Description
Steps to reproduce
- Quantize Mixtral8x7B with a quantization that fully fits on the available GPU. In my case (16 GB GPU) these are
IQ2_XXS
andIQ2_XS
- Run a short
perplexity
calculation with the model fully offloaded to the GPU. A few chunks is enough. - Now run the same calculation with
-ngl 30
, and observe how PPL is 2-3 times higher than in step 2 - To verify that this is not due to a broken CPU kernel, make a build without CUDA support and run on the CPU. Notice how PPL is very similar to the result of step 2.
Here are some example runs
All layers on GPU
main: build = 1971 (1182cf4d)
...
llm_load_print_meta: model ftype = IQ2_XSS - 2.0625 bpw
llm_load_print_meta: model params = 46.70 B
llm_load_print_meta: model size = 11.44 GiB (2.10 BPW)
llm_load_print_meta: general.name = hf
llm_load_print_meta: BOS token = 1 ''
llm_load_print_meta: EOS token = 2 ''
llm_load_print_meta: UNK token = 0 ''
llm_load_print_meta: LF token = 13 '<0x0A>'
llm_load_tensors: ggml ctx size = 0.76 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloaded 32/33 layers to GPU
llm_load_tensors: CPU buffer size = 11712.97 MiB
llm_load_tensors: CUDA0 buffer size = 11586.00 MiB
....................................................................................................
llama_new_context_with_model: n_ctx = 512
llama_new_context_with_model: freq_base = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CUDA0 KV buffer size = 64.00 MiB
llama_new_context_with_model: KV self size = 64.00 MiB, K (f16): 32.00 MiB, V (f16): 32.00 MiB
llama_new_context_with_model: CUDA_Host input buffer size = 9.01 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 109.03 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 70.50 MiB
llama_new_context_with_model: graph splits (measure): 4
system_info: n_threads = 32 / 32 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |
perplexity: tokenizing the input ..
perplexity: tokenization took 567.194 ms
perplexity: calculating perplexity over 642 chunks, batch_size=512
perplexity: 1.32 seconds per pass - ETA 14.08 minutes
[1]4.0990,[2]4.9914,[3]5.6483,[4]6.3020,[5]6.2826,[6]6.2130,[7]6.4030,[8]6.4265,[9]6.5435,[10]6.8596,[11]7.0488,[12]7.0107,[13]7.0517,[14]7.0914
30 layers offloaded to GPU
main: build = 1971 (1182cf4d)
...
llm_load_print_meta: model ftype = IQ2_XSS - 2.0625 bpw
llm_load_print_meta: model params = 46.70 B
llm_load_print_meta: model size = 11.44 GiB (2.10 BPW)
llm_load_print_meta: general.name = hf
llm_load_print_meta: BOS token = 1 ''
llm_load_print_meta: EOS token = 2 ''
llm_load_print_meta: UNK token = 0 ''
llm_load_print_meta: LF token = 13 '<0x0A>'
llm_load_tensors: ggml ctx size = 0.76 MiB
llm_load_tensors: offloading 30 repeating layers to GPU
llm_load_tensors: offloaded 30/33 layers to GPU
llm_load_tensors: CPU buffer size = 11712.97 MiB
llm_load_tensors: CUDA0 buffer size = 10806.75 MiB
....................................................................................................
llama_new_context_with_model: n_ctx = 512
llama_new_context_with_model: freq_base = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CUDA_Host KV buffer size = 4.00 MiB
llama_kv_cache_init: CUDA0 KV buffer size = 60.00 MiB
llama_new_context_with_model: KV self size = 64.00 MiB, K (f16): 32.00 MiB, V (f16): 32.00 MiB
llama_new_context_with_model: CUDA_Host input buffer size = 9.01 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 108.03 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 108.03 MiB
llama_new_context_with_model: graph splits (measure): 5
system_info: n_threads = 32 / 32 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |
perplexity: tokenizing the input ..
perplexity: tokenization took 564.023 ms
perplexity: calculating perplexity over 642 chunks, batch_size=512
perplexity: 1.44 seconds per pass - ETA 15.45 minutes
[1]9.7855,[2]10.1005,[3]12.9574,[4]13.0298,[5]12.7318,[6]11.8905,[7]11.7408,[8]11.9335,[9]11.8980,[10]12.4120,[11]12.8212,[12]13.9232,[13]13.9312,[14]14.1171
All on CPU
main: build = 1971 (1182cf4)
...
llm_load_print_meta: model type = 7B
llm_load_print_meta: model ftype = IQ2_XSS - 2.0625 bpw
llm_load_print_meta: model params = 46.70 B
llm_load_print_meta: model size = 11.44 GiB (2.10 BPW)
llm_load_print_meta: general.name = hf
llm_load_print_meta: BOS token = 1 ''
llm_load_print_meta: EOS token = 2 ''
llm_load_print_meta: UNK token = 0 ''
llm_load_print_meta: LF token = 13 '<0x0A>'
llm_load_tensors: ggml ctx size = 0.38 MiB
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/33 layers to GPU
llm_load_tensors: CPU buffer size = 11712.97 MiB
....................................................................................................
llama_new_context_with_model: n_ctx = 512
llama_new_context_with_model: freq_base = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CPU KV buffer size = 64.00 MiB
llama_new_context_with_model: KV self size = 64.00 MiB, K (f16): 32.00 MiB, V (f16): 32.00 MiB
llama_new_context_with_model: CPU input buffer size = 9.01 MiB
llama_new_context_with_model: CPU compute buffer size = 114.53 MiB
llama_new_context_with_model: graph splits (measure): 1
system_info: n_threads = 32 / 32 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |
perplexity: tokenizing the input ..
perplexity: tokenization took 566.337 ms
perplexity: calculating perplexity over 642 chunks, batch_size=512
perplexity: 72.40 seconds per pass - ETA 12 hours 54.68 minutes
[1]4.1341,[2]5.0092,[3]5.6687,[4]6.3300,[5]6.3044,[6]6.2292,[7]6.4185,[8]6.4343,[9]6.5516,[10]6.8710,[11]7.0630,[12]7.0260,[13]7.0671,
29 layers on GPU
main: build = 1971 (1182cf4d)
...
llm_load_print_meta: model type = 7B
llm_load_print_meta: model ftype = IQ2_XSS - 2.0625 bpw
llm_load_print_meta: model params = 46.70 B
llm_load_print_meta: model size = 11.44 GiB (2.10 BPW)
llm_load_print_meta: general.name = hf
llm_load_print_meta: BOS token = 1 ''
llm_load_print_meta: EOS token = 2 ''
llm_load_print_meta: UNK token = 0 ''
llm_load_print_meta: LF token = 13 '<0x0A>'
llm_load_tensors: ggml ctx size = 0.76 MiB
llm_load_tensors: offloading 29 repeating layers to GPU
llm_load_tensors: offloaded 29/33 layers to GPU
llm_load_tensors: CPU buffer size = 11712.97 MiB
llm_load_tensors: CUDA0 buffer size = 10417.12 MiB
....................................................................................................
llama_new_context_with_model: n_ctx = 512
llama_new_context_with_model: freq_base = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CUDA_Host KV buffer size = 6.00 MiB
llama_kv_cache_init: CUDA0 KV buffer size = 58.00 MiB
llama_new_context_with_model: KV self size = 64.00 MiB, K (f16): 32.00 MiB, V (f16): 32.00 MiB
llama_new_context_with_model: CUDA_Host input buffer size = 9.01 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 108.03 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 108.03 MiB
llama_new_context_with_model: graph splits (measure): 5
system_info: n_threads = 32 / 32 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |
perplexity: tokenizing the input ..
perplexity: tokenization took 566.749 ms
perplexity: calculating perplexity over 642 chunks, batch_size=512
perplexity: 1.52 seconds per pass - ETA 16.28 minutes
[1]4.0521,[2]4.9624,[3]5.5985,[4]6.2678,[5]6.2614,[6]6.2038,[7]6.4056,[8]6.4241,[9]6.5409,[10]6.8630,[11]7.0622,[12]7.0257,[13]7.0661,[14]7.1006,
31 layers on GPU
main: build = 1971 (1182cf4d)
...
llm_load_print_meta: model type = 7B
llm_load_print_meta: model ftype = IQ2_XSS - 2.0625 bpw
llm_load_print_meta: model params = 46.70 B
llm_load_print_meta: model size = 11.44 GiB (2.10 BPW)
llm_load_print_meta: general.name = hf
llm_load_print_meta: BOS token = 1 ''
llm_load_print_meta: EOS token = 2 ''
llm_load_print_meta: UNK token = 0 ''
llm_load_print_meta: LF token = 13 '<0x0A>'
llm_load_tensors: ggml ctx size = 0.76 MiB
llm_load_tensors: offloading 31 repeating layers to GPU
llm_load_tensors: offloaded 31/33 layers to GPU
llm_load_tensors: CPU buffer size = 11712.97 MiB
llm_load_tensors: CUDA0 buffer size = 11196.38 MiB
....................................................................................................
llama_new_context_with_model: n_ctx = 512
llama_new_context_with_model: freq_base = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CUDA_Host KV buffer size = 2.00 MiB
llama_kv_cache_init: CUDA0 KV buffer size = 62.00 MiB
llama_new_context_with_model: KV self size = 64.00 MiB, K (f16): 32.00 MiB, V (f16): 32.00 MiB
llama_new_context_with_model: CUDA_Host input buffer size = 9.01 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 108.03 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 108.03 MiB
llama_new_context_with_model: graph splits (measure): 5
system_info: n_threads = 32 / 32 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |
perplexity: tokenizing the input ..
perplexity: tokenization took 548.178 ms
perplexity: calculating perplexity over 642 chunks, batch_size=512
perplexity: 1.39 seconds per pass - ETA 14.82 minutes
[1]4.8836,[2]6.0415,[3]6.4471,[4]7.0981,[5]6.9666,[6]6.8581,[7]7.1009,[8]7.0858,[9]7.2431,[10]7.5545,[11]7.7723,[12]7.6741,[13]7.7159,[14]7.7463,