Partial GPU offload broken for certain number of offloaded layers 

Steps to reproduce
1. Quantize Mixtral8x7B with a quantization that fully fits on the available GPU. In my case (16 GB GPU) these are `IQ2_XXS` and `IQ2_XS`
2. Run a short `perplexity` calculation with the model fully offloaded to the GPU. A few chunks is enough.
3. Now run the same calculation with `-ngl 30`, and observe how PPL is 2-3 times higher than in step 2
4. To verify that this is not due to a broken CPU kernel, make a build without CUDA support and run on the CPU. Notice how PPL is very similar to the result of step 2.

Here are some example runs
<details> <summary> All layers on GPU </summary>
<code>
main: build = 1971 (1182cf4d)
...
llm_load_print_meta: model ftype      = IQ2_XSS - 2.0625 bpw
llm_load_print_meta: model params     = 46.70 B
llm_load_print_meta: model size       = 11.44 GiB (2.10 BPW) 
llm_load_print_meta: general.name     = hf
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.76 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloaded 32/33 layers to GPU
llm_load_tensors:        CPU buffer size = 11712.97 MiB
llm_load_tensors:      CUDA0 buffer size = 11586.00 MiB
....................................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      CUDA0 KV buffer size =    64.00 MiB
llama_new_context_with_model: KV self size  =   64.00 MiB, K (f16):   32.00 MiB, V (f16):   32.00 MiB
llama_new_context_with_model:  CUDA_Host input buffer size   =     9.01 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   109.03 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    70.50 MiB
llama_new_context_with_model: graph splits (measure): 4

system_info: n_threads = 32 / 32 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | 
perplexity: tokenizing the input ..
perplexity: tokenization took 567.194 ms
perplexity: calculating perplexity over 642 chunks, batch_size=512
perplexity: 1.32 seconds per pass - ETA 14.08 minutes
[1]4.0990,[2]4.9914,[3]5.6483,[4]6.3020,[5]6.2826,[6]6.2130,[7]6.4030,[8]6.4265,[9]6.5435,[10]6.8596,[11]7.0488,[12]7.0107,[13]7.0517,[14]7.0914
</code>
</details>
 
<details> <summary> 30 layers offloaded to GPU </summary>
<code>
main: build = 1971 (1182cf4d)
...
llm_load_print_meta: model ftype      = IQ2_XSS - 2.0625 bpw
llm_load_print_meta: model params     = 46.70 B
llm_load_print_meta: model size       = 11.44 GiB (2.10 BPW) 
llm_load_print_meta: general.name     = hf
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.76 MiB
llm_load_tensors: offloading 30 repeating layers to GPU
llm_load_tensors: offloaded 30/33 layers to GPU
llm_load_tensors:        CPU buffer size = 11712.97 MiB
llm_load_tensors:      CUDA0 buffer size = 10806.75 MiB
....................................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:  CUDA_Host KV buffer size =     4.00 MiB
llama_kv_cache_init:      CUDA0 KV buffer size =    60.00 MiB
llama_new_context_with_model: KV self size  =   64.00 MiB, K (f16):   32.00 MiB, V (f16):   32.00 MiB
llama_new_context_with_model:  CUDA_Host input buffer size   =     9.01 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   108.03 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =   108.03 MiB
llama_new_context_with_model: graph splits (measure): 5

system_info: n_threads = 32 / 32 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | 
perplexity: tokenizing the input ..
perplexity: tokenization took 564.023 ms
perplexity: calculating perplexity over 642 chunks, batch_size=512
perplexity: 1.44 seconds per pass - ETA 15.45 minutes
[1]9.7855,[2]10.1005,[3]12.9574,[4]13.0298,[5]12.7318,[6]11.8905,[7]11.7408,[8]11.9335,[9]11.8980,[10]12.4120,[11]12.8212,[12]13.9232,[13]13.9312,[14]14.1171
</code>
</details>

<details> <summary> All on CPU </summary>
main: build = 1971 (1182cf4d)
<code>
...
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = IQ2_XSS - 2.0625 bpw
llm_load_print_meta: model params     = 46.70 B
llm_load_print_meta: model size       = 11.44 GiB (2.10 BPW) 
llm_load_print_meta: general.name     = hf
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.38 MiB
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/33 layers to GPU
llm_load_tensors:        CPU buffer size = 11712.97 MiB
....................................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =    64.00 MiB
llama_new_context_with_model: KV self size  =   64.00 MiB, K (f16):   32.00 MiB, V (f16):   32.00 MiB
llama_new_context_with_model:        CPU input buffer size   =     9.01 MiB
llama_new_context_with_model:        CPU compute buffer size =   114.53 MiB
llama_new_context_with_model: graph splits (measure): 1

system_info: n_threads = 32 / 32 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | 
perplexity: tokenizing the input ..
perplexity: tokenization took 566.337 ms
perplexity: calculating perplexity over 642 chunks, batch_size=512
perplexity: 72.40 seconds per pass - ETA 12 hours 54.68 minutes
[1]4.1341,[2]5.0092,[3]5.6687,[4]6.3300,[5]6.3044,[6]6.2292,[7]6.4185,[8]6.4343,[9]6.5516,[10]6.8710,[11]7.0630,[12]7.0260,[13]7.0671,
</code>
</details>

<details> <summary> 29 layers on GPU </summary>
<code>
main: build = 1971 (1182cf4d)
...
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = IQ2_XSS - 2.0625 bpw
llm_load_print_meta: model params     = 46.70 B
llm_load_print_meta: model size       = 11.44 GiB (2.10 BPW) 
llm_load_print_meta: general.name     = hf
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.76 MiB
llm_load_tensors: offloading 29 repeating layers to GPU
llm_load_tensors: offloaded 29/33 layers to GPU
llm_load_tensors:        CPU buffer size = 11712.97 MiB
llm_load_tensors:      CUDA0 buffer size = 10417.12 MiB
....................................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:  CUDA_Host KV buffer size =     6.00 MiB
llama_kv_cache_init:      CUDA0 KV buffer size =    58.00 MiB
llama_new_context_with_model: KV self size  =   64.00 MiB, K (f16):   32.00 MiB, V (f16):   32.00 MiB
llama_new_context_with_model:  CUDA_Host input buffer size   =     9.01 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   108.03 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =   108.03 MiB
llama_new_context_with_model: graph splits (measure): 5

system_info: n_threads = 32 / 32 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | 
perplexity: tokenizing the input ..
perplexity: tokenization took 566.749 ms
perplexity: calculating perplexity over 642 chunks, batch_size=512
perplexity: 1.52 seconds per pass - ETA 16.28 minutes
[1]4.0521,[2]4.9624,[3]5.5985,[4]6.2678,[5]6.2614,[6]6.2038,[7]6.4056,[8]6.4241,[9]6.5409,[10]6.8630,[11]7.0622,[12]7.0257,[13]7.0661,[14]7.1006,
</code>
</details>

<details> <summary> 31 layers on GPU </summary>
<code>
main: build = 1971 (1182cf4d)
...
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = IQ2_XSS - 2.0625 bpw
llm_load_print_meta: model params     = 46.70 B
llm_load_print_meta: model size       = 11.44 GiB (2.10 BPW) 
llm_load_print_meta: general.name     = hf
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.76 MiB
llm_load_tensors: offloading 31 repeating layers to GPU
llm_load_tensors: offloaded 31/33 layers to GPU
llm_load_tensors:        CPU buffer size = 11712.97 MiB
llm_load_tensors:      CUDA0 buffer size = 11196.38 MiB
....................................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:  CUDA_Host KV buffer size =     2.00 MiB
llama_kv_cache_init:      CUDA0 KV buffer size =    62.00 MiB
llama_new_context_with_model: KV self size  =   64.00 MiB, K (f16):   32.00 MiB, V (f16):   32.00 MiB
llama_new_context_with_model:  CUDA_Host input buffer size   =     9.01 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   108.03 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =   108.03 MiB
llama_new_context_with_model: graph splits (measure): 5

system_info: n_threads = 32 / 32 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | 
perplexity: tokenizing the input ..
perplexity: tokenization took 548.178 ms
perplexity: calculating perplexity over 642 chunks, batch_size=512
perplexity: 1.39 seconds per pass - ETA 14.82 minutes
[1]4.8836,[2]6.0415,[3]6.4471,[4]7.0981,[5]6.9666,[6]6.8581,[7]7.1009,[8]7.0858,[9]7.2431,[10]7.5545,[11]7.7723,[12]7.6741,[13]7.7159,[14]7.7463,
</code>
</details>





Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Partial GPU offload broken for certain number of offloaded layers #5137

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Partial GPU offload broken for certain number of offloaded layers #5137

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions