Closed
Description
Hey, I'm not sure if this is a bug, or I just need change a config somewhere, but since the update, I get this error when I try to run models split across two cuda devices:
ValueError: Attempt to split tensors that exceed maximum supported devices. Current LLAMA_MAX_DEVICES=1
Is that related to this? ggml-org/llama.cpp#5240
Here is the command I ran:
python3 -m llama_cpp.server --model /Models/deepseek-coder-33b-instruct.Q4_K_M.gguf --n_gpu_layers 56 --tensor_split 64 36 --offload_kqv false --n_ctx 8000 --n_batch 56 --chat_format chatml