Open
Description
Context
Using latest 17e98d4 llama.cpp server.
Server started with a llama70b-F16 like model:
server \
--model model-f16.gguf \
--ctx-size 32768 \
--n-predict 4096 \
--parallel 32 \
--n-gpu-layers 81 \
--batch-size 4096 \
--ubatch-size 256 \
--metrics \
--mg 1 \
--log-format text \
--defrag-thold 0.1
When sending 32 concurrent requests, the server crashes with:
GGML_ASSERT: /llama.cpp/ggml.c:16521: i != GGML_HASHTABLE_FULL
Backend is CUDA, on 2 A100, compute capability 80.
EDIT: The issue is related with defragmentation, quick fix: disable defragmentation