Skip to content

parallel/server crashes with: ggml.c:16521: i != GGML_HASHTABLE_FULL when defragmentation is enabled #6685

Open
@phymbert

Description

@phymbert

Context

Using latest 17e98d4 llama.cpp server.

Server started with a llama70b-F16 like model:

server \
 --model model-f16.gguf \
--ctx-size 32768 \
--n-predict 4096 \
--parallel 32 \
--n-gpu-layers 81 \
--batch-size 4096 \
--ubatch-size 256 \
--metrics \
--mg 1 \
--log-format text \
--defrag-thold 0.1

When sending 32 concurrent requests, the server crashes with:

GGML_ASSERT: /llama.cpp/ggml.c:16521: i != GGML_HASHTABLE_FULL

Backend is CUDA, on 2 A100, compute capability 80.

EDIT: The issue is related with defragmentation, quick fix: disable defragmentation

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions