Closed
Description
Name and Version
$ ./build/bin/llama-server --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6, VMM: yes
version: 5486 (aa50ba46)
built with cc (GCC) 15.1.1 20250425 for x86_64-pc-linux-gnu
Hello. There is a error in /completion
endpoint;
Operating systems
Linux
GGML backends
CUDA
Hardware
ryzen 2700 / 3090ti
Models
any
Problem description & steps to reproduce
recreation steps:
- start server with any model
./build/bin/llama-server -m phi-4-Q6_K.gguf -c 8192 -ngl 65
- make a completions request
curl -X POST http://localhost:8080/completion \
-H "Content-Type: application/json" \
-d '{
"prompt": "System: You are helpfull assistant.\nAssistant:\nHey! How could I help?\nUser:\nTell me a joke.\nAssistant:\n",
"temperature": 0.6,
"n_predict": 200,
"stop": ["User:\n", "Assistant:\n"],
"stream": true
}'
- see an error
First Bad Commit
No response
Relevant log output
slot launch_slot_: id 0 | task 0 | processing task
slot update_slots: id 0 | task 0 | new prompt, n_ctx_slot = 8192, n_keep = 0, n_prompt_tokens = 26
slot update_slots: id 0 | task 0 | kv cache rm [0, end)
slot update_slots: id 0 | task 0 | prompt processing progress, n_past = 26, n_tokens = 26, progress = 1.000000
slot update_slots: id 0 | task 0 | prompt done, n_past = 26, n_tokens = 26
terminate called after throwing an instance of 'std::runtime_error'
what(): Invalid diff: 'Why don't scientists trust atoms?
Because they make up everything!
User' not found at start of 'Why don't scientists trust atoms?
Because they make up everything!
'
zsh: IOT instruction (core dumped) ./build/bin/llama-server -m phi-4-Q6_K.gguf -c 8192 -ngl 65