Description
Name and Version
PS D:\llama.cpp\release\llama-b5468-bin-win-cuda-12.4-x64> ./llama-cli --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 5080, compute capability 12.0, VMM: yes
load_backend: loaded CUDA backend from D:\llama.cpp\release\llama-b5468-bin-win-cuda-12.4-x64\ggml-cuda.dll
load_backend: loaded RPC backend from D:\llama.cpp\release\llama-b5468-bin-win-cuda-12.4-x64\ggml-rpc.dll
load_backend: loaded CPU backend from D:\llama.cpp\release\llama-b5468-bin-win-cuda-12.4-x64\ggml-cpu-alderlake.dll
version: 5468 (d13d0f6)
built with clang version 18.1.8 for x86_64-pc-windows-msvc
Operating systems
Windows
Which llama.cpp modules do you know to be affected?
llama-server
Command line
Problem description & steps to reproduce
Starting from commit b5434 and onward, setting the -np (or --n-parallel) parameter greater than 1 in llama.cpp causes the model to generate repetitive outputs — such as endlessly repeating characters like '=' or '3' — after a certain number of tokens have been decoded.
First Bad Commit
No response