Skip to content

Misc. bug: Eval bug: Repetitive Output After Certain Token Count When Using -np > 1 in llama.cpp (Ver. b5468) #13733

Closed
@thomasbergersen

Description

@thomasbergersen

Name and Version

PS D:\llama.cpp\release\llama-b5468-bin-win-cuda-12.4-x64> ./llama-cli --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 5080, compute capability 12.0, VMM: yes
load_backend: loaded CUDA backend from D:\llama.cpp\release\llama-b5468-bin-win-cuda-12.4-x64\ggml-cuda.dll
load_backend: loaded RPC backend from D:\llama.cpp\release\llama-b5468-bin-win-cuda-12.4-x64\ggml-rpc.dll
load_backend: loaded CPU backend from D:\llama.cpp\release\llama-b5468-bin-win-cuda-12.4-x64\ggml-cpu-alderlake.dll
version: 5468 (d13d0f6)
built with clang version 18.1.8 for x86_64-pc-windows-msvc

Operating systems

Windows

Which llama.cpp modules do you know to be affected?

llama-server

Command line

Problem description & steps to reproduce

Starting from commit b5434 and onward, setting the -np (or --n-parallel) parameter greater than 1 in llama.cpp causes the model to generate repetitive outputs — such as endlessly repeating characters like '=' or '3' — after a certain number of tokens have been decoded.

Image

First Bad Commit

No response

Relevant log output

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions