Open
Description
What happened?
The limit is respected when requesting a chat completion, but for non-chat ones, the model keeps generating tokens forever (until ctx-len is reached). With non-streaming there is no way to stop generation. Current workaround is to stream and close the connection when you reach the desired number of tokens
response = await client.completions.create(
model="[A llama3 8B gguf]",
prompt="Write me a funny story.",
max_tokens=200,
stream=True,
)
async for chunk in response:
print(chunk)
Name and Version
version: 3432 (45f2c19)
built with cc (Ubuntu 13.2.0-23ubuntu4) 13.2.0 for x86_64-linux-gnu
What operating system are you seeing the problem on?
Linux
Relevant log output
No response