Skip to content

Bug: non-chat completions not respecting the max_tokens parameter using the OpenAI api #8634

Open
@cloud11665

Description

@cloud11665

What happened?

The limit is respected when requesting a chat completion, but for non-chat ones, the model keeps generating tokens forever (until ctx-len is reached). With non-streaming there is no way to stop generation. Current workaround is to stream and close the connection when you reach the desired number of tokens

  response = await client.completions.create(
      model="[A llama3 8B gguf]",
      prompt="Write me a funny story.",
      max_tokens=200,
      stream=True,
  )
  async for chunk in response:
    print(chunk)

Name and Version

version: 3432 (45f2c19)
built with cc (Ubuntu 13.2.0-23ubuntu4) 13.2.0 for x86_64-linux-gnu

What operating system are you seeing the problem on?

Linux

Relevant log output

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingbug-unconfirmedhigh severityUsed to report high severity bugs in llama.cpp (Malfunctioning hinder important workflow)

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions