Bug: non-chat completions not respecting the max_tokens parameter using the OpenAI api

### What happened?

The limit is respected when requesting a chat completion, but for non-chat ones, the model keeps generating tokens forever (until ctx-len is reached). With non-streaming there is no way to stop generation. Current workaround is to stream and close the connection when you reach the desired number of tokens
```py
  response = await client.completions.create(
      model="[A llama3 8B gguf]",
      prompt="Write me a funny story.",
      max_tokens=200,
      stream=True,
  )
  async for chunk in response:
    print(chunk)
 ```
 

### Name and Version

version: 3432 (45f2c19c)
built with cc (Ubuntu 13.2.0-23ubuntu4) 13.2.0 for x86_64-linux-gnu

### What operating system are you seeing the problem on?

Linux

### Relevant log output

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Bug: non-chat completions not respecting the max_tokens parameter using the OpenAI api #8634

What happened?

Name and Version

What operating system are you seeing the problem on?

Relevant log output

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Bug: non-chat completions not respecting the max_tokens parameter using the OpenAI api #8634

Description

What happened?

Name and Version

What operating system are you seeing the problem on?

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions