`server`: streaming of tool calls and thoughts when `--jinja` is on #12379

ochafik · 2025-03-14T04:45:40Z

Support streaming of tool calls in OpenAI format
Improve handling of thinking model (DeepSeek R1 Distills, QwQ, Command R7B):
- Stream <think> reasoning content inside the content (same output for all thinking models when using the default --reasoning-content deepseek, even for those not using the <think> syntax like Command R7B), and even if the <think> tag was added at the end of the prompt by the template (as for DeepSeek R1 & QwQ).
- Avoid spurious lazy (tool call) grammar triggers from "thoughts about tool calls" (only trigger after closing any unclosed thoughts)
Improves Functionary v3.2 support (allow raw python code, preferred by models over {"code": "json-encoded code"} for multiline programs)
Support truncated outputs incl. reasoning_content & tool_calls (returns salvageable fields when finish_reason = length)

Follow up to #9639

Pending follow ups / bugfixes:

server: fix regression on streamed non-chat completion w/ stops #13785
server: fix streaming crashes #13786: fixes lots of the diffs crashes, especially when thinking is involved or when there's a mix of content & tool calls (or partial matches of tool call starts)
server: fix format of streamed tool call deltas (diff name, fix id location) #13800: fixes delta format (return tool_call.function.name as deltas, put id at right place, output type)

How to test / use

Get and build this PR's branch

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
git remote add ochafik https://github.com/ochafik/llama.cpp
git fetch ochafik
git checkout ochafik/tool-diffs
cmake -B build -DLLAMA_CURL=1 # -DGGML_CUDA=1 ...
cmake --build build -t llama-server --parallel --config Release
alias llama-server=./build/bin/llama-server

Run llama-server w/ any model (see more details in the tool calling docs; note that some GGUFs require a chat template override!):

# Thoughts of Command R7B / DeepSeek R1 / QwQ will be streamed in the content inside <think> tags
llama-server --jinja -fa -hf bartowski/Qwen_QwQ-32B-GGUF

# Models w/ generic tool call support now return clean interrupted output when hitting token limit
llama-server --jinja -fa -hf bartowski/microsoft_Phi-4-mini-instruct-GGUF

Call the chat completions endpoint in streamed mode with any OpenAI-compatible library, or plain curl:

curl http://localhost:8080/v1/chat/completions -d '{
  "model": "gpt-3.5-turbo",
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "python",
        "description": "Runs code in an ipython interpreter and returns the result of the execution after 60 seconds.",
        "parameters": {
          "type": "object",
          "properties": {
            "code": {
              "type": "string",
              "description": "The code to run in the ipython interpreter."
            }
          },
          "required": ["code"]
        }
      }
    }
  ],
  "messages": [
    {
      "role": "user",
      "content": "Print a hello world message with python."
    }
  ],
  "stream": true
}'

You can also open http://localhost:8080/ to see thoughts being streamed back properly even for models which template add an opening <think> tag to the end of the prompt (QwQ, now DeepSeek R1 too although most GGUFs have their initial version) and models like Cohere Command R7B that natively use a different thinking tags syntax (now normalized, since —reasoning-format deepseek is the default)

Context

Supporting OpenAI's streaming delta format was a bit tricky, as it returns chunks of JSON-encoded arguments for each function call, but that's not necessarily what models give us.

While tool calls are returned in a standard format, each w/ a function name, tool call id and JSON encoded arguments, model outputs vary greatly in their syntax. That syntax mostly uses JSON for arguments but not always.

Function calls and their arguments can be at various levels:

JSON array of tool calls (e.g. Mistral Nemo: [TOOL_CALLS][{"name": "special_function", "arguments": {"arg1": 1}, "id": "123456789"}])
Standalone JSON tool call (e.g. Hermes syntax: <tool_call>{"name": "special_function", "arguments": {"arg1": 1}}</tool_call>; note that some models use other keys here, e.g. tool_name, parameters, and may have the tool call id too)
JSON arguments object w/ name in some prefix (e.g. Deepseek: <｜tool▁calls▁begin｜><｜tool▁call▁begin｜>function<｜tool▁sep｜>special_function\n```json\n{"arg1": 1}\n```<｜tool▁call▁end｜><｜tool▁calls▁end｜>, or functionary v3.2: special_function\n{"arg1": 1})
Nested JSON for the generic mode {"tool_call": {"name": "special_function", "arguments": {"arg1": 1}}} (or inside tool_calls array if parallel_tool_calls is on)
No JSON / raw code string for python tool call, with two variants:
- Unconstrained verbatim code: <|python_tag|>multiline python code here (functionary v3.1), python\nmultiline python code here (functionary v3.2; w/ prefix >>> if after textual response)
- Constrained pythonish syntax for "builtin tools" (Llama 3.x, quite widespread): <|python_tag|>python.call(code="multiline\npython\ncode\nhere")

Side note about raw python code: <|python_tag>foo.call(bar="baz") in Llama 3.x style will return "tool_calls": [{"name": "foo", "arguments": "{\"bar\": \"baz\"}"}], while the same output from Functionary would be parsed as "tool_calls": [{"name": "python", "arguments": "{\"code\": \"foo.call(bar=\\\"baz\\\")\"}"}].

Now when streaming, we may have sampled only a prefix of the aforementioned output, and want ideally to parse what can be parsed out of it, and send a JSON-encoded arguments object that is cut at a safe place, so that the sum of all the deltas adds up to the full arguments JSON string.

(A primary use case for partial JSON arguments streaming is streaming large multiline diff tool arguments in tools such as RooCode / Cline / Cursor)

The cleanest option would have been to create a unified parser / state machine that can be drip-fed tokens, and preserve its state in the server slot. But I figured the complexity was too high for now (see notes on speeding up below), and instead I've implemented something definitely inefficient but relatively simple (chat.cpp it still the same size): for every token coming in, I try and parse the entire output so far, with partial regex & json parsing support, which allows recovering cleanly cut-off JSON-encoded function arguments (regardless of the original format of said arguments). I then compare the full common_chat_msg against the last one we sent back, and compute OpenAI-compatible deltas out of this.

Location, location, location 🏡

Note that the output of the model may be truncated (max token output length reached or streaming in progress), and that may fall inside an expected literal (e.g. <think> isn't a single token on QwQ-32B), inside a regex (used for some matchers), or inside some JSON.

But more interesting is where it happens, esp. for partial JSON:

If it happens inside an arguments object or a contents string (for generic mode), we should return it partial / truncated (and json-dumped in the case of the arguments), and diffed from the last parsed value for the streamed case
If it happens inside the wrapper of the arguments, then it depends. We don't want to get a half-function name, but as soon as we have a complete function name we can send a diff. So we try and heal the JSON (we identify which json paths can be partially healed - because they're inside the arguments, and which ones must be dropped), and only populate a tool call if we have at least a name). Likewise, if there is an array of function calls with the first complete, and the next partial, we want to make sure the client can start calling the first function.

tests/test-chat-parser.cpp should make this a bit clearer, and I'm in the process of adding partial examples w/ the actual formats in tests/test-chat.cpp (look out for /* is_partial= */ true)

See examples of streamed tool call deltas

curl https://api.openai.com/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -d '{
    "model": "gpt-3.5-turbo",
    "tools": [
        {
        "type":"function",
        "function":{
            "name":"python",
            "description":"Runs code in an ipython interpreter and returns the result of the execution after 60 seconds.",
            "parameters":{
            "type":"object",
            "properties":{
                "code":{
                "type":"string",
                "description":"The code to run in the ipython interpreter."
                }
            },
            "required":["code"]
            }
        }
        }
    ],
    "messages": [
        {
        "role": "user",
        "content": "Print a hello world message with python."
        }
    ], "stream": true
}'

data: {"id":"chatcmpl-BArAOzrqdSYW1UJvOwWb9nXAy1qo4","object":"chat.completion.chunk","created":1741927352,"model":"gpt-3.5-turbo-0125","service_tier":"default","system_fingerprint":null,"choices":[{"index":0,"delta":{"role":"assistant","content":null,"tool_calls":[{"index":0,"id":"call_aqwOReHDKPnqiF7NbRxzDTY1","type":"function","function":{"name":"python","arguments":""}}],"refusal":null},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-BArAOzrqdSYW1UJvOwWb9nXAy1qo4","object":"chat.completion.chunk","created":1741927352,"model":"gpt-3.5-turbo-0125","service_tier":"default","system_fingerprint":null,"choices":[{"index":0,"delta":{"tool_calls":[{"index":0,"function":{"arguments":"{\""}}]},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-BArAOzrqdSYW1UJvOwWb9nXAy1qo4","object":"chat.completion.chunk","created":1741927352,"model":"gpt-3.5-turbo-0125","service_tier":"default","system_fingerprint":null,"choices":[{"index":0,"delta":{"tool_calls":[{"index":0,"function":{"arguments":"code"}}]},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-BArAOzrqdSYW1UJvOwWb9nXAy1qo4","object":"chat.completion.chunk","created":1741927352,"model":"gpt-3.5-turbo-0125","service_tier":"default","system_fingerprint":null,"choices":[{"index":0,"delta":{"tool_calls":[{"index":0,"function":{"arguments":"\":\""}}]},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-BArAOzrqdSYW1UJvOwWb9nXAy1qo4","object":"chat.completion.chunk","created":1741927352,"model":"gpt-3.5-turbo-0125","service_tier":"default","system_fingerprint":null,"choices":[{"index":0,"delta":{"tool_calls":[{"index":0,"function":{"arguments":"print"}}]},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-BArAOzrqdSYW1UJvOwWb9nXAy1qo4","object":"chat.completion.chunk","created":1741927352,"model":"gpt-3.5-turbo-0125","service_tier":"default","system_fingerprint":null,"choices":[{"index":0,"delta":{"tool_calls":[{"index":0,"function":{"arguments":"('"}}]},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-BArAOzrqdSYW1UJvOwWb9nXAy1qo4","object":"chat.completion.chunk","created":1741927352,"model":"gpt-3.5-turbo-0125","service_tier":"default","system_fingerprint":null,"choices":[{"index":0,"delta":{"tool_calls":[{"index":0,"function":{"arguments":"Hello"}}]},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-BArAOzrqdSYW1UJvOwWb9nXAy1qo4","object":"chat.completion.chunk","created":1741927352,"model":"gpt-3.5-turbo-0125","service_tier":"default","system_fingerprint":null,"choices":[{"index":0,"delta":{"tool_calls":[{"index":0,"function":{"arguments":","}}]},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-BArAOzrqdSYW1UJvOwWb9nXAy1qo4","object":"chat.completion.chunk","created":1741927352,"model":"gpt-3.5-turbo-0125","service_tier":"default","system_fingerprint":null,"choices":[{"index":0,"delta":{"tool_calls":[{"index":0,"function":{"arguments":" World"}}]},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-BArAOzrqdSYW1UJvOwWb9nXAy1qo4","object":"chat.completion.chunk","created":1741927352,"model":"gpt-3.5-turbo-0125","service_tier":"default","system_fingerprint":null,"choices":[{"index":0,"delta":{"tool_calls":[{"index":0,"function":{"arguments":"!"}}]},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-BArAOzrqdSYW1UJvOwWb9nXAy1qo4","object":"chat.completion.chunk","created":1741927352,"model":"gpt-3.5-turbo-0125","service_tier":"default","system_fingerprint":null,"choices":[{"index":0,"delta":{"tool_calls":[{"index":0,"function":{"arguments":"')"}}]},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-BArAOzrqdSYW1UJvOwWb9nXAy1qo4","object":"chat.completion.chunk","created":1741927352,"model":"gpt-3.5-turbo-0125","service_tier":"default","system_fingerprint":null,"choices":[{"index":0,"delta":{"tool_calls":[{"index":0,"function":{"arguments":"\"}"}}]},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-BArAOzrqdSYW1UJvOwWb9nXAy1qo4","object":"chat.completion.chunk","created":1741927352,"model":"gpt-3.5-turbo-0125","service_tier":"default","system_fingerprint":null,"choices":[{"index":0,"delta":{},"logprobs":null,"finish_reason":"tool_calls"}]}

data: [DONE]

Implementation notes

Partial parsing utils

I added a common_chat_msg_parser utility with syntax reminiscent of @ngxson's suggestions in #11607 (comment), but relying on control flow to allow more flexibility:

Supports partial regex parsing
- Since the STL still doesn't have partial matching support (unlike Boost), I had to implement my own in common_regex (see common/regex-partial.cpp).
- The trick = transform the original regex to a regex that matches in reverse from the end of the string (e.g. /abc/ gives /((?:(?:c)?b)?a)[\s\S]*/, with a single capturing group which end indicates - in reverse - where the partial match started)
Supports partial JSON parsing:
- Used nlohmann/json's SAX interface to build location awareness / stack to know how to heal a JSON that fails to parse
- Healing the JSON w/ a healing marker that can then be found when visiting the resulting JSON (to remove things we don't want to heal - e.g. function name - and cut any JSON encoded result at the "right" place, which must be somewhere inside function arguments: consume_json accepts a list of json paths under which to expect arguments objects; could be from the root = empty path if the entire json object is an arguments object)
Supports control flow w/ try_* parsing methods. This makes the code relatively easy to read and debug. No exotic syntax (apart from optionals, they really help here imho), which should make it easier to convert to coroutines when we wanna make it all incremental.
Supports full or partial parsing w/ same code (throws partial exceptions to interrupt the control flow without making parsing code more complex)

This allows parsing of partial model outputs, whether in streaming mode or when reaching the token limit (currently, tool calls give ugly unparsed outputs when finish_reason != tool_call).

To think or not to think... what is the prompt?

I've also introduced common_chat_syntax which wraps common_reasoning_format, common_chat_format together with:

thinking_forced_open: whether the prompt was detected to end w/ a (model-specific) <think> tag to force thinking mode
reasoning_in_content: whether the thinking tags should be left in the content, which is currently the case in streaming mode as the DeepSeek API does.

This allows streaming back a standard <think>... syntax even for models that use a different set of tags (e.g. Command R7B). And of course, --reasoning-format none is still allowed to get the raw output.

Note: Ideally, we'd stream the thoughts as a reasoning_content delta (now trivial to implement), but for now we are just aiming for compatibility w/ DeepSeek's API (if --reasoning-format deepseek, which is the default).

Triggering thoughts 😓

I noticed DeepSeek R1 Qwen 7B sometimes obsesses over the tool call syntax and "thinks" about how it's gonna call it... which triggers the lazy grammars for said calls before the thoughts are closed.

To address this, I made it possible for common_chat_templates_apply to create trigger regexes that match on the entire output (this was already the case in the sampler). COMMON_GRAMMAR_TRIGGER_TYPE_PATTERN_FULL (renamed from _START) is now expected to have a single capturing group from the start of which the grammar sampler will be activated.

Functionary v3.2 w/ raw python

Ask bartowski/functionary-small-v3.2-GGUF:Q4_K_M to write a hello world in Python and it outputs python\n{"code": "print('hey')"}.

But ask it to print a hello world in python w/ matplotlib, and it uses its raw multiline python syntax python\nprint('hey')\n# many other lines. This is now supported.

TODOs

Future follow ups:

To make this faster, I suggest two options:
- Wait for the project to switch to C++20 & turn all the parser functions into resumable coroutines (feed them tokens and persist their state in the slot)
- Only compute and send deltas after N milliseconds

cc/ @jpohhhh

…al based thinking tags parsing)

Copilot

Pull Request Overview

This PR introduces streaming of tool calls and reasoning content when using the --jinja flag while also improving partial JSON parsing and proper handling of grammar triggers. Key changes include:

Addition of streaming request utilities and extensive updates to unit tests for tool call and chat completion behavior.
Integration of a unified chat syntax structure in server.cpp and changes in sampling.cpp to better handle grammar trigger patterns.
Major enhancements in the JSON partial parsing/healing logic and chat parser to recover truncated model outputs.

Reviewed Changes

Copilot reviewed 23 out of 23 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
tools/server/tests/utils.py	Added a new make_any_request method to support both streaming and non-streaming calls.
tools/server/tests/unit/test_tool_call.py	Updated test functions to support streaming and adjusted assertions for tool calls.
tools/server/tests/unit/test_chat_completion.py	Modified stream assertion checks (using is None rather than empty string).
tools/server/server.cpp	Integrated common_chat_syntax and refined grammar trigger handling and logging.
tests/test-json-partial.cpp	Added comprehensive tests for partial JSON healing with various fragment examples.
tests/test-chat-parser.cpp	Expanded test coverage for chat parsing, including reasoning and regex consumption.
src/llama-grammar.cpp	Updated grammar trigger handling to capture the first non-empty capturing group.
scripts/tool_bench.py	Enhanced command-line benchmarking parameters by adding chat_template_file support.
models/templates/README.md & Qwen-QwQ-32B.jinja	Added documentation and a new Jinja template for Qwen-QwQ-32B.
docs/function-calling.md	Included a caution regarding extreme KV quantizations affecting tool calling.
common/sampling.cpp	Refactored grammar trigger pattern concatenation by removing patterns_at_start logic.
common/json-partial.*	Introduced and refined JSON healing logic with detailed parsing and recovery steps.
common/chat*.{h,cpp}	Updated chat structures and the parser to leverage the new chat syntax and reasoning.

Comments suppressed due to low confidence (2)

tools/server/tests/unit/test_tool_call.py:98

Consider removing or re-enabling the commented assertion for a non-empty tool call id if it is no longer applicable. This cleanup can help avoid confusion in test outputs and maintain code clarity.

    # assert len(tool_call.get("id", "") > 0, f'Expected non empty tool call id in {tool_call}')

common/sampling.cpp:164

[nitpick] The variable name 'trigger_patterns' (previously replacing 'patterns_at_start') could be made more descriptive. Consider renaming it (e.g. to 'grammar_trigger_patterns') for clarity and consistency with its usage later in the code.

    std::vector<std::string> trigger_patterns;

Copilot · 2025-05-24T15:19:00Z

common/json-partial.cpp

+            it = temptative_end;
+            return true;
+        } catch (const std::exception & ex) {
+            // No, needs healing.


The healing logic for partial JSON parsing is quite complex. Adding detailed inline comments explaining the steps and decisions—especially the conditions under which different closing characters are appended—would improve maintainability and clarity.

ericcurtin

I'd be a fan of merging this, even if it was imperfect, so the community can build upon it

cgruver · 2025-05-24T19:33:00Z

+1 to @ericcurtin This capability starts to enable llama.cpp as a backend for vibe coding with self hosted models.

ochafik · 2025-05-25T00:55:32Z

I'd be a fan of merging this, even if it was imperfect, so the community can build upon it

Merged this as is 😅🤞🙏

Rane2021 · 2025-05-26T05:55:56Z

[error] langchain stream tool call error for no tool_call_id

error:
pydantic_core._pydantic_core.ValidationError: 1 validation error for ToolMessage
tool_call_id
Input should be a valid string [type=string_type, input_value=None, input_type=NoneType]
For further information visit https://errors.pydantic.dev/2.10/v/string_type
During task with name 'tools' and id '5e515507-10d7-c20a-75c3-856bcef1f4c1'
During task with name 'researcher' and id '029f68e9-0b87-3e4b-59f1-891fc72c88f2'

TheTerrasque · 2025-05-26T09:38:16Z

I see a similar problem in n8n: [ERROR: All OpenAI tool calls must have an "id" field.]

making · 2025-05-26T10:51:18Z

A similar error occurs with Spring AI

java.lang.IllegalArgumentException: ToolResponseMessage must have an id
	at org.springframework.util.Assert.isTrue(Assert.java:116) ~[spring-core-6.2.7.jar:6.2.7]
	at org.springframework.ai.openai.OpenAiChatModel.lambda$createRequest$19(OpenAiChatModel.java:594) ~[spring-ai-openai-1.0.0.jar:1.0.0]
	at java.base/java.util.ArrayList.forEach(ArrayList.java:1596) ~[na:na]
	at org.springframework.ai.openai.OpenAiChatModel.lambda$createRequest$21(OpenAiChatModel.java:594) ~[spring-ai-openai-1.0.0.jar:1.0.0]
	at java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:197) ~[na:na]
	at java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1708) ~[na:na]
	at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:509) ~[na:na]
	at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:499) ~[na:na]
	at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:575) ~[na:na]
	at java.base/java.util.stream.AbstractPipeline.evaluateToArrayNode(AbstractPipeline.java:260) ~[na:na]
	at java.base/java.util.stream.ReferencePipeline.toArray(ReferencePipeline.java:616) ~[na:na]
	at java.base/java.util.stream.ReferencePipeline.toArray(ReferencePipeline.java:622) ~[na:na]
	at java.base/java.util.stream.ReferencePipeline.toList(ReferencePipeline.java:627) ~[na:na]
	at org.springframework.ai.openai.OpenAiChatModel.createRequest(OpenAiChatModel.java:604) ~[spring-ai-openai-1.0.0.jar:1.0.0]
	at org.springframework.ai.openai.OpenAiChatModel.lambda$internalStream$14(OpenAiChatModel.java:272) ~[spring-ai-openai-1.0.0.jar:1.0.0]
	at reactor.core.publisher.FluxDeferContextual.subscribe(FluxDeferContextual.java:49) ~[reactor-core-3.7.6.jar:3.7.6]
	at reactor.core.publisher.FluxDefer.subscribe(FluxDefer.java:54) ~[reactor-core-3.7.6.jar:3.7.6]
	at reactor.core.publisher.FluxSubscribeOn$SubscribeOnSubscriber.run(FluxSubscribeOn.java:194) ~[reactor-core-3.7.6.jar:3.7.6]
	at reactor.core.scheduler.WorkerTask.call(WorkerTask.java:84) ~[reactor-core-3.7.6.jar:3.7.6]
	at reactor.core.scheduler.WorkerTask.call(WorkerTask.java:37) ~[reactor-core-3.7.6.jar:3.7.6]
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:317) ~[na:na]
	at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) ~[na:na]
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) ~[na:na]
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) ~[na:na]
	at java.base/java.lang.Thread.run(Thread.java:1583) ~[na:na]

ochafik · 2025-05-26T12:53:44Z

Thanks all for the feedback so far! (and sorry for the ahem lil breakages & unpolished edges 😅)

As I have only 10h left of copyright ownership to push fixes 🙈, here's a recap of the PRs out for review (cc/ @ngxson @ggerganov @CISC ):

server: fix regression on streamed non-chat completion w/ stops #13785
server: fix streaming crashes #13786: fixes lots of the diffs crashes, especially when thinking is involved or when there's a mix of content & tool calls (or partial matches of tool call starts)
server: fix format of streamed tool call deltas (diff name, fix id location) #13800: fixes delta format (return tool_call.function.name as deltas, put id at right place, output type)

pwilkin · 2025-05-26T13:06:32Z

@ochafik I can confirm that #13786 fixes the tool calling on Qwen3 streaming.

ochafik · 2025-05-26T15:45:05Z

All pending fixes were merged (thanks for the prompt reviews!), please all sync and report more bugs if you have a chance, I’ll do a last round in a few hours ✌🏼

(ideally please provide self-contained repro instructions, esp. for integration issues with agentic frameworks: even if it’s on a hacky branch of yours, it will help tremendously)

Update: 1st release w/ the fixes is b5497

teleprint-me · 2025-05-27T00:20:32Z

Just as an FYI, there's a minor regression with predicted logits.

20:13:18 | /mnt/valerie/public/agent
(.venv) git:(main | Δ) λ python -m agent.openai.stream
{'type': 'role', 'value': 'assistant'}
{'type': 'reasoning.open', 'value': '<think>Okay'}
{'type': 'content', 'value': ','}
{'type': 'content', 'value': ' the'}

Previously, in earlier commits, before this PR was merged, <think> and Okay were in fact separate tokens.

This is okay here in a controlled fashion, but I suspect this might creep up in unsuspected ways further down the line.

Updated simple example

# agent.openai.stream
import json
import os
from pprint import pprint
from typing import Any, Dict, Generator, Optional

import dotenv
from openai import OpenAI

from agent.tools import tools


class GPTRequest:
    def __init__(self, base_url: str = None, api_key: str = None):
        self.client = self._connect(base_url, api_key)

    def _connect(self, base_url: str = None, api_key: str = None) -> OpenAI:
        if not base_url or not api_key:
            dotenv.load_dotenv(".env")

        if not base_url:
            base_url = os.getenv("OPENAI_BASE_URL", "")

        if not api_key:
            api_key = os.getenv("OPENAI_API_KEY", "")

        if not api_key:
            raise ValueError("EnvironmentError: OPENAI_API_KEY not set in .env")

        # Setup default base URL if using local mode
        if api_key == "sk-no-key-required" and not base_url:
            base_url = "http://localhost:8080/v1"

        return OpenAI(api_key=api_key, base_url=base_url)

    def _classify_reasoning(self, text: str) -> Optional[str]:
        if "<think>" in text:
            return "reasoning.open"
        if "</think>" in text:
            return "reasoning.close"
        return None

    def _classify_tool(self, tool_call, buffer, args_fragments):
        fn = tool_call.function
        if fn.name:
            buffer["name"] = fn.name
        if fn.arguments:
            args_fragments.append(fn.arguments)
        if fn.arguments and fn.arguments.strip().endswith("}"):
            try:
                args = "".join(args_fragments)
                buffer["arguments"] = json.loads(args)
                args_fragments.clear()
                return {"type": "tool_call", "value": buffer.copy()}
            except json.JSONDecodeError:
                # Wait for more fragments
                pass
        return None

    def stream(self, **kwargs) -> Generator[Dict[str, Any], None, None]:
        response = self.client.chat.completions.create(**kwargs)
        tool_buffer = {}
        args_fragments = []

        for chunk in response:
            delta = chunk.choices[0].delta

            if hasattr(delta, "role") and delta.role:
                yield {"type": "role", "value": delta.role}

            if hasattr(delta, "content") and delta.content:
                reasoning_type = self._classify_reasoning(delta.content)
                yield {"type": reasoning_type or "content", "value": delta.content}

            if hasattr(delta, "tool_calls") and delta.tool_calls:
                for tool_call in delta.tool_calls:
                    result = self._classify_tool(tool_call, tool_buffer, args_fragments)
                    if result:
                        yield result

            if hasattr(delta, "refusal") and delta.refusal:
                yield {"type": "refusal", "value": delta.refusal.model_dump()}

            if hasattr(delta, "reasoning") and delta.reasoning:  # Future-compatible
                yield {"type": "reasoning", "value": delta.reasoning.model_dump()}


def main():
    messages = [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is the weather like in Paris, France?"},
    ]

    request = GPTRequest()
    generator = request.stream(
        model="gpt-3.5-turbo",  # Llama.Cpp expects this model definition
        messages=messages,
        max_tokens=-1,  # Allow the to model to naturally stop on its own
        stream=True,
        temperature=0.8,
        tools=tools,
    )
    for obj in generator:
        pprint(obj)


if __name__ == "__main__":
    main()

My current suspicion is to assume that it's an issue with token healing which I have to handle upstream because of this issue. It's only obvious here because it shows up right away in a predictable and deterministic fashion.

julien-c · 2025-05-27T07:30:01Z

Thanks so much for pushing this over the finish line @ochafik!

strawberrymelonpanda · 2025-05-27T15:24:50Z

As I have only 10h left of copyright ownership to push fixes 🙈

Minja is open-source, right? Do you need copyright ownership?
Curious what this means for the future if Llama.cpp will keep using minja.

CISC · 2025-05-27T16:22:32Z

Minja is open-source, right? Do you need copyright ownership? Curious what this means for the future if Llama.cpp will keep using minja.

Both minja and llama.cpp share the same license (MIT), the comment was probably referencing the email address used in the commits (which is added to AUTHORS) belonging to google.

ochafik · 2025-05-28T00:28:54Z

As I have only 10h left of copyright ownership to push fixes 🙈

Minja is open-source, right? Do you need copyright ownership? Curious what this means for the future if Llama.cpp will keep using minja.

@strawberrymelonpanda I hope to be able to clarify soon w/ my new employer whether I can contribute to minja, to a fork of it or... not at all (in any case, I've added a couple of extra maintainers from Google to the google/minja repo cc/ @matthewchan-g @Ferev, and there's always the option for llama.cpp to just fork it under ggml-org, etc)

ericcurtin · 2025-05-28T22:17:18Z

Do maintainers in Google namespace have to be Google employees, if yes, might be a decent reason to move to ggml-org namespace

ochafik added 12 commits March 12, 2025 23:51

add common_regex w/ support for partial final matches

16c9c63

add common_json w/ support for truncated json healing

6dcff43

renaming: string_find_partial_stop (moved to common.cpp)

a95fe78

add common_chat_msg_diff

ce2f593

partial common_chat_parse

cd3681d

refactor parser w/ optionals

9462365

server: wire chat diffs in stream mode

6ed8a8f

fix trigger of thinking models (must happen after thoughts are closed)

eaeed7d

nits + docs

d6e680a

fix functionary v3.2 raw python!

64ea080

rename: common_chat_syntax (now contains format)

c46d4da

rm common_regex.at_start

4358d5d

github-actions bot added documentation Improvements or additions to documentation testing Everything test related examples python python script changes server labels Mar 14, 2025

ochafik added 2 commits March 14, 2025 11:55

Merge remote-tracking branch 'origin/master' into tool-diffs

f477288

fix gcc compilation

e0202b3

ochafik force-pushed the tool-diffs branch from 94f8b38 to e0202b3 Compare March 14, 2025 12:07

ochafik added 10 commits March 14, 2025 12:41

fix unreachable code warning after [[noreturn]] annotation

f840e3a

fix / refactor test-regex-partial

af7391e

fix test-chat

449917b

rm spaces

b428b5c

fix command r7b partial parsing (lacked args path)

668fc90

Update test_chat_completion.py

b48ab23

refactor + test chat parser (try_consume_json_with_dumped_args, liter…

aefc8a4

…al based thinking tags parsing)

return partial msg from server

22428a4

refactor partial json

5b9c5a4

don't return empty <think></think>

3fbe84f

Copilot AI reviewed May 24, 2025

View reviewed changes

ericcurtin approved these changes May 24, 2025

View reviewed changes

ochafik merged commit f5cd27b into ggml-org:master May 25, 2025
48 checks passed

ochafik mentioned this pull request May 25, 2025

server: fix/test add_generation_prompt param #13770

Merged

bgs4free mentioned this pull request May 25, 2025

Misc. bug: Streaming with tools causes pydantic-ai to mess up tool name #13774

Closed

PrideIsLife mentioned this pull request May 25, 2025

Error 503 when loading model using ghcr.io/ggml-org/llama.cpp:server-cuda mostlygeek/llama-swap#146

Closed

ggerganov mentioned this pull request May 25, 2025

server: terminate called after throwing an instance of 'std::runtime_error' #13780

Closed

This was referenced May 25, 2025

server: fix regression on streamed non-chat completion w/ stops #13785

Merged

server: fix streaming crashes #13786

Merged

ochafik mentioned this pull request May 26, 2025

server: fix format of streamed tool call deltas (diff name, fix id location) #13800

Merged

p5 mentioned this pull request May 26, 2025

chore: bump llama.cpp to support tool streaming containers/ramalama#1438

Merged

This was referenced May 28, 2025

Tool support simonw/llm-llama-server#3

Closed

Enable streaming for tools when this ships simonw/llm-llama-server#4

Open

Edremon mentioned this pull request May 28, 2025

Misc. bug: Reasoning content is not separated when streaming #13867

Open

teleprint-me mentioned this pull request May 29, 2025

Eval bug: std::runtime_error Invalid diff: #13876

Open

cgruver mentioned this pull request May 29, 2025

Support Agent mode with Llama.cpp provider. continuedev/continue#5909

Open

2 tasks

server: streaming of tool calls and thoughts when --jinja is on #12379

server: streaming of tool calls and thoughts when --jinja is on #12379

Conversation

ochafik commented Mar 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

How to test / use

Context

Location, location, location 🏡

Implementation notes

Partial parsing utils

To think or not to think... what is the prompt?

Triggering thoughts 😓

Functionary v3.2 w/ raw python

TODOs

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI May 24, 2025

Choose a reason for hiding this comment

Uh oh!

ericcurtin left a comment

Choose a reason for hiding this comment

Uh oh!

cgruver commented May 24, 2025

Uh oh!

Uh oh!

ochafik commented May 25, 2025

Uh oh!

Rane2021 commented May 26, 2025

Uh oh!

TheTerrasque commented May 26, 2025

Uh oh!

making commented May 26, 2025

Uh oh!

ochafik commented May 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pwilkin commented May 26, 2025

Uh oh!

ochafik commented May 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

teleprint-me commented May 27, 2025

Uh oh!

julien-c commented May 27, 2025

Uh oh!

strawberrymelonpanda commented May 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CISC commented May 27, 2025

Uh oh!

ochafik commented May 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ericcurtin commented May 28, 2025

Uh oh!

Uh oh!

`server`: streaming of tool calls and thoughts when `--jinja` is on #12379

`server`: streaming of tool calls and thoughts when `--jinja` is on #12379

ochafik commented Mar 14, 2025 •

edited

Loading

ochafik commented May 26, 2025 •

edited

Loading

ochafik commented May 26, 2025 •

edited

Loading

strawberrymelonpanda commented May 27, 2025 •

edited

Loading

ochafik commented May 28, 2025 •

edited

Loading