Skip to content

server: streaming of tool calls and thoughts when --jinja is on #12379

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 102 commits into from
May 25, 2025

Conversation

ochafik
Copy link
Collaborator

@ochafik ochafik commented Mar 14, 2025

  • Support streaming of tool calls in OpenAI format
  • Improve handling of thinking model (DeepSeek R1 Distills, QwQ, Command R7B):
    • Stream <think> reasoning content inside the content (same output for all thinking models when using the default --reasoning-content deepseek, even for those not using the <think> syntax like Command R7B), and even if the <think> tag was added at the end of the prompt by the template (as for DeepSeek R1 & QwQ).
    • Avoid spurious lazy (tool call) grammar triggers from "thoughts about tool calls" (only trigger after closing any unclosed thoughts)
  • Improves Functionary v3.2 support (allow raw python code, preferred by models over {"code": "json-encoded code"} for multiline programs)
  • Support truncated outputs incl. reasoning_content & tool_calls (returns salvageable fields when finish_reason = length)

This fixes #12107, #10920, #11861

Follow up to #9639

Pending follow ups / bugfixes:

How to test / use

  • Get and build this PR's branch
    git clone https://github.com/ggml-org/llama.cpp
    cd llama.cpp
    git remote add ochafik https://github.com/ochafik/llama.cpp
    git fetch ochafik
    git checkout ochafik/tool-diffs
    cmake -B build -DLLAMA_CURL=1 # -DGGML_CUDA=1 ...
    cmake --build build -t llama-server --parallel --config Release
    alias llama-server=./build/bin/llama-server
  • Run llama-server w/ any model (see more details in the tool calling docs; note that some GGUFs require a chat template override!):

    # Thoughts of Command R7B / DeepSeek R1 / QwQ will be streamed in the content inside <think> tags
    llama-server --jinja -fa -hf bartowski/Qwen_QwQ-32B-GGUF
    
    # Models w/ generic tool call support now return clean interrupted output when hitting token limit
    llama-server --jinja -fa -hf bartowski/microsoft_Phi-4-mini-instruct-GGUF
    
  • Call the chat completions endpoint in streamed mode with any OpenAI-compatible library, or plain curl:

    curl http://localhost:8080/v1/chat/completions -d '{
      "model": "gpt-3.5-turbo",
      "tools": [
        {
          "type": "function",
          "function": {
            "name": "python",
            "description": "Runs code in an ipython interpreter and returns the result of the execution after 60 seconds.",
            "parameters": {
              "type": "object",
              "properties": {
                "code": {
                  "type": "string",
                  "description": "The code to run in the ipython interpreter."
                }
              },
              "required": ["code"]
            }
          }
        }
      ],
      "messages": [
        {
          "role": "user",
          "content": "Print a hello world message with python."
        }
      ],
      "stream": true
    }'
  • You can also open http://localhost:8080/ to see thoughts being streamed back properly even for models which template add an opening <think> tag to the end of the prompt (QwQ, now DeepSeek R1 too although most GGUFs have their initial version) and models like Cohere Command R7B that natively use a different thinking tags syntax (now normalized, since —reasoning-format deepseek is the default)

Context

Supporting OpenAI's streaming delta format was a bit tricky, as it returns chunks of JSON-encoded arguments for each function call, but that's not necessarily what models give us.

While tool calls are returned in a standard format, each w/ a function name, tool call id and JSON encoded arguments, model outputs vary greatly in their syntax. That syntax mostly uses JSON for arguments but not always.

Function calls and their arguments can be at various levels:

  • JSON array of tool calls (e.g. Mistral Nemo: [TOOL_CALLS][{"name": "special_function", "arguments": {"arg1": 1}, "id": "123456789"}])
  • Standalone JSON tool call (e.g. Hermes syntax: <tool_call>{"name": "special_function", "arguments": {"arg1": 1}}</tool_call>; note that some models use other keys here, e.g. tool_name, parameters, and may have the tool call id too)
  • JSON arguments object w/ name in some prefix (e.g. Deepseek: <|tool▁calls▁begin|><|tool▁call▁begin|>function<|tool▁sep|>special_function\n```json\n{"arg1": 1}\n```<|tool▁call▁end|><|tool▁calls▁end|>, or functionary v3.2: special_function\n{"arg1": 1})
  • Nested JSON for the generic mode {"tool_call": {"name": "special_function", "arguments": {"arg1": 1}}} (or inside tool_calls array if parallel_tool_calls is on)
  • No JSON / raw code string for python tool call, with two variants:
    • Unconstrained verbatim code: <|python_tag|>multiline python code here (functionary v3.1), python\nmultiline python code here (functionary v3.2; w/ prefix >>> if after textual response)
    • Constrained pythonish syntax for "builtin tools" (Llama 3.x, quite widespread): <|python_tag|>python.call(code="multiline\npython\ncode\nhere")

Side note about raw python code: <|python_tag>foo.call(bar="baz") in Llama 3.x style will return "tool_calls": [{"name": "foo", "arguments": "{\"bar\": \"baz\"}"}], while the same output from Functionary would be parsed as "tool_calls": [{"name": "python", "arguments": "{\"code\": \"foo.call(bar=\\\"baz\\\")\"}"}].

Now when streaming, we may have sampled only a prefix of the aforementioned output, and want ideally to parse what can be parsed out of it, and send a JSON-encoded arguments object that is cut at a safe place, so that the sum of all the deltas adds up to the full arguments JSON string.

(A primary use case for partial JSON arguments streaming is streaming large multiline diff tool arguments in tools such as RooCode / Cline / Cursor)

The cleanest option would have been to create a unified parser / state machine that can be drip-fed tokens, and preserve its state in the server slot. But I figured the complexity was too high for now (see notes on speeding up below), and instead I've implemented something definitely inefficient but relatively simple (chat.cpp it still the same size): for every token coming in, I try and parse the entire output so far, with partial regex & json parsing support, which allows recovering cleanly cut-off JSON-encoded function arguments (regardless of the original format of said arguments). I then compare the full common_chat_msg against the last one we sent back, and compute OpenAI-compatible deltas out of this.

Location, location, location 🏡

Note that the output of the model may be truncated (max token output length reached or streaming in progress), and that may fall inside an expected literal (e.g. <think> isn't a single token on QwQ-32B), inside a regex (used for some matchers), or inside some JSON.

But more interesting is where it happens, esp. for partial JSON:

  • If it happens inside an arguments object or a contents string (for generic mode), we should return it partial / truncated (and json-dumped in the case of the arguments), and diffed from the last parsed value for the streamed case
  • If it happens inside the wrapper of the arguments, then it depends. We don't want to get a half-function name, but as soon as we have a complete function name we can send a diff. So we try and heal the JSON (we identify which json paths can be partially healed - because they're inside the arguments, and which ones must be dropped), and only populate a tool call if we have at least a name). Likewise, if there is an array of function calls with the first complete, and the next partial, we want to make sure the client can start calling the first function.

tests/test-chat-parser.cpp should make this a bit clearer, and I'm in the process of adding partial examples w/ the actual formats in tests/test-chat.cpp (look out for /* is_partial= */ true)

See examples of streamed tool call deltas
curl https://api.openai.com/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -d '{
    "model": "gpt-3.5-turbo",
    "tools": [
        {
        "type":"function",
        "function":{
            "name":"python",
            "description":"Runs code in an ipython interpreter and returns the result of the execution after 60 seconds.",
            "parameters":{
            "type":"object",
            "properties":{
                "code":{
                "type":"string",
                "description":"The code to run in the ipython interpreter."
                }
            },
            "required":["code"]
            }
        }
        }
    ],
    "messages": [
        {
        "role": "user",
        "content": "Print a hello world message with python."
        }
    ], "stream": true
}'
data: {"id":"chatcmpl-BArAOzrqdSYW1UJvOwWb9nXAy1qo4","object":"chat.completion.chunk","created":1741927352,"model":"gpt-3.5-turbo-0125","service_tier":"default","system_fingerprint":null,"choices":[{"index":0,"delta":{"role":"assistant","content":null,"tool_calls":[{"index":0,"id":"call_aqwOReHDKPnqiF7NbRxzDTY1","type":"function","function":{"name":"python","arguments":""}}],"refusal":null},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-BArAOzrqdSYW1UJvOwWb9nXAy1qo4","object":"chat.completion.chunk","created":1741927352,"model":"gpt-3.5-turbo-0125","service_tier":"default","system_fingerprint":null,"choices":[{"index":0,"delta":{"tool_calls":[{"index":0,"function":{"arguments":"{\""}}]},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-BArAOzrqdSYW1UJvOwWb9nXAy1qo4","object":"chat.completion.chunk","created":1741927352,"model":"gpt-3.5-turbo-0125","service_tier":"default","system_fingerprint":null,"choices":[{"index":0,"delta":{"tool_calls":[{"index":0,"function":{"arguments":"code"}}]},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-BArAOzrqdSYW1UJvOwWb9nXAy1qo4","object":"chat.completion.chunk","created":1741927352,"model":"gpt-3.5-turbo-0125","service_tier":"default","system_fingerprint":null,"choices":[{"index":0,"delta":{"tool_calls":[{"index":0,"function":{"arguments":"\":\""}}]},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-BArAOzrqdSYW1UJvOwWb9nXAy1qo4","object":"chat.completion.chunk","created":1741927352,"model":"gpt-3.5-turbo-0125","service_tier":"default","system_fingerprint":null,"choices":[{"index":0,"delta":{"tool_calls":[{"index":0,"function":{"arguments":"print"}}]},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-BArAOzrqdSYW1UJvOwWb9nXAy1qo4","object":"chat.completion.chunk","created":1741927352,"model":"gpt-3.5-turbo-0125","service_tier":"default","system_fingerprint":null,"choices":[{"index":0,"delta":{"tool_calls":[{"index":0,"function":{"arguments":"('"}}]},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-BArAOzrqdSYW1UJvOwWb9nXAy1qo4","object":"chat.completion.chunk","created":1741927352,"model":"gpt-3.5-turbo-0125","service_tier":"default","system_fingerprint":null,"choices":[{"index":0,"delta":{"tool_calls":[{"index":0,"function":{"arguments":"Hello"}}]},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-BArAOzrqdSYW1UJvOwWb9nXAy1qo4","object":"chat.completion.chunk","created":1741927352,"model":"gpt-3.5-turbo-0125","service_tier":"default","system_fingerprint":null,"choices":[{"index":0,"delta":{"tool_calls":[{"index":0,"function":{"arguments":","}}]},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-BArAOzrqdSYW1UJvOwWb9nXAy1qo4","object":"chat.completion.chunk","created":1741927352,"model":"gpt-3.5-turbo-0125","service_tier":"default","system_fingerprint":null,"choices":[{"index":0,"delta":{"tool_calls":[{"index":0,"function":{"arguments":" World"}}]},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-BArAOzrqdSYW1UJvOwWb9nXAy1qo4","object":"chat.completion.chunk","created":1741927352,"model":"gpt-3.5-turbo-0125","service_tier":"default","system_fingerprint":null,"choices":[{"index":0,"delta":{"tool_calls":[{"index":0,"function":{"arguments":"!"}}]},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-BArAOzrqdSYW1UJvOwWb9nXAy1qo4","object":"chat.completion.chunk","created":1741927352,"model":"gpt-3.5-turbo-0125","service_tier":"default","system_fingerprint":null,"choices":[{"index":0,"delta":{"tool_calls":[{"index":0,"function":{"arguments":"')"}}]},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-BArAOzrqdSYW1UJvOwWb9nXAy1qo4","object":"chat.completion.chunk","created":1741927352,"model":"gpt-3.5-turbo-0125","service_tier":"default","system_fingerprint":null,"choices":[{"index":0,"delta":{"tool_calls":[{"index":0,"function":{"arguments":"\"}"}}]},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-BArAOzrqdSYW1UJvOwWb9nXAy1qo4","object":"chat.completion.chunk","created":1741927352,"model":"gpt-3.5-turbo-0125","service_tier":"default","system_fingerprint":null,"choices":[{"index":0,"delta":{},"logprobs":null,"finish_reason":"tool_calls"}]}

data: [DONE]

Implementation notes

Partial parsing utils

I added a common_chat_msg_parser utility with syntax reminiscent of @ngxson's suggestions in #11607 (comment), but relying on control flow to allow more flexibility:

  • Supports partial regex parsing
    • Since the STL still doesn't have partial matching support (unlike Boost), I had to implement my own in common_regex (see common/regex-partial.cpp).
    • The trick = transform the original regex to a regex that matches in reverse from the end of the string (e.g. /abc/ gives /((?:(?:c)?b)?a)[\s\S]*/, with a single capturing group which end indicates - in reverse - where the partial match started)
  • Supports partial JSON parsing:
    • Used nlohmann/json's SAX interface to build location awareness / stack to know how to heal a JSON that fails to parse
    • Healing the JSON w/ a healing marker that can then be found when visiting the resulting JSON (to remove things we don't want to heal - e.g. function name - and cut any JSON encoded result at the "right" place, which must be somewhere inside function arguments: consume_json accepts a list of json paths under which to expect arguments objects; could be from the root = empty path if the entire json object is an arguments object)
  • Supports control flow w/ try_* parsing methods. This makes the code relatively easy to read and debug. No exotic syntax (apart from optionals, they really help here imho), which should make it easier to convert to coroutines when we wanna make it all incremental.
  • Supports full or partial parsing w/ same code (throws partial exceptions to interrupt the control flow without making parsing code more complex)

This allows parsing of partial model outputs, whether in streaming mode or when reaching the token limit (currently, tool calls give ugly unparsed outputs when finish_reason != tool_call).

To think or not to think... what is the prompt?

I've also introduced common_chat_syntax which wraps common_reasoning_format, common_chat_format together with:

  • thinking_forced_open: whether the prompt was detected to end w/ a (model-specific) <think> tag to force thinking mode
  • reasoning_in_content: whether the thinking tags should be left in the content, which is currently the case in streaming mode as the DeepSeek API does.

This allows streaming back a standard <think>... syntax even for models that use a different set of tags (e.g. Command R7B). And of course, --reasoning-format none is still allowed to get the raw output.

Note: Ideally, we'd stream the thoughts as a reasoning_content delta (now trivial to implement), but for now we are just aiming for compatibility w/ DeepSeek's API (if --reasoning-format deepseek, which is the default).

Triggering thoughts 😓

I noticed DeepSeek R1 Qwen 7B sometimes obsesses over the tool call syntax and "thinks" about how it's gonna call it... which triggers the lazy grammars for said calls before the thoughts are closed.

To address this, I made it possible for common_chat_templates_apply to create trigger regexes that match on the entire output (this was already the case in the sampler). COMMON_GRAMMAR_TRIGGER_TYPE_PATTERN_FULL (renamed from _START) is now expected to have a single capturing group from the start of which the grammar sampler will be activated.

Functionary v3.2 w/ raw python

Ask bartowski/functionary-small-v3.2-GGUF:Q4_K_M to write a hello world in Python and it outputs python\n{"code": "print('hey')"}.

But ask it to print a hello world in python w/ matplotlib, and it uses its raw multiline python syntax python\nprint('hey')\n# many other lines. This is now supported.

TODOs

  • Fix tool call id attribution logic (disabled for now) from tool-call: ensure there's always a non-empty tool call id #12292
  • Might need one last diff in the final response after a stream, say, to close any raw python code
  • Decide what to do about logprobs for tools mode (right now, forbidden; we don't return diffs for every token, for instance if a function name is in multiple tokens we don't want to send its name in chunks)
    • Edit: OpenAI returns null logpropbs in tool call mode. Just need to ensure normal mode doesn't regress (test failing atm)
  • Fix Mistral Nemo crash (llama-server --jinja -fa -hf bartowski/Mistral-Nemo-Instruct-2407-GGUF:Q6_K_L)
  • Send partial regex (common_regex) as separate PR: common: add partial regex support #12808
  • Send partial JSON (common_json) as separate PR(?) or fold into chat-parser.cpp
  • Command R7B's non-tool-calling template (they have 3 templates) forces <|START_RESPONSE|> at the end of the prompt. Output will contain an <|END_RESPONSE|> that needs handling (would fit nicely in new common_chat_syntax struct). Maybe combine w/ forced/disabled thinking modes as a follow up PR
  • Add some docs
  • Add more tests
  • Run scripts/tool_bench.sh to compare against master (+ compare timings)

Future follow ups:

  • To make this faster, I suggest two options:
    • Wait for the project to switch to C++20 & turn all the parser functions into resumable coroutines (feed them tokens and persist their state in the slot)
    • Only compute and send deltas after N milliseconds

cc/ @jpohhhh

@github-actions github-actions bot added documentation Improvements or additions to documentation testing Everything test related examples python python script changes server labels Mar 14, 2025
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces streaming of tool calls and reasoning content when using the --jinja flag while also improving partial JSON parsing and proper handling of grammar triggers. Key changes include:

  • Addition of streaming request utilities and extensive updates to unit tests for tool call and chat completion behavior.
  • Integration of a unified chat syntax structure in server.cpp and changes in sampling.cpp to better handle grammar trigger patterns.
  • Major enhancements in the JSON partial parsing/healing logic and chat parser to recover truncated model outputs.

Reviewed Changes

Copilot reviewed 23 out of 23 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
tools/server/tests/utils.py Added a new make_any_request method to support both streaming and non-streaming calls.
tools/server/tests/unit/test_tool_call.py Updated test functions to support streaming and adjusted assertions for tool calls.
tools/server/tests/unit/test_chat_completion.py Modified stream assertion checks (using is None rather than empty string).
tools/server/server.cpp Integrated common_chat_syntax and refined grammar trigger handling and logging.
tests/test-json-partial.cpp Added comprehensive tests for partial JSON healing with various fragment examples.
tests/test-chat-parser.cpp Expanded test coverage for chat parsing, including reasoning and regex consumption.
src/llama-grammar.cpp Updated grammar trigger handling to capture the first non-empty capturing group.
scripts/tool_bench.py Enhanced command-line benchmarking parameters by adding chat_template_file support.
models/templates/README.md & Qwen-QwQ-32B.jinja Added documentation and a new Jinja template for Qwen-QwQ-32B.
docs/function-calling.md Included a caution regarding extreme KV quantizations affecting tool calling.
common/sampling.cpp Refactored grammar trigger pattern concatenation by removing patterns_at_start logic.
common/json-partial.* Introduced and refined JSON healing logic with detailed parsing and recovery steps.
common/chat*.{h,cpp} Updated chat structures and the parser to leverage the new chat syntax and reasoning.
Comments suppressed due to low confidence (2)

tools/server/tests/unit/test_tool_call.py:98

  • Consider removing or re-enabling the commented assertion for a non-empty tool call id if it is no longer applicable. This cleanup can help avoid confusion in test outputs and maintain code clarity.
    # assert len(tool_call.get("id", "") > 0, f'Expected non empty tool call id in {tool_call}')

common/sampling.cpp:164

  • [nitpick] The variable name 'trigger_patterns' (previously replacing 'patterns_at_start') could be made more descriptive. Consider renaming it (e.g. to 'grammar_trigger_patterns') for clarity and consistency with its usage later in the code.
    std::vector<std::string> trigger_patterns;

it = temptative_end;
return true;
} catch (const std::exception & ex) {
// No, needs healing.
Copy link
Preview

Copilot AI May 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The healing logic for partial JSON parsing is quite complex. Adding detailed inline comments explaining the steps and decisions—especially the conditions under which different closing characters are appended—would improve maintainability and clarity.

Copilot uses AI. Check for mistakes.

Copy link
Collaborator

@ericcurtin ericcurtin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be a fan of merging this, even if it was imperfect, so the community can build upon it

@cgruver
Copy link

cgruver commented May 24, 2025

+1 to @ericcurtin This capability starts to enable llama.cpp as a backend for vibe coding with self hosted models.

@ochafik ochafik merged commit f5cd27b into ggml-org:master May 25, 2025
48 checks passed
@ochafik
Copy link
Collaborator Author

ochafik commented May 25, 2025

I'd be a fan of merging this, even if it was imperfect, so the community can build upon it

Merged this as is 😅🤞🙏

@Rane2021
Copy link

[error] langchain stream tool call error for no tool_call_id

error:
pydantic_core._pydantic_core.ValidationError: 1 validation error for ToolMessage
tool_call_id
Input should be a valid string [type=string_type, input_value=None, input_type=NoneType]
For further information visit https://errors.pydantic.dev/2.10/v/string_type
During task with name 'tools' and id '5e515507-10d7-c20a-75c3-856bcef1f4c1'
During task with name 'researcher' and id '029f68e9-0b87-3e4b-59f1-891fc72c88f2'

@TheTerrasque
Copy link

I see a similar problem in n8n: [ERROR: All OpenAI tool calls must have an "id" field.]

@making
Copy link

making commented May 26, 2025

A similar error occurs with Spring AI

java.lang.IllegalArgumentException: ToolResponseMessage must have an id
	at org.springframework.util.Assert.isTrue(Assert.java:116) ~[spring-core-6.2.7.jar:6.2.7]
	at org.springframework.ai.openai.OpenAiChatModel.lambda$createRequest$19(OpenAiChatModel.java:594) ~[spring-ai-openai-1.0.0.jar:1.0.0]
	at java.base/java.util.ArrayList.forEach(ArrayList.java:1596) ~[na:na]
	at org.springframework.ai.openai.OpenAiChatModel.lambda$createRequest$21(OpenAiChatModel.java:594) ~[spring-ai-openai-1.0.0.jar:1.0.0]
	at java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:197) ~[na:na]
	at java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1708) ~[na:na]
	at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:509) ~[na:na]
	at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:499) ~[na:na]
	at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:575) ~[na:na]
	at java.base/java.util.stream.AbstractPipeline.evaluateToArrayNode(AbstractPipeline.java:260) ~[na:na]
	at java.base/java.util.stream.ReferencePipeline.toArray(ReferencePipeline.java:616) ~[na:na]
	at java.base/java.util.stream.ReferencePipeline.toArray(ReferencePipeline.java:622) ~[na:na]
	at java.base/java.util.stream.ReferencePipeline.toList(ReferencePipeline.java:627) ~[na:na]
	at org.springframework.ai.openai.OpenAiChatModel.createRequest(OpenAiChatModel.java:604) ~[spring-ai-openai-1.0.0.jar:1.0.0]
	at org.springframework.ai.openai.OpenAiChatModel.lambda$internalStream$14(OpenAiChatModel.java:272) ~[spring-ai-openai-1.0.0.jar:1.0.0]
	at reactor.core.publisher.FluxDeferContextual.subscribe(FluxDeferContextual.java:49) ~[reactor-core-3.7.6.jar:3.7.6]
	at reactor.core.publisher.FluxDefer.subscribe(FluxDefer.java:54) ~[reactor-core-3.7.6.jar:3.7.6]
	at reactor.core.publisher.FluxSubscribeOn$SubscribeOnSubscriber.run(FluxSubscribeOn.java:194) ~[reactor-core-3.7.6.jar:3.7.6]
	at reactor.core.scheduler.WorkerTask.call(WorkerTask.java:84) ~[reactor-core-3.7.6.jar:3.7.6]
	at reactor.core.scheduler.WorkerTask.call(WorkerTask.java:37) ~[reactor-core-3.7.6.jar:3.7.6]
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:317) ~[na:na]
	at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) ~[na:na]
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) ~[na:na]
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) ~[na:na]
	at java.base/java.lang.Thread.run(Thread.java:1583) ~[na:na]

@ochafik
Copy link
Collaborator Author

ochafik commented May 26, 2025

Thanks all for the feedback so far! (and sorry for the ahem lil breakages & unpolished edges 😅)

As I have only 10h left of copyright ownership to push fixes 🙈, here's a recap of the PRs out for review (cc/ @ngxson @ggerganov @CISC ):

@pwilkin
Copy link
Contributor

pwilkin commented May 26, 2025

@ochafik I can confirm that #13786 fixes the tool calling on Qwen3 streaming.

@ochafik
Copy link
Collaborator Author

ochafik commented May 26, 2025

All pending fixes were merged (thanks for the prompt reviews!), please all sync and report more bugs if you have a chance, I’ll do a last round in a few hours ✌🏼

(ideally please provide self-contained repro instructions, esp. for integration issues with agentic frameworks: even if it’s on a hacky branch of yours, it will help tremendously)

Update: 1st release w/ the fixes is b5497

@teleprint-me
Copy link
Contributor

Just as an FYI, there's a minor regression with predicted logits.

20:13:18 | /mnt/valerie/public/agent
(.venv) git:(main | Δ) λ python -m agent.openai.stream
{'type': 'role', 'value': 'assistant'}
{'type': 'reasoning.open', 'value': '<think>Okay'}
{'type': 'content', 'value': ','}
{'type': 'content', 'value': ' the'}

Previously, in earlier commits, before this PR was merged, <think> and Okay were in fact separate tokens.

This is okay here in a controlled fashion, but I suspect this might creep up in unsuspected ways further down the line.

Updated simple example
# agent.openai.stream
import json
import os
from pprint import pprint
from typing import Any, Dict, Generator, Optional

import dotenv
from openai import OpenAI

from agent.tools import tools


class GPTRequest:
    def __init__(self, base_url: str = None, api_key: str = None):
        self.client = self._connect(base_url, api_key)

    def _connect(self, base_url: str = None, api_key: str = None) -> OpenAI:
        if not base_url or not api_key:
            dotenv.load_dotenv(".env")

        if not base_url:
            base_url = os.getenv("OPENAI_BASE_URL", "")

        if not api_key:
            api_key = os.getenv("OPENAI_API_KEY", "")

        if not api_key:
            raise ValueError("EnvironmentError: OPENAI_API_KEY not set in .env")

        # Setup default base URL if using local mode
        if api_key == "sk-no-key-required" and not base_url:
            base_url = "http://localhost:8080/v1"

        return OpenAI(api_key=api_key, base_url=base_url)

    def _classify_reasoning(self, text: str) -> Optional[str]:
        if "<think>" in text:
            return "reasoning.open"
        if "</think>" in text:
            return "reasoning.close"
        return None

    def _classify_tool(self, tool_call, buffer, args_fragments):
        fn = tool_call.function
        if fn.name:
            buffer["name"] = fn.name
        if fn.arguments:
            args_fragments.append(fn.arguments)
        if fn.arguments and fn.arguments.strip().endswith("}"):
            try:
                args = "".join(args_fragments)
                buffer["arguments"] = json.loads(args)
                args_fragments.clear()
                return {"type": "tool_call", "value": buffer.copy()}
            except json.JSONDecodeError:
                # Wait for more fragments
                pass
        return None

    def stream(self, **kwargs) -> Generator[Dict[str, Any], None, None]:
        response = self.client.chat.completions.create(**kwargs)
        tool_buffer = {}
        args_fragments = []

        for chunk in response:
            delta = chunk.choices[0].delta

            if hasattr(delta, "role") and delta.role:
                yield {"type": "role", "value": delta.role}

            if hasattr(delta, "content") and delta.content:
                reasoning_type = self._classify_reasoning(delta.content)
                yield {"type": reasoning_type or "content", "value": delta.content}

            if hasattr(delta, "tool_calls") and delta.tool_calls:
                for tool_call in delta.tool_calls:
                    result = self._classify_tool(tool_call, tool_buffer, args_fragments)
                    if result:
                        yield result

            if hasattr(delta, "refusal") and delta.refusal:
                yield {"type": "refusal", "value": delta.refusal.model_dump()}

            if hasattr(delta, "reasoning") and delta.reasoning:  # Future-compatible
                yield {"type": "reasoning", "value": delta.reasoning.model_dump()}


def main():
    messages = [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is the weather like in Paris, France?"},
    ]

    request = GPTRequest()
    generator = request.stream(
        model="gpt-3.5-turbo",  # Llama.Cpp expects this model definition
        messages=messages,
        max_tokens=-1,  # Allow the to model to naturally stop on its own
        stream=True,
        temperature=0.8,
        tools=tools,
    )
    for obj in generator:
        pprint(obj)


if __name__ == "__main__":
    main()

My current suspicion is to assume that it's an issue with token healing which I have to handle upstream because of this issue. It's only obvious here because it shows up right away in a predictable and deterministic fashion.

@julien-c
Copy link
Contributor

Thanks so much for pushing this over the finish line @ochafik!

@strawberrymelonpanda
Copy link
Contributor

strawberrymelonpanda commented May 27, 2025

As I have only 10h left of copyright ownership to push fixes 🙈

Minja is open-source, right? Do you need copyright ownership?
Curious what this means for the future if Llama.cpp will keep using minja.

@CISC
Copy link
Collaborator

CISC commented May 27, 2025

Minja is open-source, right? Do you need copyright ownership? Curious what this means for the future if Llama.cpp will keep using minja.

Both minja and llama.cpp share the same license (MIT), the comment was probably referencing the email address used in the commits (which is added to AUTHORS) belonging to google.

@ochafik
Copy link
Collaborator Author

ochafik commented May 28, 2025

As I have only 10h left of copyright ownership to push fixes 🙈

Minja is open-source, right? Do you need copyright ownership? Curious what this means for the future if Llama.cpp will keep using minja.

@strawberrymelonpanda I hope to be able to clarify soon w/ my new employer whether I can contribute to minja, to a fork of it or... not at all (in any case, I've added a couple of extra maintainers from Google to the google/minja repo cc/ @matthewchan-g @Ferev, and there's always the option for llama.cpp to just fork it under ggml-org, etc)

@ericcurtin
Copy link
Collaborator

Do maintainers in Google namespace have to be Google employees, if yes, might be a decent reason to move to ggml-org namespace

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation examples python python script changes script Script related server testing Everything test related tool calling
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Eval bug: llama-cpp-deepseek-r1.jinja template will miss the <think> tag