fix: streaming responses drop content for tool-less requests#3
Open
beaglemoo wants to merge 2 commits into
Open
fix: streaming responses drop content for tool-less requests#3beaglemoo wants to merge 2 commits into
beaglemoo wants to merge 2 commits into
Conversation
A streaming request can carry parsed tool_calls with no declared tools (a high-temperature completion emitting tool-call-like markup). The function iterated over tools directly and crashed with TypeError: 'NoneType' object is not iterable. Return early when tools is falsy.
The streaming chat path computed tool_mode as `bool(request.tools or cfg.tool_call_parser)`. tool_call_parser defaults to a truthy value (qwen3_coder_xml), so tool_mode was True for every request -- including plain chat with no tools. That activated the tool-mode content-suppression heuristics on ordinary streamed content: - force_tool_work fired on any artifact-shaped prompt, and `force_tool_work and last_role != 'tool'` skipped every content delta; - _looks_like_invalid_tool_continuation had a blanket `len(stripped) <= 16 -> invalid` rule that dropped nearly every streamed content chunk, since streaming emits a few tokens at a time. Net effect: streaming responses emitted a role delta, a finish delta and [DONE] with little or no delta.content, while non-streaming was fine. Fix: tool_mode now tracks bool(request.tools) -- whether THIS request carries tools -- not whether a parser is configured. Also drop the length-based rule in _looks_like_invalid_tool_continuation; only the explicit junk fragments are real invalid-continuation markers. Verified: streaming now emits incremental delta.content (30 chunks for a 160-token reply), first token intact, MTP + tool calls still work.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Streaming chat completions drop almost all of their content. A
stream: truerequest that generates hundreds of tokens arrives at the client as a
roledelta, a
finishdelta,[DONE]— and little or nodelta.contentinbetween. Non-streaming requests are unaffected and return the full text.
This affects every model and every plain (tool-less) streaming chat request,
not a specific architecture — the bug is in the request-handling layer.
Two small, related fixes are included.
Symptoms
On
lightning-mlx serve <model>(tested on 0.6.10), a streaming request:usage.completion_tokensis correct (317 tokens were generated) but the textnever reached the client. The same request non-streaming returns all 317
tokens of text fine.
Root cause
1.
tool_modeis always on (routes/chat.py)stream_chat_completioncomputed:cfg.tool_call_parserdefaults to a truthy value (qwen3_coder_xml), sotool_modeisTruefor every request — including plain chat with notools. That switches on the tool-mode content-suppression heuristics againstordinary streamed text:
force_tool_workfires for any artifact-shaped prompt, andforce_tool_work and last_role != "tool"thencontinues past everycontent delta.
_looks_like_invalid_tool_continuationhas a blanketif len(stripped) <= 16: return Truerule. Streaming emits content a fewtokens at a time, so nearly every chunk is under 16 characters and gets
dropped.
Net effect: the engine generates correctly, then a tool-call safety filter —
meant for a different situation — eats the response on the way out.
2.
_validate_tool_call_paramscrashes ontools=None(service/helpers.py)A streaming request can produce parsed
tool_callswith no declaredtools(e.g. a high-temperature completion emitting tool-call-like markup).
_validate_tool_call_paramsiteratedtoolsdirectly:toolsisNonethere, raisingTypeError: 'NoneType' object is not iterableand aborting the stream.
Fix
routes/chat.py—tool_mode = bool(request.tools). Tool-modecontent-suppression should track whether this request carries tools, not
whether a parser happens to be configured. Also drop the length-based rule
in
_looks_like_invalid_tool_continuation; only the explicit junk fragmentsare genuine invalid-continuation markers, length alone is not.
service/helpers.py— return early from_validate_tool_call_paramswhen
toolsis falsy; there is nothing to validate against.Two files, +18/-3. Tool-calling behaviour is unchanged: when a request
does carry
tools,tool_modeisTrueexactly as before.Verification
Tested on an Apple M5 Pro (64 GB), lightning-mlx 0.6.10, Qwen3.6-27B and
Qwen3.6-35B-A3B:
delta.content— e.g. 30 chunks for a160-token reply, 62 for an 80-word reply — first token intact.
buffered/dropped, not just slow).
finish_reason: tool_calls, valid JSON arguments).