Skip to content

fix: streaming responses drop content for tool-less requests#3

Open
beaglemoo wants to merge 2 commits into
samuelfaj:mainfrom
beaglemoo:fix/streaming-content-drop
Open

fix: streaming responses drop content for tool-less requests#3
beaglemoo wants to merge 2 commits into
samuelfaj:mainfrom
beaglemoo:fix/streaming-content-drop

Conversation

@beaglemoo
Copy link
Copy Markdown

Summary

Streaming chat completions drop almost all of their content. A stream: true
request that generates hundreds of tokens arrives at the client as a role
delta, a finish delta, [DONE] — and little or no delta.content in
between. Non-streaming requests are unaffected and return the full text.

This affects every model and every plain (tool-less) streaming chat request,
not a specific architecture — the bug is in the request-handling layer.

Two small, related fixes are included.

Symptoms

On lightning-mlx serve <model> (tested on 0.6.10), a streaming request:

data: {"choices":[{"delta":{"role":"assistant"}}]}
data: {"choices":[{"delta":{},"finish_reason":"stop"}],"usage":{"completion_tokens":317,...}}
data: [DONE]

usage.completion_tokens is correct (317 tokens were generated) but the text
never reached the client. The same request non-streaming returns all 317
tokens of text fine.

Root cause

1. tool_mode is always on (routes/chat.py)

stream_chat_completion computed:

tool_mode = bool(request.tools or cfg.tool_call_parser)

cfg.tool_call_parser defaults to a truthy value (qwen3_coder_xml), so
tool_mode is True for every request — including plain chat with no
tools. That switches on the tool-mode content-suppression heuristics against
ordinary streamed text:

  • force_tool_work fires for any artifact-shaped prompt, and
    force_tool_work and last_role != "tool" then continues past every
    content delta.
  • _looks_like_invalid_tool_continuation has a blanket
    if len(stripped) <= 16: return True rule. Streaming emits content a few
    tokens at a time, so nearly every chunk is under 16 characters and gets
    dropped.

Net effect: the engine generates correctly, then a tool-call safety filter —
meant for a different situation — eats the response on the way out.

2. _validate_tool_call_params crashes on tools=None (service/helpers.py)

A streaming request can produce parsed tool_calls with no declared tools
(e.g. a high-temperature completion emitting tool-call-like markup).
_validate_tool_call_params iterated tools directly:

tool_defs = [t.model_dump() if hasattr(t, "model_dump") else t for t in tools]

tools is None there, raising TypeError: 'NoneType' object is not iterable
and aborting the stream.

Fix

  • routes/chat.pytool_mode = bool(request.tools). Tool-mode
    content-suppression should track whether this request carries tools, not
    whether a parser happens to be configured. Also drop the length-based rule
    in _looks_like_invalid_tool_continuation; only the explicit junk fragments
    are genuine invalid-continuation markers, length alone is not.
  • service/helpers.py — return early from _validate_tool_call_params
    when tools is falsy; there is nothing to validate against.

Two files, +18/-3. Tool-calling behaviour is unchanged: when a request
does carry tools, tool_mode is True exactly as before.

Verification

Tested on an Apple M5 Pro (64 GB), lightning-mlx 0.6.10, Qwen3.6-27B and
Qwen3.6-35B-A3B:

  • Streaming now emits incremental delta.content — e.g. 30 chunks for a
    160-token reply, 62 for an 80-word reply — first token intact.
  • Streaming wall-clock matches non-streaming (the content was previously
    buffered/dropped, not just slow).
  • Tool calls still work, streaming and non-streaming (finish_reason: tool_calls, valid JSON arguments).
  • MTP + n-gram speculative decoding unaffected.

beaglemoo added 2 commits May 16, 2026 21:46
A streaming request can carry parsed tool_calls with no declared tools
(a high-temperature completion emitting tool-call-like markup). The
function iterated over tools directly and crashed with TypeError:
'NoneType' object is not iterable. Return early when tools is falsy.
The streaming chat path computed tool_mode as
`bool(request.tools or cfg.tool_call_parser)`. tool_call_parser
defaults to a truthy value (qwen3_coder_xml), so tool_mode was True for
every request -- including plain chat with no tools. That activated the
tool-mode content-suppression heuristics on ordinary streamed content:

- force_tool_work fired on any artifact-shaped prompt, and
  `force_tool_work and last_role != 'tool'` skipped every content delta;
- _looks_like_invalid_tool_continuation had a blanket
  `len(stripped) <= 16 -> invalid` rule that dropped nearly every
  streamed content chunk, since streaming emits a few tokens at a time.

Net effect: streaming responses emitted a role delta, a finish delta and
[DONE] with little or no delta.content, while non-streaming was fine.

Fix: tool_mode now tracks bool(request.tools) -- whether THIS request
carries tools -- not whether a parser is configured. Also drop the
length-based rule in _looks_like_invalid_tool_continuation; only the
explicit junk fragments are real invalid-continuation markers.

Verified: streaming now emits incremental delta.content (30 chunks for a
160-token reply), first token intact, MTP + tool calls still work.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant