fix: streaming responses drop content for tool-less requests by beaglemoo · Pull Request #3 · samuelfaj/lightning-mlx

beaglemoo · 2026-05-16T22:04:55Z

Summary

Streaming chat completions drop almost all of their content. A stream: true
request that generates hundreds of tokens arrives at the client as a role
delta, a finish delta, [DONE] — and little or no delta.content in
between. Non-streaming requests are unaffected and return the full text.

This affects every model and every plain (tool-less) streaming chat request,
not a specific architecture — the bug is in the request-handling layer.

Two small, related fixes are included.

Symptoms

On lightning-mlx serve <model> (tested on 0.6.10), a streaming request:

data: {"choices":[{"delta":{"role":"assistant"}}]}
data: {"choices":[{"delta":{},"finish_reason":"stop"}],"usage":{"completion_tokens":317,...}}
data: [DONE]

usage.completion_tokens is correct (317 tokens were generated) but the text
never reached the client. The same request non-streaming returns all 317
tokens of text fine.

Root cause

1. `tool_mode` is always on (`routes/chat.py`)

stream_chat_completion computed:

tool_mode = bool(request.tools or cfg.tool_call_parser)

cfg.tool_call_parser defaults to a truthy value (qwen3_coder_xml), so
tool_mode is True for every request — including plain chat with no
tools. That switches on the tool-mode content-suppression heuristics against
ordinary streamed text:

force_tool_work fires for any artifact-shaped prompt, and
force_tool_work and last_role != "tool" then continues past every
content delta.
_looks_like_invalid_tool_continuation has a blanket
if len(stripped) <= 16: return True rule. Streaming emits content a few
tokens at a time, so nearly every chunk is under 16 characters and gets
dropped.

Net effect: the engine generates correctly, then a tool-call safety filter —
meant for a different situation — eats the response on the way out.

2. `_validate_tool_call_params` crashes on `tools=None` (`service/helpers.py`)

A streaming request can produce parsed tool_calls with no declared tools
(e.g. a high-temperature completion emitting tool-call-like markup).
_validate_tool_call_params iterated tools directly:

tool_defs = [t.model_dump() if hasattr(t, "model_dump") else t for t in tools]

tools is None there, raising TypeError: 'NoneType' object is not iterable
and aborting the stream.

Fix

routes/chat.py — tool_mode = bool(request.tools). Tool-mode
content-suppression should track whether this request carries tools, not
whether a parser happens to be configured. Also drop the length-based rule
in _looks_like_invalid_tool_continuation; only the explicit junk fragments
are genuine invalid-continuation markers, length alone is not.
service/helpers.py — return early from _validate_tool_call_params
when tools is falsy; there is nothing to validate against.

Two files, +18/-3. Tool-calling behaviour is unchanged: when a request
does carry tools, tool_mode is True exactly as before.

Verification

Tested on an Apple M5 Pro (64 GB), lightning-mlx 0.6.10, Qwen3.6-27B and
Qwen3.6-35B-A3B:

Streaming now emits incremental delta.content — e.g. 30 chunks for a
160-token reply, 62 for an 80-word reply — first token intact.
Streaming wall-clock matches non-streaming (the content was previously
buffered/dropped, not just slow).
Tool calls still work, streaming and non-streaming (finish_reason: tool_calls, valid JSON arguments).
MTP + n-gram speculative decoding unaffected.

A streaming request can carry parsed tool_calls with no declared tools (a high-temperature completion emitting tool-call-like markup). The function iterated over tools directly and crashed with TypeError: 'NoneType' object is not iterable. Return early when tools is falsy.

The streaming chat path computed tool_mode as `bool(request.tools or cfg.tool_call_parser)`. tool_call_parser defaults to a truthy value (qwen3_coder_xml), so tool_mode was True for every request -- including plain chat with no tools. That activated the tool-mode content-suppression heuristics on ordinary streamed content: - force_tool_work fired on any artifact-shaped prompt, and `force_tool_work and last_role != 'tool'` skipped every content delta; - _looks_like_invalid_tool_continuation had a blanket `len(stripped) <= 16 -> invalid` rule that dropped nearly every streamed content chunk, since streaming emits a few tokens at a time. Net effect: streaming responses emitted a role delta, a finish delta and [DONE] with little or no delta.content, while non-streaming was fine. Fix: tool_mode now tracks bool(request.tools) -- whether THIS request carries tools -- not whether a parser is configured. Also drop the length-based rule in _looks_like_invalid_tool_continuation; only the explicit junk fragments are real invalid-continuation markers. Verified: streaming now emits incremental delta.content (30 chunks for a 160-token reply), first token intact, MTP + tool calls still work.

beaglemoo added 2 commits May 16, 2026 21:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: streaming responses drop content for tool-less requests#3

fix: streaming responses drop content for tool-less requests#3
beaglemoo wants to merge 2 commits into
samuelfaj:mainfrom
beaglemoo:fix/streaming-content-drop

beaglemoo commented May 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

beaglemoo commented May 16, 2026

Summary

Symptoms

Root cause

1. tool_mode is always on (routes/chat.py)

2. _validate_tool_call_params crashes on tools=None (service/helpers.py)

Fix

Verification

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

1. `tool_mode` is always on (`routes/chat.py`)

2. `_validate_tool_call_params` crashes on `tools=None` (`service/helpers.py`)