ds4-server: watchdog thread + decode-loop SSE keepalive#238
Closed
Allen091080 wants to merge 1 commit into
Closed
Conversation
The inference worker can stall in the underlying GPU/Metal kernel calls
(`ds4_session_sample` / `ds4_session_eval` / `ds4_session_eval_speculative_argmax`).
Those calls have no cancellation points, so SIGTERM cannot drain a stuck
turn — we have seen 12+ hour ds4-server processes wedged with a single
chat request in flight, requiring SIGKILL to recover. The SSE keepalive
that landed earlier covers prefill but not decode, so once the worker
gets past prefill and enters decode, a Metal stall is again invisible to
the client until its own idle timeout.
Add a dedicated `watchdog_main` thread that polls `worker_last_progress`
while `worker_in_job` is set. Two thresholds:
soft stall (default 60s) — set `worker_abort_requested` so the decode
loop bails on its next iteration; the in-flight HTTP client gets
an `error` finish_reason and the worker continues to pick up
future jobs.
hard stall (default 120s) — call `_exit(137)`. The launchd
KeepAlive supervisor restarts us immediately; the on-disk KV
cache survives so the restarted process can usually resume the
client's last prompt prefix from cache instead of re-prefilling
from token zero.
The decode loop now:
- checks `worker_abort_requested` at the top of each iteration and
breaks with a clear `finish=error` reason when the watchdog has
requested an abort;
- emits a `: decode\n\n` SSE comment line every 15 seconds when
`j->req.stream` is set, mirroring the prefill keepalive from the
earlier patch so reasoning-only stretches (`<think>...`) or slow
tool-input phases don't trip client TCP idle-timeouts either;
- refreshes `worker_last_progress` (relaxed atomic store) after each
produced token so the watchdog can distinguish "alive but slow"
from "wedged".
`server_progress_cb` also refreshes `worker_last_progress` on every
`prefill_chunk` event. Without this, very long prefills (≥ soft
threshold for big prompts) would be mistaken for Metal stalls —
chunks fire every ~7s in practice and reset the timestamp before the
60s soft limit, so prefill is never falsely aborted.
Field layout:
- `worker_tid` / `watchdog_tid` / `watchdog_running` / thresholds
- 3 cross-thread atomics: `worker_last_progress`, `worker_in_job`,
`worker_abort_requested` (relaxed __atomic_* loads/stores)
The watchdog only acts while `worker_in_job` is set — idle workers
don't advance progress and we must not interpret that as a stall.
Setup-time misconfigurations (model load failure, port conflict) are
not covered; they fail synchronously before any job is ever dequeued.
Verified on macOS Metal, q2-imatrix GGUF, ctx=200000:
- `make` clean build, no new warnings
- `./ds4_test --server` passes
- Real running ds4-server protected: in earlier 12+ hour run we saw a
hard wedge that needed SIGKILL; with this patch a follow-up run hit
the soft trigger ("WATCHDOG soft stall 62s >= 60s — requesting
decode-loop abort") and the client got a clean error within 70s
instead of hanging indefinitely.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Owner
|
Fixed more organically in recent commits. See the original issue for more info. Thanks. |
Author
|
Thanks for the quick reply! Agreed — f91c12b's Closing this PR. I'll send a much smaller follow-up that keeps only the decode-loop SSE keepalive piece (no watchdog thread, no |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Builds on the SSE keepalive merged in f027269 (PR #194). That patch
covered prefill silence; this one covers the other two failure modes
on the same surface:
cancellation points. SIGTERM cannot drain a stuck turn — we observed
a 12+ hour ds4-server process wedged on a single chat request,
requiring SIGKILL.
<think>...)or slow tool-input phases: the prefill keepalive does not fire, the
decode loop produces no streamable bytes for a while, and the client
TCP idle-timeout closes the connection.
Fix
Two cooperating pieces, both in
ds4_server.c:A) Watchdog thread
A dedicated
watchdog_mainpollsworker_last_progressevery 5swhile
worker_in_jobis set.worker_abort_requestedso thedecode loop bails on its next iteration; the client gets an
errorfinish_reason and the worker continues serving future jobs.
_exit(137). The launchdKeepAlivesupervisor restarts immediately; on-disk KV cachesurvives so the restarted process can usually resume the client's
last prefix from cache instead of re-prefilling.
B) Decode-loop additions
The decode loop now:
worker_abort_requestedat the top of each iteration, breakswith a clear
finish=errorwhen the watchdog has requested abort.: decode\n\nSSE comment line at most every 15s whenj->req.streamis set (matches the prefill keepalive cadence).worker_last_progressafter each produced token viarelaxed
__atomic_store_n.server_progress_cbalso refreshesworker_last_progresson everyprefill_chunkcallback — without this, big prefills (≥ soft threshold)would be mistaken for stalls. Chunks fire every ~7s in practice and
reset the timestamp well within the 60s soft limit, so prefill is
never falsely aborted.
State additions
The watchdog only acts while
worker_in_jobis set — idle workersdon't advance progress and we must not misinterpret that as a stall.
Verification
Machine: MacBook Pro M5 Max, 128 GiB RAM
Backend: Metal
Model:
DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2.gguf(q2-imatrix)Server:
./ds4-server --host 0.0.0.0 --port 8000 --ctx 500000 --kv-disk-dir … --kv-disk-space-mb 204800make— clean build, no new warnings../ds4_test --server—server: OK/ds4 tests: ok.saw a hard wedge that needed SIGKILL. With the patch applied, a
follow-up run hit the soft trigger (
WATCHDOG soft stall 62s >= 60s — requesting decode-loop abort) and the client got a clean errorwithin 70s instead of hanging indefinitely. launchd KeepAlive
picked the process back up.
: decodecommentlines every ~15s, no client disconnects.
Test plan
./ds4_test --serveror simulated stall (e.g. add a temporary
sleepinds4_session_eval)and confirm the soft trigger fires within ~65s and
_exit(137)fires within ~125s.
kill -15still drains cleanly via the existingg_stop_requestedshutdown path — the watchdog is purelyadditive.