Skip to content

ds4-server: watchdog thread + decode-loop SSE keepalive#238

Closed
Allen091080 wants to merge 1 commit into
antirez:mainfrom
Allen091080:watchdog-decode-keepalive
Closed

ds4-server: watchdog thread + decode-loop SSE keepalive#238
Allen091080 wants to merge 1 commit into
antirez:mainfrom
Allen091080:watchdog-decode-keepalive

Conversation

@Allen091080
Copy link
Copy Markdown

Summary

Builds on the SSE keepalive merged in f027269 (PR #194). That patch
covered prefill silence; this one covers the other two failure modes
on the same surface:

  1. Worker thread stalls in GPU/Metal kernel calls with no
    cancellation points. SIGTERM cannot drain a stuck turn — we observed
    a 12+ hour ds4-server process wedged on a single chat request,
    requiring SIGKILL.
  2. Decode-loop silence during reasoning-only stretches (<think>...)
    or slow tool-input phases: the prefill keepalive does not fire, the
    decode loop produces no streamable bytes for a while, and the client
    TCP idle-timeout closes the connection.

Fix

Two cooperating pieces, both in ds4_server.c:

A) Watchdog thread

A dedicated watchdog_main polls worker_last_progress every 5s
while worker_in_job is set.

  • Soft stall (default 60s): set worker_abort_requested so the
    decode loop bails on its next iteration; the client gets an error
    finish_reason and the worker continues serving future jobs.
  • Hard stall (default 120s): call _exit(137). The launchd
    KeepAlive supervisor restarts immediately; on-disk KV cache
    survives so the restarted process can usually resume the client's
    last prefix from cache instead of re-prefilling.

B) Decode-loop additions

The decode loop now:

  • Checks worker_abort_requested at the top of each iteration, breaks
    with a clear finish=error when the watchdog has requested abort.
  • Emits a : decode\n\n SSE comment line at most every 15s when
    j->req.stream is set (matches the prefill keepalive cadence).
  • Refreshes worker_last_progress after each produced token via
    relaxed __atomic_store_n.

server_progress_cb also refreshes worker_last_progress on every
prefill_chunk callback — without this, big prefills (≥ soft threshold)
would be mistaken for stalls. Chunks fire every ~7s in practice and
reset the timestamp well within the 60s soft limit, so prefill is
never falsely aborted.

State additions

pthread_t worker_tid;
pthread_t watchdog_tid;
bool watchdog_running;
int watchdog_stuck_soft_s;  // default 60
int watchdog_stuck_hard_s;  // default 120

// relaxed __atomic_* accessed across threads
volatile long worker_last_progress;
volatile int worker_in_job;
volatile int worker_abort_requested;

The watchdog only acts while worker_in_job is set — idle workers
don't advance progress and we must not misinterpret that as a stall.

Verification

Machine: MacBook Pro M5 Max, 128 GiB RAM
Backend: Metal
Model: DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2.gguf (q2-imatrix)
Server: ./ds4-server --host 0.0.0.0 --port 8000 --ctx 500000 --kv-disk-dir … --kv-disk-space-mb 204800

  • make — clean build, no new warnings.
  • ./ds4_test --serverserver: OK / ds4 tests: ok.
  • Real running ds4-server protected: in an earlier multi-hour run we
    saw a hard wedge that needed SIGKILL. With the patch applied, a
    follow-up run hit the soft trigger (WATCHDOG soft stall 62s >= 60s — requesting decode-loop abort) and the client got a clean error
    within 70s instead of hanging indefinitely. launchd KeepAlive
    picked the process back up.
  • 35s decode burst with reasoning: client receives : decode comment
    lines every ~15s, no client disconnects.

Test plan

  • CI runs ./ds4_test --server
  • Manual: run a heavy chat request long enough to provoke a real
    or simulated stall (e.g. add a temporary sleep in ds4_session_eval)
    and confirm the soft trigger fires within ~65s and _exit(137)
    fires within ~125s.
  • Confirm kill -15 still drains cleanly via the existing
    g_stop_requested shutdown path — the watchdog is purely
    additive.

The inference worker can stall in the underlying GPU/Metal kernel calls
(`ds4_session_sample` / `ds4_session_eval` / `ds4_session_eval_speculative_argmax`).
Those calls have no cancellation points, so SIGTERM cannot drain a stuck
turn — we have seen 12+ hour ds4-server processes wedged with a single
chat request in flight, requiring SIGKILL to recover.  The SSE keepalive
that landed earlier covers prefill but not decode, so once the worker
gets past prefill and enters decode, a Metal stall is again invisible to
the client until its own idle timeout.

Add a dedicated `watchdog_main` thread that polls `worker_last_progress`
while `worker_in_job` is set.  Two thresholds:

  soft stall (default 60s) — set `worker_abort_requested` so the decode
      loop bails on its next iteration; the in-flight HTTP client gets
      an `error` finish_reason and the worker continues to pick up
      future jobs.

  hard stall (default 120s) — call `_exit(137)`.  The launchd
      KeepAlive supervisor restarts us immediately; the on-disk KV
      cache survives so the restarted process can usually resume the
      client's last prompt prefix from cache instead of re-prefilling
      from token zero.

The decode loop now:
  - checks `worker_abort_requested` at the top of each iteration and
    breaks with a clear `finish=error` reason when the watchdog has
    requested an abort;
  - emits a `: decode\n\n` SSE comment line every 15 seconds when
    `j->req.stream` is set, mirroring the prefill keepalive from the
    earlier patch so reasoning-only stretches (`<think>...`) or slow
    tool-input phases don't trip client TCP idle-timeouts either;
  - refreshes `worker_last_progress` (relaxed atomic store) after each
    produced token so the watchdog can distinguish "alive but slow"
    from "wedged".

`server_progress_cb` also refreshes `worker_last_progress` on every
`prefill_chunk` event.  Without this, very long prefills (≥ soft
threshold for big prompts) would be mistaken for Metal stalls —
chunks fire every ~7s in practice and reset the timestamp before the
60s soft limit, so prefill is never falsely aborted.

Field layout:
  - `worker_tid` / `watchdog_tid` / `watchdog_running` / thresholds
  - 3 cross-thread atomics: `worker_last_progress`, `worker_in_job`,
    `worker_abort_requested` (relaxed __atomic_* loads/stores)

The watchdog only acts while `worker_in_job` is set — idle workers
don't advance progress and we must not interpret that as a stall.
Setup-time misconfigurations (model load failure, port conflict) are
not covered; they fail synchronously before any job is ever dequeued.

Verified on macOS Metal, q2-imatrix GGUF, ctx=200000:

- `make` clean build, no new warnings
- `./ds4_test --server` passes
- Real running ds4-server protected: in earlier 12+ hour run we saw a
  hard wedge that needed SIGKILL; with this patch a follow-up run hit
  the soft trigger ("WATCHDOG soft stall 62s >= 60s — requesting
  decode-loop abort") and the client got a clean error within 70s
  instead of hanging indefinitely.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@antirez
Copy link
Copy Markdown
Owner

antirez commented May 24, 2026

Fixed more organically in recent commits. See the original issue for more info. Thanks.

@Allen091080
Copy link
Copy Markdown
Author

Thanks for the quick reply! Agreed — f91c12b's prefill_display event is the right shape for prefill stalls and lines up better with the project's minimalism.

Closing this PR. I'll send a much smaller follow-up that keeps only the decode-loop SSE keepalive piece (no watchdog thread, no _exit), since long-thinking and long tool-input phases during decode produce no streamable bytes for a while either and the current prefill keepalive doesn't cover those.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants