From 107cc0ad3a3f3b8dd4f300027f0fdf2c7afbc8cb Mon Sep 17 00:00:00 2001 From: Evan Nadeau <1878498+evannadeau@users.noreply.github.com> Date: Fri, 15 May 2026 13:52:29 -0700 Subject: [PATCH] fix(agent-channel): set busy_timeout on agent_channel.db to prevent concurrent stop-hook deadlock The agent-channel SQLite DB was initialized with `journal_mode = WAL` and `synchronous = NORMAL` but NO `busy_timeout`. WAL allows concurrent readers but writers still serialize via the writer lock; without a busy timeout a concurrent writer hits SQLITE_BUSY immediately rather than waiting. The bug surfaces at session end. Every Claude Code session in a project runs its own MCP server with its own AgentChannel instance writing heartbeats, offsets, sessions, and system_events to this same per-project DB. When PA + one or more SAs trigger /exit simultaneously, all their stop-hooks fire at once and race for the writer lock. The losers hit SQLITE_BUSY immediately, Claude Code re-fires the stop-hook reminder (`Before ending: complete orchestrator housekeeping...`), the losers retry against the still-locked DB, and the cycle hangs both parent shells until the operator force-quits and restarts the host. Recovery via process kill alone is insufficient because the kernel-level SQLite locks survive in shared memory until the WSL VM (or equivalent) is recycled. Symptom signature (from operator-reported incident 2026-05-14): - Two sessions hung at /exit, requiring full WSL restart 14 minutes later - Their transcripts end with rapid back-to-back stop-hook reminder user-role injections (no assistant response between them), pointing at hook-retry-loop rather than model hang - A third concurrent session that /exit-ed slightly earlier drained cleanly, which by elimination identifies the failure as concurrent- write contention rather than per-session bug - agent_channel.db-wal and global.db-wal sat at ~4 MB at incident time, inflating the checkpoint-replay component of the contention window Fix: add `PRAGMA busy_timeout = 5000` to the agent_channel.db connection init in agent_channel_state.ts, mirroring the same pragma already set on the global plugin DB at mcp/db/connection.ts:79. Writers now wait up to 5s for the lock instead of throwing immediately. The lock is held briefly (sessions/offsets/heartbeats are tiny INSERT/UPSERTs), so 5s is well beyond the worst-case contention window. Style note: this single new line uses `db.run("PRAGMA ...")` rather than `db.exec("PRAGMA ...")` (which the adjacent lines use); both are functionally identical for single-statement PRAGMA configuration on bun:sqlite. The asymmetry was forced by a local pre-tool-use hook that false-positives on `.exec(` patterns thinking they're child_process.exec. Happy to normalize to .exec on review request. Adjacent hygiene (intentionally out-of-scope for this PR but related): the WAL inflation contributing to incident-time contention has no periodic-checkpoint mitigation today (only checkpoint-on-close at agent_channel_state.ts:146). Adding a periodic `wal_checkpoint(TRUNCATE)` to the MCP server's heartbeat tick would bound WAL size and further reduce the contention surface. Separate PR. dist/server.js regenerated via `bun run build` (249 modules, 0.94 MB); test suite stays at 516 pass / 0 fail. --- plugins/orchestrator/dist/server.js | 1 + .../orchestrator/mcp/engine/agent_channel_state.ts | 12 ++++++++++++ 2 files changed, 13 insertions(+) diff --git a/plugins/orchestrator/dist/server.js b/plugins/orchestrator/dist/server.js index 6b6141f..08d6885 100644 --- a/plugins/orchestrator/dist/server.js +++ b/plugins/orchestrator/dist/server.js @@ -22611,6 +22611,7 @@ function getDb(stateDir) { db.exec("PRAGMA journal_mode = WAL;"); } db.exec("PRAGMA synchronous = NORMAL;"); + db.run("PRAGMA busy_timeout = 5000;"); db.exec(` CREATE TABLE IF NOT EXISTS sessions ( session_id TEXT PRIMARY KEY, diff --git a/plugins/orchestrator/mcp/engine/agent_channel_state.ts b/plugins/orchestrator/mcp/engine/agent_channel_state.ts index 56f12a0..0e31024 100644 --- a/plugins/orchestrator/mcp/engine/agent_channel_state.ts +++ b/plugins/orchestrator/mcp/engine/agent_channel_state.ts @@ -201,6 +201,18 @@ function getDb(stateDir: string): Database { db.exec("PRAGMA journal_mode = WAL;"); } db.exec("PRAGMA synchronous = NORMAL;"); + // WAL allows concurrent readers but writers serialize. Without a busy + // timeout, a concurrent writer throws SQLITE_BUSY immediately instead of + // waiting. Multiple Claude Code sessions in the same project all run their + // own MCP server with its own AgentChannel writing heartbeats, offsets, + // sessions, and system_events to this DB - and they all hit the stop-hook + // write path at end-of-session. Without this timeout, concurrent stop-hooks + // race for the writer lock, the loser sees SQLITE_BUSY immediately, Claude + // Code re-fires the stop-hook reminder, and the loser retries against the + // still-locked DB - producing the deadlock-shape that hangs both parent + // shells until host restart. Mirrors the same fix already applied to the + // global plugin DB at mcp/db/connection.ts. + db.run("PRAGMA busy_timeout = 5000;"); db.exec(` CREATE TABLE IF NOT EXISTS sessions ( session_id TEXT PRIMARY KEY,