Description
When an ACP child process becomes unresponsive (stale session, auth expired, CLI hung), the bot freezes completely — it continues posting ... placeholders for every incoming mention but never processes any prompt. This affects all threads, not just the one with the stale session. Confirmed on v0.7.2 (latest stable) — the affected code paths are unchanged.
Two bugs combine:
Bug 1: with_connection holds global write lock during streaming (pool.rs:72-80)
pub async fn with_connection<F, R>(&self, thread_id: &str, f: F) -> Result<R> {
let mut conns = self.connections.write().await; // global write lock
let conn = conns.get_mut(thread_id)...;
f(conn).await // held for ENTIRE duration of stream_prompt (minutes/hours)
}
stream_prompt runs inside f, so the RwLock write guard on self.connections is held for the entire streaming duration. While held, all other get_or_create (read lock), with_connection (write lock), and cleanup_idle (write lock) calls block. Tokio's RwLock blocks new readers when a writer is waiting, so even the read-lock fast path in get_or_create is blocked.
Bug 2: rx.recv() hangs forever when ACP process dies (connection.rs:152-153)
When the reader task hits EOF (child process died), it does not close the notify channel:
let sub = notify_tx.lock().await;
drop(sub); // drops MutexGuard, NOT the Option<Sender> inside
The mpsc::UnboundedSender remains alive in Arc<Mutex<Option<Sender>>>, so rx.recv() in stream_prompt (discord.rs:269) never returns None — it waits forever.
Combined effect:
- ACP process becomes stale (CLI session expired, API auth timeout, OOM, etc.)
stream_prompt sends prompt via session_prompt → rx.recv() hangs forever (sender not closed)
- Write lock on
pool.connections held forever
- All subsequent
get_or_create calls block (read lock blocked by held write lock)
cleanup_idle also blocks (needs write lock) — can't clean up the stale session
- Bot posts
... for every incoming mention (before pool access) but never updates them
- Entire bot is frozen — not just one thread, ALL threads
Suggested fixes:
- Fix 1 (critical): Close notify channel on EOF:
*sub = None; instead of drop(sub);
- Fix 2 (critical): Don't hold global lock during streaming — use per-connection locks or extract connection from pool during streaming
- Fix 3 (recommended): Add timeout to
rx.recv() in streaming loop
- Fix 4 (defense-in-depth): Improve
alive() to also check child process status
Analysis validated independently by Kiro (kiro-cli) and Codex (codex-acp).
Steps to Reproduce
- Deploy openab with any ACP agent (claude-agent-acp, gemini --acp, kiro-cli acp, codex-acp)
- Send a mention that triggers an ACP session — bot processes it normally
- Wait for the ACP child process to become stale (e.g., CLI session expires overnight, API auth times out, or the process hangs)
- Send another mention to the same bot
- Bot posts
... placeholder but never updates it — stream_prompt hangs on rx.recv() holding the pool write lock
- Send mentions targeting different threads — ALL are also frozen because the global write lock is held
Expected Behavior
- When an ACP process dies or becomes unresponsive, the notify channel should close (
rx.recv() returns None), the error is surfaced to the user (e.g., ⚠️ Failed), and the stale session is cleaned up
- Other threads should not be affected by one stale session — the pool lock should not be held during streaming
Environment
- openab version: 0.6.0 on our VPS, but confirmed unchanged in 0.7.2 (
pool.rs never modified since initial commit, connection.rs EOF cleanup unchanged)
- Helm chart: 0.6.0 (latest stable: 0.7.3-beta.56)
- ACP agents: claude-agent-acp, gemini --acp, kiro-cli acp, codex-acp
- K3s on Zeabur VPS (2 vCPU, 4GB RAM)
Screenshots / Logs
Logs — after freeze, only accepted bot message appears. No spawning agent, no pool error, no errors at all:
[dispatcher] Dispatched #192 (code-review) → itachi (prompt delivered)
# itachi log:
INFO openab::discord: accepted bot message (in allowed_bots_from) bot_id=1490975142803669113
# ... nothing else. No spawning, no errors. Repeated every 5 minutes for 10+ hours.
Discord thread — 25 consecutive ... messages from bots, none ever updated:
[11:24:55] 千手扉間 (Dispatcher): <@itachi> (phase: code-review) — please pick up #192
[11:24:56] 宇智波鼬: ...
[11:29:39] 千手扉間 (Dispatcher): <@itachi> (phase: code-review) — please pick up #192
[11:29:40] 宇智波鼬: ...
# ... repeated 25 times, none updated beyond "..."
Scale: 4 agents frozen, 432+ mentions over ~10 hours, zero processed.
Description
When an ACP child process becomes unresponsive (stale session, auth expired, CLI hung), the bot freezes completely — it continues posting
...placeholders for every incoming mention but never processes any prompt. This affects all threads, not just the one with the stale session. Confirmed on v0.7.2 (latest stable) — the affected code paths are unchanged.Two bugs combine:
Bug 1:
with_connectionholds global write lock during streaming (pool.rs:72-80)stream_promptruns insidef, so theRwLockwrite guard onself.connectionsis held for the entire streaming duration. While held, all otherget_or_create(read lock),with_connection(write lock), andcleanup_idle(write lock) calls block. Tokio'sRwLockblocks new readers when a writer is waiting, so even the read-lock fast path inget_or_createis blocked.Bug 2:
rx.recv()hangs forever when ACP process dies (connection.rs:152-153)When the reader task hits EOF (child process died), it does not close the notify channel:
The
mpsc::UnboundedSenderremains alive inArc<Mutex<Option<Sender>>>, sorx.recv()instream_prompt(discord.rs:269) never returnsNone— it waits forever.Combined effect:
stream_promptsends prompt viasession_prompt→rx.recv()hangs forever (sender not closed)pool.connectionsheld foreverget_or_createcalls block (read lock blocked by held write lock)cleanup_idlealso blocks (needs write lock) — can't clean up the stale session...for every incoming mention (before pool access) but never updates themSuggested fixes:
*sub = None;instead ofdrop(sub);rx.recv()in streaming loopalive()to also check child process statusAnalysis validated independently by Kiro (kiro-cli) and Codex (codex-acp).
Steps to Reproduce
...placeholder but never updates it —stream_prompthangs onrx.recv()holding the pool write lockExpected Behavior
rx.recv()returnsNone), the error is surfaced to the user (e.g.,⚠️ Failed), and the stale session is cleaned upEnvironment
pool.rsnever modified since initial commit,connection.rsEOF cleanup unchanged)Screenshots / Logs
Logs — after freeze, only
accepted bot messageappears. Nospawning agent, nopool error, no errors at all:Discord thread — 25 consecutive
...messages from bots, none ever updated:Scale: 4 agents frozen, 432+ mentions over ~10 hours, zero processed.