Skip to content

bug: with_connection global write lock + unclosed notify channel freezes bot when ACP process goes stale #295

@chengli

Description

@chengli

Description

When an ACP child process becomes unresponsive (stale session, auth expired, CLI hung), the bot freezes completely — it continues posting ... placeholders for every incoming mention but never processes any prompt. This affects all threads, not just the one with the stale session. Confirmed on v0.7.2 (latest stable) — the affected code paths are unchanged.

Two bugs combine:

Bug 1: with_connection holds global write lock during streaming (pool.rs:72-80)

pub async fn with_connection<F, R>(&self, thread_id: &str, f: F) -> Result<R> {
    let mut conns = self.connections.write().await;  // global write lock
    let conn = conns.get_mut(thread_id)...;
    f(conn).await  // held for ENTIRE duration of stream_prompt (minutes/hours)
}

stream_prompt runs inside f, so the RwLock write guard on self.connections is held for the entire streaming duration. While held, all other get_or_create (read lock), with_connection (write lock), and cleanup_idle (write lock) calls block. Tokio's RwLock blocks new readers when a writer is waiting, so even the read-lock fast path in get_or_create is blocked.

Bug 2: rx.recv() hangs forever when ACP process dies (connection.rs:152-153)

When the reader task hits EOF (child process died), it does not close the notify channel:

let sub = notify_tx.lock().await;
drop(sub);  // drops MutexGuard, NOT the Option<Sender> inside

The mpsc::UnboundedSender remains alive in Arc<Mutex<Option<Sender>>>, so rx.recv() in stream_prompt (discord.rs:269) never returns None — it waits forever.

Combined effect:

  1. ACP process becomes stale (CLI session expired, API auth timeout, OOM, etc.)
  2. stream_prompt sends prompt via session_promptrx.recv() hangs forever (sender not closed)
  3. Write lock on pool.connections held forever
  4. All subsequent get_or_create calls block (read lock blocked by held write lock)
  5. cleanup_idle also blocks (needs write lock) — can't clean up the stale session
  6. Bot posts ... for every incoming mention (before pool access) but never updates them
  7. Entire bot is frozen — not just one thread, ALL threads

Suggested fixes:

  • Fix 1 (critical): Close notify channel on EOF: *sub = None; instead of drop(sub);
  • Fix 2 (critical): Don't hold global lock during streaming — use per-connection locks or extract connection from pool during streaming
  • Fix 3 (recommended): Add timeout to rx.recv() in streaming loop
  • Fix 4 (defense-in-depth): Improve alive() to also check child process status

Analysis validated independently by Kiro (kiro-cli) and Codex (codex-acp).

Steps to Reproduce

  1. Deploy openab with any ACP agent (claude-agent-acp, gemini --acp, kiro-cli acp, codex-acp)
  2. Send a mention that triggers an ACP session — bot processes it normally
  3. Wait for the ACP child process to become stale (e.g., CLI session expires overnight, API auth times out, or the process hangs)
  4. Send another mention to the same bot
  5. Bot posts ... placeholder but never updates it — stream_prompt hangs on rx.recv() holding the pool write lock
  6. Send mentions targeting different threads — ALL are also frozen because the global write lock is held

Expected Behavior

  • When an ACP process dies or becomes unresponsive, the notify channel should close (rx.recv() returns None), the error is surfaced to the user (e.g., ⚠️ Failed), and the stale session is cleaned up
  • Other threads should not be affected by one stale session — the pool lock should not be held during streaming

Environment

  • openab version: 0.6.0 on our VPS, but confirmed unchanged in 0.7.2 (pool.rs never modified since initial commit, connection.rs EOF cleanup unchanged)
  • Helm chart: 0.6.0 (latest stable: 0.7.3-beta.56)
  • ACP agents: claude-agent-acp, gemini --acp, kiro-cli acp, codex-acp
  • K3s on Zeabur VPS (2 vCPU, 4GB RAM)

Screenshots / Logs

Logs — after freeze, only accepted bot message appears. No spawning agent, no pool error, no errors at all:

[dispatcher] Dispatched #192 (code-review) → itachi (prompt delivered)
# itachi log:
INFO openab::discord: accepted bot message (in allowed_bots_from) bot_id=1490975142803669113
# ... nothing else. No spawning, no errors. Repeated every 5 minutes for 10+ hours.

Discord thread — 25 consecutive ... messages from bots, none ever updated:

[11:24:55] 千手扉間 (Dispatcher): <@itachi> (phase: code-review) — please pick up #192
[11:24:56] 宇智波鼬: ...
[11:29:39] 千手扉間 (Dispatcher): <@itachi> (phase: code-review) — please pick up #192
[11:29:40] 宇智波鼬: ...
# ... repeated 25 times, none updated beyond "..."

Scale: 4 agents frozen, 432+ mentions over ~10 hours, zero processed.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions