refactor(pool): per-connection Arc<Mutex> to unblock concurrent sessions by ruan330 · Pull Request #183 · openabdev/openab

ruan330 · 2026-04-10T12:40:36Z

Problem

SessionPool::with_connection currently holds the pool's write lock for the entire callback duration. Because stream_prompt in discord.rs runs inside that callback and can take many seconds — or minutes — to drain an ACP turn, every other Discord thread is blocked from touching the pool while one session is streaming, even for get_or_create on a completely unrelated thread_id (which only needs the read lock).

In production on our fork (which runs a few dozen concurrent Discord threads against one broker), this manifests as:

Long turns on one thread serialize every other thread's mention handling.
get_or_create on a fresh thread can block for the duration of an unrelated streaming turn.
Corresponds to fix: pool write lock held during entire prompt streaming causes cross-session deadlock #58 (pool write lock deadlock).

Fix

Wrap each AcpConnection in Arc<Mutex<_>>:

with_connection takes the pool's read lock only long enough to clone the Arc for the target thread_id, then releases it.
It then locks that specific connection's mutex for the duration of the callback.
Other sessions continue to stream concurrently.
get_or_create on unrelated thread_ids is no longer blocked.
Rebuilds still take the write lock briefly (correct — that's a structural change to the HashMap).

cleanup_idle is updated to follow the same rule on the cleanup path: snapshot the Arcs under the read lock, release it, then try_lock each connection individually. A connection that's currently in use is by definition not idle, so try_lock lets us skip it without ever awaiting on a per-connection mutex while holding the pool lock. The write lock is only re-acquired if there are stale entries to remove.

Note: The cleanup_idle shape addresses the P1 review comment that Codex bot left on the original #77 — it pointed out that cleanup_idle grabbing the pool write lock and then awaiting conn_arc.lock() would re-introduce the exact starvation this refactor is meant to eliminate. The snapshot + try_lock pattern fixes that.

The with_connection signature is unchanged, so no call sites need to be updated — the fix is entirely internal to pool.rs. Diff is +51 / -21 in a single file.

Why this approach

A few alternatives we considered:

Leave it as one big RwLock, convert callers to read lock. Doesn't work — callers need &mut AcpConnection to drive session_prompt, and a read lock can't hand out mutable refs.
DashMap<String, AcpConnection>. DashMap's shard locks are synchronous, so we'd still need per-entry async coordination to let a callback hold exclusive access across .await points. Doesn't avoid the underlying need for a per-connection async mutex.
Per-connection Arc<Mutex<_>> (this PR). Minimal change, preserves the existing with_connection API, fixes the root cause (pool lock held during streaming). This matches the architecture discussed in RFC: Session Management — Design Proposal #78 §2b.

Scope

This PR is only the locking change. Previously bundled in #77 alongside notification-loop resilience, an alive-check safety net, and a startup cleanup routine — on reflection that was too much for a single review. Closing #77 and splitting into three focused PRs; this is the first.

Next PRs (to follow in order, from separate branches off main):

✅ This PR — per-connection lock.
Notification loop resilience — end_turn can arrive before the final agent_message_chunk via tokio::select! racing; fix is a small drain window + empty-response fallback. Fixes fix: notification loop assumes ordered events, bounded prompts, and managed session lifecycle — none hold in production #76.
Alive check + hard timeout — defensive safety net around notification_loop (30s alive check / 30min hard ceiling).

Supersedes #59 and #77. Closes #58.

Testing

Built clean against main (0588893, current tip).
cargo build --release — no warnings or errors.
Running on our fork in production for several weeks prior to the split (as part of fix: robust notification loop — per-connection locking, alive check, drain window, and startup cleanup #77) against bare-metal + Docker brokers with concurrent Discord threads — no deadlocks, no regressions in get_or_create latency.

Happy to add a focused test if you'd like — the behavioral change is subtle (a session that's streaming no longer blocks unrelated get_or_create calls), and a concurrent-access test with two tasks would make it visible.

`SessionPool::with_connection` currently holds the pool's write lock for the entire callback duration. Because `stream_prompt` in discord.rs runs inside that callback and can take many seconds (or minutes) to drain an ACP turn, every other Discord thread is blocked from touching the pool while one session streams — even for `get_or_create` on a completely unrelated thread_id, which only needs the read lock. The fix: wrap each `AcpConnection` in `Arc<Mutex<_>>`. `with_connection` now takes only the pool's read lock long enough to clone the Arc, then locks that specific connection's mutex for the callback. The pool lock is released immediately, so: - Other sessions can still stream concurrently. - `get_or_create` on unrelated thread_ids proceeds without waiting. - Rebuilds still take the write lock briefly (correct — structural change to the HashMap). `cleanup_idle` uses a snapshot-then-probe pattern so the same rule holds on the cleanup path: clone the Arcs under the read lock, release it, then `try_lock` each connection individually. A busy connection is, by definition, not idle — `try_lock` lets us skip it without ever awaiting on a per-connection mutex while holding the pool lock. The write lock is only re-acquired if there are stale entries to remove. This addresses the P1 review comment left by the Codex bot on the original openabdev#77, which noted that awaiting `conn_arc.lock()` from inside a held pool write lock would re-introduce the very starvation this refactor is meant to eliminate. This matches the architecture discussed in openabdev#78 §2b and closes openabdev#58 (pool write lock deadlock during long-running notification loops). Supersedes openabdev#59 and openabdev#77. Scoped to just the locking change so it can be reviewed in isolation — notification-loop resilience and alive check will follow as separate PRs. No call-site changes: the `with_connection` signature is unchanged. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

thepagent · 2026-04-15T00:37:32Z

Closing in favor of #257 which addresses the same pool blocking issue with a targeted fix. Thanks! 🙏

ruan330 requested a review from thepagent as a code owner April 10, 2026 12:40

ruan330 force-pushed the refactor/per-connection-lock branch from 93f8f6b to f823131 Compare April 10, 2026 12:50

thepagent closed this Apr 15, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(pool): per-connection Arc<Mutex> to unblock concurrent sessions#183

refactor(pool): per-connection Arc<Mutex> to unblock concurrent sessions#183
ruan330 wants to merge 1 commit intoopenabdev:mainfrom
ruan330:refactor/per-connection-lock

ruan330 commented Apr 10, 2026 •

edited

Loading

Uh oh!

thepagent commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ruan330 commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Fix

Why this approach

Scope

Testing

Uh oh!

thepagent commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ruan330 commented Apr 10, 2026 •

edited

Loading