Skip to content

refactor(pool): per-connection Arc<Mutex> to unblock concurrent sessions#183

Closed
ruan330 wants to merge 1 commit intoopenabdev:mainfrom
ruan330:refactor/per-connection-lock
Closed

refactor(pool): per-connection Arc<Mutex> to unblock concurrent sessions#183
ruan330 wants to merge 1 commit intoopenabdev:mainfrom
ruan330:refactor/per-connection-lock

Conversation

@ruan330
Copy link
Copy Markdown

@ruan330 ruan330 commented Apr 10, 2026

Problem

SessionPool::with_connection currently holds the pool's write lock for the entire callback duration. Because stream_prompt in discord.rs runs inside that callback and can take many seconds — or minutes — to drain an ACP turn, every other Discord thread is blocked from touching the pool while one session is streaming, even for get_or_create on a completely unrelated thread_id (which only needs the read lock).

In production on our fork (which runs a few dozen concurrent Discord threads against one broker), this manifests as:

Fix

Wrap each AcpConnection in Arc<Mutex<_>>:

  • with_connection takes the pool's read lock only long enough to clone the Arc for the target thread_id, then releases it.
  • It then locks that specific connection's mutex for the duration of the callback.
  • Other sessions continue to stream concurrently.
  • get_or_create on unrelated thread_ids is no longer blocked.
  • Rebuilds still take the write lock briefly (correct — that's a structural change to the HashMap).

cleanup_idle is updated to follow the same rule on the cleanup path: snapshot the Arcs under the read lock, release it, then try_lock each connection individually. A connection that's currently in use is by definition not idle, so try_lock lets us skip it without ever awaiting on a per-connection mutex while holding the pool lock. The write lock is only re-acquired if there are stale entries to remove.

Note: The cleanup_idle shape addresses the P1 review comment that Codex bot left on the original #77 — it pointed out that cleanup_idle grabbing the pool write lock and then awaiting conn_arc.lock() would re-introduce the exact starvation this refactor is meant to eliminate. The snapshot + try_lock pattern fixes that.

The with_connection signature is unchanged, so no call sites need to be updated — the fix is entirely internal to pool.rs. Diff is +51 / -21 in a single file.

Why this approach

A few alternatives we considered:

  1. Leave it as one big RwLock, convert callers to read lock. Doesn't work — callers need &mut AcpConnection to drive session_prompt, and a read lock can't hand out mutable refs.
  2. DashMap<String, AcpConnection>. DashMap's shard locks are synchronous, so we'd still need per-entry async coordination to let a callback hold exclusive access across .await points. Doesn't avoid the underlying need for a per-connection async mutex.
  3. Per-connection Arc<Mutex<_>> (this PR). Minimal change, preserves the existing with_connection API, fixes the root cause (pool lock held during streaming). This matches the architecture discussed in RFC: Session Management — Design Proposal #78 §2b.

Scope

This PR is only the locking change. Previously bundled in #77 alongside notification-loop resilience, an alive-check safety net, and a startup cleanup routine — on reflection that was too much for a single review. Closing #77 and splitting into three focused PRs; this is the first.

Next PRs (to follow in order, from separate branches off main):

  1. ✅ This PR — per-connection lock.
  2. Notification loop resilience — end_turn can arrive before the final agent_message_chunk via tokio::select! racing; fix is a small drain window + empty-response fallback. Fixes fix: notification loop assumes ordered events, bounded prompts, and managed session lifecycle — none hold in production #76.
  3. Alive check + hard timeout — defensive safety net around notification_loop (30s alive check / 30min hard ceiling).

Supersedes #59 and #77. Closes #58.

Testing

Happy to add a focused test if you'd like — the behavioral change is subtle (a session that's streaming no longer blocks unrelated get_or_create calls), and a concurrent-access test with two tasks would make it visible.

@ruan330 ruan330 requested a review from thepagent as a code owner April 10, 2026 12:40
`SessionPool::with_connection` currently holds the pool's write lock
for the entire callback duration. Because `stream_prompt` in discord.rs
runs inside that callback and can take many seconds (or minutes) to
drain an ACP turn, every other Discord thread is blocked from touching
the pool while one session streams — even for `get_or_create` on a
completely unrelated thread_id, which only needs the read lock.

The fix: wrap each `AcpConnection` in `Arc<Mutex<_>>`. `with_connection`
now takes only the pool's read lock long enough to clone the Arc, then
locks that specific connection's mutex for the callback. The pool lock
is released immediately, so:

  - Other sessions can still stream concurrently.
  - `get_or_create` on unrelated thread_ids proceeds without waiting.
  - Rebuilds still take the write lock briefly (correct — structural
    change to the HashMap).

`cleanup_idle` uses a snapshot-then-probe pattern so the same rule
holds on the cleanup path: clone the Arcs under the read lock, release
it, then `try_lock` each connection individually. A busy connection
is, by definition, not idle — `try_lock` lets us skip it without ever
awaiting on a per-connection mutex while holding the pool lock. The
write lock is only re-acquired if there are stale entries to remove.
This addresses the P1 review comment left by the Codex bot on the
original openabdev#77, which noted that awaiting `conn_arc.lock()` from inside
a held pool write lock would re-introduce the very starvation this
refactor is meant to eliminate.

This matches the architecture discussed in openabdev#78 §2b and closes openabdev#58
(pool write lock deadlock during long-running notification loops).

Supersedes openabdev#59 and openabdev#77. Scoped to just the locking change so it can
be reviewed in isolation — notification-loop resilience and alive
check will follow as separate PRs.

No call-site changes: the `with_connection` signature is unchanged.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@thepagent
Copy link
Copy Markdown
Collaborator

Closing in favor of #257 which addresses the same pool blocking issue with a targeted fix. Thanks! 🙏

@thepagent thepagent closed this Apr 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

2 participants