Skip to content

fix: avoid blocking unrelated sessions during pool initialization (#256)#257

Open
clsung wants to merge 4 commits intoopenabdev:mainfrom
clsung:bugfix/session_pool
Open

fix: avoid blocking unrelated sessions during pool initialization (#256)#257
clsung wants to merge 4 commits intoopenabdev:mainfrom
clsung:bugfix/session_pool

Conversation

@clsung
Copy link
Copy Markdown
Contributor

@clsung clsung commented Apr 12, 2026

Refactor SessionPool to move expensive AcpConnection initialization outside of the write lock. This ensures that a single slow or stuck session startup does not block access to other active sessions in the pool.

Changes:

  • Wrap AcpConnection in Arc<Mutex<...>> for granular per-session locking.
  • Switch SessionPool to use read locks for connection retrieval.
  • Perform connection spawning and initialization before acquiring the pool's write lock.
  • Use a fast-path read lock to check session health before attempting a rebuild.
  • fixes bug: SessionPool lock scope crosses await and can stall prompt handling #256

Discord Discussion URL: https://discord.com/channels/1491295327620169908/1491365150664560881/1493769615984164975

clsung added 2 commits April 15, 2026 01:02
…enabdev#256)

Refactor SessionPool to move expensive AcpConnection initialization
outside of the write lock. This ensures that a single slow or stuck
session startup does not block access to other active sessions in
the pool.

Changes:
- Wrap AcpConnection in Arc<Mutex<...>> for granular per-session locking.
- Switch SessionPool to use read locks for connection retrieval.
- Perform connection spawning and initialization before acquiring the
  pool's write lock.
- Use a fast-path read lock to check session health before attempting
  a rebuild.
@clsung clsung force-pushed the bugfix/session_pool branch from 32087ea to 019a125 Compare April 15, 2026 00:19
@github-actions github-actions bot added the closing-soon PR missing Discord Discussion URL — will auto-close in 3 days label Apr 15, 2026
@clsung
Copy link
Copy Markdown
Contributor Author

clsung commented Apr 15, 2026

Rebased this branch onto current openabdev/openab main and ported the fix onto the post-#310 PoolState design.

What changed:

  • moved per-session access to Arc<tokio::sync::Mutex<AcpConnection>>
  • dropped the global pool/state lock before long async work in get_or_create()
  • changed with_connection() to clone the per-connection handle under a read lock, then lock only that session
  • kept #310 suspend/resume behavior instead of reverting it
  • kept cleanup non-blocking by probing per-connection locks outside the global state lock
  • fixed two ABA-style races in eviction/cleanup by removing entries only when the sampled Arc handle still matches the current map entry

Validation:

  • cargo test passes on the rebased branch
  • added focused unit tests for the identity-checked removal helper
  • current result: 42 passed; 0 failed

This PR now targets the lock-scope bug from #256 without discarding the merged #310 pool changes.

@thepagent thepagent added the p1 High — address this sprint label Apr 15, 2026
@github-actions github-actions bot removed the closing-soon PR missing Discord Discussion URL — will auto-close in 3 days label Apr 15, 2026
Copy link
Copy Markdown
Contributor

@masami-agent masami-agent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review — PR #257

Thanks for the thorough refactor, @clsung. This addresses a real p1 issue — the current pool holds a write lock across AcpConnection::spawn() + initialize() + session_new(), which are all async and can take seconds. During that time, every other thread is blocked.

Architecture change

Before After
Connection storage HashMap<String, AcpConnection> HashMap<String, Arc<Mutex<AcpConnection>>>
Pool lock during spawn Write lock held across spawn + init + session_new Read lock for snapshot, no lock during spawn, write lock only for final insert
with_connection Write lock on pool state Read lock on pool state + per-connection mutex
Eviction Under write lock, direct access Snapshot + try_lock + remove_if_same_handle
Cleanup Under write lock Snapshot + try_lock + deferred removal

This is a significant improvement. The key insight — moving expensive async work outside the pool lock — is correct and well-executed.

✅ What looks good

  1. remove_if_same_handle — Smart use of Arc::ptr_eq to prevent ABA problems. If another task replaced the connection between snapshot and removal, the stale handle is not removed. Well-tested with two unit tests.

  2. with_connection now uses read lock — This is the biggest win. Previously, a streaming prompt held a write lock on the entire pool. Now it only holds a per-connection mutex, so other threads are unblocked.

  3. cleanup_idle uses try_lock — Busy connections are skipped instead of blocking the cleanup sweep. Correct behavior: a busy session is by definition not idle.

  4. Race condition handling in get_or_create — After spawning outside the lock, re-checks if another task already created a healthy connection for the same thread. Good.

  5. HRTB fix on with_connectionfor<'a> FnOnce(&'a mut AcpConnection) is the correct lifetime bound for the new pattern where the mutex guard (not the pool state) owns the borrow.

🔴 Must fix before merge

1. Eviction can fail silently, leaving pool permanently full

if state.active.len() >= self.max_sessions {
    if let Some((key, expected_conn, _, sid)) = eviction_candidate {
        if remove_if_same_handle(&mut state.active, &key, &expected_conn).is_some() {
            // ...
        }
    }
}

if state.active.len() >= self.max_sessions {
    return Err(anyhow!("pool exhausted ({} sessions)"));
}

If eviction_candidate is None (all other connections were locked via try_lock) or remove_if_same_handle returns None (connection was replaced), the pool stays full and returns an error. This is correct as a safety net, but the original code would always find an oldest entry because it had direct access under the write lock.

In the new design, under high concurrency, try_lock could fail on all connections, making eviction impossible even though idle sessions exist. Consider falling back to lock().await on the oldest candidate if try_lock fails on all entries, or at least logging a warning when eviction fails due to lock contention.

2. saved_session_id from suspended map is read under read lock but removed under write lock later — potential double-resume

let (existing, saved_session_id) = {
    let state = self.state.read().await;
    (
        state.active.get(thread_id).cloned(),
        state.suspended.get(thread_id).cloned(),  // read, not removed
    )
};

The suspended session ID is read (cloned) but not removed from the map. Later, state.suspended.remove(thread_id) happens under the write lock. But between the read and the write, another task for the same thread could also read the same suspended ID and attempt session_load with it. Two connections could try to resume the same session.

This is unlikely (same thread_id racing) but possible if a user sends two messages rapidly in the same thread. Consider either:

  • Accepting this as a known edge case (session_load is idempotent-ish)
  • Or adding a comment documenting the race window

🟡 Non-blocking

3. had_existing flag for session_reset

let had_existing = existing.is_some();
// ...
if !resumed {
    new_conn.session_new(&self.config.working_dir).await?;
    if had_existing || saved_session_id.is_some() {
        new_conn.session_reset = true;
    }
}

The original code only set session_reset when saved_session_id.is_some(). This PR adds had_existing as an additional condition. This is a behavioral change — if a stale connection existed but had no suspended session, session_reset is now set. Is this intentional? If so, a comment explaining why would help.

4. suspend_entry function removed

The suspend_entry helper is removed and its logic is inlined in multiple places. This is fine for now, but if more callers need suspension logic, consider extracting a helper again.

5. Connection drop timing

In the old code, suspend_entry explicitly drops the connection (triggering process group kill) inside the write lock. In the new code, the old Arc<Mutex<AcpConnection>> is dropped when remove_if_same_handle returns it and it goes out of scope. This should be fine since the Arc refcount will reach zero, but worth verifying that no other clone holds a reference that would delay the drop.

📝 Summary

Priority Item Status
🔴 Eviction can fail silently under lock contention Needs fix or documented fallback
🔴 Potential double-resume race on suspended session Needs comment or fix
🟡 had_existing behavioral change Needs clarification
🟡 suspend_entry inlined Non-blocking
🟡 Connection drop timing Non-blocking

The core design is sound and addresses the p1 issue well. The two 🔴 items are edge cases but worth addressing before merge given this is the session pool — the most critical concurrency component in the system.

Looking forward to the next revision!

@clsung
Copy link
Copy Markdown
Contributor Author

clsung commented Apr 15, 2026

Addressed the review follow-up on the latest revision:

  • #1 full-pool eviction under contention: kept the safe pool exhausted fallback, and now log explicit warnings when eviction fails because the sampled candidate changed or because all other sessions were busy during the eviction scan
  • #2 double-resume window: added a per-thread creation gate (creating) so same-thread get_or_create() / resume work is serialized and two rapid requests cannot race into duplicate session/load attempts
  • #3 session_reset behavior: added an inline comment documenting why stale live entries without a resumable session id still set the reset banner

Validation:

  • added a unit test for the per-thread gate helper
  • cargo test passes: 43 passed; 0 failed

Latest commit: 3fc2377

Copy link
Copy Markdown
Contributor

@masami-agent masami-agent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-review — all items addressed ✅

@clsung addressed all three review items cleanly. Verified the full diff (not just the items I raised).

🔴 #1 — Eviction under contention

Kept pool exhausted as the safe fallback (correct — better to reject than corrupt state). Added two explicit warning logs:

  • "pool full but eviction candidate changed before removal" — ABA case
  • "pool full but all other sessions were busy during eviction scan" — all try_lock failed

Operators can now diagnose eviction failures from logs. Acceptable approach.

🔴 #2 — Double-resume race

Added a per-thread creation gate (creating: HashMap<String, Arc<Mutex<()>>>). The _create_guard at the top of get_or_create serializes same-thread calls, completely eliminating the race window. The write lock for get_or_insert_gate is brief (just a HashMap insert) and released before any heavy async work. Clean solution.

Unit test get_or_insert_gate_reuses_gate_for_same_thread confirms gate reuse via Arc::ptr_eq.

🟡 #3session_reset behavioral change

Inline comment added explaining the intent — stale live entries without a resumable session ID still warrant the reset banner. Makes sense.

Full diff verification

  • ✅ Only src/acp/pool.rs changed — no Cargo.toml/lock version issues
  • ✅ Lock ordering is correct: write lock for gate insert → release → per-thread gate → read lock for snapshot → no lock during spawn/init → write lock for final insert
  • cleanup_idle and with_connection patterns are consistent with get_or_create
  • ✅ 3 unit tests covering remove_if_same_handle (match + mismatch) and get_or_insert_gate
  • cargo test: 43 passed, 0 failed

Recommend squash merge.

@masami-agent
Copy link
Copy Markdown
Contributor

@clsung Two items to fix before this can merge:

1. CI fail — clippy type_complexity

error: very complex type used
--> src/acp/pool.rs:103
|   let mut eviction_candidate: Option<(String, Arc<Mutex<AcpConnection>>, Instant, Option<String>)> = None;

Simplest fix: add a type alias, e.g.

type EvictionCandidate = (String, Arc<Mutex<AcpConnection>>, Instant, Option<String>);

then use Option<EvictionCandidate>.

2. Version regression — Cargo.toml
PR branch has version = "0.7.4" but main is already at 0.7.5. Please rebase or update to match main.

@clsung
Copy link
Copy Markdown
Contributor Author

clsung commented Apr 15, 2026

Addressed the two latest maintainer follow-ups:

  • fixed the clippy type_complexity complaint in src/acp/pool.rs by introducing an EvictionCandidate type alias
  • synced the crate version to current main (0.7.5) in Cargo.toml and the openab package entry in Cargo.lock

Validation:

  • cargo test passes: 43 passed; 0 failed
  • cargo clippy --all-targets -- -D warnings passes

Latest commit: deafa0e

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

p1 High — address this sprint

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug: SessionPool lock scope crosses await and can stall prompt handling

4 participants