Description
Summary
SessionPool currently keeps lock scope too broad across async work, which can stall prompt handling.
In practice, this showed up in the Discord flow as:
- one stuck request previously blocking other threads
- after switching to per-connection locking, cross-thread behavior improved
- but same-thread requests can still wait indefinitely if the previous stream never finishes cleanly
Root cause
Two patterns in SessionPool are problematic:
-
get_or_create() performs long async initialization while holding the pool lock
AcpConnection::spawn(...).await
conn.initialize().await
conn.session_new(...).await
-
with_connection() awaits the entire connection closure while holding the pool lock
That means a slow or stuck prompt/session lifecycle can block unrelated work.
Observed behavior
In the Discord handler:
- a request starts normally
- the upstream notification stream may stop producing events or never terminate cleanly
- later requests can stall before entering actual prompt handling
- after moving to per-connection locking, different threads no longer block each other
- however, requests on the same thread/session can still hang indefinitely behind the previous request
Suggested fix
Store connections as per-connection handles, for example:
RwLock<HashMap<String, Arc<tokio::sync::Mutex<AcpConnection>>>>
Then:
- use the pool lock only to look up / insert the handle
- drop the pool lock before any long-running
await
- lock only the specific connection while running per-session work
Also consider adding:
- timeout handling for same-thread prompt contention
- fail-fast "busy" behavior for same-thread concurrent requests
- regression tests for stuck-session scenarios
Steps to Reproduce
A practical reproduction path:
- Send a Discord message that starts a prompt stream
- Cause the upstream notification stream to stop producing events or never terminate cleanly
- Send another message
- Observe later requests stalling or waiting indefinitely, depending on whether they reuse the same session/thread
This makes the bot appear unresponsive under partial failure conditions, especially when an upstream ACP stream stops yielding notifications or never completes cleanly.
Expected Behavior
- pool-level locking should only protect the connection map
- long-running prompt/session work should not hold the global pool lock
- different threads should remain independent
- same-thread contention should not appear as an indefinite hang
Description
Summary
SessionPoolcurrently keeps lock scope too broad across async work, which can stall prompt handling.In practice, this showed up in the Discord flow as:
Root cause
Two patterns in
SessionPoolare problematic:get_or_create()performs long async initialization while holding the pool lockAcpConnection::spawn(...).awaitconn.initialize().awaitconn.session_new(...).awaitwith_connection()awaits the entire connection closure while holding the pool lockThat means a slow or stuck prompt/session lifecycle can block unrelated work.
Observed behavior
In the Discord handler:
Suggested fix
Store connections as per-connection handles, for example:
Then:
awaitAlso consider adding:
Steps to Reproduce
A practical reproduction path:
This makes the bot appear unresponsive under partial failure conditions, especially when an upstream ACP stream stops yielding notifications or never completes cleanly.
Expected Behavior