Skip to content

bug: SessionPool lock scope crosses await and can stall prompt handling #256

@clsung

Description

@clsung

Description

Summary

SessionPool currently keeps lock scope too broad across async work, which can stall prompt handling.

In practice, this showed up in the Discord flow as:

  • one stuck request previously blocking other threads
  • after switching to per-connection locking, cross-thread behavior improved
  • but same-thread requests can still wait indefinitely if the previous stream never finishes cleanly

Root cause

Two patterns in SessionPool are problematic:

  1. get_or_create() performs long async initialization while holding the pool lock

    • AcpConnection::spawn(...).await
    • conn.initialize().await
    • conn.session_new(...).await
  2. with_connection() awaits the entire connection closure while holding the pool lock

That means a slow or stuck prompt/session lifecycle can block unrelated work.

Observed behavior

In the Discord handler:

  • a request starts normally
  • the upstream notification stream may stop producing events or never terminate cleanly
  • later requests can stall before entering actual prompt handling
  • after moving to per-connection locking, different threads no longer block each other
  • however, requests on the same thread/session can still hang indefinitely behind the previous request

Suggested fix

Store connections as per-connection handles, for example:

RwLock<HashMap<String, Arc<tokio::sync::Mutex<AcpConnection>>>>

Then:

  • use the pool lock only to look up / insert the handle
  • drop the pool lock before any long-running await
  • lock only the specific connection while running per-session work

Also consider adding:

  • timeout handling for same-thread prompt contention
  • fail-fast "busy" behavior for same-thread concurrent requests
  • regression tests for stuck-session scenarios

Steps to Reproduce

A practical reproduction path:

  1. Send a Discord message that starts a prompt stream
  2. Cause the upstream notification stream to stop producing events or never terminate cleanly
  3. Send another message
  4. Observe later requests stalling or waiting indefinitely, depending on whether they reuse the same session/thread

This makes the bot appear unresponsive under partial failure conditions, especially when an upstream ACP stream stops yielding notifications or never completes cleanly.

Expected Behavior

  • pool-level locking should only protect the connection map
  • long-running prompt/session work should not hold the global pool lock
  • different threads should remain independent
  • same-thread contention should not appear as an indefinite hang

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingp1High — address this sprintsession

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions