Skip to content

fix(agentscope): isolate per-call ReAct step state via ContextVar to support concurrent same-instance calls#187

Open
RichardoMrMu wants to merge 1 commit intoalibaba:mainfrom
RichardoMrMu:fix/agentscope-step-concurrency-contextvars
Open

fix(agentscope): isolate per-call ReAct step state via ContextVar to support concurrent same-instance calls#187
RichardoMrMu wants to merge 1 commit intoalibaba:mainfrom
RichardoMrMu:fix/agentscope-step-concurrency-contextvars

Conversation

@RichardoMrMu
Copy link
Copy Markdown

@RichardoMrMu RichardoMrMu commented May 9, 2026

fix(agentscope): isolate per-call ReAct step state via ContextVar to support concurrent same-instance calls

Summary

Make AgentScope instrumentation safe for concurrent await agent(...) invocations on the same AgentBase instance (TUI multi-tab, asyncio.gather, ASGI singletons), by moving per-call ReAct step state from the agent instance to a ContextVar and managing the ReAct hooks with a reference-counted single registration.

This is a follow-up to the same-shape fix for the Hermes instrumentation (#184).

Problem

Two production-relevant scenarios trigger silent trace corruption:

  1. TUI / agent UI multi-tab — a single shared ReActAgent instance serves multiple chat tabs concurrently.
  2. asyncio.gather([agent(q1), agent(q2)]) — a service answers multiple questions in parallel using one agent instance.
  3. ASGI singleton — FastAPI / Starlette routes share an agent instance across coroutines.

In all three cases, the previous instrumentation produces wrong telemetry. Three independent bugs combine:

Bug 1: instance-level state slot is overwritten

# old _wrapper.py
state = _ReactStepState(original_context=_get_current_context())
call_self._react_step_state = state          # single slot on the instance
_register_react_hooks(call_self, state, self._handler)

Two concurrent invocations on the same call_self overwrite each other's _react_step_state.active_step. The earlier invocation's open step span loses its only reference and is never closed by stop_react_stepleaked span.

Bug 2: ReAct hooks are registered per-call with uuid-suffixed names

# old _ReactStepState
hook_name: str = field(
    default_factory=lambda: f"{_REACT_STEP_HOOK_PREFIX}_{uuid.uuid4().hex[:8]}"
)

Two concurrent calls register four hooks each (eight total) under different names. AgentScope's hook registry fires every registered callable, so a single _reasoning triggers two pre_reasoning hooks → start_react_step is called twice for one round → round counter inflates → step span tree corrupted.

Bug 3: finally block deletes the wrong invocation's state

# old _wrapper.py finally
if hasattr(call_self, "_react_step_state"):
    del call_self._react_step_state          # tears down the *current* instance value

If the first invocation finishes while the second is still running, this del removes the second invocation's state from under it — its remaining _reasoning / _acting callbacks then no-op silently.

The old module docstring acknowledged this limitation explicitly:

Callers must not overlap await agent(...) on the same instance across coroutines or threads without external serialization.

But that constraint is unrealistic for any production multi-tab UI or async-gather setup — the instrumentation should not require the user to serialize calls themselves.

Fix

1. Move per-call state to a ContextVar

_REACT_STATE: contextvars.ContextVar["_ReactStepState | None"] = (
    contextvars.ContextVar(
        "opentelemetry_agentscope_react_state",
        default=None,
    )
)

The wrapper sets the ContextVar on entry and resets it in finally; each asyncio task / thread observes its own state. All four hooks read state via _REACT_STATE.get() and no-op when the slot is None (meaning the hook fired in a context that did not opt into instrumentation).

2. Reference-counted hook registration under a fixed name

Instead of registering four uuid-suffixed hooks per __call__, the first concurrent invocation registers one set of hooks under the fixed name _REACT_HOOK_NAME; subsequent invocations only bump a refcount. Hooks are removed only when the last outstanding call unwinds.

The refcount is stored in a WeakKeyDictionary keyed on the agent itself rather than id(agent). CPython recycles object addresses, so a fresh agent allocated after another one is GC'd could otherwise be mis-recognised as already-instrumented if a prior acquire had crashed before its paired release. The weak dictionary lets the GC drop the entry automatically.

_REACT_HOOK_REGISTRY_LOCK = threading.Lock()
_REACT_HOOK_REGISTRY: MutableMapping[Any, int] = weakref.WeakKeyDictionary()

def _acquire_react_hooks(agent, handler):
    with _REACT_HOOK_REGISTRY_LOCK:
        count = _REACT_HOOK_REGISTRY.get(agent, 0)
        if count == 0:
            installed = []
            try:
                for hook_type, hook_fn in (
                    ("pre_reasoning",  _make_pre_reasoning_hook(handler)),
                    ("post_reasoning", _make_post_reasoning_hook(handler)),
                    ("pre_acting",     _make_pre_acting_hook()),
                    ("post_acting",    _make_post_acting_hook(handler)),
                ):
                    agent.register_instance_hook(hook_type, _REACT_HOOK_NAME, hook_fn)
                    installed.append(hook_type)
            except Exception:
                # Roll back any partial registration so the agent ends up
                # exactly as we found it; refcount is not bumped, so the
                # caller's matching release becomes a safe no-op.
                for hook_type in installed:
                    try:
                        agent.remove_instance_hook(hook_type, _REACT_HOOK_NAME)
                    except Exception:
                        logger.warning(...)
                raise
        _REACT_HOOK_REGISTRY[agent] = count + 1

def _release_react_hooks(agent):
    with _REACT_HOOK_REGISTRY_LOCK:
        current = _REACT_HOOK_REGISTRY.get(agent, 0)
        if current <= 0:                  # tolerate release-without-acquire
            return
        count = current - 1
        if count <= 0:
            for hook_type in _REACT_HOOK_TYPES:
                try:
                    agent.remove_instance_hook(hook_type, _REACT_HOOK_NAME)
                except (ValueError, KeyError):
                    logger.warning(...)   # hook lost externally; keep cleaning
            del _REACT_HOOK_REGISTRY[agent]
        else:
            _REACT_HOOK_REGISTRY[agent] = count

This guarantees:

  • _reasoning / _acting fire exactly once per logical step (no double-firing across hook registrations);
  • hooks stay alive for as long as any in-flight call still depends on them;
  • after the last call unwinds the agent has zero residual instrumentation hooks;
  • a partial register_instance_hook failure is rolled back atomically (the agent is never left half-instrumented);
  • a release without a paired acquire (e.g. caller's finally after acquire raised) is a safe no-op rather than an underflow;
  • a GC-collected agent does not leave a stale registry entry that could shadow a freshly-allocated one with the same id(...).

3. Wrapper changes

-                        state = _ReactStepState(
-                            original_context=_get_current_context(),
-                        )
-                        call_self._react_step_state = state
-                        _register_react_hooks(call_self, state, self._handler)
+                        state = _ReactStepState(
+                            original_context=_get_current_context(),
+                        )
+                        state_token = _REACT_STATE.set(state)
+                        _acquire_react_hooks(call_self, self._handler)
                         ...
                         finally:
-                            if is_react and state:
-                                _remove_react_hooks(call_self, state)
-                                if hasattr(call_self, "_react_step_state"):
-                                    del call_self._react_step_state
+                            if is_react:
+                                _release_react_hooks(call_self)
+                                if state_token is not None:
+                                    _REACT_STATE.reset(state_token)

Verification

Reproduction (before fix)

demo_concurrent_bug.py simulates two asyncio.gather-ed __call__s on a shared agent. Output on the buggy code:

start_react_step 调用次数 = 3(按 round = [1, 1, 2])
stop_react_step  调用次数 = 3(按 round = [1, 1, 2])

❌ BUG 复现成功(共 2 项异常):
   - start_react_step 次数错乱:期望 4,实际 3
   - stop_react_step 次数错乱:期望 4,实际 3
     (说明有 step span 泄露,永远不会被 close)

Expected for two concurrent invocations × two rounds each = 4 starts / 4 stops.

Verification (after fix)

demo_after_fix.py runs the same scenario and checks 6 invariants:

✅ [1] start_react_step 总数 == 4
✅ [2] stop_react_step  总数 == 4
✅ [3] round=1 出现恰好 2 次(两 tab 各一次)
✅ [4] round=2 出现恰好 2 次
✅ [5] _REACT_HOOK_REGISTRY 不残留本 agent
✅ [6] agent 实例上残留 hook 数 == 0

🎉 修复验证通过:6/6 断言全部成立

Unit tests

$ DASHSCOPE_API_KEY=test_api_key pytest tests/test_react_step_reentrancy.py \
    tests/test_concurrent_react_step.py tests/test_call_chain_reentrancy.py \
    tests/test_utils.py -v

============================= 11 passed in 0.09s =============================

New regression coverage in tests/test_concurrent_react_step.py:

  • test_state_isolated_across_concurrent_invocationsasyncio.gather two concurrent driver loops on a shared fake agent, assert 4 starts / 4 stops with round=1 firing exactly 2 times.
  • test_hook_registry_releases_after_last_invocation — after both concurrent calls unwind, the agent's hook table is empty and _REACT_HOOK_REGISTRY[id(agent)] is gone.
  • test_hook_registry_keeps_hooks_during_overlapping_calls — when a sibling call is still in flight, releasing the first call's reference must not remove the shared hooks.

test_react_step_reentrancy.py is migrated to drive state through the _REACT_STATE ContextVar via a fixture; a new test_hook_no_ops_when_react_state_not_set test pins the safety-valve contract that hooks fired in a non-instrumented context are silent no-ops.

Compatibility

  • API: no public API change. Module exports (AgentScopeAgentWrapper, AgentScopeChatModelWrapper, etc.) keep their signatures; only the private state plumbing inside _wrapper.py changes.
  • Dependencies: no new runtime dependencies. threading and contextvars are stdlib.
  • Behavior with single-call workloads: identical to before. The ContextVar set/reset cost is negligible compared to span creation, and refcount-acquired hooks fire exactly once just as before.
  • Behavior with stacked proxies (ChatModelBase / AgentBase subclass calling super().__call__): unchanged — covered by the existing test_call_chain_reentrancy.py (still passing).

Follow-up

Same shape as Hermes #184 was applied here. AgentScope's own AgentBase keeps a single _reply_task slot for its interrupt() semantics; that is outside instrumentation scope and not touched in this PR. Concurrent await agent(...) on a shared instance is now safe for telemetry; users still need to consult AgentScope's own concurrency contract if they rely on interrupt().

Related

@CLAassistant
Copy link
Copy Markdown

CLAassistant commented May 9, 2026

CLA assistant check
All committers have signed the CLA.

@RichardoMrMu RichardoMrMu force-pushed the fix/agentscope-step-concurrency-contextvars branch from 4cc9b3f to 8f69750 Compare May 9, 2026 02:29
…support concurrent same-instance calls

Make AgentScope instrumentation safe for concurrent `await agent(...)` invocations
on the same AgentBase instance (TUI multi-tab, asyncio.gather, ASGI singletons).

Three independent bugs combined under same-instance concurrency:

1. Per-call state was stored in `call_self._react_step_state` (single instance
   slot). A second concurrent invocation overwrote the first one's
   active_step reference and the first step span was never closed
   (silent span leak).
2. ReAct hooks were registered with a uuid-suffixed name per __call__.
   AgentScope's hook registry fires every registered callable, so each
   _reasoning callback double-fired across concurrent calls and inflated
   the round counter.
3. The finally block did `del call_self._react_step_state`, which
   removed the *other* concurrent invocation's state mid-flight.

Fix:
- Move per-call state into a ContextVar (`_REACT_STATE`); each asyncio task
  / thread sees its own slot, hooks read via `_REACT_STATE.get()` and
  no-op when unset.
- Register the four ReAct hooks exactly once per agent under a fixed name,
  guarded by a refcount stored in a `WeakKeyDictionary` keyed on the
  agent itself (not `id(agent)`, which is reused after GC).
- `_acquire_react_hooks` is all-or-nothing: if AgentScope rejects any of
  the four registrations, the already-installed hooks are rolled back and
  the refcount stays at zero so the agent ends up un-instrumented.
- `_release_react_hooks` tolerates extra calls without an underlying
  acquire (e.g. wrapper's `finally` after a partial acquire failure)
  and emits warning logs when AgentScope's own registry has lost a hook
  out from under us instead of silently leaking.
- Update wrapper to set/reset the ContextVar instead of mutating the
  agent instance.

Verification:
- New regression tests in tests/test_concurrent_react_step.py cover state
  isolation, hook lifetime under overlapping calls, refcount cleanup,
  acquire-rollback on partial registration failure, and release-without-
  acquire safety.
- Existing tests/test_react_step_reentrancy.py migrated to drive state
  through the ContextVar fixture; new test pins the safety-valve no-op
  contract when _REACT_STATE is unset.
- pytest tests/test_react_step_reentrancy.py tests/test_concurrent_react_step.py
  tests/test_call_chain_reentrancy.py tests/test_utils.py -> 13 passed.

Companion to the same-shape Hermes fix (alibaba#184).
@RichardoMrMu RichardoMrMu force-pushed the fix/agentscope-step-concurrency-contextvars branch from 8f69750 to 8676571 Compare May 9, 2026 02:52
@ralf0131 ralf0131 requested a review from Copilot May 9, 2026 04:14
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the AgentScope instrumentation to be safe when the same AgentBase instance is awaited concurrently (e.g., asyncio.gather, multi-tab UI, ASGI singletons) by moving per-call ReAct step state off the agent instance and into a ContextVar, and by switching ReAct hook management to a shared, reference-counted registration.

Changes:

  • Introduce _REACT_STATE (ContextVar) for per-call ReAct step state and update hooks to read state via the ContextVar.
  • Add a per-agent hook refcount registry (WeakKeyDictionary + lock) to ensure hooks are registered once per agent and removed after the last in-flight call.
  • Add/adjust tests to cover concurrency, hook lifecycle, rollback behavior, and no-op safety when state is unset.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File Description
instrumentation-loongsuite/loongsuite-instrumentation-agentscope/src/opentelemetry/instrumentation/agentscope/_wrapper.py Moves ReAct state to ContextVar and adds refcounted hook acquire/release to support concurrent same-instance calls.
instrumentation-loongsuite/loongsuite-instrumentation-agentscope/tests/test_react_step_reentrancy.py Updates tests to publish per-test ReAct state via _REACT_STATE and adds a no-op safety test when state is unset.
instrumentation-loongsuite/loongsuite-instrumentation-agentscope/tests/test_concurrent_react_step.py Adds new regression tests for concurrent invocations and hook registry lifetime/rollback semantics.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

call_self._react_step_state = state
_register_react_hooks(call_self, state, self._handler)
state_token = _REACT_STATE.set(state)
_acquire_react_hooks(call_self, self._handler)
Comment on lines +34 to +39
2. ``_REACT_HOOK_REGISTRY`` is a process-global ``id(agent) -> ref count``
map guarded by ``_REACT_HOOK_REGISTRY_LOCK``. The first concurrent call
on an agent registers four hooks under the fixed name
``_REACT_HOOK_NAME``; subsequent concurrent calls only bump the ref
count. Hooks are removed only after the **last** outstanding call on
that agent unwinds. This is required because AgentScope's hook
Comment on lines +279 to +283
# First call unwinds -> refcount drops to 1, hooks must remain so
# that the second call's pending ``_reasoning`` / ``_acting`` keep firing.
_release_react_hooks(agent)
_REACT_STATE.reset(token_b)
assert _REACT_HOOK_REGISTRY.get(agent) == 1
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants