fix(agentscope): isolate per-call ReAct step state via ContextVar to support concurrent same-instance calls by RichardoMrMu · Pull Request #187 · alibaba/loongsuite-python-agent

RichardoMrMu · 2026-05-09T02:21:21Z

fix(agentscope): isolate per-call ReAct step state via ContextVar to support concurrent same-instance calls

Summary

Make AgentScope instrumentation safe for concurrent await agent(...) invocations on the same AgentBase instance (TUI multi-tab, asyncio.gather, ASGI singletons), by moving per-call ReAct step state from the agent instance to a ContextVar and managing the ReAct hooks with a reference-counted single registration.

This is a follow-up to the same-shape fix for the Hermes instrumentation (#184).

Problem

Two production-relevant scenarios trigger silent trace corruption:

TUI / agent UI multi-tab — a single shared ReActAgent instance serves multiple chat tabs concurrently.
asyncio.gather([agent(q1), agent(q2)]) — a service answers multiple questions in parallel using one agent instance.
ASGI singleton — FastAPI / Starlette routes share an agent instance across coroutines.

In all three cases, the previous instrumentation produces wrong telemetry. Three independent bugs combine:

Bug 1: instance-level state slot is overwritten

# old _wrapper.py
state = _ReactStepState(original_context=_get_current_context())
call_self._react_step_state = state          # single slot on the instance
_register_react_hooks(call_self, state, self._handler)

Two concurrent invocations on the same call_self overwrite each other's _react_step_state.active_step. The earlier invocation's open step span loses its only reference and is never closed by stop_react_step → leaked span.

Bug 2: ReAct hooks are registered per-call with uuid-suffixed names

# old _ReactStepState
hook_name: str = field(
    default_factory=lambda: f"{_REACT_STEP_HOOK_PREFIX}_{uuid.uuid4().hex[:8]}"
)

Two concurrent calls register four hooks each (eight total) under different names. AgentScope's hook registry fires every registered callable, so a single _reasoning triggers two pre_reasoning hooks → start_react_step is called twice for one round → round counter inflates → step span tree corrupted.

Bug 3: `finally` block deletes the wrong invocation's state

# old _wrapper.py finally
if hasattr(call_self, "_react_step_state"):
    del call_self._react_step_state          # tears down the *current* instance value

If the first invocation finishes while the second is still running, this del removes the second invocation's state from under it — its remaining _reasoning / _acting callbacks then no-op silently.

The old module docstring acknowledged this limitation explicitly:

Callers must not overlap await agent(...) on the same instance across coroutines or threads without external serialization.

But that constraint is unrealistic for any production multi-tab UI or async-gather setup — the instrumentation should not require the user to serialize calls themselves.

Fix

1. Move per-call state to a `ContextVar`

_REACT_STATE: contextvars.ContextVar["_ReactStepState | None"] = (
    contextvars.ContextVar(
        "opentelemetry_agentscope_react_state",
        default=None,
    )
)

The wrapper sets the ContextVar on entry and resets it in finally; each asyncio task / thread observes its own state. All four hooks read state via _REACT_STATE.get() and no-op when the slot is None (meaning the hook fired in a context that did not opt into instrumentation).

2. Reference-counted hook registration under a fixed name

Instead of registering four uuid-suffixed hooks per __call__, the first concurrent invocation registers one set of hooks under the fixed name _REACT_HOOK_NAME; subsequent invocations only bump a refcount. Hooks are removed only when the last outstanding call unwinds.

The refcount is stored in a WeakKeyDictionary keyed on the agent itself rather than id(agent). CPython recycles object addresses, so a fresh agent allocated after another one is GC'd could otherwise be mis-recognised as already-instrumented if a prior acquire had crashed before its paired release. The weak dictionary lets the GC drop the entry automatically.

_REACT_HOOK_REGISTRY_LOCK = threading.Lock()
_REACT_HOOK_REGISTRY: MutableMapping[Any, int] = weakref.WeakKeyDictionary()

def _acquire_react_hooks(agent, handler):
    with _REACT_HOOK_REGISTRY_LOCK:
        count = _REACT_HOOK_REGISTRY.get(agent, 0)
        if count == 0:
            installed = []
            try:
                for hook_type, hook_fn in (
                    ("pre_reasoning",  _make_pre_reasoning_hook(handler)),
                    ("post_reasoning", _make_post_reasoning_hook(handler)),
                    ("pre_acting",     _make_pre_acting_hook()),
                    ("post_acting",    _make_post_acting_hook(handler)),
                ):
                    agent.register_instance_hook(hook_type, _REACT_HOOK_NAME, hook_fn)
                    installed.append(hook_type)
            except Exception:
                # Roll back any partial registration so the agent ends up
                # exactly as we found it; refcount is not bumped, so the
                # caller's matching release becomes a safe no-op.
                for hook_type in installed:
                    try:
                        agent.remove_instance_hook(hook_type, _REACT_HOOK_NAME)
                    except Exception:
                        logger.warning(...)
                raise
        _REACT_HOOK_REGISTRY[agent] = count + 1

def _release_react_hooks(agent):
    with _REACT_HOOK_REGISTRY_LOCK:
        current = _REACT_HOOK_REGISTRY.get(agent, 0)
        if current <= 0:                  # tolerate release-without-acquire
            return
        count = current - 1
        if count <= 0:
            for hook_type in _REACT_HOOK_TYPES:
                try:
                    agent.remove_instance_hook(hook_type, _REACT_HOOK_NAME)
                except (ValueError, KeyError):
                    logger.warning(...)   # hook lost externally; keep cleaning
            del _REACT_HOOK_REGISTRY[agent]
        else:
            _REACT_HOOK_REGISTRY[agent] = count

This guarantees:

_reasoning / _acting fire exactly once per logical step (no double-firing across hook registrations);
hooks stay alive for as long as any in-flight call still depends on them;
after the last call unwinds the agent has zero residual instrumentation hooks;
a partial register_instance_hook failure is rolled back atomically (the agent is never left half-instrumented);
a release without a paired acquire (e.g. caller's finally after acquire raised) is a safe no-op rather than an underflow;
a GC-collected agent does not leave a stale registry entry that could shadow a freshly-allocated one with the same id(...).

3. Wrapper changes

-                        state = _ReactStepState(
-                            original_context=_get_current_context(),
-                        )
-                        call_self._react_step_state = state
-                        _register_react_hooks(call_self, state, self._handler)
+                        state = _ReactStepState(
+                            original_context=_get_current_context(),
+                        )
+                        state_token = _REACT_STATE.set(state)
+                        _acquire_react_hooks(call_self, self._handler)
                         ...
                         finally:
-                            if is_react and state:
-                                _remove_react_hooks(call_self, state)
-                                if hasattr(call_self, "_react_step_state"):
-                                    del call_self._react_step_state
+                            if is_react:
+                                _release_react_hooks(call_self)
+                                if state_token is not None:
+                                    _REACT_STATE.reset(state_token)

Verification

Reproduction (before fix)

demo_concurrent_bug.py simulates two asyncio.gather-ed __call__s on a shared agent. Output on the buggy code:

start_react_step 调用次数 = 3（按 round = [1, 1, 2]）
stop_react_step  调用次数 = 3（按 round = [1, 1, 2]）

❌ BUG 复现成功（共 2 项异常）：
   - start_react_step 次数错乱：期望 4，实际 3
   - stop_react_step 次数错乱：期望 4，实际 3
     （说明有 step span 泄露，永远不会被 close）

Expected for two concurrent invocations × two rounds each = 4 starts / 4 stops.

Verification (after fix)

demo_after_fix.py runs the same scenario and checks 6 invariants:

✅ [1] start_react_step 总数 == 4
✅ [2] stop_react_step  总数 == 4
✅ [3] round=1 出现恰好 2 次（两 tab 各一次）
✅ [4] round=2 出现恰好 2 次
✅ [5] _REACT_HOOK_REGISTRY 不残留本 agent
✅ [6] agent 实例上残留 hook 数 == 0

🎉 修复验证通过：6/6 断言全部成立

Unit tests

$ DASHSCOPE_API_KEY=test_api_key pytest tests/test_react_step_reentrancy.py \
    tests/test_concurrent_react_step.py tests/test_call_chain_reentrancy.py \
    tests/test_utils.py -v

============================= 11 passed in 0.09s =============================

New regression coverage in tests/test_concurrent_react_step.py:

test_state_isolated_across_concurrent_invocations — asyncio.gather two concurrent driver loops on a shared fake agent, assert 4 starts / 4 stops with round=1 firing exactly 2 times.
test_hook_registry_releases_after_last_invocation — after both concurrent calls unwind, the agent's hook table is empty and _REACT_HOOK_REGISTRY[id(agent)] is gone.
test_hook_registry_keeps_hooks_during_overlapping_calls — when a sibling call is still in flight, releasing the first call's reference must not remove the shared hooks.

test_react_step_reentrancy.py is migrated to drive state through the _REACT_STATE ContextVar via a fixture; a new test_hook_no_ops_when_react_state_not_set test pins the safety-valve contract that hooks fired in a non-instrumented context are silent no-ops.

Compatibility

API: no public API change. Module exports (AgentScopeAgentWrapper, AgentScopeChatModelWrapper, etc.) keep their signatures; only the private state plumbing inside _wrapper.py changes.
Dependencies: no new runtime dependencies. threading and contextvars are stdlib.
Behavior with single-call workloads: identical to before. The ContextVar set/reset cost is negligible compared to span creation, and refcount-acquired hooks fire exactly once just as before.
Behavior with stacked proxies (ChatModelBase / AgentBase subclass calling super().__call__): unchanged — covered by the existing test_call_chain_reentrancy.py (still passing).

Follow-up

Same shape as Hermes #184 was applied here. AgentScope's own AgentBase keeps a single _reply_task slot for its interrupt() semantics; that is outside instrumentation scope and not touched in this PR. Concurrent await agent(...) on a shared instance is now safe for telemetry; users still need to consult AgentScope's own concurrency contract if they rely on interrupt().

…support concurrent same-instance calls Make AgentScope instrumentation safe for concurrent `await agent(...)` invocations on the same AgentBase instance (TUI multi-tab, asyncio.gather, ASGI singletons). Three independent bugs combined under same-instance concurrency: 1. Per-call state was stored in `call_self._react_step_state` (single instance slot). A second concurrent invocation overwrote the first one's active_step reference and the first step span was never closed (silent span leak). 2. ReAct hooks were registered with a uuid-suffixed name per __call__. AgentScope's hook registry fires every registered callable, so each _reasoning callback double-fired across concurrent calls and inflated the round counter. 3. The finally block did `del call_self._react_step_state`, which removed the *other* concurrent invocation's state mid-flight. Fix: - Move per-call state into a ContextVar (`_REACT_STATE`); each asyncio task / thread sees its own slot, hooks read via `_REACT_STATE.get()` and no-op when unset. - Register the four ReAct hooks exactly once per agent under a fixed name, guarded by a refcount stored in a `WeakKeyDictionary` keyed on the agent itself (not `id(agent)`, which is reused after GC). - `_acquire_react_hooks` is all-or-nothing: if AgentScope rejects any of the four registrations, the already-installed hooks are rolled back and the refcount stays at zero so the agent ends up un-instrumented. - `_release_react_hooks` tolerates extra calls without an underlying acquire (e.g. wrapper's `finally` after a partial acquire failure) and emits warning logs when AgentScope's own registry has lost a hook out from under us instead of silently leaking. - Update wrapper to set/reset the ContextVar instead of mutating the agent instance. Verification: - New regression tests in tests/test_concurrent_react_step.py cover state isolation, hook lifetime under overlapping calls, refcount cleanup, acquire-rollback on partial registration failure, and release-without- acquire safety. - Existing tests/test_react_step_reentrancy.py migrated to drive state through the ContextVar fixture; new test pins the safety-valve no-op contract when _REACT_STATE is unset. - pytest tests/test_react_step_reentrancy.py tests/test_concurrent_react_step.py tests/test_call_chain_reentrancy.py tests/test_utils.py -> 13 passed. Companion to the same-shape Hermes fix (alibaba#184).

Copilot

Pull request overview

This PR updates the AgentScope instrumentation to be safe when the same AgentBase instance is awaited concurrently (e.g., asyncio.gather, multi-tab UI, ASGI singletons) by moving per-call ReAct step state off the agent instance and into a ContextVar, and by switching ReAct hook management to a shared, reference-counted registration.

Changes:

Introduce _REACT_STATE (ContextVar) for per-call ReAct step state and update hooks to read state via the ContextVar.
Add a per-agent hook refcount registry (WeakKeyDictionary + lock) to ensure hooks are registered once per agent and removed after the last in-flight call.
Add/adjust tests to cover concurrency, hook lifecycle, rollback behavior, and no-op safety when state is unset.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File	Description
instrumentation-loongsuite/loongsuite-instrumentation-agentscope/src/opentelemetry/instrumentation/agentscope/_wrapper.py	Moves ReAct state to `ContextVar` and adds refcounted hook acquire/release to support concurrent same-instance calls.
instrumentation-loongsuite/loongsuite-instrumentation-agentscope/tests/test_react_step_reentrancy.py	Updates tests to publish per-test ReAct state via `_REACT_STATE` and adds a no-op safety test when state is unset.
instrumentation-loongsuite/loongsuite-instrumentation-agentscope/tests/test_concurrent_react_step.py	Adds new regression tests for concurrent invocations and hook registry lifetime/rollback semantics.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

-                        call_self._react_step_state = state
-                        _register_react_hooks(call_self, state, self._handler)
+                        state_token = _REACT_STATE.set(state)
+                        _acquire_react_hooks(call_self, self._handler)


+2. ``_REACT_HOOK_REGISTRY`` is a process-global ``id(agent) -> ref count``
+   map guarded by ``_REACT_HOOK_REGISTRY_LOCK``. The first concurrent call
+   on an agent registers four hooks under the fixed name
+   ``_REACT_HOOK_NAME``; subsequent concurrent calls only bump the ref
+   count. Hooks are removed only after the **last** outstanding call on
+   that agent unwinds. This is required because AgentScope's hook


+    # First call unwinds -> refcount drops to 1, hooks must remain so
+    # that the second call's pending ``_reasoning`` / ``_acting`` keep firing.
+    _release_react_hooks(agent)
+    _REACT_STATE.reset(token_b)
+    assert _REACT_HOOK_REGISTRY.get(agent) == 1


github-actions Bot assigned 123liuziming, Cirilla-zmh and ralf0131 May 9, 2026

github-actions Bot requested review from 123liuziming, Cirilla-zmh and ralf0131 May 9, 2026 02:21

RichardoMrMu force-pushed the fix/agentscope-step-concurrency-contextvars branch from 4cc9b3f to 8f69750 Compare May 9, 2026 02:29

RichardoMrMu force-pushed the fix/agentscope-step-concurrency-contextvars branch from 8f69750 to 8676571 Compare May 9, 2026 02:52

ralf0131 requested a review from Copilot May 9, 2026 04:14

Copilot started reviewing on behalf of ralf0131 May 9, 2026 04:15 View session

Copilot AI reviewed May 9, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(agentscope): isolate per-call ReAct step state via ContextVar to support concurrent same-instance calls#187

fix(agentscope): isolate per-call ReAct step state via ContextVar to support concurrent same-instance calls#187
RichardoMrMu wants to merge 1 commit intoalibaba:mainfrom
RichardoMrMu:fix/agentscope-step-concurrency-contextvars

RichardoMrMu commented May 9, 2026 •

edited

Loading

Uh oh!

CLAassistant commented May 9, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

RichardoMrMu commented May 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

fix(agentscope): isolate per-call ReAct step state via ContextVar to support concurrent same-instance calls

Summary

Problem

Bug 1: instance-level state slot is overwritten

Bug 2: ReAct hooks are registered per-call with uuid-suffixed names

Bug 3: finally block deletes the wrong invocation's state

Fix

1. Move per-call state to a ContextVar

2. Reference-counted hook registration under a fixed name

3. Wrapper changes

Verification

Reproduction (before fix)

Verification (after fix)

Unit tests

Compatibility

Follow-up

Related

Uh oh!

CLAassistant commented May 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

RichardoMrMu commented May 9, 2026 •

edited

Loading

Bug 3: `finally` block deletes the wrong invocation's state

1. Move per-call state to a `ContextVar`

CLAassistant commented May 9, 2026 •

edited

Loading