fix(agentscope): isolate per-call ReAct step state via ContextVar to support concurrent same-instance calls#187
Open
RichardoMrMu wants to merge 1 commit intoalibaba:mainfrom
Conversation
4cc9b3f to
8f69750
Compare
…support concurrent same-instance calls Make AgentScope instrumentation safe for concurrent `await agent(...)` invocations on the same AgentBase instance (TUI multi-tab, asyncio.gather, ASGI singletons). Three independent bugs combined under same-instance concurrency: 1. Per-call state was stored in `call_self._react_step_state` (single instance slot). A second concurrent invocation overwrote the first one's active_step reference and the first step span was never closed (silent span leak). 2. ReAct hooks were registered with a uuid-suffixed name per __call__. AgentScope's hook registry fires every registered callable, so each _reasoning callback double-fired across concurrent calls and inflated the round counter. 3. The finally block did `del call_self._react_step_state`, which removed the *other* concurrent invocation's state mid-flight. Fix: - Move per-call state into a ContextVar (`_REACT_STATE`); each asyncio task / thread sees its own slot, hooks read via `_REACT_STATE.get()` and no-op when unset. - Register the four ReAct hooks exactly once per agent under a fixed name, guarded by a refcount stored in a `WeakKeyDictionary` keyed on the agent itself (not `id(agent)`, which is reused after GC). - `_acquire_react_hooks` is all-or-nothing: if AgentScope rejects any of the four registrations, the already-installed hooks are rolled back and the refcount stays at zero so the agent ends up un-instrumented. - `_release_react_hooks` tolerates extra calls without an underlying acquire (e.g. wrapper's `finally` after a partial acquire failure) and emits warning logs when AgentScope's own registry has lost a hook out from under us instead of silently leaking. - Update wrapper to set/reset the ContextVar instead of mutating the agent instance. Verification: - New regression tests in tests/test_concurrent_react_step.py cover state isolation, hook lifetime under overlapping calls, refcount cleanup, acquire-rollback on partial registration failure, and release-without- acquire safety. - Existing tests/test_react_step_reentrancy.py migrated to drive state through the ContextVar fixture; new test pins the safety-valve no-op contract when _REACT_STATE is unset. - pytest tests/test_react_step_reentrancy.py tests/test_concurrent_react_step.py tests/test_call_chain_reentrancy.py tests/test_utils.py -> 13 passed. Companion to the same-shape Hermes fix (alibaba#184).
8f69750 to
8676571
Compare
Contributor
There was a problem hiding this comment.
Pull request overview
This PR updates the AgentScope instrumentation to be safe when the same AgentBase instance is awaited concurrently (e.g., asyncio.gather, multi-tab UI, ASGI singletons) by moving per-call ReAct step state off the agent instance and into a ContextVar, and by switching ReAct hook management to a shared, reference-counted registration.
Changes:
- Introduce
_REACT_STATE(ContextVar) for per-call ReAct step state and update hooks to read state via the ContextVar. - Add a per-agent hook refcount registry (
WeakKeyDictionary+ lock) to ensure hooks are registered once per agent and removed after the last in-flight call. - Add/adjust tests to cover concurrency, hook lifecycle, rollback behavior, and no-op safety when state is unset.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| instrumentation-loongsuite/loongsuite-instrumentation-agentscope/src/opentelemetry/instrumentation/agentscope/_wrapper.py | Moves ReAct state to ContextVar and adds refcounted hook acquire/release to support concurrent same-instance calls. |
| instrumentation-loongsuite/loongsuite-instrumentation-agentscope/tests/test_react_step_reentrancy.py | Updates tests to publish per-test ReAct state via _REACT_STATE and adds a no-op safety test when state is unset. |
| instrumentation-loongsuite/loongsuite-instrumentation-agentscope/tests/test_concurrent_react_step.py | Adds new regression tests for concurrent invocations and hook registry lifetime/rollback semantics. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| call_self._react_step_state = state | ||
| _register_react_hooks(call_self, state, self._handler) | ||
| state_token = _REACT_STATE.set(state) | ||
| _acquire_react_hooks(call_self, self._handler) |
Comment on lines
+34
to
+39
| 2. ``_REACT_HOOK_REGISTRY`` is a process-global ``id(agent) -> ref count`` | ||
| map guarded by ``_REACT_HOOK_REGISTRY_LOCK``. The first concurrent call | ||
| on an agent registers four hooks under the fixed name | ||
| ``_REACT_HOOK_NAME``; subsequent concurrent calls only bump the ref | ||
| count. Hooks are removed only after the **last** outstanding call on | ||
| that agent unwinds. This is required because AgentScope's hook |
Comment on lines
+279
to
+283
| # First call unwinds -> refcount drops to 1, hooks must remain so | ||
| # that the second call's pending ``_reasoning`` / ``_acting`` keep firing. | ||
| _release_react_hooks(agent) | ||
| _REACT_STATE.reset(token_b) | ||
| assert _REACT_HOOK_REGISTRY.get(agent) == 1 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
fix(agentscope): isolate per-call ReAct step state via ContextVar to support concurrent same-instance calls
Summary
Make AgentScope instrumentation safe for concurrent
await agent(...)invocations on the sameAgentBaseinstance (TUI multi-tab,asyncio.gather, ASGI singletons), by moving per-call ReAct step state from the agent instance to aContextVarand managing the ReAct hooks with a reference-counted single registration.This is a follow-up to the same-shape fix for the Hermes instrumentation (#184).
Problem
Two production-relevant scenarios trigger silent trace corruption:
ReActAgentinstance serves multiple chat tabs concurrently.asyncio.gather([agent(q1), agent(q2)])— a service answers multiple questions in parallel using one agent instance.In all three cases, the previous instrumentation produces wrong telemetry. Three independent bugs combine:
Bug 1: instance-level state slot is overwritten
Two concurrent invocations on the same
call_selfoverwrite each other's_react_step_state.active_step. The earlier invocation's open step span loses its only reference and is never closed bystop_react_step→ leaked span.Bug 2: ReAct hooks are registered per-call with uuid-suffixed names
Two concurrent calls register four hooks each (eight total) under different names. AgentScope's hook registry fires every registered callable, so a single
_reasoningtriggers twopre_reasoninghooks →start_react_stepis called twice for one round → round counter inflates → step span tree corrupted.Bug 3:
finallyblock deletes the wrong invocation's stateIf the first invocation finishes while the second is still running, this
delremoves the second invocation's state from under it — its remaining_reasoning/_actingcallbacks then no-op silently.The old module docstring acknowledged this limitation explicitly:
But that constraint is unrealistic for any production multi-tab UI or async-gather setup — the instrumentation should not require the user to serialize calls themselves.
Fix
1. Move per-call state to a
ContextVarThe wrapper sets the ContextVar on entry and resets it in
finally; each asyncio task / thread observes its own state. All four hooks read state via_REACT_STATE.get()and no-op when the slot isNone(meaning the hook fired in a context that did not opt into instrumentation).2. Reference-counted hook registration under a fixed name
Instead of registering four uuid-suffixed hooks per
__call__, the first concurrent invocation registers one set of hooks under the fixed name_REACT_HOOK_NAME; subsequent invocations only bump a refcount. Hooks are removed only when the last outstanding call unwinds.The refcount is stored in a
WeakKeyDictionarykeyed on the agent itself rather thanid(agent). CPython recycles object addresses, so a fresh agent allocated after another one is GC'd could otherwise be mis-recognised as already-instrumented if a prioracquirehad crashed before its pairedrelease. The weak dictionary lets the GC drop the entry automatically.This guarantees:
_reasoning/_actingfire exactly once per logical step (no double-firing across hook registrations);register_instance_hookfailure is rolled back atomically (the agent is never left half-instrumented);releasewithout a pairedacquire(e.g. caller'sfinallyafter acquire raised) is a safe no-op rather than an underflow;id(...).3. Wrapper changes
Verification
Reproduction (before fix)
demo_concurrent_bug.pysimulates twoasyncio.gather-ed__call__s on a shared agent. Output on the buggy code:Expected for two concurrent invocations × two rounds each = 4 starts / 4 stops.
Verification (after fix)
demo_after_fix.pyruns the same scenario and checks 6 invariants:Unit tests
$ DASHSCOPE_API_KEY=test_api_key pytest tests/test_react_step_reentrancy.py \ tests/test_concurrent_react_step.py tests/test_call_chain_reentrancy.py \ tests/test_utils.py -v ============================= 11 passed in 0.09s =============================New regression coverage in
tests/test_concurrent_react_step.py:test_state_isolated_across_concurrent_invocations—asyncio.gathertwo concurrent driver loops on a shared fake agent, assert 4 starts / 4 stops with round=1 firing exactly 2 times.test_hook_registry_releases_after_last_invocation— after both concurrent calls unwind, the agent's hook table is empty and_REACT_HOOK_REGISTRY[id(agent)]is gone.test_hook_registry_keeps_hooks_during_overlapping_calls— when a sibling call is still in flight, releasing the first call's reference must not remove the shared hooks.test_react_step_reentrancy.pyis migrated to drive state through the_REACT_STATEContextVar via a fixture; a newtest_hook_no_ops_when_react_state_not_settest pins the safety-valve contract that hooks fired in a non-instrumented context are silent no-ops.Compatibility
AgentScopeAgentWrapper,AgentScopeChatModelWrapper, etc.) keep their signatures; only the private state plumbing inside_wrapper.pychanges.threadingandcontextvarsare stdlib.ChatModelBase/AgentBasesubclass callingsuper().__call__): unchanged — covered by the existingtest_call_chain_reentrancy.py(still passing).Follow-up
Same shape as Hermes #184 was applied here. AgentScope's own
AgentBasekeeps a single_reply_taskslot for itsinterrupt()semantics; that is outside instrumentation scope and not touched in this PR. Concurrentawait agent(...)on a shared instance is now safe for telemetry; users still need to consult AgentScope's own concurrency contract if they rely oninterrupt().Related
_wrapper.pymodule docstring updated to document the new concurrency guarantees.