fix(hermes-agent): isolate per-call state via ContextVar to fix concurrent trace corruption#184
Open
RichardoMrMu wants to merge 1 commit intoalibaba:mainfrom
Open
Conversation
Collaborator
|
@RichardoMrMu Thanks for the contribution! Could you please complete the CLA check before we proceed?Thanks again for contributing! |
RichardoMrMu
pushed a commit
to RichardoMrMu/loongsuite-python-agent
that referenced
this pull request
May 9, 2026
…support concurrent same-instance calls Make AgentScope instrumentation safe for concurrent `await agent(...)` invocations on the same AgentBase instance (TUI multi-tab, asyncio.gather, ASGI singletons). Three independent bugs combined under same-instance concurrency: 1. Per-call state was stored in `call_self._react_step_state` (single instance slot). A second concurrent invocation overwrote the first one's active_step reference and the first step span was never closed (silent span leak). 2. ReAct hooks were registered with a uuid-suffixed name per __call__. AgentScope's hook registry fires every registered callable, so each _reasoning callback double-fired across concurrent calls and inflated the round counter. 3. The finally block did `del call_self._react_step_state`, which removed the *other* concurrent invocation's state mid-flight. Fix: - Move per-call state into a ContextVar (`_REACT_STATE`); each asyncio task / thread sees its own slot, hooks read via `_REACT_STATE.get()` and no-op when unset. - Register the four ReAct hooks exactly once per agent under a fixed name, guarded by a refcount stored in a `WeakKeyDictionary` keyed on the agent itself (not `id(agent)`, which is reused after GC). - `_acquire_react_hooks` is all-or-nothing: if AgentScope rejects any of the four registrations, the already-installed hooks are rolled back and the refcount stays at zero so the agent ends up un-instrumented. - `_release_react_hooks` tolerates extra calls without an underlying acquire (e.g. wrapper's `finally` after a partial acquire failure) and emits warning logs when AgentScope's own registry has lost a hook out from under us instead of silently leaking. - Update wrapper to set/reset the ContextVar instead of mutating the agent instance. Verification: - New regression tests in tests/test_concurrent_react_step.py cover state isolation, hook lifetime under overlapping calls, refcount cleanup, acquire-rollback on partial registration failure, and release-without- acquire safety. - Existing tests/test_react_step_reentrancy.py migrated to drive state through the ContextVar fixture; new test pins the safety-valve no-op contract when _REACT_STATE is unset. - pytest tests/test_react_step_reentrancy.py tests/test_concurrent_react_step.py tests/test_call_chain_reentrancy.py tests/test_utils.py -> 13 passed. Companion to the same-shape Hermes fix (alibaba#184).
RichardoMrMu
added a commit
to RichardoMrMu/loongsuite-python-agent
that referenced
this pull request
May 9, 2026
…support concurrent same-instance calls Make AgentScope instrumentation safe for concurrent `await agent(...)` invocations on the same AgentBase instance (TUI multi-tab, asyncio.gather, ASGI singletons). Three independent bugs combined under same-instance concurrency: 1. Per-call state was stored in `call_self._react_step_state` (single instance slot). A second concurrent invocation overwrote the first one's active_step reference and the first step span was never closed (silent span leak). 2. ReAct hooks were registered with a uuid-suffixed name per __call__. AgentScope's hook registry fires every registered callable, so each _reasoning callback double-fired across concurrent calls and inflated the round counter. 3. The finally block did `del call_self._react_step_state`, which removed the *other* concurrent invocation's state mid-flight. Fix: - Move per-call state into a ContextVar (`_REACT_STATE`); each asyncio task / thread sees its own slot, hooks read via `_REACT_STATE.get()` and no-op when unset. - Register the four ReAct hooks exactly once per agent under a fixed name, guarded by a refcount stored in a `WeakKeyDictionary` keyed on the agent itself (not `id(agent)`, which is reused after GC). - `_acquire_react_hooks` is all-or-nothing: if AgentScope rejects any of the four registrations, the already-installed hooks are rolled back and the refcount stays at zero so the agent ends up un-instrumented. - `_release_react_hooks` tolerates extra calls without an underlying acquire (e.g. wrapper's `finally` after a partial acquire failure) and emits warning logs when AgentScope's own registry has lost a hook out from under us instead of silently leaking. - Update wrapper to set/reset the ContextVar instead of mutating the agent instance. Verification: - New regression tests in tests/test_concurrent_react_step.py cover state isolation, hook lifetime under overlapping calls, refcount cleanup, acquire-rollback on partial registration failure, and release-without- acquire safety. - Existing tests/test_react_step_reentrancy.py migrated to drive state through the ContextVar fixture; new test pins the safety-valve no-op contract when _REACT_STATE is unset. - pytest tests/test_react_step_reentrancy.py tests/test_concurrent_react_step.py tests/test_call_chain_reentrancy.py tests/test_utils.py -> 13 passed. Companion to the same-shape Hermes fix (alibaba#184).
…rrent trace corruption Previously, helpers.state(instance) stored its per-call telemetry state on the agent instance itself (instance._otel_hermes_state). When the same AIAgent instance is shared across concurrent calls (e.g. TUI multi-tab, parallel subagents, or any code driving one AIAgent from multiple coroutines/threads), the concurrent calls share the same dict and overwrite each other's current_step_invocation, pending_step_finish_reason, token counters and last-response metadata. This caused silent trace corruption: - React-step spans leaked (never closed) when a sibling call overwrote current_step_invocation before the original step finished. - Token usage was mixed across concurrent invocations. - last_response_id / last_response_model could be attributed to the wrong invocation. Fix: move per-call state into a ContextVar (_HERMES_STATE). asyncio automatically copies the parent context into each Task, and threading threads also get isolated contexts, so sibling concurrent invocations now have independent state. The state(instance) / clear_state(instance) function signatures are unchanged for backward compatibility (instance is preserved as an unused parameter so no call site needs to change). The same package already uses contextvars correctly for _ACTIVE_TOOL_NAMES in wrappers.py - this change brings state() in line with that pattern. Adds a regression test in test_telemetry_spec.py that runs two concurrent threads on a shared AIAgent and asserts each gets a complete, independent react-step / chat / tool span chain (28 passed).
98a15d0 to
8a80e16
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
opentelemetry.instrumentation.hermes_agent.helpers.state(instance)currently stores its per-call telemetry state on the agent instance itself (instance._otel_hermes_state). When the sameAIAgentinstance is shared across concurrent calls — e.g. a TUI with multiple tabs, parallel subagents, or any code driving oneAIAgentfrom multiple coroutines/threads — the concurrent calls share the same dict and overwrite each other'scurrent_step_invocation,pending_step_finish_reason, token counters and last-response metadata.This causes silent corruption of the produced traces:
current_step_invocationbefore the original step is finished.invoke_agentmay showinput_tokens=0because a sibling cleared the counter).last_response_id/last_response_modelmay be attributed to the wrong invocation.This is a "half-done" concurrency story — the same package already uses
contextvars.ContextVarcorrectly for_ACTIVE_TOOL_NAMESinwrappers.py, butstate()was implemented as a plain instance attribute.Root cause
Because the dict lives on
instance, every concurrent call hits the same single-slot fields.Fix
Move per-call state into a
ContextVar. Function signatures ofstate(instance)andclear_state(instance)are kept intact for backward compatibility (instanceis preserved as an unused parameter so no call site needs to change).asyncio.create_taskautomatically copies the parent context into each task → sibling coroutines (e.g. twotab_handlertasks) get independent state dicts.threading.Threadalso gets isolated contexts → sibling threads also get independent state dicts.Verification
Existing telemetry spec tests (regression-free)
All 27 pre-existing tests pass + 1 new concurrency regression test.
New regression test
Added
test_state_is_isolated_across_concurrent_invocationsintest_telemetry_spec.py. It runs two concurrent threads on a sharedAIAgent, synchronises them with aBarrierafter each thread has opened its first step span (the exact moment the buggy implementation would clobber state), and asserts:len(agent_spans) == 2len(step_spans) == 4← would be 2 before the fix (silent leak)len(llm_spans) == 4len(tool_spans) == 2{2 step + 2 llm + 1 tool}sub-tree (no cross-tree contamination)Compatibility
state(instance)andclear_state(instance)keep their existing signatures.LLMCallWrapper,ToolBatchWrapper, etc.) call these helpers through their public name, so no wrapper code changes are needed.instance._otel_hermes_stateattribute is no longer set, but it was never part of the public API and is not used anywhere else.Follow-up: AgentScope likely has the same issue
opentelemetry.instrumentation.agentscopeuses a similarsetattr(call_self, "_react_step_state", state)pattern. A separate issue + fix is recommended, with the sameContextVarmigration shape. Happy to follow up with another PR.Files changed
instrumentation-loongsuite/loongsuite-instrumentation-hermes-agent/src/opentelemetry/instrumentation/hermes_agent/helpers.py_HERMES_STATEContextVar; rewritestate()/clear_state()to use it. Function signatures unchanged.instrumentation-loongsuite/loongsuite-instrumentation-hermes-agent/tests/test_telemetry_spec.pytest_state_is_isolated_across_concurrent_invocationsregression test.