fix(hermes-agent): isolate per-call state via ContextVar to fix concurrent trace corruption by RichardoMrMu · Pull Request #184 · alibaba/loongsuite-python-agent

RichardoMrMu · 2026-05-08T09:41:23Z

Summary

opentelemetry.instrumentation.hermes_agent.helpers.state(instance) currently stores its per-call telemetry state on the agent instance itself (instance._otel_hermes_state). When the same AIAgent instance is shared across concurrent calls — e.g. a TUI with multiple tabs, parallel subagents, or any code driving one AIAgent from multiple coroutines/threads — the concurrent calls share the same dict and overwrite each other's current_step_invocation, pending_step_finish_reason, token counters and last-response metadata.

This causes silent corruption of the produced traces:

React-step spans are leaked (never closed) when a sibling call overwrites current_step_invocation before the original step is finished.
Token usage is mixed across concurrent invocations (one invoke_agent may show input_tokens=0 because a sibling cleared the counter).
last_response_id / last_response_model may be attributed to the wrong invocation.

This is a "half-done" concurrency story — the same package already uses contextvars.ContextVar correctly for _ACTIVE_TOOL_NAMES in wrappers.py, but state() was implemented as a plain instance attribute.

Root cause

# helpers.py (before)
def state(instance):
    current = getattr(instance, "_otel_hermes_state", None)
    if current is None:
        current = {
            "current_step_invocation": None,    # single slot, shared
            "active_llm_depth": 0,
            "input_tokens": 0,
            ...
        }
        setattr(instance, "_otel_hermes_state", current)   # shared
    return current

Because the dict lives on instance, every concurrent call hits the same single-slot fields.

Fix

Move per-call state into a ContextVar. Function signatures of state(instance) and clear_state(instance) are kept intact for backward compatibility (instance is preserved as an unused parameter so no call site needs to change).

# helpers.py (after)
_HERMES_STATE: contextvars.ContextVar["dict[str, Any] | None"] = (
    contextvars.ContextVar("opentelemetry_hermes_state", default=None)
)

def state(instance):
    current = _HERMES_STATE.get()
    if current is None:
        current = {...}
        _HERMES_STATE.set(current)
    return current

def clear_state(instance):
    _HERMES_STATE.set(None)

asyncio.create_task automatically copies the parent context into each task → sibling coroutines (e.g. two tab_handler tasks) get independent state dicts.
threading.Thread also gets isolated contexts → sibling threads also get independent state dicts.

Verification

Existing telemetry spec tests (regression-free)

$ pytest instrumentation-loongsuite/loongsuite-instrumentation-hermes-agent/tests/test_telemetry_spec.py
============================== 28 passed in 0.14s ==============================

All 27 pre-existing tests pass + 1 new concurrency regression test.

New regression test

Added test_state_is_isolated_across_concurrent_invocations in test_telemetry_spec.py. It runs two concurrent threads on a shared AIAgent, synchronises them with a Barrier after each thread has opened its first step span (the exact moment the buggy implementation would clobber state), and asserts:

len(agent_spans) == 2
len(step_spans) == 4 ← would be 2 before the fix (silent leak)
len(llm_spans) == 4
len(tool_spans) == 2
Each agent span owns exactly its own complete {2 step + 2 llm + 1 tool} sub-tree (no cross-tree contamination)

Compatibility

state(instance) and clear_state(instance) keep their existing signatures.
All wrappers (LLMCallWrapper, ToolBatchWrapper, etc.) call these helpers through their public name, so no wrapper code changes are needed.
The instance._otel_hermes_state attribute is no longer set, but it was never part of the public API and is not used anywhere else.

Follow-up: AgentScope likely has the same issue

opentelemetry.instrumentation.agentscope uses a similar setattr(call_self, "_react_step_state", state) pattern. A separate issue + fix is recommended, with the same ContextVar migration shape. Happy to follow up with another PR.

Files changed

File	Change
`instrumentation-loongsuite/loongsuite-instrumentation-hermes-agent/src/opentelemetry/instrumentation/hermes_agent/helpers.py`	Add `_HERMES_STATE` ContextVar; rewrite `state()` / `clear_state()` to use it. Function signatures unchanged.
`instrumentation-loongsuite/loongsuite-instrumentation-hermes-agent/tests/test_telemetry_spec.py`	Add `test_state_is_isolated_across_concurrent_invocations` regression test.

CLAassistant · 2026-05-08T09:41:30Z

All committers have signed the CLA.

sipercai · 2026-05-08T10:01:14Z

@RichardoMrMu Thanks for the contribution! Could you please complete the CLA check before we proceed?Thanks again for contributing!

…support concurrent same-instance calls Make AgentScope instrumentation safe for concurrent `await agent(...)` invocations on the same AgentBase instance (TUI multi-tab, asyncio.gather, ASGI singletons). Three independent bugs combined under same-instance concurrency: 1. Per-call state was stored in `call_self._react_step_state` (single instance slot). A second concurrent invocation overwrote the first one's active_step reference and the first step span was never closed (silent span leak). 2. ReAct hooks were registered with a uuid-suffixed name per __call__. AgentScope's hook registry fires every registered callable, so each _reasoning callback double-fired across concurrent calls and inflated the round counter. 3. The finally block did `del call_self._react_step_state`, which removed the *other* concurrent invocation's state mid-flight. Fix: - Move per-call state into a ContextVar (`_REACT_STATE`); each asyncio task / thread sees its own slot, hooks read via `_REACT_STATE.get()` and no-op when unset. - Register the four ReAct hooks exactly once per agent under a fixed name, guarded by a refcount stored in a `WeakKeyDictionary` keyed on the agent itself (not `id(agent)`, which is reused after GC). - `_acquire_react_hooks` is all-or-nothing: if AgentScope rejects any of the four registrations, the already-installed hooks are rolled back and the refcount stays at zero so the agent ends up un-instrumented. - `_release_react_hooks` tolerates extra calls without an underlying acquire (e.g. wrapper's `finally` after a partial acquire failure) and emits warning logs when AgentScope's own registry has lost a hook out from under us instead of silently leaking. - Update wrapper to set/reset the ContextVar instead of mutating the agent instance. Verification: - New regression tests in tests/test_concurrent_react_step.py cover state isolation, hook lifetime under overlapping calls, refcount cleanup, acquire-rollback on partial registration failure, and release-without- acquire safety. - Existing tests/test_react_step_reentrancy.py migrated to drive state through the ContextVar fixture; new test pins the safety-valve no-op contract when _REACT_STATE is unset. - pytest tests/test_react_step_reentrancy.py tests/test_concurrent_react_step.py tests/test_call_chain_reentrancy.py tests/test_utils.py -> 13 passed. Companion to the same-shape Hermes fix (alibaba#184).

…rrent trace corruption Previously, helpers.state(instance) stored its per-call telemetry state on the agent instance itself (instance._otel_hermes_state). When the same AIAgent instance is shared across concurrent calls (e.g. TUI multi-tab, parallel subagents, or any code driving one AIAgent from multiple coroutines/threads), the concurrent calls share the same dict and overwrite each other's current_step_invocation, pending_step_finish_reason, token counters and last-response metadata. This caused silent trace corruption: - React-step spans leaked (never closed) when a sibling call overwrote current_step_invocation before the original step finished. - Token usage was mixed across concurrent invocations. - last_response_id / last_response_model could be attributed to the wrong invocation. Fix: move per-call state into a ContextVar (_HERMES_STATE). asyncio automatically copies the parent context into each Task, and threading threads also get isolated contexts, so sibling concurrent invocations now have independent state. The state(instance) / clear_state(instance) function signatures are unchanged for backward compatibility (instance is preserved as an unused parameter so no call site needs to change). The same package already uses contextvars correctly for _ACTIVE_TOOL_NAMES in wrappers.py - this change brings state() in line with that pattern. Adds a regression test in test_telemetry_spec.py that runs two concurrent threads on a shared AIAgent and asserts each gets a complete, independent react-step / chat / tool span chain (28 passed).

github-actions Bot assigned 123liuziming, Cirilla-zmh and ralf0131 May 8, 2026

github-actions Bot requested review from 123liuziming, Cirilla-zmh and ralf0131 May 8, 2026 09:48

RichardoMrMu mentioned this pull request May 9, 2026

fix(agentscope): isolate per-call ReAct step state via ContextVar to support concurrent same-instance calls #187

Open

RichardoMrMu force-pushed the fix/hermes-step-concurrency-contextvars branch from 98a15d0 to 8a80e16 Compare May 9, 2026 02:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(hermes-agent): isolate per-call state via ContextVar to fix concurrent trace corruption#184

fix(hermes-agent): isolate per-call state via ContextVar to fix concurrent trace corruption#184
RichardoMrMu wants to merge 1 commit intoalibaba:mainfrom
RichardoMrMu:fix/hermes-step-concurrency-contextvars

RichardoMrMu commented May 8, 2026

Uh oh!

CLAassistant commented May 8, 2026 •

edited

Loading

Uh oh!

sipercai commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

RichardoMrMu commented May 8, 2026

Summary

Root cause

Fix

Verification

Existing telemetry spec tests (regression-free)

New regression test

Compatibility

Follow-up: AgentScope likely has the same issue

Files changed

Uh oh!

CLAassistant commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sipercai commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

CLAassistant commented May 8, 2026 •

edited

Loading