Skip to content

fix(hermes-agent): isolate per-call state via ContextVar to fix concurrent trace corruption#184

Open
RichardoMrMu wants to merge 1 commit intoalibaba:mainfrom
RichardoMrMu:fix/hermes-step-concurrency-contextvars
Open

fix(hermes-agent): isolate per-call state via ContextVar to fix concurrent trace corruption#184
RichardoMrMu wants to merge 1 commit intoalibaba:mainfrom
RichardoMrMu:fix/hermes-step-concurrency-contextvars

Conversation

@RichardoMrMu
Copy link
Copy Markdown

Summary

opentelemetry.instrumentation.hermes_agent.helpers.state(instance) currently stores its per-call telemetry state on the agent instance itself (instance._otel_hermes_state). When the same AIAgent instance is shared across concurrent calls — e.g. a TUI with multiple tabs, parallel subagents, or any code driving one AIAgent from multiple coroutines/threads — the concurrent calls share the same dict and overwrite each other's current_step_invocation, pending_step_finish_reason, token counters and last-response metadata.

This causes silent corruption of the produced traces:

  • React-step spans are leaked (never closed) when a sibling call overwrites current_step_invocation before the original step is finished.
  • Token usage is mixed across concurrent invocations (one invoke_agent may show input_tokens=0 because a sibling cleared the counter).
  • last_response_id / last_response_model may be attributed to the wrong invocation.

This is a "half-done" concurrency story — the same package already uses contextvars.ContextVar correctly for _ACTIVE_TOOL_NAMES in wrappers.py, but state() was implemented as a plain instance attribute.

Root cause

# helpers.py (before)
def state(instance):
    current = getattr(instance, "_otel_hermes_state", None)
    if current is None:
        current = {
            "current_step_invocation": None,    # single slot, shared
            "active_llm_depth": 0,
            "input_tokens": 0,
            ...
        }
        setattr(instance, "_otel_hermes_state", current)   # shared
    return current

Because the dict lives on instance, every concurrent call hits the same single-slot fields.

Fix

Move per-call state into a ContextVar. Function signatures of state(instance) and clear_state(instance) are kept intact for backward compatibility (instance is preserved as an unused parameter so no call site needs to change).

# helpers.py (after)
_HERMES_STATE: contextvars.ContextVar["dict[str, Any] | None"] = (
    contextvars.ContextVar("opentelemetry_hermes_state", default=None)
)

def state(instance):
    current = _HERMES_STATE.get()
    if current is None:
        current = {...}
        _HERMES_STATE.set(current)
    return current

def clear_state(instance):
    _HERMES_STATE.set(None)
  • asyncio.create_task automatically copies the parent context into each task → sibling coroutines (e.g. two tab_handler tasks) get independent state dicts.
  • threading.Thread also gets isolated contexts → sibling threads also get independent state dicts.

Verification

Existing telemetry spec tests (regression-free)

$ pytest instrumentation-loongsuite/loongsuite-instrumentation-hermes-agent/tests/test_telemetry_spec.py
============================== 28 passed in 0.14s ==============================

All 27 pre-existing tests pass + 1 new concurrency regression test.

New regression test

Added test_state_is_isolated_across_concurrent_invocations in test_telemetry_spec.py. It runs two concurrent threads on a shared AIAgent, synchronises them with a Barrier after each thread has opened its first step span (the exact moment the buggy implementation would clobber state), and asserts:

  • len(agent_spans) == 2
  • len(step_spans) == 4 ← would be 2 before the fix (silent leak)
  • len(llm_spans) == 4
  • len(tool_spans) == 2
  • Each agent span owns exactly its own complete {2 step + 2 llm + 1 tool} sub-tree (no cross-tree contamination)

Compatibility

  • state(instance) and clear_state(instance) keep their existing signatures.
  • All wrappers (LLMCallWrapper, ToolBatchWrapper, etc.) call these helpers through their public name, so no wrapper code changes are needed.
  • The instance._otel_hermes_state attribute is no longer set, but it was never part of the public API and is not used anywhere else.

Follow-up: AgentScope likely has the same issue

opentelemetry.instrumentation.agentscope uses a similar setattr(call_self, "_react_step_state", state) pattern. A separate issue + fix is recommended, with the same ContextVar migration shape. Happy to follow up with another PR.

Files changed

File Change
instrumentation-loongsuite/loongsuite-instrumentation-hermes-agent/src/opentelemetry/instrumentation/hermes_agent/helpers.py Add _HERMES_STATE ContextVar; rewrite state() / clear_state() to use it. Function signatures unchanged.
instrumentation-loongsuite/loongsuite-instrumentation-hermes-agent/tests/test_telemetry_spec.py Add test_state_is_isolated_across_concurrent_invocations regression test.

@CLAassistant
Copy link
Copy Markdown

CLAassistant commented May 8, 2026

CLA assistant check
All committers have signed the CLA.

@sipercai
Copy link
Copy Markdown
Collaborator

sipercai commented May 8, 2026

@RichardoMrMu Thanks for the contribution! Could you please complete the CLA check before we proceed?Thanks again for contributing!

RichardoMrMu pushed a commit to RichardoMrMu/loongsuite-python-agent that referenced this pull request May 9, 2026
…support concurrent same-instance calls

Make AgentScope instrumentation safe for concurrent `await agent(...)` invocations
on the same AgentBase instance (TUI multi-tab, asyncio.gather, ASGI singletons).

Three independent bugs combined under same-instance concurrency:

1. Per-call state was stored in `call_self._react_step_state` (single instance
   slot). A second concurrent invocation overwrote the first one's
   active_step reference and the first step span was never closed
   (silent span leak).
2. ReAct hooks were registered with a uuid-suffixed name per __call__.
   AgentScope's hook registry fires every registered callable, so each
   _reasoning callback double-fired across concurrent calls and inflated
   the round counter.
3. The finally block did `del call_self._react_step_state`, which
   removed the *other* concurrent invocation's state mid-flight.

Fix:
- Move per-call state into a ContextVar (`_REACT_STATE`); each asyncio task
  / thread sees its own slot, hooks read via `_REACT_STATE.get()` and
  no-op when unset.
- Register the four ReAct hooks exactly once per agent under a fixed name,
  guarded by a refcount stored in a `WeakKeyDictionary` keyed on the
  agent itself (not `id(agent)`, which is reused after GC).
- `_acquire_react_hooks` is all-or-nothing: if AgentScope rejects any of
  the four registrations, the already-installed hooks are rolled back and
  the refcount stays at zero so the agent ends up un-instrumented.
- `_release_react_hooks` tolerates extra calls without an underlying
  acquire (e.g. wrapper's `finally` after a partial acquire failure)
  and emits warning logs when AgentScope's own registry has lost a hook
  out from under us instead of silently leaking.
- Update wrapper to set/reset the ContextVar instead of mutating the
  agent instance.

Verification:
- New regression tests in tests/test_concurrent_react_step.py cover state
  isolation, hook lifetime under overlapping calls, refcount cleanup,
  acquire-rollback on partial registration failure, and release-without-
  acquire safety.
- Existing tests/test_react_step_reentrancy.py migrated to drive state
  through the ContextVar fixture; new test pins the safety-valve no-op
  contract when _REACT_STATE is unset.
- pytest tests/test_react_step_reentrancy.py tests/test_concurrent_react_step.py
  tests/test_call_chain_reentrancy.py tests/test_utils.py -> 13 passed.

Companion to the same-shape Hermes fix (alibaba#184).
RichardoMrMu added a commit to RichardoMrMu/loongsuite-python-agent that referenced this pull request May 9, 2026
…support concurrent same-instance calls

Make AgentScope instrumentation safe for concurrent `await agent(...)` invocations
on the same AgentBase instance (TUI multi-tab, asyncio.gather, ASGI singletons).

Three independent bugs combined under same-instance concurrency:

1. Per-call state was stored in `call_self._react_step_state` (single instance
   slot). A second concurrent invocation overwrote the first one's
   active_step reference and the first step span was never closed
   (silent span leak).
2. ReAct hooks were registered with a uuid-suffixed name per __call__.
   AgentScope's hook registry fires every registered callable, so each
   _reasoning callback double-fired across concurrent calls and inflated
   the round counter.
3. The finally block did `del call_self._react_step_state`, which
   removed the *other* concurrent invocation's state mid-flight.

Fix:
- Move per-call state into a ContextVar (`_REACT_STATE`); each asyncio task
  / thread sees its own slot, hooks read via `_REACT_STATE.get()` and
  no-op when unset.
- Register the four ReAct hooks exactly once per agent under a fixed name,
  guarded by a refcount stored in a `WeakKeyDictionary` keyed on the
  agent itself (not `id(agent)`, which is reused after GC).
- `_acquire_react_hooks` is all-or-nothing: if AgentScope rejects any of
  the four registrations, the already-installed hooks are rolled back and
  the refcount stays at zero so the agent ends up un-instrumented.
- `_release_react_hooks` tolerates extra calls without an underlying
  acquire (e.g. wrapper's `finally` after a partial acquire failure)
  and emits warning logs when AgentScope's own registry has lost a hook
  out from under us instead of silently leaking.
- Update wrapper to set/reset the ContextVar instead of mutating the
  agent instance.

Verification:
- New regression tests in tests/test_concurrent_react_step.py cover state
  isolation, hook lifetime under overlapping calls, refcount cleanup,
  acquire-rollback on partial registration failure, and release-without-
  acquire safety.
- Existing tests/test_react_step_reentrancy.py migrated to drive state
  through the ContextVar fixture; new test pins the safety-valve no-op
  contract when _REACT_STATE is unset.
- pytest tests/test_react_step_reentrancy.py tests/test_concurrent_react_step.py
  tests/test_call_chain_reentrancy.py tests/test_utils.py -> 13 passed.

Companion to the same-shape Hermes fix (alibaba#184).
…rrent trace corruption

Previously, helpers.state(instance) stored its per-call telemetry state on
the agent instance itself (instance._otel_hermes_state). When the same
AIAgent instance is shared across concurrent calls (e.g. TUI multi-tab,
parallel subagents, or any code driving one AIAgent from multiple
coroutines/threads), the concurrent calls share the same dict and
overwrite each other's current_step_invocation, pending_step_finish_reason,
token counters and last-response metadata.

This caused silent trace corruption:
- React-step spans leaked (never closed) when a sibling call overwrote
  current_step_invocation before the original step finished.
- Token usage was mixed across concurrent invocations.
- last_response_id / last_response_model could be attributed to the wrong
  invocation.

Fix: move per-call state into a ContextVar (_HERMES_STATE). asyncio
automatically copies the parent context into each Task, and threading
threads also get isolated contexts, so sibling concurrent invocations now
have independent state. The state(instance) / clear_state(instance)
function signatures are unchanged for backward compatibility (instance is
preserved as an unused parameter so no call site needs to change).

The same package already uses contextvars correctly for _ACTIVE_TOOL_NAMES
in wrappers.py - this change brings state() in line with that pattern.

Adds a regression test in test_telemetry_spec.py that runs two concurrent
threads on a shared AIAgent and asserts each gets a complete, independent
react-step / chat / tool span chain (28 passed).
@RichardoMrMu RichardoMrMu force-pushed the fix/hermes-step-concurrency-contextvars branch from 98a15d0 to 8a80e16 Compare May 9, 2026 02:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants