feat(ingress): add opt-in agent audit header propagation by teochenglim · Pull Request #4554 · restatedev/restate

teochenglim · 2026-04-03T05:37:09Z

Agent Audit — Design Doc

Date: 2026-04-03
Status: Ready for implementation

Problem

When Restate is used to orchestrate multi-agent AI workflows, each agent invocation needs to carry enough identity context to answer:

Who triggered this entire chain? (triggered_by)
Which human session did it originate from? (conversation_id)
Which exact agent instance ran this step? (agent_id)
Which workflow execution does this belong to? (workflow_id)
What step within the workflow is this? (workflow_step)

Today, none of these are first-class concepts in Restate. Users must roll their own ad-hoc solutions.

Audit Trace Model

trace_id: abc123
    └── agent_id: "agent_review_01"
    └── agent_type: "review_agent"
    └── workflow_id: "wf_tender_review_007"
    └── workflow_step: "step_3_validate"
    └── parent_trace_id: xyz789
            └── agent_id: "agent_orchestrator_01"
            └── workflow_id: "wf_tender_review_007"
            └── parent_trace_id: None
                    └── triggered_by: "user@gov.sg"
                    └── conversation_id: "sess_001"

Field-to-Restate Mapping

Field	Source in Restate	Needs header?
`trace_id`	OTel `ServiceInvocationSpanContext`	No — already propagated
`parent_trace_id`	OTel span cause	No — already propagated
`agent_id`	`ctx.key()` (object key)	No — already available
`agent_type`	`invocation_target.service_name()`	No — use service name
`agent_version`	Deployment pinned at invocation time	No — already tracked
`workflow_id`	`ctx.invocation_id()`	No — already available
`workflow_def`	`invocation_target` (name + handler)	No — already available
`workflow_step`	`invocation_target.handler_name()`	No — already available
`triggered_by`	User-supplied, must be propagated	Yes
`conversation_id`	User-supplied, must be propagated	Yes

Only triggered_by and conversation_id need explicit header propagation — everything else is already derivable from Restate's existing invocation context.

Chosen Approach: Well-Known Headers (opt-in, disabled by default)

Design Principles

Opt-in at ingress. When disabled (default), x-restate-audit-* headers are stripped at ingress so no untrusted client can inject fake audit context. When enabled, they pass through to handlers.
SDK-side propagation discipline. Restate does not auto-forward these headers on service-to-service calls. The calling service/SDK is responsible for re-attaching them on each outbound call — the same model as W3C traceparent.
Minimal blast radius. No state machine changes, no new storage, no new wire formats.

Header Constants

Defined in restate_types::invocation::audit:

/// The human principal that originally triggered this call chain.
/// Value: opaque string, e.g. "user@gov.sg"
/// Propagation: caller must re-attach on every outbound call.
pub const TRIGGERED_BY: &str = "x-restate-audit-triggered-by";

/// The human session/conversation that originated this call chain.
/// Value: opaque string, e.g. "sess_001"
/// Propagation: caller must re-attach on every outbound call.
pub const CONVERSATION_ID: &str = "x-restate-audit-conversation-id";

Config

In IngressOptions (crates/types/src/config/ingress.rs):

ingress:
  agent-audit: false   # default — strips x-restate-audit-* at ingress

File Changes

File	Change
`crates/types/src/invocation/audit.rs`	NEW — header constants + doc
`crates/types/src/invocation/mod.rs`	Add `pub mod audit;`
`crates/types/src/config/ingress.rs`	Add `agent_audit: bool` (default `false`)
`crates/ingress-http/src/handler/mod.rs`	Add `agent_audit: bool` to `Handler` struct
`crates/ingress-http/src/server.rs`	Thread `agent_audit` from `IngressOptions` → `HyperServerIngress` → `Handler`
`crates/ingress-http/src/handler/service_handler.rs`	Strip `x-restate-audit-*` in `parse_headers()` when disabled

Header Stripping in `parse_headers()`

// When agent_audit is disabled, strip audit headers to prevent injection
if !agent_audit && k.as_str().starts_with("x-restate-audit-") {
    continue;
}

Usage Pattern (Python SDK)

AUDIT_TRIGGERED_BY = "x-restate-audit-triggered-by"
AUDIT_CONVERSATION_ID = "x-restate-audit-conversation-id"

@restate.handler()
async def review_document(ctx: Context, req: AgentRequest):
    # Build the audit chain: pass own invocation_id as the parent
    # for any child agents we call
    await ctx.service_call(
        validator_agent.validate,
        arg=ValidateRequest(payload=req.payload),
        headers={
            AUDIT_TRIGGERED_BY: req.headers.get(AUDIT_TRIGGERED_BY),
            AUDIT_CONVERSATION_ID: req.headers.get(AUDIT_CONVERSATION_ID),
        }
    )

The SDK receives both constants as well-known strings to reference.

What Is Not In This PR

The following were considered and explicitly deferred:

Emitting audit events to a log/table — out of scope; users can do this in their handler with ctx.run()
Validating header values at ingress (e.g. non-empty) — deferred, not needed for v1
Exposing audit context helpers in the SDK — SDK concern, follows this PR

Alternative Approaches (PR Comments)

Alt 1: Server-side auto-propagation

What: When Agent A calls Agent B, the Restate server automatically copies x-restate-audit-* headers from the caller's ServiceInvocation.headers into the callee's ServiceInvocation.headers.

Where: crates/worker/src/partition/state_machine/entries/call_commands.rs, in _ApplyCallCommand::apply(), after the CallRequest is destructured — merge any x-restate-audit-* headers from caller_invocation_metadata into the outgoing ServiceInvocation.headers.

Trade-offs:

Pro: No SDK discipline required — headers propagate automatically through every hop
Pro: Impossible to accidentally drop the audit context mid-chain
Con: Requires reading caller invocation metadata during call command processing (already available via caller_invocation_status)
Con: Caller cannot override/clear the headers for a specific child call
Con: State machine change — higher risk surface than header constants alone
Con: Requires storing headers on InvocationMetadata (currently only on ServiceInvocation), or a separate lookup

Verdict: Correct long-term direction for a fully-managed audit trail, but too much scope for a minimum PR. Revisit after Option A is validated.

Alt 2: Audit headers as first-class fields on `ServiceInvocation`

What: Instead of using Vec<Header> as the carrier, add audit_ctx: Option<AuditContext> directly to ServiceInvocation and CallRequest.

pub struct AuditContext {
    pub triggered_by: ByteString,
    pub conversation_id: ByteString,
}

Where: crates/types/src/invocation/mod.rs (ServiceInvocation) and crates/types/src/journal_v2/command.rs (CallRequest).

Trade-offs:

Pro: Type-safe — no stringly-typed header names at the call site
Pro: Cannot be accidentally filtered or mangled by header processing logic
Pro: Visible in admin API / storage queries as a typed field
Con: Protocol change — CallRequest is part of the service protocol v4 Bilrost encoding; adding a field requires a protocol version bump
Con: Much larger blast radius: storage schema, wire format, admin REST model, partition store, WAL all need updating
Con: Overkill for what is essentially optional metadata that not all users need

Verdict: The right design if audit becomes a core Restate primitive (like idempotency key is today). Premature for the initial feature.

Alt 3: OTel span attributes instead of headers

What: Store triggered_by and conversation_id as OpenTelemetry span attributes on the ServiceInvocationSpanContext rather than as headers.

Where: Extend SpanContextDef or add a bag to ServiceInvocationSpanContext in crates/types/src/invocation/mod.rs; emit attributes via invocation_span! macro in crates/tracing-instrumentation.

Trade-offs:

Pro: Audit context automatically appears in every OTel span/trace — directly queryable in Jaeger, Grafana Tempo, etc.
Pro: No need for SDK propagation discipline — OTel baggage handles it
Con: OTel baggage propagation is not currently wired through Restate's internal span context
Con: Mixes audit identity (who triggered it) with observability concerns (how to trace it) — different audiences, different retention policies
Con: Requires changes to the tracing layer and the span context serialisation format
Con: Not accessible in handler code without going through OTel APIs

Verdict: Useful as a complementary feature (emit audit fields as span attributes when audit is enabled), not a replacement. Could be layered on top of Option A later.

claude

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

github-actions · 2026-04-03T05:37:18Z

All contributors have signed the CLA ✍️ ✅
_{Posted by the CLA Assistant Lite bot.}

teochenglim · 2026-04-03T05:38:28Z

I have read the CLA Document and I hereby sign the CLA

tillrohrmann · 2026-04-07T07:40:21Z

Thanks a lot for creating this PR @teochenglim. We probably need a little bit to properly review your contribution as the team is quite busy these days.

@slinkydeveloper and @gvdongen for your visibility as you were looking into tracing and how to integrate Restate with AI observability tools before.

feat(ingress): add opt-in agent audit header propagation

73a646b

claude Bot reviewed Apr 3, 2026

View reviewed changes

fix restate-vqueues ci failed

8235bb9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(ingress): add opt-in agent audit header propagation#4554

feat(ingress): add opt-in agent audit header propagation#4554
teochenglim wants to merge 2 commits intorestatedev:mainfrom
teochenglim:main

teochenglim commented Apr 3, 2026

Uh oh!

claude Bot left a comment

Uh oh!

github-actions Bot commented Apr 3, 2026 •

edited

Loading

Uh oh!

teochenglim commented Apr 3, 2026

Uh oh!

tillrohrmann commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

teochenglim commented Apr 3, 2026

Agent Audit — Design Doc

Problem

Audit Trace Model

Field-to-Restate Mapping

Chosen Approach: Well-Known Headers (opt-in, disabled by default)

Design Principles

Header Constants

Config

File Changes

Header Stripping in parse_headers()

Usage Pattern (Python SDK)

What Is Not In This PR

Alternative Approaches (PR Comments)

Alt 1: Server-side auto-propagation

Alt 2: Audit headers as first-class fields on ServiceInvocation

Alt 3: OTel span attributes instead of headers

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Claude Code Review

Uh oh!

github-actions Bot commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

teochenglim commented Apr 3, 2026

Uh oh!

tillrohrmann commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Header Stripping in `parse_headers()`

Alt 2: Audit headers as first-class fields on `ServiceInvocation`

github-actions Bot commented Apr 3, 2026 •

edited

Loading