Skip to content

feat(ingress): add opt-in agent audit header propagation#4554

Open
teochenglim wants to merge 2 commits intorestatedev:mainfrom
teochenglim:main
Open

feat(ingress): add opt-in agent audit header propagation#4554
teochenglim wants to merge 2 commits intorestatedev:mainfrom
teochenglim:main

Conversation

@teochenglim
Copy link
Copy Markdown

Agent Audit — Design Doc

Date: 2026-04-03
Status: Ready for implementation


Problem

When Restate is used to orchestrate multi-agent AI workflows, each agent invocation needs to carry enough identity context to answer:

  • Who triggered this entire chain? (triggered_by)
  • Which human session did it originate from? (conversation_id)
  • Which exact agent instance ran this step? (agent_id)
  • Which workflow execution does this belong to? (workflow_id)
  • What step within the workflow is this? (workflow_step)

Today, none of these are first-class concepts in Restate. Users must roll their own ad-hoc solutions.


Audit Trace Model

trace_id: abc123
    └── agent_id: "agent_review_01"
    └── agent_type: "review_agent"
    └── workflow_id: "wf_tender_review_007"
    └── workflow_step: "step_3_validate"
    └── parent_trace_id: xyz789
            └── agent_id: "agent_orchestrator_01"
            └── workflow_id: "wf_tender_review_007"
            └── parent_trace_id: None
                    └── triggered_by: "user@gov.sg"
                    └── conversation_id: "sess_001"

Field-to-Restate Mapping

Field Source in Restate Needs header?
trace_id OTel ServiceInvocationSpanContext No — already propagated
parent_trace_id OTel span cause No — already propagated
agent_id ctx.key() (object key) No — already available
agent_type invocation_target.service_name() No — use service name
agent_version Deployment pinned at invocation time No — already tracked
workflow_id ctx.invocation_id() No — already available
workflow_def invocation_target (name + handler) No — already available
workflow_step invocation_target.handler_name() No — already available
triggered_by User-supplied, must be propagated Yes
conversation_id User-supplied, must be propagated Yes

Only triggered_by and conversation_id need explicit header propagation — everything else is already derivable from Restate's existing invocation context.


Chosen Approach: Well-Known Headers (opt-in, disabled by default)

Design Principles

  • Opt-in at ingress. When disabled (default), x-restate-audit-* headers are stripped at ingress so no untrusted client can inject fake audit context. When enabled, they pass through to handlers.
  • SDK-side propagation discipline. Restate does not auto-forward these headers on service-to-service calls. The calling service/SDK is responsible for re-attaching them on each outbound call — the same model as W3C traceparent.
  • Minimal blast radius. No state machine changes, no new storage, no new wire formats.

Header Constants

Defined in restate_types::invocation::audit:

/// The human principal that originally triggered this call chain.
/// Value: opaque string, e.g. "user@gov.sg"
/// Propagation: caller must re-attach on every outbound call.
pub const TRIGGERED_BY: &str = "x-restate-audit-triggered-by";

/// The human session/conversation that originated this call chain.
/// Value: opaque string, e.g. "sess_001"
/// Propagation: caller must re-attach on every outbound call.
pub const CONVERSATION_ID: &str = "x-restate-audit-conversation-id";

Config

In IngressOptions (crates/types/src/config/ingress.rs):

ingress:
  agent-audit: false   # default — strips x-restate-audit-* at ingress

File Changes

File Change
crates/types/src/invocation/audit.rs NEW — header constants + doc
crates/types/src/invocation/mod.rs Add pub mod audit;
crates/types/src/config/ingress.rs Add agent_audit: bool (default false)
crates/ingress-http/src/handler/mod.rs Add agent_audit: bool to Handler struct
crates/ingress-http/src/server.rs Thread agent_audit from IngressOptionsHyperServerIngressHandler
crates/ingress-http/src/handler/service_handler.rs Strip x-restate-audit-* in parse_headers() when disabled

Header Stripping in parse_headers()

// When agent_audit is disabled, strip audit headers to prevent injection
if !agent_audit && k.as_str().starts_with("x-restate-audit-") {
    continue;
}

Usage Pattern (Python SDK)

AUDIT_TRIGGERED_BY = "x-restate-audit-triggered-by"
AUDIT_CONVERSATION_ID = "x-restate-audit-conversation-id"

@restate.handler()
async def review_document(ctx: Context, req: AgentRequest):
    # Build the audit chain: pass own invocation_id as the parent
    # for any child agents we call
    await ctx.service_call(
        validator_agent.validate,
        arg=ValidateRequest(payload=req.payload),
        headers={
            AUDIT_TRIGGERED_BY: req.headers.get(AUDIT_TRIGGERED_BY),
            AUDIT_CONVERSATION_ID: req.headers.get(AUDIT_CONVERSATION_ID),
        }
    )

The SDK receives both constants as well-known strings to reference.


What Is Not In This PR

The following were considered and explicitly deferred:

  • Emitting audit events to a log/table — out of scope; users can do this in their handler with ctx.run()
  • Validating header values at ingress (e.g. non-empty) — deferred, not needed for v1
  • Exposing audit context helpers in the SDK — SDK concern, follows this PR

Alternative Approaches (PR Comments)

Alt 1: Server-side auto-propagation

What: When Agent A calls Agent B, the Restate server automatically copies x-restate-audit-* headers from the caller's ServiceInvocation.headers into the callee's ServiceInvocation.headers.

Where: crates/worker/src/partition/state_machine/entries/call_commands.rs, in _ApplyCallCommand::apply(), after the CallRequest is destructured — merge any x-restate-audit-* headers from caller_invocation_metadata into the outgoing ServiceInvocation.headers.

Trade-offs:

  • Pro: No SDK discipline required — headers propagate automatically through every hop
  • Pro: Impossible to accidentally drop the audit context mid-chain
  • Con: Requires reading caller invocation metadata during call command processing (already available via caller_invocation_status)
  • Con: Caller cannot override/clear the headers for a specific child call
  • Con: State machine change — higher risk surface than header constants alone
  • Con: Requires storing headers on InvocationMetadata (currently only on ServiceInvocation), or a separate lookup

Verdict: Correct long-term direction for a fully-managed audit trail, but too much scope for a minimum PR. Revisit after Option A is validated.


Alt 2: Audit headers as first-class fields on ServiceInvocation

What: Instead of using Vec<Header> as the carrier, add audit_ctx: Option<AuditContext> directly to ServiceInvocation and CallRequest.

pub struct AuditContext {
    pub triggered_by: ByteString,
    pub conversation_id: ByteString,
}

Where: crates/types/src/invocation/mod.rs (ServiceInvocation) and crates/types/src/journal_v2/command.rs (CallRequest).

Trade-offs:

  • Pro: Type-safe — no stringly-typed header names at the call site
  • Pro: Cannot be accidentally filtered or mangled by header processing logic
  • Pro: Visible in admin API / storage queries as a typed field
  • Con: Protocol change — CallRequest is part of the service protocol v4 Bilrost encoding; adding a field requires a protocol version bump
  • Con: Much larger blast radius: storage schema, wire format, admin REST model, partition store, WAL all need updating
  • Con: Overkill for what is essentially optional metadata that not all users need

Verdict: The right design if audit becomes a core Restate primitive (like idempotency key is today). Premature for the initial feature.


Alt 3: OTel span attributes instead of headers

What: Store triggered_by and conversation_id as OpenTelemetry span attributes on the ServiceInvocationSpanContext rather than as headers.

Where: Extend SpanContextDef or add a bag to ServiceInvocationSpanContext in crates/types/src/invocation/mod.rs; emit attributes via invocation_span! macro in crates/tracing-instrumentation.

Trade-offs:

  • Pro: Audit context automatically appears in every OTel span/trace — directly queryable in Jaeger, Grafana Tempo, etc.
  • Pro: No need for SDK propagation discipline — OTel baggage handles it
  • Con: OTel baggage propagation is not currently wired through Restate's internal span context
  • Con: Mixes audit identity (who triggered it) with observability concerns (how to trace it) — different audiences, different retention policies
  • Con: Requires changes to the tracing layer and the span context serialisation format
  • Con: Not accessible in handler code without going through OTel APIs

Verdict: Useful as a complementary feature (emit audit fields as span attributes when audit is enabled), not a replacement. Could be layered on top of Option A later.

Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 3, 2026

All contributors have signed the CLA ✍️ ✅
Posted by the CLA Assistant Lite bot.

@teochenglim
Copy link
Copy Markdown
Author

I have read the CLA Document and I hereby sign the CLA

@tillrohrmann
Copy link
Copy Markdown
Contributor

Thanks a lot for creating this PR @teochenglim. We probably need a little bit to properly review your contribution as the team is quite busy these days.

@slinkydeveloper and @gvdongen for your visibility as you were looking into tracing and how to integrate Restate with AI observability tools before.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants