Skip to content

Response envelope schema validation + schema drift canary #39

@stackbilt-admin

Description

@stackbilt-admin

Problem

Our current resilience story (circuit breaker + provider failover + validateToolCalls) handles loud provider failures well, but leaves two gaps for the classic "API silently deprecated a field at 2am" scenario:

  1. No response envelope schema validation. Each provider's parser reads fields like stop_reason, usage.input_tokens, choices[0].message.content via direct property access. If a provider renames/drops a field, we get an uncaught TypeError or a silently-undefined value propagating into LLMResponse. Failover rescues us, but only after a failure; the first requests hitting the broken shape either throw or return corrupt data.
  2. No schema drift detection. We learn about breaking changes from production errors, not from canaries. There's no golden-sample comparison, no nightly probe, no changelog watcher.

The existing validateToolCalls (src/providers/base.ts:335) is the right pattern — drop malformed entries with a warning instead of crashing — but it only covers the tool_calls array. We should extend the same defensive-parse posture to the whole response envelope.

Proposal

Part 1 — Response envelope schema validation (per provider)

Add a zod (or valibot, lighter runtime) schema per provider describing the raw upstream response shape. Parse through it before field access in generateResponse. On validation failure:

  • Log a structured warning with provider, model, and the offending path
  • Emit a schema_drift hook event (see src/utils/hooks.ts) so observability can alert
  • Throw a new SchemaDriftError that the circuit breaker/failover path treats like a transient provider error (so we fail over to a healthy provider instead of crashing the caller)

Files likely touched:

  • src/providers/anthropic.ts, openai.ts, groq.ts, cerebras.ts, cloudflare.ts — add per-provider schema, parse before access
  • src/errors.ts — add SchemaDriftError
  • src/factory.ts — treat SchemaDriftError as fallback-eligible in getFallbackDecision
  • src/utils/hooks.ts — new schema_drift event type

Part 2 — Schema drift canary

A separate opt-in module (src/utils/schema-canary.ts) that:

  • Sends a minimal known-good request to each configured provider
  • Captures the raw response
  • Compares against a committed golden fixture (src/__tests__/fixtures/response-shapes/<provider>.json)
  • Reports added/removed/renamed top-level and usage fields

Consumer-facing surface: a runSchemaCanary(providers) function that returns a diff report. Leave scheduling to the consumer (cron Worker, GitHub Action, whatever) — the library shouldn't own cadence.

Stretch: a scripts/update-golden-shapes.ts that refreshes fixtures after human review of a diff (never automatically).

Tests

  • Unit: each provider's schema rejects known-bad shapes (missing usage, wrong type on content, etc.) and accepts current golden shapes
  • Integration: factory fails over on SchemaDriftError like it does on ProviderError
  • Canary smoke: golden shape comparison logic — fixture-driven, no network

Out of scope

  • Changelog/RSS polling — separate concern, different cadence
  • Auto-healing / auto-patching parsers — humans review drift diffs, period
  • Schema validation of request payloads — we already validate in validateRequest

Priority

Medium. Current failover + circuit breaker means this is a defense-in-depth upgrade, not a hair-on-fire gap. But the cost of the 2am JSON parse error is high enough that being proactive wins.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions