Problem
Our current resilience story (circuit breaker + provider failover + validateToolCalls) handles loud provider failures well, but leaves two gaps for the classic "API silently deprecated a field at 2am" scenario:
- No response envelope schema validation. Each provider's parser reads fields like
stop_reason, usage.input_tokens, choices[0].message.content via direct property access. If a provider renames/drops a field, we get an uncaught TypeError or a silently-undefined value propagating into LLMResponse. Failover rescues us, but only after a failure; the first requests hitting the broken shape either throw or return corrupt data.
- No schema drift detection. We learn about breaking changes from production errors, not from canaries. There's no golden-sample comparison, no nightly probe, no changelog watcher.
The existing validateToolCalls (src/providers/base.ts:335) is the right pattern — drop malformed entries with a warning instead of crashing — but it only covers the tool_calls array. We should extend the same defensive-parse posture to the whole response envelope.
Proposal
Part 1 — Response envelope schema validation (per provider)
Add a zod (or valibot, lighter runtime) schema per provider describing the raw upstream response shape. Parse through it before field access in generateResponse. On validation failure:
- Log a structured warning with provider, model, and the offending path
- Emit a
schema_drift hook event (see src/utils/hooks.ts) so observability can alert
- Throw a new
SchemaDriftError that the circuit breaker/failover path treats like a transient provider error (so we fail over to a healthy provider instead of crashing the caller)
Files likely touched:
src/providers/anthropic.ts, openai.ts, groq.ts, cerebras.ts, cloudflare.ts — add per-provider schema, parse before access
src/errors.ts — add SchemaDriftError
src/factory.ts — treat SchemaDriftError as fallback-eligible in getFallbackDecision
src/utils/hooks.ts — new schema_drift event type
Part 2 — Schema drift canary
A separate opt-in module (src/utils/schema-canary.ts) that:
- Sends a minimal known-good request to each configured provider
- Captures the raw response
- Compares against a committed golden fixture (
src/__tests__/fixtures/response-shapes/<provider>.json)
- Reports added/removed/renamed top-level and usage fields
Consumer-facing surface: a runSchemaCanary(providers) function that returns a diff report. Leave scheduling to the consumer (cron Worker, GitHub Action, whatever) — the library shouldn't own cadence.
Stretch: a scripts/update-golden-shapes.ts that refreshes fixtures after human review of a diff (never automatically).
Tests
- Unit: each provider's schema rejects known-bad shapes (missing
usage, wrong type on content, etc.) and accepts current golden shapes
- Integration: factory fails over on
SchemaDriftError like it does on ProviderError
- Canary smoke: golden shape comparison logic — fixture-driven, no network
Out of scope
- Changelog/RSS polling — separate concern, different cadence
- Auto-healing / auto-patching parsers — humans review drift diffs, period
- Schema validation of request payloads — we already validate in
validateRequest
Priority
Medium. Current failover + circuit breaker means this is a defense-in-depth upgrade, not a hair-on-fire gap. But the cost of the 2am JSON parse error is high enough that being proactive wins.
Problem
Our current resilience story (circuit breaker + provider failover +
validateToolCalls) handles loud provider failures well, but leaves two gaps for the classic "API silently deprecated a field at 2am" scenario:stop_reason,usage.input_tokens,choices[0].message.contentvia direct property access. If a provider renames/drops a field, we get an uncaughtTypeErroror a silently-undefined value propagating intoLLMResponse. Failover rescues us, but only after a failure; the first requests hitting the broken shape either throw or return corrupt data.The existing
validateToolCalls(src/providers/base.ts:335) is the right pattern — drop malformed entries with a warning instead of crashing — but it only covers thetool_callsarray. We should extend the same defensive-parse posture to the whole response envelope.Proposal
Part 1 — Response envelope schema validation (per provider)
Add a zod (or valibot, lighter runtime) schema per provider describing the raw upstream response shape. Parse through it before field access in
generateResponse. On validation failure:schema_drifthook event (see src/utils/hooks.ts) so observability can alertSchemaDriftErrorthat the circuit breaker/failover path treats like a transient provider error (so we fail over to a healthy provider instead of crashing the caller)Files likely touched:
src/providers/anthropic.ts,openai.ts,groq.ts,cerebras.ts,cloudflare.ts— add per-provider schema, parse before accesssrc/errors.ts— addSchemaDriftErrorsrc/factory.ts— treatSchemaDriftErroras fallback-eligible ingetFallbackDecisionsrc/utils/hooks.ts— newschema_driftevent typePart 2 — Schema drift canary
A separate opt-in module (
src/utils/schema-canary.ts) that:src/__tests__/fixtures/response-shapes/<provider>.json)Consumer-facing surface: a
runSchemaCanary(providers)function that returns a diff report. Leave scheduling to the consumer (cron Worker, GitHub Action, whatever) — the library shouldn't own cadence.Stretch: a
scripts/update-golden-shapes.tsthat refreshes fixtures after human review of a diff (never automatically).Tests
usage, wrong type oncontent, etc.) and accepts current golden shapesSchemaDriftErrorlike it does onProviderErrorOut of scope
validateRequestPriority
Medium. Current failover + circuit breaker means this is a defense-in-depth upgrade, not a hair-on-fire gap. But the cost of the 2am JSON parse error is high enough that being proactive wins.