Response envelope schema validation + schema drift canary

## Problem

Our current resilience story (circuit breaker + provider failover + `validateToolCalls`) handles *loud* provider failures well, but leaves two gaps for the classic \"API silently deprecated a field at 2am\" scenario:

1. **No response envelope schema validation.** Each provider's parser reads fields like `stop_reason`, `usage.input_tokens`, `choices[0].message.content` via direct property access. If a provider renames/drops a field, we get an uncaught `TypeError` or a silently-undefined value propagating into `LLMResponse`. Failover rescues us, but only after a failure; the first requests hitting the broken shape either throw or return corrupt data.
2. **No schema drift detection.** We learn about breaking changes from production errors, not from canaries. There's no golden-sample comparison, no nightly probe, no changelog watcher.

The existing `validateToolCalls` (src/providers/base.ts:335) is the right pattern — drop malformed entries with a warning instead of crashing — but it only covers the `tool_calls` array. We should extend the same defensive-parse posture to the whole response envelope.

## Proposal

### Part 1 — Response envelope schema validation (per provider)

Add a zod (or valibot, lighter runtime) schema per provider describing the *raw* upstream response shape. Parse through it before field access in `generateResponse`. On validation failure:

- Log a structured warning with provider, model, and the offending path
- Emit a `schema_drift` hook event (see src/utils/hooks.ts) so observability can alert
- Throw a new `SchemaDriftError` that the circuit breaker/failover path treats like a transient provider error (so we fail over to a healthy provider instead of crashing the caller)

Files likely touched:
- `src/providers/anthropic.ts`, `openai.ts`, `groq.ts`, `cerebras.ts`, `cloudflare.ts` — add per-provider schema, parse before access
- `src/errors.ts` — add `SchemaDriftError`
- `src/factory.ts` — treat `SchemaDriftError` as fallback-eligible in `getFallbackDecision`
- `src/utils/hooks.ts` — new `schema_drift` event type

### Part 2 — Schema drift canary

A separate opt-in module (`src/utils/schema-canary.ts`) that:

- Sends a minimal known-good request to each configured provider
- Captures the raw response
- Compares against a committed golden fixture (`src/__tests__/fixtures/response-shapes/<provider>.json`)
- Reports added/removed/renamed top-level and usage fields

Consumer-facing surface: a `runSchemaCanary(providers)` function that returns a diff report. Leave scheduling to the consumer (cron Worker, GitHub Action, whatever) — the library shouldn't own cadence.

Stretch: a `scripts/update-golden-shapes.ts` that refreshes fixtures after human review of a diff (never automatically).

## Tests

- Unit: each provider's schema rejects known-bad shapes (missing `usage`, wrong type on `content`, etc.) and accepts current golden shapes
- Integration: factory fails over on `SchemaDriftError` like it does on `ProviderError`
- Canary smoke: golden shape comparison logic — fixture-driven, no network

## Out of scope

- Changelog/RSS polling — separate concern, different cadence
- Auto-healing / auto-patching parsers — humans review drift diffs, period
- Schema validation of *request* payloads — we already validate in `validateRequest`

## Priority

Medium. Current failover + circuit breaker means this is a defense-in-depth upgrade, not a hair-on-fire gap. But the cost of the 2am JSON parse error is high enough that being proactive wins.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Response envelope schema validation + schema drift canary #39

Problem

Proposal

Part 1 — Response envelope schema validation (per provider)

Part 2 — Schema drift canary

Tests

Out of scope

Priority

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Response envelope schema validation + schema drift canary #39

Description

Problem

Proposal

Part 1 — Response envelope schema validation (per provider)

Part 2 — Schema drift canary

Tests

Out of scope

Priority

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions