feat(videos): Level 2 post-render QA — per-beat vision-judge frame check

## Context

PR #452 ships Level 1 of post-render QA: deterministic stream sanity + sentinel-string detection + beat-midpoint frame extraction + PIL variance check for all-black/white frames. Catches the major classes of "render exited ok but the video is broken" failures that have shipped this week.

What Level 1 *can't* catch (because there's no vision intelligence in the pipeline):

- Text overflows the frame edges (rendered glyphs are wider than the layout assumed)
- Wrong cycle step is highlighted given the audible narration
- Stat number is unreadable due to color contrast with background
- Asset failed to load (frame is rendered but contains a Remotion error overlay or broken-image glyph)
- Captions visually overlap a clip's lower-third
- B-roll thumbnail is the wrong shot (someone swapped the gdrive id but the alias is still `@field-walking-towards-house`)

These are all "is this frame *acceptable*" judgments. A vision-capable model can answer them with a small rubric per beat-kind.

## Proposal

Add `scripts/qa_render_vision.py` (or extend `qa_render.py --vision`) that:

1. Reuses Level 1's per-beat midpoint frame extraction.
2. For each beat, builds a small rubric prompt parameterized by the beat's kind:
   - `intro_hook` → does the tagline text fit and read cleanly?
   - `intro_cycle` → which of {Learn, Deliver, Verify, Pay} appears highlighted? Does it match the caption text under it?
   - `intro_handoff` → is the program name visible?
   - `body_scene` → does the lower-third overlay obscure subject of the b-roll?
   - `body_problem_stat`, `body_impact_stats` → does the big number fit, is the caption readable, is the source line visible but de-emphasized?
   - `body_product_beats` → are the three phone-frame screencaps positioned correctly, with the right caption beneath?
   - `outro_cta` → logo + tagline + "Request a demo" link all present?
3. Sends each frame + rubric to a vision-capable model (`claude-opus-4-7` or similar), gets a structured verdict (OK / WARN / FAIL + short explanation).
4. Surfaces results in the same OK/WARN/FAIL summary as Level 1.

## Open questions

- **Cost**: 8 beats × per-render. If we run on every render that's ~$0.05–$0.20 of vision tokens. Acceptable but worth capping to opt-in (`--vision`) initially.
- **Determinism**: vision judges drift between model versions. Tag the model id in the report so a future "this used to pass" investigation has a known baseline.
- **False positives**: model might flag a legitimate stylistic choice (e.g. low-contrast Headline by design). Need a per-program override path that says "ignore the contrast warning on impact[1]".
- **Frame selection**: midpoint might miss the worst frame in a beat (e.g. a stat that's mid-animation). Maybe sample at midpoint + last-2s + first-2s and judge the worst of the three.

## Acceptance

- `--vision` flag added to qa_render.py
- Vision judge runs per beat against the midpoint frame
- Token spend ≤ \$0.25/render at default rubric size
- README / CLAUDE.md note on when to use vision vs. deterministic-only
- One regression case captured as a golden fixture: a chc render with the original "E2E-85648Z%" stat — vision judge should fail it on the contrast/overflow rubric, not just the sentinel regex

Blocks: nothing — Level 1 stands alone. This is purely additive.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(videos): Level 2 post-render QA — per-beat vision-judge frame check #453

Context

Proposal

Open questions

Acceptance

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

feat(videos): Level 2 post-render QA — per-beat vision-judge frame check #453

Description

Context

Proposal

Open questions

Acceptance

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions