Context
PR #452 ships Level 1 of post-render QA: deterministic stream sanity + sentinel-string detection + beat-midpoint frame extraction + PIL variance check for all-black/white frames. Catches the major classes of "render exited ok but the video is broken" failures that have shipped this week.
What Level 1 can't catch (because there's no vision intelligence in the pipeline):
- Text overflows the frame edges (rendered glyphs are wider than the layout assumed)
- Wrong cycle step is highlighted given the audible narration
- Stat number is unreadable due to color contrast with background
- Asset failed to load (frame is rendered but contains a Remotion error overlay or broken-image glyph)
- Captions visually overlap a clip's lower-third
- B-roll thumbnail is the wrong shot (someone swapped the gdrive id but the alias is still
@field-walking-towards-house)
These are all "is this frame acceptable" judgments. A vision-capable model can answer them with a small rubric per beat-kind.
Proposal
Add scripts/qa_render_vision.py (or extend qa_render.py --vision) that:
- Reuses Level 1's per-beat midpoint frame extraction.
- For each beat, builds a small rubric prompt parameterized by the beat's kind:
intro_hook → does the tagline text fit and read cleanly?
intro_cycle → which of {Learn, Deliver, Verify, Pay} appears highlighted? Does it match the caption text under it?
intro_handoff → is the program name visible?
body_scene → does the lower-third overlay obscure subject of the b-roll?
body_problem_stat, body_impact_stats → does the big number fit, is the caption readable, is the source line visible but de-emphasized?
body_product_beats → are the three phone-frame screencaps positioned correctly, with the right caption beneath?
outro_cta → logo + tagline + "Request a demo" link all present?
- Sends each frame + rubric to a vision-capable model (
claude-opus-4-7 or similar), gets a structured verdict (OK / WARN / FAIL + short explanation).
- Surfaces results in the same OK/WARN/FAIL summary as Level 1.
Open questions
- Cost: 8 beats × per-render. If we run on every render that's ~$0.05–$0.20 of vision tokens. Acceptable but worth capping to opt-in (
--vision) initially.
- Determinism: vision judges drift between model versions. Tag the model id in the report so a future "this used to pass" investigation has a known baseline.
- False positives: model might flag a legitimate stylistic choice (e.g. low-contrast Headline by design). Need a per-program override path that says "ignore the contrast warning on impact[1]".
- Frame selection: midpoint might miss the worst frame in a beat (e.g. a stat that's mid-animation). Maybe sample at midpoint + last-2s + first-2s and judge the worst of the three.
Acceptance
--vision flag added to qa_render.py
- Vision judge runs per beat against the midpoint frame
- Token spend ≤ $0.25/render at default rubric size
- README / CLAUDE.md note on when to use vision vs. deterministic-only
- One regression case captured as a golden fixture: a chc render with the original "E2E-85648Z%" stat — vision judge should fail it on the contrast/overflow rubric, not just the sentinel regex
Blocks: nothing — Level 1 stands alone. This is purely additive.
Context
PR #452 ships Level 1 of post-render QA: deterministic stream sanity + sentinel-string detection + beat-midpoint frame extraction + PIL variance check for all-black/white frames. Catches the major classes of "render exited ok but the video is broken" failures that have shipped this week.
What Level 1 can't catch (because there's no vision intelligence in the pipeline):
@field-walking-towards-house)These are all "is this frame acceptable" judgments. A vision-capable model can answer them with a small rubric per beat-kind.
Proposal
Add
scripts/qa_render_vision.py(or extendqa_render.py --vision) that:intro_hook→ does the tagline text fit and read cleanly?intro_cycle→ which of {Learn, Deliver, Verify, Pay} appears highlighted? Does it match the caption text under it?intro_handoff→ is the program name visible?body_scene→ does the lower-third overlay obscure subject of the b-roll?body_problem_stat,body_impact_stats→ does the big number fit, is the caption readable, is the source line visible but de-emphasized?body_product_beats→ are the three phone-frame screencaps positioned correctly, with the right caption beneath?outro_cta→ logo + tagline + "Request a demo" link all present?claude-opus-4-7or similar), gets a structured verdict (OK / WARN / FAIL + short explanation).Open questions
--vision) initially.Acceptance
--visionflag added to qa_render.pyBlocks: nothing — Level 1 stands alone. This is purely additive.