Skip to content

feat(videos): Level 2 post-render QA — per-beat vision-judge frame check #453

@jjackson

Description

@jjackson

Context

PR #452 ships Level 1 of post-render QA: deterministic stream sanity + sentinel-string detection + beat-midpoint frame extraction + PIL variance check for all-black/white frames. Catches the major classes of "render exited ok but the video is broken" failures that have shipped this week.

What Level 1 can't catch (because there's no vision intelligence in the pipeline):

  • Text overflows the frame edges (rendered glyphs are wider than the layout assumed)
  • Wrong cycle step is highlighted given the audible narration
  • Stat number is unreadable due to color contrast with background
  • Asset failed to load (frame is rendered but contains a Remotion error overlay or broken-image glyph)
  • Captions visually overlap a clip's lower-third
  • B-roll thumbnail is the wrong shot (someone swapped the gdrive id but the alias is still @field-walking-towards-house)

These are all "is this frame acceptable" judgments. A vision-capable model can answer them with a small rubric per beat-kind.

Proposal

Add scripts/qa_render_vision.py (or extend qa_render.py --vision) that:

  1. Reuses Level 1's per-beat midpoint frame extraction.
  2. For each beat, builds a small rubric prompt parameterized by the beat's kind:
    • intro_hook → does the tagline text fit and read cleanly?
    • intro_cycle → which of {Learn, Deliver, Verify, Pay} appears highlighted? Does it match the caption text under it?
    • intro_handoff → is the program name visible?
    • body_scene → does the lower-third overlay obscure subject of the b-roll?
    • body_problem_stat, body_impact_stats → does the big number fit, is the caption readable, is the source line visible but de-emphasized?
    • body_product_beats → are the three phone-frame screencaps positioned correctly, with the right caption beneath?
    • outro_cta → logo + tagline + "Request a demo" link all present?
  3. Sends each frame + rubric to a vision-capable model (claude-opus-4-7 or similar), gets a structured verdict (OK / WARN / FAIL + short explanation).
  4. Surfaces results in the same OK/WARN/FAIL summary as Level 1.

Open questions

  • Cost: 8 beats × per-render. If we run on every render that's ~$0.05–$0.20 of vision tokens. Acceptable but worth capping to opt-in (--vision) initially.
  • Determinism: vision judges drift between model versions. Tag the model id in the report so a future "this used to pass" investigation has a known baseline.
  • False positives: model might flag a legitimate stylistic choice (e.g. low-contrast Headline by design). Need a per-program override path that says "ignore the contrast warning on impact[1]".
  • Frame selection: midpoint might miss the worst frame in a beat (e.g. a stat that's mid-animation). Maybe sample at midpoint + last-2s + first-2s and judge the worst of the three.

Acceptance

  • --vision flag added to qa_render.py
  • Vision judge runs per beat against the midpoint frame
  • Token spend ≤ $0.25/render at default rubric size
  • README / CLAUDE.md note on when to use vision vs. deterministic-only
  • One regression case captured as a golden fixture: a chc render with the original "E2E-85648Z%" stat — vision judge should fail it on the contrast/overflow rubric, not just the sentinel regex

Blocks: nothing — Level 1 stands alone. This is purely additive.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions