Skip to content

feat(videos): cycle highlight uses ElevenLabs alignment timestamps#456

Merged
jjackson merged 1 commit into
mainfrom
feat/cycle-tts-alignment
May 19, 2026
Merged

feat(videos): cycle highlight uses ElevenLabs alignment timestamps#456
jjackson merged 1 commit into
mainfrom
feat/cycle-tts-alignment

Conversation

@jjackson
Copy link
Copy Markdown
Owner

Summary

@jjackson: "we want the highlight to specifically transition when we say, learn, deliver, verify, pay onto each highlighted area when we say the word"

The narration-synced highlight from #451/#455 picked a reasonable proportional position (word_index / total_words) but that's an estimate — it transitions roughly when the word is said, not precisely. This PR wires up actual per-character TTS timestamps.

Pipeline

  1. voiceover.ts::synthesize — switched to ElevenLabs' /v1/text-to-speech/{voice_id}/with-timestamps endpoint. Same audio quality + voice settings; the response is JSON with audio_base64 + alignment.{characters, character_start_times_seconds, character_end_times_seconds}. The alignment is persisted in the sidecar JSON next to the mp3.
  2. scripts/render.ts — reads the cycle beat's sidecar, calls wordStartSeconds(alignment, "learn"|"deliver"|"verif"|"paid"|"pay"), and threads the 4 numbers through Remotion props as cycleStepStartSeconds.
  3. Root.tsx + Intro.tsxCycle checks frame / fps against the 4 timestamps; switches activeIndex on each crossing. Falls back to the word-index proportional path when alignment isn't available (Studio preview, or cached audio from before this PR).

Cache compatibility

The sidecar JSON used to live without an alignment field. synthesize() now treats those as cache misses so the audio re-synthesizes once per program ($0.01-ish in ElevenLabs cost) to backfill alignment. Re-renders after that hit cache as before.

Verified

render.log excerpt:

Cycle step timings (seconds into cycle audio):
  { learn: 0.975, deliver: 1.509, verify: 2.229, pay: 3.599 }

Frame extracts from a fresh chc/run-001 render (cycle beat starts at video t≈4s):

Video t Cycle-relative t Highlighted Expected (per spoken word)
5s 1s Learn Learn (0.975s ≤ t < 1.509s) ✓
6s 2s Deliver Deliver (1.509s ≤ t < 2.229s) ✓
7s 3s Verify Verify (2.229s ≤ t < 3.599s) ✓
8s 4s Pay Pay (3.599s ≤ t) ✓

Each transition coincides with the spoken word's onset.

🤖 Generated with Claude Code

The narration-synced highlight from #451/#455 picked a reasonable
proportional position (word_index / total_words) but that's an
estimate. The highlight transitioned roughly when the word was said,
not precisely. @jjackson asked for "specifically transition when we
say learn, deliver, verify, pay onto each highlighted area when we
say the word" — that requires actual per-character timestamps from
the TTS engine.

ElevenLabs has a /v1/text-to-speech/{voice_id}/with-timestamps
endpoint that returns the same audio mp3 plus a per-character
alignment array with start/end seconds. Switched synthesize() over
to it and persisted the alignment in the existing sidecar JSON
(adds an `alignment: {characters, character_start_times_seconds,
character_end_times_seconds}` field).

Pipeline:

  scripts/render.ts
    → reads alignment for the 'cycle' beat's per-beat audio sidecar
    → calls wordStartSeconds(alignment, "learn"|"deliver"|"verif"|"paid"|"pay")
    → passes the 4 numbers as `cycleStepStartSeconds` in Remotion props

  src/Root.tsx
    → threads cycleStepStartSeconds through to <Intro>

  src/compositions/Intro.tsx
    → Cycle component checks frame/fps against the 4 timestamps;
      switches activeIndex on each crossing. No estimation.
    → Falls back to the word-index proportional path when alignment
      isn't available (Studio preview, or cached audio from before
      the alignment switch).

Cache compatibility: the sidecar JSON used to live without an
`alignment` field. `synthesize()` now treats those as cache misses
so the audio re-synthesizes once (per-program, $0.01-ish) to backfill
alignment. Re-renders after that hit cache as before.

Verified on chc/run-001:
  - render.log: "Cycle step timings (seconds into cycle audio):
    { learn: 0.975, deliver: 1.509, verify: 2.229, pay: 3.599 }"
  - Frame extracts at t=5/6/7/8s: Learn / Deliver / Verify / Pay
    highlighted respectively. Each transition coincides with the
    spoken word's onset.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@jjackson jjackson merged commit bda0ed7 into main May 19, 2026
3 checks passed
@jjackson jjackson deleted the feat/cycle-tts-alignment branch May 19, 2026 19:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant