feat(videos): cycle highlight uses ElevenLabs alignment timestamps#456
Merged
Conversation
The narration-synced highlight from #451/#455 picked a reasonable proportional position (word_index / total_words) but that's an estimate. The highlight transitioned roughly when the word was said, not precisely. @jjackson asked for "specifically transition when we say learn, deliver, verify, pay onto each highlighted area when we say the word" — that requires actual per-character timestamps from the TTS engine. ElevenLabs has a /v1/text-to-speech/{voice_id}/with-timestamps endpoint that returns the same audio mp3 plus a per-character alignment array with start/end seconds. Switched synthesize() over to it and persisted the alignment in the existing sidecar JSON (adds an `alignment: {characters, character_start_times_seconds, character_end_times_seconds}` field). Pipeline: scripts/render.ts → reads alignment for the 'cycle' beat's per-beat audio sidecar → calls wordStartSeconds(alignment, "learn"|"deliver"|"verif"|"paid"|"pay") → passes the 4 numbers as `cycleStepStartSeconds` in Remotion props src/Root.tsx → threads cycleStepStartSeconds through to <Intro> src/compositions/Intro.tsx → Cycle component checks frame/fps against the 4 timestamps; switches activeIndex on each crossing. No estimation. → Falls back to the word-index proportional path when alignment isn't available (Studio preview, or cached audio from before the alignment switch). Cache compatibility: the sidecar JSON used to live without an `alignment` field. `synthesize()` now treats those as cache misses so the audio re-synthesizes once (per-program, $0.01-ish) to backfill alignment. Re-renders after that hit cache as before. Verified on chc/run-001: - render.log: "Cycle step timings (seconds into cycle audio): { learn: 0.975, deliver: 1.509, verify: 2.229, pay: 3.599 }" - Frame extracts at t=5/6/7/8s: Learn / Deliver / Verify / Pay highlighted respectively. Each transition coincides with the spoken word's onset. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
@jjackson: "we want the highlight to specifically transition when we say, learn, deliver, verify, pay onto each highlighted area when we say the word"
The narration-synced highlight from #451/#455 picked a reasonable proportional position (
word_index / total_words) but that's an estimate — it transitions roughly when the word is said, not precisely. This PR wires up actual per-character TTS timestamps.Pipeline
voiceover.ts::synthesize— switched to ElevenLabs'/v1/text-to-speech/{voice_id}/with-timestampsendpoint. Same audio quality + voice settings; the response is JSON withaudio_base64+alignment.{characters, character_start_times_seconds, character_end_times_seconds}. The alignment is persisted in the sidecar JSON next to the mp3.scripts/render.ts— reads the cycle beat's sidecar, callswordStartSeconds(alignment, "learn"|"deliver"|"verif"|"paid"|"pay"), and threads the 4 numbers through Remotion props ascycleStepStartSeconds.Root.tsx+Intro.tsx—Cyclechecksframe / fpsagainst the 4 timestamps; switchesactiveIndexon each crossing. Falls back to the word-index proportional path when alignment isn't available (Studio preview, or cached audio from before this PR).Cache compatibility
The sidecar JSON used to live without an
alignmentfield.synthesize()now treats those as cache misses so the audio re-synthesizes once per program ($0.01-ish in ElevenLabs cost) to backfill alignment. Re-renders after that hit cache as before.Verified
render.logexcerpt:Frame extracts from a fresh chc/run-001 render (cycle beat starts at video t≈4s):
Each transition coincides with the spoken word's onset.
🤖 Generated with Claude Code