TTS debug CLIs (transcribe / synthesize / phonemize) + Kokoro post-processing#24
Merged
Conversation
added 2 commits
May 10, 2026 13:31
Three new CLI tools under linux/tools/ that exercise the SDK without
Android, useful for diagnosing TTS / STT regressions:
- speech_transcribe — WAV → text via Parakeet
- speech_synthesize — text → WAV via Kokoro
- speech_phonemize — text → IPA + token IDs via the phonemizer
Round-tripping a known prompt through these surfaced two real issues:
the model produces catastrophic out-of-range output for very short
prompts (peak ≈250 for 'Hey' alone), and our peak normalization made
that worse by amplifying the surviving spike. Address what we can on
the consuming side; the residual audio-quality gap is in the ONNX
export itself (segmented cumsum / scatter ops missing — documented in
speech-models/models/kokoro-tts/export/NOTES.md).
Changes:
- linux/setup_linux.sh — handle Darwin arm64/x86_64 (download macOS ORT,
use libonnxruntime*.dylib glob).
- linux/CMakeLists.txt — SPEECH_BUILD_TOOLS option, three new targets.
- linux/src/speech.cpp — skip KokoroTts construction when
config.transcribe_only is true. Lets speech_transcribe run on a
slimmer model dir (no kokoro-e2e.onnx needed) and removes a chunk of
startup latency from STT-only consumers.
- sdk/.../kokoro_tts.cpp — match speech-swift's Kokoro post-processing:
- drop output entirely when peak > 2.0 (numerical instability marker
on short prompts), instead of normalizing garbage into audible
glitches
- 50 ms RMS-window trailing-silence trim with 10 ms ramp-down,
replacing the old 5 ms hard fade — Kokoro emits 100-300 ms of
low-energy noise + spike clicks past the real speech, the windowed
trim cleanly clips it
- keep the leading 5 ms fade-in to suppress click at start
Linux: cmake --build build && DYLD_LIBRARY_PATH=../ort-linux/lib \
./build/speech_transcribe ./tests/models input.wav
Android: ./gradlew :sdk:testDebugUnitTest — 23/23 still pass.
Kokoro's vocab maps '$' to id 0 and treats it as the start/end-of-sequence marker (same symbol used for padding, distinguished by attention mask). Our phonemizer used BOS_ID=1 and EOS_ID=2, which are ';' and ':' in the vocab — every utterance was synthesized as if you'd said ";<phonemes>:" instead of just <phonemes>. Visible effects this fixes: - 'Hello world' was producing audio that round-tripped as 'I wrote' / 'Oh no'; with the right wrap it round-trips as 'Hello world'. - 'The quick brown fox jumps over the lazy dog' was dropping 'The'; with the fix it preserves the first word. - 'Hey' alone produced output with peak ~247 (numerical instability from the wrong BOS confusing the duration predictor for very short inputs); now produces normal audio at peak 0.5. - Generated durations roughly doubled (0.65s -> 1.12s for 'Hello world'), which was the real reason words felt rushed and slurred — the predictor was under-allocating frames because the wrong BOS made it treat the leading semicolon as a content token. Verified by round-tripping prompts through the new linux/tools/ speech_synthesize -> speech_transcribe pair (added in this same PR). Note: the verify in speech-models/models/kokoro-tts/export/convert_onnx.py already used [0, ..., 0] in its hardcoded test tokens, but the SDK reimplementation drifted to [1, ..., 2] — easy to miss, the bug only surfaces when you listen to the audio. Local: ./gradlew :sdk:assembleDebug :sdk:testDebugUnitTest — 23/23 pass.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Stacked on #19. Diagnoses and (mostly) fixes the bad Kokoro TTS audio
quality the user observed in the demo. The lead was a one-line bug in
the SDK's phonemizer; the rest of the PR is the diagnostic tooling that
made it findable, plus matching speech-swift's post-processing.
The fix
KokoroPhonemizerwas wrapping every utterance withBOS_ID = 1/EOS_ID = 2, but Kokoro's vocab maps id 1 to;and id 2 to:—real punctuation. The actual start/end-of-sequence symbol is
$,mapped to id 0 (which doubles as padding, distinguished by the
attention mask). Every prompt was being synthesized as if you'd said
";<phonemes>:"instead of just<phonemes>.Round-trip test (synthesize → transcribe through the new CLIs):
Synthesized durations roughly doubled because the duration predictor
now allocates real frame counts to phonemes instead of squeezing
through a fake leading
;. That's why the old audio sounded rushed andslurred together.
The remaining minor mismatches ("lousy" for "lazy", "numb" for "name?")
are mostly Parakeet ASR quirks on synthetic speech rather than synthesis
problems.
Diagnostic tooling
Three Linux/macOS CLIs added under
linux/tools/that round-trip textthrough the SDK without bouncing through Android. This whole bug was
landed by running
speech_synthesize "Hello world" out.wav && speech_transcribe out.wavand seeing "I wrote" come back. They'repermanent now.
speech_transcribe <model_dir> <input.wav>speech_synthesize <model_dir> <out.wav> "<text>" [language]KokoroTtsdirectly.speech_phonemize <model_dir> "<text>" [language]KokoroPhonemizer.Other changes
linux/setup_linux.shhandles macOS arm64/x86_64 in addition toLinux. Detects
uname -s, downloads the right ONNX Runtime tarball,copies the right shared-lib glob.
linux/CMakeLists.txt—SPEECH_BUILD_TOOLSoption, three newtargets.
linux/src/speech.cppskipsKokoroTtsconstruction whenconfig.transcribe_onlyis true. Saves model load time and letsspeech_transcriberun on a slimmer model dir.sdk/.../kokoro_tts.cpppost-processing matches speech-swift:triggered constantly by the BOS/EOS bug, harmless safety net now).
replacing the old 5 ms hard fade.
Test plan
./gradlew :sdk:assembleDebug— green./gradlew :sdk:testDebugUnitTest— 23/23 pass./linux/setup_linux.shon macOS arm64 — pullsonnxruntime-osx-arm64-1.19.0.tgzcmake --build build— buildsspeech_transcribe,speech_synthesize,speech_phonemize"The quick brown fox..." → all transcribe back to the requested
text (modulo Parakeet ASR quirks on synthetic audio)
synthesised speech is intelligible
Notes
speech-models/models/kokoro-tts/export/convert_onnx.pyused
[0, ..., 0]in its hardcoded test tokens — the canonicalconvention is in fact correct in the export pipeline. Our SDK
reimplementation just drifted at some point.
convert.pyto verify itwasn't a stale-weights problem; the new export produces the same
audio as what's published on HuggingFace, so the export pipeline is
current. The bug was purely on the consumer side.