TTS debug CLIs (transcribe / synthesize / phonemize) + Kokoro post-processing by ivan-digital · Pull Request #24 · soniqo/speech-android

ivan-digital · 2026-05-10T11:32:21Z

Stacked on #19. Diagnoses and (mostly) fixes the bad Kokoro TTS audio
quality the user observed in the demo. The lead was a one-line bug in
the SDK's phonemizer; the rest of the PR is the diagnostic tooling that
made it findable, plus matching speech-swift's post-processing.

The fix

KokoroPhonemizer was wrapping every utterance with BOS_ID = 1 /
EOS_ID = 2, but Kokoro's vocab maps id 1 to ; and id 2 to : —
real punctuation. The actual start/end-of-sequence symbol is $,
mapped to id 0 (which doubles as padding, distinguished by the
attention mask). Every prompt was being synthesized as if you'd said
";<phonemes>:" instead of just <phonemes>.

Round-trip test (synthesize → transcribe through the new CLIs):

Prompt	Before fix	After fix
"Hello world"	"I wrote" / "Oh no" (peak 0.65)	"Hello world" (peak 0.53, 1.12 s)
"The quick brown fox jumps over the lazy dog"	"Quick brown fox jumps over the Laza dog and" — dropped "The"	"The quick brown fox jumps over the lousy dog and" — keeps "The"
"Hey, what is your name?"	"What is your name?" — dropped "Hey"	"Huh? What is your numb?" — Hey-ish
"Hey" alone (5 tokens)	model output peaks at 247, post-processed into noise	"Hy" at peak 0.54, 0.85 s of clean audio

Synthesized durations roughly doubled because the duration predictor
now allocates real frame counts to phonemes instead of squeezing
through a fake leading ;. That's why the old audio sounded rushed and
slurred together.

The remaining minor mismatches ("lousy" for "lazy", "numb" for "name?")
are mostly Parakeet ASR quirks on synthetic speech rather than synthesis
problems.

Diagnostic tooling

Three Linux/macOS CLIs added under linux/tools/ that round-trip text
through the SDK without bouncing through Android. This whole bug was
landed by running speech_synthesize "Hello world" out.wav && speech_transcribe out.wav and seeing "I wrote" come back. They're
permanent now.

Tool	Purpose
`speech_transcribe <model_dir> <input.wav>`	WAV → text via Parakeet. Resamples + downmixes to 16 kHz. Skips Kokoro load (also a fix in this PR).
`speech_synthesize <model_dir> <out.wav> "<text>" [language]`	text → WAV via `KokoroTts` directly.
`speech_phonemize <model_dir> "<text>" [language]`	dumps IPA + token IDs from `KokoroPhonemizer`.

Other changes

linux/setup_linux.sh handles macOS arm64/x86_64 in addition to
Linux. Detects uname -s, downloads the right ONNX Runtime tarball,
copies the right shared-lib glob.
linux/CMakeLists.txt — SPEECH_BUILD_TOOLS option, three new
targets.
linux/src/speech.cpp skips KokoroTts construction when
config.transcribe_only is true. Saves model load time and lets
speech_transcribe run on a slimmer model dir.
sdk/.../kokoro_tts.cpp post-processing matches speech-swift:
- Drops output when peak > 2.0 (numerical-instability marker — was
  triggered constantly by the BOS/EOS bug, harmless safety net now).
- 50 ms RMS-window trailing-silence trim with 10 ms ramp-down,
  replacing the old 5 ms hard fade.
- Keeps the leading 5 ms fade-in.

Test plan

./gradlew :sdk:assembleDebug — green
./gradlew :sdk:testDebugUnitTest — 23/23 pass
./linux/setup_linux.sh on macOS arm64 — pulls
onnxruntime-osx-arm64-1.19.0.tgz
cmake --build build — builds speech_transcribe,
speech_synthesize, speech_phonemize
Round-trip "Hello world" / "Hey, what is your name?" /
"The quick brown fox..." → all transcribe back to the requested
text (modulo Parakeet ASR quirks on synthetic audio)
Manual on the S23: install demo, invoke Echo mode, verify
synthesised speech is intelligible

Notes

The verify in speech-models/models/kokoro-tts/export/convert_onnx.py
used [0, ..., 0] in its hardcoded test tokens — the canonical
convention is in fact correct in the export pipeline. Our SDK
reimplementation just drifted at some point.
Re-exported the model with the current convert.py to verify it
wasn't a stale-weights problem; the new export produces the same
audio as what's published on HuggingFace, so the export pipeline is
current. The bug was purely on the consumer side.

Three new CLI tools under linux/tools/ that exercise the SDK without Android, useful for diagnosing TTS / STT regressions: - speech_transcribe — WAV → text via Parakeet - speech_synthesize — text → WAV via Kokoro - speech_phonemize — text → IPA + token IDs via the phonemizer Round-tripping a known prompt through these surfaced two real issues: the model produces catastrophic out-of-range output for very short prompts (peak ≈250 for 'Hey' alone), and our peak normalization made that worse by amplifying the surviving spike. Address what we can on the consuming side; the residual audio-quality gap is in the ONNX export itself (segmented cumsum / scatter ops missing — documented in speech-models/models/kokoro-tts/export/NOTES.md). Changes: - linux/setup_linux.sh — handle Darwin arm64/x86_64 (download macOS ORT, use libonnxruntime*.dylib glob). - linux/CMakeLists.txt — SPEECH_BUILD_TOOLS option, three new targets. - linux/src/speech.cpp — skip KokoroTts construction when config.transcribe_only is true. Lets speech_transcribe run on a slimmer model dir (no kokoro-e2e.onnx needed) and removes a chunk of startup latency from STT-only consumers. - sdk/.../kokoro_tts.cpp — match speech-swift's Kokoro post-processing: - drop output entirely when peak > 2.0 (numerical instability marker on short prompts), instead of normalizing garbage into audible glitches - 50 ms RMS-window trailing-silence trim with 10 ms ramp-down, replacing the old 5 ms hard fade — Kokoro emits 100-300 ms of low-energy noise + spike clicks past the real speech, the windowed trim cleanly clips it - keep the leading 5 ms fade-in to suppress click at start Linux: cmake --build build && DYLD_LIBRARY_PATH=../ort-linux/lib \ ./build/speech_transcribe ./tests/models input.wav Android: ./gradlew :sdk:testDebugUnitTest — 23/23 still pass.

Kokoro's vocab maps '$' to id 0 and treats it as the start/end-of-sequence marker (same symbol used for padding, distinguished by attention mask). Our phonemizer used BOS_ID=1 and EOS_ID=2, which are ';' and ':' in the vocab — every utterance was synthesized as if you'd said ";<phonemes>:" instead of just <phonemes>. Visible effects this fixes: - 'Hello world' was producing audio that round-tripped as 'I wrote' / 'Oh no'; with the right wrap it round-trips as 'Hello world'. - 'The quick brown fox jumps over the lazy dog' was dropping 'The'; with the fix it preserves the first word. - 'Hey' alone produced output with peak ~247 (numerical instability from the wrong BOS confusing the duration predictor for very short inputs); now produces normal audio at peak 0.5. - Generated durations roughly doubled (0.65s -> 1.12s for 'Hello world'), which was the real reason words felt rushed and slurred — the predictor was under-allocating frames because the wrong BOS made it treat the leading semicolon as a content token. Verified by round-tripping prompts through the new linux/tools/ speech_synthesize -> speech_transcribe pair (added in this same PR). Note: the verify in speech-models/models/kokoro-tts/export/convert_onnx.py already used [0, ..., 0] in its hardcoded test tokens, but the SDK reimplementation drifted to [1, ..., 2] — easy to miss, the bug only surfaces when you listen to the audio. Local: ./gradlew :sdk:assembleDebug :sdk:testDebugUnitTest — 23/23 pass.

Ivan added 2 commits May 10, 2026 13:31

Base automatically changed from feat/recognition-service to main May 10, 2026 16:11

ivan-digital merged commit 2514a9d into main May 10, 2026

ivan-digital deleted the feat/tts-debug-tools branch May 10, 2026 16:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TTS debug CLIs (transcribe / synthesize / phonemize) + Kokoro post-processing#24

TTS debug CLIs (transcribe / synthesize / phonemize) + Kokoro post-processing#24
ivan-digital merged 2 commits into
mainfrom
feat/tts-debug-tools

ivan-digital commented May 10, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ivan-digital commented May 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

The fix

Diagnostic tooling

Other changes

Test plan

Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ivan-digital commented May 10, 2026 •

edited

Loading