Skip to content

TTS debug CLIs (transcribe / synthesize / phonemize) + Kokoro post-processing#24

Merged
ivan-digital merged 2 commits into
mainfrom
feat/tts-debug-tools
May 10, 2026
Merged

TTS debug CLIs (transcribe / synthesize / phonemize) + Kokoro post-processing#24
ivan-digital merged 2 commits into
mainfrom
feat/tts-debug-tools

Conversation

@ivan-digital
Copy link
Copy Markdown
Contributor

@ivan-digital ivan-digital commented May 10, 2026

Stacked on #19. Diagnoses and (mostly) fixes the bad Kokoro TTS audio
quality the user observed in the demo. The lead was a one-line bug in
the SDK's phonemizer; the rest of the PR is the diagnostic tooling that
made it findable, plus matching speech-swift's post-processing.

The fix

KokoroPhonemizer was wrapping every utterance with BOS_ID = 1 /
EOS_ID = 2, but Kokoro's vocab maps id 1 to ; and id 2 to :
real punctuation. The actual start/end-of-sequence symbol is $,
mapped to id 0 (which doubles as padding, distinguished by the
attention mask). Every prompt was being synthesized as if you'd said
";<phonemes>:" instead of just <phonemes>.

Round-trip test (synthesize → transcribe through the new CLIs):

Prompt Before fix After fix
"Hello world" "I wrote" / "Oh no" (peak 0.65) "Hello world" (peak 0.53, 1.12 s)
"The quick brown fox jumps over the lazy dog" "Quick brown fox jumps over the Laza dog and" — dropped "The" "The quick brown fox jumps over the lousy dog and" — keeps "The"
"Hey, what is your name?" "What is your name?" — dropped "Hey" "Huh? What is your numb?" — Hey-ish
"Hey" alone (5 tokens) model output peaks at 247, post-processed into noise "Hy" at peak 0.54, 0.85 s of clean audio

Synthesized durations roughly doubled because the duration predictor
now allocates real frame counts to phonemes instead of squeezing
through a fake leading ;. That's why the old audio sounded rushed and
slurred together.

The remaining minor mismatches ("lousy" for "lazy", "numb" for "name?")
are mostly Parakeet ASR quirks on synthetic speech rather than synthesis
problems.

Diagnostic tooling

Three Linux/macOS CLIs added under linux/tools/ that round-trip text
through the SDK without bouncing through Android. This whole bug was
landed by running speech_synthesize "Hello world" out.wav && speech_transcribe out.wav and seeing "I wrote" come back. They're
permanent now.

Tool Purpose
speech_transcribe <model_dir> <input.wav> WAV → text via Parakeet. Resamples + downmixes to 16 kHz. Skips Kokoro load (also a fix in this PR).
speech_synthesize <model_dir> <out.wav> "<text>" [language] text → WAV via KokoroTts directly.
speech_phonemize <model_dir> "<text>" [language] dumps IPA + token IDs from KokoroPhonemizer.

Other changes

  • linux/setup_linux.sh handles macOS arm64/x86_64 in addition to
    Linux. Detects uname -s, downloads the right ONNX Runtime tarball,
    copies the right shared-lib glob.
  • linux/CMakeLists.txtSPEECH_BUILD_TOOLS option, three new
    targets.
  • linux/src/speech.cpp skips KokoroTts construction when
    config.transcribe_only is true. Saves model load time and lets
    speech_transcribe run on a slimmer model dir.
  • sdk/.../kokoro_tts.cpp post-processing matches speech-swift:
    • Drops output when peak > 2.0 (numerical-instability marker — was
      triggered constantly by the BOS/EOS bug, harmless safety net now).
    • 50 ms RMS-window trailing-silence trim with 10 ms ramp-down,
      replacing the old 5 ms hard fade.
    • Keeps the leading 5 ms fade-in.

Test plan

  • ./gradlew :sdk:assembleDebug — green
  • ./gradlew :sdk:testDebugUnitTest — 23/23 pass
  • ./linux/setup_linux.sh on macOS arm64 — pulls
    onnxruntime-osx-arm64-1.19.0.tgz
  • cmake --build build — builds speech_transcribe,
    speech_synthesize, speech_phonemize
  • Round-trip "Hello world" / "Hey, what is your name?" /
    "The quick brown fox..." → all transcribe back to the requested
    text (modulo Parakeet ASR quirks on synthetic audio)
  • Manual on the S23: install demo, invoke Echo mode, verify
    synthesised speech is intelligible

Notes

  • The verify in speech-models/models/kokoro-tts/export/convert_onnx.py
    used [0, ..., 0] in its hardcoded test tokens — the canonical
    convention is in fact correct in the export pipeline. Our SDK
    reimplementation just drifted at some point.
  • Re-exported the model with the current convert.py to verify it
    wasn't a stale-weights problem; the new export produces the same
    audio as what's published on HuggingFace, so the export pipeline is
    current. The bug was purely on the consumer side.

Ivan added 2 commits May 10, 2026 13:31
Three new CLI tools under linux/tools/ that exercise the SDK without
Android, useful for diagnosing TTS / STT regressions:

- speech_transcribe — WAV → text via Parakeet
- speech_synthesize — text → WAV via Kokoro
- speech_phonemize  — text → IPA + token IDs via the phonemizer

Round-tripping a known prompt through these surfaced two real issues:
the model produces catastrophic out-of-range output for very short
prompts (peak ≈250 for 'Hey' alone), and our peak normalization made
that worse by amplifying the surviving spike. Address what we can on
the consuming side; the residual audio-quality gap is in the ONNX
export itself (segmented cumsum / scatter ops missing — documented in
speech-models/models/kokoro-tts/export/NOTES.md).

Changes:

- linux/setup_linux.sh — handle Darwin arm64/x86_64 (download macOS ORT,
  use libonnxruntime*.dylib glob).
- linux/CMakeLists.txt — SPEECH_BUILD_TOOLS option, three new targets.
- linux/src/speech.cpp — skip KokoroTts construction when
  config.transcribe_only is true. Lets speech_transcribe run on a
  slimmer model dir (no kokoro-e2e.onnx needed) and removes a chunk of
  startup latency from STT-only consumers.
- sdk/.../kokoro_tts.cpp — match speech-swift's Kokoro post-processing:
  - drop output entirely when peak > 2.0 (numerical instability marker
    on short prompts), instead of normalizing garbage into audible
    glitches
  - 50 ms RMS-window trailing-silence trim with 10 ms ramp-down,
    replacing the old 5 ms hard fade — Kokoro emits 100-300 ms of
    low-energy noise + spike clicks past the real speech, the windowed
    trim cleanly clips it
  - keep the leading 5 ms fade-in to suppress click at start

Linux: cmake --build build && DYLD_LIBRARY_PATH=../ort-linux/lib \
  ./build/speech_transcribe ./tests/models input.wav

Android: ./gradlew :sdk:testDebugUnitTest — 23/23 still pass.
Kokoro's vocab maps '$' to id 0 and treats it as the start/end-of-sequence
marker (same symbol used for padding, distinguished by attention mask).
Our phonemizer used BOS_ID=1 and EOS_ID=2, which are ';' and ':' in the
vocab — every utterance was synthesized as if you'd said ";<phonemes>:"
instead of just <phonemes>.

Visible effects this fixes:

- 'Hello world' was producing audio that round-tripped as 'I wrote' /
  'Oh no'; with the right wrap it round-trips as 'Hello world'.
- 'The quick brown fox jumps over the lazy dog' was dropping 'The';
  with the fix it preserves the first word.
- 'Hey' alone produced output with peak ~247 (numerical instability
  from the wrong BOS confusing the duration predictor for very short
  inputs); now produces normal audio at peak 0.5.
- Generated durations roughly doubled (0.65s -> 1.12s for 'Hello
  world'), which was the real reason words felt rushed and slurred —
  the predictor was under-allocating frames because the wrong BOS made
  it treat the leading semicolon as a content token.

Verified by round-tripping prompts through the new linux/tools/
speech_synthesize -> speech_transcribe pair (added in this same PR).

Note: the verify in speech-models/models/kokoro-tts/export/convert_onnx.py
already used [0, ..., 0] in its hardcoded test tokens, but the SDK
reimplementation drifted to [1, ..., 2] — easy to miss, the bug only
surfaces when you listen to the audio.

Local: ./gradlew :sdk:assembleDebug :sdk:testDebugUnitTest — 23/23 pass.
Base automatically changed from feat/recognition-service to main May 10, 2026 16:11
@ivan-digital ivan-digital merged commit 2514a9d into main May 10, 2026
@ivan-digital ivan-digital deleted the feat/tts-debug-tools branch May 10, 2026 16:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant