Skip to content

wavekat/wavekat-tts

Repository files navigation

WaveKat TTS

Crates.io docs.rs

Unified text-to-speech for voice pipelines, wrapping multiple TTS engines behind common Rust traits. Same pattern as wavekat-vad and wavekat-turn.

Warning

Early development. API may change between minor versions.

Backends

Backend Feature flag Status License
Qwen3-TTS qwen3-tts ✅ Available Apache 2.0
CosyVoice cosyvoice 🚧 Planned Apache 2.0

Quick start

cargo add wavekat-tts --features qwen3-tts
use wavekat_tts::{TtsBackend, SynthesizeRequest};
use wavekat_tts::backends::qwen3_tts::{Qwen3Tts, ModelConfig, ModelPrecision, ExecutionProvider};

// Auto-downloads INT4 model files on first run, runs on CPU (default):
let tts = Qwen3Tts::new()?;

// Or FP32 on CPU:
// let tts = Qwen3Tts::from_config(ModelConfig::default().with_precision(ModelPrecision::Fp32))?;

// Or INT4 from a local directory on CUDA:
// let tts = Qwen3Tts::from_config(
//     ModelConfig::default()
//         .with_dir("models/qwen3-tts-1.7b")
//         .with_execution_provider(ExecutionProvider::Cuda),
// )?;

let request = SynthesizeRequest::new("Hello, world")
    .with_instruction("Speak naturally and clearly.");
let audio = tts.synthesize(&request)?;

// Save to WAV (wavekat-core includes WAV I/O via the `wav` feature):
audio.write_wav("output.wav")?;

println!("{}s at {} Hz", audio.duration_secs(), audio.sample_rate());

Model files are cached by the HF Hub client at $HF_HOME/hub/ (default ~/.cache/huggingface/hub/). Set WAVEKAT_MODEL_DIR to load from a local directory and skip all downloads.

All backends produce AudioFrame<'static> from wavekat-core — the same type consumed by wavekat-vad and wavekat-turn.

Architecture

wavekat-vad   →  "is someone speaking?"
wavekat-turn  →  "are they done speaking?"
wavekat-tts   →  "synthesize the response"
     │                   │                     │
     └───────────────────┴─────────────────────┘
                         │
                   AudioFrame (wavekat-core)

Two trait families:

  • TtsBackend — batch synthesis: text → AudioFrame<'static>
  • StreamingTtsBackend — streaming: text → iterator of AudioFrame<'static> chunks

Examples

Generate a WAV file from text (model files are auto-downloaded on first run):

cargo run --example synthesize --features qwen3-tts -- "Hello, world\!"
cargo run --example synthesize --features qwen3-tts -- --instruction "Speak in a warm, friendly tone." "Give every small business the voice of a big one."
cargo run --example synthesize --features qwen3-tts -- --precision fp32 "Hello"
cargo run --example synthesize --features qwen3-tts -- --model-dir /path/to/model --output hello.wav "Hello"

Performance

Backend Precision Provider Hardware RTF short RTF medium RTF long
qwen3-tts int4 CPU Standard_NC4as_T4_v3 1.98 2.04 2.34
qwen3-tts int4 CUDA Standard_NC4as_T4_v3 0.78 0.85 1.07

RTF < 1.0 = faster-than-real-time. Lower is better.
To update: run make bench-csv-cuda on target hardware, then commit bench/results/.

Feature flags

Backends

Flag Default Description
qwen3-tts off Qwen3-TTS local ONNX inference
cosyvoice off CosyVoice local ONNX inference (planned)

Execution providers

Composable with any backend flag. Selects the inference hardware at build time.

Flag Description Status
cuda NVIDIA CUDA GPU ✅ Working
tensorrt NVIDIA TensorRT 🚧 Not configured
coreml Apple CoreML (macOS) 🚧 Not configured

License

Licensed under Apache 2.0.

Copyright 2026 WaveKat.

About

Text-to-speech library for Rust with a unified trait interface over multiple backends (Kokoro ONNX). Part of the WaveKat voice pipeline.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors