Supertonic Express — Lightning Fast, On-Device TTS

Supertonic Express is a lightning-fast, on-device text-to-speech system designed for extreme performance with minimal computational overhead. Powered by ONNX Runtime, it runs entirely on your device—no cloud, no API calls, no privacy concerns.

This repository provides both a Python implementation (with a production-ready FastAPI server) and a JavaScript/Node.js implementation, using the optimized ONNX model from onnx-community/Supertonic-TTS-2-ONNX.

📰 Update News

2026.02.22 - 🎉 Smart Text Chunking added! Automatic sentence-based splitting prevents OOM errors and enables unlimited-length audio generation
2026.02.22 - 🎉 Improved Streaming: Chunk-by-chunk audio streaming reduces time-to-first-byte for long texts. Default format changed to Opus (WhatsApp-compatible, 64k Ogg)
2026.01.13 - 🎉 Migrated to use onnx-community/Supertonic-TTS-2-ONNX model with Transformers tokenizer for improved compatibility
2026.01.13 - 🎉 FastAPI Server released! OpenAI-compatible REST API with streaming support, multiple audio formats, and Docker deployment. Perfect for Open-WebUI integration! API Docs
2026.01.06 - 🎉 Supertonic 2 released with multilingual support! Now supports English (en), Korean (ko), Spanish (es), Portuguese (pt), and French (fr). Demo | Models

Demo

Raspberry Pi

Watch Supertonic running on a Raspberry Pi, demonstrating on-device, real-time text-to-speech synthesis:

supertonic_raspberry-pi_480.mov

E-Reader

Experience Supertonic on an Onyx Boox Go 6 e-reader in airplane mode, achieving an average RTF of 0.3× with zero network dependency:

supertonic_ebook.mp4

Chrome Extension

Turns any webpage into audio in under one second, delivering lightning-fast, on-device text-to-speech with zero network dependency—free, private, and effortless:

TLDRL_video_1_1_4_short_low.mp4

🎧 Try it now: Experience Supertonic in your browser with our Interactive Demo, or get started with pre-trained models from Hugging Face Hub

Why Supertonic?

⚡ Blazingly Fast: Generates speech up to 167× faster than real-time on consumer hardware (M4 Pro)—unmatched by any other TTS system
🪶 Ultra Lightweight: Only 66M parameters, optimized for efficient on-device performance with minimal footprint
📱 On-Device Capable: Complete privacy and zero latency—all processing happens locally on your device
🎨 Natural Text Handling: Seamlessly processes numbers, dates, currency, abbreviations, and complex expressions without pre-processing
⚙️ Highly Configurable: Adjust inference steps, batch processing, and other parameters to match your specific needs
🧩 Flexible Deployment: Deploy seamlessly across servers, browsers, and edge devices with multiple runtime backends.

Implementation Languages

This repository provides implementations in multiple languages:

🐍 Python with ONNX Runtime

Full-featured implementation with FastAPI server, ideal for server deployments.

Location: /py directory
Features: ONNX Runtime, FastAPI REST API, OpenAI compatibility, Docker support

⚡ JavaScript/Node.js with Transformers.js

Pure JavaScript implementation optimized for Intel CPUs, perfect for Node.js applications and browsers.

Location: /js directory
Features: Transformers.js, Intel CPU optimizations, browser support, zero Python dependencies

For other language implementations, see:

Original Supertonic Repository - Multi-language implementations (C++, Java, Swift, etc.)

🚀 Python FastAPI Server

We now provide a production-ready FastAPI server with OpenAI-compatible endpoints!

Features:

✅ OpenAI-compatible /v1/audio/speech endpoint
✅ Chunk-by-chunk streaming with low time-to-first-byte
✅ Smart text chunking — unlimited-length audio, no OOM errors
✅ Multiple audio formats (MP3, Opus [default], AAC, FLAC, WAV, PCM)
✅ Docker support (CPU & GPU)
✅ Works out-of-the-box with Open-WebUI 🎉

Quick Start:

# Install dependencies
cd py
pip install -e .

# Download the ONNX model
python -c "from huggingface_hub import snapshot_download; snapshot_download('onnx-community/Supertonic-TTS-2-ONNX', local_dir='assets')"

# Start the server
./scripts/run_server_cpu.sh

Usage with OpenAI Client:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8880/v1",
    api_key="not-needed"
)

response = client.audio.speech.create(
    model="supertonic",
    voice="M1",
    input="Hello from Supertonic!",
    response_format="opus"  # Default; WhatsApp-compatible Ogg/Opus
)

response.stream_to_file("output.opus")

📚 Documentation:

Getting Started

First, clone the repository:

git clone https://github.com/groxaxo/supertonic-express.git
cd supertonic-express

Prerequisites

For JavaScript/Node.js implementation:

Node.js >= 18.0.0
No model download required (automatically handled by Transformers.js)

For Python implementation:

Download the ONNX model from Hugging Face:

# Using huggingface_hub
pip install huggingface-hub
python -c "from huggingface_hub import snapshot_download; snapshot_download('onnx-community/Supertonic-TTS-2-ONNX', local_dir='assets')"

Quick Start

JavaScript/Node.js Example (Details)

cd js
npm install
npm run example:basic

Python Example (Details)

cd py
pip install -e .
python example_onnx.py --onnx-dir ../assets

FastAPI Server

./scripts/run_server_cpu.sh

GPU Acceleration (Recommended)

To run Supertonic on NVIDIA GPUs with maximum performance (~46x Real-time speedup), we provide a dedicated launch script that handles environment setup and library paths.

Create Conda Environment:

# Create environment
conda create -n supertonic-gpu python=3.10
conda activate supertonic-gpu

# Install dependencies
pip install onnxruntime-gpu nvidia-cudnn-cu12
pip install -r py/requirements_gpu.txt
pip install -e py/

Launch with GPU Optimization:

# Run the optimized server (handles LD_LIBRARY_PATH automatically)
./scripts/run_server_gpu.sh

CPU Usage

For deployment without a GPU, use the CPU-optimized launch script. Performance is approximately 4.4x Real-time.

# Run in CPU mode
./scripts/run_server_cpu.sh

JavaScript/Node.js Usage

The JavaScript implementation uses Transformers.js and includes Intel CPU optimizations:

import { SupertonicTTS } from './index.js';

// Create and initialize TTS
const tts = new SupertonicTTS();
await tts.initialize();

// Generate speech
await tts.generateAndSave('This is really cool!', 'output.wav', {
  language: 'en',
  speaker_embeddings: 'https://huggingface.co/onnx-community/Supertonic-TTS-2-ONNX/resolve/main/voices/M1.bin',
  num_inference_steps: 5,
  speed: 1.05,
});

CPU Optimization Settings: The JavaScript implementation includes automatic CPU optimizations for Intel processors:

SIMD instructions enabled (AVX2, AVX-512)
Automatic thread count optimization (75% of CPU cores)
Configurable via OMP_NUM_THREADS environment variable

See js/README.md for detailed documentation and optimization guides.

Technical Details

Runtime: ONNX Runtime for cross-platform inference (CPU-optimized; GPU mode available)
Model: Uses onnx-community/Supertonic-TTS-2-ONNX
Tokenizer: Hugging Face Transformers AutoTokenizer
Batch Processing: Supports batch inference for improved throughput
Audio Output: Outputs 16-bit WAV files at 44.1kHz sample rate

Performance

We evaluated Supertonic's performance (with 2 inference steps) using two key metrics across input texts of varying lengths: Short (59 chars), Mid (152 chars), and Long (266 chars).

Metrics:

Characters per Second: Measures throughput by dividing the number of input characters by the time required to generate audio. Higher is better.
Real-time Factor (RTF): Measures the time taken to synthesize audio relative to its duration. Lower is better (e.g., RTF of 0.1 means it takes 0.1 seconds to generate one second of audio).

Characters per Second

System	Short (59 chars)	Mid (152 chars)	Long (266 chars)
Supertonic (M4 pro - CPU)	912	1048	1263
Supertonic (M4 pro - WebGPU)	996	1801	2509
Supertonic (RTX4090)	2615	6548	12164
`API` ElevenLabs Flash v2.5	144	209	287
`API` OpenAI TTS-1	37	55	82
`API` Gemini 2.5 Flash TTS	12	18	24
`API` Supertone Sona speech 1	38	64	92
`Open` Kokoro	104	107	117
`Open` NeuTTS Air	37	42	47

Notes:
API = Cloud-based API services (measured from Seoul)
Open = Open-source models
Supertonic (M4 pro - CPU) and (M4 pro - WebGPU): Tested with ONNX
Supertonic (RTX4090): Tested with PyTorch model
Kokoro: Tested on M4 Pro CPU with ONNX
NeuTTS Air: Tested on M4 Pro CPU with Q8-GGUF

Real-time Factor

System	Short (59 chars)	Mid (152 chars)	Long (266 chars)
Supertonic (M4 pro - CPU)	0.015	0.013	0.012
Supertonic (M4 pro - WebGPU)	0.014	0.007	0.006
Supertonic (RTX4090)	0.005	0.002	0.001
`API` ElevenLabs Flash v2.5	0.133	0.077	0.057
`API` OpenAI TTS-1	0.471	0.302	0.201
`API` Gemini 2.5 Flash TTS	1.060	0.673	0.541
`API` Supertone Sona speech 1	0.372	0.206	0.163
`Open` Kokoro	0.144	0.124	0.126
`Open` NeuTTS Air	0.390	0.338	0.343

Additional Performance Data (5-step inference)

Characters per Second (5-step)

System	Short (59 chars)	Mid (152 chars)	Long (266 chars)
Supertonic (M4 pro - CPU)	596	691	850
Supertonic (M4 pro - WebGPU)	570	1118	1546
Supertonic (RTX4090)	1286	3757	6242

Real-time Factor (5-step)

System	Short (59 chars)	Mid (152 chars)	Long (266 chars)
Supertonic (M4 pro - CPU)	0.023	0.019	0.018
Supertonic (M4 pro - WebGPU)	0.024	0.012	0.010
Supertonic (RTX4090)	0.011	0.004	0.002

Optimization Results (This Repository)

We have implemented ONNX Runtime IO Binding to eliminate CPU-GPU data transfer overhead.

Configuration	Inference Speedup	Notes
GPU (Optimized)	46x Real-time	~0.07s latency for short sentences. Uses `scripts/run_server_gpu.sh`
CPU (Baseline)	4.4x Real-time	~0.76s latency. Uses `scripts/run_server_cpu.sh`

Natural Text Handling

Supertonic is designed to handle complex, real-world text inputs that contain numbers, currency symbols, abbreviations, dates, and proper nouns.

🎧 View audio samples more easily: Check out our Interactive Demo for a better viewing experience of all audio examples

Overview of Test Cases:

Category	Key Challenges	Supertonic	ElevenLabs	OpenAI	Gemini	Microsoft
Financial Expression	Decimal currency, abbreviated magnitudes (M, K), currency symbols, currency codes	✅	❌	❌	❌	❌
Time and Date	Time notation, abbreviated weekdays/months, date formats	✅	❌	❌	❌	❌
Phone Number	Area codes, hyphens, extensions (ext.)	✅	❌	❌	❌	❌
Technical Unit	Decimal numbers with units, abbreviated technical notations	✅	❌	❌	❌	❌

Example 1: Financial Expression

Text:

"The startup secured $5.2M in venture capital, a huge leap from their initial $450K seed round."

Challenges:

Decimal point in currency ($5.2M should be read as "five point two million")
Abbreviated magnitude units (M for million, K for thousand)
Currency symbol ($) that needs to be properly pronounced as "dollars"

Audio Samples:

System	Result	Audio Sample
Supertonic	✅	🎧 Play Audio
ElevenLabs Flash v2.5	❌	🎧 Play Audio
OpenAI TTS-1	❌	🎧 Play Audio
Gemini 2.5 Flash TTS	❌	🎧 Play Audio
VibeVoice Realtime 0.5B	❌	🎧 Play Audio

Example 2: Time and Date

Text:

"The train delay was announced at 4:45 PM on Wed, Apr 3, 2024 due to track maintenance."

Challenges:

Time expression with PM notation (4:45 PM)
Abbreviated weekday (Wed)
Abbreviated month (Apr)
Full date format (Apr 3, 2024)

Audio Samples:

System	Result	Audio Sample
Supertonic	✅	🎧 Play Audio
ElevenLabs Flash v2.5	❌	🎧 Play Audio
OpenAI TTS-1	❌	🎧 Play Audio
Gemini 2.5 Flash TTS	❌	🎧 Play Audio
VibeVoice Realtime 0.5B	❌	🎧 Play Audio

Example 3: Phone Number

Text:

"You can reach the hotel front desk at (212) 555-0142 ext. 402 anytime."

Challenges:

Area code in parentheses that should be read as separate digits
Phone number with hyphen separator (555-0142)
Abbreviated extension notation (ext.)
Extension number (402)

Audio Samples:

System	Result	Audio Sample
Supertonic	✅	🎧 Play Audio
ElevenLabs Flash v2.5	❌	🎧 Play Audio
OpenAI TTS-1	❌	🎧 Play Audio
Gemini 2.5 Flash TTS	❌	🎧 Play Audio
VibeVoice Realtime 0.5B	❌	🎧 Play Audio

Example 4: Technical Unit

Text:

"Our drone battery lasts 2.3h when flying at 30kph with full camera payload."

Challenges:

Decimal time duration with abbreviation (2.3h = two point three hours)
Speed unit with abbreviation (30kph = thirty kilometers per hour)
Technical abbreviations (h for hours, kph for kilometers per hour)
Technical/engineering context requiring proper pronunciation

Audio Samples:

System	Result	Audio Sample
Supertonic	✅	🎧 Play Audio
ElevenLabs Flash v2.5	❌	🎧 Play Audio
OpenAI TTS-1	❌	🎧 Play Audio
Gemini 2.5 Flash TTS	❌	🎧 Play Audio
VibeVoice Realtime 0.5B	❌	🎧 Play Audio

Note: These samples demonstrate how each system handles text normalization and pronunciation of complex expressions without requiring pre-processing or phonetic annotations.

Built with Supertonic

Project	Description	Links
TLDRL	Free, on-device TTS extension for reading any webpage	Chrome
Read Aloud	Open-source TTS browser extension	Chrome · Edge · GitHub
PageEcho	E-Book reader app for iOS	App Store
VoiceChat	On-device voice-to-voice LLM chatbot in the browser	Demo · GitHub
OmniAvatar	Talking avatar video generator from photo + speech	Demo
CopiloTTS	Kotlin Multiplatform TTS SDK via ONNX Runtime	GitHub
Voice Mixer	PyQt5 tool for mixing and modifying voice styles	GitHub
Supertonic MNN	Lightweight library based on MNN (fp32/fp16/int8)	GitHub · PyPI
Transformers.js	Hugging Face's JS library with Supertonic support	GitHub PR · Demo
Pinokio	1-click localhost cloud for Mac, Windows, and Linux	Pinokio · GitHub

Citation

The following papers describe the core technologies used in Supertonic. If you use this system in your research or find these techniques useful, please consider citing the relevant papers:

SupertonicTTS: Main Architecture

This paper introduces the overall architecture of SupertonicTTS, including the speech autoencoder, flow-matching based text-to-latent module, and efficient design choices.

@article{kim2025supertonic,
  title={SupertonicTTS: Towards Highly Efficient and Streamlined Text-to-Speech System},
  author={Kim, Hyeongju and Yang, Jinhyeok and Yu, Yechan and Ji, Seunghun and Morton, Jacob and Bous, Frederik and Byun, Joon and Lee, Juheon},
  journal={arXiv preprint arXiv:2503.23108},
  year={2025},
  url={https://arxiv.org/abs/2503.23108}
}

Length-Aware RoPE: Text-Speech Alignment

This paper presents Length-Aware Rotary Position Embedding (LARoPE), which improves text-speech alignment in cross-attention mechanisms.

@article{kim2025larope,
  title={Length-Aware Rotary Position Embedding for Text-Speech Alignment},
  author={Kim, Hyeongju and Lee, Juheon and Yang, Jinhyeok and Morton, Jacob},
  journal={arXiv preprint arXiv:2509.11084},
  year={2025},
  url={https://arxiv.org/abs/2509.11084}
}

Self-Purifying Flow Matching: Training with Noisy Labels

This paper describes the self-purification technique for training flow matching models robustly with noisy or unreliable labels.

@article{kim2025spfm,
  title={Training Flow Matching Models with Reliable Labels via Self-Purification},
  author={Kim, Hyeongju and Yu, Yechan and Yi, June Young and Lee, Juheon},
  journal={arXiv preprint arXiv:2509.19091},
  year={2025},
  url={https://arxiv.org/abs/2509.19091}
}

License

This project's sample code is released under the MIT License. - see the LICENSE for details.

The accompanying model is released under the OpenRAIL-M License. - see the LICENSE file for details.

This model was trained using PyTorch, which is licensed under the BSD 3-Clause License but is not redistributed with this project. - see the LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 74 Commits
docs		docs
img		img
js		js
py		py
scripts		scripts
.gitattributes		.gitattributes
.gitignore		.gitignore
CHANGELOG_CUSTOM.md		CHANGELOG_CUSTOM.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Supertonic Express — Lightning Fast, On-Device TTS

📰 Update News

Table of Contents

Demo

Raspberry Pi

E-Reader

Chrome Extension

Why Supertonic?

Implementation Languages

🐍 Python with ONNX Runtime

⚡ JavaScript/Node.js with Transformers.js

🚀 Python FastAPI Server

Getting Started

Prerequisites

Quick Start

GPU Acceleration (Recommended)

CPU Usage

JavaScript/Node.js Usage

Technical Details

Performance

Characters per Second

Real-time Factor

Optimization Results (This Repository)

Natural Text Handling

Built with Supertonic

Citation

SupertonicTTS: Main Architecture

Length-Aware RoPE: Text-Speech Alignment

Self-Purifying Flow Matching: Training with Noisy Labels

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages