A fully-local voice assistant demo with a super simple FastAPI backend and a simple HTML front-end. All the models (ASR / LLM / TTS) are open weight and running locally, i.e. no data is being sent to the Internet nor any API. It's intended to demonstrate how easy it is to run a fully-local AI setup on affordable commodity hardware, while also demonstrating the uncanny valley and teasing out the ethical considerations of such a setup (see Disclaimer and Ethical Considerations at the bottom).
Models used:
- ASR: NVIDIA parakeet-tdt-0.6b-v3 600M
- LLM: Mistral ministral-3 3b 4-bit quantized
- TTS (Simple): Hexgrad Kokoro 82M
- TTS (With Voice Cloning): Qwen3-TTS
Why "Outrageous"? Because it was outrageously easy to create!
How it works:
- Frontend captures user's audio and sends a blob of bytes to the backend
/chatendpoint - Backend parses the bytes, extracts sample rate (SR) and channels, then:
- Transcribes the audio to text using an automatic speech recognition (ASR) model
- Sends the transcribed text to the LLM, i.e. "the brain"
- Sends the LLM response to a text-to-speech (TTS) model
- Performs normalization of TTS output, converts it to bytes, and sends the bytes back to frontend
- The frontend plays the response audio back to the user
On my system (RTX5070 12GiB VRAM), the whole round-trip-time using Kokoro is ~1 second.
When using "profiles" (or voice cloning), there is an additional pre-step where a 3-5 second wav clip with a corresponding transcription and a prompt, is used for TTS. This leverages Qwen3-TTS and doesn't require any finetuning. Note however that responses will be slightly slower. The reason I included Kokoro as a default / non-voice cloning TTS is (1) it's very fast, and (2) I really like the quality of the default voice.
- Add Apple Silicon (MLX) support
- Add Voice Activity Detection (VAD) client-side (e.g. see https://docs.vad.ricky0123.com/)
- Remove the need for prepared transcript for voice cloning - use ASR as part of a "warmup"
- Add orchestration to detect more complex tasks (e.g. requiring GPT/Claude) or whether tool calls (web search, file search, other cli/API calls are needed)
outrageous-demo-voice-cloning-2.mp4
Outrageous.-.Demo.mp4
- Python >=3.12
uvinstalled and available in PATH- Ollama installed and running (
ollamaCLI available) - Verified on a system with RTX 5070 (12GiB VRAM); lower-end setups should be possible
Fetch Python deps and HF/Ollama models:
# NVIDIA/CUDA
./ova.sh install --cuda
# Apple/MLX
./ova.sh install --mlxStart the front-end and back-end services (non-blocking) with a fast default voice assistant:
./ova.sh startTo start the voice assistant with one of the pre-cloned voices (NOTE: response time will be slower):
OVA_PROFILE=dua ./ova.sh start # NOTE: with cloned voice of a famous artist- Front-end: http://localhost:8000
- Back-end: http://localhost:5173
Logs and PIDs are stored under .ova/. If you want to follow the logs in another terminal window:
tail -f .ova/backend.logStop all services:
./ova.sh stopIn order to add a new voice, no code changes are required. You simply need to do the following:
- Create a new directory
profiles/<voice>/ - Add a 3-5 second voice clip
ref_audio.wav, a direct transcription of that clipref_text.txt, and any instructions in theprompt.txt- all under the sub-directory created in the previous step. - To start the service with the new voice, simply run
OVA_PROFILE=<voice> ./ova.sh start
And that's it!
Enjoy!
Disclaimer & Ethical Considerations: This project is a proof-of-concept demonstration and is provided as is without any warranties or guarantees. It is intended for educational and experimental purposes only. The voice cloning is also purely for educational purposes - for real-life/commercial use, one should always seek the relevant permissions. This demo also highlights the ethical and security aspects - the ease with which one can clone a voice with no finetuning, using only a 3-5 second audio clip - which is both eerie, and potentially dangerous in the wrong hands. All this can be accomplished on a commodity PC that most people can afford.
