Real-time speech-to-text transcription system for Raspberry Pi 5 that streams audio from a USB microphone to OpenAI's Realtime API, forwards finalized turns to a GPT-5-based assistant, and can speak or print the reply in the terminal.
- Real-time audio capture from USB microphone with wake-word gating, pre-roll, and silence auto-stop.
- Streams to OpenAI Realtime API for transcription, then routes each completed turn to GPT-5 Mini or GPT-5.1 with adjustable reasoning effort.
- Assistant replies can stream directly from the Responses API or fall back to local Audio API TTS with automatic sample-rate selection.
- Optional web-search tool calls, fixed system prompt, and location/language hints so the assistant stays on-topic.
- Voice stop commands (e.g., "Hey Jarvis stop") interrupt playback and clear the pending turn.
- Verbose logging captures wake-word scores, state transitions, and all console output in timestamped files under
~/.cache/pi-assistant/logs/. - Optimized for Raspberry Pi 5, 24 kHz mono PCM audio throughout the capture pipeline.
- Raspberry Pi 5
- USB microphone (plug-and-play, ALSA compatible)
- USB speaker
- Python 3.11+
- uv (https://docs.astral.sh/uv/)
- OpenAI API key
cd /path/to/pi-assistantcurl -LsSf https://astral.sh/uv/install.sh | sh# Runtime dependencies only
uv sync
# Include Ruff + pre-commit (optional dev tools)
uv sync --group dev
uv run pre-commit installuv sync creates a managed .venv/ if one does not already exist and keeps it up to date as dependencies change.
If your Raspberry Pi workflow needs Python 3.11 without touching the system installation, let uv download a portable interpreter that only applies inside this repo:
uv python pin 3.11This stores the interpreter under .uv/python/... and records the version in .python-version. Commit .python-version, keep .uv/ ignored, and every uv command you run here (including uv sync) will automatically use that local 3.11 build so packages like tflite-runtime install cleanly. To switch later, run uv python pin <other-version> or remove .python-version.
sounddevice depends on the PortAudio shared libraries, which Raspberry Pi OS does not install by default. Run the helper script once per Pi to pull in the required packages:
./scripts/provisioning/install-portaudio-deps.shThe script performs an apt-get update and installs libportaudio2, libportaudiocpp0, and portaudio19-dev using sudo. If you're on a distro without apt-get, install the equivalent PortAudio development packages via your package manager before running uv run pi-assistant.
Create a .env file with your OpenAI API key:
cp .env.example .envThen edit .env and add your actual API key:
OPENAI_API_KEY=your_actual_api_key_here
Get your API key from: https://platform.openai.com/api-keys
On first launch, if .env is missing or OPENAI_API_KEY is empty, the CLI prompts you to paste the key so the setup can continue without manual file edits.
See docs/cli.md for the complete CLI command and configuration guide, including execution modes, diagnostics, flags, environment variables, and wake-word tuning. The quick start remains:
uv run pi-assistantSay "Hey Rhaspy stop" (or "Rhaspy stop") while the assistant is talking to immediately halt playback, clear the pending turn, and return to listening mode. Follow the CLI guide for verbose logging, simulated queries, assistant model overrides, and diagnostics such as test-audio and test-websocket.
When you need throughput counters (the per-100 chunk [DEBUG] Processed … lines), launch with uv run pi-assistant -v --log-chunks so only intentional debugging sessions see them.
pi_assistant.assistant.LLMResponder wraps the OpenAI Responses API and supports multiple presets. The CLI prompts for a default model/reasoning pair on first launch and stores the choice in .env, but you can override it per run via the flags above or by setting ASSISTANT_MODEL / ASSISTANT_REASONING_EFFORT.
| Preset | Model ID | Recommended use |
|---|---|---|
nano (default) |
gpt-5-nano-2025-08-07 |
Ultra-fast replies tuned for low, medium, or high reasoning. |
mini |
gpt-5-mini-2025-08-07 |
Balanced latency with optional minimal, low, medium, or high reasoning. |
5.1 |
gpt-5.1-2025-11-13 |
Higher accuracy with none, low, medium, or high reasoning. |
4.1 |
gpt-4.1-2025-04-14 |
Compatibility tier with reasoning disabled; ideal when you must stick to GPT-4-class models. |
ASSISTANT_REASONING_EFFORTfalls back tolowwhen unset. Models without reasoning (e.g., GPT-4.1) reject CLI overrides and skip persisted values with a warning instead.minimalcannot be used whileASSISTANT_WEB_SEARCH_ENABLED=1.ASSISTANT_SYSTEM_PROMPT,LOCATION_NAME,ASSISTANT_LANGUAGE, andTRANSCRIPTION_LANGUAGEare sent as system messages so you can keep the assistant short, localized, and aware of the device's location.- Set
ASSISTANT_WEB_SEARCH_ENABLED=0to disable tool calls entirely or leave it enabled to let GPT-5 issue web search requests when the Responses API determines they are useful.
Two delivery paths are supported:
responses(default): stream assistant audio directly from the Responses API whenASSISTANT_TTS_RESPONSES_ENABLED=1. The CLI automatically verifies whether the selected model supports the feature and falls back if the server rejects it.local-tts: request text first, then synthesize locally via the Audio API usingASSISTANT_TTS_MODEL,ASSISTANT_TTS_VOICE,ASSISTANT_TTS_FORMAT, andASSISTANT_TTS_SAMPLE_RATE.
SpeechPlayer handles sample-rate mismatches and exposes a stop hook so voice commands can interrupt playback. The optional confirmation cue ("Got it.") is controlled via CONFIRMATION_CUE_ENABLED and CONFIRMATION_CUE_TEXT; the phrase is pre-rendered and cached so the tone plays instantly after each wake-word trigger.
- Set
SIMULATED_QUERY_TEXT="What's the forecast today?"in.envto auto-inject a prompt every time the pipeline starts, or use--simulate-queryto test a single turn without speaking. - Voice stop commands ("Hey Jarvis stop" / "Jarvis stop") signal the
SpeechPlayerto halt audio and setAUDIO_INPUT_DEVICE-free capture back to listening mode with the current turn discarded.
Verbose logs are captured by default. Each uv run pi-assistant session:
- Writes a log file named with an ISO-8601 timestamp (e.g.,
~/.cache/pi-assistant/logs/2024-11-30T14-03-12.123.log). - Mirrors console output timestamps in the log.
- Strips ANSI colors for readability.
- Run
uv run pi-assistant -v --log-chunksto include the historical per-100 chunk counters in both the console and the captured log; omit the flag to keep verbose output quieter during normal development.
Files live under ~/.cache/pi-assistant/logs/ unless you override the folder with VERBOSE_LOG_DIRECTORY=/path/to/dir. Disable capture entirely via VERBOSE_LOG_CAPTURE_ENABLED=0 to conserve space on constrained devices.
Ruff is configured via ruff.toml to handle both formatting and linting.
# Format the codebase
uv run ruff format .
# Run lint checks (apply autofixes when possible)
uv run ruff check --fix .
# Run the Git hook suite manually
uv run pre-commit run --all-filesThese commands automatically exclude generated artifacts such as .venv/ and tests/manual/test_recording.wav.
Pyright enforces static typing using pyrightconfig.json, mirroring the same exclusions.
# Run a full type-checking pass
uv run pyright
# Keep Pyright running in watch mode while editing
uv run pyright -- --watchBefore sending a pull request, run uv run pyright && uv run pytest so CI sees the same status you validated locally.
If you use VS Code, install the recommended Pylance extension (added via .vscode/extensions.json) so you get fast type checking, inline docstrings, and completion hints that mirror the repo's configuration.
On Raspberry Pi, install the required system libraries:
# Update system
sudo apt-get update
# Install audio libraries
sudo apt-get install -y libportaudio2 portaudio19-dev python3-dev
# Install Python pip
sudo apt-get install -y python3-pip
# Optional: Set CPU governor to performance
sudo cpufreq-set -g performance# List available microphones
arecord -l
# Test microphone recording (replace hw:X,Y with your device)
arecord --device=hw:1,0 --format S16_LE --rate 24000 -c 1 test.wavDefaults live in config/defaults.toml and are surfaced via pi_assistant.config. Detailed environment-variable descriptions (assistant tuning, wake-word overrides, simulated queries, logging knobs, etc.) are documented in docs/cli.md.
If the assistant acknowledges a wake word before you're done speaking, increase SERVER_STOP_MIN_SILENCE_SECONDS
in .env (default 0.75) so the controller ignores premature server VAD stop events until it hears a longer
stretch of local silence.
If the controller logs "[TURN] Awaiting server confirmation before finalizing turn." indefinitely, lower
SERVER_STOP_TIMEOUT_SECONDS (default 5) to make it fall back sooner, or raise it if your network regularly
delays the server's speech_stopped acknowledgment. Setting the value to 0 disables the timeout if you would
rather wait indefinitely for the server.
pi-assistant/
├── assets/ # Shared fixtures (prompt snippets, sample WAVs)
├── config/
│ └── defaults.toml # Baseline runtime configuration
├── docs/ # CLI + architecture references
├── logs/
│ ├── diagnostics/ # Hardware checks
│ └── runtime/ # Verbose CLI sessions
├── models/
│ ├── wake_word/ # Bundled openWakeWord artifacts
│ ├── tts/
│ └── llm/
├── scripts/
│ ├── provisioning/ # Hardware + dependency setup
│ └── tooling/ # Repo automation helpers
├── src/
│ └── pi_assistant/
│ ├── assistant/
│ │ ├── llm.py
│ │ ├── session/services/
│ │ ├── transcription/
│ │ └── transcripts/
│ ├── audio/
│ │ ├── capture/
│ │ ├── playback/
│ │ ├── processing/
│ │ └── wake_word/
│ ├── cli/
│ │ ├── app.py
│ │ ├── controller/
│ │ └── logging.py
│ ├── core/
│ │ └── exceptions.py
│ ├── diagnostics/
│ ├── network/websocket/
│ └── config/
├── tests/
│ ├── assistant/
│ ├── audio/
│ ├── cli/
│ ├── config/
│ ├── diagnostics/
│ ├── network/
│ ├── wake_word/
│ └── manual/
├── README.md
└── pyproject.toml
The tests/wake_word/test_wake_word.py regression uses the generated tests/hey_jarvis.wav fixture (and explicitly loads the Jarvis model) to ensure the detector fires exactly once regardless of which wake word the CLI currently selects. Run it with:
uv run python -m unittest tests/wake_word/test_wake_word.pyNote: The wake-word test is skipped automatically when
openwakeword(and its runtimes) are unavailable.
You can run the full test suite through uv (no manual activation required):
uv run pytestAdd -v for verbose output:
uv run pytest -vTo generate coverage reports (powered by pytest-cov):
uv run pytest --covAsync tests rely on pytest-asyncio; no extra setup is needed when using uv sync --group dev.
Virtual environment not activated:
source .venv/bin/activateWebSocket connection errors:
- Check API key in
.envfile - Verify internet connection
- Ensure Realtime API access
No audio devices found:
# List microphones (Raspberry Pi)
arecord -lError querying device -1:
- Run
uv run pi-assistant test-audioto confirm sounddevice can capture samples. - Use
arecord -l(orsd.query_devices()in Python) to note the correct ALSA card/index. - Export
AUDIO_INPUT_DEVICE=<index-or-name>so the client selects the right microphone.
Microphone <name> does not support SAMPLE_RATE=24000 Hz:
- The client now probes the hardware and, when possible, saves the microphone's preferred rate (for example 44100/48000 Hz) to
.env. Look for the[INFO] … Saved SAMPLE_RATE to .envlog, then restart the assistant so the new value takes effect. - If the device does not report any supported rate, set
SAMPLE_RATEmanually (keepSTREAM_SAMPLE_RATE=24000so capture audio is resampled before streaming) and re-runuv run pi-assistant -v.
Licensing for bundled wake-word models:
models/wake_word/alexa_v0.1.(onnx|tflite),models/wake_word/hey_jarvis_v0.1.(onnx|tflite),models/wake_word/hey_rhasspy_v0.1.(onnx|tflite),models/wake_word/melspectrogram.onnx, andmodels/wake_word/embedding_model.onnxare distributed under the Apache 2.0 license from the openWakeWord project.
Microphone permission (macOS): System Settings → Privacy & Security → Microphone
This project is for educational and development purposes.
For issues and questions, refer to:
- OpenAI Realtime API docs: https://platform.openai.com/docs/guides/realtime
- Plan document:
docs/wake-word.md