Raspberry Pi Audio Transcription System

Real-time speech-to-text transcription system for Raspberry Pi 5 that streams audio from a USB microphone to OpenAI's Realtime API, forwards finalized turns to a GPT-5-based assistant, and can speak or print the reply in the terminal.

Features

Real-time audio capture from USB microphone with wake-word gating, pre-roll, and silence auto-stop.
Streams to OpenAI Realtime API for transcription, then routes each completed turn to GPT-5 Mini or GPT-5.1 with adjustable reasoning effort.
Assistant replies can stream directly from the Responses API or fall back to local Audio API TTS with automatic sample-rate selection.
Optional web-search tool calls, fixed system prompt, and location/language hints so the assistant stays on-topic.
Voice stop commands (e.g., "Hey Jarvis stop") interrupt playback and clear the pending turn.
Verbose logging captures wake-word scores, state transitions, and all console output in timestamped files under ~/.cache/pi-assistant/logs/.
Optimized for Raspberry Pi 5, 24 kHz mono PCM audio throughout the capture pipeline.

Requirements

Hardware

Raspberry Pi 5
USB microphone (plug-and-play, ALSA compatible)
USB speaker

Software

Python 3.11+
uv (https://docs.astral.sh/uv/)
OpenAI API key

Installation

1. Clone or Navigate to Project

cd /path/to/pi-assistant

2. Install uv (one time per machine)

curl -LsSf https://astral.sh/uv/install.sh | sh

3. Sync Dependencies (creates .venv automatically)

# Runtime dependencies only
uv sync

# Include Ruff + pre-commit (optional dev tools)
uv sync --group dev
uv run pre-commit install

uv sync creates a managed .venv/ if one does not already exist and keeps it up to date as dependencies change.

Optional: Pin repository-local Python 3.11

If your Raspberry Pi workflow needs Python 3.11 without touching the system installation, let uv download a portable interpreter that only applies inside this repo:

uv python pin 3.11

This stores the interpreter under .uv/python/... and records the version in .python-version. Commit .python-version, keep .uv/ ignored, and every uv command you run here (including uv sync) will automatically use that local 3.11 build so packages like tflite-runtime install cleanly. To switch later, run uv python pin <other-version> or remove .python-version.

4. Install PortAudio (Raspberry Pi)

sounddevice depends on the PortAudio shared libraries, which Raspberry Pi OS does not install by default. Run the helper script once per Pi to pull in the required packages:

./scripts/provisioning/install-portaudio-deps.sh

The script performs an apt-get update and installs libportaudio2, libportaudiocpp0, and portaudio19-dev using sudo. If you're on a distro without apt-get, install the equivalent PortAudio development packages via your package manager before running uv run pi-assistant.

5. Configure API Key

Create a .env file with your OpenAI API key:

cp .env.example .env

Then edit .env and add your actual API key:

OPENAI_API_KEY=your_actual_api_key_here

Get your API key from: https://platform.openai.com/api-keys

On first launch, if .env is missing or OPENAI_API_KEY is empty, the CLI prompts you to paste the key so the setup can continue without manual file edits.

Usage

See docs/cli.md for the complete CLI command and configuration guide, including execution modes, diagnostics, flags, environment variables, and wake-word tuning. The quick start remains:

uv run pi-assistant

Say "Hey Rhaspy stop" (or "Rhaspy stop") while the assistant is talking to immediately halt playback, clear the pending turn, and return to listening mode. Follow the CLI guide for verbose logging, simulated queries, assistant model overrides, and diagnostics such as test-audio and test-websocket. When you need throughput counters (the per-100 chunk [DEBUG] Processed … lines), launch with uv run pi-assistant -v --log-chunks so only intentional debugging sessions see them.

Assistant Models & Reasoning

pi_assistant.assistant.LLMResponder wraps the OpenAI Responses API and supports multiple presets. The CLI prompts for a default model/reasoning pair on first launch and stores the choice in .env, but you can override it per run via the flags above or by setting ASSISTANT_MODEL / ASSISTANT_REASONING_EFFORT.

Preset	Model ID	Recommended use
`nano` (default)	`gpt-5-nano-2025-08-07`	Ultra-fast replies tuned for `low`, `medium`, or `high` reasoning.
`mini`	`gpt-5-mini-2025-08-07`	Balanced latency with optional `minimal`, `low`, `medium`, or `high` reasoning.
`5.1`	`gpt-5.1-2025-11-13`	Higher accuracy with `none`, `low`, `medium`, or `high` reasoning.
`4.1`	`gpt-4.1-2025-04-14`	Compatibility tier with reasoning disabled; ideal when you must stick to GPT-4-class models.

ASSISTANT_REASONING_EFFORT falls back to low when unset. Models without reasoning (e.g., GPT-4.1) reject CLI overrides and skip persisted values with a warning instead. minimal cannot be used while ASSISTANT_WEB_SEARCH_ENABLED=1.
ASSISTANT_SYSTEM_PROMPT, LOCATION_NAME, ASSISTANT_LANGUAGE, and TRANSCRIPTION_LANGUAGE are sent as system messages so you can keep the assistant short, localized, and aware of the device's location.
Set ASSISTANT_WEB_SEARCH_ENABLED=0 to disable tool calls entirely or leave it enabled to let GPT-5 issue web search requests when the Responses API determines they are useful.

Assistant Audio Modes & TTS

Two delivery paths are supported:

responses (default): stream assistant audio directly from the Responses API when ASSISTANT_TTS_RESPONSES_ENABLED=1. The CLI automatically verifies whether the selected model supports the feature and falls back if the server rejects it.
local-tts: request text first, then synthesize locally via the Audio API using ASSISTANT_TTS_MODEL, ASSISTANT_TTS_VOICE, ASSISTANT_TTS_FORMAT, and ASSISTANT_TTS_SAMPLE_RATE.

SpeechPlayer handles sample-rate mismatches and exposes a stop hook so voice commands can interrupt playback. The optional confirmation cue ("Got it.") is controlled via CONFIRMATION_CUE_ENABLED and CONFIRMATION_CUE_TEXT; the phrase is pre-rendered and cached so the tone plays instantly after each wake-word trigger.

Simulated Queries & Voice Controls

Set SIMULATED_QUERY_TEXT="What's the forecast today?" in .env to auto-inject a prompt every time the pipeline starts, or use --simulate-query to test a single turn without speaking.
Voice stop commands ("Hey Jarvis stop" / "Jarvis stop") signal the SpeechPlayer to halt audio and set AUDIO_INPUT_DEVICE-free capture back to listening mode with the current turn discarded.

Capturing Verbose Logs Locally

Verbose logs are captured by default. Each uv run pi-assistant session:

Writes a log file named with an ISO-8601 timestamp (e.g., ~/.cache/pi-assistant/logs/2024-11-30T14-03-12.123.log).
Mirrors console output timestamps in the log.
Strips ANSI colors for readability.
Run uv run pi-assistant -v --log-chunks to include the historical per-100 chunk counters in both the console and the captured log; omit the flag to keep verbose output quieter during normal development.

Files live under ~/.cache/pi-assistant/logs/ unless you override the folder with VERBOSE_LOG_DIRECTORY=/path/to/dir. Disable capture entirely via VERBOSE_LOG_CAPTURE_ENABLED=0 to conserve space on constrained devices.

Code Quality

Ruff is configured via ruff.toml to handle both formatting and linting.

# Format the codebase
uv run ruff format .

# Run lint checks (apply autofixes when possible)
uv run ruff check --fix .

# Run the Git hook suite manually
uv run pre-commit run --all-files

These commands automatically exclude generated artifacts such as .venv/ and tests/manual/test_recording.wav.

Pyright enforces static typing using pyrightconfig.json, mirroring the same exclusions.

# Run a full type-checking pass
uv run pyright

# Keep Pyright running in watch mode while editing
uv run pyright -- --watch

Before sending a pull request, run uv run pyright && uv run pytest so CI sees the same status you validated locally.

Editor Integration

If you use VS Code, install the recommended Pylance extension (added via .vscode/extensions.json) so you get fast type checking, inline docstrings, and completion hints that mirror the repo's configuration.

Raspberry Pi Setup

System Dependencies

On Raspberry Pi, install the required system libraries:

# Update system
sudo apt-get update

# Install audio libraries
sudo apt-get install -y libportaudio2 portaudio19-dev python3-dev

# Install Python pip
sudo apt-get install -y python3-pip

# Optional: Set CPU governor to performance
sudo cpufreq-set -g performance

Test Microphone with ALSA

# List available microphones
arecord -l

# Test microphone recording (replace hw:X,Y with your device)
arecord --device=hw:1,0 --format S16_LE --rate 24000 -c 1 test.wav

Configuration

Defaults live in config/defaults.toml and are surfaced via pi_assistant.config. Detailed environment-variable descriptions (assistant tuning, wake-word overrides, simulated queries, logging knobs, etc.) are documented in docs/cli.md.

If the assistant acknowledges a wake word before you're done speaking, increase SERVER_STOP_MIN_SILENCE_SECONDS in .env (default 0.75) so the controller ignores premature server VAD stop events until it hears a longer stretch of local silence. If the controller logs "[TURN] Awaiting server confirmation before finalizing turn." indefinitely, lower SERVER_STOP_TIMEOUT_SECONDS (default 5) to make it fall back sooner, or raise it if your network regularly delays the server's speech_stopped acknowledgment. Setting the value to 0 disables the timeout if you would rather wait indefinitely for the server.

Project Structure

pi-assistant/
├── assets/                      # Shared fixtures (prompt snippets, sample WAVs)
├── config/
│   └── defaults.toml            # Baseline runtime configuration
├── docs/                        # CLI + architecture references
├── logs/
│   ├── diagnostics/             # Hardware checks
│   └── runtime/                 # Verbose CLI sessions
├── models/
│   ├── wake_word/               # Bundled openWakeWord artifacts
│   ├── tts/
│   └── llm/
├── scripts/
│   ├── provisioning/            # Hardware + dependency setup
│   └── tooling/                 # Repo automation helpers
├── src/
│   └── pi_assistant/
│       ├── assistant/
│       │   ├── llm.py
│       │   ├── session/services/
│       │   ├── transcription/
│       │   └── transcripts/
│       ├── audio/
│       │   ├── capture/
│       │   ├── playback/
│       │   ├── processing/
│       │   └── wake_word/
│       ├── cli/
│       │   ├── app.py
│       │   ├── controller/
│       │   └── logging.py
│       ├── core/
│       │   └── exceptions.py
│       ├── diagnostics/
│       ├── network/websocket/
│       └── config/
├── tests/
│   ├── assistant/
│   ├── audio/
│   ├── cli/
│   ├── config/
│   ├── diagnostics/
│   ├── network/
│   ├── wake_word/
│   └── manual/
├── README.md
└── pyproject.toml

Automated Tests

The tests/wake_word/test_wake_word.py regression uses the generated tests/hey_jarvis.wav fixture (and explicitly loads the Jarvis model) to ensure the detector fires exactly once regardless of which wake word the CLI currently selects. Run it with:

uv run python -m unittest tests/wake_word/test_wake_word.py

Note: The wake-word test is skipped automatically when openwakeword (and its runtimes) are unavailable.

Running Tests

You can run the full test suite through uv (no manual activation required):

uv run pytest

Add -v for verbose output:

uv run pytest -v

To generate coverage reports (powered by pytest-cov):

uv run pytest --cov

Async tests rely on pytest-asyncio; no extra setup is needed when using uv sync --group dev.

Troubleshooting

Virtual environment not activated:

source .venv/bin/activate

WebSocket connection errors:

Check API key in .env file
Verify internet connection
Ensure Realtime API access

No audio devices found:

# List microphones (Raspberry Pi)
arecord -l

Error querying device -1:

Run uv run pi-assistant test-audio to confirm sounddevice can capture samples.
Use arecord -l (or sd.query_devices() in Python) to note the correct ALSA card/index.
Export AUDIO_INPUT_DEVICE=<index-or-name> so the client selects the right microphone.

Microphone <name> does not support SAMPLE_RATE=24000 Hz:

The client now probes the hardware and, when possible, saves the microphone's preferred rate (for example 44100/48000 Hz) to .env. Look for the [INFO] … Saved SAMPLE_RATE to .env log, then restart the assistant so the new value takes effect.
If the device does not report any supported rate, set SAMPLE_RATE manually (keep STREAM_SAMPLE_RATE=24000 so capture audio is resampled before streaming) and re-run uv run pi-assistant -v.

Licensing for bundled wake-word models:

models/wake_word/alexa_v0.1.(onnx|tflite), models/wake_word/hey_jarvis_v0.1.(onnx|tflite), models/wake_word/hey_rhasspy_v0.1.(onnx|tflite), models/wake_word/melspectrogram.onnx, and models/wake_word/embedding_model.onnx are distributed under the Apache 2.0 license from the openWakeWord project.

Microphone permission (macOS): System Settings → Privacy & Security → Microphone

License

This project is for educational and development purposes.

Support

For issues and questions, refer to:

OpenAI Realtime API docs: https://platform.openai.com/docs/guides/realtime
Plan document: docs/wake-word.md

Name		Name	Last commit message	Last commit date
Latest commit History 50 Commits
.ai		.ai
.claude		.claude
.gemini		.gemini
.github/workflows		.github/workflows
.vscode		.vscode
assets		assets
config		config
docs		docs
models		models
scripts		scripts
src/pi_assistant		src/pi_assistant
tests		tests
.env.example		.env.example
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
AGENTS.md		AGENTS.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
pyrightconfig.json		pyrightconfig.json
ruff.toml		ruff.toml
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

Raspberry Pi Audio Transcription System

Features

Requirements

Hardware

Software

Installation

1. Clone or Navigate to Project

2. Install uv (one time per machine)

3. Sync Dependencies (creates .venv automatically)

Optional: Pin repository-local Python 3.11

4. Install PortAudio (Raspberry Pi)

5. Configure API Key

Usage

Assistant Models & Reasoning

Assistant Audio Modes & TTS

Simulated Queries & Voice Controls

Capturing Verbose Logs Locally

Code Quality

Editor Integration

Raspberry Pi Setup

System Dependencies

Test Microphone with ALSA

Configuration

Project Structure

Automated Tests

Running Tests

Troubleshooting

License

Support

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages