Technical architecture reference for VoxWatch, an AI-powered security audio deterrent system that detects persons on camera via Frigate and delivers escalating vocal warnings through camera speakers.
VoxWatch runs as two Docker containers on host networking:
| Container | Stack | Resources | Purpose |
|---|---|---|---|
voxwatch |
Python 3.11 | 512MB / 2 CPU | Core detection and audio pipeline |
voxwatch-dashboard |
React 18 + FastAPI | 256MB / 1 CPU | Web UI and setup wizard |
Core ports:
| Port | Service | Container |
|---|---|---|
| 33344 | Dashboard (FastAPI + React SPA) | voxwatch-dashboard |
| 8891 | Audio HTTP server (serves files to go2rtc) | voxwatch |
| 8892 | Preview API (internal aiohttp) | voxwatch |
Both containers use network_mode: host to access Frigate, go2rtc, and MQTT on localhost.
┌─────────────┐
│ Frigate │
│ (Detection) │
└──────┬───────┘
│ MQTT: frigate/events
▼
┌───────────────────────────────────────────────────────────────┐
│ VoxWatch Service │
│ │
│ ┌─────────────┐ ┌──────────────┐ ┌───────────────────┐ │
│ │ MQTT Client │──▶│ 3-Stage │──▶│ Audio Pipeline │ │
│ │ (subscriber) │ │ Pipeline │ │ TTS → ffmpeg → │ │
│ └─────────────┘ │ │ │ HTTP → go2rtc │ │
│ │ AI Vision ──┤ └────────┬──────────┘ │
│ │ (7 provs) │ │ │
│ └──────────────┘ │ │
│ │ │
│ ┌──────────────────┐ │ │
│ │ MQTT Publisher │◀── events, status ───────┤ │
│ └────────┬─────────┘ │ │
└───────────┼─────────────────────────────────────┼──────────────┘
│ │
▼ ▼
┌─────────────────┐ ┌────────────────┐
│ Home Assistant │ │ go2rtc │
│ (automations) │ │ (backchannel) │
└────────┬─────────┘ └───────┬────────┘
│ │
│ MQTT: voxwatch/announce ▼
└──────────▶ VoxWatch ──▶ TTS ──▶ Camera Speaker
Three integration paths:
- Detection flow: Frigate → MQTT → VoxWatch → AI + TTS → go2rtc → Camera Speaker
- Event publishing: VoxWatch → MQTT events → Home Assistant
- Announcement flow: Home Assistant →
voxwatch/announce→ VoxWatch → TTS → Camera Speaker
- Plays a pre-cached warning message immediately on person detection
- Backchannel warmup (silent push + 2s wait) runs concurrently with AI analysis start
- No AI analysis required; fixed latency, highly reliable
- 3 snapshots captured and sent to AI vision provider
- AI generates a context-aware description of the person and situation
- Natural cadence speech: phrase-level pauses based on punctuation, speed variation
- Response mode templates with 8 substitution variables (see Response Modes)
- Queued after Stage 1 audio completes
- Video clips sent when supported (Gemini); snapshots fallback for other providers
- Person-still-present check via Frigate API before executing
- Escalated warning based on behavioral analysis
- Only triggers if person remains after Stage 2
Seven providers with automatic fallback chain:
| Provider | Type | Video Support | Notes |
|---|---|---|---|
| Gemini | Cloud | Yes (clips) | Primary recommended provider |
| OpenAI | Cloud | No (snapshots) | GPT-4 Vision |
| Anthropic Claude | Cloud | No (snapshots) | Claude vision |
| xAI Grok | Cloud | No (snapshots) | Grok vision |
| Ollama | Local | No (snapshots) | Self-hosted, offline capable |
| Custom OpenAI-compatible | Cloud/Local | No (snapshots) | Any OpenAI-API-compatible endpoint |
| Fallback | N/A | N/A | Generic pre-cached message |
- Automatic fallback chain: if the configured provider fails, the next available provider is tried
- Nightvision-aware prompts: when IR mode is detected, prompts instruct the AI to avoid color descriptions
Seven providers with automatic fallback chain:
| Provider | Type | Quality | Notes |
|---|---|---|---|
| Kokoro | Local | High | Neural TTS |
| Piper | Local | High | Neural TTS, pre-installed voice |
| ElevenLabs | Cloud | Very High | Premium voice cloning |
| Cartesia | Cloud | High | Low-latency streaming |
| Amazon Polly | Cloud | High | AWS neural voices |
| OpenAI TTS | Cloud | Very High | Multiple voice options |
| espeak-ng | Local | Low | Always available, robotic fallback |
Natural cadence speech system:
- Phrase-level pauses inserted based on punctuation (commas, periods, ellipses)
- Speed variation across phrases for natural rhythm
- Audio postprocessing: loudnorm (EBU R128), compression, silence trimming
14 built-in modes plus custom mode support, loaded from voxwatch/modes/loader.py.
Mode resolution order:
- Camera-specific override (
response_modes.camera_overrides) active_modesettingresponse_mode.namesetting- Default:
standard
AI description variables (8):
| Variable | Description |
|---|---|
{clothing_description} |
What the person is wearing |
{location_on_property} |
Where on the property they are |
{behavior_description} |
What they are doing |
{suspect_count} |
Number of persons detected |
{address_street} |
Street address |
{address_full} |
Full address |
{time_of_day} |
Current time context |
{camera_name} |
Name of the detecting camera |
Persona customization: mood presets, system names, guard dog names, operator names, surveillance presets.
Specialized dispatch mode (police_dispatch) that simulates radio communications.
Multi-segment architecture:
- Channel intro
- Main dispatch (location/description, crime-in-progress)
- Squelch pauses between segments
- Officer response
Radio effects processing:
- Bandpass filtering (300-3400Hz telephone band)
- Dynamic compression
- Radio noise overlay
- Squelch sound effects
Configurable parameters: address, agency name, callsign, officer voice, radio intensity level.
TTS Output
│
▼
ffmpeg codec conversion (PCM 16-bit 44.1kHz mono → PCMU 8kHz or PCMA 8kHz)
│
▼
Optional attention tone prepend
│
▼
Audio HTTP server (port 8891)
│
▼
go2rtc fetches file via HTTP → pushes to camera backchannel
│
▼
Camera Speaker
Working format: PCM 16-bit 44.1kHz mono internally, converted to target camera codec (PCMU/G.711 mu-law at 8kHz or PCMA/G.711 A-law at 8kHz).
Backchannel warmup: Silent audio push + 2-second wait before real audio. Required for Reolink cameras to initialize the RTSP backchannel.
Per-camera push locks: asyncio.Lock per camera prevents overlapping audio pushes.
| Topic | Purpose | Filtering |
|---|---|---|
frigate/events (configurable) |
Person detection events | type: "new", label: "person", score >= min_score |
voxwatch/announce |
TTS announcements from Home Assistant | Camera name + message text |
| Topic | Purpose |
|---|---|
voxwatch/events/detection |
New person detection |
voxwatch/events/stage |
Pipeline stage transitions |
voxwatch/events/ended |
Detection event completed |
voxwatch/events/error |
Pipeline errors |
voxwatch/status |
Service status (LWT: online/offline, retained, QoS 1) |
All outbound payloads are JSON with event_id, timestamp, camera, and context-specific fields. Publishing is fire-and-forget and never blocks the detection pipeline.
- Frontend: React 18 + TypeScript + Tailwind CSS + Vite
- Backend: FastAPI + Pydantic + aiohttp
| Router | Purpose |
|---|---|
audio.py |
Audio test, announce, preview proxy |
cameras.py |
Camera listing and configuration |
config.py |
Configuration read/write |
system.py |
System status and health |
wizard.py |
Setup wizard API |
setup.py |
Initial setup flow |
| Endpoint | Method | Purpose |
|---|---|---|
/api/audio/test |
POST | Test audio push to a camera |
/api/audio/announce |
POST | Send TTS announcement |
/api/audio/preview |
POST | Preview TTS output |
/api/cameras |
GET | List configured cameras |
/api/config |
GET/PUT | Read/write configuration |
9-step auto-discovery flow:
- Frigate connection
- MQTT broker
- Camera discovery
- AI provider configuration
- TTS provider configuration
- Response mode selection
- Camera-specific configuration
- Review and confirm
- Apply and start
camera_db.py contains 7 known camera models with codec and backchannel parameters. Uses fuzzy matching on model strings. Falls back to ONVIF identification for unknown cameras.
Internal aiohttp server running inside the voxwatch service container.
| Endpoint | Purpose |
|---|---|
/api/preview |
Generate and return TTS audio preview |
/api/preview/generate-intro |
Generate mode intro audio |
/api/announce |
Push TTS announcement to camera |
/api/health |
Health check |
Shares the same AudioPipeline instance as the main service (same TTS engines, same codec conversion). The dashboard container proxies preview and announce requests to this API.
| Mechanism | Implementation |
|---|---|
| API authentication | Bearer token via DASHBOARD_API_KEY env var, validated with hmac.compare_digest |
| Rate limiting | 5 audio pushes per camera per 60 seconds on test/wizard endpoints |
| Camera name validation | Regex ^[a-zA-Z0-9_-]+$ (SSRF prevention) |
| Path traversal protection | Validated on SPA static file serving |
| Secrets masking | API keys displayed as ***MASKED** in API responses |
| CORS | Configurable origins via CORS_ORIGINS env var |
| TTS input sanitization | Control character removal before synthesis |
Single config.yaml file with environment variable substitution: ${ENV_VAR} and ${ENV_VAR:default}.
The service polls the config file every 10 seconds.
Hot-reloadable (no restart needed):
- TTS settings
- Stage 1 message text
- Active hours schedule
- Cooldown timers
- Response mode and active mode
- Dispatch configuration
Requires container restart:
- Frigate / go2rtc / MQTT connection parameters
- Camera list
- AI provider API keys
- Audio codec settings
- Pipeline stage enable/disable toggles
Write safety: Atomic writes via tempfile + os.replace().
| File | Format | Rotation | Content |
|---|---|---|---|
status.json |
JSON | Written every 5s (overwrite) | Service state, uptime, per-camera stats (detections, audio pushes, cooldowns) |
events.jsonl |
JSON Lines | 5MB rotation | One entry per detection event |
voxwatch.log |
Text | 10MB/file, 5 backups (50MB total) | Application log |
Docker logging: json-file driver, max-size: 10m, max-file: 3.
- Event loop:
asynciofor all I/O (HTTP, audio pipeline, Frigate API) - MQTT thread:
paho-mqttruns its own background thread; events bridged to asyncio viacall_soon_threadsafe - Per-camera locks:
asyncio.Lockper camera prevents overlapping audio pushes - Task tracking: Active async tasks tracked with automatic cleanup via done callbacks
- Graceful shutdown: Drains active tasks, publishes offline LWT status, disconnects MQTT
Multi-stage Dockerfile producing a non-root container.
| Property | Value |
|---|---|
| Base | Python 3.11 |
| System dependencies | ffmpeg, espeak-ng, curl, piper (with en_US-lessac-medium voice) |
| Final image size | ~911MB (optimized from 1769MB) |
| Health check | curl -f http://localhost:8891/ |
| Run user | voxwatch (non-root) |
| Docker logging | json-file, max-size 10m, max-file 3 |
Resource limits (docker-compose):
| Container | Memory | CPU |
|---|---|---|
voxwatch |
512MB | 2 |
voxwatch-dashboard |
256MB | 1 |