GPU-accelerated audio processing pipeline: vocal separation (Demucs), speaker diarization (Pyannote), transcription (WhisperX), and text sentiment analysis. Its primary use case is building AI-ready audio databases — transforming raw recordings into structured, speaker-attributed JSON with word-level timestamps that feed directly into RAG pipelines, vector stores, and fine-tuning datasets. The pipeline uses a Ghost Track strategy: AI models run against a clean, music-free vocal stem to maximize accuracy, then the resulting metadata is applied back to the original audio, preserving its acoustic character. Designed to run on 24 GB consumer GPUs with all models resident in VRAM simultaneously, it processes large corpora in batch with no model reload overhead between files.
# Create and activate a Python 3.11 virtualenv
uv venv --python 3.11.14
source .venv/bin/activate
# Install all deps (uv sync, whisperx, CUDA torch wheels, pre-commit hooks)
make dev-setup
# Copy the env template and add your HuggingFace token
cp .env.example .env
# Edit .env and set HF_TOKEN=hf_your_token_here
# Verify the install
make test
audio-refinery --helpCUDA note:
uv syncresolves torch from PyPI and installs the CPU build.make dev-setupautomatically reinstallstorch==2.1.2+cu121andtorchaudio==2.1.2+cu121(CUDA 12.1) as its final step. If your system uses a different CUDA version, runmake install-torch-cudaafter editing the wheel URLs in the Makefile.
NumPy constraint:
numpy<2.0.0is pinned inpyproject.toml. Do not upgrade it — WhisperX and some audio libraries break with NumPy 2.x.
Pyannote speaker diarization models are gated on HuggingFace. Complete these steps once before running audio-refinery diarize:
- Create a HuggingFace account at huggingface.co if you don't have one.
- Accept the license for each gated model (must be logged in):
- Create a read-only access token: Profile → Settings → Access Tokens → New token.
- Add it to your
.envfile (copy.env.exampleto.env):Or export it in your shell:HF_TOKEN=hf_your_token_hereexport HF_TOKEN=hf_your_token_here
The .env file is gitignored. The token is never embedded in code.
The default Demucs output directory is /mnt/fast_scratch/demucs_output. Using a tmpfs RAM disk avoids SSD write amplification during heavy audio processing and significantly reduces per-file latency. If the RAM disk is not available, the pipeline will ask for confirmation before falling back to local storage.
sudo mkdir -p /mnt/fast_scratch
sudo mount -t tmpfs -o size=32G,mode=1777 tmpfs /mnt/fast_scratchThe mode=1777 flag is required so non-root users can write to the mount.
To persist across reboots, add to /etc/fstab:
tmpfs /mnt/fast_scratch tmpfs defaults,size=32G,mode=1777 0 0
Or append it automatically:
echo "tmpfs /mnt/fast_scratch tmpfs defaults,size=32G,mode=1777 0 0" | sudo tee -a /etc/fstabCheck usage:
df -h /mnt/fast_scratchIf you prefer not to use a RAM disk, pass -o /some/other/dir to write output anywhere.
The first run will auto-download the htdemucs model weights (~80 MB) to ~/.cache/torch/hub/checkpoints/. No HuggingFace token or manual download is required.
audio-refinery separate /path/to/audio.wavThis runs Demucs htdemucs with --two-stems=vocals on the GPU. Output:
- A blue panel summarizing input/output/device
- An animated spinner while Demucs processes
- A green results table with file info, timing, and output paths
- A JSON dump of the full
SeparationResultprovenance record
Demucs writes two stems:
/mnt/fast_scratch/demucs_output/htdemucs/<track_name>/vocals.wav
/mnt/fast_scratch/demucs_output/htdemucs/<track_name>/no_vocals.wav
Usage: audio-refinery separate [OPTIONS] INPUT_FILE
Options:
-o, --output-dir PATH Output directory (default: /mnt/fast_scratch/demucs_output)
-d, --device TEXT Compute device: 'cpu', 'cuda', or 'cuda:N' (default: cuda)
--segment INTEGER Segment size in seconds for VRAM optimization
--help Show help and exit
# Pin to a specific GPU
audio-refinery separate /path/to/audio.wav -d cuda:0
# Use CPU instead of GPU (slower)
audio-refinery separate /path/to/audio.wav -d cpu
# Write output to a custom directory
audio-refinery separate /path/to/audio.wav -o /tmp/my_output
# Reduce VRAM usage for long files (splits into 40-second chunks)
audio-refinery separate /path/to/audio.wav --segment 40CUDA out of memory— Try--segment 40(or lower) to reduce peak VRAM usage.audio-refinery: command not found— Runuv pip install -e .with the venv activated.Demucs is not installed or not on PATH— Make sure the venv is activated.
After step 1 produces vocals.wav, feed it to audio-refinery diarize to identify speaker turns.
audio-refinery diarize /mnt/fast_scratch/demucs_output/htdemucs/audio_track/vocals.wavThis runs Pyannote speaker-diarization-3.1. Output:
- A blue panel summarizing input and device
- An animated spinner while Pyannote processes
- A green summary table: speakers detected, total segments, total speech duration
- A per-speaker breakdown table
- A JSON dump of the full
DiarizationResultprovenance record
Usage: audio-refinery diarize [OPTIONS] INPUT_FILE
Options:
-d, --device TEXT Compute device: 'cpu', 'cuda', or 'cuda:N' (default: cuda)
--min-speakers INTEGER Minimum number of speakers (optional hint)
--max-speakers INTEGER Maximum number of speakers (optional hint)
--hf-token TEXT HuggingFace token (overrides HF_TOKEN env var)
-o, --output-file PATH Write DiarizationResult JSON to this file
--help Show help and exit
# Hint that the audio has 2–4 speakers
audio-refinery diarize /path/to/vocals.wav -d cuda:0 --min-speakers 2 --max-speakers 4
# Save result JSON for downstream processing
audio-refinery diarize /path/to/vocals.wav -d cuda:0 -o /tmp/diarization.jsonHuggingFace token not found— Follow the HF setup steps above.Failed to load Pyannote pipeline— Check that you accepted both model licenses on HuggingFace.
After step 2 produces a DiarizationResult JSON, feed the same vocals.wav to audio-refinery transcribe to generate word-level timestamps. Optionally pass the diarization JSON to get speaker labels on each segment.
# Pure transcription (no speaker labels)
audio-refinery transcribe /path/to/vocals.wav -d cuda:0 -o /tmp/transcription.json
# With speaker assignment from step 2
audio-refinery transcribe /path/to/vocals.wav \
-d cuda:0 \
--diarization-file /tmp/diarization.json \
-o /tmp/transcription_with_speakers.jsonThis runs WhisperX (large-v3) with Wav2Vec2 forced alignment. Output:
- A blue panel summarizing input, device, compute type, and language
- An animated spinner while processing
- A green summary table: language, segments, words, processing time
- A transcript preview table (first 10 segments, with optional speaker column)
- A JSON dump of the full
TranscriptionResultprovenance record
Usage: audio-refinery transcribe [OPTIONS] INPUT_FILE
Options:
-d, --device TEXT Compute device: 'cpu', 'cuda', or 'cuda:N' (default: cuda)
--compute-type [float16|int8_float16|int8|float32]
CTranslate2 compute type (default: float16).
int8_float16 gives ~1.5× throughput with negligible accuracy delta.
--batch-size INTEGER Batch size for transcription (default: 16)
--language TEXT Language code, e.g. 'en', 'fr', or 'auto' (default: en)
--diarization-file PATH DiarizationResult JSON from step 2 for speaker assignment
-o, --output-file PATH Write TranscriptionResult JSON to this file
--help Show help and exit
# Use int8_float16 for ~1.5× throughput (recommended)
audio-refinery transcribe /path/to/vocals.wav --compute-type int8_float16
# Auto-detect language
audio-refinery transcribe /path/to/vocals.wav --language auto
# Full pipeline output with speaker labels
audio-refinery transcribe /path/to/vocals.wav \
-d cuda:0 \
--diarization-file /tmp/diarization.json \
-o /tmp/transcript.jsonwhisperx is not installed— Follow the install steps above. whisperx requires manual installation due to ctranslate2 version constraints.No such file or directory: 'ffmpeg'— Install the ffmpeg binary:sudo apt install ffmpegCUDA out of memory— Try--compute-type int8to reduce VRAM usage, or lower--batch-size.alignment failedwarning — No Wav2Vec2 alignment model for the detected language. Transcription falls back to Whisper's raw timestamps; segment text is still correct.
After step 3 produces a TranscriptionResult JSON, feed it to audio-refinery sentiment to score each segment as positive, neutral, or negative. This step is text-only — no audio or GPU required.
The analyzer uses cardiffnlp/twitter-roberta-base-sentiment-latest and writes results in two places:
- A standalone
SentimentResultJSON (optional, via-o) - In-place update of the source transcription JSON — each segment gains a
sentimentfield andsentiment_appliedis set totrue
# Analyse and update the transcription in place
audio-refinery sentiment /data/audio/transcription/transcription_abc123.json
# Also write a standalone SentimentResult JSON
audio-refinery sentiment /data/audio/transcription/transcription_abc123.json \
-o /data/audio/sentiment/sentiment_abc123.json
# Use a different model
audio-refinery sentiment /path/to/transcription.json --model my-org/my-sentiment-modelUsage: audio-refinery sentiment [OPTIONS] TRANSCRIPTION_FILE
Arguments:
TRANSCRIPTION_FILE TranscriptionResult JSON written by step 3 [required]
Options:
--model TEXT HuggingFace model for text-classification
[default: cardiffnlp/twitter-roberta-base-sentiment-latest]
-d, --device TEXT Compute device: 'cpu', 'cuda', or 'cuda:N' [default: cpu]
-o, --output-file PATH Write SentimentResult JSON to this file
--help Show this message and exit.
audio-refinery pipeline runs all active stages (separation → diarization → transcription, and optionally sentiment) across an entire directory of WAV files. Processing is interleaved per-file: each file completes all active stages before the next begins, so ghost-track stems never accumulate beyond one file's worth of data at a time. All models are loaded once before the loop starts.
/mnt/fast_scratch/ ← RAM disk (preferred scratch, strongly recommended)
demucs/ ← Demucs stems (created and deleted per-file by default)
<base>/
extracted/ ← input WAV files (must exist)
audio_<id>.wav
diarization/ ← intermediate Pyannote JSON (created automatically)
transcription/ ← final WhisperX JSON output (created automatically)
summary/ ← run summary JSON (created automatically)
pipeline_summary.json
If <base>/extracted does not exist the pipeline exits immediately with an error. All other subdirectories are created on demand.
Demucs scratch (priority order):
/mnt/fast_scratch/demucs— used automatically if/mnt/fast_scratchis mounted (preferred)<base>/demucs— fallback to local storage if the RAM disk is unavailable; the pipeline asks for confirmation before proceeding
# Minimal invocation (steps 1–3: separate, diarize, transcribe)
audio-refinery pipeline --base-dir /data/audio/batch
# Pin to a specific GPU
audio-refinery pipeline --base-dir /data/audio/batch -d cuda:0
# Reduce VRAM pressure during Demucs
audio-refinery pipeline --base-dir /data/audio/batch --segment 40
# Re-run everything from scratch
audio-refinery pipeline --base-dir /data/audio/batch --no-resume
# Keep all ghost-track stems for inspection
audio-refinery pipeline --base-dir /data/audio/batch --keep-scratch
# Enable step 4 (text sentiment analysis)
audio-refinery pipeline --base-dir /data/audio/batch --sentiment
# Use int8_float16 for ~1.5× throughput with negligible accuracy delta
audio-refinery pipeline --base-dir /data/audio/batch --compute-type int8_float16
# Tighten the thermal ceiling (default is 80°C)
audio-refinery pipeline --base-dir /data/audio/batch --temp-limit 75
# Disable temperature monitoring entirely
audio-refinery pipeline --base-dir /data/audio/batch --temp-limit 0Usage: audio-refinery pipeline [OPTIONS]
Options:
--base-dir PATH Base directory (extracted/ must exist inside) [required]
-d, --device TEXT Compute device: 'cpu', 'cuda', or 'cuda:N' [default: cuda]
--segment INTEGER Demucs segment size in seconds for VRAM optimisation
--compute-type [float16|int8_float16|int8|float32]
WhisperX compute type [default: float16]
--batch-size INTEGER WhisperX batch size [default: 16]
--language TEXT Language code (e.g. 'en') or 'auto' [default: en]
--whisper-model TEXT WhisperX model name [default: large-v3]
Variants: large-v3 (highest accuracy), distil-large-v3
(~2× faster), medium.en (~3.2× faster, English-only),
medium (~3× faster, multilingual)
--hf-token TEXT HuggingFace token for Pyannote (overrides HF_TOKEN env var)
--no-resume Re-process all files, ignoring existing outputs
--keep-scratch Keep all ghost-track stems after processing
--temp-limit INTEGER GPU temperature limit in °C. [default: 80]
A background thread checks every 5 s and shuts the pipeline
down if the limit is exceeded. Set to 0 to disable.
--sentiment Enable step 4: Text Sentiment Analysis
--help Show this message and exit.
By default, ghost-track stems are cleaned up per-file as soon as they are no longer needed:
no_vocals.wav— deleted immediately after separationvocals.wav— deleted after transcription
This bounds scratch usage to roughly one file's worth of stems (~400 MB) at any point in the run. Use --keep-scratch to retain all stems.
By default, the pipeline skips any file whose transcription output already exists and is non-empty. An interrupted run can be safely restarted. Use --no-resume to force full reprocessing.
At the end of every run, <base>/summary/pipeline_summary.json is written with per-stage counts (processed / skipped / failed), timing, active steps, and a list of any failures.
Source directory not found— Create<base>/extracted/and placeaudio_<id>.wavfiles inside.No audio_*.wav files found— Source files must match the patternaudio_<id>.wav.- Partial failures — Check the failure table printed at the end of the run and
summary/pipeline_summary.json. Thermal shutdown: cuda:N at XX°C— The GPU exceeded--temp-limit. Wait for it to cool, then resume (completed files are skipped automatically).
CUDA and nvidia-smi use independent GPU numbering schemes by default, and they don't always
agree. Without corrective configuration, cuda:0 in PyTorch may refer to a different physical
card than index 0 in nvidia-smi, making it impossible to reliably pin workloads to specific
GPUs.
Audio-refinery sets CUDA_DEVICE_ORDER=PCI_BUS_ID at startup, before any CUDA context is
created. This forces both CUDA and nvidia-smi to enumerate GPUs in PCI bus order, so their
indices always match.
# List all GPUs with their nvidia-smi index, name, and VRAM
nvidia-smi --query-gpu=index,name,memory.total --format=csvExample output on a dual-GPU system:
index, name, memory.total [MiB]
0, NVIDIA GeForce RTX 3090 Ti, 24576 MiB
1, NVIDIA GeForce RTX 4090, 24576 MiB
With CUDA_DEVICE_ORDER=PCI_BUS_ID set, cuda:0 in any audio-refinery command maps to
index 0 in this output, and cuda:1 maps to index 1.
The --device flag uses PyTorch device syntax:
| String | Meaning |
|---|---|
cuda |
Default GPU (typically cuda:0 in PCI bus order) |
cuda:0 |
GPU at PCI bus index 0 |
cuda:1 |
GPU at PCI bus index 1 |
cpu |
CPU only — significantly slower; useful for testing or GPU-free machines |
Before a long batch run, confirm that cuda:N refers to the GPU you intend:
python -c "
import os; os.environ['CUDA_DEVICE_ORDER'] = 'PCI_BUS_ID'
import torch
for i in range(torch.cuda.device_count()):
print(f'cuda:{i} ->', torch.cuda.get_device_name(i))
"The output should match the nvidia-smi index order:
cuda:0 -> NVIDIA GeForce RTX 3090 Ti
cuda:1 -> NVIDIA GeForce RTX 4090
audio-refinery pipeline-parallel uses detect_gpu_order() to rank all available GPUs
before assigning workers. The ranking uses FP16 TFLOPS from src/gpu_tflops.toml as the
primary sort key, so the fastest GPU gets the first worker slot. GPUs not in the table fall
back to a (rounded VRAM GB, max SM clock) heuristic — reliable within a generation but not
across them.
To check whether your GPU is in the table:
nvidia-smi --query-gpu=name --format=csv,noheaderCompare the output against the entries in src/gpu_tflops.toml. If your GPU is listed, it
ranks above any unlisted GPU. If it is not, the heuristic applies — which may or may not
produce the ordering you want.
If your GPU is not in src/gpu_tflops.toml, add it for correct ranking in multi-GPU setups:
-
Get the exact name as
nvidia-smireports it:nvidia-smi --query-gpu=name --format=csv,noheader
-
Find the FP16 TFLOPS figure at TechPowerUp GPU Specs. Use the Shader Performance row for consumer GPUs, or FP16 (half) for data center GPUs. See the comments in
src/gpu_tflops.tomlfor guidance on which value to use. -
Add an entry to
src/gpu_tflops.toml:"NVIDIA GeForce RTX 5080" = 86.4
The GPU name must match nvidia-smi output exactly. The lookup code automatically handles
the presence or absence of the NVIDIA prefix across driver versions.
Before starting any model work, every GPU command queries nvidia-smi to check whether the targeted device is already occupied. If active processes are found, a yellow panel lists each PID and its VRAM footprint and asks for confirmation before proceeding.
During audio-refinery pipeline runs the current GPU temperature is shown in the progress bar description, colour-coded relative to the configured limit:
| Colour | Meaning |
|---|---|
| green | More than 10°C below the limit (safe) |
| yellow | Within 10°C of the limit (warm — watch it) |
| red | At or above the limit (shutdown imminent) |
Temperature is sampled at most once every 5 seconds to avoid hammering nvidia-smi.
# Print temperature + GPU utilisation every 10 s, log to file
nvidia-smi dmon -s tu -d 10 | tee /tmp/gpu_temps.logaudio-refinery pipeline-parallel runs N simultaneous pipeline workers — one per --device flag — for significantly higher throughput when multiple GPUs are available.
The launcher:
- Discovers all WAV files in
<base>/extracted/ - Splits them into N interleaved partitions
- Writes per-worker manifests to
<base>/manifests/ - Spawns one
audio-refinery pipelinechild process per--deviceflag, labelled W0, W1, … - Redirects each worker's output to
<base>/logs/worker_N.log - Waits for all workers to complete, then prints a combined summary and writes
<base>/summary/combined_report.json
<base>/
extracted/ ← input WAV files (must exist)
manifests/ ← auto-generated per-worker file lists
manifest_0.txt
manifest_1.txt
logs/ ← per-worker output logs
worker_0.log
worker_1.log
diarization/ ← shared output (workers write non-overlapping files)
transcription/ ← shared output (workers write non-overlapping files)
summary/
worker_0.json
worker_1.json
combined_report.json ← always written; aggregates all workers and failures
# Basic dual-GPU run (default: cuda:0 and cuda:1)
audio-refinery pipeline-parallel --base-dir /data/audio/batch
# Recommended: int8_float16 for ~1.5× per-GPU speedup
audio-refinery pipeline-parallel --base-dir /data/audio/batch \
--compute-type int8_float16
# With power limits (recommended for extended runs)
audio-refinery pipeline-parallel --base-dir /data/audio/batch \
--compute-type int8_float16 \
--power-limit 350
# Three-GPU run
audio-refinery pipeline-parallel --base-dir /data/audio/batch \
--device cuda:0 --device cuda:1 --device cuda:2
# Monitor workers in separate terminals
tail -f /data/audio/batch/logs/worker_0.log
tail -f /data/audio/batch/logs/worker_1.logUsage: audio-refinery pipeline-parallel [OPTIONS]
Options:
--base-dir PATH Base directory (extracted/ must exist inside) [required]
--device TEXT GPU device for a worker. Repeat for each worker.
[default: cuda:0, cuda:1]
--segment INTEGER Demucs segment size in seconds for VRAM optimisation
--compute-type [float16|int8_float16|int8|float32]
WhisperX compute type [default: float16]
--batch-size INTEGER WhisperX batch size [default: 16]
--language TEXT Language code (e.g. 'en') or 'auto' [default: en]
--whisper-model TEXT WhisperX model name [default: large-v3]
--hf-token TEXT HuggingFace token for Pyannote
--sentiment Enable step 4: Text Sentiment Analysis
--no-resume Re-process all files, ignoring existing outputs
--keep-scratch Keep all ghost-track stems after processing
--temp-limit INTEGER GPU temperature limit in °C. [default: 80]
Each worker shuts down if its GPU exceeds this threshold.
Set to 0 to disable.
--power-limit INTEGER Set nvidia-smi power limit (watts) on each GPU before launch.
Requires passwordless sudo for nvidia-smi.
--help Show this message and exit.
combined_report.json is always written after all workers finish. It contains aggregate metrics across all workers:
| Field | Type | Description |
|---|---|---|
run_at |
string | ISO 8601 timestamp of run start (UTC) |
total_discovered |
int | Total WAV files found in extracted/ |
total_time_seconds |
float | Wall-clock seconds from first worker start to last finish |
total_audio_hours |
float | Total audio duration processed across all workers |
source_audio_bytes |
int | Combined size of all input WAV files |
total_words |
int | Total words transcribed across all files |
total_segments |
int | Total transcript segments across all files |
avg_time_per_file_seconds |
float | total_time / total_discovered — average wall-clock cost per file |
avg_time_per_mb_seconds |
float | total_time / source_MB — processing seconds per MB of source audio |
processing_speed_ratio |
float | audio_seconds / wall_seconds — real-time factor (e.g. 3.7 means the pipeline processed audio 3.7× faster than its playback duration) |
words_per_audio_hour |
float | Transcription density — useful for detecting sparse/silent audio or diarization misses |
gpu_temp_celsius |
object | Per-device temperature summary: peak_celsius, avg_celsius, sample_count |
workers |
array | Per-worker label, device, exit code, and individual summary |
combined_failures |
array | Aggregated failure records from all workers |
null is written for derived metrics when the divisor is zero (e.g. avg_time_per_file_seconds is null if no files were discovered).
--power-limit invokes sudo nvidia-smi -pl <watts>. To allow this without a password prompt:
echo 'YOUR_USERNAME ALL=(ALL) NOPASSWD: /usr/bin/nvidia-smi' | sudo tee /etc/sudoers.d/nvidia-smi- A worker exits non-zero — Inspect
<base>/logs/worker_N.logand<base>/summary/combined_report.json. nvidia-smi power limit failed: Insufficient Permissions— Add the sudoers rule above, or set limits manually withsudo nvidia-smi -i 0 -pl 350before running.Thermal shutdown: Worker WN (cuda:N) at XX°C— A worker's GPU exceeded--temp-limit. The offending worker is terminated; the others continue. Resume withaudio-refinery pipeline-parallel --base-dir ...— completed files are skipped automatically.
Set SLACK_WEBHOOK_URL in your .env file (or environment) to receive notifications when a pipeline run completes or shuts down due to overheating:
SLACK_WEBHOOK_URL=https://hooks.slack.com/services/your/webhook/url
Notifications are fire-and-forget — a failure to deliver never blocks or aborts the pipeline.
| Document | Description |
|---|---|
| Architecture | Ghost Track pipeline design, model selection rationale, data model |
| Use Cases | Who uses this and for what |
| Performance | Throughput benchmarks, scaling options, optimization guide |
| Deployment | Production patterns, async workers, Docker, monitoring |
| Development | Dev setup, testing, contributing, release process |
uv venv --python 3.11.14
source .venv/bin/activate
# Install all deps including whisperx, CUDA torch, dev tools, and pre-commit hooks
make dev-setup
# Copy the env template and add your HuggingFace token
cp .env.example .env
# Run unit tests (no GPU required)
make test
# Run integration tests (requires GPU, HF_TOKEN, and test audio)
make test-integration
# Lint and format
make lint
make formataudio-refinery is released under the MIT License.
Dependency note: The Pyannote model weights (pyannote/speaker-diarization-3.1 and pyannote/segmentation-3.0) are gated on HuggingFace under separate terms. If you run this tool in a commercial data product, verify that your HuggingFace account's accepted terms cover your use case. The MIT license on this software does not extend to the model weights — those are governed by their respective HuggingFace model cards.