Verba volant, scripta manent โ "spoken words fly away, written words remain." This ancient Latin proverb has perhaps been the greatest disadvantage of radio programs and their digital successors, podcasts. We can always revisit what was written, but not what was spoken โ until now. This project breaks through this age-old barrier, enabling podcast enthusiasts who struggled to find where a particular topic was discussed to finally locate, read, and listen again to what was said.
STTCast is a comprehensive suite for automatic podcast transcription, speaker identification (diarization), and intelligent semantic search powered by RAG (Retrieval-Augmented Generation).
- WhisperX Transcription: Primary transcription engine based on OpenAI Whisper with CUDA acceleration
- Pyannote Diarization: Automatic speaker identification through voice clustering
- Alternative Vosk Engine: For GPU-free processing (Spanish only)
- Web Interface: Browser-based transcription management
- RAG Semantic Search: Intelligent search system across podcast collections
- Participation Analysis: Speaking time statistics per speaker
- Query Cache: Semantic caching system to optimize repeated searches
STTCast uses a three-tier architecture that separates responsibilities and enables scalability:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ PRESENTATION LAYER โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ Web Interface โ โ RAG Client โ โ CLI โ โ
โ โ (webif) โ โ (rag/client) โ โ (sttcast.py) โ โ
โ โ Port 8302 โ โ Port 8004 โ โ โ โ
โ โโโโโโโโโโฌโโโโโโโโโโ โโโโโโโโโโฌโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ โ โ
โโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ โ
โโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ โ SERVICE LAYER โ
โโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โโโโโโโโโโผโโโโโโโโโโ โโโโโโโโโโผโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ Transcription โ โ RAG Server โ โ Context Server โ โ
โ โ Server โ โ (sttcast_rag_ โ โ (context_server) โ โ
โ โ (sttctranssrv) โ โ service) โ โ โ โ
โ โ Port 8000 โ โ Port 5500 โ โ Port 8001 โ โ
โ โโโโโโโโโโฌโโโโโโโโโโ โโโโโโโโโโฌโโโโโโโโโโ โโโโโโโโโโโโฌโโโโโโโโโโโโ โ
โ โ โ โ โ
โโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโ
โ โ โ
โโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโ
โ โ DATA LAYER โ โ
โโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโค
โ โโโโโโโโโโผโโโโโโโโโโ โโโโโโโโโโผโโโโโโโโโโ โโโโโโโโโโโโผโโโโโโโโโโโโ โ
โ โ PostgreSQL โ โ OpenAI API โ โ FAISS + SQLite โ โ
โ โ (webif_db) โ โ (Embeddings) โ โ (Vectors) โ โ
โ โโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
FastAPI web application that allows:
- Uploading audio files for transcription
- Configuring reusable transcription profiles
- Monitoring job progress
- Downloading results in HTML and SRT formats
- Managing users and permissions
Backend that processes transcriptions:
- Job queue management
- WhisperX execution with Pyannote
- GPU resource control
- HTML and SRT file generation
Artificial intelligence service:
- Embedding generation with OpenAI
- Answering questions about content
- Automatic summary generation
- GPT model integration
Database management:
- Relational database queries (SQLite)
- Vector searches with FAISS
- Context provision for RAG
Flask web application for end users:
- Semantic search across transcriptions
- Speaker participation analysis
- Intelligent query caching
- Direct episode references
- Python 3.10 or higher
- FFmpeg installed in PATH
- NVIDIA GPU with CUDA (recommended for Whisper)
- PostgreSQL (for web interface)
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txtWhisperX is the recommended engine for high-quality transcriptions. It combines:
- OpenAI Whisper: For speech-to-text transcription
- Pyannote: For diarization (speaker identification)
Features:
- CUDA acceleration for fast processing
- Multiple models based on quality/speed requirements
- Multi-language support
- Speaker identification with training file
| Model | Required vRAM | Speed |
|---|---|---|
| tiny | ~1 GB | ~32x |
| base | ~1 GB | ~16x |
| small | ~2 GB | ~6x |
| medium | ~5 GB | ~2x |
| large | ~10 GB | 1x |
Lightweight engine based on Vosk-Kaldi:
- Runs on CPU only
- Spanish language only
- Lower accuracy but no GPU requirements
- Useful for massive processing without GPU resources
Models available at alphacephei. Recommended: vosk-model-es-0.42
A YouTube tutorial is available with installation and usage instructions. It covers an earlier version and will be updated soon.
# Transcription with Whisper (recommended)
./sttcast.py -w --whlanguage es audio.mp3
# Transcription with Whisper and embedded audio tags
./sttcast.py -w -a --whlanguage es audio.mp3
# Transcription with training file for diarization
./sttcast.py -w --whtraining training.mp3 --whlanguage es audio.mp3
# Transcribe entire directory
./sttcast.py -w --whlanguage es /path/to/directory/
# Transcription with Vosk (Spanish only, no GPU)
./sttcast.py -m /path/to/vosk/model audio.mp3usage: sttcast.py [-h] [-m MODEL] [-s SECONDS] [-c CPUS] [-i HCONF] [-n MCONF]
[-l LCONF] [-o OVERLAP] [-r RWAVFRAMES] [-w] [--whmodel WHMODEL]
[--whdevice {cuda,cpu}] [--whlanguage WHLANGUAGE]
[--whtraining WHTRAINING] [--whsusptime WHSUSPTIME] [-a]
[--html-suffix HTML_SUFFIX] [--min-offset MIN_OFFSET]
[--max-gap MAX_GAP] [-p PREFIX] [--calendar CALENDAR]
[-t TEMPLATES] [--pyannote-method PYANNOTE_METHOD]
[--pyannote-min-cluster-size SIZE] [--pyannote-threshold THRESHOLD]
[--pyannote-min-speakers N] [--pyannote-max-speakers N]
fnames [fnames ...]
Positional arguments:
fnames Audio files or directories to transcribe
General options:
-h, --help Show help
-s SECONDS Seconds per task (default: 600)
-c CPUS CPUs to use (default: cores - 2)
-a, --audio-tags Include audio player in HTML
--html-suffix SUFFIX Suffix for HTML file (default: empty)
-p PREFIX Prefix for output files (default: ep)
--calendar FILE CSV file with episode calendar (default: calfile)
-t TEMPLATES HTML templates directory (default: templates)
Vosk options:
-m MODEL Path to Vosk model
-i HCONF High confidence threshold (default: 0.95)
-n MCONF Medium confidence threshold (default: 0.7)
-l LCONF Low confidence threshold (default: 0.5)
-o OVERLAP Overlap between fragments (default: 2)
-r RWAVFRAMES WAV read frames (default: 4000)
Whisper options:
-w, --whisper Use Whisper engine (recommended)
--whmodel MODEL Model: tiny|base|small|medium|large (default: small)
--whdevice DEVICE Acceleration: cuda|cpu (default: cuda)
--whlanguage LANG Language: es|en|fr|de... (default: es)
--whtraining FILE Training MP3 file for diarization (default: training.mp3)
--whsusptime SECS Minimum speaking time (default: 60.0)
Pyannote options (advanced diarization):
--pyannote-method Clustering method (default: ward)
--pyannote-min-cluster-size Minimum cluster size (default: 15)
--pyannote-threshold Clustering threshold (default: 0.7147)
--pyannote-min-speakers Expected minimum number of speakers
--pyannote-max-speakers Expected maximum number of speakers
The web interface allows managing transcriptions from the browser:
# Start transcription server
python -m sttctranssrv
# Start web interface
python -m webif.webif --port 8302Environment variables are stored in files within the .env/ directory:
STTCAST_DB_FILE="/path/to/your/database.db"STTCAST_FAISS_FILE="/path/to/your/index.faiss"
STTCAST_RELEVANT_FRAGMENTS=100OPENAI_API_KEY="sk-..."
OPENAI_GPT_MODEL="gpt-4o-mini"
OPENAI_EMBEDDINGS_MODEL="text-embedding-3-small"PODCAST_CAL_FILE="/path/to/calendar.csv"
PODCAST_PREFIX="ep"
PODCAST_WORKDIR="/path/to/podcasts/"
PODCAST_TEMPLATES="/path/to/templates/"HUGGINGFACE_TOKEN="hf_..."PYANNOTE_METHOD=ward
PYANNOTE_MIN_CLUSTER_SIZE=15
PYANNOTE_THRESHOLD=0.7147
# PYANNOTE_MIN_SPEAKERS=2
# PYANNOTE_MAX_SPEAKERS=5RAG_SERVER_HOST="0.0.0.0"
RAG_SERVER_PORT=5500RAG_CLIENT_HOST="0.0.0.0"
RAG_CLIENT_PORT=8004
RAG_CLIENT_STT_LANG="es-ES"
RAG_MP3_DIR="/path/to/mp3/files"# Server
WEBIF_HOST=127.0.0.1
WEBIF_PORT=8302
WEBIF_DEBUG=false
# Initial admin user
WEBIF_ADMIN_NAME=admin
WEBIF_ADMIN_PASSWORD=secure_password
WEBIF_ADMIN_EMAIL=admin@example.com
# PostgreSQL database
WEBIF_DB_HOST=localhost
WEBIF_DB_PORT=5432
WEBIF_DB_USER=sttcast
WEBIF_DB_PASSWORD=db_password
WEBIF_DB_NAME=sttcast_webif
# Session
WEBIF_SECRET_KEY=random-secret-key
WEBIF_SESSION_EXPIRE=480
# File storage
WEBIF_UPLOAD_DIR=/tmp/sttcast_webif/uploads
WEBIF_RESULTS_DIR=/tmp/sttcast_webif/results
WEBIF_TRAINING_DIR=/tmp/sttcast_webif/training
WEBIF_MAX_UPLOAD_SIZE=524288000TRANSSRV_HOST=0.0.0.0
TRANSSRV_PORT=8000
TRANSSRV_API_KEY=secure-hmac-keyThe Whisper/Pyannote pipeline automatically identifies speakers. Pyannote performs voice segment clustering but does not identify real names.
Pyannote requires a HuggingFace read access token. Store it in the HUGGINGFACE_TOKEN environment variable.
Since Pyannote clusters voices instead of identifying them, the program overcomes this limitation by adding recognized voices to the audio before processing. This allows the system to match unidentified segments with the closest known voice cluster.
# training.yml
F01:
name: John Smith
files:
- Training/John_1.mp3
- Training/John_2.mp3
F02:
name: Jane Doe
files:
- Training/Jane_1.mp3
- Training/Jane_2.mp3python diarization/trainingmp3.py -c training.yml -o training.mp3Options:
-c CONFIG YAML file with speakers (default: training.yml)
-o OUTPUT Output MP3 file (default: training.mp3)
-s SILENCE Silence between speakers in seconds (default: 5)
-t TIME Total fragment duration in seconds (default: 600)
./sttcast.py -w --whtraining training.mp3 --whlanguage es podcast.mp3Generated HTML files include final comments with the total speaking time for each participant.
Extract statistics to CSV:
python diarization/speakingtime.py -o times.csv transcriptions/*.htmlAnalyze with the notebooks/speakingtimes.ipynb notebook.
The RAG system enables semantic searches across transcribed podcast collections.
Transcription segments are stored in:
- Relational database (SQLite): For structured management and conventional queries
- Vector database (FAISS): For efficient semantic searches using embeddings
FastAPI service that provides:
- Embedding generation from transcription fragments
- Automatic content question answering
- Combination of vector retrieval and natural language generation
- Automatic episode summary generation
Service that queries both databases to provide context:
- Relational database queries
- Vector searches with FAISS
- Relevant fragment provision for RAG
Flask web application that provides:
- Semantic search: Natural language questions about content
- Participation analysis: Speaking times per speaker
- Query cache: Semantic caching system storing previous queries
- Direct references: Links to episodes and specific timestamps
# Start context server
python db/context_server.py # Port 8001
# Start RAG service
python rag/sttcast_rag_service.py # Port 5500
# Start RAG web client
cd rag/client && python client_rag.py # Port 8004Access at http://localhost:8004
The system generates automatic episode summaries using GPT:
python summaries/get_rag_summaries.py -i /path/to/transcriptions -o /path/to/summariesThe service processes *_whisper_audio_es.html files in configurable blocks.
python summaries/insert_summaries.py -s /path/to/summaries -t /path/to/transcriptionsSummaries are inserted into all HTML files for the episode (different engines, languages, etc.).
Adds audio controls to an HTML transcribed without the --audio-tags option:
./add_audio_tag.py --mp3-file audio.mp3 -o output.html transcription.htmlIf you don't have a local GPU, the Automation/ directory contains scripts to:
- Create AWS EC2 machines with GPU
- Automatically provision and install STTCast
- Process files and download results
- Destroy resources when finished
All with just two commands: one to create and process, another to destroy.
This project is licensed under the terms specified in the LICENSE file.
STTCast - Intelligent podcast transcription with AI





