A LangGraph-based AI agent that transcribes and summarizes sermon video recordings from MP4 or MP3 files.
- Transcription: Converts audio from MP4/MP3 files to text using OpenAI Whisper
- GPU Acceleration: Automatically detects and uses NVIDIA GPU (CUDA) for much faster transcription
- Waveform Generation: Pre-computes audio waveform data (480 normalized amplitude values) for mobile app visualization allowing it to be rendered on the mobile app via adaptive downsampling within the viewport of the device.
- Video Clip Generation: Automatically creates sub-10-minute summary MP4s from full sermons using AI-powered segment selection
- Summarization: Generates end-user-friendly single-paragraph summaries using GPT-4o-mini
- Semantic Tagging: Automatically applies relevant topical tags to summaries for better organization and discovery
- Batch Processing: Process entire directories of sermon files at once
- LangGraph Architecture: Built with a graph-based workflow for clear separation of concerns
- CLI Interface: Easy-to-use command-line interface
- Python 3.8+
- FFmpeg (must be installed separately)
- OpenAI API key
For the best performance with GPU acceleration, use the automated setup script:
Windows:
setup_venv_gpu.batmacOS/Linux:
chmod +x setup_venv_gpu.sh
./setup_venv_gpu.shThis script will:
- Create a Python virtual environment
- Install CUDA-enabled PyTorch (for GPU acceleration)
- Install all other dependencies in the correct order
- Verify GPU detection
Then configure your environment:
cp .env.example .env
# Edit .env and add your OpenAI API keyIf you don't have an NVIDIA GPU or prefer manual setup:
-
Clone the repository:
-
Create a virtual environment:
python -m venv venv venv\Scripts\activate # On Windows # source venv/bin/activate # On macOS/Linux
-
Install dependencies:
pip install -r requirements.txt
Note: This installs CPU-only PyTorch. For GPU acceleration (4x faster), use the automated setup script above or see the GPU Acceleration section below.
-
Install FFmpeg (if not already installed):
- Windows:
choco install ffmpegor download from ffmpeg.org - macOS:
brew install ffmpeg - Linux:
apt install ffmpegoryum install ffmpeg
- Windows:
-
Configure environment variables:
cp .env.example .env # Edit .env and add your OpenAI API key
Transcribe and summarize a single sermon file:
python agent.py --file path/to/sermon.mp4Process all audio files in a directory at once:
python agent.py --batch-dir "G:\Thrive\Sermon Videos\Audio Files"This will:
- Find all audio files (MP3, MP4, WAV, M4A, MOV) in the directory
- Process each file sequentially
- Create organized subdirectories for each file's outputs
- Generate a consolidated
batch_summaries.jsonwith all results - Display progress as files are processed
- Continue processing even if individual files fail
If no file is specified, the agent will auto-detect the latest media file in the configured directory:
python agent.py--file,-f: Path to a single sermon audio/video file (MP4, MP3, WAV, M4A, MOV)--batch-dir,-b: Path to directory containing multiple sermon files for batch processing--resume: Skip files that have already been successfully processed (only works with--batch-dir)
Note:
--fileand--batch-dirare mutually exclusive. Use one or the other.
For large batches, use the --resume flag to enable checkpoint-based resumption:
# Resume an interrupted batch
python agent.py --batch-dir "G:\Thrive\Sermon Videos\Audio Files" --resumeBenefits:
- β Skip files that have already been successfully processed
- β Safely interrupt and resume large batch jobs
- β Automatically retry failed files while preserving successful ones
- β Perfect for processing hundreds of files across multiple sessions
Without --resume (default):
- Clears the
batch_outputs/directory before starting - Useful for testing prompt changes or configuration adjustments
See CHECKPOINT_GUIDE.md for detailed usage and examples.
The agent generates the following files in the current directory:
transcription.txt: Full transcription texttranscription_segments.json: Transcription with timestampssummary.txt: Single-paragraph summarysummary.json: Summary with metadata and semantic tags
The agent generates:
-
Individual file outputs in organized subdirectories:
batch_outputs/ βββ 2025-10-05-Recording/ β βββ transcription.txt β βββ transcription_segments.json β βββ summary.txt β βββ summary.json βββ 2025-10-12-Recording/ β βββ transcription.txt β βββ transcription_segments.json β βββ summary.txt β βββ summary.json βββ ... -
Consolidated JSON output (
batch_summaries.json):{ "2025-10-05-Recording": { "filename": "2025-10-05-Recording.mp3", "summary": "This sermon explores the transformative power of faith...", "transcription_path": "C:\\...\\batch_outputs\\2025-10-05-Recording\\transcription.txt", "summary_path": "C:\\...\\batch_outputs\\2025-10-05-Recording\\summary.txt", "word_count": 158, "date_processed": "2025-10-09T10:30:00Z", "status": "success" }, "2025-10-12-Recording": { "filename": "2025-10-12-Recording.mp3", "summary": "The message focuses on the importance of community...", "transcription_path": "C:\\...\\batch_outputs\\2025-10-12-Recording\\transcription.txt", "summary_path": "C:\\...\\batch_outputs\\2025-10-12-Recording\\summary.txt", "word_count": 142, "date_processed": "2025-10-09T10:35:00Z", "status": "success" } }If a file fails to process, the entry will include error information:
{ "problematic-file": { "filename": "problematic-file.mp3", "summary": null, "transcription_path": null, "summary_path": null, "word_count": 0, "date_processed": "2025-10-09T10:40:00Z", "status": "error", "error": "File not found or corrupted" } }
The agent uses LangGraph with five main nodes:
- Transcribe Node: Converts audio to text using OpenAI Whisper
- Automatically detects and uses GPU (CUDA) if available
- Extracts audio from video files using FFmpeg
- Generates timestamped segments
- Waveform Node: Generates audio waveform data using librosa
- Calculates 480 RMS (Root Mean Square) amplitude values
- Normalizes values to 0.15-1.0 range for consistent visualization
- Saves waveform data to summary.json for API consumption
- Summarize Node: Generates a summary using GPT-4o-mini
- Creates end-user-friendly single-paragraph summaries
- Includes metadata (word count, processing date, etc.)
- Tagging Node: Applies semantic tags to summaries
- Analyzes summary content using GPT-4o-mini
- Selects up to 5 relevant tags from a predefined list (config/tags_config.py)
- Tags are cached in memory for efficient batch processing
- Updates summary.json with a tags array
- Clip Generation Node (optional): Creates summary video clips
- Uses GPT-4o-mini to analyze transcript and select key moments
- Optimizes segments with context padding and gap merging
- Generates MP4 clips using FFMPEG with GPU acceleration support
- Only runs when
ENABLE_CLIP_GENERATION=truein.env
The agent automatically applies relevant topical tags to each sermon summary for better organization and discovery.
- Tag Source: Tags are defined in
config/tags_config.py, which contains 102+ predefined sermon topics - Hybrid Analysis: GPT-4o-mini analyzes BOTH the summary (main themes) and transcript excerpt (comprehensive context) to determine relevant themes
- Selection: The AI selects up to 5 most relevant tags from the available list
- Storage: Tags are added to
summary.jsonas atagsarray field
Tags cover a wide range of sermon topics including:
- Relationships & Family: Marriage, Family, Friendship, Singleness
- Theological Foundations: Salvation, Faith, Trinity, Church, Holy Spirit
- Spiritual Disciplines: Prayer, Worship, Fasting, Bible Study
- Personal Growth: Hope, Love, Joy, Peace, Courage, Wisdom
- Life Challenges: Suffering, Anxiety, Doubt, Addiction, Grief
- Biblical Studies: Parables, Sermon on the Mount, specific book studies
- Seasonal: Advent, Christmas, Easter, Lent
- And many more...
{
"summary": "This sermon explores the transformative power of faith...",
"word_count": 120,
"character_count": 750,
"model": "gpt-4o-mini",
"transcription_length": 28500,
"tags": ["Faith", "Salvation", "Hope", "Discipleship"]
}To add new tags:
- Edit
config/tags_config.py - Add your new tag to the appropriate category list (or create a new category)
- The tag will be automatically included in
ALL_TAGSand available on the next run
The agent automatically generates pre-computed audio waveform data for mobile app visualization, eliminating the need for resource-intensive client-side processing.
- Audio Analysis: After transcription, the waveform node uses librosa to analyze the extracted audio file
- RMS Calculation: Divides audio into 480 equal segments and calculates RMS (Root Mean Square) amplitude for each
- Normalization: Values are normalized to 0.15-1.0 range for consistent visualization across different audio files
- Storage: Waveform data is saved to
summary.jsonas awaveform_dataarray field
RMS provides a better representation of perceived loudness than peak amplitude values:
- More stable and less sensitive to transient spikes
- Better represents the energy content of audio segments
The waveform data is an array of 480 floating-point values between 0.15 and 1.0:
{
"summary": "This sermon explores...",
"tags": ["Faith", "Hope"],
"waveform_data": [
0.42, 0.38, 0.45, 0.52, 0.48, 0.55, 0.61, 0.58, 0.64, 0.71,
0.68, 0.75, 0.82, 0.78, 0.85, 0.91, 0.88, 0.95, 0.89, 0.83,
...
]
}- Performance: Pre-computed server-side, eliminating mobile device CPU load
- Consistency: Normalized values ensure consistent visualization across all audio files
- Simplicity: Ready-to-use JSON format for direct integration with mobile apps
- Fast: Adds only ~5-10 seconds to processing time for typical sermon lengths
Waveform generation requires the librosa library, which is included in requirements.txt:
pip install librosa>=0.10.0While librosa has some really neat advanced features, we're only using it for RMS amplitude calculations
- Tags are loaded once and cached in memory during batch processing
- Hybrid analysis uses ~15000 characters of transcript + full summary
- Each sermon typically takes 3-4 seconds for tag classification
- Cost: ~$0.0008 per sermon (very affordable with GPT-4o-mini)
The agent can automatically generate short summary videos (under 10 minutes) from full sermon recordings. This feature uses AI to intelligently select and stitch together the most important moments from your sermon. This can be extremely useful for creating social media content, or other short-form content for casual viewing.
- AI-Powered Selection: GPT-4o-mini analyzes the sermon transcript to identify key moments and themes
- Intelligent Segmentation: Selects 30-60 second segments that capture the most important content
- Smart Optimization: Merges nearby segments, adds context padding, and ensures chronological order
- Video Processing: Uses FFMPEG to extract and concatenate selected segments with optional fade transitions
- GPU Acceleration: Supports NVIDIA CUDA (h264_nvenc) for faster video encoding with automatic CPU fallback
Video clip generation is disabled by default and must be explicitly enabled in your .env file:
# Enable video clip generation
ENABLE_CLIP_GENERATION=trueAdd these settings to your .env file to customize clip generation behavior:
# Video Clip Generation Settings
ENABLE_CLIP_GENERATION=false # Set to 'true' to enable (default: false)
MAX_CLIP_DURATION=600 # Maximum clip length in seconds (default: 600 = 10 minutes)
MIN_SEGMENT_LENGTH=30 # Minimum segment length in seconds (default: 30)
CONTEXT_PADDING=5 # Seconds to add before/after each segment (default: 5)
MERGE_GAP_THRESHOLD=15 # Merge segments if gap is less than this (default: 15 seconds)
ENABLE_FADE_TRANSITIONS=true # Add fade transitions between segments (default: true)
FADE_DURATION=0.5 # Fade transition duration in seconds (default: 0.5)
CLIP_OUTPUT_DIR= # Optional: Custom output directory (default: temp directory)For significantly faster video processing, you can enable GPU-accelerated encoding using NVIDIA CUDA:
Performance Comparison:
- CPU (libx264): Standard encoding, works everywhere
- GPU (h264_nvenc): 3-5x faster encoding on NVIDIA GPUs (RTX/GTX series)
Setup Instructions:
See the detailed GPU Encoding Setup Guide for step-by-step instructions on:
- Installing NVIDIA drivers and CUDA toolkit
- Building FFMPEG with CUDA support
- Verifying GPU encoding is working
- Troubleshooting common issues
Automatic Fallback: If GPU encoding is not available or fails, the agent automatically falls back to CPU encoding (libx264) to ensure your clips are always generated successfully.
When clip generation is enabled, the agent produces:
{filename}_Summary.mp4: The generated summary video clip{filename}_Summary_metadata.json: Detailed metadata including:- Selected segments with timestamps and importance scores
- Processing statistics (AI selection time, encoding time, etc.)
- GPU encoding information
- Token usage for AI segment selection
- Compression ratio and file size reduction
{
"version": "2.0",
"generated_at": "2025-11-01 14:30:00",
"original_video": {
"duration_seconds": 2400,
"file_size_mb": 450.5
},
"output_video": {
"duration_seconds": 540,
"file_size_mb": 95.2,
"size_reduction_percent": 78.9
},
"summary": {
"total_segments": 12,
"total_duration_seconds": 540,
"average_importance_score": 8.5,
"compression_ratio": 4.44
},
"gpu_info": {
"encoding_method": "GPU (h264_nvenc)",
"gpu_available": true
}
}- Time-Saving: Automatically creates shareable highlight reels from full sermons
- AI-Powered: Intelligently selects the most impactful moments
- Flexible: Highly configurable to match your needs
- Fast: GPU acceleration makes processing quick even for long sermons
- Professional: Smooth fade transitions and optimized encoding
- Cost-Effective: Uses GPT-4o-mini for affordable AI segment selection (~$0.001-0.003 per sermon)
Video clip generation adds minimal token usage:
- Input tokens: Transcript analysis (~5,000-15,000 tokens depending on sermon length)
- Output tokens: Segment selection data (~500-1,500 tokens)
- Cost: Approximately $0.001-0.003 per sermon with GPT-4o-mini
Token usage is tracked separately and included in the total cost breakdown.
Environment variables (in .env):
OPENAI_API_KEY: Your OpenAI API key (required)WHISPER_MODEL: Whisper model to use (default:small.en)- Options:
tiny,tiny.en,base,base.en,small,small.en,medium,medium.en,large - Larger models are more accurate but slower (GPU highly recommended for medium/large)
- Options:
WHISPER_FORCE_CPU: Force CPU mode even if GPU is available (default:false)SERMON_AUDIO_DIR: Directory to search for sermon files (optional)
ENABLE_CLIP_GENERATION: Enable automatic video clip generation (default:false)MAX_CLIP_DURATION: Maximum clip length in seconds (default:600)MIN_SEGMENT_LENGTH: Minimum segment length in seconds (default:30)CONTEXT_PADDING: Seconds to add before/after each segment (default:5)MERGE_GAP_THRESHOLD: Merge segments if gap is less than this in seconds (default:15)ENABLE_FADE_TRANSITIONS: Add fade transitions between segments (default:true)FADE_DURATION: Fade transition duration in seconds (default:0.5)CLIP_OUTPUT_DIR: Custom output directory for clips (optional, defaults to temp directory)
See the Video Clip Generation section for detailed information.
The agent automatically detects and uses NVIDIA GPUs (CUDA) for Whisper transcription, providing significant speed improvements.
Performance Comparison:
- CPU: ~15 minutes for a typical sermon
- GPU (RTX 4080 SUPER): ~4 minutes for the same sermon
- Speedup: ~4x faster with GPU acceleration
GPU Detection:
- If you have an NVIDIA GPU with CUDA support (RTX series, GTX series, etc.), it will be automatically detected and used
- The agent will display GPU information at startup:
π GPU detected: NVIDIA GeForce RTX 4080 SUPER Number of GPUs available: 1 Using device: cuda with fp16 precision This will be MUCH faster than CPU! - fp16 precision is automatically enabled on GPU for faster inference
Installing CUDA-Enabled PyTorch:
Recommended: Use the automated setup script (see Installation section):
# Windows
setup_venv_gpu.bat
# macOS/Linux
./setup_venv_gpu.shManual Installation:
If you already have a venv and need to add GPU support:
-
Uninstall CPU-only PyTorch:
pip uninstall torch torchvision torchaudio
-
Install CUDA-enabled PyTorch (for CUDA 11.8):
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
For other CUDA versions, visit: https://pytorch.org/get-started/locally/
-
Verify GPU is detected:
python test_gpu.py
You should see:
β CUDA available: True β GPU 0: NVIDIA GeForce RTX [Your GPU Model] π GPU is ready to use! Whisper will run MUCH faster.
Important Note about librosa:
When installing librosa (required for waveform generation), pip may replace CUDA-enabled PyTorch with the CPU version. The automated setup script handles this by installing PyTorch with CUDA first, then librosa. If you install packages manually and lose GPU support, simply reinstall CUDA-enabled PyTorch:
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118To force CPU mode:
Set WHISPER_FORCE_CPU=true in your .env file (useful for testing or if you encounter GPU memory issues)
Batch processing leverages GPU acceleration for optimal performance:
-
With GPU (RTX 4080 SUPER):
- ~4 minutes per sermon (30-45 minute audio)
- Can process 15 sermons per hour
- Recommended for large batches
-
With CPU only:
- ~15 minutes per sermon
- Can process 4 sermons per hour
- Still functional but significantly slower
Example batch processing time:
- 10 sermons with GPU: ~40 minutes
- 10 sermons with CPU: ~2.5 hours
Problem: Files are being skipped or not found
- Solution: Ensure all files have supported extensions (.mp3, .mp4, .wav, .m4a, .mov)
- Check that the directory path is correct and accessible
Problem: Some files fail to process
- Solution: Check the
batch_summaries.jsonfor error details - Individual file failures won't stop the batch - other files will continue processing
- Common issues: corrupted files, unsupported codecs, insufficient disk space
Problem: GPU out of memory during batch processing
- Solution:
- Use a smaller Whisper model (e.g.,
small.eninstead ofmediumorlarge) - Set
WHISPER_FORCE_CPU=trueto use CPU mode - Process files individually instead of in batch mode
- Use a smaller Whisper model (e.g.,
A complete web application for interacting with the sermon summarization agent has been created with:
- Backend: C# .NET 9 Web API (
/API) - Frontend: React + Vite with TypeScript (
/UI) - Python Agent: Sermon processing engine with token tracking
- .NET 9 SDK
- Node.js 18+
- Python 3.8+ with dependencies installed
- OpenAI API key
-
Navigate to the API directory:
cd API -
Build the project:
dotnet build
-
Run the API:
dotnet run
The API will start on
https://localhost:5001(orhttp://localhost:5000in development)
-
Navigate to the UI directory:
cd UI/SermonSummarizationUI -
Install dependencies (if not already done):
npm install
-
Create a
.envfile with the API URL:cp .env.example .env # Edit .env and set VITE_API_URL=http://localhost:5000/api -
Start the development server:
npm run dev
The UI will be available at
http://localhost:5173 -
To build for production:
npm run build
- File Upload: Drag-and-drop or click to upload audio/video files (MP3, MP4, WAV, M4A, MOV)
- Real-time Processing: Watch as your sermon is transcribed, summarized, and tagged
- Skeleton Loaders: Modern loading states with skeleton screens instead of spinners
- Token Tracking: See exactly how many tokens were used for each operation
- Beautiful UI: Clean, modern interface with responsive design
- Error Handling: Comprehensive error messages and recovery
POST /api/sermons/process- Upload and process a sermon fileGET /api/sermons/{id}/status- Check processing statusGET /api/sermons/health- Health check endpoint
The application tracks token usage across all operations:
- Transcription tokens: Tokens used for audio transcription
- Summarization tokens: Tokens used for generating the summary
- Tagging tokens: Tokens used for semantic tag classification
- Clip generation tokens: Tokens used for AI-powered segment selection (when enabled)
- Total tokens: Sum of all tokens used
Token counts and associated costs are displayed in the UI after processing completes.
See LICENSE file for details.