Sermon Summarization Agent

A LangGraph-based AI agent that transcribes and summarizes sermon video recordings from MP4 or MP3 files.

Features

Transcription: Converts audio from MP4/MP3 files to text using OpenAI Whisper
GPU Acceleration: Automatically detects and uses NVIDIA GPU (CUDA) for much faster transcription
Waveform Generation: Pre-computes audio waveform data (480 normalized amplitude values) for mobile app visualization allowing it to be rendered on the mobile app via adaptive downsampling within the viewport of the device.
Video Clip Generation: Automatically creates sub-10-minute summary MP4s from full sermons using AI-powered segment selection
Summarization: Generates end-user-friendly single-paragraph summaries using GPT-4o-mini
Semantic Tagging: Automatically applies relevant topical tags to summaries for better organization and discovery
Batch Processing: Process entire directories of sermon files at once
LangGraph Architecture: Built with a graph-based workflow for clear separation of concerns
CLI Interface: Easy-to-use command-line interface

Requirements

Python 3.8+
FFmpeg (must be installed separately)
OpenAI API key

Installation

Quick Setup with GPU Support (Recommended)

For the best performance with GPU acceleration, use the automated setup script:

Windows:

setup_venv_gpu.bat

macOS/Linux:

chmod +x setup_venv_gpu.sh
./setup_venv_gpu.sh

This script will:

Create a Python virtual environment
Install CUDA-enabled PyTorch (for GPU acceleration)
Install all other dependencies in the correct order
Verify GPU detection

Then configure your environment:

cp .env.example .env
# Edit .env and add your OpenAI API key

Manual Installation (CPU-only)

If you don't have an NVIDIA GPU or prefer manual setup:

Clone the repository:

Create a virtual environment:

python -m venv venv
venv\Scripts\activate  # On Windows
# source venv/bin/activate  # On macOS/Linux

Install dependencies:
```
pip install -r requirements.txt
```
Note: This installs CPU-only PyTorch. For GPU acceleration (4x faster), use the automated setup script above or see the GPU Acceleration section below.
Install FFmpeg (if not already installed):
- Windows: choco install ffmpeg or download from ffmpeg.org
- macOS: brew install ffmpeg
- Linux: apt install ffmpeg or yum install ffmpeg

Configure environment variables:

cp .env.example .env
# Edit .env and add your OpenAI API key

Usage

Single File Mode

Transcribe and summarize a single sermon file:

python agent.py --file path/to/sermon.mp4

Batch Processing Mode

Process all audio files in a directory at once:

python agent.py --batch-dir "G:\Thrive\Sermon Videos\Audio Files"

This will:

Find all audio files (MP3, MP4, WAV, M4A, MOV) in the directory
Process each file sequentially
Create organized subdirectories for each file's outputs
Generate a consolidated batch_summaries.json with all results
Display progress as files are processed
Continue processing even if individual files fail

Auto-detect Latest File

If no file is specified, the agent will auto-detect the latest media file in the configured directory:

python agent.py

Options

--file, -f: Path to a single sermon audio/video file (MP4, MP3, WAV, M4A, MOV)
--batch-dir, -b: Path to directory containing multiple sermon files for batch processing
--resume: Skip files that have already been successfully processed (only works with --batch-dir)

Note: --file and --batch-dir are mutually exclusive. Use one or the other.

Checkpoint-Based Resumption

For large batches, use the --resume flag to enable checkpoint-based resumption:

# Resume an interrupted batch
python agent.py --batch-dir "G:\Thrive\Sermon Videos\Audio Files" --resume

Benefits:

✅ Skip files that have already been successfully processed
✅ Safely interrupt and resume large batch jobs
✅ Automatically retry failed files while preserving successful ones
✅ Perfect for processing hundreds of files across multiple sessions

Without --resume (default):

Clears the batch_outputs/ directory before starting
Useful for testing prompt changes or configuration adjustments

See CHECKPOINT_GUIDE.md for detailed usage and examples.

Output

Single File Mode

The agent generates the following files in the current directory:

transcription.txt: Full transcription text
transcription_segments.json: Transcription with timestamps
summary.txt: Single-paragraph summary
summary.json: Summary with metadata and semantic tags

Batch Processing Mode

The agent generates:

Individual file outputs in organized subdirectories:

batch_outputs/
├── 2025-10-05-Recording/
│   ├── transcription.txt
│   ├── transcription_segments.json
│   ├── summary.txt
│   └── summary.json
├── 2025-10-12-Recording/
│   ├── transcription.txt
│   ├── transcription_segments.json
│   ├── summary.txt
│   └── summary.json
└── ...

Consolidated JSON output (batch_summaries.json):

{
  "2025-10-05-Recording": {
    "filename": "2025-10-05-Recording.mp3",
    "summary": "This sermon explores the transformative power of faith...",
    "transcription_path": "C:\\...\\batch_outputs\\2025-10-05-Recording\\transcription.txt",
    "summary_path": "C:\\...\\batch_outputs\\2025-10-05-Recording\\summary.txt",
    "word_count": 158,
    "date_processed": "2025-10-09T10:30:00Z",
    "status": "success"
  },
  "2025-10-12-Recording": {
    "filename": "2025-10-12-Recording.mp3",
    "summary": "The message focuses on the importance of community...",
    "transcription_path": "C:\\...\\batch_outputs\\2025-10-12-Recording\\transcription.txt",
    "summary_path": "C:\\...\\batch_outputs\\2025-10-12-Recording\\summary.txt",
    "word_count": 142,
    "date_processed": "2025-10-09T10:35:00Z",
    "status": "success"
  }
}

If a file fails to process, the entry will include error information:

{
  "problematic-file": {
    "filename": "problematic-file.mp3",
    "summary": null,
    "transcription_path": null,
    "summary_path": null,
    "word_count": 0,
    "date_processed": "2025-10-09T10:40:00Z",
    "status": "error",
    "error": "File not found or corrupted"
  }
}

Architecture

The agent uses LangGraph with five main nodes:

Transcribe Node: Converts audio to text using OpenAI Whisper
- Automatically detects and uses GPU (CUDA) if available
- Extracts audio from video files using FFmpeg
- Generates timestamped segments
Waveform Node: Generates audio waveform data using librosa
- Calculates 480 RMS (Root Mean Square) amplitude values
- Normalizes values to 0.15-1.0 range for consistent visualization
- Saves waveform data to summary.json for API consumption
Summarize Node: Generates a summary using GPT-4o-mini
- Creates end-user-friendly single-paragraph summaries
- Includes metadata (word count, processing date, etc.)
Tagging Node: Applies semantic tags to summaries
- Analyzes summary content using GPT-4o-mini
- Selects up to 5 relevant tags from a predefined list (config/tags_config.py)
- Tags are cached in memory for efficient batch processing
- Updates summary.json with a tags array
Clip Generation Node (optional): Creates summary video clips
- Uses GPT-4o-mini to analyze transcript and select key moments
- Optimizes segments with context padding and gap merging
- Generates MP4 clips using FFMPEG with GPU acceleration support
- Only runs when ENABLE_CLIP_GENERATION=true in .env

Semantic Tagging

The agent automatically applies relevant topical tags to each sermon summary for better organization and discovery.

How It Works

Tag Source: Tags are defined in config/tags_config.py, which contains 102+ predefined sermon topics
Hybrid Analysis: GPT-4o-mini analyzes BOTH the summary (main themes) and transcript excerpt (comprehensive context) to determine relevant themes
Selection: The AI selects up to 5 most relevant tags from the available list
Storage: Tags are added to summary.json as a tags array field

Available Tag Categories

Tags cover a wide range of sermon topics including:

Relationships & Family: Marriage, Family, Friendship, Singleness
Theological Foundations: Salvation, Faith, Trinity, Church, Holy Spirit
Spiritual Disciplines: Prayer, Worship, Fasting, Bible Study
Personal Growth: Hope, Love, Joy, Peace, Courage, Wisdom
Life Challenges: Suffering, Anxiety, Doubt, Addiction, Grief
Biblical Studies: Parables, Sermon on the Mount, specific book studies
Seasonal: Advent, Christmas, Easter, Lent
And many more...

Example Output

{
  "summary": "This sermon explores the transformative power of faith...",
  "word_count": 120,
  "character_count": 750,
  "model": "gpt-4o-mini",
  "transcription_length": 28500,
  "tags": ["Faith", "Salvation", "Hope", "Discipleship"]
}

Extending Tags

To add new tags:

Edit config/tags_config.py
Add your new tag to the appropriate category list (or create a new category)
The tag will be automatically included in ALL_TAGS and available on the next run

Waveform Generation

The agent automatically generates pre-computed audio waveform data for mobile app visualization, eliminating the need for resource-intensive client-side processing.

How It Works

Audio Analysis: After transcription, the waveform node uses librosa to analyze the extracted audio file
RMS Calculation: Divides audio into 480 equal segments and calculates RMS (Root Mean Square) amplitude for each
Normalization: Values are normalized to 0.15-1.0 range for consistent visualization across different audio files
Storage: Waveform data is saved to summary.json as a waveform_data array field

Why RMS (Root Mean Square)?

RMS provides a better representation of perceived loudness than peak amplitude values:

More stable and less sensitive to transient spikes
Better represents the energy content of audio segments

Output Format

The waveform data is an array of 480 floating-point values between 0.15 and 1.0:

{
  "summary": "This sermon explores...",
  "tags": ["Faith", "Hope"],
  "waveform_data": [
    0.42, 0.38, 0.45, 0.52, 0.48, 0.55, 0.61, 0.58, 0.64, 0.71,
    0.68, 0.75, 0.82, 0.78, 0.85, 0.91, 0.88, 0.95, 0.89, 0.83,
    ...
  ]
}

Benefits

Performance: Pre-computed server-side, eliminating mobile device CPU load
Consistency: Normalized values ensure consistent visualization across all audio files
Simplicity: Ready-to-use JSON format for direct integration with mobile apps
Fast: Adds only ~5-10 seconds to processing time for typical sermon lengths

Dependencies

Waveform generation requires the librosa library, which is included in requirements.txt:

pip install librosa>=0.10.0

While librosa has some really neat advanced features, we're only using it for RMS amplitude calculations

Performance

Tags are loaded once and cached in memory during batch processing
Hybrid analysis uses ~15000 characters of transcript + full summary
Each sermon typically takes 3-4 seconds for tag classification
Cost: ~$0.0008 per sermon (very affordable with GPT-4o-mini)

Video Clip Generation

The agent can automatically generate short summary videos (under 10 minutes) from full sermon recordings. This feature uses AI to intelligently select and stitch together the most important moments from your sermon. This can be extremely useful for creating social media content, or other short-form content for casual viewing.

How It Works

AI-Powered Selection: GPT-4o-mini analyzes the sermon transcript to identify key moments and themes
Intelligent Segmentation: Selects 30-60 second segments that capture the most important content
Smart Optimization: Merges nearby segments, adds context padding, and ensures chronological order
Video Processing: Uses FFMPEG to extract and concatenate selected segments with optional fade transitions
GPU Acceleration: Supports NVIDIA CUDA (h264_nvenc) for faster video encoding with automatic CPU fallback

Enabling Video Clip Generation

Video clip generation is disabled by default and must be explicitly enabled in your .env file:

# Enable video clip generation
ENABLE_CLIP_GENERATION=true

Configuration Options

Add these settings to your .env file to customize clip generation behavior:

# Video Clip Generation Settings
ENABLE_CLIP_GENERATION=false          # Set to 'true' to enable (default: false)
MAX_CLIP_DURATION=600                 # Maximum clip length in seconds (default: 600 = 10 minutes)
MIN_SEGMENT_LENGTH=30                 # Minimum segment length in seconds (default: 30)
CONTEXT_PADDING=5                     # Seconds to add before/after each segment (default: 5)
MERGE_GAP_THRESHOLD=15                # Merge segments if gap is less than this (default: 15 seconds)
ENABLE_FADE_TRANSITIONS=true          # Add fade transitions between segments (default: true)
FADE_DURATION=0.5                     # Fade transition duration in seconds (default: 0.5)
CLIP_OUTPUT_DIR=                      # Optional: Custom output directory (default: temp directory)

GPU-Accelerated Video Encoding

For significantly faster video processing, you can enable GPU-accelerated encoding using NVIDIA CUDA:

Performance Comparison:

CPU (libx264): Standard encoding, works everywhere
GPU (h264_nvenc): 3-5x faster encoding on NVIDIA GPUs (RTX/GTX series)

Setup Instructions:

See the detailed GPU Encoding Setup Guide for step-by-step instructions on:

Installing NVIDIA drivers and CUDA toolkit
Building FFMPEG with CUDA support
Verifying GPU encoding is working
Troubleshooting common issues

Automatic Fallback: If GPU encoding is not available or fails, the agent automatically falls back to CPU encoding (libx264) to ensure your clips are always generated successfully.

Output Files

When clip generation is enabled, the agent produces:

{filename}_Summary.mp4: The generated summary video clip
{filename}_Summary_metadata.json: Detailed metadata including:
- Selected segments with timestamps and importance scores
- Processing statistics (AI selection time, encoding time, etc.)
- GPU encoding information
- Token usage for AI segment selection
- Compression ratio and file size reduction

Example Output

{
  "version": "2.0",
  "generated_at": "2025-11-01 14:30:00",
  "original_video": {
    "duration_seconds": 2400,
    "file_size_mb": 450.5
  },
  "output_video": {
    "duration_seconds": 540,
    "file_size_mb": 95.2,
    "size_reduction_percent": 78.9
  },
  "summary": {
    "total_segments": 12,
    "total_duration_seconds": 540,
    "average_importance_score": 8.5,
    "compression_ratio": 4.44
  },
  "gpu_info": {
    "encoding_method": "GPU (h264_nvenc)",
    "gpu_available": true
  }
}

Benefits

Time-Saving: Automatically creates shareable highlight reels from full sermons
AI-Powered: Intelligently selects the most impactful moments
Flexible: Highly configurable to match your needs
Fast: GPU acceleration makes processing quick even for long sermons
Professional: Smooth fade transitions and optimized encoding
Cost-Effective: Uses GPT-4o-mini for affordable AI segment selection (~$0.001-0.003 per sermon)

Token Usage

Video clip generation adds minimal token usage:

Input tokens: Transcript analysis (~5,000-15,000 tokens depending on sermon length)
Output tokens: Segment selection data (~500-1,500 tokens)
Cost: Approximately $0.001-0.003 per sermon with GPT-4o-mini

Token usage is tracked separately and included in the total cost breakdown.

Configuration

Environment variables (in .env):

Core Settings

OPENAI_API_KEY: Your OpenAI API key (required)
WHISPER_MODEL: Whisper model to use (default: small.en)
- Options: tiny, tiny.en, base, base.en, small, small.en, medium, medium.en, large
- Larger models are more accurate but slower (GPU highly recommended for medium/large)
WHISPER_FORCE_CPU: Force CPU mode even if GPU is available (default: false)
SERMON_AUDIO_DIR: Directory to search for sermon files (optional)

Video Clip Generation Settings

ENABLE_CLIP_GENERATION: Enable automatic video clip generation (default: false)
MAX_CLIP_DURATION: Maximum clip length in seconds (default: 600)
MIN_SEGMENT_LENGTH: Minimum segment length in seconds (default: 30)
CONTEXT_PADDING: Seconds to add before/after each segment (default: 5)
MERGE_GAP_THRESHOLD: Merge segments if gap is less than this in seconds (default: 15)
ENABLE_FADE_TRANSITIONS: Add fade transitions between segments (default: true)
FADE_DURATION: Fade transition duration in seconds (default: 0.5)
CLIP_OUTPUT_DIR: Custom output directory for clips (optional, defaults to temp directory)

See the Video Clip Generation section for detailed information.

GPU Acceleration ⚡

The agent automatically detects and uses NVIDIA GPUs (CUDA) for Whisper transcription, providing significant speed improvements.

Performance Comparison:

CPU: ~15 minutes for a typical sermon
GPU (RTX 4080 SUPER): ~4 minutes for the same sermon
Speedup: ~4x faster with GPU acceleration

GPU Detection:

If you have an NVIDIA GPU with CUDA support (RTX series, GTX series, etc.), it will be automatically detected and used

The agent will display GPU information at startup:

🚀 GPU detected: NVIDIA GeForce RTX 4080 SUPER
   Number of GPUs available: 1
   Using device: cuda with fp16 precision
   This will be MUCH faster than CPU!

fp16 precision is automatically enabled on GPU for faster inference

Installing CUDA-Enabled PyTorch:

Recommended: Use the automated setup script (see Installation section):

# Windows
setup_venv_gpu.bat

# macOS/Linux
./setup_venv_gpu.sh

Manual Installation:

If you already have a venv and need to add GPU support:

Uninstall CPU-only PyTorch:

pip uninstall torch torchvision torchaudio

Install CUDA-enabled PyTorch (for CUDA 11.8):
```
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
```
For other CUDA versions, visit: https://pytorch.org/get-started/locally/

Verify GPU is detected:

python test_gpu.py

You should see:

✓ CUDA available: True
✓ GPU 0: NVIDIA GeForce RTX [Your GPU Model]
🚀 GPU is ready to use! Whisper will run MUCH faster.

Important Note about librosa: When installing librosa (required for waveform generation), pip may replace CUDA-enabled PyTorch with the CPU version. The automated setup script handles this by installing PyTorch with CUDA first, then librosa. If you install packages manually and lose GPU support, simply reinstall CUDA-enabled PyTorch:

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

To force CPU mode: Set WHISPER_FORCE_CPU=true in your .env file (useful for testing or if you encounter GPU memory issues)

Batch Processing Performance

Batch processing leverages GPU acceleration for optimal performance:

With GPU (RTX 4080 SUPER):
- ~4 minutes per sermon (30-45 minute audio)
- Can process 15 sermons per hour
- Recommended for large batches
With CPU only:
- ~15 minutes per sermon
- Can process 4 sermons per hour
- Still functional but significantly slower

Example batch processing time:

10 sermons with GPU: ~40 minutes
10 sermons with CPU: ~2.5 hours

Troubleshooting

Batch Processing Issues

Problem: Files are being skipped or not found

Solution: Ensure all files have supported extensions (.mp3, .mp4, .wav, .m4a, .mov)
Check that the directory path is correct and accessible

Problem: Some files fail to process

Solution: Check the batch_summaries.json for error details
Individual file failures won't stop the batch - other files will continue processing
Common issues: corrupted files, unsupported codecs, insufficient disk space

Problem: GPU out of memory during batch processing

Solution:
- Use a smaller Whisper model (e.g., small.en instead of medium or large)
- Set WHISPER_FORCE_CPU=true to use CPU mode
- Process files individually instead of in batch mode

Full-Stack Web Application

A complete web application for interacting with the sermon summarization agent has been created with:

Architecture

Backend: C# .NET 9 Web API (/API)
Frontend: React + Vite with TypeScript (/UI)
Python Agent: Sermon processing engine with token tracking

Getting Started with the Web App

Prerequisites

.NET 9 SDK
Node.js 18+
Python 3.8+ with dependencies installed
OpenAI API key

Running the Backend API

Navigate to the API directory:
```
cd API
```
Build the project:
```
dotnet build
```
Run the API:
```
dotnet run
```
The API will start on https://localhost:5001 (or http://localhost:5000 in development)

Running the Frontend UI

Navigate to the UI directory:
```
cd UI/SermonSummarizationUI
```
Install dependencies (if not already done):
```
npm install
```

Create a .env file with the API URL:

cp .env.example .env
# Edit .env and set VITE_API_URL=http://localhost:5000/api

Start the development server:
```
npm run dev
```
The UI will be available at http://localhost:5173
To build for production:
```
npm run build
```

Features

File Upload: Drag-and-drop or click to upload audio/video files (MP3, MP4, WAV, M4A, MOV)
Real-time Processing: Watch as your sermon is transcribed, summarized, and tagged
Skeleton Loaders: Modern loading states with skeleton screens instead of spinners
Token Tracking: See exactly how many tokens were used for each operation
Beautiful UI: Clean, modern interface with responsive design
Error Handling: Comprehensive error messages and recovery

API Endpoints

POST /api/sermons/process - Upload and process a sermon file
GET /api/sermons/{id}/status - Check processing status
GET /api/sermons/health - Health check endpoint

Token Usage

The application tracks token usage across all operations:

Transcription tokens: Tokens used for audio transcription
Summarization tokens: Tokens used for generating the summary
Tagging tokens: Tokens used for semantic tag classification
Clip generation tokens: Tokens used for AI-powered segment selection (when enabled)
Total tokens: Sum of all tokens used

Token counts and associated costs are displayed in the UI after processing completes.

License

See LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
.github/workflows		.github/workflows
API		API
UI/SermonSummarizationUI		UI/SermonSummarizationUI
Update_Messages		Update_Messages
classes		classes
config		config
docs		docs
nodes		nodes
utils		utils
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
agent.py		agent.py
bulk_waveform_generator.py		bulk_waveform_generator.py
requirements.txt		requirements.txt
retry_failed.py		retry_failed.py
setup_venv_gpu.bat		setup_venv_gpu.bat
setup_venv_gpu.sh		setup_venv_gpu.sh

Folders and files

Latest commit

History

Repository files navigation

Sermon Summarization Agent

Features

Requirements

Installation

Quick Setup with GPU Support (Recommended)

Manual Installation (CPU-only)

Usage

Single File Mode

Batch Processing Mode

Auto-detect Latest File

Options

Checkpoint-Based Resumption

Output

Single File Mode

Batch Processing Mode

Architecture

Semantic Tagging

How It Works

Available Tag Categories

Example Output

Extending Tags

Waveform Generation

How It Works

Why RMS (Root Mean Square)?

Output Format

Benefits

Dependencies

Performance

Video Clip Generation

How It Works

Enabling Video Clip Generation

Configuration Options

GPU-Accelerated Video Encoding

Output Files

Example Output

Benefits

Token Usage

Configuration

Core Settings

Video Clip Generation Settings

GPU Acceleration ⚡

Batch Processing Performance

Troubleshooting

Batch Processing Issues

Full-Stack Web Application

Architecture

Getting Started with the Web App

Prerequisites

Running the Backend API

Running the Frontend UI

Features

API Endpoints

Token Usage

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Contributors

Uh oh!

Languages