Skip to content

ThriveCommunityChurch/Sermon_Summarization_Agent

Repository files navigation

Sermon Summarization Agent

A LangGraph-based AI agent that transcribes and summarizes sermon video recordings from MP4 or MP3 files.

Features

  • Transcription: Converts audio from MP4/MP3 files to text using OpenAI Whisper
  • GPU Acceleration: Automatically detects and uses NVIDIA GPU (CUDA) for much faster transcription
  • Waveform Generation: Pre-computes audio waveform data (480 normalized amplitude values) for mobile app visualization allowing it to be rendered on the mobile app via adaptive downsampling within the viewport of the device.
  • Video Clip Generation: Automatically creates sub-10-minute summary MP4s from full sermons using AI-powered segment selection
  • Summarization: Generates end-user-friendly single-paragraph summaries using GPT-4o-mini
  • Semantic Tagging: Automatically applies relevant topical tags to summaries for better organization and discovery
  • Batch Processing: Process entire directories of sermon files at once
  • LangGraph Architecture: Built with a graph-based workflow for clear separation of concerns
  • CLI Interface: Easy-to-use command-line interface

Requirements

  • Python 3.8+
  • FFmpeg (must be installed separately)
  • OpenAI API key

Installation

Quick Setup with GPU Support (Recommended)

For the best performance with GPU acceleration, use the automated setup script:

Windows:

setup_venv_gpu.bat

macOS/Linux:

chmod +x setup_venv_gpu.sh
./setup_venv_gpu.sh

This script will:

  • Create a Python virtual environment
  • Install CUDA-enabled PyTorch (for GPU acceleration)
  • Install all other dependencies in the correct order
  • Verify GPU detection

Then configure your environment:

cp .env.example .env
# Edit .env and add your OpenAI API key

Manual Installation (CPU-only)

If you don't have an NVIDIA GPU or prefer manual setup:

  1. Clone the repository:

  2. Create a virtual environment:

    python -m venv venv
    venv\Scripts\activate  # On Windows
    # source venv/bin/activate  # On macOS/Linux
  3. Install dependencies:

    pip install -r requirements.txt

    Note: This installs CPU-only PyTorch. For GPU acceleration (4x faster), use the automated setup script above or see the GPU Acceleration section below.

  4. Install FFmpeg (if not already installed):

    • Windows: choco install ffmpeg or download from ffmpeg.org
    • macOS: brew install ffmpeg
    • Linux: apt install ffmpeg or yum install ffmpeg
  5. Configure environment variables:

    cp .env.example .env
    # Edit .env and add your OpenAI API key

Usage

Single File Mode

Transcribe and summarize a single sermon file:

python agent.py --file path/to/sermon.mp4

Batch Processing Mode

Process all audio files in a directory at once:

python agent.py --batch-dir "G:\Thrive\Sermon Videos\Audio Files"

This will:

  • Find all audio files (MP3, MP4, WAV, M4A, MOV) in the directory
  • Process each file sequentially
  • Create organized subdirectories for each file's outputs
  • Generate a consolidated batch_summaries.json with all results
  • Display progress as files are processed
  • Continue processing even if individual files fail

Auto-detect Latest File

If no file is specified, the agent will auto-detect the latest media file in the configured directory:

python agent.py

Options

  • --file, -f: Path to a single sermon audio/video file (MP4, MP3, WAV, M4A, MOV)
  • --batch-dir, -b: Path to directory containing multiple sermon files for batch processing
  • --resume: Skip files that have already been successfully processed (only works with --batch-dir)

Note: --file and --batch-dir are mutually exclusive. Use one or the other.

Checkpoint-Based Resumption

For large batches, use the --resume flag to enable checkpoint-based resumption:

# Resume an interrupted batch
python agent.py --batch-dir "G:\Thrive\Sermon Videos\Audio Files" --resume

Benefits:

  • βœ… Skip files that have already been successfully processed
  • βœ… Safely interrupt and resume large batch jobs
  • βœ… Automatically retry failed files while preserving successful ones
  • βœ… Perfect for processing hundreds of files across multiple sessions

Without --resume (default):

  • Clears the batch_outputs/ directory before starting
  • Useful for testing prompt changes or configuration adjustments

See CHECKPOINT_GUIDE.md for detailed usage and examples.

Output

Single File Mode

The agent generates the following files in the current directory:

  • transcription.txt: Full transcription text
  • transcription_segments.json: Transcription with timestamps
  • summary.txt: Single-paragraph summary
  • summary.json: Summary with metadata and semantic tags

Batch Processing Mode

The agent generates:

  1. Individual file outputs in organized subdirectories:

    batch_outputs/
    β”œβ”€β”€ 2025-10-05-Recording/
    β”‚   β”œβ”€β”€ transcription.txt
    β”‚   β”œβ”€β”€ transcription_segments.json
    β”‚   β”œβ”€β”€ summary.txt
    β”‚   └── summary.json
    β”œβ”€β”€ 2025-10-12-Recording/
    β”‚   β”œβ”€β”€ transcription.txt
    β”‚   β”œβ”€β”€ transcription_segments.json
    β”‚   β”œβ”€β”€ summary.txt
    β”‚   └── summary.json
    └── ...
    
  2. Consolidated JSON output (batch_summaries.json):

    {
      "2025-10-05-Recording": {
        "filename": "2025-10-05-Recording.mp3",
        "summary": "This sermon explores the transformative power of faith...",
        "transcription_path": "C:\\...\\batch_outputs\\2025-10-05-Recording\\transcription.txt",
        "summary_path": "C:\\...\\batch_outputs\\2025-10-05-Recording\\summary.txt",
        "word_count": 158,
        "date_processed": "2025-10-09T10:30:00Z",
        "status": "success"
      },
      "2025-10-12-Recording": {
        "filename": "2025-10-12-Recording.mp3",
        "summary": "The message focuses on the importance of community...",
        "transcription_path": "C:\\...\\batch_outputs\\2025-10-12-Recording\\transcription.txt",
        "summary_path": "C:\\...\\batch_outputs\\2025-10-12-Recording\\summary.txt",
        "word_count": 142,
        "date_processed": "2025-10-09T10:35:00Z",
        "status": "success"
      }
    }

    If a file fails to process, the entry will include error information:

    {
      "problematic-file": {
        "filename": "problematic-file.mp3",
        "summary": null,
        "transcription_path": null,
        "summary_path": null,
        "word_count": 0,
        "date_processed": "2025-10-09T10:40:00Z",
        "status": "error",
        "error": "File not found or corrupted"
      }
    }

Architecture

The agent uses LangGraph with five main nodes:

  1. Transcribe Node: Converts audio to text using OpenAI Whisper
    • Automatically detects and uses GPU (CUDA) if available
    • Extracts audio from video files using FFmpeg
    • Generates timestamped segments
  2. Waveform Node: Generates audio waveform data using librosa
    • Calculates 480 RMS (Root Mean Square) amplitude values
    • Normalizes values to 0.15-1.0 range for consistent visualization
    • Saves waveform data to summary.json for API consumption
  3. Summarize Node: Generates a summary using GPT-4o-mini
    • Creates end-user-friendly single-paragraph summaries
    • Includes metadata (word count, processing date, etc.)
  4. Tagging Node: Applies semantic tags to summaries
    • Analyzes summary content using GPT-4o-mini
    • Selects up to 5 relevant tags from a predefined list (config/tags_config.py)
    • Tags are cached in memory for efficient batch processing
    • Updates summary.json with a tags array
  5. Clip Generation Node (optional): Creates summary video clips
    • Uses GPT-4o-mini to analyze transcript and select key moments
    • Optimizes segments with context padding and gap merging
    • Generates MP4 clips using FFMPEG with GPU acceleration support
    • Only runs when ENABLE_CLIP_GENERATION=true in .env

Semantic Tagging

The agent automatically applies relevant topical tags to each sermon summary for better organization and discovery.

How It Works

  1. Tag Source: Tags are defined in config/tags_config.py, which contains 102+ predefined sermon topics
  2. Hybrid Analysis: GPT-4o-mini analyzes BOTH the summary (main themes) and transcript excerpt (comprehensive context) to determine relevant themes
  3. Selection: The AI selects up to 5 most relevant tags from the available list
  4. Storage: Tags are added to summary.json as a tags array field

Available Tag Categories

Tags cover a wide range of sermon topics including:

  • Relationships & Family: Marriage, Family, Friendship, Singleness
  • Theological Foundations: Salvation, Faith, Trinity, Church, Holy Spirit
  • Spiritual Disciplines: Prayer, Worship, Fasting, Bible Study
  • Personal Growth: Hope, Love, Joy, Peace, Courage, Wisdom
  • Life Challenges: Suffering, Anxiety, Doubt, Addiction, Grief
  • Biblical Studies: Parables, Sermon on the Mount, specific book studies
  • Seasonal: Advent, Christmas, Easter, Lent
  • And many more...

Example Output

{
  "summary": "This sermon explores the transformative power of faith...",
  "word_count": 120,
  "character_count": 750,
  "model": "gpt-4o-mini",
  "transcription_length": 28500,
  "tags": ["Faith", "Salvation", "Hope", "Discipleship"]
}

Extending Tags

To add new tags:

  1. Edit config/tags_config.py
  2. Add your new tag to the appropriate category list (or create a new category)
  3. The tag will be automatically included in ALL_TAGS and available on the next run

Waveform Generation

The agent automatically generates pre-computed audio waveform data for mobile app visualization, eliminating the need for resource-intensive client-side processing.

How It Works

  1. Audio Analysis: After transcription, the waveform node uses librosa to analyze the extracted audio file
  2. RMS Calculation: Divides audio into 480 equal segments and calculates RMS (Root Mean Square) amplitude for each
  3. Normalization: Values are normalized to 0.15-1.0 range for consistent visualization across different audio files
  4. Storage: Waveform data is saved to summary.json as a waveform_data array field

Why RMS (Root Mean Square)?

RMS provides a better representation of perceived loudness than peak amplitude values:

  • More stable and less sensitive to transient spikes
  • Better represents the energy content of audio segments

Output Format

The waveform data is an array of 480 floating-point values between 0.15 and 1.0:

{
  "summary": "This sermon explores...",
  "tags": ["Faith", "Hope"],
  "waveform_data": [
    0.42, 0.38, 0.45, 0.52, 0.48, 0.55, 0.61, 0.58, 0.64, 0.71,
    0.68, 0.75, 0.82, 0.78, 0.85, 0.91, 0.88, 0.95, 0.89, 0.83,
    ...
  ]
}

Benefits

  • Performance: Pre-computed server-side, eliminating mobile device CPU load
  • Consistency: Normalized values ensure consistent visualization across all audio files
  • Simplicity: Ready-to-use JSON format for direct integration with mobile apps
  • Fast: Adds only ~5-10 seconds to processing time for typical sermon lengths

Dependencies

Waveform generation requires the librosa library, which is included in requirements.txt:

pip install librosa>=0.10.0

While librosa has some really neat advanced features, we're only using it for RMS amplitude calculations

Performance

  • Tags are loaded once and cached in memory during batch processing
  • Hybrid analysis uses ~15000 characters of transcript + full summary
  • Each sermon typically takes 3-4 seconds for tag classification
  • Cost: ~$0.0008 per sermon (very affordable with GPT-4o-mini)

Video Clip Generation

The agent can automatically generate short summary videos (under 10 minutes) from full sermon recordings. This feature uses AI to intelligently select and stitch together the most important moments from your sermon. This can be extremely useful for creating social media content, or other short-form content for casual viewing.

How It Works

  1. AI-Powered Selection: GPT-4o-mini analyzes the sermon transcript to identify key moments and themes
  2. Intelligent Segmentation: Selects 30-60 second segments that capture the most important content
  3. Smart Optimization: Merges nearby segments, adds context padding, and ensures chronological order
  4. Video Processing: Uses FFMPEG to extract and concatenate selected segments with optional fade transitions
  5. GPU Acceleration: Supports NVIDIA CUDA (h264_nvenc) for faster video encoding with automatic CPU fallback

Enabling Video Clip Generation

Video clip generation is disabled by default and must be explicitly enabled in your .env file:

# Enable video clip generation
ENABLE_CLIP_GENERATION=true

Configuration Options

Add these settings to your .env file to customize clip generation behavior:

# Video Clip Generation Settings
ENABLE_CLIP_GENERATION=false          # Set to 'true' to enable (default: false)
MAX_CLIP_DURATION=600                 # Maximum clip length in seconds (default: 600 = 10 minutes)
MIN_SEGMENT_LENGTH=30                 # Minimum segment length in seconds (default: 30)
CONTEXT_PADDING=5                     # Seconds to add before/after each segment (default: 5)
MERGE_GAP_THRESHOLD=15                # Merge segments if gap is less than this (default: 15 seconds)
ENABLE_FADE_TRANSITIONS=true          # Add fade transitions between segments (default: true)
FADE_DURATION=0.5                     # Fade transition duration in seconds (default: 0.5)
CLIP_OUTPUT_DIR=                      # Optional: Custom output directory (default: temp directory)

GPU-Accelerated Video Encoding

For significantly faster video processing, you can enable GPU-accelerated encoding using NVIDIA CUDA:

Performance Comparison:

  • CPU (libx264): Standard encoding, works everywhere
  • GPU (h264_nvenc): 3-5x faster encoding on NVIDIA GPUs (RTX/GTX series)

Setup Instructions:

See the detailed GPU Encoding Setup Guide for step-by-step instructions on:

  • Installing NVIDIA drivers and CUDA toolkit
  • Building FFMPEG with CUDA support
  • Verifying GPU encoding is working
  • Troubleshooting common issues

Automatic Fallback: If GPU encoding is not available or fails, the agent automatically falls back to CPU encoding (libx264) to ensure your clips are always generated successfully.

Output Files

When clip generation is enabled, the agent produces:

  • {filename}_Summary.mp4: The generated summary video clip
  • {filename}_Summary_metadata.json: Detailed metadata including:
    • Selected segments with timestamps and importance scores
    • Processing statistics (AI selection time, encoding time, etc.)
    • GPU encoding information
    • Token usage for AI segment selection
    • Compression ratio and file size reduction

Example Output

{
  "version": "2.0",
  "generated_at": "2025-11-01 14:30:00",
  "original_video": {
    "duration_seconds": 2400,
    "file_size_mb": 450.5
  },
  "output_video": {
    "duration_seconds": 540,
    "file_size_mb": 95.2,
    "size_reduction_percent": 78.9
  },
  "summary": {
    "total_segments": 12,
    "total_duration_seconds": 540,
    "average_importance_score": 8.5,
    "compression_ratio": 4.44
  },
  "gpu_info": {
    "encoding_method": "GPU (h264_nvenc)",
    "gpu_available": true
  }
}

Benefits

  • Time-Saving: Automatically creates shareable highlight reels from full sermons
  • AI-Powered: Intelligently selects the most impactful moments
  • Flexible: Highly configurable to match your needs
  • Fast: GPU acceleration makes processing quick even for long sermons
  • Professional: Smooth fade transitions and optimized encoding
  • Cost-Effective: Uses GPT-4o-mini for affordable AI segment selection (~$0.001-0.003 per sermon)

Token Usage

Video clip generation adds minimal token usage:

  • Input tokens: Transcript analysis (~5,000-15,000 tokens depending on sermon length)
  • Output tokens: Segment selection data (~500-1,500 tokens)
  • Cost: Approximately $0.001-0.003 per sermon with GPT-4o-mini

Token usage is tracked separately and included in the total cost breakdown.

Configuration

Environment variables (in .env):

Core Settings

  • OPENAI_API_KEY: Your OpenAI API key (required)
  • WHISPER_MODEL: Whisper model to use (default: small.en)
    • Options: tiny, tiny.en, base, base.en, small, small.en, medium, medium.en, large
    • Larger models are more accurate but slower (GPU highly recommended for medium/large)
  • WHISPER_FORCE_CPU: Force CPU mode even if GPU is available (default: false)
  • SERMON_AUDIO_DIR: Directory to search for sermon files (optional)

Video Clip Generation Settings

  • ENABLE_CLIP_GENERATION: Enable automatic video clip generation (default: false)
  • MAX_CLIP_DURATION: Maximum clip length in seconds (default: 600)
  • MIN_SEGMENT_LENGTH: Minimum segment length in seconds (default: 30)
  • CONTEXT_PADDING: Seconds to add before/after each segment (default: 5)
  • MERGE_GAP_THRESHOLD: Merge segments if gap is less than this in seconds (default: 15)
  • ENABLE_FADE_TRANSITIONS: Add fade transitions between segments (default: true)
  • FADE_DURATION: Fade transition duration in seconds (default: 0.5)
  • CLIP_OUTPUT_DIR: Custom output directory for clips (optional, defaults to temp directory)

See the Video Clip Generation section for detailed information.

GPU Acceleration ⚑

The agent automatically detects and uses NVIDIA GPUs (CUDA) for Whisper transcription, providing significant speed improvements.

Performance Comparison:

  • CPU: ~15 minutes for a typical sermon
  • GPU (RTX 4080 SUPER): ~4 minutes for the same sermon
  • Speedup: ~4x faster with GPU acceleration

GPU Detection:

  • If you have an NVIDIA GPU with CUDA support (RTX series, GTX series, etc.), it will be automatically detected and used
  • The agent will display GPU information at startup:
    πŸš€ GPU detected: NVIDIA GeForce RTX 4080 SUPER
       Number of GPUs available: 1
       Using device: cuda with fp16 precision
       This will be MUCH faster than CPU!
    
  • fp16 precision is automatically enabled on GPU for faster inference

Installing CUDA-Enabled PyTorch:

Recommended: Use the automated setup script (see Installation section):

# Windows
setup_venv_gpu.bat

# macOS/Linux
./setup_venv_gpu.sh

Manual Installation:

If you already have a venv and need to add GPU support:

  1. Uninstall CPU-only PyTorch:

    pip uninstall torch torchvision torchaudio
  2. Install CUDA-enabled PyTorch (for CUDA 11.8):

    pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

    For other CUDA versions, visit: https://pytorch.org/get-started/locally/

  3. Verify GPU is detected:

    python test_gpu.py

    You should see:

    βœ“ CUDA available: True
    βœ“ GPU 0: NVIDIA GeForce RTX [Your GPU Model]
    πŸš€ GPU is ready to use! Whisper will run MUCH faster.
    

Important Note about librosa: When installing librosa (required for waveform generation), pip may replace CUDA-enabled PyTorch with the CPU version. The automated setup script handles this by installing PyTorch with CUDA first, then librosa. If you install packages manually and lose GPU support, simply reinstall CUDA-enabled PyTorch:

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

To force CPU mode: Set WHISPER_FORCE_CPU=true in your .env file (useful for testing or if you encounter GPU memory issues)

Batch Processing Performance

Batch processing leverages GPU acceleration for optimal performance:

  • With GPU (RTX 4080 SUPER):

    • ~4 minutes per sermon (30-45 minute audio)
    • Can process 15 sermons per hour
    • Recommended for large batches
  • With CPU only:

    • ~15 minutes per sermon
    • Can process 4 sermons per hour
    • Still functional but significantly slower

Example batch processing time:

  • 10 sermons with GPU: ~40 minutes
  • 10 sermons with CPU: ~2.5 hours

Troubleshooting

Batch Processing Issues

Problem: Files are being skipped or not found

  • Solution: Ensure all files have supported extensions (.mp3, .mp4, .wav, .m4a, .mov)
  • Check that the directory path is correct and accessible

Problem: Some files fail to process

  • Solution: Check the batch_summaries.json for error details
  • Individual file failures won't stop the batch - other files will continue processing
  • Common issues: corrupted files, unsupported codecs, insufficient disk space

Problem: GPU out of memory during batch processing

  • Solution:
    • Use a smaller Whisper model (e.g., small.en instead of medium or large)
    • Set WHISPER_FORCE_CPU=true to use CPU mode
    • Process files individually instead of in batch mode

Full-Stack Web Application

A complete web application for interacting with the sermon summarization agent has been created with:

Architecture

  • Backend: C# .NET 9 Web API (/API)
  • Frontend: React + Vite with TypeScript (/UI)
  • Python Agent: Sermon processing engine with token tracking

Getting Started with the Web App

Prerequisites

  • .NET 9 SDK
  • Node.js 18+
  • Python 3.8+ with dependencies installed
  • OpenAI API key

Running the Backend API

  1. Navigate to the API directory:

    cd API
  2. Build the project:

    dotnet build
  3. Run the API:

    dotnet run

    The API will start on https://localhost:5001 (or http://localhost:5000 in development)

Running the Frontend UI

  1. Navigate to the UI directory:

    cd UI/SermonSummarizationUI
  2. Install dependencies (if not already done):

    npm install
  3. Create a .env file with the API URL:

    cp .env.example .env
    # Edit .env and set VITE_API_URL=http://localhost:5000/api
  4. Start the development server:

    npm run dev

    The UI will be available at http://localhost:5173

  5. To build for production:

    npm run build

Features

  • File Upload: Drag-and-drop or click to upload audio/video files (MP3, MP4, WAV, M4A, MOV)
  • Real-time Processing: Watch as your sermon is transcribed, summarized, and tagged
  • Skeleton Loaders: Modern loading states with skeleton screens instead of spinners
  • Token Tracking: See exactly how many tokens were used for each operation
  • Beautiful UI: Clean, modern interface with responsive design
  • Error Handling: Comprehensive error messages and recovery

API Endpoints

  • POST /api/sermons/process - Upload and process a sermon file
  • GET /api/sermons/{id}/status - Check processing status
  • GET /api/sermons/health - Health check endpoint

Token Usage

The application tracks token usage across all operations:

  • Transcription tokens: Tokens used for audio transcription
  • Summarization tokens: Tokens used for generating the summary
  • Tagging tokens: Tokens used for semantic tag classification
  • Clip generation tokens: Tokens used for AI-powered segment selection (when enabled)
  • Total tokens: Sum of all tokens used

Token counts and associated costs are displayed in the UI after processing completes.

License

See LICENSE file for details.

About

A LangGraph-based AI agent that transcribes and summarizes sermon video recordings from MP4 or MP3 files.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Contributors