Skip to content

timothywarner-org/pptx-shredder

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

40 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

PPTX Shredder 🎯

One tool. One purpose. Rock solid.

Transform PowerPoint presentations into LLM-optimized markdown. Built for technical trainers who need dead-simple reliability.

Does one thing and does it very, very well ✨

⚑ Quick Start

Step 1: Get Running (30 seconds)

# Clone and setup
git clone https://github.com/timothywarner-org/pptx-shredder.git
cd pptx-shredder
pip install -r requirements.txt

# Drop your PPTX files in input/ folder, then:
python shred.py

# βœ… Done! Your markdown is in output/

Step 2: Use from Claude Desktop (optional)

# Global access (works from any directory)
claude mcp add pptx-shredder npx -y @timothywarner/pptx-shredder-mcp

# Now use in Claude Desktop from any project:
# "Use shred_pptx to process my presentation.pptx"

🧠 How It Works (Now with AI!)

NEW: Intelligent Extraction with DeepSeek LLM πŸ€–

flowchart TD
    A[πŸ“ Drop PPTX in input/] --> B[πŸš€ Run python shred.py]
    B --> C[πŸ” PowerPoint Object Model]
    C --> D[πŸ“ Structured Content Extract]
    D --> E[🧠 DeepSeek LLM Analysis]
    E --> F[🎯 Learning Objectives Detection]
    E --> G[πŸ“š Module Boundary Recognition]
    E --> H[🏷️ Activity Type Classification]
    E --> I[⏱️ Time Estimation]
    F --> J[🧩 Intelligent Chunking]
    G --> J
    H --> J
    I --> J
    J --> K[πŸ“‹ Rich YAML Metadata]
    K --> L[πŸ“„ High-Quality Markdown]
    L --> M[πŸ“‚ Save to output/]
    
    N[πŸ–₯️ Claude Desktop<br/>Any Directory] --> O[πŸ“¦ npx @timothywarner/<br/>pptx-shredder-mcp]
    O --> P[πŸ”Œ MCP Server]
    P --> B
    
    style A fill:#e1f5fe
    style M fill:#e8f5e8
    style N fill:#fff3e0
    style O fill:#e8f0ff
    style E fill:#ff9800
    style J fill:#f3e5f5
Loading

πŸ“ Project Structure

pptx-shredder/                 🏠 Main project directory
β”œβ”€β”€ πŸš€ shred.py               ← Entry point (run this!)
β”œβ”€β”€ πŸ“‹ requirements.txt       ← Python dependencies  
β”œβ”€β”€ πŸ“¦ package.json           ← npm package for global MCP access
β”œβ”€β”€ βš™οΈ config.yaml           ← Settings (optional)
β”œβ”€β”€ πŸ”Œ mcp_server.py         ← MCP server (Python)
β”œβ”€β”€ πŸ“„ .mcp.json             ← MCP configuration (local + global)
β”‚
β”œβ”€β”€ πŸ“‚ bin/                   🌍 Global npm package entry
β”‚   └── mcp-server.js        ← Node.js wrapper for global access
β”‚
β”œβ”€β”€ πŸ“‚ src/                   🧠 Core application logic
β”‚   β”œβ”€β”€ πŸ” extractor.py      ← Legacy PPTX extraction (regex-based)
β”‚   β”œβ”€β”€ πŸ€– intelligent_extractor.py ← NEW: AI-powered extraction
β”‚   β”œβ”€β”€ ✨ formatter.py      ← Legacy markdown formatting  
β”‚   β”œβ”€β”€ 🧠 intelligent_formatter.py ← NEW: AI-optimized formatting
β”‚   β”œβ”€β”€ πŸŽ›οΈ shred.py          ← CLI interface with Rich UI
β”‚   └── πŸ› οΈ utils.py          ← Helpers & token counting
β”‚
β”œβ”€β”€ πŸ“‚ input/                 πŸ“₯ Drop your PPTX files here
β”‚   └── πŸ“– README.md         ← Usage instructions
β”‚
β”œβ”€β”€ πŸ“‚ output/                πŸ“€ Generated markdown appears here
β”‚   └── πŸ“– README.md         ← What gets created
β”‚
β”œβ”€β”€ πŸ§ͺ tests/                 πŸ”¬ 64 comprehensive tests
β”‚   β”œβ”€β”€ test_extractor.py    ← Content extraction tests
β”‚   β”œβ”€β”€ test_formatter.py    ← Markdown generation tests
β”‚   └── test_integration.py  ← End-to-end workflow tests
β”‚
β”œβ”€β”€ 🐳 .devcontainer/        πŸ“¦ VS Code dev environment
β”œβ”€β”€ πŸ€– .github/              βš™οΈ CI/CD & automation
β”‚   β”œβ”€β”€ workflows/           ← GitHub Actions
β”‚   └── dependabot.yml      ← Dependency updates
β”‚
└── πŸ“š docs/                  πŸ“– Documentation
    └── PRD.md               ← Product requirements

🎯 What It Does (The Magic)

Single Purpose: Convert PPTX β†’ LLM-ready markdown
Rock Solid: 64 tests, 95%+ coverage, enterprise CI/CD
Dead Simple: Drop files, run command, collect results

Core Intelligence

  • 🧠 Pattern Recognition: Detects modules, labs, exercises, learning objectives
  • πŸ“š Context Preservation: Maintains instructional flow and narrative
  • πŸ€– LLM Optimization: Token-counted chunks (1500-2000) with smart overlap
  • πŸ’» Code Detection: Identifies and formats code in 15+ languages
  • πŸ“‹ Rich Metadata: YAML frontmatter with semantic context

Enterprise Training Intelligence

  • πŸŽ“ Pedagogical Awareness: Categorizes instructor notes by intent (timing, emphasis, tips, warnings)
  • πŸ“Š Difficulty Assessment: Automatic difficulty level detection (beginner/intermediate/advanced)
  • ⏱️ Time Estimation: Activity-based duration calculation with multipliers
  • πŸ” Prerequisites Detection: Extracts required knowledge from content and notes
  • πŸ“ˆ Learning Analytics: Cognitive load, interaction level, and learning mode analysis
  • πŸ›‘οΈ Compliance Ready: Detects regulatory markers (GDPR, HIPAA, SOX, ISO, NIST, PCI)
  • 🎯 Assessment Extraction: Identifies quiz questions and knowledge checks
  • πŸ–ΌοΈ Visual Context: Describes images, tables, charts, and layout semantics

πŸ“– Usage Guide

Basic Commands

# Production mode - scan input/ folder
python shred.py

# Process specific files  
python shred.py presentation.pptx course.pptx

# Preview mode (no files created)
python shred.py --dry-run

# Show help
python shred.py --help

Advanced Options

# Custom chunking strategy
python shred.py --strategy sequential --chunk-size 2000

# Verbose output with detailed logging
python shred.py --verbose

# Custom output directory
python shred.py --output-dir ./my-markdown

# Force overwrite existing files
python shred.py --force

Processing Strategies

  • instructional (default): Smart chunking that preserves learning modules
  • sequential: Simple slide-by-slide processing
  • single: One file per presentation

🎯 What It Does

Core Features

PPTX Shredder intelligently:

  • Extracts Everything: Text, speaker notes, slide structure, code blocks
  • Recognizes Patterns: Modules, labs, exercises, learning objectives
  • Optimizes for LLMs: Token-counted chunks (1500-2000 tokens) with overlap
  • Preserves Context: Instructional narrative and relationships
  • Rich Metadata: YAML frontmatter with learning context
  • Code Detection: Identifies and formats code in 15+ languages
  • Beautiful UI: Progress bars, tables, and colored output

Instructional Design Awareness

  • Detects module boundaries and learning objectives
  • Preserves lab instructions and exercise context
  • Maintains teaching flow and narrative structure
  • Groups related content intelligently

πŸ“„ Output Example

---
module_id: 01-azure-storage-fundamentals
module_title: Azure Storage Fundamentals
slide_range: [1, 8]
chunk_index: 1
total_chunks: 3
learning_objectives:
  - Configure blob storage with appropriate security settings
  - Implement lifecycle management policies for cost optimization
  - Apply compliance requirements for enterprise data governance
prerequisites:
  - Basic understanding of cloud computing concepts
  - Familiarity with Azure portal navigation
concepts: ["Azure", "Storage", "Security", "Compliance", "GDPR"]
difficulty_level: intermediate
estimated_duration: 25 minutes
learning_context:
  primary_learning_mode: experiential
  cognitive_load: medium
  interaction_level: high
activity_type: hands-on-lab
compliance_markers: ["GDPR", "SECURITY"]
instructor_guidance_categories: ["timing", "emphasis", "examples", "tips", "warnings"]
---

# Azure Storage Fundamentals

*This is part 1 of 3 in the Azure Storage Fundamentals module series.*

**πŸ”’ Compliance Notice:** This content relates to GDPR, SECURITY requirements.

## πŸ“‹ Prerequisites
Before starting this module, you should have:
- Basic understanding of cloud computing concepts
- Familiarity with Azure portal navigation

## 🎯 Learning Objectives
By the end of this module, you will be able to:
- Configure blob storage with appropriate security settings
- Implement lifecycle management policies for cost optimization
- Apply compliance requirements for enterprise data governance

## πŸ“š Content

### πŸ§ͺ Storage Account Configuration
**Objective**: Create and configure a storage account with enterprise security

#### πŸ’» Lab Code:
```powershell
# Create storage account with security features
$storageAccount = New-AzStorageAccount `
  -ResourceGroupName "rg-storage-lab" `
  -Name "stentsec$((Get-Random))" `
  -AllowBlobPublicAccess $false `
  -EnableHttpsTrafficOnly $true `
  -MinimumTlsVersion "TLS1_2"

🧠 Knowledge Check:

Q: What is the minimum TLS version required for enterprise security compliance?

πŸ‘¨β€πŸ« Instructor Guidance:

⏱️ Timing: Allow 8 minutes for storage account creation ⚠️ Emphasis: Critical to stress importance of disabling public blob access πŸ’‘ Examples: Show real-world scenario where public access led to data breach πŸ”§ Tips: Use naming conventions that include environment and purpose


## πŸ”§ Status: Production Ready

| Aspect | Status | Details |
|--------|--------|---------|
| **🎯 Core Function** | βœ… Complete | PPTX β†’ Markdown conversion working perfectly |
| **πŸ§ͺ Testing** | βœ… 64 tests, 95%+ coverage | Unit, integration, cross-platform tests |
| **πŸš€ CI/CD** | βœ… Enterprise grade | GitHub Actions, Dependabot, auto-review |
| **πŸ“Š UI** | βœ… Rich console | Progress bars, tables, colored output |
| **πŸ”’ Security** | βœ… Local only | Zero network calls, NDA-friendly |
| **🌍 Global Access** | βœ… npm package | Works from any directory via npx |
| **πŸ“ Content Quality** | βœ… Automated linting | Markdown formatting and URL validation |
| **⚑ Platform** | βœ… Cross-platform | Windows, macOS, Linux support |
| **🐳 DevOps** | βœ… Full automation | Dev containers, automated dependencies |

## 🎯 Rock Solid Philosophy

**Single Responsibility**: We do ONE thing - convert PPTX to LLM-ready markdown  
**Zero Surprises**: Predictable, reliable behavior every time  
**Maximum Clarity**: Simple workflow, clear output, obvious structure  
**Bullet Proof**: Comprehensive testing prevents regressions  
**Privacy First**: All processing local, no external dependencies

## 🎬 Try It Now

### Quick Demo
```bash
# Run the interactive demo
python demo.py

# Or try with sample presentations
cp samples/*.pptx input/
python shred.py

Real-World Example

# Process a technical training deck
python shred.py "Azure Fundamentals Course.pptx"

# Output includes:
# - Module detection and grouping
# - Lab instructions preserved
# - Code blocks properly formatted
# - Learning objectives extracted
# - Smart chunking for LLM context windows

πŸ§ͺ Development

Testing

# Run all tests with verbose output
PYTHONPATH=src python -m pytest tests/ -v

# Run with coverage report
PYTHONPATH=src python -m pytest tests/ --cov=src --cov-report=html

# Run specific test category
PYTHONPATH=src python -m pytest tests/test_extractor.py -v

# Quick test run
make test

Code Quality

# Format code
black src/ tests/

# Type checking
mypy src/

# Lint code
ruff check src/

# Run all checks
make check

Development Workflow

# Install dev dependencies
pip install -r requirements-dev.txt

# Run in watch mode
make watch

# Build and test
make all

🎯 Recent Improvements (v0.2.0)

πŸ€– Intelligent Extraction System

  • Replaced regex-based extraction with PowerPoint object model + DeepSeek LLM
  • Learning objectives detection now uses semantic understanding instead of pattern matching
  • Module boundary recognition identifies instructional structure automatically
  • Activity type classification (lecture, demo, lab, assessment, etc.)
  • Time estimation based on content complexity and activity type
  • Prerequisites extraction from both content and speaker notes

🧠 AI-Powered Analysis

  • Uses DeepSeek API for instructional design inference
  • Structured content extraction via PPTX object model
  • Intelligent chunking based on pedagogical flow
  • Rich YAML frontmatter with 20+ metadata fields

πŸ“Š Quality Improvements

  • Fixed malformed docstrings and syntax errors
  • Enhanced error handling and robust slide processing
  • Proper import resolution for modular architecture
  • Cross-platform compatibility maintained

πŸ“‹ Outstanding TODOs

Performance Optimization

  • Batch DeepSeek API calls - Currently 1 call per slide (slow for large presentations)
  • Implement caching - Cache LLM responses for similar slide patterns
  • Parallel processing - Process multiple slides concurrently
  • Fallback modes - Graceful degradation when API unavailable

Feature Enhancements

  • Multi-language support - Detect and handle non-English content
  • Custom LLM providers - Support OpenAI, Anthropic, local models
  • Export formats - Add JSON, HTML, and SCORM output options
  • Template system - Customizable markdown templates for different use cases

Enterprise Features

  • Batch directory processing - Process entire folder hierarchies
  • Git integration - Track changes across presentation versions
  • Compliance tracking - Enhanced detection of regulatory markers
  • Quality metrics - Automated assessment of content quality

Content Quality

# Check markdown formatting and URLs
./scripts/local-content-check.sh

# Markdown linting only
./scripts/local-content-check.sh markdown-only

# URL validation only
./scripts/local-content-check.sh urls-only

πŸ‘₯ Perfect For

Technical Trainers

  • Convert course materials for AI-assisted delivery
  • Create searchable knowledge bases from presentations
  • Generate practice questions and assessments

Instructional Designers

  • Repurpose existing content for new formats
  • Extract learning objectives and outcomes
  • Analyze course structure and flow

Content Teams

  • Build AI training datasets from presentations
  • Create documentation from training materials
  • Generate summaries and abstracts

Developers

  • Process technical presentations for RAG systems
  • Extract code examples and documentation
  • Build knowledge bases for AI assistants

πŸ—οΈ Simple Architecture

graph TB
    subgraph "🎯 Single Purpose Design"
        A[πŸ“ Input PPTX Files] --> B[πŸ” Extractor]
        B --> C[✨ Formatter] 
        C --> D[πŸ“„ Output Markdown]
    end
    
    subgraph "🧠 Core Components"
        B --> B1[Extract Text]
        B --> B2[Extract Notes]
        B --> B3[Detect Patterns]
        
        C --> C1[Smart Chunking]
        C --> C2[Add Metadata]
        C --> C3[Generate Files]
    end
    
    subgraph "πŸ”Œ Integrations"
        E[πŸ–₯️ Claude Desktop] --> F[MCP Server]
        F --> B
        
        G[πŸŽ›οΈ CLI Interface] --> B
    end
    
    style A fill:#e1f5fe
    style D fill:#e8f5e8
    style B fill:#fff3e0
    style C fill:#f3e5f5
Loading

πŸ”§ Configuration

Default settings in config.yaml:

extraction:
  extract_text: true
  extract_notes: true
  extract_images: false  # Coming soon
  
formatting:
  default_chunk_size: 1500
  chunk_overlap: 200
  include_metadata: true
  
output:
  overwrite_existing: false
  create_summary: true

πŸš€ Roadmap

  • Core PPTX text extraction
  • Instructional design patterns
  • LLM-optimized chunking
  • Rich console interface
  • Comprehensive testing
  • CI/CD pipeline
  • Image extraction and description
  • Table preservation
  • Multi-language support
  • Web interface
  • API endpoint

πŸ“š Documentation

🀝 Contributing

Contributions welcome! This project uses:

  • Automated PR review assignment
  • GitHub Copilot code review
  • Comprehensive test requirements
  • Pre-commit hooks for quality

See CONTRIBUTING.md for guidelines.

πŸ“„ License

MIT License - see LICENSE


🎯 The Bottom Line

PPTX Shredder does ONE thing and does it very, very well.

βœ… Zero Configuration - Works out of the box
βœ… Zero Surprises - Predictable, reliable results
βœ… Zero Network - Completely local processing
βœ… Maximum Clarity - Simple workflow, clear output

Built by technical trainers, for technical trainers. πŸŽ“

πŸ“§ Questions? πŸ› Found a bug? Open an issue

About

Transform PowerPoint presentations into LLM-optimized markdown while preserving instructional design narrative. Built for technical trainers.

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors