Map-Reduce LLM Pipeline for Meeting Transcripts

Extract and validate action items from long meeting transcripts using LangChain's Map-Reduce chain pattern.

🎯 Project Overview

Built a multi-stage Map–Reduce LLM pipeline using LangChain to extract and validate action items from long meeting transcripts.

Output Schema

{
  "task": "",
  "owner": "",
  "deadline": "",
  "confidence": 0.0
}

📁 Project Structure

Map-Reduce-Chain/
├── src/
│   ├── config.py
│   ├── models.py                 # Pydantic schemas
│   ├── document_loader.py        # LangChain Documents + metadata
│   ├── map_chain.py              # MAP chain (Prompt + LLM + Parser)
│   ├── reduce_chain.py           # REDUCE chain
│   ├── confidence_chain.py       # Confidence scoring chain
│   ├── validation.py
│   ├── main.py                   # Pipeline orchestration
│   └── prompts/
│       ├── map_prompt.yaml
│       └── reduce_prompt.yaml
├── tests/
├── data/
├── notebooks/
├── .env.example
└── README.md

🚀 Quick Start

1. Create Virtual Environment

# Windows
python -m venv venv
venv\Scripts\activate

# macOS/Linux
python3 -m venv venv
source venv/bin/activate

2. Install Dependencies

pip install -r requirements.txt

3. Set Up Environment Variables

# Copy the example to .env
cp .env.example .env

# Edit .env and add your OpenAI API key
# OPENAI_API_KEY=sk-...

4. Run Tests

pytest tests/ -v

📅 Implementation Timeline

Day 1: Core Extraction (MAP Phase)

✅ Define action item schema
✅ Transcript ingestion & metadata handling
✅ Smart chunking (by speaker turns, 1-2 minutes)
✅ MAP prompt + LangChain chain
✅ Output validation & retry logic

Day 2: Consolidation (REDUCE Phase)

✅ Merge logic definition
✅ REDUCE prompt + chain
✅ Confidence scoring layer
✅ Edge case handling
✅ UI/CLI implementation
✅ Documentation

🧠 Key LangChain Concepts Used

Map-Reduce Chains: Split, process, and consolidate
PromptTemplate: Reusable prompt patterns
LLMChain: Chain prompts with LLM calls
PydanticOutputParser: Structured extraction
Document Objects: Metadata-aware text processing
Custom Text Splitters: Preserve speaker context
Retry & Validation: Reliability patterns

📝 Usage Examples

Extract from a transcript file

from src.main import ActionItemExtractor

extractor = ActionItemExtractor()
items = extractor.extract("path/to/transcript.txt", "meeting_001")
for item in items:
     print(f"Task: {item.task}")
     print(f"Owner: {item.owner}")
     print(f"Confidence: {item.confidence}")

Using the CLI

python src/main.py transcript.txt actions.json

Using Streamlit UI

streamlit run src/app.py

🔧 Configuration

Edit src/config.py to customize:

LOG_LEVEL: DEBUG, INFO, WARNING, ERROR
BATCH_SIZE: Number of chunks to process at once
CONFIDENCE_THRESHOLD: Minimum confidence score (0-1)
OPENAI_MODEL: LLM to use (gpt-4, gpt-3.5-turbo)

✅ Testing

Run all tests:

pytest tests/ -v

Run specific test:

pytest tests/ -k map -v

With coverage:

pytest tests/ --cov=src --cov-report=html

📚 Architecture Notes

Why Map-Reduce?

Scalability: Process transcripts of any length
Reliability: LLM operates on focused contexts
Debuggability: Each stage is testable independently
Flexibility: Easy to add validation layers

Workflow

Raw Transcript
     ↓
[Ingestion] → Add metadata, normalize
     ↓
[Chunking] → Speaker turns, preserve context
     ↓
[MAP Phase] → Extract candidates from each chunk
     ↓
[REDUCE Phase] → Deduplicate, fill gaps, normalize
     ↓
[Confidence Scoring] → Rate certainty
     ↓
[Validation] → Handle edge cases
     ↓
Structured Action Items

🚨 Known Limitations

Requires clear speaker labels in transcript
Performance degrades on very long transcripts (>1hr) without chunking optimization
Confidence scores are heuristic-based
LLM hallucinations possible on ambiguous deadlines

🤝 Contributing

Create a feature branch
Write tests for new features
Run black and flake8 before committing
Update README if adding new functionality

Built with: LangChain, OpenAI, Pydantic, Streamlit

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
data		data
src		src
tests		tests
.env.example		.env.example
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
SETUP_GUIDE.md		SETUP_GUIDE.md
pytest.ini		pytest.ini
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Map-Reduce LLM Pipeline for Meeting Transcripts

🎯 Project Overview

Output Schema

📁 Project Structure

🚀 Quick Start

1. Create Virtual Environment

2. Install Dependencies

3. Set Up Environment Variables

4. Run Tests

📅 Implementation Timeline

Day 1: Core Extraction (MAP Phase)

Day 2: Consolidation (REDUCE Phase)

🧠 Key LangChain Concepts Used

📝 Usage Examples

Extract from a transcript file

Using the CLI

Using Streamlit UI

🔧 Configuration

✅ Testing

📚 Architecture Notes

Why Map-Reduce?

Workflow

🚨 Known Limitations

🤝 Contributing

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Map-Reduce LLM Pipeline for Meeting Transcripts

🎯 Project Overview

Output Schema

📁 Project Structure

🚀 Quick Start

1. Create Virtual Environment

2. Install Dependencies

3. Set Up Environment Variables

4. Run Tests

📅 Implementation Timeline

Day 1: Core Extraction (MAP Phase)

Day 2: Consolidation (REDUCE Phase)

🧠 Key LangChain Concepts Used

📝 Usage Examples

Extract from a transcript file

Using the CLI

Using Streamlit UI

🔧 Configuration

✅ Testing

📚 Architecture Notes

Why Map-Reduce?

Workflow

🚨 Known Limitations

🤝 Contributing

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages