Welcome to Week 3 of the LLM learning track! This guide will get you up and running in 5 minutes.
๐ This project uses FREE local models via Ollama - no API costs!
A reliable LLM system that extracts structured data from messy text:
"Invoice #123, total $456.78, due March 15th"
โ
{
"invoice_number": "123",
"total_amount": 456.78,
"due_date": "2025-03-15"
}
With validation, retries, and 99%+ reliability.
Install Ollama first:
# macOS/Linux
curl -fsSL https://ollama.ai/install.sh | sh
# Or macOS with Homebrew
brew install ollama
# Start Ollama service
ollama serve
# Pull the model (in another terminal)
ollama pull llama3.2Then setup the project:
# Run the quick start script
./quickstart.shThis will:
- โ Check Ollama installation
- โ Download llama3.2 model if needed
- โ Create virtual environment
- โ Install dependencies
- โ Run a demo extraction
OR manually:
# Install and start Ollama
ollama serve # Keep running in one terminal
ollama pull llama3.2 # In another terminal
# Create virtual environment
python3 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
# Setup environment (optional - has defaults)
cp .env.example .env# Extract from a sample invoice
python cli.py extract \
--input sample_inputs/invoice_tech.txt \
--type invoiceYou should see:
โ Extraction succeeded after 1 attempt!
โโ Extracted Data โโโโโโโโโโโโโโโโโโโโ
โ {
โ "invoice_number": "INV-2025-0342",
โ "total_amount": 8470.43,
โ "vendor_name": "TechSupply Solutions Inc.",
โ ...
โ }
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
# Extract from an email
python cli.py extract \
--input sample_inputs/email_project.txt \
--type email
# Extract from a support ticket
python cli.py extract \
--input sample_inputs/support_ticket_urgent.txt \
--type support_ticket
# See all available schema types
python cli.py list-schemasRead CONCEPTS.md to understand:
- Why function calling matters
- How guardrails work
- What makes output reliable
Start with schemas:
# Open the schema definitions
code src/schemas.pyLook for:
InvoiceData- see how fields are definedField()validators - see validation rules@field_validator- see custom validation
Then explore the engine:
# Open the extraction engine
code src/extractor.pyLook for:
extract()method - main entry point_call_llm()- how we use function calling- Retry logic in the main loop
_build_validation_feedback()- error recovery
# See everything that happens
python cli.py extract \
--input sample_inputs/invoice_tech.txt \
--type invoice \
--verboseThen check the logs:
ls -lt logs/ | head -2
cat logs/extraction_*.logCreate a file with incomplete data:
echo "Invoice #123, total $50" > test_invoice.txt
python cli.py extract \
--input test_invoice.txt \
--type invoice \
--verboseQuestions:
- What validation errors occur?
- Does it retry? How many times?
- What's the final error message?
Create invalid JSON:
echo '{"invoice_number": 123}' > bad.json
python cli.py validate \
--schema invoice \
--file bad.jsonQuestions:
- What error does Pydantic report?
- Why is
123invalid forinvoice_number? - What would be valid?
Open src/schemas.py and add a new optional field to InvoiceData:
payment_method: Optional[str] = Field(
None,
description="Payment method (credit card, check, etc.)"
)Save and run:
python cli.py extract \
--input sample_inputs/invoice_tech.txt \
--type invoiceDoes it extract the new field?
Add a new extraction type for receipts:
- Add to
src/schemas.py:
class ReceiptData(BaseModel):
store_name: str
purchase_date: str = Field(..., pattern=r"^\d{4}-\d{2}-\d{2}$")
total: float = Field(..., gt=0)
items: List[str]
payment_method: Optional[str] = None- Register it:
EXTRACTION_SCHEMAS["receipt"] = {
"model": ReceiptData,
"description": "Extract data from receipts",
"name": "extract_receipt_data"
}-
Create a sample receipt in
sample_inputs/receipt.txt -
Test it:
python cli.py extract \
--input sample_inputs/receipt.txt \
--type receipt- Install Ollama from https://ollama.ai
- On macOS:
brew install ollama - Verify:
ollama --version
- Start Ollama:
ollama serve - Check if running:
curl http://localhost:11434/api/tags - Make sure no firewall is blocking port 11434
- Pull the model:
ollama pull llama3.2 - List installed models:
ollama list - Try a different model:
ollama pull mistral
- Check the input text - is it really an invoice/email/ticket?
- Look at the validation errors in the output
- Check logs for detailed error messages
- Try adjusting the schema if it's too strict
- Make sure you activated the virtual environment
- Run
pip install -r requirements.txtagain
Always run with --verbose when learning:
python cli.py extract -i <file> -t <type> --verboseLogs show everything:
# Find latest log
ls -lt logs/ | head -2
# View it
cat logs/extraction_*.logTry inputs that should fail:
- Missing fields
- Wrong formats
- Ambiguous data
- Empty files
# More deterministic
python cli.py extract -i <file> -t <type> --temperature 0.0
# More creative
python cli.py extract -i <file> -t <type> --temperature 0.7Test schemas without LLM calls:
python cli.py validate -s invoice -f data.json- README.md - Complete documentation
- CONCEPTS.md - Why these patterns matter
- LEARNING_OUTCOMES.md - What you'll master
- PROJECT_STRUCTURE.md - Code organization
- example_usage.py - Programmatic examples
1. Quick Start (5 min)
โโโ Get it running
2. Concepts (15 min)
โโโ Understand why
3. Code Exploration (30 min)
โโโ Read schemas.py
โโโ Read extractor.py
โโโ Run with --verbose
4. Hands-On (60 min)
โโโ Exercise 1: Break it
โโโ Exercise 2: Test validation
โโโ Exercise 3: Add a field
โโโ Exercise 4: Create new schema
5. Deep Dive (60+ min)
โโโ Modify retry logic
โโโ Add custom validators
โโโ Integrate with your app
โโโ Deploy to production
After completing this project, you should be able to:
- Explain why function calling is more reliable than parsing text
- Define a Pydantic schema with validation rules
- Use the CLI to extract data from text
- Understand how retry logic improves success rates
- Debug validation failures using logs
- Add a new extraction schema to the system
- Configure temperature for determinism vs creativity
- Validate JSON against schemas programmatically
- Integrate the extractor into your own code
- Explain when to use LLMs vs traditional parsing
Once you're comfortable with this project:
-
Extend it
- Add new schemas (contracts, resumes, catalogs)
- Add async support for batch processing
- Build a web UI with Streamlit
-
Apply it
- Use it in your own projects
- Process real documents
- Build a data pipeline
-
Continue learning
- Week 4: RAG & Knowledge Integration
- Week 5: Agents & Complex Workflows
- Week 6: Production Deployment
- Check logs in
logs/ - Review error messages carefully
- Read the relevant documentation section
- Try the troubleshooting guide above
- Experiment with simpler inputs first
Remember: The goal isn't just to make it workโit's to understand why it works and when to use these patterns.
Take your time, experiment, break things, and learn! ๐