Skip to content

fnusatvik07/rag-source

Repository files navigation

RAG Source

The definitive single-repository guide for parsing and preparing documents for Retrieval-Augmented Generation (RAG). Contains working code examples, generated sample documents, and comprehensive guides for every major document type.

What's Inside

9 document types, 26+ extraction methods, all with working Python code and sample documents:

Type Methods Key Libraries
PDF pypdf, pdfplumber, PyMuPDF, OCR, table extraction, comparison pypdf, pdfplumber, pymupdf, pytesseract
Word (DOCX) python-docx, mammoth, docx2txt python-docx, mammoth
PowerPoint (PPTX) Basic extraction, structured slide parsing python-pptx
HTML / Web BeautifulSoup, html2text, trafilatura bs4, html2text, trafilatura
Spreadsheets openpyxl, pandas, csv stdlib openpyxl, pandas
Images (OCR) Tesseract, EasyOCR pytesseract, easyocr
Email (EML) stdlib email parsing, structured extraction email (stdlib)
Markdown / Text Chunking strategies, AST parsing, semantic chunking mistune
EPUB ebooklib extraction, full text pipeline ebooklib

Production-grade parsing libraries with AI-powered extraction:

Library Type Key Strength License
Docling (IBM) Local Multi-format, built-in chunking, TableFormer MIT
Unstructured.io Local/Cloud Widest format support, typed elements Apache 2.0
Azure Doc Intelligence Cloud Highest accuracy, prebuilt models (invoice, receipt, ID) Proprietary
LlamaParse Cloud GenAI-native parsing, LlamaIndex integration Proprietary
Marker Local Best PDF-to-Markdown, equation support GPL
MegaParse (Quivr) Local/API Simplest API, vision mode with GPT-4o/Claude Apache 2.0

Getting Started

# Install dependencies
uv sync

# Install optional OCR dependencies
uv sync --extra ocr

# Install advanced parsing libraries (pick what you need)
uv sync --extra docling        # IBM Docling
uv sync --extra unstructured   # Unstructured.io
uv sync --extra azure          # Azure Document Intelligence
uv sync --extra llamaparse     # LlamaParse
uv sync --extra marker         # Marker PDF-to-Markdown

# Generate all sample documents
for f in unstructured_documents/*/sample_docs/generate_samples.py; do
  uv run python "$f"
done

# Run any extraction script
uv run python unstructured_documents/01_pdf/01_pypdf_extraction.py

Focus

This repo focuses on document parsing and extraction strategies — how to get text out of various file formats and prepare it for RAG. It does not implement RAG pipelines, vector databases, or embedding models. The goal is to be the only guide you need for the "parse and chunk" phase of any RAG system.

Requirements

  • Python 3.11+
  • uv for dependency management

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors