A Python tool that downloads all publicly available Jeffrey Epstein–related document collections from multiple sources and extracts searchable text from them.
This project aggregates court records, document dumps, and OCR archives into a single local dataset, then extracts raw text layers from PDFs to enable full-text search and analysis.
- Downloads all known public Epstein document collections from:
- Internet Archive (court filings, black books, phone books, oversight releases)
- GitHub OCR archive (8,186 structured documents with metadata & entities)
- Extracts raw text layers from PDFs
- Uses PyMuPDF when available
- Falls back to
pdftotextif needed
- Preserves document structure and metadata
- Generates searchable
.txtfiles - Detects potentially interesting patterns (emails, phone numbers)
- Unified search across all downloaded material
- Automatically installs missing Python dependencies
Text-searchable PDF collections, including:
- Giuffre v. Maxwell court documents
- House Oversight Committee releases
- Epstein Black Book (multiple versions)
- Epstein phone books
- Python 3.8+
- Internet connection
- Disk space:
- ~5–7 GB for PDFs
- Additional space for extracted text
- PyMuPDF (auto-installed)
pdftotext(system fallback)
git clone https://github.com/lanefiedler731-gif/Epstein-Downloader.git
cd Epstein-Downloader
python3 epstein_downloader.py --allDependencies are installed automatically if missing.
Download everything:
python3 epstein_downloader.py --allShow status:
python3 epstein_downloader.py --statusSearch all documents:
python3 epstein_downloader.py --search "keyword"Only Internet Archive:
python3 epstein_downloader.py --ia-onlyOnly GitHub OCR:
python3 epstein_downloader.py --github-onlyExtract PDF text only:
python3 epstein_downloader.py --extractdocuments/
├── internet_archive/
├── github_ocr/
│ └── analyses.json
extracted_text/
├── pdf_text/
└── interesting_finds.json
- Extracts raw PDF text layers
- Improper redactions may still expose underlying text
- Blank pages are flagged
- No OCR is performed on image-only PDFs
- Only publicly available documents are accessed
- No private or restricted material is obtained
- Users are responsible for interpretation and use
- Intended for research, journalism, and archival transparency
This software makes no claims regarding accuracy, completeness, or interpretation of the documents.
It is a data aggregation and text extraction utility only.
Without a license, all rights are reserved.