A Python-based web scraper and RAG (Retrieval-Augmented Generation) API for the U.S. Department of Justice Epstein Files.
- Automated Browser Control: Uses Playwright to bypass bot detection and age verification.
- DOJ Disclosures Scan (Default): Automatically navigates the disclosures page and downloads datasets.
- Legacy Alphabet Search: Optional mode to search through documents using letter-based queries.
- Deduplication: Removes duplicate files before downloading.
- Vision PDF Parsing: Uses GPT-4o-mini Vision to extract text AND understand images from PDFs.
- Multimodal Understanding: Images, signatures, and diagrams are described and indexed.
- Vector Search: Qdrant vector database for semantic search.
- OpenAI-Compatible: Supports custom API endpoints (Ollama, Azure, etc).
- Background Sync: Indexes PDFs in background without blocking the API.
eua_gov/
βββ src/
β βββ __init__.py # Package exports
β βββ app.py # Scraper orchestrator
β βββ api.py # FastAPI RAG server
β βββ config.py # Pydantic Settings
β βββ logging_config.py # Loguru setup
β βββ scraper.py # Web scraping logic
β βββ downloader.py # PDF download logic
β βββ rag/ # RAG module
β βββ llm.py # LLM client wrapper
β βββ parser.py # Vision PDF parser
β βββ embeddings.py # Embeddings wrapper
β βββ store.py # Qdrant wrapper
β βββ sync.py # Document sync logic
βββ downloads/ # Downloaded PDF files
βββ logs/ # Log files
βββ main.py # Entry point
βββ routes.http # HTTP client tests
βββ .env # Environment configuration
βββ Dockerfile # Docker image
βββ docker-compose.yml # Docker Compose
βββ pyproject.toml # Dependencies
- Python 3.10+
- uv (recommended) or pip
- Docker (for Qdrant)
cd epstein_crawler_docs
# Install dependencies
uv sync
# Install Playwright browsers (for scraper)
uv run playwright install chromiumCreate a .env file in the project root:
# ============================================================
# Scraper Configuration (Optional)
# ============================================================
MAX_DOWNLOADS= # Max files to download (default: unlimited)
NAVIGATION_TIMEOUT=60000 # Navigation timeout in ms
# ============================================================
# RAG API Configuration (Required for API)
# ============================================================
OPENAI_API_KEY=sk-your-key-here
OPENAI_BASE_URL= # Optional: Custom API URL (Ollama, Azure, etc)
# Models (optional, these are defaults for OpenAI)
OPENAI_EMBEDDING_MODEL=text-embedding-3-small
OPENAI_EMBEDDING_DIMENSION=1536 # Dimension of the embedding model
OPENAI_CHAT_MODEL=gpt-5-mini
OPENAI_VISION_MODEL= # Optional: Separate model for PDF parsing (defaults to CHAT_MODEL)
# PDF Parsing
MAX_PAGES_PER_PDF=0 # 0 = no limit
# Qdrant (optional if running locally)
QDRANT_HOST=localhost
QDRANT_PORT=6333
# API Server (optional)
API_PORT=8000
API_CORS_ORIGINS=*| Variable | Default | Description |
|---|---|---|
OPENAI_API_KEY |
(required) | OpenAI API key (or "ollama" for Ollama) |
OPENAI_BASE_URL |
(none) | Custom API URL for Ollama, Azure, etc |
OPENAI_EMBEDDING_MODEL |
text-embedding-3-small |
Embedding model |
OPENAI_EMBEDDING_DIMENSION |
1536 |
Must match the model's output size. |
OPENAI_CHAT_MODEL |
gpt-5-mini |
Chat model for Q&A |
OPENAI_VISION_MODEL |
(chat model) | Vision model for PDF parsing |
MAX_PAGES_PER_PDF |
0 (unlimited) |
Max pages per PDF to process |
QDRANT_HOST |
localhost |
Qdrant server host |
QDRANT_PORT |
6333 |
Qdrant server port |
API_PORT |
8000 |
FastAPI server port |
API_CORS_ORIGINS |
* |
CORS allowed origins |
MAX_DOWNLOADS |
(none) | Max PDFs to download (scraper) |
| Model | Dimension |
|---|---|
text-embedding-3-small (OpenAI) |
1536 |
text-embedding-3-large (OpenAI) |
3072 |
text-embedding-ada-002 (OpenAI) |
1536 |
qwen3-embedding (Ollama) |
1024 (or 4096, check model card) |
Local:
uv run main.pyDocker Compose:
docker compose --profile scraper upLocal:
ALPHABET=abc uv run main.py --searchDocker Compose:
docker compose --profile search upNote: The API needs PDFs in the
downloads/directory to work. Run the Scraper first (or manually add PDFs).
docker compose --profile api up --buildOn startup, the API will:
- Check for unindexed PDFs in
downloads/ - Start background sync (parse PDFs with Vision, generate embeddings)
- API is immediately available while sync runs
curl http://localhost:8000/sync/statusAsk a question:
curl -X POST http://localhost:8000/ask \
-H "Content-Type: application/json" \
-d '{"question": "Who visited the island?"}'Response:
{
"answer": "Based on the documents...",
"sources": [
{
"filename": "EFTA00001234.pdf",
"score": 0.89,
"preview": "Flight log showing..."
}
]
}| Method | Endpoint | Description |
|---|---|---|
GET |
/health |
Health check (includes sync status) |
GET |
/stats |
Index statistics |
POST |
/sync |
Trigger background sync |
GET |
/sync/status |
Check sync progress |
POST |
/ask |
Ask a question |
| Profile | Services | Use Case |
|---|---|---|
scraper |
scraper-scan | Download PDFs from DOJ |
search |
scraper-search | Legacy search mode |
api |
qdrant, api | RAG API + Vector DB |
# Run scraper
docker compose --profile scraper up
# Start RAG API + Qdrant
docker compose --profile api up --build
# Stop all services (must specify profile if active)
docker compose --profile api down
# Stop and remove volumes (clean Qdrant data)
docker compose --profile api down -vOPENAI_API_KEY=ollama
OPENAI_BASE_URL=http://localhost:11434/v1
OPENAI_CHAT_MODEL=llava # Vision modelOPENAI_API_KEY=your-azure-key
OPENAI_BASE_URL=https://your-resource.openai.azure.com/v1
OPENAI_CHAT_MODEL=gpt-4o-miniScraper:
playwright- Browser automationbeautifulsoup4- HTML parsingloguru- Loggingpydantic-settings- Configuration
RAG API:
pdf2image- PDF to image conversionpillow- Image processingopenai- Vision, Embeddings & chatqdrant-client- Vector databasefastapi- Web frameworkuvicorn- ASGI server
To execute the test suite:
uv run pytest tests/ -vWe use ruff to maintain code quality:
# Check for errors
uv run ruff check .
# Auto-fix simple errors
uv run ruff check . --fix
# Format code
uv run ruff format .The scraper is built to be robust against failures:
- Stop & Resume: You can stop the script (
Ctrl+C) at any time. Run it again, and it will pick up exactly where it left off. - Incremental Downloads: It checks
downloaded_urls(success list) andfailed_urls(error list) on startup to avoid re-downloading existing files or retrying known broken links. - Atomic Saves: Progress is saved to a temporary file first (
.tmp) and then renamed. This prevents JSON corruption if a crash occurs during a write operation. - Batch Processing: Downloads happen in batches. If one file fails (e.g., 404), it's logged, and the batch continues without crashing the entire process.
Images are automatically built and published to Docker Hub (rodrigobrocchi/epstein_crawler_docs) when a new tag is pushed.
git tag v1.0.6
git push origin v1.0.6MIT