EPSTEIN-RIPPER

Tools for automated indexing and download of DOJ epstein files data-sets.

small note although db_scanner detects and can work on any .json or .sqlite file in a data# dir EpRip expects: index_data#.(sqlite/json) for the naming format inside the corresponding data# directory end note

NEW-UPGRADE :: 4/24/2026

faster
- upgraded to sqlite index use
- massive speed increase over json when downloading / working with large index's
stronger
- error handling is pretty bullet proof now, imo. Things are working very well. Long hands-free download sessions.
smarter
- upgraded the 'scan' mode selection for the main(EpRip) program.
- Combined the functionality of previous index_tools/ into one tool db_scanner . Which is also a scan/indexer , seperate from EpRip that performs the same battle-tested scanning. It can be used to get stats from your index files, repair them (match on-disk vs downloaded= value in index files). And other index management features.
prettier
- color coded EpRip ux. Over long downloads/scans with lines streaming, everything blends together after a while. I color coded features of the user interface to make it faster to visually acquire relevant information from the readout.

IMPORTANT

use index_convertor.py to quickly convert your current JSON index's over to sqlite.
db_scanner can do this also with it's import json mode, but it's not as straight forward as the index_convertor
using sqlite over json will really open up the speed and advantage of this new upgrade.

Overview

The DOJ dataset pages are not especially crawler-friendly. In practice, the project has had to deal with:

pagination that repeats or remixes results
no trustworthy last-page indicator
short-lived auth/session state
abuse-deterrent gates
occasional non-PDF payloads served from PDF-looking endpoints

The goal of this repo is straightforward:

build reliable indexes of the DOJ file listings
download PDFs safely and resumably
repair index/download-state drift when disk and index no longer agree
provide utilities for validation, conversion, and post-processing

Recommended Workflow

For most users:

use NEW_EpRip.py for normal scanning/downloading work
use db_scanner_NEW.py when you want deeper control over index building, resume/rewalk behavior, repair, or SQLite maintenance
keep your working index in SQLite
treat older JSON indexes as import sources, not as the preferred working format

If you have an older JSON index theres 2 ways to convert to sqlite: [ method 2 is more straightforward imo ] METHOD 1 - db_scanner.py

start db_scanner_NEW.py and pick your dataset
it will detect .sqlite and .json files.
choose Import JSON into SQLite
it will prompt for the .json to use
point it at a new or existing SQLite target
continue all future work against the SQLite index

METHOD 2 - index_convertor.py

start 'index_convertor.py'
pick dataset #
pick mode JSON-> SQLite (it also works both ways( SQLITE->JSON)
it will show existing JSON in that dataset's dir. pick the one you want to use.
enter the output path/filename for the new .sqlite file that will be created
thats it. i left this util in, even though the db_scanner can do the same because its a more straightforward process. db_scanner contains most eve ry previous index_files util from the previous release but this main con version is paramount to experiencing the speed of the updated version of EpRip.

Main Programs

`NEW_EpRip.py`

Primary ripper/downloader.

Use this when you want the main integrated workflow:

select one or more datasets
scan DOJ pages
download missing PDFs
resume interrupted work
keep index and download progress together in the main ripper flow

Core operating modes:

sync = scan + download
scan = update index only
download = download missing files from an existing index

NEW_EpRip.py is the best default choice if your goal is to archive datasets end-to-end.

`db_scanner_NEW.py`

Standalone index scanner and index maintenance utility.

Use this when you want tighter control over index operations than the main ripper exposes.

It can:

scan DOJ dataset pages into an EpRip-compatible SQLite index
resume discovery scans
rewalk existing indexed ranges
repair suspect/error pages
import legacy JSON indexes into SQLite
sync on-disk PDFs into the database
repair download-state against actual disk state
audit duplicate-like conditions
produce disk/DB consistency reports
duplicate an active DB into a clean shareable copy with download-state reset

This tool now contains the practical functionality that used to live in older standalone index repair/maintenance utilities.

If you prefer to separate indexing from downloading, or just prefer the db_scanner tool's layout, db_scanner_NEW.py can be used as the standalone scanner/index builder instead of relying on the main ripper for dataset scanning- but EpRip does do everything self contained (scan+download, scan, download)

Index Format

Recommended format:

SQLite

Why:

faster for large indexes
safer for ongoing maintenance
supports richer scan state and repair workflows
matches the current EpRip-compatible utility path

Legacy format still supported:

JSON import into SQLite

json still supported but conversion to sqlite is HIGHLY recommended *
the gains in download speed for SQLite is MASSIVE in comparisson! * Current JSON import policy is preservation-first:

existing SQLite rows are preserved
JSON entries are normalized into the current schema
only missing filenames are inserted

Repo Layout

Intended top-level structure:

index_files/
corruption_scan.py
db_scan_readme.md
db_scanner_NEW.py
filter_ripped_images.py
image_ripper.py
index_convertor.py
LEGACY_auto_ep_rip.py
NEW_EpRip.py
README.md
requirements.txt

`index_files/`

index_files/ contains some premade index files for datasets 9-12 The index files are in zip files since they're over 100mb's. in json format. will need to be converted over to sqlite .

to use, place an index file into a corresponding data#/ directory that lives in the epstein-ripper root dir.

this is the dir structure EpRip excpects, However this new version lets you choose from discovered index files, as long as they're contained in the corresponding data# dir , where # == 1-12.

epstein-ripper -data#/ --index_data#.sqlite

`index_convertor.py`

Format conversion utility for index files.

Useful when someone specifically needs to flip supported EpRip index formats

json -> sqlite sqlite -> json

db_scanner can move json to sqlite also, but this standalone util is a solid straight forward way to convert EpRip compatable index files.

the recommended long-term format For EpRip is now SQLite.

`LEGACY_auto_ep_rip.py`

Older ripper version kept for reference/compatibility.

i will leave this in the repo for a period of time before removing. Not sure how long to wait. maybe a month or so from now i'll consider removing it.

The current preferred ripper is NEW_EpRip.py.

`corruption_scan.py`

LEGACY UTIL not really needed anymore, i kept it in for a just-in-case/nostalgia purpose One-time corruption sweep utility.

Use it to:

scan a dataset directory
validate PDF signatures
quarantine obvious bad/corrupt files

After removing bad files from disk, use db_scanner_NEW.py utilities to reconcile download-state in the index.

`image_ripper.py`

Embedded image extraction utility for PDF collections.

Bulk embedded image extractor (GUI).

Features:

Recursive folder scanning
Incremental re-run support
Process tracking via processed_pdfs.txt
Image mapping log (image_map.txt)

Requirements:

pip install pymupdf pillow

Designed for:

Large disclosure datasets
Forensic review
Visual content isolation

`filter_ripped_images.py`

-sorts a dir containing images(recursive)

And moves all black(redatcted) images
images that appear to be all text/documents
other images that have traits that dont seem like an image/picture.
It puts these in different categories/buckets. moving them to new categorized folders for review.
Using copy mode can balloon hard disk data and cause problems when used on a large ammount of data!
- recommended to use Move mode.

example: I ran image_ripper on the datasets i have (far from complete, but several hundred thousand pdf's) - i ended up with over 400k+ image files ripped. After running this filter program on my ripped_images, i reviewed what it pulled out and it was very accurate. Not many images that I needed to save out of the images it pulled out. I havent gone through all of what was left behind, but i did go through the pulls - and out of the almost half million images, over 300k were moved from the scanned dir. to the sorted dirs

making, imo, this a very valuable tool in the pre-cleaning stage of going through the extracted images from using image_ripper.py.

because of the way the pdf's were generated alot of them themselves register as 'image' and get saved as image files- along with the extracted content. This problem makes this sorting tool a neccessary step as a Post-processing helper for extracted images.

Data Safety / Integrity

This project is built around preserving operator confidence:

downloads use validation-aware handling
index work is resumable
SQLite is the preferred working format
JSON import is additive and non-destructive to existing SQLite rows
disk repair utilities reconcile stale download-state
Ctrl+C shutdown in the scanner reports saved progress and resume position

The duplicate-DB utility in db_scanner_NEW.py is specifically intended for:

creating a clean shareable index
preserving a working DB while producing a reset copy
letting another operator re-repair download-state against their own local disk later

Choosing Between The Two Main Tools

Use NEW_EpRip.py when:

you want the main archival workflow
you want one tool to handle scanning and downloading together ( or seperate )
you want the default operator experience
the workflow is straight forward and does everything needed.

db_scanner has specialized functionality for index's and scanning Use db_scanner_NEW.py when:

you want standalone index scanning
you want to migrate legacy JSON indexes into SQLite
you want index repair or disk reconciliation
you want a clean copy of an index
you want more direct control over index behavior and maintenance

Requirements

Baseline requirements:

Python 3.9+
Playwright
Chromium via Playwright

Quick Start

git clone https://github.com/prizmatik666/epstein-ripper
cd epstein-ripper
pip install -r requirements.txt
playwright install chromium
python epstein_ripper.py

Some optional utilities may require additional packages depending on the tool being used.

Practical Notes

Do not rename or delete files while a scan/download job is actively running.
Prefer SQLite for active work.
Prefer importing old JSON indexes into SQLite rather than continuing to maintain JSON as the primary index format.
Use the DB scanner utility section when disk contents and index state need to be reconciled.

Support

If you are auditing, preserving, or extending the archive, the most useful thing you can do is use the newer SQLite-based workflow and keep the index state clean and reproducible.

The pursuit of truth, justice, and .pdf punishment is imperative. We're all a tool for change. - Prizm

I had info for $upporting this project but someone tried to hack my paypal so I've removed the support info for that. Thanks =\

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

EPSTEIN-RIPPER

Overview

Recommended Workflow

Main Programs

`NEW_EpRip.py`

`db_scanner_NEW.py`

Index Format

Repo Layout

`index_files/`

`index_convertor.py`

`LEGACY_auto_ep_rip.py`

`corruption_scan.py`

`image_ripper.py`

`filter_ripped_images.py`

Data Safety / Integrity

Choosing Between The Two Main Tools

Requirements

Practical Notes

Support

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
.vscode		.vscode
index_files		index_files
.gitignore		.gitignore
LEGACY_auto_ep_rip.py		LEGACY_auto_ep_rip.py
NEW_EpRip.py		NEW_EpRip.py
README.md		README.md
corruption_scan.py		corruption_scan.py
db_scan_readme.md		db_scan_readme.md
db_scanner_NEW.py		db_scanner_NEW.py
filter_ripped_images.py		filter_ripped_images.py
image_ripper.py		image_ripper.py
index_convertor.py		index_convertor.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

EPSTEIN-RIPPER

Overview

Recommended Workflow

Main Programs

NEW_EpRip.py

db_scanner_NEW.py

Index Format

Repo Layout

index_files/

index_convertor.py

LEGACY_auto_ep_rip.py

corruption_scan.py

image_ripper.py

filter_ripped_images.py

Data Safety / Integrity

Choosing Between The Two Main Tools

Requirements

Practical Notes

Support

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`NEW_EpRip.py`

`db_scanner_NEW.py`

`index_files/`

`index_convertor.py`

`LEGACY_auto_ep_rip.py`

`corruption_scan.py`

`image_ripper.py`

`filter_ripped_images.py`

Packages