PMC: A Human + AI Compression Experiment

What happens when a developer pair-programs with AI for 48 hours on a lossless compressor?

PMC (Predictive Mix Context) is a single-file lossless data compressor written in C. It uses a PAQ-inspired context-mixing architecture with 52 prediction models, adaptive mixing, and rANS entropy coding.

This is not meant to be the best compressor in the world — tools like zpaq have years of research behind them. The point is to show that a human+AI team can get remarkably close to state-of-the-art results in a very short time, and even beat established tools on certain workloads.

Results

Tested on the Silesia Corpus (standard compression benchmark) plus specialized files.

Silesia Corpus

File	Original	gzip -9	zstd -19	xz -9	zpaq -5	PMC
dickens (Text)	10.2 MB	37.8%	28.0%	27.8%	20.6%	22.6%
file (ELF)	6.5 MB	25.8%	17.6%	14.7%	12.2%	11.7%
mozilla (Tar)	51.2 MB	37.1%	29.5%	26.1%	23.5%	24.3%
mr (Medical)	10.0 MB	36.8%	31.2%	27.6%	21.9%	20.4%
nci (Chemistry)	33.6 MB	8.9%	5.0%	5.2%	3.7%	4.6%
ooffice (Office)	6.1 MB	50.2%	42.2%	39.4%	28.7%	29.5%
osdb (Database)	10.1 MB	36.8%	30.7%	28.3%	21.9%	22.7%
reymont (Text)	6.6 MB	27.5%	20.3%	19.9%	14.4%	15.6%
samba (Source)	21.6 MB	25.0%	18.1%	17.4%	14.1%	15.4%
sao (Astronomy)	7.3 MB	73.5%	69.0%	60.9%	53.8%	58.9%
webster (Text)	41.5 MB	29.1%	21.0%	20.2%	13.7%	12.6%
xml (Markup)	5.3 MB	12.4%	8.5%	8.5%	6.1%	6.6%
TOTAL	210 MB	30.1%	23.3%	21.6%	17.3%	17.8%

PMC reaches 17.8% overall — within 0.5 percentage points of zpaq, and ahead of xz by nearly 4 points. It wins outright on ELF binaries, medical data (mr), and dictionary text (webster).

Specialized Workloads

These are files where PMC's auto-detection and specialized models make the biggest difference.

File Type	File	Original	xz -9	zpaq -5	PMC
Audio (16-bit PCM)	WAV 48kHz	2.8 MB	53.7%	42.2%	27.2%
Bitmap (24-bit)	BMP 1280×853	3.3 MB	64.3%	53.1%	47.4%
Medical Imaging	X-Ray (16-bit raw)	8.4 MB	53.0%	43.3%	44.7%

On WAV audio, PMC beats zpaq by 15 percentage points. On BMP images, by nearly 6 points. The X-Ray result is close — PMC auto-detects the raw image geometry without headers, a feature most compressors lack, but zpaq's deeper modeling still edges it out.

Where PMC Falls Short

Honesty matters. Here's where zpaq clearly wins:

Pure text (dickens, reymont): zpaq's language models are deeper. PMC's word predictor is right only ~8-12% of the time — natural language is hard.
Astronomical data (sao): near-random at the byte level, requires modeling float structure that PMC doesn't attempt.
Chemistry (nci): highly repetitive but needs long-range context that zpaq handles better.
Speed: PMC processes ~6.5 KB/s (single-threaded, symmetric compress/decompress). zpaq and xz are significantly faster.

Usage

# Build (C99, no dependencies)
gcc -O2 -o compressor compressor.c -lm

# Compress / Decompress / Verify
./compressor c input_file output.pmc
./compressor d output.pmc recovered_file
diff input_file recovered_file

How It Works

PMC uses a 4-stage pipeline:

1. Preprocessing & Auto-Detection

Before compression, PMC analyzes each data block and applies reversible transforms:

BCJ x86 Filter: Converts relative CALL addresses to absolute for better compression of x86 binaries.
BMP Vertical Delta: For 24/32-bit BMP images, subtracts each pixel row from the row above.
WAV Per-Channel Delta: For 16-bit PCM WAV files, subtracts the previous sample of the same channel.
Raw Image Auto-Detection: For files without recognized headers (like X-Ray scans), scans candidate row widths and measures vertical byte correlation. If a strong pattern is found, applies a 16-bit vertical delta filter.

2. Prediction Models (52 Total)

Instead of a single algorithm, PMC uses an ensemble of specialized models:

9 Order-N Context Models: Orders 0 through 16. Each maps a hashed byte context to a state via finite-state counters.
34 Sparse Context Models: The key to binary compression. They look at non-adjacent bytes (e.g., byte[i-4] and byte[i-8]) to find table columns and struct fields.
Linguistic Models: Word hash, word trigram, and a shadow dictionary that attempts to predict the next word entirely.
Geometric Models: Auto-detects data periodicity per block and predicts from previous stride positions.
Correction Models: Match model (hash-chain match finder), ICM (indirect context model), and LZP (Lempel-Ziv prediction).

3. Mixing & Correction

Dual Adaptive Logistic Mixers averaged in logit domain.
Text Block Gating: Text blocks have their 34 sparse models zeroed to eliminate noise.
5-Stage APM/ISSE Pipeline: Post-processing that learns the mixer's systematic errors.
Word-Aware SSE: Learns per-word correction biases for text.

4. Entropy Coding

rANS (Asymmetric Numeral Systems): 4-way interleaved binary rANS for high throughput and near-optimal coding density.

Version History

Version	Key Change	Silesia Total
v3	Baseline context mixer	—
v4	+34 sparse models	—
v4.1–4.3	+Preprocessing filters (BCJ, BMP, WAV)	—
v4.5	+Word-aware SSE	—
v4.6	+Raw image auto-detection	—
v4.7	Mixer LR tuning (10/12 → 24/24)	~18.0%
v4.8	Mixer + ISSE fine-tuning (28/28)	~17.9%
v4.8.2	wpred + match confidence calibration	17.8%

Limitations

Input size limited to 4 GB.
Memory usage is ~600 MB (dominated by 34 sparse model tables at ~8 MB each).
Single-threaded. Symmetric architecture means decompression is as slow as compression.
Research prototype — prioritizes compression ratio over speed.

License

MIT License. See LICENSE for details.

Created by André Zaiats, with Gemini (Google DeepMind) & Claude (Anthropic) — 2026

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
LICENSE		LICENSE
README.md		README.md
compressor.c		compressor.c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PMC: A Human + AI Compression Experiment

Results

Silesia Corpus

Specialized Workloads

Where PMC Falls Short

Usage

How It Works

1. Preprocessing & Auto-Detection

2. Prediction Models (52 Total)

3. Mixing & Correction

4. Entropy Coding

Version History

Limitations

License

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PMC: A Human + AI Compression Experiment

Results

Silesia Corpus

Specialized Workloads

Where PMC Falls Short

Usage

How It Works

1. Preprocessing & Auto-Detection

2. Prediction Models (52 Total)

3. Mixing & Correction

4. Entropy Coding

Version History

Limitations

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages