Skip to content

andrezaiats/pmc-compressor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 

Repository files navigation

PMC: A Human + AI Compression Experiment

What happens when a developer pair-programs with AI for 48 hours on a lossless compressor?

PMC (Predictive Mix Context) is a single-file lossless data compressor written in C. It uses a PAQ-inspired context-mixing architecture with 52 prediction models, adaptive mixing, and rANS entropy coding.

This is not meant to be the best compressor in the world — tools like zpaq have years of research behind them. The point is to show that a human+AI team can get remarkably close to state-of-the-art results in a very short time, and even beat established tools on certain workloads.

Results

Tested on the Silesia Corpus (standard compression benchmark) plus specialized files.

Silesia Corpus

File Original gzip -9 zstd -19 xz -9 zpaq -5 PMC
dickens (Text) 10.2 MB 37.8% 28.0% 27.8% 20.6% 22.6%
file (ELF) 6.5 MB 25.8% 17.6% 14.7% 12.2% 11.7%
mozilla (Tar) 51.2 MB 37.1% 29.5% 26.1% 23.5% 24.3%
mr (Medical) 10.0 MB 36.8% 31.2% 27.6% 21.9% 20.4%
nci (Chemistry) 33.6 MB 8.9% 5.0% 5.2% 3.7% 4.6%
ooffice (Office) 6.1 MB 50.2% 42.2% 39.4% 28.7% 29.5%
osdb (Database) 10.1 MB 36.8% 30.7% 28.3% 21.9% 22.7%
reymont (Text) 6.6 MB 27.5% 20.3% 19.9% 14.4% 15.6%
samba (Source) 21.6 MB 25.0% 18.1% 17.4% 14.1% 15.4%
sao (Astronomy) 7.3 MB 73.5% 69.0% 60.9% 53.8% 58.9%
webster (Text) 41.5 MB 29.1% 21.0% 20.2% 13.7% 12.6%
xml (Markup) 5.3 MB 12.4% 8.5% 8.5% 6.1% 6.6%
TOTAL 210 MB 30.1% 23.3% 21.6% 17.3% 17.8%

PMC reaches 17.8% overall — within 0.5 percentage points of zpaq, and ahead of xz by nearly 4 points. It wins outright on ELF binaries, medical data (mr), and dictionary text (webster).

Specialized Workloads

These are files where PMC's auto-detection and specialized models make the biggest difference.

File Type File Original xz -9 zpaq -5 PMC
Audio (16-bit PCM) WAV 48kHz 2.8 MB 53.7% 42.2% 27.2%
Bitmap (24-bit) BMP 1280×853 3.3 MB 64.3% 53.1% 47.4%
Medical Imaging X-Ray (16-bit raw) 8.4 MB 53.0% 43.3% 44.7%

On WAV audio, PMC beats zpaq by 15 percentage points. On BMP images, by nearly 6 points. The X-Ray result is close — PMC auto-detects the raw image geometry without headers, a feature most compressors lack, but zpaq's deeper modeling still edges it out.

Where PMC Falls Short

Honesty matters. Here's where zpaq clearly wins:

  • Pure text (dickens, reymont): zpaq's language models are deeper. PMC's word predictor is right only ~8-12% of the time — natural language is hard.
  • Astronomical data (sao): near-random at the byte level, requires modeling float structure that PMC doesn't attempt.
  • Chemistry (nci): highly repetitive but needs long-range context that zpaq handles better.
  • Speed: PMC processes ~6.5 KB/s (single-threaded, symmetric compress/decompress). zpaq and xz are significantly faster.

Usage

# Build (C99, no dependencies)
gcc -O2 -o compressor compressor.c -lm

# Compress / Decompress / Verify
./compressor c input_file output.pmc
./compressor d output.pmc recovered_file
diff input_file recovered_file

How It Works

PMC uses a 4-stage pipeline:

1. Preprocessing & Auto-Detection

Before compression, PMC analyzes each data block and applies reversible transforms:

  • BCJ x86 Filter: Converts relative CALL addresses to absolute for better compression of x86 binaries.
  • BMP Vertical Delta: For 24/32-bit BMP images, subtracts each pixel row from the row above.
  • WAV Per-Channel Delta: For 16-bit PCM WAV files, subtracts the previous sample of the same channel.
  • Raw Image Auto-Detection: For files without recognized headers (like X-Ray scans), scans candidate row widths and measures vertical byte correlation. If a strong pattern is found, applies a 16-bit vertical delta filter.

2. Prediction Models (52 Total)

Instead of a single algorithm, PMC uses an ensemble of specialized models:

  • 9 Order-N Context Models: Orders 0 through 16. Each maps a hashed byte context to a state via finite-state counters.
  • 34 Sparse Context Models: The key to binary compression. They look at non-adjacent bytes (e.g., byte[i-4] and byte[i-8]) to find table columns and struct fields.
  • Linguistic Models: Word hash, word trigram, and a shadow dictionary that attempts to predict the next word entirely.
  • Geometric Models: Auto-detects data periodicity per block and predicts from previous stride positions.
  • Correction Models: Match model (hash-chain match finder), ICM (indirect context model), and LZP (Lempel-Ziv prediction).

3. Mixing & Correction

  • Dual Adaptive Logistic Mixers averaged in logit domain.
  • Text Block Gating: Text blocks have their 34 sparse models zeroed to eliminate noise.
  • 5-Stage APM/ISSE Pipeline: Post-processing that learns the mixer's systematic errors.
  • Word-Aware SSE: Learns per-word correction biases for text.

4. Entropy Coding

rANS (Asymmetric Numeral Systems): 4-way interleaved binary rANS for high throughput and near-optimal coding density.

Version History

Version Key Change Silesia Total
v3 Baseline context mixer
v4 +34 sparse models
v4.1–4.3 +Preprocessing filters (BCJ, BMP, WAV)
v4.5 +Word-aware SSE
v4.6 +Raw image auto-detection
v4.7 Mixer LR tuning (10/12 → 24/24) ~18.0%
v4.8 Mixer + ISSE fine-tuning (28/28) ~17.9%
v4.8.2 wpred + match confidence calibration 17.8%

Limitations

  • Input size limited to 4 GB.
  • Memory usage is ~600 MB (dominated by 34 sparse model tables at ~8 MB each).
  • Single-threaded. Symmetric architecture means decompression is as slow as compression.
  • Research prototype — prioritizes compression ratio over speed.

License

MIT License. See LICENSE for details.


Created by André Zaiats, with Gemini (Google DeepMind) & Claude (Anthropic) — 2026

About

High-ratio specialized compressor built in a 48h Human+AI experiment. Uses context mixing to beat zpaq/xz on structured binaries & signals.

Topics

Resources

License

Stars

Watchers

Forks

Contributors

Languages