What happens when a developer pair-programs with AI for 48 hours on a lossless compressor?
PMC (Predictive Mix Context) is a single-file lossless data compressor written in C. It uses a PAQ-inspired context-mixing architecture with 52 prediction models, adaptive mixing, and rANS entropy coding.
This is not meant to be the best compressor in the world — tools like zpaq have years of research behind them. The point is to show that a human+AI team can get remarkably close to state-of-the-art results in a very short time, and even beat established tools on certain workloads.
Tested on the Silesia Corpus (standard compression benchmark) plus specialized files.
| File | Original | gzip -9 | zstd -19 | xz -9 | zpaq -5 | PMC |
|---|---|---|---|---|---|---|
| dickens (Text) | 10.2 MB | 37.8% | 28.0% | 27.8% | 20.6% | 22.6% |
| file (ELF) | 6.5 MB | 25.8% | 17.6% | 14.7% | 12.2% | 11.7% |
| mozilla (Tar) | 51.2 MB | 37.1% | 29.5% | 26.1% | 23.5% | 24.3% |
| mr (Medical) | 10.0 MB | 36.8% | 31.2% | 27.6% | 21.9% | 20.4% |
| nci (Chemistry) | 33.6 MB | 8.9% | 5.0% | 5.2% | 3.7% | 4.6% |
| ooffice (Office) | 6.1 MB | 50.2% | 42.2% | 39.4% | 28.7% | 29.5% |
| osdb (Database) | 10.1 MB | 36.8% | 30.7% | 28.3% | 21.9% | 22.7% |
| reymont (Text) | 6.6 MB | 27.5% | 20.3% | 19.9% | 14.4% | 15.6% |
| samba (Source) | 21.6 MB | 25.0% | 18.1% | 17.4% | 14.1% | 15.4% |
| sao (Astronomy) | 7.3 MB | 73.5% | 69.0% | 60.9% | 53.8% | 58.9% |
| webster (Text) | 41.5 MB | 29.1% | 21.0% | 20.2% | 13.7% | 12.6% |
| xml (Markup) | 5.3 MB | 12.4% | 8.5% | 8.5% | 6.1% | 6.6% |
| TOTAL | 210 MB | 30.1% | 23.3% | 21.6% | 17.3% | 17.8% |
PMC reaches 17.8% overall — within 0.5 percentage points of zpaq, and ahead of xz by nearly 4 points. It wins outright on ELF binaries, medical data (mr), and dictionary text (webster).
These are files where PMC's auto-detection and specialized models make the biggest difference.
| File Type | File | Original | xz -9 | zpaq -5 | PMC |
|---|---|---|---|---|---|
| Audio (16-bit PCM) | WAV 48kHz | 2.8 MB | 53.7% | 42.2% | 27.2% |
| Bitmap (24-bit) | BMP 1280×853 | 3.3 MB | 64.3% | 53.1% | 47.4% |
| Medical Imaging | X-Ray (16-bit raw) | 8.4 MB | 53.0% | 43.3% | 44.7% |
On WAV audio, PMC beats zpaq by 15 percentage points. On BMP images, by nearly 6 points. The X-Ray result is close — PMC auto-detects the raw image geometry without headers, a feature most compressors lack, but zpaq's deeper modeling still edges it out.
Honesty matters. Here's where zpaq clearly wins:
- Pure text (dickens, reymont): zpaq's language models are deeper. PMC's word predictor is right only ~8-12% of the time — natural language is hard.
- Astronomical data (sao): near-random at the byte level, requires modeling float structure that PMC doesn't attempt.
- Chemistry (nci): highly repetitive but needs long-range context that zpaq handles better.
- Speed: PMC processes ~6.5 KB/s (single-threaded, symmetric compress/decompress). zpaq and xz are significantly faster.
# Build (C99, no dependencies)
gcc -O2 -o compressor compressor.c -lm
# Compress / Decompress / Verify
./compressor c input_file output.pmc
./compressor d output.pmc recovered_file
diff input_file recovered_filePMC uses a 4-stage pipeline:
Before compression, PMC analyzes each data block and applies reversible transforms:
- BCJ x86 Filter: Converts relative CALL addresses to absolute for better compression of x86 binaries.
- BMP Vertical Delta: For 24/32-bit BMP images, subtracts each pixel row from the row above.
- WAV Per-Channel Delta: For 16-bit PCM WAV files, subtracts the previous sample of the same channel.
- Raw Image Auto-Detection: For files without recognized headers (like X-Ray scans), scans candidate row widths and measures vertical byte correlation. If a strong pattern is found, applies a 16-bit vertical delta filter.
Instead of a single algorithm, PMC uses an ensemble of specialized models:
- 9 Order-N Context Models: Orders 0 through 16. Each maps a hashed byte context to a state via finite-state counters.
- 34 Sparse Context Models: The key to binary compression. They look at non-adjacent bytes (e.g., byte[i-4] and byte[i-8]) to find table columns and struct fields.
- Linguistic Models: Word hash, word trigram, and a shadow dictionary that attempts to predict the next word entirely.
- Geometric Models: Auto-detects data periodicity per block and predicts from previous stride positions.
- Correction Models: Match model (hash-chain match finder), ICM (indirect context model), and LZP (Lempel-Ziv prediction).
- Dual Adaptive Logistic Mixers averaged in logit domain.
- Text Block Gating: Text blocks have their 34 sparse models zeroed to eliminate noise.
- 5-Stage APM/ISSE Pipeline: Post-processing that learns the mixer's systematic errors.
- Word-Aware SSE: Learns per-word correction biases for text.
rANS (Asymmetric Numeral Systems): 4-way interleaved binary rANS for high throughput and near-optimal coding density.
| Version | Key Change | Silesia Total |
|---|---|---|
| v3 | Baseline context mixer | — |
| v4 | +34 sparse models | — |
| v4.1–4.3 | +Preprocessing filters (BCJ, BMP, WAV) | — |
| v4.5 | +Word-aware SSE | — |
| v4.6 | +Raw image auto-detection | — |
| v4.7 | Mixer LR tuning (10/12 → 24/24) | ~18.0% |
| v4.8 | Mixer + ISSE fine-tuning (28/28) | ~17.9% |
| v4.8.2 | wpred + match confidence calibration | 17.8% |
- Input size limited to 4 GB.
- Memory usage is ~600 MB (dominated by 34 sparse model tables at ~8 MB each).
- Single-threaded. Symmetric architecture means decompression is as slow as compression.
- Research prototype — prioritizes compression ratio over speed.
MIT License. See LICENSE for details.
Created by André Zaiats, with Gemini (Google DeepMind) & Claude (Anthropic) — 2026