quant.cpp Roadmap

Vision

quant.cpp is the single-header C reference implementation of TurboQuant and related KV cache quantization research.

Not competing with Google. Not competing with llama.cpp. Filling the gap nobody else fills: TurboQuant-class compression anywhere a C compiler runs.

See docs/positioning.md for the full strategy.

Positioning

Data-center TurboQuant?       → Google reference (arxiv:2504.19874)
Workstation speed?            → llama.cpp
Batch serving?                → vLLM
TurboQuant on iPhone?         → quant.cpp
TurboQuant in a browser?      → quant.cpp
TurboQuant in a game engine?  → quant.cpp
TurboQuant on a microcontroller? → quant.cpp

Direction 1: Embedding Engine ("LLM의 SQLite")

The world's simplest way to add LLM to a C/C++ project.

Done

quant.h single header (15K LOC, 628KB)
6-function API (load, new, generate, ask, free_ctx, free_model)
WASM build (192KB binary)
MSVC/MinGW Windows support
Zero external dependencies
API documentation (docs/api.md)
quant.h sync with latest source
Embedding examples (minimal, chat, KV compare)

Planned

Direction 2: KV Compression Research Platform

A C reference engine for KV cache quantization research.

Production-ready

uniform_4b KV quantization (4–7x compression, +6.3% PPL on Llama 3.2 3B)
uniform_4b + Q4 V combo (6.9x KV memory reduction)
Delta compression (P-frame encoding)
QK-norm aware compression (Gemma 4 / hybrid attention models)
Plugin architecture (3 functions to add new type)
35 unit tests

Building blocks (research, not yet production-ready)

Random Hadamard Transform (tq_rht.c)
Lloyd-Max-Gaussian codebook quantizer (tq_codebook.c)
1-bit QJL sign hash (tq_qjl.c)
PolarQuant (polar coordinate) compression (tq_polar.c)
turbo_kv_* types composing the building blocks (paper structure, gap in quality)

Open: TurboQuant paper reproduction

Close the gap on turbo_kv_* quality vs Google paper — see issue #14
Per-channel outlier handling (paper's 32-channel split)
QJL constant verification for Rademacher rows
Per-head rotation seeds
Regression test pinning turbo_kv_4b PPL on Llama 3.2 3B ≤ 14.5

Planned (after Direction 2 reproduction)

"Add Your Own Type" tutorial polish (docs/custom-quantization.md)
Arxiv tech report
llama.cpp KV type PR (ggml type registration) — only after paper reproduction works
vLLM KV compression plugin
Benchmarking suite (PPL across models × KV types)

Non-Goals

❌ GPU speed competition with llama.cpp (requires tensor graph IR)
❌ Batch serving (vLLM's domain)
❌ Training support
❌ 100+ model coverage

Architecture Principles

One file forward pass: tq_transformer.c contains the entire inference loop
Plugin quantization: Add types via tq_traits.c registration
Zero dependencies: libc + pthreads only (+ Metal on macOS)
CPU-first: NEON/AVX2 optimized, GPU as optional accelerator
Embeddable: quant.h works anywhere a C compiler does

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

quant.cpp Roadmap

Vision

Positioning

Direction 1: Embedding Engine ("LLM의 SQLite")

Done

Planned

Direction 2: KV Compression Research Platform

Production-ready

Building blocks (research, not yet production-ready)

Open: TurboQuant paper reproduction

Planned (after Direction 2 reproduction)

Non-Goals

Architecture Principles

FilesExpand file tree

ROADMAP.md

Latest commit

History

ROADMAP.md

File metadata and controls

quant.cpp Roadmap

Vision

Positioning

Direction 1: Embedding Engine ("LLM의 SQLite")

Done

Planned

Direction 2: KV Compression Research Platform

Production-ready

Building blocks (research, not yet production-ready)

Open: TurboQuant paper reproduction

Planned (after Direction 2 reproduction)

Non-Goals

Architecture Principles