quant.cpp is the single-header C reference implementation of TurboQuant and related KV cache quantization research.
Not competing with Google. Not competing with llama.cpp. Filling the gap nobody else fills: TurboQuant-class compression anywhere a C compiler runs.
See docs/positioning.md for the full strategy.
Data-center TurboQuant? → Google reference (arxiv:2504.19874)
Workstation speed? → llama.cpp
Batch serving? → vLLM
TurboQuant on iPhone? → quant.cpp
TurboQuant in a browser? → quant.cpp
TurboQuant in a game engine? → quant.cpp
TurboQuant on a microcontroller? → quant.cpp
The world's simplest way to add LLM to a C/C++ project.
- quant.h single header (15K LOC, 628KB)
- 6-function API (load, new, generate, ask, free_ctx, free_model)
- WASM build (192KB binary)
- MSVC/MinGW Windows support
- Zero external dependencies
- API documentation (docs/api.md)
- quant.h sync with latest source
- Embedding examples (minimal, chat, KV compare)
- pip install quantcpp (Python bindings)
- iOS SDK + demo app
- Android NDK build guide
- Unity C# plugin
- Unreal C++ integration
- npm package (WASM)
- GitHub Pages live demo with pre-loaded model
A C reference engine for KV cache quantization research.
-
uniform_4bKV quantization (4–7x compression, +6.3% PPL on Llama 3.2 3B) -
uniform_4b+ Q4 V combo (6.9x KV memory reduction) - Delta compression (P-frame encoding)
- QK-norm aware compression (Gemma 4 / hybrid attention models)
- Plugin architecture (3 functions to add new type)
- 35 unit tests
- Random Hadamard Transform (
tq_rht.c) - Lloyd-Max-Gaussian codebook quantizer (
tq_codebook.c) - 1-bit QJL sign hash (
tq_qjl.c) - PolarQuant (polar coordinate) compression (
tq_polar.c) -
turbo_kv_*types composing the building blocks (paper structure, gap in quality)
- Close the gap on
turbo_kv_*quality vs Google paper — see issue #14 - Per-channel outlier handling (paper's 32-channel split)
- QJL constant verification for Rademacher rows
- Per-head rotation seeds
- Regression test pinning
turbo_kv_4bPPL on Llama 3.2 3B ≤ 14.5
- "Add Your Own Type" tutorial polish (docs/custom-quantization.md)
- Arxiv tech report
- llama.cpp KV type PR (ggml type registration) — only after paper reproduction works
- vLLM KV compression plugin
- Benchmarking suite (PPL across models × KV types)
- ❌ GPU speed competition with llama.cpp (requires tensor graph IR)
- ❌ Batch serving (vLLM's domain)
- ❌ Training support
- ❌ 100+ model coverage
- One file forward pass: tq_transformer.c contains the entire inference loop
- Plugin quantization: Add types via tq_traits.c registration
- Zero dependencies: libc + pthreads only (+ Metal on macOS)
- CPU-first: NEON/AVX2 optimized, GPU as optional accelerator
- Embeddable: quant.h works anywhere a C compiler does