Skip to content

Latest commit

 

History

History
93 lines (73 loc) · 3.37 KB

File metadata and controls

93 lines (73 loc) · 3.37 KB

quant.cpp Roadmap

Vision

quant.cpp is the single-header C reference implementation of TurboQuant and related KV cache quantization research.

Not competing with Google. Not competing with llama.cpp. Filling the gap nobody else fills: TurboQuant-class compression anywhere a C compiler runs.

See docs/positioning.md for the full strategy.

Positioning

Data-center TurboQuant?       → Google reference (arxiv:2504.19874)
Workstation speed?            → llama.cpp
Batch serving?                → vLLM
TurboQuant on iPhone?         → quant.cpp
TurboQuant in a browser?      → quant.cpp
TurboQuant in a game engine?  → quant.cpp
TurboQuant on a microcontroller? → quant.cpp

Direction 1: Embedding Engine ("LLM의 SQLite")

The world's simplest way to add LLM to a C/C++ project.

Done

  • quant.h single header (15K LOC, 628KB)
  • 6-function API (load, new, generate, ask, free_ctx, free_model)
  • WASM build (192KB binary)
  • MSVC/MinGW Windows support
  • Zero external dependencies
  • API documentation (docs/api.md)
  • quant.h sync with latest source
  • Embedding examples (minimal, chat, KV compare)

Planned

  • pip install quantcpp (Python bindings)
  • iOS SDK + demo app
  • Android NDK build guide
  • Unity C# plugin
  • Unreal C++ integration
  • npm package (WASM)
  • GitHub Pages live demo with pre-loaded model

Direction 2: KV Compression Research Platform

A C reference engine for KV cache quantization research.

Production-ready

  • uniform_4b KV quantization (4–7x compression, +6.3% PPL on Llama 3.2 3B)
  • uniform_4b + Q4 V combo (6.9x KV memory reduction)
  • Delta compression (P-frame encoding)
  • QK-norm aware compression (Gemma 4 / hybrid attention models)
  • Plugin architecture (3 functions to add new type)
  • 35 unit tests

Building blocks (research, not yet production-ready)

  • Random Hadamard Transform (tq_rht.c)
  • Lloyd-Max-Gaussian codebook quantizer (tq_codebook.c)
  • 1-bit QJL sign hash (tq_qjl.c)
  • PolarQuant (polar coordinate) compression (tq_polar.c)
  • turbo_kv_* types composing the building blocks (paper structure, gap in quality)

Open: TurboQuant paper reproduction

  • Close the gap on turbo_kv_* quality vs Google paper — see issue #14
  • Per-channel outlier handling (paper's 32-channel split)
  • QJL constant verification for Rademacher rows
  • Per-head rotation seeds
  • Regression test pinning turbo_kv_4b PPL on Llama 3.2 3B ≤ 14.5

Planned (after Direction 2 reproduction)

  • "Add Your Own Type" tutorial polish (docs/custom-quantization.md)
  • Arxiv tech report
  • llama.cpp KV type PR (ggml type registration) — only after paper reproduction works
  • vLLM KV compression plugin
  • Benchmarking suite (PPL across models × KV types)

Non-Goals

  • ❌ GPU speed competition with llama.cpp (requires tensor graph IR)
  • ❌ Batch serving (vLLM's domain)
  • ❌ Training support
  • ❌ 100+ model coverage

Architecture Principles

  1. One file forward pass: tq_transformer.c contains the entire inference loop
  2. Plugin quantization: Add types via tq_traits.c registration
  3. Zero dependencies: libc + pthreads only (+ Metal on macOS)
  4. CPU-first: NEON/AVX2 optimized, GPU as optional accelerator
  5. Embeddable: quant.h works anywhere a C compiler does