Skip to content

Commit f8bdf6e

Browse files
unamedkrclaude
andcommitted
Add ROADMAP.md: project direction and non-goals
Two clear directions: 1. "LLM의 SQLite" — embedding engine (quant.h, WASM, mobile) 2. KV compression research platform (plugin architecture, papers) Explicit non-goals: GPU speed competition, batch serving, training. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent bc92e04 commit f8bdf6e

1 file changed

Lines changed: 80 additions & 0 deletions

File tree

ROADMAP.md

Lines changed: 80 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,80 @@
1+
# quant.cpp Roadmap
2+
3+
## Vision
4+
5+
**quant.cpp is the SQLite of LLM inference.**
6+
7+
Not the fastest. Not the most feature-complete.
8+
The most embeddable, the most readable, and the only engine
9+
that compresses KV cache 7x without quality loss.
10+
11+
## Positioning
12+
13+
```
14+
Need speed? → llama.cpp
15+
Need throughput? → vLLM
16+
Need to embed LLM in your app with one file? → quant.cpp
17+
Need 7x longer context on the same hardware? → quant.cpp
18+
```
19+
20+
## Direction 1: Embedding Engine ("LLM의 SQLite")
21+
22+
The world's simplest way to add LLM to a C/C++ project.
23+
24+
### Done
25+
- [x] quant.h single header (15K LOC, 628KB)
26+
- [x] 6-function API (load, new, generate, ask, free_ctx, free_model)
27+
- [x] WASM build (192KB binary)
28+
- [x] MSVC/MinGW Windows support
29+
- [x] Zero external dependencies
30+
31+
### In Progress
32+
- [ ] API documentation (docs/api.md)
33+
- [ ] quant.h sync with latest source
34+
- [ ] Embedding examples (minimal, chat, KV compare)
35+
36+
### Planned
37+
- [ ] pip install quantcpp (Python bindings)
38+
- [ ] iOS SDK + demo app
39+
- [ ] Android NDK build guide
40+
- [ ] Unity C# plugin
41+
- [ ] Unreal C++ integration
42+
- [ ] npm package (WASM)
43+
- [ ] GitHub Pages live demo with pre-loaded model
44+
45+
## Direction 2: KV Compression Research Platform
46+
47+
The reference implementation for KV cache quantization research.
48+
49+
### Done
50+
- [x] 7 quantization types (Polar, QJL, Turbo, Uniform, TurboKV)
51+
- [x] Delta compression (P-frame encoding)
52+
- [x] QK-norm aware compression
53+
- [x] Plugin architecture (3 functions to add new type)
54+
- [x] 34 unit tests
55+
56+
### In Progress
57+
- [ ] "Add Your Own Type" tutorial (docs/custom-quantization.md)
58+
- [ ] Arxiv tech report
59+
60+
### Planned
61+
- [ ] llama.cpp KV type PR (ggml type registration)
62+
- [ ] vLLM KV compression plugin
63+
- [ ] Benchmarking suite (PPL across models × KV types)
64+
- [ ] Learned codebook quantization
65+
- [ ] Per-head adaptive bit allocation
66+
67+
## Non-Goals
68+
69+
- ❌ GPU speed competition with llama.cpp (requires tensor graph IR)
70+
- ❌ Batch serving (vLLM's domain)
71+
- ❌ Training support
72+
- ❌ 100+ model coverage
73+
74+
## Architecture Principles
75+
76+
1. **One file forward pass**: tq_transformer.c contains the entire inference loop
77+
2. **Plugin quantization**: Add types via tq_traits.c registration
78+
3. **Zero dependencies**: libc + pthreads only (+ Metal on macOS)
79+
4. **CPU-first**: NEON/AVX2 optimized, GPU as optional accelerator
80+
5. **Embeddable**: quant.h works anywhere a C compiler does

0 commit comments

Comments
 (0)