Skip to content

Commit ae668c1

Browse files
unamedkrclaude
andcommitted
README: add Book-in-a-Chat demo, Python package, speed FAQ (EN + KO)
- Book-in-a-Chat section: load entire novel, Q&A with KV compression - Python section: pip install quantcpp with streaming example - Speed FAQ: honest explanation + v1.3 Metal GPU roadmap link - H2H benchmark + KV landscape blog in Documentation table - Community feedback: speed demand validated (6 upvotes) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent bb35354 commit ae668c1

2 files changed

Lines changed: 92 additions & 2 deletions

File tree

README.ko.md

Lines changed: 45 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -66,7 +66,23 @@ hf download bartowski/SmolLM2-135M-Instruct-GGUF SmolLM2-135M-Instruct-Q8_0.gguf
6666
./build/quant models/SmolLM2-135M-Instruct-Q8_0.gguf --chat -p "안녕!" -k uniform_4b -v q4
6767
```
6868

69-
> **[API 레퍼런스](docs/api.md)** · **[WASM 데모](https://quantumaikr.github.io/quant.cpp/)** · **[커스텀 양자화 가이드](docs/custom-quantization.md)**
69+
> **[API 레퍼런스](docs/api.md)** · **[WASM 데모](https://quantumaikr.github.io/quant.cpp/)** · **[커스텀 양자화 가이드](docs/custom-quantization.md)** · **[Python: `pip install quantcpp`](#python)**
70+
71+
---
72+
73+
## 실제 동작: Book-in-a-Chat
74+
75+
소설 한 권을 컨텍스트에 넣고 질문합니다. llama.cpp는 메모리 부족, quant.cpp는 전체를 기억합니다.
76+
77+
```bash
78+
# 이상한 나라의 앨리스 (~27K 토큰) KV 압축으로 로드
79+
bash bench/demo/book_chat.sh models/Llama-3.2-3B-Instruct-Q8_0.gguf
80+
81+
# Q: "모자 장수가 앨리스에게 낸 수수께끼는?"
82+
# A: "왜 까마귀가 책상과 같을까?" — 7장, 미친 다과회에서...
83+
```
84+
85+
16GB Mac + Llama 3.2 3B: llama.cpp는 ~50K 토큰에서 OOM. quant.cpp는 KV 6.9x 압축 → **350K 토큰** — 소설 12권 분량.
7086

7187
---
7288

@@ -285,6 +301,27 @@ curl http://localhost:8080/v1/chat/completions \
285301

286302
---
287303

304+
## Python
305+
306+
```bash
307+
cd bindings/python && pip install .
308+
```
309+
310+
```python
311+
from quantcpp import Model
312+
313+
with Model("model.gguf", kv_compress=1) as m:
314+
print(m.ask("프랑스의 수도는?"))
315+
316+
# 스트리밍
317+
for token in m.generate("옛날 옛적에"):
318+
print(token, end="", flush=True)
319+
```
320+
321+
C 컴파일러 외 빌드 의존성 없음. 설치 시 `quant.h`를 자동 컴파일.
322+
323+
---
324+
288325
## 백엔드 & 성능
289326

290327
| 백엔드 | 플랫폼 | 상태 | 비고 |
@@ -335,6 +372,13 @@ llama.cpp도 KV 캐시 양자화를 지원합니다 (Q8_0 K + Q5_0 V 추천, ~1.
335372

336373
</details>
337374

375+
<details>
376+
<summary><b>왜 llama.cpp보다 느린가요?</b></summary>
377+
378+
세 가지 이유: (1) llama.cpp는 모든 양자화 포맷에 수년간 최적화한 NEON/AVX2 어셈블리가 있고, (2) Metal/CUDA GPU로 전체 forward pass를 오프로드하며, (3) 250K+ LOC vs 72K LOC로 더 많은 마이크로 최적화가 있습니다. quant.cpp는 메모리와 임베더빌리티를 먼저 최적화했습니다. 속도 개선(Metal GPU 전체 오프로드, SIMD 커널 추가)이 진행 중입니다 — [v1.3 계획](docs/plan/prd/prd_v1.3.md) 참고.
379+
380+
</details>
381+
338382
<details>
339383
<summary><b>내 앱에 임베딩할 수 있나요?</b></summary>
340384

README.md

Lines changed: 47 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -66,7 +66,23 @@ hf download bartowski/SmolLM2-135M-Instruct-GGUF SmolLM2-135M-Instruct-Q8_0.gguf
6666
./build/quant models/SmolLM2-135M-Instruct-Q8_0.gguf --chat -p "Hello!" -k uniform_4b -v q4
6767
```
6868

69-
> **[Full API docs](docs/api.md)** · **[WASM demo](https://quantumaikr.github.io/quant.cpp/)** · **[Add your own KV type](docs/custom-quantization.md)**
69+
> **[Full API docs](docs/api.md)** · **[WASM demo](https://quantumaikr.github.io/quant.cpp/)** · **[Add your own KV type](docs/custom-quantization.md)** · **[Python: `pip install quantcpp`](#python)**
70+
71+
---
72+
73+
## See It In Action: Book-in-a-Chat
74+
75+
Load an entire novel into context and ask questions about it. llama.cpp runs out of memory. quant.cpp remembers the whole book.
76+
77+
```bash
78+
# Load Alice in Wonderland (~27K tokens) with KV compression
79+
bash bench/demo/book_chat.sh models/Llama-3.2-3B-Instruct-Q8_0.gguf
80+
81+
# Q: "What riddle did the Mad Hatter ask Alice?"
82+
# A: "Why is a raven like a writing-desk?" — from Chapter 7, A Mad Tea-Party...
83+
```
84+
85+
On a 16GB Mac with Llama 3.2 3B: llama.cpp maxes out at ~50K tokens (FP16 KV). quant.cpp compresses KV 6.9x → **350K tokens** — enough for 12 novels.
7086

7187
---
7288

@@ -285,6 +301,27 @@ Build with `-DTQ_BUILD_SERVER=ON`. Streaming SSE supported. KV compression confi
285301

286302
---
287303

304+
## Python
305+
306+
```bash
307+
cd bindings/python && pip install .
308+
```
309+
310+
```python
311+
from quantcpp import Model
312+
313+
with Model("model.gguf", kv_compress=1) as m:
314+
print(m.ask("What is the capital of France?"))
315+
316+
# Streaming
317+
for token in m.generate("Once upon a time"):
318+
print(token, end="", flush=True)
319+
```
320+
321+
Zero build dependencies beyond a C compiler. Compiles `quant.h` at install time.
322+
323+
---
324+
288325
## Backends & Performance
289326

290327
| Backend | Platform | Status | Notes |
@@ -346,6 +383,13 @@ Works on Linux, macOS, Windows (MSVC/MinGW), iOS, Android, and WASM.
346383

347384
</details>
348385

386+
<details>
387+
<summary><b>Why is it slower than llama.cpp?</b></summary>
388+
389+
Three reasons: (1) llama.cpp has years of hand-tuned NEON/AVX2 assembly for every quant format, (2) llama.cpp offloads the full forward pass to Metal/CUDA GPU, (3) 250K+ LOC vs 72K LOC means more micro-optimizations. quant.cpp optimized for memory and embeddability first. Speed improvements (full Metal GPU offload, more SIMD kernels) are actively in progress — see [v1.3 plan](docs/plan/prd/prd_v1.3.md).
390+
391+
</details>
392+
349393
<details>
350394
<summary><b>No GPU — is this useless?</b></summary>
351395

@@ -375,6 +419,8 @@ Tested extensively (2-bit delta, NF2, online SVD, multi-hash). None reached acce
375419
|:---------|:------------|
376420
| **[API Reference](docs/api.md)** | Full C API for quant.h and libturboquant (730 lines) |
377421
| **[Custom Quantization](docs/custom-quantization.md)** | Add your own KV type in 3 functions |
422+
| **[H2H Benchmark](bench/head_to_head/)** | Reproducible quant.cpp vs llama.cpp comparison |
423+
| **[KV Compression Landscape](docs/blog/kv-cache-landscape.md)** | Eviction vs Architecture vs Compression guide |
378424
| **[ROADMAP](ROADMAP.md)** | Project direction and planned features |
379425
| **[CHANGELOG](CHANGELOG.md)** | Version history and release notes |
380426
| **[Tech Report](docs/papers/quant_cpp_tech_report.md)** | Architecture and benchmarks (Arxiv draft) |

0 commit comments

Comments
 (0)