README: add Book-in-a-Chat demo, Python package, speed FAQ (EN + KO)

unamedkr · claude · unamedkr · commit ae668c12847e · 2026-04-06T16:25:03.000+09:00
- Book-in-a-Chat section: load entire novel, Q&amp;A with KV compression
- Python section: pip install quantcpp with streaming example
- Speed FAQ: honest explanation + v1.3 Metal GPU roadmap link
- H2H benchmark + KV landscape blog in Documentation table
- Community feedback: speed demand validated (6 upvotes)

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/README.ko.md b/README.ko.md
@@ -66,7 +66,23 @@ hf download bartowski/SmolLM2-135M-Instruct-GGUF SmolLM2-135M-Instruct-Q8_0.gguf
 ./build/quant models/SmolLM2-135M-Instruct-Q8_0.gguf --chat -p "안녕!" -k uniform_4b -v q4
 ```
 
-> **[API 레퍼런스](docs/api.md)** · **[WASM 데모](https://quantumaikr.github.io/quant.cpp/)** · **[커스텀 양자화 가이드](docs/custom-quantization.md)**
+> **[API 레퍼런스](docs/api.md)** · **[WASM 데모](https://quantumaikr.github.io/quant.cpp/)** · **[커스텀 양자화 가이드](docs/custom-quantization.md)** · **[Python: `pip install quantcpp`](#python)**
+
+---
+
+## 실제 동작: Book-in-a-Chat
+
+소설 한 권을 컨텍스트에 넣고 질문합니다. llama.cpp는 메모리 부족, quant.cpp는 전체를 기억합니다.
+
+```bash
+# 이상한 나라의 앨리스 (~27K 토큰) KV 압축으로 로드
+bash bench/demo/book_chat.sh models/Llama-3.2-3B-Instruct-Q8_0.gguf
+
+# Q: "모자 장수가 앨리스에게 낸 수수께끼는?"
+# A: "왜 까마귀가 책상과 같을까?" — 7장, 미친 다과회에서...
+```
+
+16GB Mac + Llama 3.2 3B: llama.cpp는 ~50K 토큰에서 OOM. quant.cpp는 KV 6.9x 압축 → **350K 토큰** — 소설 12권 분량.
 
 ---
 
@@ -285,6 +301,27 @@ curl http://localhost:8080/v1/chat/completions \
 
 ---
 
+## Python
+
+```bash
+cd bindings/python && pip install .
+```
+
+```python
+from quantcpp import Model
+
+with Model("model.gguf", kv_compress=1) as m:
+    print(m.ask("프랑스의 수도는?"))
+
+    # 스트리밍
+    for token in m.generate("옛날 옛적에"):
+        print(token, end="", flush=True)
+```
+
+C 컴파일러 외 빌드 의존성 없음. 설치 시 `quant.h`를 자동 컴파일.
+
+---
+
 ## 백엔드 & 성능
 
 | 백엔드 | 플랫폼 | 상태 | 비고 |
@@ -335,6 +372,13 @@ llama.cpp도 KV 캐시 양자화를 지원합니다 (Q8_0 K + Q5_0 V 추천, ~1.
 
 </details>
 
+<details>
+<summary><b>왜 llama.cpp보다 느린가요?</b></summary>
+
+세 가지 이유: (1) llama.cpp는 모든 양자화 포맷에 수년간 최적화한 NEON/AVX2 어셈블리가 있고, (2) Metal/CUDA GPU로 전체 forward pass를 오프로드하며, (3) 250K+ LOC vs 72K LOC로 더 많은 마이크로 최적화가 있습니다. quant.cpp는 메모리와 임베더빌리티를 먼저 최적화했습니다. 속도 개선(Metal GPU 전체 오프로드, SIMD 커널 추가)이 진행 중입니다 — [v1.3 계획](docs/plan/prd/prd_v1.3.md) 참고.
+
+</details>
+
 <details>
 <summary><b>내 앱에 임베딩할 수 있나요?</b></summary>
 
diff --git a/README.md b/README.md
@@ -66,7 +66,23 @@ hf download bartowski/SmolLM2-135M-Instruct-GGUF SmolLM2-135M-Instruct-Q8_0.gguf
 ./build/quant models/SmolLM2-135M-Instruct-Q8_0.gguf --chat -p "Hello!" -k uniform_4b -v q4
 ```
 
-> **[Full API docs](docs/api.md)** · **[WASM demo](https://quantumaikr.github.io/quant.cpp/)** · **[Add your own KV type](docs/custom-quantization.md)**
+> **[Full API docs](docs/api.md)** · **[WASM demo](https://quantumaikr.github.io/quant.cpp/)** · **[Add your own KV type](docs/custom-quantization.md)** · **[Python: `pip install quantcpp`](#python)**
+
+---
+
+## See It In Action: Book-in-a-Chat
+
+Load an entire novel into context and ask questions about it. llama.cpp runs out of memory. quant.cpp remembers the whole book.
+
+```bash
+# Load Alice in Wonderland (~27K tokens) with KV compression
+bash bench/demo/book_chat.sh models/Llama-3.2-3B-Instruct-Q8_0.gguf
+
+# Q: "What riddle did the Mad Hatter ask Alice?"
+# A: "Why is a raven like a writing-desk?" — from Chapter 7, A Mad Tea-Party...
+```
+
+On a 16GB Mac with Llama 3.2 3B: llama.cpp maxes out at ~50K tokens (FP16 KV). quant.cpp compresses KV 6.9x → **350K tokens** — enough for 12 novels.
 
 ---
 
@@ -285,6 +301,27 @@ Build with `-DTQ_BUILD_SERVER=ON`. Streaming SSE supported. KV compression confi
 
 ---
 
+## Python
+
+```bash
+cd bindings/python && pip install .
+```
+
+```python
+from quantcpp import Model
+
+with Model("model.gguf", kv_compress=1) as m:
+    print(m.ask("What is the capital of France?"))
+
+    # Streaming
+    for token in m.generate("Once upon a time"):
+        print(token, end="", flush=True)
+```
+
+Zero build dependencies beyond a C compiler. Compiles `quant.h` at install time.
+
+---
+
 ## Backends & Performance
 
 | Backend | Platform | Status | Notes |
@@ -346,6 +383,13 @@ Works on Linux, macOS, Windows (MSVC/MinGW), iOS, Android, and WASM.
 
 </details>
 
+<details>
+<summary><b>Why is it slower than llama.cpp?</b></summary>
+
+Three reasons: (1) llama.cpp has years of hand-tuned NEON/AVX2 assembly for every quant format, (2) llama.cpp offloads the full forward pass to Metal/CUDA GPU, (3) 250K+ LOC vs 72K LOC means more micro-optimizations. quant.cpp optimized for memory and embeddability first. Speed improvements (full Metal GPU offload, more SIMD kernels) are actively in progress — see [v1.3 plan](docs/plan/prd/prd_v1.3.md).
+
+</details>
+
 <details>
 <summary><b>No GPU — is this useless?</b></summary>
 
@@ -375,6 +419,8 @@ Tested extensively (2-bit delta, NF2, online SVD, multi-hash). None reached acce
 |:---------|:------------|
 | **[API Reference](docs/api.md)** | Full C API for quant.h and libturboquant (730 lines) |
 | **[Custom Quantization](docs/custom-quantization.md)** | Add your own KV type in 3 functions |
+| **[H2H Benchmark](bench/head_to_head/)** | Reproducible quant.cpp vs llama.cpp comparison |
+| **[KV Compression Landscape](docs/blog/kv-cache-landscape.md)** | Eviction vs Architecture vs Compression guide |
 | **[ROADMAP](ROADMAP.md)** | Project direction and planned features |
 | **[CHANGELOG](CHANGELOG.md)** | Version history and release notes |
 | **[Tech Report](docs/papers/quant_cpp_tech_report.md)** | Architecture and benchmarks (Arxiv draft) |