You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -335,6 +372,13 @@ llama.cpp도 KV 캐시 양자화를 지원합니다 (Q8_0 K + Q5_0 V 추천, ~1.
335
372
336
373
</details>
337
374
375
+
<details>
376
+
<summary><b>왜 llama.cpp보다 느린가요?</b></summary>
377
+
378
+
세 가지 이유: (1) llama.cpp는 모든 양자화 포맷에 수년간 최적화한 NEON/AVX2 어셈블리가 있고, (2) Metal/CUDA GPU로 전체 forward pass를 오프로드하며, (3) 250K+ LOC vs 72K LOC로 더 많은 마이크로 최적화가 있습니다. quant.cpp는 메모리와 임베더빌리티를 먼저 최적화했습니다. 속도 개선(Metal GPU 전체 오프로드, SIMD 커널 추가)이 진행 중입니다 — [v1.3 계획](docs/plan/prd/prd_v1.3.md) 참고.
> **[Full API docs](docs/api.md)** · **[WASM demo](https://quantumaikr.github.io/quant.cpp/)** · **[Add your own KV type](docs/custom-quantization.md)**
69
+
> **[Full API docs](docs/api.md)** · **[WASM demo](https://quantumaikr.github.io/quant.cpp/)** · **[Add your own KV type](docs/custom-quantization.md)** · **[Python: `pip install quantcpp`](#python)**
70
+
71
+
---
72
+
73
+
## See It In Action: Book-in-a-Chat
74
+
75
+
Load an entire novel into context and ask questions about it. llama.cpp runs out of memory. quant.cpp remembers the whole book.
76
+
77
+
```bash
78
+
# Load Alice in Wonderland (~27K tokens) with KV compression
Zero build dependencies beyond a C compiler. Compiles `quant.h` at install time.
322
+
323
+
---
324
+
288
325
## Backends & Performance
289
326
290
327
| Backend | Platform | Status | Notes |
@@ -346,6 +383,13 @@ Works on Linux, macOS, Windows (MSVC/MinGW), iOS, Android, and WASM.
346
383
347
384
</details>
348
385
386
+
<details>
387
+
<summary><b>Why is it slower than llama.cpp?</b></summary>
388
+
389
+
Three reasons: (1) llama.cpp has years of hand-tuned NEON/AVX2 assembly for every quant format, (2) llama.cpp offloads the full forward pass to Metal/CUDA GPU, (3) 250K+ LOC vs 72K LOC means more micro-optimizations. quant.cpp optimized for memory and embeddability first. Speed improvements (full Metal GPU offload, more SIMD kernels) are actively in progress — see [v1.3 plan](docs/plan/prd/prd_v1.3.md).
390
+
391
+
</details>
392
+
349
393
<details>
350
394
<summary><b>No GPU — is this useless?</b></summary>
0 commit comments