Skip to content

Commit ff45ff8

Browse files
unamedkrclaude
andcommitted
README: fair llama.cpp comparison — add Q8K+Q5V, remove 'broken' label
Community feedback (u/audioen) correctly pointed out that llama.cpp Q4_0 is also per-block and supports separate K/V quant types. Updated: - Add llama.cpp Q8_0 K + Q5_0 V (~1.6x, ~+1% PPL) to comparison charts - Remove 'broken' label from Q4_0 — it's not broken, just lower quality - Clarify the difference: block size (128 vs 32), min-max encoding, delta - FAQ rewritten to acknowledge llama.cpp KV quant and position quant.cpp for the 4-7x compression range where the quality gap matters Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 7c9236b commit ff45ff8

2 files changed

Lines changed: 24 additions & 14 deletions

File tree

README.ko.md

Lines changed: 12 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -77,21 +77,25 @@ hf download bartowski/SmolLM2-135M-Instruct-GGUF SmolLM2-135M-Instruct-Q8_0.gguf
7777
```
7878
KV 양자화 품질 (SmolLM2 1.7B, WikiText-2)
7979
80-
llama.cpp Q4_0 KV │██████████████████████████████████████ PPL +10.6% ← 깨짐
80+
llama.cpp Q4_0 KV │██████████████████████████████████████ PPL +10.6%
8181
82-
quant.cpp 4-bit │▏ PPL +0.0% ← 무손실
82+
llama.cpp Q8K+Q5V │▎ PPL ~+1% ← 추천 설정 (1.6x 압축)
8383
84-
quant.cpp 3-bit │█ PPL +1.3% ← delta 압축
84+
quant.cpp 4-bit │▏ PPL +0.0% ← 무손실 (3.8x 압축)
85+
86+
quant.cpp 3-bit │█ PPL +1.3% ← delta 압축 (4.3x)
8587
└────────────────────────────────────────────────
8688
0% +12%
8789
Perplexity 저하 →
8890
```
8991

92+
둘 다 per-block 방식입니다. 품질 차이는 블록 크기(128 vs 32), min-max 범위 인코딩, 독립적 K/V 처리, delta 압축에서 옵니다. ~1.6x 압축이면 llama.cpp Q8+Q5가 우수합니다. quant.cpp는 차이가 큰 **4-7x 범위**를 타겟합니다.
93+
9094
### vs 다른 엔진들
9195

9296
| | quant.cpp | llama.cpp | vLLM | MLX | ONNX RT |
9397
|:--|:---------:|:---------:|:----:|:---:|:-------:|
94-
| KV 압축 | **7x, +0% PPL** | +10.6% PPL | -- | -- | -- |
98+
| KV 압축 | **7x, +0% PPL** | 1.6x ~+1% PPL | -- | -- | -- |
9599
| 코드 크기 | **72K LOC** | 250K+ | 100K+ | 50K+ | 500K+ |
96100
| 의존성 | **제로** | ggml | PyTorch | Apple fw | 런타임 |
97101
| 임베더블 | **단일 헤더** | -- | -- | -- | 복잡 |
@@ -163,7 +167,8 @@ quant.cpp: key를 4-bit으로 양자화 → 4 bits/원소 → 3.8x
163167
4b K + FP16 V 14.63 │ ● 동일
164168
4b K + Q4 V 14.57 │ ● 약간 더 좋음 (!)
165169
delta 3b K + Q4 V 14.82 │ ● +1.3%
166-
llama.cpp Q4 KV 16.18 │ ● +10.6%
170+
llama.cpp Q8K+Q5V ~14.8 │ ● ~+1% (1.6x 압축)
171+
llama.cpp Q4_0 KV 16.18 │ ● +10.6% (3.8x 압축)
167172
3b K (delta 없음) —— │ ● +62%
168173
└──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──
169174
14 15 16 17 18 19 20 21+
@@ -317,9 +322,9 @@ llama.cpp는 전체 기능을 갖춘 추론 프레임워크 (250K+ LOC). quant.c
317322
</details>
318323

319324
<details>
320-
<summary><b>llama.cpp에도 Q4 KV가 있는데, 뭐가 다른가요?</b></summary>
325+
<summary><b>llama.cpp에도 KV 양자화가 있는데, 뭐가 다른가요?</b></summary>
321326

322-
둘 다 4-bit이지만 품질 차이가 큽니다: llama.cpp Q4_0 KV는 PPL +10.6%, quant.cpp는 +0.0%. quant.cpp는 K와 V를 각각 최적 방식으로 독립 양자화하며, delta 압축(3-bit에서 PPL +1.3%)도 제공합니다. llama.cpp에는 이 기능이 없습니다.
327+
llama.cpp도 KV 캐시 양자화를 지원합니다 (Q8_0 K + Q5_0 V 추천, ~1.6x 압축에 품질 손실 거의 없음). quant.cpp는 더 높은 압축을 목표로 합니다: 4-bit K + Q4 V로 3.8x에서 PPL +0.0%, delta 압축으로 4.3x에서 +1.3%. 품질 우위는 128-element min-max 블록(vs 32-element), 독립적 K/V 양자화, 인접 키의 delta 인코딩에서 옵니다. 1.6x면 llama.cpp KV 양자화로 충분합니다. 4-7x가 필요하면 quant.cpp를 쓰세요.
323328

324329
</details>
325330

README.md

Lines changed: 12 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -77,21 +77,25 @@ hf download bartowski/SmolLM2-135M-Instruct-GGUF SmolLM2-135M-Instruct-Q8_0.gguf
7777
```
7878
KV Quantization Quality (SmolLM2 1.7B, WikiText-2)
7979
80-
llama.cpp Q4_0 KV │██████████████████████████████████████ PPL +10.6% ← broken
80+
llama.cpp Q4_0 KV │██████████████████████████████████████ PPL +10.6%
8181
82-
quant.cpp 4-bit │▏ PPL +0.0% ← lossless
82+
llama.cpp Q8 K+Q5 V │▎ PPL ~+1% ← recommended (1.6x compression)
8383
84-
quant.cpp 3-bit │█ PPL +1.3% ← delta compression
84+
quant.cpp 4-bit │▏ PPL +0.0% ← lossless (3.8x compression)
85+
86+
quant.cpp 3-bit │█ PPL +1.3% ← delta compression (4.3x)
8587
└────────────────────────────────────────────────
8688
0% +12%
8789
Perplexity Degradation →
8890
```
8991

92+
Both are per-block methods. The quality gap comes from block size (128 vs 32), min-max range encoding, independent K/V treatment, and delta compression — not from a fundamental design flaw in llama.cpp. At ~1.6x compression, llama.cpp Q8+Q5 is excellent. quant.cpp targets the **4-7x range** where the difference matters.
93+
9094
### vs every other engine
9195

9296
| | quant.cpp | llama.cpp | vLLM | MLX | ONNX RT |
9397
|:--|:---------:|:---------:|:----:|:---:|:-------:|
94-
| KV compression | **7x, +0% PPL** | +10.6% PPL | -- | -- | -- |
98+
| KV compression | **7x, +0% PPL** | 1.6x at ~+1% PPL | -- | -- | -- |
9599
| Code size | **72K LOC** | 250K+ | 100K+ | 50K+ | 500K+ |
96100
| Dependencies | **zero** | ggml | PyTorch | Apple fw | runtime |
97101
| Embeddable | **single header** | -- | -- | -- | complex |
@@ -163,7 +167,8 @@ Like video compression: I-frames (FP32) every 64 tokens, P-frames (3-bit delta)
163167
4b K + FP16 V 14.63 │ ● identical
164168
4b K + Q4 V 14.57 │ ● slightly better (!)
165169
delta 3b K + Q4 V 14.82 │ ● +1.3%
166-
llama.cpp Q4 KV 16.18 │ ● +10.6%
170+
llama.cpp Q8K+Q5V ~14.8 │ ● ~+1% (1.6x compression)
171+
llama.cpp Q4_0 KV 16.18 │ ● +10.6% (3.8x compression)
167172
3b K (no delta) —— │ ● +62%
168173
└──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──
169174
14 15 16 17 18 19 20 21+
@@ -317,9 +322,9 @@ llama.cpp is a full-featured inference framework (250K+ LOC). quant.cpp is a min
317322
</details>
318323

319324
<details>
320-
<summary><b>llama.cpp already has Q4 KV. How is yours better?</b></summary>
325+
<summary><b>llama.cpp already has KV quantization. How is yours different?</b></summary>
321326

322-
Both use 4 bits, but quality differs: llama.cpp Q4_0 KV gives PPL +10.6%, quant.cpp gives +0.0%. The difference: quant.cpp quantizes K and V independently with type-appropriate methods, and offers delta compression (3-bit at +1.3% PPL). llama.cpp has no equivalent.
327+
llama.cpp supports KV cache quantization (Q8_0 K + Q5_0 V is the recommended config, ~1.6x compression with minimal quality loss). quant.cpp targets higher compression: 4-bit K + Q4 V gives 3.8x at +0.0% PPL, and delta compression pushes to 4.3x at +1.3% PPL. The quality advantage comes from 128-element min-max blocks (vs 32-element), independent K/V quantization methods, and delta encoding of adjacent keys — a technique llama.cpp doesn't have. Use llama.cpp's KV quant if 1.6x is enough; use quant.cpp if you need 4-7x.
323328

324329
</details>
325330

0 commit comments

Comments
 (0)