You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Community feedback (u/audioen) correctly pointed out that llama.cpp Q4_0
is also per-block and supports separate K/V quant types. Updated:
- Add llama.cpp Q8_0 K + Q5_0 V (~1.6x, ~+1% PPL) to comparison charts
- Remove 'broken' label from Q4_0 — it's not broken, just lower quality
- Clarify the difference: block size (128 vs 32), min-max encoding, delta
- FAQ rewritten to acknowledge llama.cpp KV quant and position quant.cpp
for the 4-7x compression range where the quality gap matters
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
둘 다 per-block 방식입니다. 품질 차이는 블록 크기(128 vs 32), min-max 범위 인코딩, 독립적 K/V 처리, delta 압축에서 옵니다. ~1.6x 압축이면 llama.cpp Q8+Q5가 우수합니다. quant.cpp는 차이가 큰 **4-7x 범위**를 타겟합니다.
@@ -317,9 +322,9 @@ llama.cpp는 전체 기능을 갖춘 추론 프레임워크 (250K+ LOC). quant.c
317
322
</details>
318
323
319
324
<details>
320
-
<summary><b>llama.cpp에도 Q4 KV가 있는데, 뭐가 다른가요?</b></summary>
325
+
<summary><b>llama.cpp에도 KV 양자화가 있는데, 뭐가 다른가요?</b></summary>
321
326
322
-
둘 다 4-bit이지만 품질 차이가 큽니다: llama.cpp Q4_0 KV는 PPL +10.6%, quant.cpp는 +0.0%. quant.cpp는 K와 V를 각각 최적 방식으로 독립 양자화하며, delta 압축(3-bit에서 PPL +1.3%)도 제공합니다. llama.cpp에는 이 기능이 없습니다.
327
+
llama.cpp도 KV 캐시 양자화를 지원합니다 (Q8_0 K + Q5_0 V 추천, ~1.6x 압축에 품질 손실 거의 없음). quant.cpp는 더 높은 압축을 목표로 합니다: 4-bit K + Q4 V로 3.8x에서 PPL +0.0%, delta 압축으로 4.3x에서 +1.3%. 품질 우위는 128-element min-max 블록(vs 32-element), 독립적 K/V 양자화, 인접 키의 delta 인코딩에서 옵니다. 1.6x면 llama.cpp KV 양자화로 충분합니다. 4-7x가 필요하면 quant.cpp를 쓰세요.
Both are per-block methods. The quality gap comes from block size (128 vs 32), min-max range encoding, independent K/V treatment, and delta compression — not from a fundamental design flaw in llama.cpp. At ~1.6x compression, llama.cpp Q8+Q5 is excellent. quant.cpp targets the **4-7x range** where the difference matters.
@@ -317,9 +322,9 @@ llama.cpp is a full-featured inference framework (250K+ LOC). quant.cpp is a min
317
322
</details>
318
323
319
324
<details>
320
-
<summary><b>llama.cpp already has Q4 KV. How is yours better?</b></summary>
325
+
<summary><b>llama.cpp already has KV quantization. How is yours different?</b></summary>
321
326
322
-
Both use 4 bits, but quality differs: llama.cpp Q4_0 KV gives PPL +10.6%, quant.cpp gives +0.0%. The difference: quant.cpp quantizes K and V independently with type-appropriate methods, and offers delta compression (3-bit at +1.3% PPL). llama.cpp has no equivalent.
327
+
llama.cpp supports KV cache quantization (Q8_0 K + Q5_0 V is the recommended config, ~1.6x compression with minimal quality loss). quant.cpp targets higher compression: 4-bit K + Q4 V gives 3.8x at +0.0% PPL, and delta compression pushes to 4.3x at +1.3% PPL. The quality advantage comes from 128-element min-max blocks (vs 32-element), independent K/V quantization methods, and delta encoding of adjacent keys — a technique llama.cpp doesn't have. Use llama.cpp's KV quant if 1.6x is enough; use quant.cpp if you need 4-7x.
0 commit comments