Problem
Symmetric turbo3 K+V produces catastrophic PPL on Qwen2.5-7B (PPL 2,887 vs baseline 7.43). This has never worked — bisect confirms PPL 10,829 at the very first CUDA turbo commit (97dddac).
Root cause
Qwen2.5-7B has 4 KV heads serving 28 Q heads (GQA ratio 7:1). Each K head's turbo3 quantization error gets broadcast to 7 Q heads, amplifying the error. Models with GQA ≤ 4:1 are unaffected (Mistral-7B: 8 KV heads, turbo3 PPL = 7.71, only +4.4%).
Bisect data
| Commit |
Label |
Qwen turbo3/turbo3 PPL |
| 97dddac |
first VEC FA fix |
10,829 |
| f2b3936 |
asymmetric KV support |
2,887 |
| 53180d9 |
block-size 128 |
106,035 |
| fe15d61 |
InnerQ equalization |
2,886 |
| bc05a68 |
current HEAD |
2,887 |
Never worked. The asymmetric KV support (q8_0-K + turbo3-V) was introduced as a manual fix for exactly this issue.
Proposed fix
Auto-detect high GQA ratio at KV cache init and silently upgrade K to q8_0:
const uint32_t gqa_ratio = n_head / n_head_kv;
if (gqa_ratio >= 6 && type_k == type_v && k_is_turbo) {
type_k = GGML_TYPE_Q8_0; // auto-asymmetric
}
Results with fix:
- Qwen2.5-7B turbo3/turbo3: PPL 7.06 (was 2,887), NIAH 5/5
- Mistral-7B: unaffected (GQA 4:1, below threshold)
- Opt-out:
TURBO_AUTO_ASYMMETRIC=0
Implementation: https://github.com/signalnine/llama-cpp-turboquant/tree/fix/auto-asymmetric-gqa
Also included in PR #53.
Affected models
Any model with n_head_kv ≤ 4 and n_head ≥ 24 (GQA ≥ 6:1):
- Qwen2.5 family (4 KV heads, 28 Q heads)
- Qwen2 family (same architecture)
- Potentially other models with aggressive GQA
Problem
Symmetric turbo3 K+V produces catastrophic PPL on Qwen2.5-7B (PPL 2,887 vs baseline 7.43). This has never worked — bisect confirms PPL 10,829 at the very first CUDA turbo commit (97dddac).
Root cause
Qwen2.5-7B has 4 KV heads serving 28 Q heads (GQA ratio 7:1). Each K head's turbo3 quantization error gets broadcast to 7 Q heads, amplifying the error. Models with GQA ≤ 4:1 are unaffected (Mistral-7B: 8 KV heads, turbo3 PPL = 7.71, only +4.4%).
Bisect data
Never worked. The asymmetric KV support (q8_0-K + turbo3-V) was introduced as a manual fix for exactly this issue.
Proposed fix
Auto-detect high GQA ratio at KV cache init and silently upgrade K to q8_0:
Results with fix:
TURBO_AUTO_ASYMMETRIC=0Implementation: https://github.com/signalnine/llama-cpp-turboquant/tree/fix/auto-asymmetric-gqa
Also included in PR #53.
Affected models
Any model with n_head_kv ≤ 4 and n_head ≥ 24 (GQA ≥ 6:1):