Skip to content

Symmetric turbo3 KV catastrophic on Qwen2.5 (GQA 7:1) — auto-asymmetric fix proposed #54

@signalnine

Description

@signalnine

Problem

Symmetric turbo3 K+V produces catastrophic PPL on Qwen2.5-7B (PPL 2,887 vs baseline 7.43). This has never worked — bisect confirms PPL 10,829 at the very first CUDA turbo commit (97dddac).

Root cause

Qwen2.5-7B has 4 KV heads serving 28 Q heads (GQA ratio 7:1). Each K head's turbo3 quantization error gets broadcast to 7 Q heads, amplifying the error. Models with GQA ≤ 4:1 are unaffected (Mistral-7B: 8 KV heads, turbo3 PPL = 7.71, only +4.4%).

Bisect data

Commit Label Qwen turbo3/turbo3 PPL
97dddac first VEC FA fix 10,829
f2b3936 asymmetric KV support 2,887
53180d9 block-size 128 106,035
fe15d61 InnerQ equalization 2,886
bc05a68 current HEAD 2,887

Never worked. The asymmetric KV support (q8_0-K + turbo3-V) was introduced as a manual fix for exactly this issue.

Proposed fix

Auto-detect high GQA ratio at KV cache init and silently upgrade K to q8_0:

const uint32_t gqa_ratio = n_head / n_head_kv;
if (gqa_ratio >= 6 && type_k == type_v && k_is_turbo) {
    type_k = GGML_TYPE_Q8_0;  // auto-asymmetric
}

Results with fix:

  • Qwen2.5-7B turbo3/turbo3: PPL 7.06 (was 2,887), NIAH 5/5
  • Mistral-7B: unaffected (GQA 4:1, below threshold)
  • Opt-out: TURBO_AUTO_ASYMMETRIC=0

Implementation: https://github.com/signalnine/llama-cpp-turboquant/tree/fix/auto-asymmetric-gqa
Also included in PR #53.

Affected models

Any model with n_head_kv ≤ 4 and n_head ≥ 24 (GQA ≥ 6:1):

  • Qwen2.5 family (4 KV heads, 28 Q heads)
  • Qwen2 family (same architecture)
  • Potentially other models with aggressive GQA

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions