TurboQuant KV Cache Compression — Full HIP/ROCm Port (gfx1100) #21526
Replies: 19 comments 74 replies
-
|
Excellent! I'll test this and report, I also have a 7900XTX linux system but with a modere modern rocm version (7.2.1). I had tried to get it to work on fork of TheTom's turboquant branch, but since this is out of my wheelhouse I had claude give it a go (thus not admissible as a PR). I did manage to get it working but it's not a clean implementation. I was hoping it would serve as inspiration for someone to properly tackle it. And then your PR popped up :) Just in case there's anything remotely useful in it, here is my fork: https://github.com/stragulus/llama-cpp-turboquant/tree/bug/rocm-vram-leakage |
Beta Was this translation helpful? Give feedback.
-
|
Great, I will test both on my Strix Halo gfx1151. Regarding TheTom's, that fork now also implemented weight quantization, so it would be great to have both things working on our AMDs :). Thanks for sharing! |
Beta Was this translation helpful? Give feedback.
-
Gemma 4 31B Dense — TurboQuant preliminary resultsI also tested Gemma 4 31B Dense (Q4_K_M, bartowski GGUF) on RX 7900 XTX / ROCm 6.4. Unlike the earlier Gemma 4 26B A4B MoE result, I did not see catastrophic quality collapse from PPL (WikiText-2 raw, ctx=4096)
This is an instruct model evaluated on raw WikiText-2, so the absolute PPL values are not directly comparable to standard base-model perplexity runs. The unexpected That said, I am not interpreting this as "turbo3 improves Gemma 4 quality". I suspect the surprising ordering is more likely related to the evaluation regime (instruct model on raw WikiText) and/or Gemma-specific evaluation behavior in llama.cpp. There is also an open upstream issue about Gemma final logit softcapping potentially not working correctly (#21388), so I'm treating these 31B Dense PPL results as preliminary until that is clarified. The safer conclusion is narrower:
Chat quality (turbo3 all layers)3-turn smoke test (thermodynamics laws, IEEE 754 significand reasoning, Python precision demo): all responses coherent, factually correct, well-structured. Generation speed: 24–25 t/s. Practical implications for 24GB GPUs
Summary: Gemma 4 family + TurboQuant
|
Beta Was this translation helpful? Give feedback.
-
TriAttention KV Cache Pruning — 32K context validatedFound and fixed a critical bug: the calibration script was storing Results with correct theta (Qwen3-8B Q4_K_M, WikiText-2, 20 chunks, RX 7900 XTX, ROCm 6.4):
75% retention holds under +1% PPL degradation from 4K to 32K. Even 50% retention (half the KV cache evicted) stays under +3%. Physical compaction at 50% gives 28% wall-time speedup because Flash Attention operates on a shorter, contiguous cache after pruning. Also added |
Beta Was this translation helpful? Give feedback.
-
|
I have a Strix Halo (128GB NixOS 6.19.11 kernel) ROCm 7.2.1 and just built this, I'm more than happy help, experiment and collaborate. just let me know what benchmark script(s), models etc. would be helpful? |
Beta Was this translation helpful? Give feedback.
-
|
Thanks for building and offering to help! Here's what would be most useful: 1. PPL comparison (f16 vs turbo3): ./scripts/get-wikitext-2.sh
# baseline
./build/bin/llama-perplexity -m your-model.gguf -f wikitext-2-raw/wiki.test.raw -c 4096 --chunks 10 -ngl 99
# turbo3
./build/bin/llama-perplexity -m your-model.gguf -f wikitext-2-raw/wiki.test.raw -c 4096 --chunks 10 -ngl 99 --cache-type-k turbo3 --cache-type-v turbo32. VRAM monitoring during a long session — stragulus reported VRAM growth on Ubuntu that I can't reproduce on openSUSE. With your NixOS + ROCm 7.2.1 that's a third data point: watch -n 5 rocm-smi --showmeminfo vramwhile running llama-server with turbo3 KV and sending multiple requests. 3. Any model you have handy — Qwen3.5-27B, Llama 3, Gemma 4 are all interesting. Strix Halo with 128GB is especially useful for long context tests. The startup log lines showing KV buffer sizes and the build warnings (if any) would also be helpful for gfx1151 validation. |
Beta Was this translation helpful? Give feedback.
-
Update: Full benchmark results (2026-04-10, GSM8K corrected 2026-04-11)Since the initial post, I've run significantly more thorough benchmarks on the current codebase. All tests on RX 7900 XTX, ROCm 6.4. GSM8K Math (Qwen3.5-27B Q5_K_M, temperature=0)
Both validated on the full 1319-problem GSM8K set. Difference is +0.1% — within statistical noise. turbo3 does not degrade math reasoning at 5.12× KV compression. Needle-in-a-Haystack (Qwen3.5-27B, turbo3 K+V)20/20 passed across 2K–16K context at all depths (0%, 25%, 50%, 75%, 100%). No retrieval degradation. Tool Calling (Qwen3.5-27B, turbo3 K+V)15/15 passed — correct tool selection and parameter extraction across 5 tool types. WikiText-2 PPL (Qwen3.5-27B Q5_K_M)
turbo3 matches or slightly beats f16 at 16K context. Speed
Gemma 4 findings
Best Gemma 4 config: K-side attention sharpening (α=1.036) is critical for symmetric turbo3 configs — without it, GSM8K drops from 83% to 74%. New: Boundary V hybrid architecture fixFixed the known bug where boundary layer detection used raw layer index instead of KV layer ordinal. On Qwen3.5-27B (64 total layers, 16 KV layers), boundary V now correctly targets the first/last KV attention layers. Added mode 8 for symmetric configs. Bug fixes
Repo: https://github.com/domvox/llama.cpp-turboquant-hip (branch: |
Beta Was this translation helpful? Give feedback.
-
Update on the VRAM growth issue (@stragulus)I dug into the FA kernel code and I think I found the root cause of the VRAM growth you reported. The TILE FA kernel passes The VEC FA kernel sets On my setup (gfx1100, openSUSE), turbo types route to VEC, so I don't see the growth. If your setup routes to TILE (different GPU, different dispatch path, or different FA kernel selection), you'd get the full f16 temp allocation on every forward pass. This would explain:
The relevant code is in Could you check which FA kernel your build selects? Look for lines like |
Beta Was this translation helpful? Give feedback.
-
|
I dug deeper into this as well, was able to reproduce your example as well,
and debugged the issue. I plan to propose a fix for this as I have multiple
solutions, weighing performance or vram usage impact and picking a smart
default choice. The fix will work for any quantized kv cache and is indeed
not turbo-related at all.
…On Fri, Apr 10, 2026 at 10:11 PM Dominik ***@***.***> wrote:
Update on the VRAM growth issue ***@***.*** <https://github.com/stragulus>
)
I dug into the FA kernel code and I think I found the root cause of the
VRAM growth you reported.
The *TILE FA kernel* passes need_f16_K=true, need_f16_V=true to
launch_fattn, which causes it to allocate a full f16 dequantization
buffer via K_f16.alloc(ggml_nelements(K)). For a 262K context with 16 KV
layers, that's ~16 GiB of temp buffer — on top of the turbo KV cache itself.
The *VEC FA kernel* sets need_f16_K = (type_K == GGML_TYPE_F16), so for
turbo types it's false — no temp buffer, inline dequant instead.
On my setup (gfx1100, openSUSE), turbo types route to VEC, so I don't see
the growth. If your setup routes to TILE (different GPU, different dispatch
path, or different FA kernel selection), you'd get the full f16 temp
allocation on every forward pass.
This would explain:
- why VEC (your Vulkan fallback) didn't show the issue
- why the growth scales with KV cache size
- why it's not turbo-specific per se — any quantized KV type going
through TILE would have the same behavior
The relevant code is in ggml/src/ggml-cuda/fattn-common.cuh lines
~1300-1360 and fattn-tile.cuh where launch_fattn is called with true, true
.
Could you check which FA kernel your build selects? Look for lines like
ggml_cuda_flash_attn_ext_vec or ggml_cuda_flash_attn_ext_tile in verbose
output.
—
Reply to this email directly, view it on GitHub
<#21526 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAD66B3MIX3Q57LX7STS6C34VFIOXAVCNFSM6AAAAACXOOQMX6VHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTMNJSGIYTKMI>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
|
Great to hear you reproduced it and are working on a fix. The fact that it affects any quantized KV type (not just turbo) makes it a much more impactful fix for the broader community. Happy to test any patches on gfx1100 / ROCm 6.4 if that helps. |
Beta Was this translation helpful? Give feedback.
-
|
Awesome! Thanks a lot for putting in the hardwork for implementing this. I was running Unoptimized (f16): 21824.00 MiB gfx1151, ROCm 7.2.1 |
Beta Was this translation helpful? Give feedback.
-
|
Thanks for this work. I tried it on my dual-MI50 (rocm-6.4) and I didn't notice particular issues, however I noticed that the PP speed is roughly halved on qwen3-coder-next-Q4_K_M (1130 t/s -> 500 t/s), and the TG speed drops from 41.5 t/s to 34.5. But it can definitely be helpful for large contexts. I tried with turbo4 instead and it gave me the exact same performance as turbo3. Hoping this helps! |
Beta Was this translation helpful? Give feedback.
-
TurboQuant + TriAttention combo: measured results (2026-04-11)All tests on RX 7900 XTX 24GB, ROCm 6.4, openSUSE Tumbleweed. Qwen3 thinking mode disabled via GSM8K math reasoningModel: Qwen3.5-27B Q5_K_M, temp=0, full 1319-problem test set.
No measurable reasoning degradation on GSM8K from TriAttention pruning in this run. Single-needle NIAH retrievalTriAttention params:
¹ 8B tested at 4K+8K only in the ≤12K range (no 2K/12K points in this run). Single failure at 4K d=0.75. Effective KV compression with combo: ~6.8×. Observed limitations
Repos
|
Beta Was this translation helpful? Give feedback.
-
|
Hi @domvox, First, thank you for this work on TurboQuant HIP — the KV cache compression is exactly what 24GB GPU owners need for long context. I've been testing your port on my RX 7900 XTX (ROCm 7.2.1) and turbo3 works as advertised. I wanted to bring something to your attention. Are you aware of @lhl's
Your build doesn't use
lhl's PR was never merged — @JohannesGaessler declined it in November 2025, saying he planned to rewrite the WMMA kernel as native MMA within ~1 month. That was 5 months ago and the replacement hasn't materialized, leaving RDNA3 users without these optimizations in upstream. I attempted to merge both branches to get TurboQuant + optimized WMMA in one build. Here's what I found: Attempt 1 — lhl as base, turboquant merged on top: Attempt 2 — turboquant as base, lhl merged on top: The fundamental issue is that both forks diverged from different upstream commits and the FA pipeline changed enough between them. Suggestion: If you're planning further development, rebasing TurboQuant on top of lhl's This could also help build the case for finally merging lhl's PR upstream — if multiple community projects depend on it, there's stronger incentive to get the WMMA fixes into mainline rather than waiting indefinitely for the MMA rewrite. Happy to test any builds on my 7900 XTX. Thanks again — turbo3 at 80K context on a 24GB card is a game changer. |
Beta Was this translation helpful? Give feedback.
-
|
I have a rtx 3090 and a 7900 xtx at my disposal and i said it before in another thread, the ampere lineup can only do fp16 int8 and int4 same with rdna 3 and rdna 4 amd gpu's what we are missing is fp8 and 3bit and sub 1 bit which is physically impossible on these architectures and which is a requirement for turboquant because of 3,5 bit quantization aka what the whole paper is about. There might be ways to possibly achieve this but at the cost of overhead and ways around which will cause other problems. |
Beta Was this translation helpful? Give feedback.
-
TurboQuant PPL quality benchmarks — long context analysisFollowing up on the throughput numbers, here are perplexity measurements across different models, context lengths, and KV cache configurations. All tests on RX 7900 XTX 24GB, ROCm 6.4, Docker build, wikitext-2 test set, 5 chunks per measurement. Qwen3-8B Q4_K_M (full RoPE, rope_theta=1M, 8 KV heads, head_dim=128)
Llama-3.1-8B base Q4_K_M (full RoPE, rope_theta=500K, 8 KV heads, head_dim=128)
Qwen3.5-27B Q5_K_M (partial_rotary_factor=0.25, 4 KV heads, head_dim=256)
Analysis: K cache is the sensitive componentThe K/V ablation on Qwen3-8B isolates the problem clearly:
The likely mechanism is that llama.cpp stores K after RoPE. RoPE applies position-dependent rotations to K channel pairs, making the post-RoPE K distribution harder to quantize — especially with only 8 centroids (turbo3). turbo4's 16 centroids handle this well in all tested configurations. The severity is model-dependent. Qwen3-8B shows catastrophic turbo3 K failure at 16K, Llama-3.1-8B shows moderate regression (+8.4%), and Qwen3.5-27B with partial RoPE (only 25% of K dimensions rotated) shows no measurable regression. Multiple factors likely contribute: RoPE coverage, rope_theta, head dimension, and the model's learned K distribution. Layer-adaptive mode with first+last 4 layers at q8_0 prevents the catastrophic collapse on Qwen3-8B (19.70 → 7.06), suggesting boundary layers are particularly sensitive. This is consistent with prior work on post-RoPE K quantization sensitivity (KVQuant, Berkeley) and frequency-dependent quantization error in RoPE channels (Q-ROAR). Practical recommendations
For models with Note: PPL on wikitext-2 is not a complete proxy for long-context retrieval or generation quality. These numbers should be treated as directional guidance. |
Beta Was this translation helpful? Give feedback.
-
|
I tested the branch locally (last I tested was f229a36) on a gfx1200 9060XT and I observed similar performance and quality results as others. However, I was going to post all the results, until I realized that the peak vram usage wasn't really different, for some reason. Concretely with unsloth/gemma-4-31B-it-GGUF:UD-IQ3_XXS I was able to go up to context size of 30000 with my setup and llama-bench, with both f16 and turbo4/turbo3. In fact, turbo will crash at 31000 with not enough enough vram, while f16 would still work at 31000 and had just enough ram. Commands I used (which may be wrong, I am new to this): I tried some other models and quantization, but had similar vram usage observations. |
Beta Was this translation helpful? Give feedback.
-
TriAttention scoring update: GPU consistency fix + NIAH benchmarkGPU scoring fixThe HIP q8_0 scoring kernel was computing a different angle than the CPU path. After passing Commit: NIAH retrieval benchmark (Qwen3-8B)Qwen3-8B Q4_K_M, q8_0 KV cache, no thinking mode, temp=0. Budget: 512 scored old tokens + 512 recent window (~1024 retained). Diverse haystack with random needle codes and distractor facts, 5 reps per cell. Scored eviction answers the needle correctly about twice as often as random (80% vs 38%). The scored failures we inspected were near-misses — the model found the needle but dropped a digit (e.g. Small initial benchmark (5 reps per cell) — the gap is directionally consistent across context lengths and depths in this run. Hybrid model observation (Qwen3.5-27B)On Qwen3.5-27B (3:1 Gated DeltaNet / gated-attention), scored and random eviction produced the same NIAH results in this exact setup: both 4/6 at 8K, 4/6 at 16K, 2/3 at 25K. One possible explanation is that the recurrent DeltaNet path carries enough retrieval signal at these lengths that attention-KV eviction differences are not visible in this benchmark. Not tested beyond 25K yet — treating this as an observation rather than a conclusion. |
Beta Was this translation helpful? Give feedback.
-
TriAttention scoring: budget sweep and Qwen3.5 hybrid checkFollow-up to my earlier NIAH result after the GPU scoring fix. Two extra checks:
1. Budget Sensitivity (Qwen3-8B, 8K context, 18-trial NIAH)Setup: Qwen3-8B Q4_K_M, q8_0 KV cache, FA, temp=0, no thinking mode, scoring interval=128, window=128. Same NIAH generator as the previous 80% vs 38% post.
At budget=1024 (1152 total retained, ~14% of 8K context), all near-miss failures disappear. The misses I inspected were near the cutoff: needle tokens scored around z=2.88-2.99. In this setup, a larger budget was enough to keep them. For retrieval-sensitive workloads with tight budgets, 2. Hybrid SSM+Attention (Qwen3.5-27B)Setup: Qwen3.5-27B Q4_K_M, q8_0 KV cache, FA, temp=0, no thinking mode, budget=512, window=128. Qwen3.5-27B is a Gated DeltaNet + full attention hybrid (3:1 ratio, 16 full-attention layers out of 64 total).
Scored eviction matched random eviction at every tested context length. I do not know the root cause yet. One obvious limitation is partial RoPE: Qwen3.5 rotates 64 of 256 K dimensions, so this score only directly sees 25% of each key. I have not separated that from other hybrid-model effects. For this hybrid model/configuration, I would use KV quantization before eviction. With only 16 full-attention layers (vs 64 total), full KV cache at 16K context is ~1 GiB — already much smaller than a same-shape pure transformer. Quantizing that (q8_0, q4_0, turbo) keeps all tokens and avoids the retrieval loss seen here. 3. Scoring AblationsTried several scoring changes against the budget=512 baseline (89%) on Qwen3-8B:
Current defaults were the best of the variants I tried. The consistent improvement came from increasing the retained-token budget, not from changing the scoring formula. BranchUpdated branch:
|
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
I ported TurboQuant KV cache compression (Zandieh et al., ICLR 2026) to HIP/ROCm on clean llama.cpp HEAD (
b8680). The original fork hung for me on HIP; this clean port onto mainline HEAD does not.Repo: https://github.com/domvox/llama.cpp-turboquant-hip
Branch:
feature/turboquant-hip-port-clean(commit6a8df6c)Paper: https://arxiv.org/abs/2504.19874
Hardware / Software
b8680What's included
TURBO2_0(2-bit, 6.4× compression),TURBO3_0(3-bit, 4.9×),TURBO4_0(4-bit, 3.8×)Benchmark Results
Perplexity — Qwen3.5-9B Q4_K (Wikitext-2, 145 chunks, ctx=2048)
Perplexity — Qwen3.5-27B Q5_K_M (Wikitext-2, 20 chunks, ctx=2048)
Smoke-check run (no f16 baseline for 27B yet — comparative results pending):
Throughput — Qwen3.5-27B Q5_K_M (16K context)
Throughput is within 1% — turbo3 is essentially free in terms of speed.
VRAM — The Hero Case (27B Q5_K_M @ 80K context, 24 GB GPU)
This is the real value proposition: turbo3 lets you run long-context workloads that simply don't fit with f16 KV cache.
Baseline Regression Check
f16 and q8_0 throughput on this branch is identical to clean mainline. Zero regression.
Test Matrix
Known Limitations
HIP_VISIBLE_DEVICES=0.Not Yet Tested
Build Instructions
Upstream Plan
This is an experimental port. The goal is to upstream useful pieces to ggml-org/llama.cpp as small, reviewable PRs. Feedback welcome from maintainers and anyone running HIP — especially other AMD GPU owners (RDNA2, RDNA3, RDNA4, MI-series).
Based on TheTom/llama-cpp-turboquant and Discussion #20969.
Standalone HIP kernel benchmark: https://github.com/domvox/turboquant-hip
Beta Was this translation helpful? Give feedback.
All reactions