TurboQuant KV Cache Compression — Full HIP/ROCm Port (gfx1100) #21526

domvox · 2026-04-06T18:48:41Z

domvox
Apr 6, 2026

I ported TurboQuant KV cache compression (Zandieh et al., ICLR 2026) to HIP/ROCm on clean llama.cpp HEAD (b8680). The original fork hung for me on HIP; this clean port onto mainline HEAD does not.

Repo: https://github.com/domvox/llama.cpp-turboquant-hip
Branch: feature/turboquant-hip-port-clean (commit 6a8df6c)
Paper: https://arxiv.org/abs/2504.19874

Hardware / Software

GPU: AMD Radeon RX 7900 XTX (gfx1100, 24 GB VRAM)
ROCm: 6.4
CPU: Ryzen 9 9950X3D
OS: openSUSE Tumbleweed
Base: llama.cpp HEAD b8680

What's included

New GGML types: TURBO2_0 (2-bit, 6.4× compression), TURBO3_0 (3-bit, 4.9×), TURBO4_0 (4-bit, 3.8×)
CPU quantize/dequantize + Walsh-Hadamard Transform
Full HIP/CUDA kernel port: set-rows, convert, dequantize, turbo-wht, turbo-innerq, mmvq-tq, flash attention instances (all 16 K×V combinations)
KV cache integration with llama-kv-cache, llama-graph WHT rotation
llama-bench turbo type support

Benchmark Results

Perplexity — Qwen3.5-9B Q4_K (Wikitext-2, 145 chunks, ctx=2048)

KV type	PPL	Δ vs f16
f16	7.152 ± 0.046	baseline
q8_0	7.149 ± 0.046	−0.04%
turbo4	7.228 ± 0.047	+1.06%
turbo3	7.236 ± 0.047	+1.17%

Perplexity — Qwen3.5-27B Q5_K_M (Wikitext-2, 20 chunks, ctx=2048)

Smoke-check run (no f16 baseline for 27B yet — comparative results pending):

KV type	PPL
turbo3	6.905 ± 0.126

Throughput — Qwen3.5-27B Q5_K_M (16K context)

KV type	Prompt (t/s)	Decode (t/s)
f16	395	29.8
turbo3	394	29.6

Throughput is within 1% — turbo3 is essentially free in terms of speed.

VRAM — The Hero Case (27B Q5_K_M @ 80K context, 24 GB GPU)

KV type	Result
f16	OOM (needs ~26 GiB)
turbo3	Runs — 314 t/s pp, 29.4 t/s tg (~20 GiB)

This is the real value proposition: turbo3 lets you run long-context workloads that simply don't fit with f16 KV cache.

Baseline Regression Check

f16 and q8_0 throughput on this branch is identical to clean mainline. Zero regression.

Test Matrix

Test	Status
Build (GGML_HIP=ON, HEAD b8680)	✅
Baseline regression (f16/q8_0 vs mainline)	✅
All 16 K×V combos on GPU (llama-bench)	✅
PPL 27B turbo3 (Wikitext-2, 20 chunks)	✅ 6.905 ± 0.126
Hero benchmark (27B @ 80K, OOM vs runs)	✅
Text generation (llama-completion, turbo3)	✅
Multi-turn conversation (5 turns, llama-cli -cnv)	✅
llama-server (OAI-compatible API)	✅
Context shift / KV eviction (ctx=256, overflow)	✅
Cross-architecture: Llama 3 8B	✅
Cross-architecture: Mistral 7B v0.3	✅
turbo2 (2-bit) on GPU	✅
CPU-only fallback (ngl=0)	✅

Known Limitations

Asymmetric KV (e.g. f16 K + turbo3 V) works but falls back to the slow non-fused flash attention path. Symmetric combos use the fast fused path.
Tested on gfx1100 only (RDNA3). Other AMD GPUs not yet validated — feedback welcome.
iGPU crash: The integrated GPU in Ryzen 9950X3D crashes even on clean mainline. Workaround: HIP_VISIBLE_DEVICES=0.
Concurrent sessions / parallel decoding: Not yet thoroughly tested.

Not Yet Tested

Extensive load testing under llama-server
Models beyond Qwen3.5, Llama 3, Mistral (Gemma, Phi, etc.)
turbo2 perplexity validation

Build Instructions

git clone https://github.com/domvox/llama.cpp-turboquant-hip.git
cd llama.cpp-turboquant-hip
git checkout feature/turboquant-hip-port-clean

cmake -B build -DGGML_HIP=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build -j16

# Run with turbo3 KV cache:
HIP_VISIBLE_DEVICES=0 ./build/bin/llama-cli \
  -m your-model.gguf -c 4096 -ngl 99 -cnv \
  --cache-type-k turbo3 --cache-type-v turbo3

Upstream Plan

This is an experimental port. The goal is to upstream useful pieces to ggml-org/llama.cpp as small, reviewable PRs. Feedback welcome from maintainers and anyone running HIP — especially other AMD GPU owners (RDNA2, RDNA3, RDNA4, MI-series).

Based on TheTom/llama-cpp-turboquant and Discussion #20969.
Standalone HIP kernel benchmark: https://github.com/domvox/turboquant-hip

stragulus · 2026-04-06T19:08:57Z

stragulus
Apr 6, 2026

Excellent! I'll test this and report, I also have a 7900XTX linux system but with a modere modern rocm version (7.2.1). I had tried to get it to work on fork of TheTom's turboquant branch, but since this is out of my wheelhouse I had claude give it a go (thus not admissible as a PR). I did manage to get it working but it's not a clean implementation. I was hoping it would serve as inspiration for someone to properly tackle it. And then your PR popped up :) Just in case there's anything remotely useful in it, here is my fork: https://github.com/stragulus/llama-cpp-turboquant/tree/bug/rocm-vram-leakage

12 replies

stragulus Apr 6, 2026

That is good news! Thanks for checking. Combined with this datapoint and my own changes I'll hopefully be able to narrow it down further.

stragulus Apr 8, 2026

I just confirmed that the same issue also exists when using rocm-6.40 on ubuntu 24.04 - so sadly we can't rule out rocm as the culprit here.

domvox Apr 8, 2026
Author

Interesting — so the VRAM growth reproduces on ROCm 6.4 + Ubuntu 24.04 but not on my ROCm 6.4 + openSUSE Tumbleweed (kernel 6.19.11). That narrows it down to something distro/kernel-specific rather than ROCm version. What kernel are you running on Ubuntu?

stragulus Apr 8, 2026

Kernel is: Linux rocm-test 6.17.0-20-generic #20~24.04.1-Ubuntu SMP PREEMPT_DYNAMIC Thu Mar 19 01:28:37 UTC 2 x86_64 x86_64 x86_64 GNU/Linux

I've been pestering QWen-3.5 using my own vram leak patched version to figure out the code base, and all signs point to the flash attention kernels in llama.cpp lazily reserving the entire kv cache size in a buffer that holds the dequantized f16/f16 k/v data, even if they only need to process entries one by one. But it also looks like this should then impact literally everyone who would have flash attention turned on and using quantized k/v parameters. In which case I would have expected many more people to report that than the bug report seems to indicate. The only FA kernel that does NOT do this is the VEC kernel, which looks like it's the lowest prio fallback if nothing else is available (this was the case for me when I used the vulkan port someone else on here made, thus no vram leakage there).

Now I know most people here are just loading synthetic benchmarks as the focus is not yet on making this production ready, so it could easily escape attention. But if one runs llama-bench with, say, a 30k prompt size and -fa 1 - it SHOULD show the vram issue for everyone. Would love it if someone could poke a hole in my understanding. I could very well be wrong.

domvox Apr 9, 2026
Author

Thanks for the detailed analysis — the FA lazy-allocation theory is interesting.

I just tried a matching single-run check on my side (gfx1100, ROCm 6.4, openSUSE Tumbleweed, kernel 6.19.11) with llama-bench -fa 1 -p 30000 -n 1 on Qwen3.5-27B Q5_K_M.

What I see here is:

f16: ~22.4 GB VRAM, 374.1 t/s prefill, 29.9 t/s tg
turbo3: ~20.9 GB VRAM, 372.5 t/s prefill, 29.3 t/s tg

On my setup both stay flat during the run and both return to baseline VRAM afterward, so I'm not seeing growth from a single 30K benchmark run.

So at least from this one datapoint, it doesn't look like "quantized KV + FA always leaks for everyone". That makes me think it may depend on some combination of:

distro / kernel,
the exact FA kernel path selected,
or whether the growth is cumulative across repeated requests rather than visible in a single run.

Your hypothesis still sounds plausible — I just can't reproduce it under the same high-level conditions on my machine.

hvico · 2026-04-06T19:15:58Z

hvico
Apr 6, 2026

Great, I will test both on my Strix Halo gfx1151. Regarding TheTom's, that fork now also implemented weight quantization, so it would be great to have both things working on our AMDs :).

Thanks for sharing!

1 reply

domvox Apr 6, 2026
Author

Thanks — that would be awesome.
Strix Halo / gfx1151 coverage would be really valuable, since I've only validated on gfx1100 so far. And yes, fully agree — having both KV-cache compression and weight quantization working on AMD would be a great outcome.
Please share results when you have them.

domvox · 2026-04-07T17:59:52Z

domvox
Apr 7, 2026
Author

Gemma 4 31B Dense — TurboQuant preliminary results

I also tested Gemma 4 31B Dense (Q4_K_M, bartowski GGUF) on RX 7900 XTX / ROCm 6.4.

Unlike the earlier Gemma 4 26B A4B MoE result, I did not see catastrophic quality collapse from turbo3 applied to all KV layers on this model.

PPL (WikiText-2 raw, ctx=4096)

Config	20 chunks	72 chunks (full)	KV VRAM
f16 KV	17,508 ± 661	23,300 ± 464	1520 MiB
turbo3 all layers	4,956 ± 187	6,987 ± 140	297 MiB

This is an instruct model evaluated on raw WikiText-2, so the absolute PPL values are not directly comparable to standard base-model perplexity runs. The unexpected turbo3 < f16 PPL ordering persists on the larger eval set, so it is not just a small-sample artifact.

That said, I am not interpreting this as "turbo3 improves Gemma 4 quality". I suspect the surprising ordering is more likely related to the evaluation regime (instruct model on raw WikiText) and/or Gemma-specific evaluation behavior in llama.cpp. There is also an open upstream issue about Gemma final logit softcapping potentially not working correctly (#21388), so I'm treating these 31B Dense PPL results as preliminary until that is clarified.

The safer conclusion is narrower:

on Gemma 4 26B A4B MoE, turbo3 on all layers was catastrophic and SWA bypass was needed
on Gemma 4 31B Dense, turbo3 on all layers does not show the same catastrophic failure in my tests
KV VRAM savings are still substantial: 297 MiB vs 1520 MiB (~80% reduction)

Chat quality (turbo3 all layers)

3-turn smoke test (thermodynamics laws, IEEE 754 significand reasoning, Python precision demo): all responses coherent, factually correct, well-structured. Generation speed: 24–25 t/s.

Practical implications for 24GB GPUs

Model weights (Q4_K_M): 18.7 GB
KV cache (turbo3, 4K ctx): 0.3 GB
Compute buffers: ~0.5 GB
Total: ~19.5 GB — fits comfortably on 24GB cards

Summary: Gemma 4 family + TurboQuant

Model	turbo3 all layers	SWA bypass needed?
26B A4B MoE	❌ catastrophic PPL	✅ yes, turbo3 global + f16 SWA
31B Dense	✅ no catastrophic failure in preliminary tests	❌ not needed in my tests

3 replies

baramofme Apr 11, 2026

First of all, thank you for creating the hip port version.
I switched over from TheTom's turboquantplus branch, so I am using turbo3 and turbo4 with cache-type-k and cache-type-v as they are.

How do I bypass SWA when using the Gemma4 26B A4B MoE version? Is it the asymmetric K/V setting? As shown below?
- "--cache-type-k"
- "q8_0"
- "--cache-type-v"
- "turbo3"
I would also like to try applying ggml compression. However, what should I use? I tried TQ3S_1 and TQ4S_1 from the TheTom branch.
- TURBO3_0(3bit, 4.9x), TURBO4_0(4bit, 3.8x), TURBO2_0 (2bit, 6.4x),
- TQ3_1S, TQ4_1S => Is this the same as the one from TheTom's branch?
Is convert_hf_to_gguf.py the tool to use for model weight compression?And should I proceed exactly as described in TheTom's weight-compression-tq4.md file?
- https://github.com/domvox/llama.cpp-turboquant-hip/blob/main/convert_hf_to_gguf.py
- https://github.com/TheTom/turboquant_plus/blob/main/docs/papers/weight-compression-tq4.md#7-practical-use
I am currently using the server-rocm Docker image you created. To use TriAttention KV Cache Pruning, do I just need to set the kv cache type to turbo3/turbo3 as before?
- https://github.com/domvox/llama.cpp-turboquant-hip/pkgs/container/llama.cpp-turboquant-hip/790159687?tag=server-rocm

domvox Apr 11, 2026
Author

Thanks for trying it out! Answers:

1. SWA bypass for Gemma 4 26B A4B:
The config you showed (q8_0 K + turbo3 V) is not quite right. The best config I found:

--cache-type-k turbo3 --cache-type-v turbo3 --cache-type-k-swa turbo3 --cache-type-v-swa q8_0

This applies turbo3 to both global K and V, turbo3 to SWA K, and q8_0 to SWA V. The key finding: SWA V with turbo3 hurts quality, but SWA K is fine with turbo3. The --cache-type-k-swa and --cache-type-v-swa flags control the SWA layers separately.

2. Weight quantization (TQ3_1S, TQ4_1S):
The type definitions exist in the headers (merged from TheTom's branch), but they are not wired up in llama-quantize or convert_hf_to_gguf.py in my fork. So you can't use them for weight quantization here. For weight compression, use standard ggml types (Q4_K_M, Q5_K_M, etc.) via llama-quantize. TheTom's turboquantplus branch has the working weight quantization implementation.

3. convert_hf_to_gguf.py:
Correct observation — turbo types are not there because they're runtime KV cache compression, not model weight formats. Just use any standard GGUF and add --cache-type-k turbo3 --cache-type-v turbo3 when running the server. No model conversion needed.

baramofme Apr 11, 2026

답변 감사합니다!

domvox · 2026-04-09T13:20:31Z

domvox
Apr 9, 2026
Author

TriAttention KV Cache Pruning — 32K context validated

Found and fixed a critical bug: the calibration script was storing rope_theta=10000 instead of the model's actual rope_theta=1000000 (Qwen3-8B). This caused scoring frequencies to be 100× too fast, making phase predictions essentially random at long context.

Results with correct theta (Qwen3-8B Q4_K_M, WikiText-2, 20 chunks, RX 7900 XTX, ROCm 6.4):

ctx	Baseline	75% retention	Δ	50% retention	Δ
4096	8.1524	8.1735	+0.3%	8.2290	+0.9%
16384	8.1213	8.1550	+0.4%	8.2488	+1.6%
32768	8.7843	8.8630	+0.9%	9.0344	+2.8%

75% retention holds under +1% PPL degradation from 4K to 32K. Even 50% retention (half the KV cache evicted) stays under +3%.

Physical compaction at 50% gives 28% wall-time speedup because Flash Attention operates on a shorter, contiguous cache after pruning.

Also added --tri-sink N flag to unconditionally protect the first N tokens (attention sinks, StreamingLLM-style).

Repo: https://github.com/domvox/triattention-ggml

0 replies

mikiadev · 2026-04-10T13:12:11Z

mikiadev
Apr 10, 2026

I have a Strix Halo (128GB NixOS 6.19.11 kernel) ROCm 7.2.1 and just built this, I'm more than happy help, experiment and collaborate. just let me know what benchmark script(s), models etc. would be helpful?

1 reply

domvox Apr 11, 2026
Author

Thanks for offering! Strix Halo with 128GB unified memory is an incredible test platform — you can push context lengths none of us can reach on 24GB discrete.

Here's what would be most valuable:

Build check — does it compile cleanly on ROCm 7.2.1? All our testing has been on 6.4, so this alone is useful info.
Perplexity (f16 vs turbo3):

./llama-perplexity -m -f wikitext-2-raw/wiki.test.raw -c 4096 -ngl 99 --cache-type-k f16 --cache-type-v f16
./llama-perplexity -m -f wikitext-2-raw/wiki.test.raw -c 4096 -ngl 99 --cache-type-k turbo3 --cache-type-v turbo3

Long context stress test — with 128GB you could be the first to push turbo3 past 200K. Try increasing -c until it OOMs and report peak VRAM (rocm-
smi):

./llama-bench -m -p 512 -n 128 -ngl 99 -c 131072 --cache-type-k turbo3 --cache-type-v turbo3
./llama-bench -m -p 512 -n 128 -ngl 99 -c 262144 --cache-type-k turbo3 --cache-type-v turbo3

Generation speed — both prefill (pp) and token generation (tg):

./llama-bench -m -p 4096 -n 128 -ngl 99 --cache-type-k f16 --cache-type-v f16
./llama-bench -m -p 4096 -n 128 -ngl 99 --cache-type-k turbo3 --cache-type-v turbo3

Ideal model: Qwen3.5-27B Q5_K_M — we have full baselines for that one. Llama-3 70B Q4_K_M would also be great if you want to go bigger.

domvox · 2026-04-10T18:12:38Z

domvox
Apr 10, 2026
Author

Thanks for building and offering to help! Here's what would be most useful:

1. PPL comparison (f16 vs turbo3):

./scripts/get-wikitext-2.sh

# baseline
./build/bin/llama-perplexity -m your-model.gguf -f wikitext-2-raw/wiki.test.raw -c 4096 --chunks 10 -ngl 99

# turbo3
./build/bin/llama-perplexity -m your-model.gguf -f wikitext-2-raw/wiki.test.raw -c 4096 --chunks 10 -ngl 99 --cache-type-k turbo3 --cache-type-v turbo3

2. VRAM monitoring during a long session — stragulus reported VRAM growth on Ubuntu that I can't reproduce on openSUSE. With your NixOS + ROCm 7.2.1 that's a third data point:

watch -n 5 rocm-smi --showmeminfo vram

while running llama-server with turbo3 KV and sending multiple requests.

3. Any model you have handy — Qwen3.5-27B, Llama 3, Gemma 4 are all interesting. Strix Halo with 128GB is especially useful for long context tests.

The startup log lines showing KV buffer sizes and the build warnings (if any) would also be helpful for gfx1151 validation.

2 replies

LyraMoonDev Apr 15, 2026

RX 6700 XT | MN-12B-Mag-Mell-R1-Q5_K_L.gguf

HSA_OVERRIDE_GFX_VERSION=10.3.0 build/bin/llama-perplexity -m RP_Models/MN-12B-Mag-Mell-R1-Q5_K_L.gguf -f wikitext-2-raw/wiki.test.raw -c 4096 --chunks 10 -ngl 99

type	PPL	size log line
fp16	5.9396 +/- 0.09823	`llama_kv_cache: size = 640.00 MiB ( 4096 cells, 40 layers, 1/1 seqs), K (f16): 320.00 MiB, V (f16): 320.00 MiB`
turbo4	6.0370 +/- 0.10059	`llama_kv_cache: size = 170.00 MiB ( 4096 cells, 40 layers, 1/1 seqs), K (turbo4): 85.00 MiB, V (turbo4): 85.00 MiB`
turbo3	6.1359 +/- 0.10211	`llama_kv_cache: size = 125.00 MiB ( 4096 cells, 40 layers, 1/1 seqs), K (turbo3): 62.50 MiB, V (turbo3): 62.50 MiB`

HSA_OVERRIDE_GFX_VERSION=10.3.0 build/bin/llama-bench -m RP_Models/MN-12B-Mag-Mell-R1-Q5_K_L.gguf -fa 1 -p "4096,16384,32768" -ngl 99

type	pp4096	pp16384	pp32768	tg128
fp16	557.25 +/- 1.09	450.21 +/- 0.11	OOM	36.15 +/- 0.01
turbo4	550.71 +/- 0.35	443.30 +/- 0.32	350.78 +/- 0.09	34.49 +/- 0.01
turbo3	549.63 +/- 0.80	442.93 +/- 0.07	350.92 +/- 0.06	33.52 +/- 0.01

I built with a custom command:

HIPCXX="$(hipconfig -l)/clang" HIP_PATH="$(hipconfig -R)" \
    cmake -S . -B build \
        -DGGML_HIP=ON \
        -DGPU_TARGETS=gfx1030 \
        -DCMAKE_BUILD_TYPE=Release \
        -DGGML_NATIVE=ON \
    && cmake --build build --config Release -- -j 16

Don't know if that makes a difference but there you go.

Let me know if you can use other tests as well, I don't really understand any of those numbers.
It does use less VRAM though so it is definitively working!

domvox Apr 17, 2026
Author

@LyraMoonDev Excellent — first RDNA2 validation! Your results look solid:

KV Type	PPL (4K)	Δ	KV Size	Compression
f16	5.94	—	640 MiB	1×
turbo4	6.04	+1.7%	170 MiB	3.8×
turbo3	6.14	+3.3%	125 MiB	5.1×

And the key result: turbo enables 32K context where f16 OOMs.

The decode speed drop (36→33 t/s) is consistent with TurboQuant overhead — likely a mix of FA dequantization, WHT rotations, and different kernel paths. On RDNA2 with HSA_OVERRIDE_GFX_VERSION the exact cost profile can vary. Fair trade for 5× memory savings.

Your build command is fine. We also just added Docker Compose support:

GPU_TARGETS=gfx1030 docker compose build
docker compose up

Adding gfx1030 to the confirmed hardware list — thanks for the thorough testing!

domvox · 2026-04-10T18:15:04Z

domvox
Apr 10, 2026
Author

Update: Full benchmark results (2026-04-10, GSM8K corrected 2026-04-11)

Since the initial post, I've run significantly more thorough benchmarks on the current codebase. All tests on RX 7900 XTX, ROCm 6.4.

GSM8K Math (Qwen3.5-27B Q5_K_M, temperature=0)

Config	Accuracy	N
f16 KV	71.9%	1319 (full set)
turbo3 KV	72.0%	1319 (full set)

Both validated on the full 1319-problem GSM8K set. Difference is +0.1% — within statistical noise. turbo3 does not degrade math reasoning at 5.12× KV compression.

Needle-in-a-Haystack (Qwen3.5-27B, turbo3 K+V)

20/20 passed across 2K–16K context at all depths (0%, 25%, 50%, 75%, 100%). No retrieval degradation.

Tool Calling (Qwen3.5-27B, turbo3 K+V)

15/15 passed — correct tool selection and parameter extraction across 5 tool types.

WikiText-2 PPL (Qwen3.5-27B Q5_K_M)

Context	f16	turbo3	Δ
4K	6.6657	6.6657	+0.02%
16K	6.2752	6.2187	-0.9%

turbo3 matches or slightly beats f16 at 16K context.

Speed

Model	Config	Prefill (tok/s)	Decode (tok/s)
Qwen3.5-27B	f16	427	30.15
	turbo3	423 (-1%)	29.49 (-2%)
Gemma 4 26B	f16	2939	94.53
	turbo3	2878 (-2%)	87.13 (-8%)

Gemma 4 findings

Model	f16	turbo3	Drop
Gemma 4 26B-A4B	83%	81.5% (avg 2 runs)	-1.5%
Gemma 4 31B Dense	96%	97%	+1%

Best Gemma 4 config: --cache-type-k turbo3 --cache-type-v turbo3 --cache-type-k-swa turbo3 --cache-type-v-swa q8_0

K-side attention sharpening (α=1.036) is critical for symmetric turbo3 configs — without it, GSM8K drops from 83% to 74%.

New: Boundary V hybrid architecture fix

Fixed the known bug where boundary layer detection used raw layer index instead of KV layer ordinal. On Qwen3.5-27B (64 total layers, 16 KV layers), boundary V now correctly targets the first/last KV attention layers. Added mode 8 for symmetric configs.

Bug fixes

Fixed linker error with GGML_BACKEND_DL=ON (Docker CI)
Fixed cudaMemcpyToSymbol nodiscard warnings on HIP
Disabled MUSA and s390x Docker builds (unsupported)

Repo: https://github.com/domvox/llama.cpp-turboquant-hip (branch: feature/triattention-scoring)

0 replies

domvox · 2026-04-10T20:11:02Z

domvox
Apr 10, 2026
Author

Update on the VRAM growth issue (@stragulus)

I dug into the FA kernel code and I think I found the root cause of the VRAM growth you reported.

The TILE FA kernel passes need_f16_K=true, need_f16_V=true to launch_fattn, which causes it to allocate a full f16 dequantization buffer via K_f16.alloc(ggml_nelements(K)). For a 262K context with 16 KV layers, that's ~16 GiB of temp buffer — on top of the turbo KV cache itself.

The VEC FA kernel sets need_f16_K = (type_K == GGML_TYPE_F16), so for turbo types it's false — no temp buffer, inline dequant instead.

On my setup (gfx1100, openSUSE), turbo types route to VEC, so I don't see the growth. If your setup routes to TILE (different GPU, different dispatch path, or different FA kernel selection), you'd get the full f16 temp allocation on every forward pass.

This would explain:

why VEC (your Vulkan fallback) didn't show the issue
why the growth scales with KV cache size
why it's not turbo-specific per se — any quantized KV type going through TILE would have the same behavior

The relevant code is in ggml/src/ggml-cuda/fattn-common.cuh lines ~1300-1360 and fattn-tile.cuh where launch_fattn is called with true, true.

Could you check which FA kernel your build selects? Look for lines like ggml_cuda_flash_attn_ext_vec or ggml_cuda_flash_attn_ext_tile in verbose output.

0 replies

stragulus · 2026-04-10T22:13:41Z

stragulus
Apr 10, 2026

I dug deeper into this as well, was able to reproduce your example as well, and debugged the issue. I plan to propose a fix for this as I have multiple solutions, weighing performance or vram usage impact and picking a smart default choice. The fix will work for any quantized kv cache and is indeed not turbo-related at all.

…

On Fri, Apr 10, 2026 at 10:11 PM Dominik ***@***.***> wrote: Update on the VRAM growth issue ***@***.*** <https://github.com/stragulus> ) I dug into the FA kernel code and I think I found the root cause of the VRAM growth you reported. The *TILE FA kernel* passes need_f16_K=true, need_f16_V=true to launch_fattn, which causes it to allocate a full f16 dequantization buffer via K_f16.alloc(ggml_nelements(K)). For a 262K context with 16 KV layers, that's ~16 GiB of temp buffer — on top of the turbo KV cache itself. The *VEC FA kernel* sets need_f16_K = (type_K == GGML_TYPE_F16), so for turbo types it's false — no temp buffer, inline dequant instead. On my setup (gfx1100, openSUSE), turbo types route to VEC, so I don't see the growth. If your setup routes to TILE (different GPU, different dispatch path, or different FA kernel selection), you'd get the full f16 temp allocation on every forward pass. This would explain: - why VEC (your Vulkan fallback) didn't show the issue - why the growth scales with KV cache size - why it's not turbo-specific per se — any quantized KV type going through TILE would have the same behavior The relevant code is in ggml/src/ggml-cuda/fattn-common.cuh lines ~1300-1360 and fattn-tile.cuh where launch_fattn is called with true, true . Could you check which FA kernel your build selects? Look for lines like ggml_cuda_flash_attn_ext_vec or ggml_cuda_flash_attn_ext_tile in verbose output. — Reply to this email directly, view it on GitHub <#21526 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAD66B3MIX3Q57LX7STS6C34VFIOXAVCNFSM6AAAAACXOOQMX6VHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTMNJSGIYTKMI> . You are receiving this because you were mentioned.Message ID: ***@***.***>

0 replies

domvox · 2026-04-10T22:16:52Z

domvox
Apr 10, 2026
Author

Great to hear you reproduced it and are working on a fix. The fact that it affects any quantized KV type (not just turbo) makes it a much more impactful fix for the broader community.

Happy to test any patches on gfx1100 / ROCm 6.4 if that helps.

9 replies

stragulus Apr 12, 2026

https://github.com/domvox/llama.cpp-turboquant-hip/pull/2/changes There it is, just had to patch in a single commit. No more vram usage creep across all test cases without impacting performance noticeably. In your own previous example where you didn't notice vram usage creep using the qwen Q5_K_M model, you would still see if for longer prompts (e.g. 60k) because it just so happened to have pre-allocated more vram than the Q4_K_M model which showed vram creep with smaller prompt sizes. If you compare that test against this branch, you'll also spot the difference.

domvox Apr 12, 2026
Author

@stragulus Cherry-picked into feature/triattention-scoring and pushed — built and tested on 7900 XTX, no performance regression (952 t/s pp512 turbo3). Clean patch, thanks!

domvox Apr 12, 2026
Author

Hey @stragulus, while doing a Codex review of the full codebase I found two potential issues in the batched FA KV conversion path (flash_attn_combine2_inplace + do_batch loop in fattn-common.cuh). These are in the batched path code, not in the turbo dequant itself.

1. Attention sinks counted per KV batch

do_kernel_call always passes sinks to the FA kernel. In the batched path, each do_batch() call runs the kernel with blockIdx.y == 0 (since batch_blocks.y = 1), so the sink logit is added to the softmax in every batch. When combine2_inplace merges via log-sum-exp, the sink contribution is counted N times (once per batch) instead of once.

Impact: only when attention sinks are enabled + batched KV conversion is active (long context). The sink weight in the final softmax is inflated proportionally to the number of batches.

Fix: pass sinks only for batch 0.

2. Mask row stride wrong for ncols > 1 in batched path

The vec kernel indexes the mask as maskh[j*ne11 + i_KQ], using ne11 as the stride between query rows. In the batched path, ne11 = kv_count (slice size), but the mask was allocated with the full KV length as row stride. For ncols == 1 (token generation) this is fine. For ncols == 2 (prompt processing), query column j=1 reads from maskh[kv_count + i_KQ] instead of maskh[full_kv_length + i_KQ].

Fix: use nb31 / sizeof(half) as the mask row stride instead of ne11 (nb31 is already passed to the kernel).

Neither affects the common single-token generation path, so not urgent. Just flagging for when you have time.

stragulus Apr 12, 2026

Good catch! I validated and added my fixes to my forked branch: stragulus/llama.cpp-turboquant-hip@ecc534c - since I can't cleanly PR can you cherry-pick from there?

domvox Apr 12, 2026
Author

Cherry-picked in 199b852. Thanks for the quick fix! 🙏

rkmth · 2026-04-11T11:22:55Z

rkmth
Apr 11, 2026

Awesome! Thanks a lot for putting in the hardwork for implementing this.

I was running minimax-m2.5-q3/UD-Q3_K_XL for 90k context

Unoptimized (f16): 21824.00 MiB
Optimized (turbo3): 4262.50 MiB

gfx1151, ROCm 7.2.1

1 reply

domvox Apr 11, 2026
Author

Thanks for testing! That's the first gfx1151 result with turbo3 KV — great to see it working on Strix Halo.

5.12× compression on MiniMax-M2.5 at 90K context is solid. If you get a chance, two things would be really useful:

Any quality observations in chat (does it feel the same as f16 KV?)
If you push context higher (130K+), let me know if you hit OOM — there's a known issue with FA temp buffers growing with context fill that we're tracking.

wtarreau · 2026-04-11T16:02:04Z

wtarreau
Apr 11, 2026

Thanks for this work. I tried it on my dual-MI50 (rocm-6.4) and I didn't notice particular issues, however I noticed that the PP speed is roughly halved on qwen3-coder-next-Q4_K_M (1130 t/s -> 500 t/s), and the TG speed drops from 41.5 t/s to 34.5. But it can definitely be helpful for large contexts. I tried with turbo4 instead and it gave me the exact same performance as turbo3. Hoping this helps!

9 replies

domvox Apr 12, 2026
Author

@wtarreau Thanks for running the full matrix — this is the most detailed gfx906 data we have.

For reference, here's q8_0 vs turbo3 K+V on a single RX 7900 XTX (gfx1100) with Qwen3.5-27B:

Context	q8_0 (t/s)	turbo3 (t/s)	Overhead
pp512	947	937	1.1%
pp4096	910	904	0.6%
pp16384	818	814	0.4%

Not an apples-to-apples comparison — different model (27B dense vs 80B MoE), single vs dual GPU, and different GPU architecture. But the gap is striking enough to suggest the turbo3 overhead is strongly platform/workload-dependent. The bit-unpacking in the dequant kernels likely hits GCN's wave64 harder than RDNA3's wave32, though the model and topology differences make it hard to isolate the cause cleanly.

turbo3 gives ~2.3x nominal KV compression vs q8_0, so at pp65536 on dual MI50 the VRAM savings may still justify the throughput cost.

DmitryAkDev Apr 19, 2026

ROCM 7.2.1
Device 0: AMD Radeon Graphics, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64, VRAM: 32752 MiB
power cap to 180

Build - a lot of warnings:
../llama.cpp-turboquant-hip/ggml/src/ggml-cuda/template-instances/../turbo-quant.cuh:275:13: warning: ignoring return value of type 'hipError_t' declared with 'nodiscard' attribute [-Wunused-value] 275 | cudaMemcpyToSymbol(d_innerq_calibrating, &zero, sizeof(int)); | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

GEMMA-4-31B-Q4_K_M

HIP_VISIBLE_DEVICES=0 ./build/bin/llama-perplexity -m /mnt/h/gguf/gemma4_31_update/google_gemma-4-31B-it-Q4_K_M.gguf -f ./scripts/wikitext-2-raw/wiki.test.raw -c 4096 --chunks 10 -ngl 99
llama_kv_cache: size = 1200.00 MiB ( 1536 cells, 50 layers, 1/1 seqs), K (f16): 600.00 MiB, V (f16): 600.00 MiB
Final estimate: PPL = 18934.0944 +/- 964.54416

HIP_VISIBLE_DEVICES=0 ./build/bin/llama-perplexity -m /mnt/h/gguf/gemma4_31_update/google_gemma-4-31B-it-Q4_K_M.gguf -f ./scripts/wikitext-2-raw/wiki.test.raw -c 4096 --chunks 10 -ngl 99 --cache-type-k turbo3 --cache-type-v turbo3
llama_kv_cache: size = 234.38 MiB ( 1536 cells, 50 layers, 1/1 seqs), K (turbo3): 117.19 MiB, V (turbo3): 117.19 MiB
Final estimate: PPL = 565584988568.3198 +/- 28640633423.13766

HIP_VISIBLE_DEVICES=0 ./build/bin/llama-perplexity -m /mnt/h/gguf/gemma4_31_update/google_gemma-4-31B-it-Q4_K_M.gguf -f ./scripts/wikitext-2-raw/wiki.test.raw -c 4096 --chunks 10 -ngl 99 --cache-type-k turbo4 --cache-type-v turbo4
llama_kv_cache: size = 318.75 MiB ( 1536 cells, 50 layers, 1/1 seqs), K (turbo4): 159.38 MiB, V (turbo4): 159.38 MiB
Final estimate: PPL = 48212618.8387 +/- 2422002.53041

HIP_VISIBLE_DEVICES=0 ./build/bin/llama-bench -m /mnt/h/gguf/gemma4_31_update/google_gemma-4-31B-it-Q4_K_M.gguf -p 4096 -n 128 -ngl 99 --cache-type-k f16 --cache-type-v f16
ggml_cuda_init: found 1 ROCm devices (Total VRAM: 32752 MiB):
Device 0: AMD Radeon Graphics, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64, VRAM: 32752 MiB

model	size	params	backend	ngl	test	t/s
gemma4 ?B Q4_K - Medium	18.24 GiB	30.70 B	ROCm	99	pp4096	156.48 ± 0.14
gemma4 ?B Q4_K - Medium	18.24 GiB	30.70 B	ROCm	99	tg128	18.47 ± 0.01

build: 11aced9 (8699)

HIP_VISIBLE_DEVICES=0 ./build/bin/llama-bench -m /mnt/h/gguf/gemma4_31_update/google_gemma-4-31B-it-Q4_K_M.gguf -p 4096 -n 128 -ngl 99 --cache-type-k turbo3 --cache-type-v turbo3
main: error: failed to create context with model '/mnt/h/gguf/gemma4_31_update/google_gemma-4-31B-it-Q4_K_M.gguf'

HIP_VISIBLE_DEVICES=0 ./build/bin/llama-bench -m /mnt/h/gguf/gemma4_31_update/google_gemma-4-31B-it-Q4_K_M.gguf -p 4096 -n 128 -ngl 99 --cache-type-k turbo3 --cache-type-v turbo3 -fa 1
ggml_cuda_init: found 1 ROCm devices (Total VRAM: 32752 MiB):
Device 0: AMD Radeon Graphics, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64, VRAM: 32752 MiB

model	size	params	backend	ngl	type_k	type_v	fa	test	t/s
gemma4 ?B Q4_K - Medium	18.24 GiB	30.70 B	ROCm	99	turbo3	turbo3	1	pp4096	183.78 ± 0.20
gemma4 ?B Q4_K - Medium	18.24 GiB	30.70 B	ROCm	99	turbo3	turbo3	1	tg128	18.35 ± 0.02

build: 11aced9 (8699)

HIP_VISIBLE_DEVICES=0 ./build/bin/llama-bench -m /mnt/h/gguf/gemma4_31_update/google_gemma-4-31B-it-Q4_K_M.gguf -p 4096 -n 128 -ngl 99 --cache-type-k turbo4 --cache-type-v turbo4 -fa 1
ggml_cuda_init: found 1 ROCm devices (Total VRAM: 32752 MiB):
Device 0: AMD Radeon Graphics, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64, VRAM: 32752 MiB

model	size	params	backend	ngl	type_k	type_v	fa	test	t/s
gemma4 ?B Q4_K - Medium	18.24 GiB	30.70 B	ROCm	99	turbo4	turbo4	1	pp4096	183.65 ± 0.17
gemma4 ?B Q4_K - Medium	18.24 GiB	30.70 B	ROCm	99	turbo4	turbo4	1	tg128	18.55 ± 0.01

build: 11aced9 (8699)

Qwen3.5-27B-Q5_K_M

HIP_VISIBLE_DEVICES=0 ./build/bin/llama-perplexity -m /mnt/h/gguf/Qwen3.5-27B/Qwen3.5-27B-Q5_K_M.gguf -f ./scripts/wikitext-2-raw/wiki.test.raw -c 4096 --chunks 10 -ngl 99 --cache-type-k f16 --cache-type-v f16
llama_kv_cache: size = 256.00 MiB ( 4096 cells, 16 layers, 1/1 seqs), K (f16): 128.00 MiB, V (f16): 128.00 MiB
Final estimate: PPL = 7.1514 +/- 0.13525

HIP_VISIBLE_DEVICES=0 ./build/bin/llama-perplexity -m /mnt/h/gguf/Qwen3.5-27B/Qwen3.5-27B-Q5_K_M.gguf -f ./scripts/wikitext-2-raw/wiki.test.raw -c 4096 --chunks 10 -ngl 99 --cache-type-k turbo3 --cache-type-v turbo3
llama_kv_cache: size = 50.00 MiB ( 4096 cells, 16 layers, 1/1 seqs), K (turbo3): 25.00 MiB, V (turbo3): 25.00 MiB
Final estimate: PPL = 19.4613 +/- 0.39019

HIP_VISIBLE_DEVICES=0 ./build/bin/llama-perplexity -m /mnt/h/gguf/Qwen3.5-27B/Qwen3.5-27B-Q5_K_M.gguf -f ./scripts/wikitext-2-raw/wiki.test.raw -c 4096 --chunks 10 -ngl 99 --cache-type-k turbo4 --cache-type-v turbo4
llama_kv_cache: size = 68.00 MiB ( 4096 cells, 16 layers, 1/1 seqs), K (turbo4): 34.00 MiB, V (turbo4): 34.00 MiB
Final estimate: PPL = 10.3809 +/- 0.20076

HIP_VISIBLE_DEVICES=0 ./build/bin/llama-bench -m /mnt/h/gguf/Qwen3.5-27B/Qwen3.5-27B-Q5_K_M.gguf -p 4096 -n 128 -ngl 99 --cache-type-k f16 --cache-type-v f16
ggml_cuda_init: found 1 ROCm devices (Total VRAM: 32752 MiB):
Device 0: AMD Radeon Graphics, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64, VRAM: 32752 MiB

model	size	params	backend	ngl	test	t/s
qwen35 27B Q5_K - Medium	18.25 GiB	26.90 B	ROCm	99	pp4096	149.34 ± 0.02
qwen35 27B Q5_K - Medium	18.25 GiB	26.90 B	ROCm	99	tg128	18.44 ± 0.01

build: 11aced9 (8699)

HIP_VISIBLE_DEVICES=0 ./build/bin/llama-bench -m /mnt/h/gguf/Qwen3.5-27B/Qwen3.5-27B-Q5_K_M.gguf -p 4096 -n 128 -ngl 99 --cache-type-k turbo3 --cache-type-v turbo3
ggml_cuda_init: found 1 ROCm devices (Total VRAM: 32752 MiB):
Device 0: AMD Radeon Graphics, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64, VRAM: 32752 MiB
main: error: failed to create context with model '/mnt/h/gguf/Qwen3.5-27B/Qwen3.5-27B-Q5_K_M.gguf'

HIP_VISIBLE_DEVICES=0 ./build/bin/llama-bench -m /mnt/h/gguf/Qwen3.5-27B/Qwen3.5-27B-Q5_K_M.gguf -p 4096 -n 128 -ngl 99 --cache-type-k turbo3 --cache-type-v turbo3 -fa 1
ggml_cuda_init: found 1 ROCm devices (Total VRAM: 32752 MiB):
Device 0: AMD Radeon Graphics, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64, VRAM: 32752 MiB

model	size	params	backend	ngl	type_k	type_v	fa	test	t/s
qwen35 27B Q5_K - Medium	18.25 GiB	26.90 B	ROCm	99	turbo3	turbo3	1	pp4096	152.36 ± 0.03
qwen35 27B Q5_K - Medium	18.25 GiB	26.90 B	ROCm	99	turbo3	turbo3	1	tg128	18.25 ± 0.01

build: 11aced9 (8699)

HIP_VISIBLE_DEVICES=0 ./build/bin/llama-bench -m /mnt/h/gguf/Qwen3.5-27B/Qwen3.5-27B-Q5_K_M.gguf -p 4096 -n 128 -ngl 99 --cache-type-k turbo4 --cache-type-v turbo4 -fa 1
ggml_cuda_init: found 1 ROCm devices (Total VRAM: 32752 MiB):
Device 0: AMD Radeon Graphics, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64, VRAM: 32752 MiB

model	size	params	backend	ngl	type_k	type_v	fa	test	t/s
qwen35 27B Q5_K - Medium	18.25 GiB	26.90 B	ROCm	99	turbo4	turbo4	1	pp4096	152.33 ± 0.04
qwen35 27B Q5_K - Medium	18.25 GiB	26.90 B	ROCm	99	turbo4	turbo4	1	tg128	18.29 ± 0.01

build: 11aced9 (8699)

domvox Apr 19, 2026
Author

@DmitryAkDev Thanks for testing on MI50! The PPL degradation you're seeing is expected on the branch you built (feature/turboquant-hip-port-clean, build 11aced9). That branch is missing several fixes now in feature/triattention-scoring:

HIP VEC forcing for quantized KV — without it, turbo types fall through to TILE FA which causes the OOM you saw without -fa 1
InnerQ partial-RoPE auto-disable — Qwen3.5-27B has partial RoPE (25% of K dims rotated). Current branch auto-detects this. Old branch doesn't, corrupting WHT rotation. On our 7900 XTX with current branch: Qwen3.5-27B turbo3 PPL = 6.96 vs f16 6.91 (+0.7%)
InnerQ group-size safeguards — prevents OOB on non-128 groups

Could you rebuild from feature/triattention-scoring? Most useful rerun:

# baseline
llama-perplexity -m Qwen3.5-27B-Q5_K_M.gguf -f wiki.test.raw -c 4096 --chunks 10 -ngl 99
# turbo3
llama-perplexity -m Qwen3.5-27B-Q5_K_M.gguf -f wiki.test.raw -c 4096 --chunks 10 -ngl 99 --cache-type-k turbo3 --cache-type-v turbo3 -fa 1
# conservative
llama-perplexity -m Qwen3.5-27B-Q5_K_M.gguf -f wiki.test.raw -c 4096 --chunks 10 -ngl 99 --cache-type-k q8_0 --cache-type-v turbo3 -fa 1

The Gemma4 f16 PPL being 18,934 is expected — instruct model on raw wikitext. The turbo PPL exploding relative to that is the same missing-safeguards issue from the old branch.

Build warnings (cudaMemcpyToSymbol nodiscard) are harmless, also fixed in current branch.

DmitryAkDev Apr 19, 2026

Thank you for your work first!
Sure,
Qwen3.5-27B-Q5_K_M, feature/triattention-scoring, build: b0ae34e72 (8941)

Just to compare
RX 7900 XTX
ROCM 7.2.1

HIP_VISIBLE_DEVICES=1 ./build/bin/llama-perplexity -m /mnt/h/gguf/Qwen3.5-27B/Qwen3.5-27B-Q5_K_M.gguf -f ./scripts/wikitext-2-raw/wiki.test.raw -c 4096 --chunks 10 -ngl 99
Device 0: Radeon RX 7900 XTX, gfx1100 (0x1100), VMM: no, Wave Size: 32, VRAM: 24560 MiB
llama_kv_cache: size = 256.00 MiB ( 4096 cells, 16 layers, 1/1 seqs), K (f16): 128.00 MiB, V (f16): 128.00 MiB
Final estimate: PPL = 7.1424 +/- 0.13484

HIP_VISIBLE_DEVICES=1 ./build/bin/llama-perplexity -m /mnt/h/gguf/Qwen3.5-27B/Qwen3.5-27B-Q5_K_M.gguf -f ./scripts/wikitext-2-raw/wiki.test.raw -c 4096 --chunks 10 -ngl 99 --cache-type-k q8_0 --cache-type-v q8_0 -fa 1
Device 0: Radeon RX 7900 XTX, gfx1100 (0x1100), VMM: no, Wave Size: 32, VRAM: 24560 MiB
llama_kv_cache: size = 136.00 MiB ( 4096 cells, 16 layers, 1/1 seqs), K (q8_0): 68.00 MiB, V (q8_0): 68.00 MiB
Final estimate: PPL = 7.1450 +/- 0.13523

HIP_VISIBLE_DEVICES=1 ./build/bin/llama-perplexity -m /mnt/h/gguf/Qwen3.5-27B/Qwen3.5-27B-Q5_K_M.gguf -f ./scripts/wikitext-2-raw/wiki.test.raw -c 4096 --chunks 10 -ngl 99 --cache-type-k turbo3 --cache-type-v turbo3 -fa 1
Device 0: Radeon RX 7900 XTX, gfx1100 (0x1100), VMM: no, Wave Size: 32, VRAM: 24560 MiB
llama_kv_cache: size = 50.00 MiB ( 4096 cells, 16 layers, 1/1 seqs), K (turbo3): 25.00 MiB, V (turbo3): 25.00 MiB
Final estimate: PPL = 7.1812 +/- 0.13576

HIP_VISIBLE_DEVICES=1 ./build/bin/llama-perplexity -m /mnt/h/gguf/Qwen3.5-27B/Qwen3.5-27B-Q5_K_M.gguf -f ./scripts/wikitext-2-raw/wiki.test.raw -c 4096 --chunks 10 -ngl 99 --cache-type-k q8_0 --cache-type-v turbo3 -fa 1
Device 0: Radeon RX 7900 XTX, gfx1100 (0x1100), VMM: no, Wave Size: 32, VRAM: 24560 MiB
llama_kv_cache: size = 93.00 MiB ( 4096 cells, 16 layers, 1/1 seqs), K (q8_0): 68.00 MiB, V (turbo3): 25.00 MiB
Final estimate: PPL = 7.1924 +/- 0.13636

MI50-32gb
ROCM 7.2.1

HIP_VISIBLE_DEVICES=0 ./build/bin/llama-perplexity -m /mnt/h/gguf/Qwen3.5-27B/Qwen3.5-27B-Q5_K_M.gguf -f ./scripts/wikitext-2-raw/wiki.test.raw -c 4096 --chunks 10 -ngl 99
Device 0: AMD Radeon Graphics, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64, VRAM: 32752 MiB
llama_kv_cache: size = 256.00 MiB ( 4096 cells, 16 layers, 1/1 seqs), K (f16): 128.00 MiB, V (f16): 128.00 MiB
Final estimate: PPL = 7.1514 +/- 0.13525

HIP_VISIBLE_DEVICES=0 ./build/bin/llama-perplexity -m /mnt/h/gguf/Qwen3.5-27B/Qwen3.5-27B-Q5_K_M.gguf -f ./scripts/wikitext-2-raw/wiki.test.raw -c 4096 --chunks 10 -ngl 99 --cache-type-k turbo3 --cache-type-v turbo3 -fa 1
Device 0: AMD Radeon Graphics, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64, VRAM: 32752 MiB
llama_kv_cache: size = 50.00 MiB ( 4096 cells, 16 layers, 1/1 seqs), K (turbo3): 25.00 MiB, V (turbo3): 25.00 MiB
Final estimate: PPL = 12.6968 +/- 0.25501

HIP_VISIBLE_DEVICES=0 ./build/bin/llama-perplexity -m /mnt/h/gguf/Qwen3.5-27B/Qwen3.5-27B-Q5_K_M.gguf -f ./scripts/wikitext-2-raw/wiki.test.raw -c 4096 --chunks 10 -ngl 99 --cache-type-k q8_0 --cache-type-v turbo3 -fa 1
Device 0: AMD Radeon Graphics, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64, VRAM: 32752 MiB
llama_kv_cache: size = 93.00 MiB ( 4096 cells, 16 layers, 1/1 seqs), K (q8_0): 68.00 MiB, V (turbo3): 25.00 MiB
Final estimate: PPL = 10.7293 +/- 0.21486

Dual test
Mi50-32gb + RX 7900 XTX
ROCM 7.2.1

./build/bin/llama-perplexity -m /mnt/h/gguf/Qwen3.5-27B/Qwen3.5-27B-Q5_K_M.gguf -f ./scripts/wikitext-2-raw/wiki.test.raw -c 4096 --chunks 10 -ngl 99
Device 0: AMD Radeon Graphics, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64, VRAM: 32752 MiB
Device 1: Radeon RX 7900 XTX, gfx1100 (0x1100), VMM: no, Wave Size: 32, VRAM: 24560 MiB
llama_kv_cache: size = 256.00 MiB ( 4096 cells, 16 layers, 1/1 seqs), K (f16): 128.00 MiB, V (f16): 128.00 MiB
Final estimate: PPL = 7.1696 +/- 0.13572

./build/bin/llama-perplexity -m /mnt/h/gguf/Qwen3.5-27B/Qwen3.5-27B-Q5_K_M.gguf -f ./scripts/wikitext-2-raw/wiki.test.raw -c 4096 --chunks 10 -ngl 99 --cache-type-k turbo3 --cache-type-v turbo3 -fa 1
Device 0: AMD Radeon Graphics, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64, VRAM: 32752 MiB
Device 1: Radeon RX 7900 XTX, gfx1100 (0x1100), VMM: no, Wave Size: 32, VRAM: 24560 MiB
llama_kv_cache: size = 50.00 MiB ( 4096 cells, 16 layers, 1/1 seqs), K (turbo3): 25.00 MiB, V (turbo3): 25.00 MiB
Final estimate: PPL = 9.8961 +/- 0.21645

./build/bin/llama-perplexity -m /mnt/h/gguf/Qwen3.5-27B/Qwen3.5-27B-Q5_K_M.gguf -f ./scripts/wikitext-2-raw/wiki.test.raw -c 4096 --chunks 10 -ngl 99 --cache-type-k q8_0 --cache-type-v turbo3 -fa 1
Device 0: AMD Radeon Graphics, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64, VRAM: 32752 MiB
Device 1: Radeon RX 7900 XTX, gfx1100 (0x1100), VMM: no, Wave Size: 32, VRAM: 24560 MiB
llama_kv_cache: size = 93.00 MiB ( 4096 cells, 16 layers, 1/1 seqs), K (q8_0): 68.00 MiB, V (turbo3): 25.00 MiB
Final estimate: PPL = 9.3899 +/- 0.20403

power cap to 180
HIP_VISIBLE_DEVICES=0 ./build/bin/llama-bench -m /mnt/h/gguf/Qwen3.5-27B/Qwen3.5-27B-Q5_K_M.gguf -p 4096 -n 128 -ngl 99 --cache-type-k f16 --cache-type-v f16
ggml_cuda_init: found 1 ROCm devices (Total VRAM: 32752 MiB):
Device 0: AMD Radeon Graphics, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64, VRAM: 32752 MiB

model	size	params	backend	ngl	test	t/s
qwen35 27B Q5_K - Medium	18.25 GiB	26.90 B	ROCm	99	pp4096	149.39 ± 0.06
qwen35 27B Q5_K - Medium	18.25 GiB	26.90 B	ROCm	99	tg128	18.44 ± 0.02

build: b0ae34e72 (8941)

HIP_VISIBLE_DEVICES=0 ./build/bin/llama-bench -m /mnt/h/gguf/Qwen3.5-27B/Qwen3.5-27B-Q5_K_M.gguf -p 4096 -n 128 -ngl 99 --cache-type-k turbo3 --cache-type-v turbo3
ggml_cuda_init: found 1 ROCm devices (Total VRAM: 32752 MiB):
Device 0: AMD Radeon Graphics, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64, VRAM: 32752 MiB
main: error: failed to create context with model '/mnt/h/gguf/Qwen3.5-27B/Qwen3.5-27B-Q5_K_M.gguf'

HIP_VISIBLE_DEVICES=0 ./build/bin/llama-bench -m /mnt/h/gguf/Qwen3.5-27B/Qwen3.5-27B-Q5_K_M.gguf -p 4096 -n 128 -ngl 99 --cache-type-k turbo3 --cache-type-v turbo3 -fa 1
ggml_cuda_init: found 1 ROCm devices (Total VRAM: 32752 MiB):
Device 0: AMD Radeon Graphics, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64, VRAM: 32752 MiB

model	size	params	backend	ngl	type_k	type_v	fa	test	t/s
qwen35 27B Q5_K - Medium	18.25 GiB	26.90 B	ROCm	99	turbo3	turbo3	1	pp4096	121.13 ± 0.03
qwen35 27B Q5_K - Medium	18.25 GiB	26.90 B	ROCm	99	turbo3	turbo3	1	tg128	18.20 ± 0.01

build: b0ae34e72 (8941)

HIP_VISIBLE_DEVICES=0 ./build/bin/llama-bench -m /mnt/h/gguf/Qwen3.5-27B/Qwen3.5-27B-Q5_K_M.gguf -p 4096 -n 128 -ngl 99 --cache-type-k turbo4 --cache-type-v turbo4 -fa 1
ggml_cuda_init: found 1 ROCm devices (Total VRAM: 32752 MiB):
Device 0: AMD Radeon Graphics, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64, VRAM: 32752 MiB

model	size	params	backend	ngl	type_k	type_v	fa	test	t/s
qwen35 27B Q5_K - Medium	18.25 GiB	26.90 B	ROCm	99	turbo4	turbo4	1	pp4096	126.11 ± 0.03
qwen35 27B Q5_K - Medium	18.25 GiB	26.90 B	ROCm	99	turbo4	turbo4	1	tg128	18.33 ± 0.01

build: b0ae34e72 (8941)

To compare with "native" powercap
GPU[0] : Successfully set power to: 225W

HIP_VISIBLE_DEVICES=0 ./build/bin/llama-bench -m /mnt/h/gguf/Qwen3.5-27B/Qwen3.5-27B-Q5_K_M.gguf -p 4096 -n 128 -ngl 99 --cache-type-k turbo4 --cache-type-v turbo4 -fa 1
ggml_cuda_init: found 1 ROCm devices (Total VRAM: 32752 MiB):
Device 0: AMD Radeon Graphics, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64, VRAM: 32752 MiB

model	size	params	backend	ngl	type_k	type_v	fa	test	t/s
qwen35 27B Q5_K - Medium	18.25 GiB	26.90 B	ROCm	99	turbo4	turbo4	1	pp4096	126.50 ± 0.00
qwen35 27B Q5_K - Medium	18.25 GiB	26.90 B	ROCm	99	turbo4	turbo4	1	tg128	19.43 ± 0.10

build: b0ae34e72 (8941)

TheTom Apr 20, 2026

Great data, thanks for the thorough testing. XTX numbers confirm the fixes landed clean on RDNA3.

MI50 PPL degradation is a gfx906-specific issue... likely wave64 lane mapping in the WHT butterfly. I don't have GCN hardware to debug against, so if you're able to dig into it and PR a fix to the fork that would be hugely appreciated. Happy to review and merge.

domvox · 2026-04-11T18:23:26Z

domvox
Apr 11, 2026
Author

TurboQuant + TriAttention combo: measured results (2026-04-11)

All tests on RX 7900 XTX 24GB, ROCm 6.4, openSUSE Tumbleweed. Qwen3 thinking mode disabled via chat_template_kwargs: {"enable_thinking": false} — otherwise reasoning tokens consume max_tokens and produce false NIAH failures.

GSM8K math reasoning

Model: Qwen3.5-27B Q5_K_M, temp=0, full 1319-problem test set.

Config	Accuracy	N
f16 KV	71.9% (948/1319)	1319
turbo3 KV	72.0% (950/1319)	1319
turbo3 + TriAttention 75%	72.0% (950/1319)	1319

No measurable reasoning degradation on GSM8K from TriAttention pruning in this run.

Single-needle NIAH retrieval

TriAttention params: --tri-budget 75 --tri-window 512 --tri-interval 128 --tri-sink 0
Test grid: 5 depths (0.00, 0.25, 0.50, 0.75, 1.00) × context lengths.

Config	Model	to 12K	to 30K
turbo3-only baseline	Qwen3.5-27B	20/20 (100%)	25/25 (100%)
turbo3 + TriAtt 75%	Qwen3.5-27B	20/20 (100%)	23/25 (92%)
turbo3 + TriAtt 75%	Qwen3-8B	9/10 (90%)¹	19/25 (76%)

¹ 8B tested at 4K+8K only in the ≤12K range (no 2K/12K points in this run). Single failure at 4K d=0.75.

Effective KV compression with combo: ~6.8×.

Observed limitations

27B combo: both failures at depth 0.75 (4K and 30K context). Baseline has zero failures at any depth.
8B degrades faster under pruning: 6 failures total vs 2 for 27B. 4 of 6 at d=0.75; additionally 16K context shows weakness at multiple depths (d=0.25, d=0.50, d=0.75).
Single-needle retrieval only — multi-needle not yet tested.
Tested on Qwen3.5-27B (hybrid SSM+attn) and Qwen3-8B (pure transformer) only.
Combo perplexity not yet measured.

Repos

TurboQuant HIP port: https://github.com/domvox/llama.cpp-turboquant-hip
TriAttention calibration: https://github.com/domvox/triattention-ggml

2 replies

TheTom Apr 20, 2026

Great work on the independent implementation and the GSM8K + NIAH matrix. A few things jumped out because I've been down this exact road on the Metal side.

Your d=0.75 NIAH failures on 27B match a pattern I hit early on with paper-faithful scoring. I wrote up the full progression in a paper that covers three iterations of the eviction policy: https://github.com/TheTom/turboquant_plus/blob/main/docs/papers/triattention-v3.md

The short version:

V1 (paper-faithful global sort) produces the same position-dependent retrieval failures you're seeing. On Qwen2.5-7B at 32K, V1 passes start and middle but silently drops the needle at the end position. The model doesn't refuse or hallucinate randomly. It reaches for the nearest semantic neighbor in the surviving context, which is exactly what your "Operation Manchu" and depth-0.75 failures look like.

V3 (prefix protection + per-segment quota) fixes this on pure transformers. Two structural constraints on top of the same trig scoring: a hard prefix window where the first N tokens are never evicted (default 128), and a per-segment eviction quota that forces eviction to spread across 8 position buckets instead of concentrating wherever the trig score is lowest. On 7B at 32K with 90% retention, V3 lands at +0.006% PPL with clean NIAH at all three positions. At 64K it's bit-identical to baseline.

What didn't work:

V2 (per-segment quota without prefix protection) rescued end-position retrieval but broke start-position retrieval and added +4.49% PPL. The fix wasn't spreading eviction more evenly, it was explicitly protecting the tokens that V1 was implicitly keeping.
Boundary layer skipping (motivated by Sharpe's coupled-oscillator framework) was a clean negative. Excluding the first few attention layers from scoring hurts PPL on pure transformers and doesn't rescue NIAH on hybrids.

Where it still breaks: V3 fails NIAH at middle and end positions on Qwen3.5 hybrid models (27B dense, 35B-A3B MoE), even though PPL transfers cleanly (+0.29% and +0.21%). The scoring formula only operates on the rotated portion of K, and Qwen3.5 rotates only 64 of 256 head dimensions. V3 is blind to 75% of each cell's K content on these models. I haven't found a fix for this yet.

Your per-layer MRL-based budget scaling is something I haven't tried. I use uniform budgets across layers. The idea of giving less-predictable layers more conservative budgets makes sense and could complement V3's positional constraints. That's a different axis entirely.

I'd like to collaborate on this if you're interested. You have HIP/CUDA testing I don't have access to, and V3's prefix + segment policy might resolve your d=0.75 failures without requiring changes to the scoring formula itself. The implementation is on the experiment/triattention-integration branch: https://github.com/TheTom/llama-cpp-turboquant/tree/experiment/triattention-integration

Each commit is a clean checkpoint you can reproduce independently. Open an issue or PR on either repo, have a discord as well if you're interested.

domvox Apr 20, 2026
Author

Thanks for the full paper — really thorough. The methodology section alone saved me from a few traps I'd already fallen into (the -b 512 thing bit me, and I had a --no-display-prompt false-pass episode too).

We picked up your V3 policy on April 11 from your earlier analysis (prefix=128, 8 segments). On Qwen3.5-27B at 75% retention, 32K ctx, NIAH passes everywhere except d=0.75 where we get 3/5 — near-miss at 4K, full miss at 30K. Baseline clean at 25/25.

I think the gap is mostly retention — you validated at 90% and even 85% already shows cracks. The Qwen3.5 hybrid issue from your Section 4.5 lines up too. We confirmed partial_rotary_factor=0.25, so 32 frequency pairs out of head_dim=256.

Happy to run HIP/ROCm testing on your branch if that's useful. I'll open an issue on your repo.

a-n-t-0 · 2026-04-11T18:53:27Z

a-n-t-0
Apr 11, 2026

Hi @domvox,

First, thank you for this work on TurboQuant HIP — the KV cache compression is exactly what 24GB GPU owners need for long context. I've been testing your port on my RX 7900 XTX (ROCm 7.2.1) and turbo3 works as advertised.

I wanted to bring something to your attention. Are you aware of @lhl's rocm-wmma-tune branch (PR #16827)? It's a set of RDNA3-specific optimizations to the FlashAttention WMMA kernel:

Adaptive KQ stride (128 for D<=128) to reduce LDS footprint
__launch_bounds__ for minimum 2 blocks/SM occupancy on RDNA3
WMMA for prefill only, VEC/TILE fallback for decode (fixes the long-context decode regression)

Your build doesn't use -DGGML_HIP_ROCWMMA_FATTN=ON, so you're on the standard TILE/VEC path. With lhl's optimized WMMA path enabled, prefill throughput on the same 7900 XTX is significantly higher. Here are my benchmarks on Gemma 4 31B Q4_K_XL (pp512, -mmp 0):

Build	base	KV type	pp512 (t/s)	tg32 (t/s)
lhl wmma-tune	b8669	q8_0	897	27.6
Your turboquant-hip	b8680	q8_0	724	27.5
Your turboquant-hip	b8680	turbo3	662	27.5

lhl's PR was never merged — @JohannesGaessler declined it in November 2025, saying he planned to rewrite the WMMA kernel as native MMA within ~1 month. That was 5 months ago and the replacement hasn't materialized, leaving RDNA3 users without these optimizations in upstream.

I attempted to merge both branches to get TurboQuant + optimized WMMA in one build. Here's what I found:

Attempt 1 — lhl as base, turboquant merged on top:
Git auto-merges cleanly, compiles with -DGGML_HIP_ROCWMMA_FATTN=ON, turbo3 works. However, prefill regresses to ~677 t/s even for q8_0 — the upstream commits between b8669 and b8680 change enough in the pipeline (context, kv-cache, graph scheduling) to cause a measurable regression that I couldn't isolate to a single file.

Attempt 2 — turboquant as base, lhl merged on top:
Conflicts in fattn.cu and fattn-tile.cuh. After manual resolution, WMMA crashes on non-turbo types due to interface mismatches between the two code bases.

The fundamental issue is that both forks diverged from different upstream commits and the FA pipeline changed enough between them.

Suggestion: If you're planning further development, rebasing TurboQuant on top of lhl's rocm-wmma-tune branch (rather than upstream HEAD) could give RDNA3 users both optimized WMMA prefill and turbo KV cache compression. The turbo VEC kernels would still handle turbo types, while the standard FA path benefits from lhl's WMMA optimizations.

This could also help build the case for finally merging lhl's PR upstream — if multiple community projects depend on it, there's stronger incentive to get the WMMA fixes into mainline rather than waiting indefinitely for the MMA rewrite.

Happy to test any builds on my 7900 XTX. Thanks again — turbo3 at 80K context on a 24GB card is a game changer.

5 replies

domvox Apr 11, 2026
Author

@a-n-t-0 Thanks for the detailed benchmarks and for pointing me at lhl's branch. I dug into this properly tonight.

TL;DR: The ~897 t/s result on RX 7900 XTX is real, but in my tests it comes from using AMD's official ROCm toolchain/userspace stack, not from rocWMMA or lhl's tuning. On RDNA3, WMMA was unnecessary on ROCm 6.3.1 and a regression on ROCm 7.2.1.

I ran a controlled matrix on the same RX 7900 XTX with the same codebase, same models, -DAMDGPU_TARGETS=gfx1100, and flash attention enabled. I tested native openSUSE ROCm 6.4 and AMD's official ROCm Docker images for 6.3.1 and 7.2.1 to separate the host distro ROCm path from AMD's official ROCm stack.

Gemma 4 31B Q4_K_M, pp512 f16:

Environment	Toolchain	WMMA	pp512 (t/s)
openSUSE native ROCm 6.4	distro ROCm / clang 19	OFF	450
Docker ROCm 6.3.1	AMD clang 18	OFF	903
Docker ROCm 6.3.1	AMD clang 18	ON	908
Docker ROCm 7.2.1	AMD clang 22	OFF	889
Docker ROCm 7.2.1	AMD clang 22	ON	679

Qwen3.5-27B Q5_K_M, pp512 f16:

Environment	Toolchain	WMMA	pp512 (t/s)
openSUSE native ROCm 6.4	distro ROCm / clang 19	OFF	426
Docker ROCm 6.3.1	AMD clang 18	OFF	957
Docker ROCm 6.3.1	AMD clang 18	ON	955
Docker ROCm 7.2.1	AMD clang 22	OFF	937
Docker ROCm 7.2.1	AMD clang 22	ON	873

Key findings:

The big prefill gap tracks AMD's official ROCm toolchain/userspace stack versus my distro-packaged ROCm path. On my openSUSE Tumbleweed setup, that was the real bottleneck — not WMMA.
On AMD's toolchain, WMMA is neutral on ROCm 6.3.1 and clearly hurts on ROCm 7.2.1 (889→679 on Gemma, 937→873 on Qwen). The standard TILE/VEC path is as fast or faster on RDNA3.
lhl's two cherry-picked commits (adaptive KQ stride + launch_bounds tuning) add no measurable benefit on top of WMMA ON in my ROCm 7.2.1 tests.
Decode throughput stayed essentially unchanged across configurations (~27–30 t/s), which matches the expectation that decode is mostly memory-bound.

Bottom line for RX 7900 XTX / gfx1100 users: the main lever is using AMD's official ROCm stack/toolchain. On this GPU, I would keep GGML_HIP_ROCWMMA_FATTN=OFF. AMD's official ROCm Docker images are a practical way to test that stack without replacing distro packages.

This is for RX 7900 XTX / gfx1100 on these two models — not claiming the same behavior for every AMD GPU or model shape.

Happy to share full logs if anyone wants to reproduce.

a-n-t-0 Apr 12, 2026

Hi @domvox,

Following up on the toolchain discussion — you were right that WMMA is unnecessary on gfx1100 with the AMD toolchain. I confirmed: WMMA OFF = 910 t/s, WMMA ON = 906 t/s at pp512. Identical.

However, I ran a deeper comparison with context depth and found a significant decode regression on dense models between your base (b8852) and the older b8669 codebase. The regression is specific to models with many global attention layers at long context.

Setup: RX 7900 XTX, ROCm 7.2.1, AMD clang 22.0.0, Pop!_OS 24.04 (kernel 6.18.7), headless. b8669 compiled with ROCWMMA_FATTN=ON (lhl's branch — but decode uses VEC/TILE per lhl's patches). b8852 compiled with ROCWMMA_FATTN=OFF. Both with -DAMDGPU_TARGETS=gfx1100. I separately confirmed WMMA ON vs OFF makes no difference at pp512 (910 vs 906 t/s).

Gemma 4 31B Dense Q4_K_XL (48 global attention layers):

Test	b8669 q8_0	b8852 q8_0	b8852 turbo3	Δ b8669 vs b8852
pp512 @ d0	897	903	886	+1%
pp512 @ d4096	732	727	718	-1%
pp512 @ d16384	553	543	539	-2%
tg128 @ d0	27.7	28.4	27.7	+2%
tg128 @ d4096	26.4	23.4	23.2	-11%
tg128 @ d16384	24.0	16.8	16.7	-30%

Qwen3.5-27B Q4_K_XL (hybrid SWA, only 10 global attention layers):

Test	b8669 q8_0	b8852 q8_0	b8852 turbo3	Δ b8669 vs b8852
pp512 @ d0	1019	1018	1018	0%
pp512 @ d4096	924	919	919	-1%
pp512 @ d16384	729	719	722	-1%
tg128 @ d0	29.0	29.7	29.5	+2%
tg128 @ d4096	28.4	28.9	28.6	+2%
tg128 @ d16384	26.2	26.4	26.7	+1%

Key observations:

Prefill is identical across all configs — confirms your toolchain finding.
Turbo3 has zero performance cost vs q8_0 at every depth — turbo is genuinely free.
Decode regression is dense-model specific. Gemma 4 (48 global attention layers) loses 30% at d16384. Qwen3.5 (10 global + 50 SWA) is completely unaffected.
Confirmed in llama-server too: Gemma 4 at ~30K context gives ~12 t/s on b8852 vs ~21.7 t/s on b8669.

This looks like an upstream regression between b8669 and b8852 in the FA decode path that specifically hits models with many global attention layers at long context. Since Qwen3.5's SWA layers only scan ~1536 tokens regardless of total context, they mask the issue.

Not sure if this is worth chasing on your side since it's an upstream change, but wanted to flag it since it significantly impacts Gemma 4 usability at long context on 24GB cards. Happy to run any additional tests.

domvox Apr 12, 2026
Author

@a-n-t-0 Great data — the turbo3 zero-cost result across all tested depths matches our measurements.

The decode regression on Gemma 4 between b8669 and b8852 is a useful find. Since prefill is unaffected, q8_0 regresses the same way, and Qwen3.5 does not show the issue, this points away from TurboQuant and suggests a regression in the long-context full-attention decode path. The Gemma 4 d16384 comparison (24.0 → 16.8 t/s) looks like a clear repro case. If you have time for a git bisect between those two builds, pinpointing the exact commit would make it much easier for maintainers to act on.

a-n-t-0 Apr 12, 2026

I ran a few more builds to narrow this down before attempting a full bisect.

My first suspect was commit 668ed76 ("HIP: enable WMMA-MMQ INT kernels for RDNA 3"), which was bisected as a regression source in Issue #17917. But building b8852 with -DGGML_HIP_MMQ_MFMA=OFF made no difference — still 16.8 t/s.

I then tested whether WMMA itself was somehow helping decode on b8669. Built b8669 with ROCWMMA_FATTN=OFF (pure TILE/VEC, no WMMA at all):

Gemma 4 31B Q4_K_XL, tg128 @ d16384, q8_0 KV, RX 7900 XTX, ROCm 7.2.1:

Build	WMMA FA	MMQ_MFMA	tg128 @ d16384
b8669	ON	ON	24.0 t/s
b8669	OFF	ON	24.1 t/s
b8852	OFF	ON	16.8 t/s
b8852	OFF	OFF	16.8 t/s

So: WMMA is irrelevant (confirms your toolchain finding), MMQ_MFMA is irrelevant, and the regression is purely in the ~180 upstream commits between b8669 and b8852. Something in the FA decode path or graph scheduling changed. Prefill is unaffected across all builds (~900 t/s at d0, ~545 t/s at d16384).

This only manifests on dense models with many global attention layers (Gemma 4, 48 layers) at long context. Qwen3.5 hybrid (10 global + 50 SWA) is completely unaffected.

I don't have a clean upstream checkout to bisect between these two points since both builds are forks, but the regression window is b8669..b8852 and the repro is straightforward: Gemma 4 31B, -fa 1 -ctk q8_0 -ctv q8_0 -d 16384 -n 128. Happy to test specific commits if you or anyone can identify candidates.

domvox Apr 12, 2026
Author

@a-n-t-0 Top suspect: PR #21159 (15f786e65) — the stream_k fixup kernel optimization. stream_k is active on RDNA3 via amd_wmma_available(cc), it was tested only on NVIDIA, and it already caused a post-merge issue on Blackwell (#21564) tied to b8680. Quickest test would be b8679 vs b8680 on your setup, or revert just that commit on b8852.

BlackBookOfficial · 2026-04-11T22:56:27Z

BlackBookOfficial
Apr 11, 2026

I have a rtx 3090 and a 7900 xtx at my disposal and i said it before in another thread, the ampere lineup can only do fp16 int8 and int4 same with rdna 3 and rdna 4 amd gpu's what we are missing is fp8 and 3bit and sub 1 bit which is physically impossible on these architectures and which is a requirement for turboquant because of 3,5 bit quantization aka what the whole paper is about.

There might be ways to possibly achieve this but at the cost of overhead and ways around which will cause other problems.

13 replies

domvox Apr 14, 2026
Author

@a-n-t-0 New commit e22b3c898 — restored unconditional sparse V skip for all V types. The gating in 7a76caecd was removing a
beneficial optimization: at long context, skipping V reads for negligible attention weights saves HBM bandwidth even on q8_0.

Bisect confirmed on same hardware (pp16384 tg128, Gemma 4 31B Q4_K_M, q8_0 KV): cd9e3b35e = 26.97, e22b3c898 = 27.47.

Couldn't reproduce your d16384 test — context allocation fails on my build. What's your exact command line for the depth test?

Also: Gemma 4 31B GSM8K (100, temp=0): f16 99%, turbo3 96%.

Would appreciate a retest on e22b3c898.

a-n-t-0 Apr 14, 2026

Tested e22b3c898 — still 16.78 t/s at d16384.

Here's my exact command:

llama-bench -m gemma-4-31B-it-UD-Q4_K_XL.gguf -ngl 99 -fa 1 -ctk q8_0 -ctv q8_0 -p 512 -n 128 -d 0,16384 -mmp 0 -r 1

The -d 16384 flag fills the KV cache with 16384 tokens before running the pp512/tg128 benchmark. This is what makes the decode slow — the VEC kernel has to scan 16K+ positions per layer per token.

Your pp16384 test is different — that's a 16K prefill (large batch, compute-bound), not a decode with 16K already in cache (single-token, memory-bound). Your 26.97/27.47 numbers are in the normal short-context decode range (~28 t/s). The regression only shows on the decode-with-deep-context path.

Regarding context allocation failing: I had the same issue initially — Gemma 4 has a 262K default context which OOMs llama-bench. The -mmp 0 (disable mmap) helps, and the Q4_K_XL quant (17.5 GiB) leaves enough room for the d16384 KV cache. If you're using Q4_K_M it might be slightly larger. You could also try with -d 8192 first.

The key data point: your original feature/turboquant-hip-port-clean (b8699) does 24.0 t/s on the same test. The regression is somewhere in the commits between that branch and triattention-scoring — not in Sparse V specifically (I tested that in isolation on upstream b8770, no effect). A bisect between those two branches would pinpoint it.

domvox Apr 14, 2026
Author

Thanks, your -d 16384 repro was exactly what I needed. I had been testing with -p 16384, which is prefill, not decode with a deep cache. Different workload, so it makes sense I wasn't seeing it.

Found the issue and pushed a fix: a13c3db12.

It came from the stragulus cherry-pick for batched KV conversion, abf395d, which added about 530 lines to fattn-common.cuh. The VEC q8_0 path doesn't even use that code, but HIP clang on gfx1100 generated worse code for the VEC kernel just because the header changed. Classic compiler side effect.

Bisect on Gemma 4 31B Q4_K_M, q8_0 KV, -fa 1 -d 0,16384 -mmp 0:

Before: d0=27.4 d16384=23.7
Broken: d0=27.8 d16384=16.7
Fixed: d0=27.1 d16384=23.75

Side note: -fa on silently fails in llama-bench; the bool parser only accepts 0/1. That cost us a few hours while figuring out why q8_0 KV wouldn't create a context. -fa 1 works.

I'll re-add the batched conversion later in a separate .cu so it doesn't affect VEC codegen.

baramofme Apr 16, 2026

@a-n-t-0 Thanks!!!!! @domvox Please distribute the patch as a package!! I would like to try using it directly in Docker Compose. Thank you.

domvox Apr 17, 2026
Author

@baramofme Just pushed Docker Compose support!

bash
git clone -b feature/triattention-scoring https://github.com/domvox/llama.cpp-turboquant-hip.git
cd llama.cpp-turboquant-hip
cp .env.example .env

Edit .env: set MODEL_PATH and GPU_TARGETS (gfx1030/gfx1100/gfx1200)

docker compose build
docker compose up

The .env.example has all the knobs — KV cache type (turbo3/turbo4/q8_0), context size, GPU
layers, etc. Server runs on port 8080 with OpenAI-compatible API.

For other tools (bench, perplexity):
bash
docker compose run --rm --entrypoint /app/llama-bench llama-turboquant
-m /models/your-model.gguf -ngl 99 -fa 1 -ctk turbo3 -ctv turbo3

Let me know if you run into any issues with the build!

domvox · 2026-04-12T16:34:18Z

domvox
Apr 12, 2026
Author

TurboQuant PPL quality benchmarks — long context analysis

Following up on the throughput numbers, here are perplexity measurements across different models, context lengths, and KV cache configurations. All tests on RX 7900 XTX 24GB, ROCm 6.4, Docker build, wikitext-2 test set, 5 chunks per measurement.

Qwen3-8B Q4_K_M (full RoPE, rope_theta=1M, 8 KV heads, head_dim=128)

Config	ctx=16K	vs f16
f16 KV (baseline)	6.92	—
turbo4 K+V	6.99	+0.9%
turbo4 K + turbo3 V	7.01	+1.3%
q8_0 K + turbo3 V	6.95	+0.4%
turbo3 K + q8_0 V	21.16	+206% ⚠️
layer-adaptive (first+last 4 layers q8_0 K+V)	7.06	+2.0%
turbo3 K+V	19.70	+185% ⚠️

Llama-3.1-8B base Q4_K_M (full RoPE, rope_theta=500K, 8 KV heads, head_dim=128)

Config	ctx=16K	vs f16
f16 KV (baseline)	4.91	—
turbo4 K + turbo3 V	5.13	+4.4%
turbo3 K+V	5.33	+8.4%

Qwen3.5-27B Q5_K_M (partial_rotary_factor=0.25, 4 KV heads, head_dim=256)

Config	ctx=16K	vs f16
f16 KV (baseline)	6.12	—
turbo3 K+V	6.11	no measurable regression

Analysis: K cache is the sensitive component

The K/V ablation on Qwen3-8B isolates the problem clearly:

q8_0 K + turbo3 V = 6.95 — V quantization is robust
turbo3 K + q8_0 V = 21.16 — K quantization causes the failure

The likely mechanism is that llama.cpp stores K after RoPE. RoPE applies position-dependent rotations to K channel pairs, making the post-RoPE K distribution harder to quantize — especially with only 8 centroids (turbo3). turbo4's 16 centroids handle this well in all tested configurations.

The severity is model-dependent. Qwen3-8B shows catastrophic turbo3 K failure at 16K, Llama-3.1-8B shows moderate regression (+8.4%), and Qwen3.5-27B with partial RoPE (only 25% of K dimensions rotated) shows no measurable regression. Multiple factors likely contribute: RoPE coverage, rope_theta, head dimension, and the model's learned K distribution.

Layer-adaptive mode with first+last 4 layers at q8_0 prevents the catastrophic collapse on Qwen3-8B (19.70 → 7.06), suggesting boundary layers are particularly sensitive.

This is consistent with prior work on post-RoPE K quantization sensitivity (KVQuant, Berkeley) and frequency-dependent quantization error in RoPE channels (Q-ROAR).

Practical recommendations

Context length	Config
≤ 8K	`--cache-type-k turbo3 --cache-type-v turbo3`
8K–32K	`--cache-type-k turbo4 --cache-type-v turbo3`
32K+	`--cache-type-k turbo4 --cache-type-v turbo4` or benchmark your model

For models with partial_rotary_factor < 1.0 (e.g. Qwen3.5 family), turbo3 K+V showed no regression at 16K in this test, but longer contexts and other model sizes still need validation.

Note: PPL on wikitext-2 is not a complete proxy for long-context retrieval or generation quality. These numbers should be treated as directional guidance.

0 replies

Gerporgl · 2026-04-16T17:39:22Z

Gerporgl
Apr 16, 2026

I tested the branch locally (last I tested was f229a36) on a gfx1200 9060XT and I observed similar performance and quality results as others.
I am using ROCm 7.2.2 since this is what I'm currently testing with.

However, I was going to post all the results, until I realized that the peak vram usage wasn't really different, for some reason.
Although when I use turbo4/turbo3 I can see the vram usage growing slowly during llama-bench, it does reach pretty much almost the same point as f16.
The main difference is that vram grows to the max and stay the same without turbo, while turbo grows slowly.

Concretely with unsloth/gemma-4-31B-it-GGUF:UD-IQ3_XXS I was able to go up to context size of 30000 with my setup and llama-bench, with both f16 and turbo4/turbo3. In fact, turbo will crash at 31000 with not enough enough vram, while f16 would still work at 31000 and had just enough ram.

Commands I used (which may be wrong, I am new to this):

llama-bench -hf unsloth/gemma-4-31B-it-GGUF:UD-IQ3_XXS -ngl 99 -fa 1 -p 512 -n 128 -d 0,30000 -r 1 -ctk turbo4 -ctv turbo3
llama-bench -hf unsloth/gemma-4-31B-it-GGUF:UD-IQ3_XXS -ngl 99 -fa 1 -p 512 -n 128 -d 0,30000 -r 1 -ctk f16 -ctv f16

I tried some other models and quantization, but had similar vram usage observations.

16 replies

Gerporgl Apr 18, 2026

I had some time to test as well (build: 0757ff4), and yes it seems to use less vram now with turboquant with the same context size.
This is always with my same amdgpu setup (9060XT 16Gb VRAM) and ROCm 7.2.2

f16 test with gemma4 31B (fails over 38K):

root@CT-llama-swap:~# llama-bench -hf unsloth/gemma-4-31B-it-GGUF:UD-IQ3_XXS -ngl 99 -fa 1 -ctk f16 -ctv f16 -p 512 -n 128 -d 36000,38000,40000,42000 -r 1
ggml_cuda_init: found 1 ROCm devices (Total VRAM: 16304 MiB):
  Device 0: AMD Radeon Graphics, gfx1200 (0x1200), VMM: no, Wave Size: 32, VRAM: 16304 MiB
load_backend: loaded ROCm backend from /opt/llama/bin/libggml-hip.so
load_backend: loaded CPU backend from /opt/llama/bin/libggml-cpu-zen4.so
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| gemma4 ?B IQ3_XXS - 3.0625 bpw |  11.01 GiB |    30.70 B | ROCm       |  99 |  1 |  pp512 @ d36000 |        186.46 ± 0.00 |
| gemma4 ?B IQ3_XXS - 3.0625 bpw |  11.01 GiB |    30.70 B | ROCm       |  99 |  1 |  tg128 @ d36000 |         11.51 ± 0.00 |
| gemma4 ?B IQ3_XXS - 3.0625 bpw |  11.01 GiB |    30.70 B | ROCm       |  99 |  1 |  pp512 @ d38000 |        179.41 ± 0.00 |
| gemma4 ?B IQ3_XXS - 3.0625 bpw |  11.01 GiB |    30.70 B | ROCm       |  99 |  1 |  tg128 @ d38000 |         11.47 ± 0.00 |
main: error: failed to create context with model '/root/data/models/models--unsloth--gemma-4-31B-it-GGUF/snapshots/43e80d41a220ac7c83023daacd6a0d1fd8559251/gemma-4-31B-it-UD-IQ3_XXS.gguf'

q8_0+turbo4 test (succeed all the way up to 42K):

root@CT-llama-swap:~# llama-bench -hf unsloth/gemma-4-31B-it-GGUF:UD-IQ3_XXS -ngl 99 -fa 1 -ctk q8_0 -ctv turbo4 -p 512 -n 128 -d 36000,38000,40000,42000 -r 1
ggml_cuda_init: found 1 ROCm devices (Total VRAM: 16304 MiB):
  Device 0: AMD Radeon Graphics, gfx1200 (0x1200), VMM: no, Wave Size: 32, VRAM: 16304 MiB
load_backend: loaded ROCm backend from /opt/llama/bin/libggml-hip.so
load_backend: loaded CPU backend from /opt/llama/bin/libggml-cpu-zen4.so
| model                          |       size |     params | backend    | ngl | type_k | type_v | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -----: | -----: | -: | --------------: | -------------------: |
| gemma4 ?B IQ3_XXS - 3.0625 bpw |  11.01 GiB |    30.70 B | ROCm       |  99 |   q8_0 | turbo4 |  1 |  pp512 @ d36000 |        149.68 ± 0.00 |
| gemma4 ?B IQ3_XXS - 3.0625 bpw |  11.01 GiB |    30.70 B | ROCm       |  99 |   q8_0 | turbo4 |  1 |  tg128 @ d36000 |          9.19 ± 0.00 |
| gemma4 ?B IQ3_XXS - 3.0625 bpw |  11.01 GiB |    30.70 B | ROCm       |  99 |   q8_0 | turbo4 |  1 |  pp512 @ d38000 |        145.52 ± 0.00 |
| gemma4 ?B IQ3_XXS - 3.0625 bpw |  11.01 GiB |    30.70 B | ROCm       |  99 |   q8_0 | turbo4 |  1 |  tg128 @ d38000 |          9.01 ± 0.00 |
| gemma4 ?B IQ3_XXS - 3.0625 bpw |  11.01 GiB |    30.70 B | ROCm       |  99 |   q8_0 | turbo4 |  1 |  pp512 @ d40000 |        141.46 ± 0.00 |
| gemma4 ?B IQ3_XXS - 3.0625 bpw |  11.01 GiB |    30.70 B | ROCm       |  99 |   q8_0 | turbo4 |  1 |  tg128 @ d40000 |          8.70 ± 0.00 |
| gemma4 ?B IQ3_XXS - 3.0625 bpw |  11.01 GiB |    30.70 B | ROCm       |  99 |   q8_0 | turbo4 |  1 |  pp512 @ d42000 |        137.19 ± 0.00 |
| gemma4 ?B IQ3_XXS - 3.0625 bpw |  11.01 GiB |    30.70 B | ROCm       |  99 |   q8_0 | turbo4 |  1 |  tg128 @ d42000 |          8.58 ± 0.00 |

build: 0757ff4 (1)

baramofme Apr 21, 2026

@domvox Has this error been resolved in this branch commit? domvox/llama.cpp-turboquant-hip@e819d68 ?

domvox Apr 21, 2026
Author

@baramofme The latest commits on that branch are scoring/eviction improvements (GPU consistency fix, non-rotary scoring support). They don't address the OOM issue from #1 — that's a HIP memory management issue on NO_VMM devices (RX 9060 XT), not related to KV cache scoring. The workarounds from the issue still apply.

TheTom Apr 21, 2026

RE the oom: My fork already has the fix for the OOM without speed degradation. I'm working with the upstream maintainers to get fix that matches what they are expecting.

baramofme Apr 21, 2026

@TheTom Thanks. I will give it a try as well.

domvox · 2026-04-21T11:41:12Z

domvox
Apr 21, 2026
Author

TriAttention scoring update: GPU consistency fix + NIAH benchmark

GPU scoring fix

The HIP q8_0 scoring kernel was computing a different angle than the CPU path. After passing cur_pos into the kernel, both paths now use delta = cur_pos - p_k + offset, so the mismatch would no longer show up as different eviction choices between CPU and GPU.

Commit: 221c56721 on feature/triattention-scoring.

NIAH retrieval benchmark (Qwen3-8B)

Qwen3-8B Q4_K_M, q8_0 KV cache, no thinking mode, temp=0. Budget: 512 scored old tokens + 512 recent window (~1024 retained). Diverse haystack with random needle codes and distractor facts, 5 reps per cell.

              8K           16K          32K
           25  50  75   25  50  75   25  50  75   Total
no-prune  5/5 5/5 5/5  5/5 5/5 5/5  5/5 5/5 5/5  45/45 (100%)
scored    4/5 4/5 3/5  4/5 4/5 4/5  4/5 5/5 4/5  36/45 (80%)
random    2/5 4/5 2/5  2/5 0/5 2/5  2/5 1/5 2/5  17/45 (38%)

Scored eviction answers the needle correctly about twice as often as random (80% vs 38%). The scored failures we inspected were near-misses — the model found the needle but dropped a digit (e.g. KRFA-397-ZMAM instead of KRFA-3977-ZMAM). Random failures were complete hallucinations. Three server crashes (HTTP 500) hit both scored and random on the same prompts and are counted as failures for both; excluding those, rates are 36/42 (86%) scored vs 17/42 (40%) random.

Small initial benchmark (5 reps per cell) — the gap is directionally consistent across context lengths and depths in this run.

Hybrid model observation (Qwen3.5-27B)

On Qwen3.5-27B (3:1 Gated DeltaNet / gated-attention), scored and random eviction produced the same NIAH results in this exact setup: both 4/6 at 8K, 4/6 at 16K, 2/3 at 25K.

One possible explanation is that the recurrent DeltaNet path carries enough retrieval signal at these lengths that attention-KV eviction differences are not visible in this benchmark. Not tested beyond 25K yet — treating this as an observation rather than a conclusion.

0 replies

domvox · 2026-04-22T07:34:06Z

domvox
Apr 22, 2026
Author

TriAttention scoring: budget sweep and Qwen3.5 hybrid check

Follow-up to my earlier NIAH result after the GPU scoring fix.

Two extra checks:

On Qwen3-8B, the 512-token scored budget was too tight for this NIAH setup. Raising the scored old-token budget to 1024 (1152 total with window) removed the misses in this 18-trial run.
On Qwen3.5-27B hybrid, scored eviction did not beat random eviction in my small NIAH runs. Full KV or KV quantization looks like the safer option there for now.

1. Budget Sensitivity (Qwen3-8B, 8K context, 18-trial NIAH)

Setup: Qwen3-8B Q4_K_M, q8_0 KV cache, FA, temp=0, no thinking mode, scoring interval=128, window=128. Same NIAH generator as the previous 80% vs 38% post.

Budget (old tokens)	Window	Total retained	NIAH accuracy
512	128	640	89% (16/18)
768	128	896	94% (17/18)
1024	128	1152	100% (18/18)
1536	128	1664	100% (18/18)

At budget=1024 (1152 total retained, ~14% of 8K context), all near-miss failures disappear. The misses I inspected were near the cutoff: needle tokens scored around z=2.88-2.99. In this setup, a larger budget was enough to keep them.

For retrieval-sensitive workloads with tight budgets, TRIA_BUDGET_TOKENS=1024 (absolute budget env var) is worth trying. With window=128 this retains ~1152 tokens total.

2. Hybrid SSM+Attention (Qwen3.5-27B)

Setup: Qwen3.5-27B Q4_K_M, q8_0 KV cache, FA, temp=0, no thinking mode, budget=512, window=128.

Qwen3.5-27B is a Gated DeltaNet + full attention hybrid (3:1 ratio, 16 full-attention layers out of 64 total).

Method	NIAH 8K (6 trials)	NIAH 16K (6 trials)	NIAH 25K (3 trials)
No pruning (full KV)	100%	100%	100%
TriAttention scored	67% (4/6)	67% (4/6)	67% (2/3)
Random eviction	67% (4/6)	67% (4/6)	67% (2/3)

Scored eviction matched random eviction at every tested context length.

I do not know the root cause yet. One obvious limitation is partial RoPE: Qwen3.5 rotates 64 of 256 K dimensions, so this score only directly sees 25% of each key. I have not separated that from other hybrid-model effects.

For this hybrid model/configuration, I would use KV quantization before eviction. With only 16 full-attention layers (vs 64 total), full KV cache at 16K context is ~1 GiB — already much smaller than a same-shape pure transformer. Quantizing that (q8_0, q4_0, turbo) keeps all tokens and avoids the retrieval loss seen here.

3. Scoring Ablations

Tried several scoring changes against the budget=512 baseline (89%) on Qwen3-8B:

Tweak	Result
Stratified reserve 10-20%	worse (83-78%)
Max-over-offset beta 0.5-1.0	worse (78-61%)
V-lambda 0.5-2.0	worse (72-50%)
V-merge / merge-style variants	random-level or worse
Halo protection ±2-4	no measurable effect
Non-rotary alpha 0-10	no measurable effect

Current defaults were the best of the variants I tried. The consistent improvement came from increasing the retained-token budget, not from changing the scoring formula.

Branch

Updated branch: domvox/llama.cpp-turboquant-hip @ feature/triattention-scoring (26e9f259c)

GPU scoring consistency fix (commit 221c56721, reported in previous post)
TRIA_V_LAMBDA env var for v-energy tuning (default 0.25 unchanged)
All failed experiments reverted, clean state

0 replies

TurboQuant KV Cache Compression — Full HIP/ROCm Port (gfx1100) #21526

Uh oh!

Hardware / Software

What's included

Benchmark Results

Perplexity — Qwen3.5-9B Q4_K (Wikitext-2, 145 chunks, ctx=2048)

Perplexity — Qwen3.5-27B Q5_K_M (Wikitext-2, 20 chunks, ctx=2048)

Throughput — Qwen3.5-27B Q5_K_M (16K context)

VRAM — The Hero Case (27B Q5_K_M @ 80K context, 24 GB GPU)

Baseline Regression Check

Test Matrix

Known Limitations

Not Yet Tested

Build Instructions

Upstream Plan

Replies: 19 comments · 74 replies

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

domvox Apr 8, 2026 Author

Uh oh!

Uh oh!

domvox Apr 9, 2026 Author

Uh oh!

Uh oh!

domvox Apr 6, 2026 Author

Uh oh!

domvox Apr 7, 2026 Author

Gemma 4 31B Dense — TurboQuant preliminary results

PPL (WikiText-2 raw, ctx=4096)

Chat quality (turbo3 all layers)

Practical implications for 24GB GPUs

Summary: Gemma 4 family + TurboQuant

Uh oh!

Uh oh!

Uh oh!

domvox Apr 11, 2026 Author

Uh oh!

Uh oh!

domvox Apr 9, 2026 Author

TriAttention KV Cache Pruning — 32K context validated

Uh oh!

Uh oh!

domvox Apr 11, 2026 Author

Uh oh!

domvox Apr 10, 2026 Author

Uh oh!

RX 6700 XT | MN-12B-Mag-Mell-R1-Q5_K_L.gguf

Uh oh!

domvox Apr 17, 2026 Author

Uh oh!

Uh oh!

domvox Apr 10, 2026 Author

Update: Full benchmark results (2026-04-10, GSM8K corrected 2026-04-11)

GSM8K Math (Qwen3.5-27B Q5_K_M, temperature=0)

Needle-in-a-Haystack (Qwen3.5-27B, turbo3 K+V)

Tool Calling (Qwen3.5-27B, turbo3 K+V)

WikiText-2 PPL (Qwen3.5-27B Q5_K_M)

Speed

Gemma 4 findings

New: Boundary V hybrid architecture fix

Bug fixes

Uh oh!

domvox Apr 10, 2026 Author

Update on the VRAM growth issue (@stragulus)

Uh oh!

Uh oh!

domvox Apr 10, 2026 Author

Uh oh!

Uh oh!

Uh oh!

Replies: 19 comments 74 replies

domvox Apr 8, 2026
Author

domvox Apr 9, 2026
Author

domvox Apr 6, 2026
Author

domvox
Apr 7, 2026
Author

domvox Apr 11, 2026
Author

domvox
Apr 9, 2026
Author

domvox Apr 11, 2026
Author

domvox
Apr 10, 2026
Author

domvox Apr 17, 2026
Author

domvox
Apr 10, 2026
Author

domvox
Apr 10, 2026
Author

domvox
Apr 10, 2026
Author