Add --kv-cache turbo3 (TurboQuant+ 3-bit KV cache, CUDA + Metal)#243
Add --kv-cache turbo3 (TurboQuant+ 3-bit KV cache, CUDA + Metal)#243TheTom wants to merge 9 commits into
Conversation
…ference) Adds --kv-cache turbo3 to ds4: a 3-bit-per-value KV cache layout based on the TurboQuant+ scheme (Walsh-Hadamard rotation + Lloyd-Max codebook + matched-norm L2 scale + per-group FP8 scale byte). This commit ships the simulation (CPU reference + CUDA pack/dequant kernels) behind --kv-cache turbo3, alongside the existing fp8 default. Footprint: 431 bytes per KV row, a 4.75x reduction vs the float-sim fp8 path (2048 bytes per row). Quality preservation matches TQ+ paper expectations on Qwen3.6-A3B IQ2XXS. Includes A/B bench CSVs from GB10 Spark + reproduction notes under speed-bench/turbo3/. Engine open fails fast with a clean error if --kv-cache turbo3 is combined with the Metal backend (Metal kernel arrives in a later commit); use --cuda or --cpu for now.
Replaces the float-sim turbo3 path with the real packed 431-byte per-row layout the TQ+ paper describes: signs (8 B) | data (3 bits/value, 24 groups of 8 = 24 B) | FP8 scale bytes (24 B) | row L2 norm (fp16, 2 B) ... with `ds4_kv_row_bytes` exposing the active dtype's row stride and new CUDA pack/dequant kernels (pack_turbo3_kernel + device-side dequant helpers callable from attention inner loops). Cache wiring: the GPU layer_raw_cache pool now holds packed bytes when --kv-cache turbo3 is active; reads decompress-to-scratch (superseded by inline-dequant attention kernels in the next commit). The on-disk warm cache header gains a kv_dtype field (v2) so a cold run with mismatched dtype rebuilds the cache rather than loading fp8 bytes into a turbo3 pool. ds4-bench prints a 3-way KV cache footprint summary at startup so the user can see the savings.
Removes the "decompress turbo3 row to scratch fp16, then run fp8 attention" hop that the previous commit's pack/dequant kernels fell back to. Inline-dequants the packed 431-byte row directly inside the CUDA attention kernels (decode-token, prefill, indexed mixed, head-batched online) so the packed bytes stream straight into the K-dot and V-acc inner loops. Decode-token + prefill + indexed-mixed-online + tile-batched V-acc all carry the inline-dequant path. Tile sizes raised from 4/8 to 16 on the online turbo3 kernels for better simdgroup utilisation. Result on GB10 Spark (Qwen3.6-A3B IQ2XXS, ds4-bench): - turbo3 decode_tps within 1% of fp8 across 2K..32K - turbo3 prefill_tps within 2% of fp8 - 4.75x KV cache memory reduction preserved
The Spark target maps to DGX Spark / GB10, which is Blackwell sm_120. Previously cuda-spark passed an empty CUDA_ARCH which fell through to the nvcc default (PTX-JIT against sm_75 / Turing). With my Phase 2b inline-dequant kernels this default produced 320-byte stack frames (register spills to local memory) - measured 6x regression on attention_decode_mixed_turbo3_kernel (464us per call vs ~80us at sm_120, which has 0-byte stack frame and 96 registers per thread). Setting -arch=sm_120 also tightens register usage across the existing fp8 kernels and avoids the JIT pause on first launch. No correctness change; the kernels are unchanged. ATTN profile delta at 8K context decode, DSV4 IQ2XXS, 300-word prompt: attention_decode_mixed_turbo3_kernel sm_75 464us/call -> sm_120 ~80us/call
Adds three flags to ds4-bench for KV-cache quality validation
that match the TurboQuant+ asymmetric-kv-compression validation suite:
--ppl-prompt FILE
Skip the throughput sweep. Tokenize FILE, walk token by token,
accumulate -log P(token_t | tokens_<t) from the live logits.
Reports nll_avg / ppl / scored_tokens. Use to compare quality
across --kv-cache dtypes apples-to-apples.
--quality-emit FILE
During --ppl-prompt, dump per-position full-vocab raw logits to
FILE. Binary format: "DS4Q" magic | u32 vocab | u32 scored
| (scored * vocab * float32). Run with --kv-cache fp8 to
capture a baseline.
--quality-baseline FILE
Read FILE (from a prior --quality-emit run) and compare every
position's logits to the current run. Reports:
- KLD(baseline || current) full-vocab mean and max, in nats
- top-1 agreement current argmax == baseline argmax
- top-5 agreement current argmax in baseline top-5
Standard usage to validate turbo3 / turbo4 against fp8:
./ds4-bench --ppl-prompt PROMPT --kv-cache fp8 --quality-emit baseline.bin
./ds4-bench --ppl-prompt PROMPT --kv-cache turbo3 --quality-baseline baseline.bin
./ds4-bench --ppl-prompt PROMPT --kv-cache turbo4 --quality-baseline baseline.bin
Sample, GB10 Spark, IQ2XXS, 255 positions, vocab=129280, two prompts
(high-entropy security + low-entropy code audit):
dtype | PPL low-entropy / PPL high-entropy | KLD mean | top-5
fp8 | 4.35 / 35.28 | - | 100%
turbo4 | 4.30 / 32.53 | 0.02-0.12 nats | 99.6%
turbo3 | 4.12 / 31.83 | 0.04-0.21 nats | 99.6%
Same on Mac Metal: turbo4 top-5 hits 100% on the low-entropy prompt.
Ports the CUDA turbo3 pipeline to Metal so --kv-cache turbo3 works
on Apple Silicon backends. New MSL kernels in metal/dsv4_turbo3.metal:
- pack: in-place quant of fp16 KV row to 431 packed bytes
- dequant-to-scratch: reconstruct fp16 row for the existing fp8
attention path (Wave M0/M1)
- attention decode: inline-dequant turbo3 row directly in the
K-dot and V-acc inner loops, dispatched
via decode_heads_turbo3_tensor and
decode_mixed_batch_turbo3_heads_tensor
(Wave M2-M5)
Lifts the --kv-cache turbo3 + Metal engine-open guard that the earlier
"Metal kernel deferred" commit installed.
The in-place quant on Metal currently delegates to the fp8 pack +
post-pass turbo3 path; a native cooperative-WHT MSL quant is gated
behind DS4_METAL_TURBO3_QUANT_NATIVE=1 (debug only - a known
cooperative-WHT bug in the native kernel is being investigated
separately). The decode/attention path is fully native.
Bench, M5 Max 128 GB, Qwen3.6-A3B IQ2XXS, "What is 2+2?":
fp8: prefill 45.98 t/s, decode 36.36 t/s, ppl 3.3596
turbo3: prefill 59.27 t/s (+29%), decode 30.20 t/s, ppl 3.4283 (+2.04%)
KV row 2048 B -> 431 B (4.75x reduction), preserved across the port.
Upstream commit 23e1ea5 ("Store Metal attention compressed KV cache in F16") added the comp_kv_f16 parameter to the attention launcher signature but did not update tests/cuda_long_context_smoke.c, so make cuda-regression has been failing to build on main. Pass 0 here (the CUDA path stores comp_kv as float32 at default build settings, matching what the test prepares). Restores make cuda-regression to a buildable state on the CUDA backend.
ASCII hyphens (-) only across new comments and docstrings.
Stripped workflow labels ("Phase 1/2/2a/2b/2b Wave M0-M2/2c v2/3a/9") that
referenced an internal phasing scheme meaningless to readers; replaced with
descriptive ASCII phrases or deleted where the surrounding sentence still
reads cleanly. Removed dead pointers to docs/turbo3-roadmap.md (file is
not in this PR). Swapped Unicode box-drawing dividers and arrows for
ASCII (===, ->, *, <=) to match antirez's existing section headers.
Shortened a few verbose comment blocks. Also fixed "GB10" used as machine
name in two spots to "GX10" (the machine; GB10 is the Blackwell chip).
|
Ran two targeted tests on the MLA-stacking concern. 1. Long-context fact retrieval ( Recall is preserved end-to-end. The MLA latent + 3-bit stack doesn't break the test on Mac Metal. 2. KLD vs fp8 baseline as context grows, same
KLD dropped 3.6x as context grew 2K -> 8K, and top-1 agreement improved 4.5%. Compounding didn't happen at the contexts I tested - looks like sparser long-context attention patterns let the matched-norm L2 scale absorb the quantization noise rather than amplify it. If you have a workload that you think would actually stress the stack harder (recall over wider haystacks, longer chains, a specific code-completion regression you're worried about), happy to bench it. |
|
FYI in case useful, the follow-up PRs are staged on my fork:
Both branches build green on Mac Metal + GX10 CUDA, turbo3/turbo4 smoke runs pass. Happy to open them as PRs once this lands, or sooner if useful. |
|
Thank you @TheTom do I understand correctly that this is only compressing raw SWA ring right now? This means that at 100k context we save only 1% of memory in total, at the same time risking a correctness problem. Or I'm misunderstanding the goal of this PR? Thanks! |
|
This is neat for those of us trying to do more with less VRAM. Going off of the math in the kv footprint estimator here in the PR, this saves about ~0.28GB of VRAM at 1M context? That seems to match the 1% figure that @antirez is saying. Do you intend to attempt turboquant on the compressed KV rows @TheTom? I'd be curious to see how that affects correctness, it would be neat if it is a small drift. Along those lines, curious what the correctness difference is between Turboquant at 3 bits vs 4 bits? 4 bits seem like an interesting middle ground. |
|
On the asymptote concern - had Phase 7 (comp-cache compression) sitting locally as the natural follow-up; didn't include here to keep this PR small enough to review. Just pushed it: TheTom:pr/04-phase7-comp-cache (diff). This PR is the foundation (turbo3 enum + raw cache); the follow-up extends to the comp pool, which closes the asymptote. Measured on Spark GB10 / IQ2XXS,
Quality with both kv + comp turbo3 active:
Happy to fold the follow-up into this PR, or open it separately once this lands - whatever you prefer. |
|
CUDA side smoke on a consumer Ada card:
Result: build passes and ds4: CUDA backend initialized on NVIDIA GeForce RTX 4090 (sm_89)
cuda-regression: top-k n_comp=32768 n_tokens=32 elapsed=0.003s
cuda long-context regression: OKI did not run the model/throughput bench on this box — the WSL2 env only exposes ~31 GiB RAM, so this is just build + CUDA regression coverage for Full receipt/log: https://github.com/sztlink/boring-receipts/blob/main/receipts/2026-05-24-4090-ds4-pr243-turbo3-cuda-build-regression.md |
|
Also checked the follow-up comp-cache branch on the same consumer Ada box:
Result: build passes and ds4: CUDA backend initialized on NVIDIA GeForce RTX 4090 (sm_89)
cuda-regression: top-k n_comp=32768 n_tokens=32 elapsed=0.003s
cuda long-context regression: OKSame limitation as above: no model/throughput/footprint reproduction on this box because the WSL2 env only exposes ~31 GiB RAM. This is just cross-hardware build + CUDA-regression coverage for the comp-cache follow-up. Full receipt/log: https://github.com/sztlink/boring-receipts/blob/main/receipts/2026-05-25-4090-ds4-pr04-phase7-comp-cache-cuda-build-regression.md |
Adds an opt-in
--kv-cache turbo3dtype that packs each KV row to431 bytes - a 4.75x reduction over the float-sim fp8 baseline
(2048 B/row) - using the TurboQuant+ scheme: per-group Walsh-Hadamard
rotation + 8-level Lloyd-Max codebook + matched-norm L2 scale + FP8
per-group scale byte + RoPE tail.
Default behaviour is unchanged (fp8). Ships on both CUDA and Metal
in this PR. turbo4 (4-bit), h8 head-batched Flash attention, and
sparse-V skip are separate follow-ups once this lands.
Bench
GX10 (ASUS Ascent, GB10 Blackwell chip, 128 GB), IQ2XXS,
ds4-bench --backend cuda, 5-run medians:prefill_tpswithin +/-2% of fp8 across all measured frontiers.M5 Max 128 GB, IQ2XXS,
ds4-bench --backend metal:Mac Metal decode is ~19% slower at short context: the Wave M3 path
dequants the packed row to scratch per attention call before reusing
the stock fp8 attention kernel. The follow-up PR with the h8 head-
batched Flash attention kernels eliminates the scratch hop (one dequant
per K=V tile, shared across the 8 query heads) and closes the gap to
within ~5% of fp8 on the same prompt. Prefill_tps is unchanged. CUDA
decode is +1.9% to +3.3% across 8K..32K (no regression on that backend).
Quality (teacher-forced PPL + full-vocab KLD + top-k agreement)
This PR also adds
ds4-bench --ppl-prompt FILEfor teacher-forcedperplexity, plus
--quality-emit FILE/--quality-baseline FILEtodump per-position full-vocab logits and report KL divergence + top-1 /
top-5 agreement vs a saved baseline. Matches the TurboQuant+
asymmetric-kv-compression paper's validation suite.
GX10, IQ2XXS, 255 scored positions, vocab=129280:
Mac Metal cross-backend validation (same harness, M5 Max):
top-5 agreement >=99.6% on both prompts, both backends. The model
still ranks the right tokens at the top of the distribution; the
argmax just shuffles within the top set. PPL alone is misleading on
small samples (turbo3 reports lower PPL than fp8 due to quantisation
noise acting as regularisation); KLD reveals the actual distribution
drift.
Reproduce
make ds4-bench ./ds4-bench --kv-cache fp8 --quality-emit baseline.bin \ --ppl-prompt tests/test-vectors/prompts/long_code_audit.txt ./ds4-bench --kv-cache turbo3 --quality-baseline baseline.bin \ --ppl-prompt tests/test-vectors/prompts/long_code_audit.txtTests
make cuda-regressionwas already broken onmainbefore this PR(upstream
23e1ea5addedcomp_kv_f16to the attention launchersignature but did not update
tests/cuda_long_context_smoke.c, sothe test failed to compile). This PR includes a one-line courtesy
fix that restores
make cuda-regressionto a buildable state - feelfree to drop that commit if you want to send the fix separately.
The official continuation scorer in
gguf-tools/quality-testing/was not run (collecting fresh official continuations needs a paid
DEEPSEEK API key). Recommend running it locally before merge if
you want belt-and-suspenders quality confirmation; our KLD harness
is the equivalent local proxy.
Files
Why a flag and not the default (or default-replacing fp8)
AGENT.md says "Do not add permanent semantic variants behind flags." I
read this as a guard against shipping two competing production paths with
no clear winner - happy to defer to your read on whether
--kv-cache turbo3crosses that line. My case for shipping it as an opt-in:
performance: identical to main.
./ds4_test --logprob-vectorsisbit-identical with the default flag value.
distribution drift (KLD mean 0.04-0.21 nats vs fp8 baseline, top-1
argmax shuffles within the top-5 ~7-17% of positions) for a 4.75x
smaller KV cache. That's a memory-pressure / context-length trade,
not a free win - long-context users with tight unified-memory budgets
benefit, short-context users on big machines don't. Removing fp8 or
flipping the default would force the trade on everyone.
--quality. ds4 already exposesnumerical-fidelity knobs to the CLI; this is one more along that axis.
If you'd rather it ship a different way (default-flipped, gated by a
build-time
#define, removed entirely and resubmitted only once thequality gap closes, etc.), I'm happy to rework - just let me know the
shape you want.
What this PR deliberately doesn't do