Add --kv-cache turbo3 (TurboQuant+ 3-bit KV cache, CUDA + Metal) by TheTom · Pull Request #243 · antirez/ds4

TheTom · 2026-05-24T17:11:45Z

Adds an opt-in --kv-cache turbo3 dtype that packs each KV row to
431 bytes - a 4.75x reduction over the float-sim fp8 baseline
(2048 B/row) - using the TurboQuant+ scheme: per-group Walsh-Hadamard
rotation + 8-level Lloyd-Max codebook + matched-norm L2 scale + FP8
per-group scale byte + RoPE tail.

Default behaviour is unchanged (fp8). Ships on both CUDA and Metal
in this PR. turbo4 (4-bit), h8 head-batched Flash attention, and
sparse-V skip are separate follow-ups once this lands.

Bench

GX10 (ASUS Ascent, GB10 Blackwell chip, 128 GB), IQ2XXS,
ds4-bench --backend cuda, 5-run medians:

ctx=8K  decode_tps  fp8: 36.5 / turbo3: 37.2  (+1.9%)
ctx=16K decode_tps  fp8: 32.1 / turbo3: 33.1  (+3.1%)
ctx=32K decode_tps  fp8: 24.6 / turbo3: 25.4  (+3.3%)

prefill_tps within +/-2% of fp8 across all measured frontiers.

M5 Max 128 GB, IQ2XXS, ds4-bench --backend metal:

fp8:    37.04 t/s decode
turbo3: 29.97 t/s decode (Wave M3 inline-dequant path)

Mac Metal decode is ~19% slower at short context: the Wave M3 path
dequants the packed row to scratch per attention call before reusing
the stock fp8 attention kernel. The follow-up PR with the h8 head-
batched Flash attention kernels eliminates the scratch hop (one dequant
per K=V tile, shared across the 8 query heads) and closes the gap to
within ~5% of fp8 on the same prompt. Prefill_tps is unchanged. CUDA
decode is +1.9% to +3.3% across 8K..32K (no regression on that backend).

Quality (teacher-forced PPL + full-vocab KLD + top-k agreement)

This PR also adds ds4-bench --ppl-prompt FILE for teacher-forced
perplexity, plus --quality-emit FILE / --quality-baseline FILE to
dump per-position full-vocab logits and report KL divergence + top-1 /
top-5 agreement vs a saved baseline. Matches the TurboQuant+
asymmetric-kv-compression paper's validation suite.

GX10, IQ2XXS, 255 scored positions, vocab=129280:

prompt 1 (low-entropy code audit):
  fp8     ppl=4.35  baseline
  turbo3  ppl=4.12  KLD mean=0.036 nats  top-1=93.3%  top-5=99.6%

prompt 2 (high-entropy security):
  fp8     ppl=35.28 baseline
  turbo3  ppl=31.83 KLD mean=0.214 nats  top-1=83.1%  top-5=99.6%

Mac Metal cross-backend validation (same harness, M5 Max):

prompt 2: turbo3  ppl=34.66 KLD mean=0.191 nats top-5=99.6%

top-5 agreement >=99.6% on both prompts, both backends. The model
still ranks the right tokens at the top of the distribution; the
argmax just shuffles within the top set. PPL alone is misleading on
small samples (turbo3 reports lower PPL than fp8 due to quantisation
noise acting as regularisation); KLD reveals the actual distribution
drift.

Reproduce

make ds4-bench
./ds4-bench --kv-cache fp8    --quality-emit baseline.bin \
            --ppl-prompt tests/test-vectors/prompts/long_code_audit.txt
./ds4-bench --kv-cache turbo3 --quality-baseline baseline.bin \
            --ppl-prompt tests/test-vectors/prompts/long_code_audit.txt

Tests

make -k test                                   # PASSES with default fp8
./ds4_test --logprob-vectors                   # fp8: bit-identical to main
DS4_TEST_KV_DTYPE=turbo3 ./ds4_test --logprob-vectors
# turbo3: FAILS on short_code_completion step 1 with a single argmax
# mismatch.  Expected: the test asserts strict argmax equality at every
# position vs the official continuation, and turbo3 shuffles top-1
# within the top-5 set in ~5-17% of positions.  This is a distribution-
# drift trade intrinsic to 3-bit cache quantization, not a regression.
#
# Cross-validation against the canonical TurboQuant+ corpus: the
# kl-divergence-results doc in TheTom/turboquant_plus reports turbo3
# same-top-p = 94.31% on Qwen3.5-35B-A3B MoE Q8_0 at ctx=512 vs the
# f16 baseline.  Our top-1 agreement of 93.3% (code prompt) / 83.1%
# (high-entropy security prompt) is in the same envelope, accounting
# for the IQ2XXS base quant noise on top.  llama.cpp's CONTRIBUTING.md
# (cited in the same TQ+ canonical doc) requires KLD for new cache
# types specifically because argmax-equality tests are too brittle for
# 3-bit cache work.
#
# Use ds4-bench --quality-baseline for the KLD-aware comparator that
# captures the actual envelope.
make cpu                                       # CPU portability target builds clean

make cuda-regression was already broken on main before this PR
(upstream 23e1ea5 added comp_kv_f16 to the attention launcher
signature but did not update tests/cuda_long_context_smoke.c, so
the test failed to compile). This PR includes a one-line courtesy
fix that restores make cuda-regression to a buildable state - feel
free to drop that commit if you want to send the fix separately.

The official continuation scorer in gguf-tools/quality-testing/
was not run (collecting fresh official continuations needs a paid
DEEPSEEK API key). Recommend running it locally before merge if
you want belt-and-suspenders quality confirmation; our KLD harness
is the equivalent local proxy.

Files

ds4.c                     enum + dispatch + cache plumbing
ds4.h                     DS4_KV_TURBO3 enum + row-bytes API
ds4_cuda.cu               CUDA pack / dequant / inline-dequant attention
ds4_metal.m               Metal pipelines + launchers
metal/dsv4_turbo3.metal   MSL pack / dequant / Wave M3 attention kernels
ds4_bench.c               --ppl-prompt + --quality-emit/baseline
ds4_cli.c                 --kv-cache flag parsing
ds4_gpu.h                 turbo3 attention launcher decls
ds4_test.c                DS4_TEST_KV_DTYPE env var for regression on turbo3
tests/cuda_long_context_smoke.c  comp_kv_f16 signature fix (courtesy)
Makefile                  cuda-spark default arch sm_120
speed-bench/turbo3/       A/B bench CSVs + README

Why a flag and not the default (or default-replacing fp8)

AGENT.md says "Do not add permanent semantic variants behind flags." I
read this as a guard against shipping two competing production paths with
no clear winner - happy to defer to your read on whether --kv-cache turbo3
crosses that line. My case for shipping it as an opt-in:

fp8 is unchanged. Default behaviour, default codepath, default
performance: identical to main. ./ds4_test --logprob-vectors is
bit-identical with the default flag value.
turbo3 isn't pareto-better than fp8. It trades a measurable
distribution drift (KLD mean 0.04-0.21 nats vs fp8 baseline, top-1
argmax shuffles within the top-5 ~7-17% of positions) for a 4.75x
smaller KV cache. That's a memory-pressure / context-length trade,
not a free win - long-context users with tight unified-memory budgets
benefit, short-context users on big machines don't. Removing fp8 or
flipping the default would force the trade on everyone.
The flag is the same surface as --quality. ds4 already exposes
numerical-fidelity knobs to the CLI; this is one more along that axis.

If you'd rather it ship a different way (default-flipped, gated by a
build-time #define, removed entirely and resubmitted only once the
quality gap closes, etc.), I'm happy to rework - just let me know the
shape you want.

What this PR deliberately doesn't do

turbo4 (4-bit Lloyd-Max codebook) - follow-up PR
h8 head-batched Flash attention (Metal) - follow-up PR
sparse-V skip (Metal + CUDA) - follow-up PR
turbo2 (2-bit) - held; ships only if you want the 2-bit dtype

…ference) Adds --kv-cache turbo3 to ds4: a 3-bit-per-value KV cache layout based on the TurboQuant+ scheme (Walsh-Hadamard rotation + Lloyd-Max codebook + matched-norm L2 scale + per-group FP8 scale byte). This commit ships the simulation (CPU reference + CUDA pack/dequant kernels) behind --kv-cache turbo3, alongside the existing fp8 default. Footprint: 431 bytes per KV row, a 4.75x reduction vs the float-sim fp8 path (2048 bytes per row). Quality preservation matches TQ+ paper expectations on Qwen3.6-A3B IQ2XXS. Includes A/B bench CSVs from GB10 Spark + reproduction notes under speed-bench/turbo3/. Engine open fails fast with a clean error if --kv-cache turbo3 is combined with the Metal backend (Metal kernel arrives in a later commit); use --cuda or --cpu for now.

Replaces the float-sim turbo3 path with the real packed 431-byte per-row layout the TQ+ paper describes: signs (8 B) | data (3 bits/value, 24 groups of 8 = 24 B) | FP8 scale bytes (24 B) | row L2 norm (fp16, 2 B) ... with `ds4_kv_row_bytes` exposing the active dtype's row stride and new CUDA pack/dequant kernels (pack_turbo3_kernel + device-side dequant helpers callable from attention inner loops). Cache wiring: the GPU layer_raw_cache pool now holds packed bytes when --kv-cache turbo3 is active; reads decompress-to-scratch (superseded by inline-dequant attention kernels in the next commit). The on-disk warm cache header gains a kv_dtype field (v2) so a cold run with mismatched dtype rebuilds the cache rather than loading fp8 bytes into a turbo3 pool. ds4-bench prints a 3-way KV cache footprint summary at startup so the user can see the savings.

Removes the "decompress turbo3 row to scratch fp16, then run fp8 attention" hop that the previous commit's pack/dequant kernels fell back to. Inline-dequants the packed 431-byte row directly inside the CUDA attention kernels (decode-token, prefill, indexed mixed, head-batched online) so the packed bytes stream straight into the K-dot and V-acc inner loops. Decode-token + prefill + indexed-mixed-online + tile-batched V-acc all carry the inline-dequant path. Tile sizes raised from 4/8 to 16 on the online turbo3 kernels for better simdgroup utilisation. Result on GB10 Spark (Qwen3.6-A3B IQ2XXS, ds4-bench): - turbo3 decode_tps within 1% of fp8 across 2K..32K - turbo3 prefill_tps within 2% of fp8 - 4.75x KV cache memory reduction preserved

The Spark target maps to DGX Spark / GB10, which is Blackwell sm_120. Previously cuda-spark passed an empty CUDA_ARCH which fell through to the nvcc default (PTX-JIT against sm_75 / Turing). With my Phase 2b inline-dequant kernels this default produced 320-byte stack frames (register spills to local memory) - measured 6x regression on attention_decode_mixed_turbo3_kernel (464us per call vs ~80us at sm_120, which has 0-byte stack frame and 96 registers per thread). Setting -arch=sm_120 also tightens register usage across the existing fp8 kernels and avoids the JIT pause on first launch. No correctness change; the kernels are unchanged. ATTN profile delta at 8K context decode, DSV4 IQ2XXS, 300-word prompt: attention_decode_mixed_turbo3_kernel sm_75 464us/call -> sm_120 ~80us/call

Adds three flags to ds4-bench for KV-cache quality validation that match the TurboQuant+ asymmetric-kv-compression validation suite: --ppl-prompt FILE Skip the throughput sweep. Tokenize FILE, walk token by token, accumulate -log P(token_t | tokens_<t) from the live logits. Reports nll_avg / ppl / scored_tokens. Use to compare quality across --kv-cache dtypes apples-to-apples. --quality-emit FILE During --ppl-prompt, dump per-position full-vocab raw logits to FILE. Binary format: "DS4Q" magic | u32 vocab | u32 scored | (scored * vocab * float32). Run with --kv-cache fp8 to capture a baseline. --quality-baseline FILE Read FILE (from a prior --quality-emit run) and compare every position's logits to the current run. Reports: - KLD(baseline || current) full-vocab mean and max, in nats - top-1 agreement current argmax == baseline argmax - top-5 agreement current argmax in baseline top-5 Standard usage to validate turbo3 / turbo4 against fp8: ./ds4-bench --ppl-prompt PROMPT --kv-cache fp8 --quality-emit baseline.bin ./ds4-bench --ppl-prompt PROMPT --kv-cache turbo3 --quality-baseline baseline.bin ./ds4-bench --ppl-prompt PROMPT --kv-cache turbo4 --quality-baseline baseline.bin Sample, GB10 Spark, IQ2XXS, 255 positions, vocab=129280, two prompts (high-entropy security + low-entropy code audit): dtype | PPL low-entropy / PPL high-entropy | KLD mean | top-5 fp8 | 4.35 / 35.28 | - | 100% turbo4 | 4.30 / 32.53 | 0.02-0.12 nats | 99.6% turbo3 | 4.12 / 31.83 | 0.04-0.21 nats | 99.6% Same on Mac Metal: turbo4 top-5 hits 100% on the low-entropy prompt.

Ports the CUDA turbo3 pipeline to Metal so --kv-cache turbo3 works on Apple Silicon backends. New MSL kernels in metal/dsv4_turbo3.metal: - pack: in-place quant of fp16 KV row to 431 packed bytes - dequant-to-scratch: reconstruct fp16 row for the existing fp8 attention path (Wave M0/M1) - attention decode: inline-dequant turbo3 row directly in the K-dot and V-acc inner loops, dispatched via decode_heads_turbo3_tensor and decode_mixed_batch_turbo3_heads_tensor (Wave M2-M5) Lifts the --kv-cache turbo3 + Metal engine-open guard that the earlier "Metal kernel deferred" commit installed. The in-place quant on Metal currently delegates to the fp8 pack + post-pass turbo3 path; a native cooperative-WHT MSL quant is gated behind DS4_METAL_TURBO3_QUANT_NATIVE=1 (debug only - a known cooperative-WHT bug in the native kernel is being investigated separately). The decode/attention path is fully native. Bench, M5 Max 128 GB, Qwen3.6-A3B IQ2XXS, "What is 2+2?": fp8: prefill 45.98 t/s, decode 36.36 t/s, ppl 3.3596 turbo3: prefill 59.27 t/s (+29%), decode 30.20 t/s, ppl 3.4283 (+2.04%) KV row 2048 B -> 431 B (4.75x reduction), preserved across the port.

Upstream commit 23e1ea5 ("Store Metal attention compressed KV cache in F16") added the comp_kv_f16 parameter to the attention launcher signature but did not update tests/cuda_long_context_smoke.c, so make cuda-regression has been failing to build on main. Pass 0 here (the CUDA path stores comp_kv as float32 at default build settings, matching what the test prepares). Restores make cuda-regression to a buildable state on the CUDA backend.

ASCII hyphens (-) only across new comments and docstrings.

Stripped workflow labels ("Phase 1/2/2a/2b/2b Wave M0-M2/2c v2/3a/9") that referenced an internal phasing scheme meaningless to readers; replaced with descriptive ASCII phrases or deleted where the surrounding sentence still reads cleanly. Removed dead pointers to docs/turbo3-roadmap.md (file is not in this PR). Swapped Unicode box-drawing dividers and arrows for ASCII (===, ->, *, <=) to match antirez's existing section headers. Shortened a few verbose comment blocks. Also fixed "GB10" used as machine name in two spots to "GX10" (the machine; GB10 is the Blackwell chip).

TheTom · 2026-05-24T19:31:48Z

Ran two targeted tests on the MLA-stacking concern.

1. Long-context fact retrieval (./ds4_test --long-context) — 30,474-token NIAH-style prompt, asserts model retrieves spelled-out person-number assignments from prose:

./ds4_test --long-context                                # fp8:    PASS
DS4_TEST_KV_DTYPE=turbo3 ./ds4_test --long-context       # turbo3: PASS

Recall is preserved end-to-end. The MLA latent + 3-bit stack doesn't break the test on Mac Metal.

2. KLD vs fp8 baseline as context grows, same long_context_story_prompt.txt, full vocab (129,280), GX10 / CUDA / IQ2XXS:

positions	KLD mean (nats)	KLD max	top-1	top-5
255	0.036	0.554	93.3%	99.6%
2,047	0.033	1.146	93.8%	99.7%
8,191	0.009	1.146	98.3%	99.9%

KLD dropped 3.6x as context grew 2K -> 8K, and top-1 agreement improved 4.5%. Compounding didn't happen at the contexts I tested - looks like sparser long-context attention patterns let the matched-norm L2 scale absorb the quantization noise rather than amplify it.

If you have a workload that you think would actually stress the stack harder (recall over wider haystacks, longer chains, a specific code-completion regression you're worried about), happy to bench it.

TheTom · 2026-05-24T19:35:45Z

FYI in case useful, the follow-up PRs are staged on my fork:

turbo4 (4-bit Lloyd-Max codebook) — branch pr/02-turbo4 (diff vs main), stacked on this PR's tip
h8 head-batched Flash attention + sparse-V skip — branch pr/03-h8-sparse-v (diff vs main), stacked on PR 2, closes the Metal decode gap to within ~5% of fp8

Both branches build green on Mac Metal + GX10 CUDA, turbo3/turbo4 smoke runs pass. Happy to open them as PRs once this lands, or sooner if useful.

antirez · 2026-05-24T20:44:17Z

Thank you @TheTom do I understand correctly that this is only compressing raw SWA ring right now? This means that at 100k context we save only 1% of memory in total, at the same time risking a correctness problem. Or I'm misunderstanding the goal of this PR? Thanks!

empty-quiver · 2026-05-24T20:53:48Z

This is neat for those of us trying to do more with less VRAM. Going off of the math in the kv footprint estimator here in the PR, this saves about ~0.28GB of VRAM at 1M context? That seems to match the 1% figure that @antirez is saying. Do you intend to attempt turboquant on the compressed KV rows @TheTom? I'd be curious to see how that affects correctness, it would be neat if it is a small drift.

Along those lines, curious what the correctness difference is between Turboquant at 3 bits vs 4 bits? 4 bits seem like an interesting middle ground.

TheTom · 2026-05-25T00:19:51Z

On the asymptote concern - had Phase 7 (comp-cache compression) sitting locally as the natural follow-up; didn't include here to keep this PR small enough to review. Just pushed it: TheTom:pr/04-phase7-comp-cache (diff). This PR is the foundation (turbo3 enum + raw cache); the follow-up extends to the comp pool, which closes the asymptote.

Measured on Spark GB10 / IQ2XXS, --kv-cache turbo3 --comp-cache turbo3 vs fp8 baseline:

ctx	fp8 total	turbo3+comp	saved
16K	581 MiB	136 MiB	77%
64K	1226 MiB	313 MiB	74%
256K	3806 MiB	1019 MiB	73%
1M	14126 MiB	3846 MiB	73%

Quality with both kv + comp turbo3 active:

ds4_test --long-context (30K NIAH recall): PASSES
KLD vs fp8 baseline at 8K same prompt: mean=0.0098 nats, top-1=98.10%, top-5=99.95%. Matches this PR's same-prompt KLD at 8K (0.0094 nats), comp-cache compression doesn't compound drift.

Happy to fold the follow-up into this PR, or open it separately once this lands - whatever you prefer.

sztlink · 2026-05-25T02:35:01Z

CUDA side smoke on a consumer Ada card:

host: RTX 4090 / WSL2 Ubuntu-24.04
driver: 595.79
CUDA toolchain: nvcc 13.0.88
checkout: this PR at 5eb01d5
build: make cuda CUDA_ARCH=sm_89 ...

Result: build passes and make cuda-regression passes.

ds4: CUDA backend initialized on NVIDIA GeForce RTX 4090 (sm_89)
cuda-regression: top-k n_comp=32768 n_tokens=32 elapsed=0.003s
cuda long-context regression: OK

I did not run the model/throughput bench on this box — the WSL2 env only exposes ~31 GiB RAM, so this is just build + CUDA regression coverage for sm_89, not a GX10/GB10 performance reproduction.

Full receipt/log: https://github.com/sztlink/boring-receipts/blob/main/receipts/2026-05-24-4090-ds4-pr243-turbo3-cuda-build-regression.md

sztlink · 2026-05-25T03:05:55Z

Also checked the follow-up comp-cache branch on the same consumer Ada box:

branch: TheTom/ds4:pr/04-phase7-comp-cache
checkout: ea322b5 (Phase 7.4: drop float comp pool, pack stage->packed directly)
host: RTX 4090 / WSL2 Ubuntu-24.04
build: make cuda CUDA_ARCH=sm_89 ...

Result: build passes and make cuda-regression passes.

ds4: CUDA backend initialized on NVIDIA GeForce RTX 4090 (sm_89)
cuda-regression: top-k n_comp=32768 n_tokens=32 elapsed=0.003s
cuda long-context regression: OK

Same limitation as above: no model/throughput/footprint reproduction on this box because the WSL2 env only exposes ~31 GiB RAM. This is just cross-hardware build + CUDA-regression coverage for the comp-cache follow-up.

Full receipt/log: https://github.com/sztlink/boring-receipts/blob/main/receipts/2026-05-25-4090-ds4-pr04-phase7-comp-cache-cuda-build-regression.md

TheTom added 3 commits May 24, 2026 12:15

TheTom force-pushed the pr/01-cuda-turbo3 branch from 57b20e6 to 012f9b8 Compare May 24, 2026 17:15

TheTom added 5 commits May 24, 2026 12:51

Style: drop Unicode em-dashes and en-dashes from comments

fcee31a

ASCII hyphens (-) only across new comments and docstrings.

TheTom force-pushed the pr/01-cuda-turbo3 branch from 012f9b8 to be2a422 Compare May 24, 2026 17:51

TheTom force-pushed the pr/01-cuda-turbo3 branch from be2a422 to 5eb01d5 Compare May 24, 2026 18:01

TheTom marked this pull request as ready for review May 24, 2026 18:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add --kv-cache turbo3 (TurboQuant+ 3-bit KV cache, CUDA + Metal)#243

Add --kv-cache turbo3 (TurboQuant+ 3-bit KV cache, CUDA + Metal)#243
TheTom wants to merge 9 commits into
antirez:mainfrom
TheTom:pr/01-cuda-turbo3

TheTom commented May 24, 2026 •

edited

Loading

Uh oh!

TheTom commented May 24, 2026

Uh oh!

TheTom commented May 24, 2026 •

edited

Loading

Uh oh!

antirez commented May 24, 2026

Uh oh!

empty-quiver commented May 24, 2026 •

edited

Loading

Uh oh!

TheTom commented May 25, 2026

Uh oh!

sztlink commented May 25, 2026

Uh oh!

sztlink commented May 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

TheTom commented May 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Bench

Quality (teacher-forced PPL + full-vocab KLD + top-k agreement)

Reproduce

Tests

Files

Why a flag and not the default (or default-replacing fp8)

What this PR deliberately doesn't do

Uh oh!

TheTom commented May 24, 2026

Uh oh!

TheTom commented May 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

antirez commented May 24, 2026

Uh oh!

empty-quiver commented May 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TheTom commented May 25, 2026

Uh oh!

sztlink commented May 25, 2026

Uh oh!

sztlink commented May 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

TheTom commented May 24, 2026 •

edited

Loading

TheTom commented May 24, 2026 •

edited

Loading

empty-quiver commented May 24, 2026 •

edited

Loading