HIP port of TurboQuant (Zandieh et al., ICLR 2026) for KV cache compression on AMD RDNA3 GPUs.
4.9x compression at 3-bit (TQ3) with zero accuracy loss. First working HIP implementation.
| Metric | TQ3 (3-bit) | TQ4 (4-bit) | Paper |
|---|---|---|---|
| MSE | 0.0337 | 0.0093 | 0.034 / 0.009 |
| Compression | 4.9x | 3.8x | 4.9x / 3.8x |
| Block size | 52 bytes | 68 bytes | — |
| Operation | Throughput | GB/s |
|---|---|---|
| Quantize (TQ3) | 64M vec/s | 36 GB/s |
| Dequantize (TQ3) | 178M vec/s | 101 GB/s |
| Fused dot product | 1.07B dot/s | — |
GPU vs CPU max diff: 0.000001 (bit-exact match).
| Context | FP16 KV | TQ3 KV | Saved |
|---|---|---|---|
| 64K | 2.00 GiB | 0.41 GiB | 1.59 GiB |
| 128K | 4.00 GiB | 0.81 GiB | 3.19 GiB |
| 256K | 8.00 GiB | 1.62 GiB | 6.38 GiB |
Requires ROCm 6.x with HIP.
./build.sh # builds tq_bench
hipcc -O3 --offload-arch=gfx1100 -o tq_validate \
ggml_turboquant.c ggml_turboquant.hip.cpp tq_validate.cpp -lm./tq_bench # 65K vectors, dim=128
./tq_bench 262144 256 # 262K vectors, dim=256 (2×128 blocks)
./tq_validate # 9 correctness tests
./tq_test # 18 CPU reference tests| File | Description |
|---|---|
ggml_turboquant.hip.cpp |
HIP kernels: quantize, dequantize, fused dot (TQ3 + TQ4) |
ggml_turboquant.h |
Header: types, codebooks, API |
ggml_turboquant.c |
CPU reference implementation |
ggml_turboquant.cu |
Original CUDA kernels (reference) |
tq_hip_benchmark.cpp |
GPU benchmark harness |
tq_validate.cpp |
GPU validation suite (9 tests) |
tq_test.c |
CPU test suite (18 tests) |
Per-vector pipeline (head_dim=128):
- Norm — store
||x||₂(4 bytes) - Normalize —
x_unit = x / ||x|| - Rotate —
y = Π · x_unit(deterministic orthogonal matrix) - Quantize — nearest centroid from Lloyd-Max codebook (3 or 4 bits)
- Bit-pack — 128 × 3 bits = 48 bytes (TQ3) or 128 × 4 bits = 64 bytes (TQ4)
Dequantize reverses: unpack → codebook lookup → inverse rotation → scale by norm.
Changes from CUDA original:
__shfl_downwithout sync mask (HIP style)warpSize-aware reduction (works with RDNA3 wave32 and wave64)- Parallel bit-packing via
atomicOron shared memory (8x faster than single-thread) - TQ4 uses nibble packing (2 values per byte, no atomics needed)
hipMemcpyToSymbolfor constant memory codebooks
- TQ3 quantize/dequantize kernels
- TQ4 quantize/dequantize kernels
- Fused dot product kernel
- GPU vs CPU validation (bit-exact)
- MSE matches paper
- Integration with llama.cpp KV cache
- Real model testing
- Original CUDA implementation: veritatisquaesitoressumus
- Paper: Zandieh et al., "Online Vector Quantization with Near-optimal Distortion Rate", ICLR 2026 (arXiv:2504.19874)
- HIP port: domvox