Skip to content

domvox/turboquant-hip

Repository files navigation

TurboQuant HIP — KV Cache Compression for AMD GPUs

HIP port of TurboQuant (Zandieh et al., ICLR 2026) for KV cache compression on AMD RDNA3 GPUs.

4.9x compression at 3-bit (TQ3) with zero accuracy loss. First working HIP implementation.

Results on RX 7900 XTX (gfx1100)

Metric TQ3 (3-bit) TQ4 (4-bit) Paper
MSE 0.0337 0.0093 0.034 / 0.009
Compression 4.9x 3.8x 4.9x / 3.8x
Block size 52 bytes 68 bytes
Operation Throughput GB/s
Quantize (TQ3) 64M vec/s 36 GB/s
Dequantize (TQ3) 178M vec/s 101 GB/s
Fused dot product 1.07B dot/s

GPU vs CPU max diff: 0.000001 (bit-exact match).

VRAM Savings (Qwen3.5-9B, 8 attention layers, 4 KV heads, head_dim 256)

Context FP16 KV TQ3 KV Saved
64K 2.00 GiB 0.41 GiB 1.59 GiB
128K 4.00 GiB 0.81 GiB 3.19 GiB
256K 8.00 GiB 1.62 GiB 6.38 GiB

Build

Requires ROCm 6.x with HIP.

./build.sh                          # builds tq_bench
hipcc -O3 --offload-arch=gfx1100 -o tq_validate \
    ggml_turboquant.c ggml_turboquant.hip.cpp tq_validate.cpp -lm

Run

./tq_bench                          # 65K vectors, dim=128
./tq_bench 262144 256               # 262K vectors, dim=256 (2×128 blocks)
./tq_validate                       # 9 correctness tests
./tq_test                           # 18 CPU reference tests

Files

File Description
ggml_turboquant.hip.cpp HIP kernels: quantize, dequantize, fused dot (TQ3 + TQ4)
ggml_turboquant.h Header: types, codebooks, API
ggml_turboquant.c CPU reference implementation
ggml_turboquant.cu Original CUDA kernels (reference)
tq_hip_benchmark.cpp GPU benchmark harness
tq_validate.cpp GPU validation suite (9 tests)
tq_test.c CPU test suite (18 tests)

Algorithm

Per-vector pipeline (head_dim=128):

  1. Norm — store ||x||₂ (4 bytes)
  2. Normalizex_unit = x / ||x||
  3. Rotatey = Π · x_unit (deterministic orthogonal matrix)
  4. Quantize — nearest centroid from Lloyd-Max codebook (3 or 4 bits)
  5. Bit-pack — 128 × 3 bits = 48 bytes (TQ3) or 128 × 4 bits = 64 bytes (TQ4)

Dequantize reverses: unpack → codebook lookup → inverse rotation → scale by norm.

HIP Port Notes

Changes from CUDA original:

  • __shfl_down without sync mask (HIP style)
  • warpSize-aware reduction (works with RDNA3 wave32 and wave64)
  • Parallel bit-packing via atomicOr on shared memory (8x faster than single-thread)
  • TQ4 uses nibble packing (2 values per byte, no atomics needed)
  • hipMemcpyToSymbol for constant memory codebooks

Status

  • TQ3 quantize/dequantize kernels
  • TQ4 quantize/dequantize kernels
  • Fused dot product kernel
  • GPU vs CPU validation (bit-exact)
  • MSE matches paper
  • Integration with llama.cpp KV cache
  • Real model testing

Credits

About

TurboQuant KV cache compression for AMD GPUs (HIP/ROCm). First working HIP port. 4.9x compression, zero accuracy loss.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

No contributors