[Paper] [Project Page]
- Fast KVzip trains a lightweight gating mechanism for KV cache compression across both prefill and decoding stages.
- Near-lossless performance on general tasks with up to a 70% KV cache eviction ratio while significantly improving attention efficiency.
- A Low-Rank Sink Attention gate architecture, trained by directly distilling importance scores from KVzip in under one H100 hour.
- NVIDIA KVpress adds support for Fast KVzip (see also Leaderboard).
Supported GPUs: NVIDIA Ampere (e.g, A100, RTX3090), and Hopper (e.g., H100).
pip install torch==2.7.0 --index-url https://download.pytorch.org/whl/cu128
pip install flash-attn==2.7.3 --no-build-isolation
cd csrc
make
cd ../prefill
pip install -r requirements.txtWe release trained gates for
- Qwen/Qwen2.5-{7,14}B-Instruct-1M
- Qwen/Qwen3-{8,14}B
- Qwen/Qwen3-8B-FP8
- Qwen/Qwen3-4B-Instruct-2507
- google/gemma-3-12b-it
Gates for these models will be automatically downloaded via HuggingFace.
- For other models, you first need to train gates. Please refer the to the section
Train Gates for New Modelsin this README.
- For prefill-intensive tasks, please refer to
./prefill. - For decoding-intensive tasks, please refer to
./math.
source train_gate.sh $model_name- Results will be save at the
./result_gatefolder. - After training gates, please corrects the
file_pathin theget_gate_weightfunction fromprefill/attention/gate.pyandmath/method/load_gate.py.
Our code is built upon the following open-source projects:
@article{kim2026fastkvzip,
title={Fast KVzip: Efficient and Accurate LLM Inference with Gated KV Eviction},
author={Jang-Hyun Kim and Dongyoon Han and Sangdoo Yun},
journal={arXiv preprint arXiv:2601.17668},
year={2026},
}