Fast KVzip: Efficient and Accurate LLM Inference with Gated KV Eviction

What's New

Fast KVzip trains a lightweight gating mechanism for KV cache compression across both prefill and decoding stages.
Near-lossless performance on general tasks with up to a 70% KV cache eviction ratio while significantly improving attention efficiency.
A Low-Rank Sink Attention gate architecture, trained by directly distilling importance scores from KVzip in under one H100 hour.
NVIDIA KVpress adds support for Fast KVzip (see also Leaderboard).

Installation

Supported GPUs: NVIDIA Ampere (e.g, A100, RTX3090), and Hopper (e.g., H100).

pip install torch==2.7.0 --index-url https://download.pytorch.org/whl/cu128
pip install flash-attn==2.7.3 --no-build-isolation
cd csrc
make
cd ../prefill
pip install -r requirements.txt

We release trained gates for

Qwen/Qwen2.5-{7,14}B-Instruct-1M
Qwen/Qwen3-{8,14}B
Qwen/Qwen3-8B-FP8
Qwen/Qwen3-4B-Instruct-2507
google/gemma-3-12b-it

Gates for these models will be automatically downloaded via HuggingFace.

For other models, you first need to train gates. Please refer the to the section Train Gates for New Models in this README.

Evaluation

For prefill-intensive tasks, please refer to ./prefill.
For decoding-intensive tasks, please refer to ./math.

Train Gates for New Models

source train_gate.sh $model_name

Results will be save at the ./result_gate folder.
After training gates, please corrects the file_path in the get_gate_weight function from prefill/attention/gate.py and math/method/load_gate.py.

Acknowledgments

Our code is built upon the following open-source projects:

KVzip (prefill-intensive tasks)
R-KV (decoding-intensive tasks)

Citation

@article{kim2026fastkvzip,
        title={Fast KVzip: Efficient and Accurate LLM Inference with Gated KV Eviction}, 
        author={Jang-Hyun Kim and Dongyoon Han and Sangdoo Yun},
        journal={arXiv preprint arXiv:2601.17668},
        year={2026},
}

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
csrc		csrc
data		data
math		math
prefill		prefill
.gitignore		.gitignore
README.md		README.md
load_data.py		load_data.py
optim.py		optim.py
train_gate.sh		train_gate.sh
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Fast KVzip: Efficient and Accurate LLM Inference with Gated KV Eviction

What's New

Installation

Evaluation

Train Gates for New Models

Acknowledgments

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Fast KVzip: Efficient and Accurate LLM Inference with Gated KV Eviction

What's New

Installation

Evaluation

Train Gates for New Models

Acknowledgments

Citation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages