Maintainers - Aaron Zhao and Cheng Zhang
A curated list of LLM Quantization papers, partially taken from Sudarsharm Sreeram's initial effort.
| Title | Venue | Code | Notes |
|---|---|---|---|
| LRQ: Optimizing Post-Training Quantization for Large Language Models by Learning Low-Rank Weight-Scaling Matrices | Arxiv | - | PTQ, A low-rank version of FlexRound with fewer trainable params |
| Title | Venue | Code | Notes |
|---|---|---|---|
| LLM.int8(): 8-bit matrix multiplication for transformers at scale | Arxiv | Github | - |
| ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers | Arxiv | Github | - |
| Title | Venue | Type | Code |
|---|---|---|---|
| A survey on model compression for large language models | - | - | - |
| Name | Notes |
|---|---|
| BitsandBytes | Tim Dettmers, LLM.int8(), Single GPU |
| DeepSpeed | Microsoft, Multi-GPU |
| Flash Attention | Dao-AILab (CS Princeton), Single GPU |
| Megatron-LM | NVIDIA, Multi-GPU |
| MXScaling | Microsoft, MX |
This paper address this with a new Dense-and-Sparse Quantization method. Dense-and-Sparse splits weight matrices into two components: A dense component that can be heavily quantized without affecting model performance, as well as a sparse part that preserves sensitive and outlier parts of the weight matrices With this approach, we are able to serve larger models with smaller memory footprint, the same latency, and yet higher accuracy and quality. For instance, the Squeeze variant of the Vicuna models can be served within 6 GB of memory and reach 2% higher MMLU than the baseline model in FP16 with an even 2x larger memory footprint.
PTQ; weight-activation; INT4/6/8; Use statistics to calculate channel-wise zero-point and scale factor which can be fused into weight