Skip to content

SamsungLabs/LittleBit

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LittleBit: Ultra Low-Bit Quantization
via Latent Factorization

Banseok Lee*, Dongkyu Kim*, Youngcheon You, Youngmin Kim

*Equal Contribution, Corresponding Author

arXiv NeurIPS


📢 Abstract

LittleBit is a novel method for extreme LLM compression, targeting levels like 0.1 bits per weight (BPW). By representing weights in a low-rank form using latent matrix factorization and subsequently binarizing these factors, it achieves nearly 31× memory reduction (e.g., Llama2-13B to under 0.9 GB). To counteract information loss, it integrates a multi-scale compensation mechanism including row, column, and an additional latent dimension learning per-rank importance.


✨ Key Features

🧠 Model Architecture & Support

  • Extreme Compression: Targets 0.1 BPW regime.
  • High Efficiency: 31× memory reduction compared to FP16.
  • Novel Method: Latent Matrix Factorization with Binarization & Multi-scale Compensation.

🏗️ Supported Models

The codebase currently supports the following architectures:

  • OPT
  • Llama (Llama-2, Llama-3)
  • Phi-4
  • Qwen2.5 (QwQ)
  • Gemma 2 & Gemma 3
  • Qwen3

💿 Installation

Set up the environment using Conda and Pip. We recommend using Python 3.12.

conda create -n littlebit python=3.12
conda activate littlebit

# Install CUDA toolkit (adjust version as necessary)
conda install nvidia/label/cuda-12.4.1::cuda-toolkit -c nvidia/label/cuda-12.4.1

# Install PyTorch
pip install torch==2.8.0+cu124 torchvision==0.23.0+cu124 torchaudio==2.8.0+cu124 --index-url https://download.pytorch.org/whl/cu124

# Install dependencies
pip install -r requirements.txt

🚀 Usage

1. Training (QAT)

Train the model using Quantization-Aware Training (QAT) with the LittleBit approach.

Single GPU Example:

CUDA_VISIBLE_DEVICES=0 python -m main \
    --model_id meta-llama/Llama-2-7b-hf \
    --dataset c4_wiki \
    --save_dir ./outputs/Llama-2-7b-LittleBit \
    --num_train_epochs 5.0 \
    --per_device_train_batch_size 4 \
    --lr 4e-05 \
    --warmup_ratio 0.02 \
    --report wandb \
    --quant_func SmoothSign \
    --quant_mod LittleBitLinear \
    --residual True \
    --eff_bit 1.0 \
    --kv_factor 1.0 \
    --min_split_dim 8 \
    --l2l_loss_scale 10.0

Multi-GPU (DeepSpeed) Example:

deepspeed --num_gpus=4 main.py \
    --model_id meta-llama/Llama-2-7b-hf \
    --dataset c4_wiki \
    --save_dir ./outputs/Llama-2-7b-LittleBit \
    --ds_config_path configs/zero3.json \
    --num_train_epochs 5.0 \
    --per_device_train_batch_size 4 \
    --lr 4e-05 \
    --report wandb \
    --quant_func SmoothSign \
    --quant_mod LittleBitLinear \
    --residual True \
    --eff_bit 1.0 \
    --kv_factor 1.0 \
    --min_split_dim 8

2. Evaluation

Evaluate the trained LittleBit model on Perplexity (PPL) tasks and Zero-shot benchmarks. You can evaluate a locally trained model or one hosted directly on the Hugging Face Hub.

Standard Evaluation:

# From a local directory
CUDA_VISIBLE_DEVICES=0 python eval.py \
    --model_id ./outputs/Llama-2-7b-LittleBit \
    --seqlen 2048 \
    --ppl_task wikitext2,c4 \
    --zeroshot_task boolq,piqa,hellaswag,winogrande,arc_easy,arc_challenge,openbookqa

# From the Hugging Face Hub
CUDA_VISIBLE_DEVICES=0 python eval.py \
    --model_id username/littlebit-llama-7b-0.1bpw \
    --seqlen 2048 \
    --ppl_task wikitext2

Evaluating Legacy Models (Manual Override): If you are evaluating older models that do not contain the new littlebit_config.json file, you can explicitly provide the quantization parameters via CLI. These arguments will override any saved configurations:

CUDA_VISIBLE_DEVICES=0 python eval.py \
    --model_id ./outputs/Legacy-Llama-2-7b \
    --quant_func SmoothSign \
    --quant_mod LittleBitLinear \
    --num_expert 4 \
    --split_dim 1024

Note

Parameter Loading Priority: The evaluation script automatically loads quantization parameters in the following order:

  1. Explicit CLI arguments (Highest priority, overrides everything else)
  2. littlebit_config.json in the model directory
  3. config.json (Fallback for older checkpoints)

📝 Citation

If you find this work useful, please cite our paper:

@inproceedings{littlebit,
  title={LittleBit: Ultra Low-Bit Quantization via Latent Factorization},
  author={Lee, Banseok and Kim, Dongkyu and You, Youngcheon and Kim, Youngmin},
  booktitle={Advances in Neural Information Processing Systems},
  year={2025},
}

⚖️ License

This project is licensed under the CC BY-NC 4.0 license.

About

Official implementation of "LittleBit: Ultra Low-Bit Quantization via Latent Factorization" (NeurIPS 2025)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages