LittleBit: Ultra Low-Bit Quantization
via Latent Factorization

Banseok Lee^, Dongkyu Kim^, Youngcheon You, Youngmin Kim^†

^*Equal Contribution, ^†Corresponding Author

📢 Abstract

LittleBit is a novel method for extreme LLM compression, targeting levels like 0.1 bits per weight (BPW). By representing weights in a low-rank form using latent matrix factorization and subsequently binarizing these factors, it achieves nearly 31× memory reduction (e.g., Llama2-13B to under 0.9 GB). To counteract information loss, it integrates a multi-scale compensation mechanism including row, column, and an additional latent dimension learning per-rank importance.

✨ Key Features

🧠 Model Architecture & Support

Extreme Compression: Targets 0.1 BPW regime.
High Efficiency: 31× memory reduction compared to FP16.
Novel Method: Latent Matrix Factorization with Binarization & Multi-scale Compensation.

🏗️ Supported Models

The codebase currently supports the following architectures:

✅ OPT
✅ Llama (Llama-2, Llama-3)
✅ Phi-4
✅ Qwen2.5 (QwQ)
✅ Gemma 2 & Gemma 3
✅ Qwen3

💿 Installation

Set up the environment using Conda and Pip. We recommend using Python 3.12.

conda create -n littlebit python=3.12
conda activate littlebit

# Install CUDA toolkit (adjust version as necessary)
conda install nvidia/label/cuda-12.4.1::cuda-toolkit -c nvidia/label/cuda-12.4.1

# Install PyTorch
pip install torch==2.8.0+cu124 torchvision==0.23.0+cu124 torchaudio==2.8.0+cu124 --index-url https://download.pytorch.org/whl/cu124

# Install dependencies
pip install -r requirements.txt

🚀 Usage

1. Training (QAT)

Train the model using Quantization-Aware Training (QAT) with the LittleBit approach.

Single GPU Example:

CUDA_VISIBLE_DEVICES=0 python -m main \
    --model_id meta-llama/Llama-2-7b-hf \
    --dataset c4_wiki \
    --save_dir ./outputs/Llama-2-7b-LittleBit \
    --num_train_epochs 5.0 \
    --per_device_train_batch_size 4 \
    --lr 4e-05 \
    --warmup_ratio 0.02 \
    --report wandb \
    --quant_func SmoothSign \
    --quant_mod LittleBitLinear \
    --residual True \
    --eff_bit 1.0 \
    --kv_factor 1.0 \
    --min_split_dim 8 \
    --l2l_loss_scale 10.0

Multi-GPU (DeepSpeed) Example:

deepspeed --num_gpus=4 main.py \
    --model_id meta-llama/Llama-2-7b-hf \
    --dataset c4_wiki \
    --save_dir ./outputs/Llama-2-7b-LittleBit \
    --ds_config_path configs/zero3.json \
    --num_train_epochs 5.0 \
    --per_device_train_batch_size 4 \
    --lr 4e-05 \
    --report wandb \
    --quant_func SmoothSign \
    --quant_mod LittleBitLinear \
    --residual True \
    --eff_bit 1.0 \
    --kv_factor 1.0 \
    --min_split_dim 8

2. Evaluation

Evaluate the trained LittleBit model on Perplexity (PPL) tasks and Zero-shot benchmarks. You can evaluate a locally trained model or one hosted directly on the Hugging Face Hub.

Standard Evaluation:

# From a local directory
CUDA_VISIBLE_DEVICES=0 python eval.py \
    --model_id ./outputs/Llama-2-7b-LittleBit \
    --seqlen 2048 \
    --ppl_task wikitext2,c4 \
    --zeroshot_task boolq,piqa,hellaswag,winogrande,arc_easy,arc_challenge,openbookqa

# From the Hugging Face Hub
CUDA_VISIBLE_DEVICES=0 python eval.py \
    --model_id username/littlebit-llama-7b-0.1bpw \
    --seqlen 2048 \
    --ppl_task wikitext2

Evaluating Legacy Models (Manual Override): If you are evaluating older models that do not contain the new littlebit_config.json file, you can explicitly provide the quantization parameters via CLI. These arguments will override any saved configurations:

CUDA_VISIBLE_DEVICES=0 python eval.py \
    --model_id ./outputs/Legacy-Llama-2-7b \
    --quant_func SmoothSign \
    --quant_mod LittleBitLinear \
    --num_expert 4 \
    --split_dim 1024

Note

Parameter Loading Priority: The evaluation script automatically loads quantization parameters in the following order:

Explicit CLI arguments (Highest priority, overrides everything else)
littlebit_config.json in the model directory
config.json (Fallback for older checkpoints)

📝 Citation

If you find this work useful, please cite our paper:

@inproceedings{littlebit,
  title={LittleBit: Ultra Low-Bit Quantization via Latent Factorization},
  author={Lee, Banseok and Kim, Dongkyu and You, Youngcheon and Kim, Youngmin},
  booktitle={Advances in Neural Information Processing Systems},
  year={2025},
}

⚖️ License

This project is licensed under the CC BY-NC 4.0 license.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
configs		configs
quantization		quantization
utils		utils
.gitignore		.gitignore
.isort.cfg		.isort.cfg
.style.yapf		.style.yapf
LICENSE		LICENSE
README.md		README.md
eval.py		eval.py
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LittleBit: Ultra Low-Bit Quantization
via Latent Factorization

Banseok Lee^, Dongkyu Kim^, Youngcheon You, Youngmin Kim^†

📢 Abstract

✨ Key Features

🧠 Model Architecture & Support

🏗️ Supported Models

💿 Installation

🚀 Usage

1. Training (QAT)

2. Evaluation

📝 Citation

⚖️ License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

LittleBit: Ultra Low-Bit Quantizationvia Latent Factorization

Banseok Lee*, Dongkyu Kim*, Youngcheon You, Youngmin Kim†

📢 Abstract

✨ Key Features

🧠 Model Architecture & Support

🏗️ Supported Models

💿 Installation

🚀 Usage

1. Training (QAT)

2. Evaluation

📝 Citation

⚖️ License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

LittleBit: Ultra Low-Bit Quantization
via Latent Factorization

Banseok Lee^, Dongkyu Kim^, Youngcheon You, Youngmin Kim^†

Packages