Efficient speculative sampling for language models
Speculoos is a Python package that accelerates language model inference using speculative sampling. It uses a smaller draft model to propose multiple tokens at once, which are then verified by the target model. This provides significant speedups while maintaining the exact same output distribution as standard auto-regressive sampling.
Speculative sampling is an inference optimization technique that:
- Uses a fast "draft" model to propose K tokens ahead
- Validates these proposals with the target model in parallel
- Accepts or rejects tokens based on probability matching
- Achieves 2-3x speedup with no quality loss
The key insight is that while a single forward pass through a large model is expensive, verifying multiple tokens in parallel is much faster than generating them one at a time.
# Clone the repository
git clone https://github.com/bijinc/speculoos.git
cd speculoos
# Install in editable mode
pip install -e .
# Or install with development dependencies
pip install -e ".[dev]"pip install speculoosExample code
from speculoos import SpeculativeSampler
# Create a sampler with draft and target models
sampler = SpeculativeSampler(
draft_model_name="gpt2",
target_model_name="gpt2-large",
K=5 # Number of speculative tokens per iteration
)
# Generate text
text = sampler.sample_and_decode("Once upon a time", T=20)
print(text)
# Compare with baseline auto-regressive sampling
baseline_text = sampler.auto_regressive_sample_and_decode("Once upon a time", T=20)
print(baseline_text)from speculoos import SpeculativeSampler
sampler = SpeculativeSampler("distilgpt2", "gpt2-medium")
# Encode input text to token IDs
input_ids = sampler.encode("Once upon a time")
# Generate tokens
output_ids = sampler.sample(input_ids, T=20)
# Decode back to text
text = sampler.decode(output_ids)
print(text)The K parameter controls the number of speculative tokens generated per iteration:
- Higher K (e.g., 8-10): More tokens proposed, but lower acceptance rate
- Lower K (e.g., 3-5): Fewer tokens proposed, but higher acceptance rate
- Optimal K: Typically 4-6 for most model pairs
# Experiment with different K values
for k in [3, 5, 7, 10]:
sampler = SpeculativeSampler("distilgpt2", "gpt2-medium", K=k)
# ... benchmark performanceFor best results:
- Draft model: Should be 2-4x smaller/faster than target model
- Same model family: Use consecutive sizes from same architecture (not distilled versions)
- Compatible tokenizers: Use models with the same tokenizer
Good pairs (high acceptance rate):
gpt2→gpt2-mediumorgpt2-large✅gpt2-medium→gpt2-large✅- Small models in same family (e.g.,
llama-7b→llama-13b) ✅
Avoid (low acceptance rate):
distilgpt2→gpt2-medium❌ (distilled model has different distribution)- Models from different families ❌
See the examples/ directory for complete examples:
- examples/demo.py - Complete working demo with benchmarking
Run the demo:
python examples/demo.pySpeculativeSampler - Main class for speculative sampling.
Methods:
sample(input_ids, T)- Generate T tokens using speculative samplingauto_regressive_sample(input_ids, T)- Generate T tokens using baseline samplingsample_and_decode(text, T)- End-to-end generation from textencode(text)- Convert text to token IDsdecode(token_ids)- Convert token IDs to text
speculative_sampling(...)- Functional API for speculative samplingauto_regressive_sampling(...)- Functional API for baseline sampling
- Python ≥ 3.8
- PyTorch ≥ 2.0.0
- Transformers ≥ 4.30.0
- NumPy < 2.0
MIT License - see LICENSE file for details.
If you use this package in your research, please cite:
@software{speculoos2026,
title = {Speculoos: Efficient Speculative Sampling for Language Models},
author = {Chakraborty, Bijin},
year = {2026},
url = {https://github.com/bijinc/speculoos}
}Based on the paper:
- Accelerating Large Language Model Decoding with Speculative Sampling (Chen et. al, 2023)
Contributions are welcome! Please feel free to submit a Pull Request.
For questions and issues, please open an issue on GitHub.