Advanced Optimizers (AIO)

A comprehensive, all-in-one collection of optimization algorithms for deep learning, designed for maximum efficiency, minimal memory footprint, and superior performance across diverse model architectures and training scenarios.

🔥 What's New

in 2.2.2

Simplified_AdEMAMix now uses the same LR as AdamW for all beta1 and alpha_grad values!

in 2.1.x

Added Signum (SignSGD with momentum): A new optimizer in the family (SignSGD_adv)
More info coming soon.

in 2.0.x

Implemented torch.compile for all advanced optimizers. Enabled via (compiled_optimizer=True) to fuse and optimize the optimizer step path.
Better and improved 1-bit factored mode via (nnmf_factor=True).
Various improvements across the optimizers.

in 1.2.x

Added advanced variants of Muon optimizer with features and settings from recent papers.

Optimizer	Description
`Muon_adv`	Advanced Muon implementation with CANS, NorMuon, Low-Rank ortho, etc. features.
`AdaMuon_adv`	Advanced AdaMuon implementation, which combines Muon's geometry with Adam-like adaptive scaling and sign-based orthogonalization.

Documentation coming soon.

Implemented Cautious Weight Decay for all advanced optimizers.
Improved parameter update and weight decay for BF16 with stochastic rounding. The updates are now accumulated in float32 and rounded once at the end.
Use fused and in-place operations whenever possible for all advanced optimizers.
Prodigy variants are now 50% faster by avoiding CUDA syncs. Thanks to @dxqb!

📦 Installation

pip install adv_optm

🧠 Core Innovations

This library integrates multiple state-of-the-art optimization techniques validated through extensive research and practical training, with 1-bit compression for optimizer states:

Memory-Efficient Optimization (SMMF-inspired)

Paper: SMMF: Square-Matricized Momentum Factorization
Approach: Uses rank-1 non-negative matrix factorization with reconstruction cycle (factor → reconstruct → update → factor)
Innovation:
- First moment split into 1-bit sign + absolute value
- Final storage: four factored vectors + one 1-bit sign state
- Preserves Adam-like update quality with drastically reduced memory

⚡ Performance Characteristics

Memory Efficiency (SDXL Model – 6.5GB)

Optimizer	Memory Usage	Description
`Adopt_Factored`	328 MB	4 small vectors + 1-bit state
`Adopt_Factored + AdEMAMix`	625 MB	6 small vectors + two 1-bit states
`Simplified_AdEMAMix`	328 MB	Same as standard factored (no extra state)

Speed Comparison (SDXL, Batch Size 4)

Optimizer	Speed	Notes
`Adafactor`	~8.5s/it	Baseline
`Adopt_Factored`	~10s/it	+18% overhead from compression
`Adopt_Factored + AdEMAMix`	~12s/it	+41% overhead (3 factored states)

🧪 Available Optimizers

Standard Optimizers (All support `factored=True/False`)

Optimizer	Description	Best For
`Adam_Adv`	Advanced Adam implementation	General purpose
`Adopt_Adv`	Adam-variant with independent beta2	Stable training for small batch size regimes
`Prodigy_Adv`	Prodigy with D-Adaptation	Adam with automatic LR tuning
`Simplified_AdEMAMix`	Adam variant with accumulator momentum	Small/large batch training when tuned correctly
`Lion_Adv`	Advanced Lion implementation	Memory-constrained environments
`Prodigy_Lion_Adv`	Prodigy + Lion combination	Lion with automatic LR tuning

⚙️ Feature Matrix

Feature	Adam_Adv	Adopt_Adv	Prodigy_Adv	Simplified_AdEMAMix	Lion_Adv
Factored	✓	✓	✓	✓	✓
AdEMAMix	✓	✓	✓	✗	✗
Simplified_AdEMAMix	✗	✓	✓	✓	✗
OrthoGrad	✓	✓	✓	✓	✓
Grams	✓	✓	✓	✗	✗
Cautious	✓	✓	✓	✗	✓
atan2	✓	✓	✓	✗	✗
Stochastic Rounding	✓	✓	✓	✓	✓
Fused Backward Pass	✓	✓	✓	✓	✓
Kourkoutas-β	✓	✓	✓	✓	✗

🛠️ Comprehensive Feature Guide

A. Universal Safe Features

These features work with all optimizers and are generally safe to enable.

Feature	Description	Recommended Usage	Performance Impact	Theoretical Basis	Compatibility
Fused Back Pass	Fuses backward pass; gradients used immediately and memory freed on-the-fly	Memory-constrained environments	Reduces peak memory	Memory optimization	All optimizers
Stochastic Rounding	Replaces nearest rounding with stochastic rounding to preserve small gradient updates in BF16	BF16 training	Minimal overhead (<5%)	Revisiting BFloat16 Training	All optimizers
OrthoGrad	Removes gradient component parallel to weights to reduce overfitting	Full fine-tuning without weight decay	+33% time overhead (BS=4); less at larger BS	Grokking at Edge	All optimizers
Factored	Memory-efficient optimization via rank-1 1-bit factorization of optimizer states	Large models / memory-limited hardware	Adds compression overhead	SMMF	All optimizers

B. Individual Features

Feature	Description	Recommended Usage	Performance Impact	Theoretical Basis	Compatibility
Cautious	Only applies update if gradient direction aligns with momentum direction	Accelerating convergence	No overhead	C-Optim	Adam/Adopt/Prodigy/Lion
Grams	Update direction derived purely from current gradient	When Cautious is insufficient	No overhead	Grams	Adam/Adopt/Prodigy
AdEMAMix	Dual EMA system that retains relevance of gradients over tens of thousands of steps	Long training runs, especially where model forgetting is a concern	+1 state memory	AdEMAMix	Adam/Adopt/Prodigy
Simplified_AdEMAMix	Accumulator-based momentum, single EMA variant of AdEMAMix	All scenarios when tuned correctly	No overhead	Connections	Adam/Adopt/Prodigy
atan2	Robust epsilon replacement with built-in gradient clipping	Use for stable bounded updates (or for Adopt as it needs that)	No overhead	Adam-atan2	Adam/Adopt/Prodigy
Kourkoutas-β	Layer-wise adaptive β₂ based on gradient “sunspike” ratio	Noisy/small/large-batch/high-LR training	No overhead	Kourkoutas-β	Adam/Adopt/Prodigy/Simplified_AdEMAMix

Note: If both Cautious and Grams are enabled, Grams takes precedence and Cautious is disabled.

🔍 Feature Deep Dives

AdEMAMix

Adds a slow-decaying second EMA (beta3) that retains gradient memory over tens of thousands of steps.
Particularly effective for small batch sizes, where Adam’s standard first moment is nearly useless.

Tunable Hyperparameters

Parameter	Default	Tuning Guide
`beta3`	0.9999	• Runs >120k steps: 0.9999 • Runs ≤120k steps: 0.999
`alpha`	5	• Reduce to 2–3 if diverging • Increase to strengthen long-term memory

✅ Pro Tip: Set beta1=0 in Adam/Adopt/Prodigy to skip standard EMA entirely and rely solely on AdEMAMix’s slow EMA, ideal for small-batch regimes.

Simplified_AdEMAMix

Introduced in Connections between Schedule-Free Optimizers, AdEMAMix, and Accelerated SGD Variants (arXiv:2502.02431).
Replaces Adam’s first moment with a theory-based momentum with emphasize on raw gradient, combining the stability of long memory with responsiveness to recent gradients.
Key insight: Classical momentum does not accelerate in noisy (small-batch) regimes; this accumulator do.

Tunable Hyperparameters

Parameter	Default	Tuning Guide
`beta1`	0.99	Controls accumulator memory length: • Small BS: 0.99–0.9999 • Large BS: 0.9
`Grad α`	100	Most critical parameter: • Inversely scales with batch size • 100–10 for small BS (≤32) • 1–0.1 for large BS (≥512)

atan2

Replaces eps in Adam-family optimizers with a scale-invariant, bounded update rule.
Automatically clips updates to [-2, 2], preventing destabilizing jumps.
Highly recommended for Adopt_Adv, which is prone to instability without clipping.

📚 Reference:

Paper: https://arxiv.org/abs/2407.05872

Code: https://github.com/lucidrains/adam-atan2-pytorch

Kourkoutas-β

Kourkoutas-β introduces a sunspike-driven, layer-wise adaptive second-moment decay (β₂) as an optional enhancement for Adam_Adv, Adopt_Adv, Prodigy_Adv, and Simplified_AdEMAMix.

Instead of using a fixed β₂ (e.g., 0.999 or 0.95), it dynamically modulates β₂ per layer based on a bounded sunspike ratio:

During gradient bursts → β₂ ↓ toward Lower β₂ → faster reaction
During calm phases → β₂ ↑ toward The Selected β₂ → stronger smoothing

This is especially effective for noisy training, small batch sizes, and high learning rates, where gradient norms shift abruptly due to noise or aggressive LR schedules.

Pros/Cons

Category	Details
✅ Pros	• Layer-wise adaptation blends benefits of high β₂ (strong smoothing) and low β₂ (fast reaction). • Robust to sudden loss landscape shifts, reacts quickly during gradient bursts, smooths during calm phases. • High tolerance to aggressive learning rates.
⚠️ Cons	• Potentially unstable at the start of training due to unreliable early gradient norms; mitigated by using `K-β Warmup Steps`.

💡 Best Practice: Set K_warmup_steps equal to your standard LR warmup steps. During warmup, the optimizer uses the static beta2; adaptation begins only after warmup ends.

📚 Reference:

Paper: Kourkoutas-β: A Sunspike-Driven Adam Optimizer with Desert Flair

Code: kbeta

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Advanced Optimizers (AIO)

🔥 What's New

in 2.2.2

in 2.1.x

in 2.0.x

in 1.2.x

📦 Installation

🧠 Core Innovations

Memory-Efficient Optimization (SMMF-inspired)

⚡ Performance Characteristics

Memory Efficiency (SDXL Model – 6.5GB)

Speed Comparison (SDXL, Batch Size 4)

🧪 Available Optimizers

Standard Optimizers (All support `factored=True/False`)

⚙️ Feature Matrix

🛠️ Comprehensive Feature Guide

A. Universal Safe Features

B. Individual Features

🔍 Feature Deep Dives

AdEMAMix

Tunable Hyperparameters

Simplified_AdEMAMix

Tunable Hyperparameters

atan2

Kourkoutas-β

Pros/Cons

📚 References

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Advanced Optimizers (AIO)

🔥 What's New

in 2.2.2

in 2.1.x

in 2.0.x

in 1.2.x

📦 Installation

🧠 Core Innovations

Memory-Efficient Optimization (SMMF-inspired)

⚡ Performance Characteristics

Memory Efficiency (SDXL Model – 6.5GB)

Speed Comparison (SDXL, Batch Size 4)

🧪 Available Optimizers

Standard Optimizers (All support factored=True/False)

⚙️ Feature Matrix

🛠️ Comprehensive Feature Guide

A. Universal Safe Features

B. Individual Features

🔍 Feature Deep Dives

AdEMAMix

Tunable Hyperparameters

Simplified_AdEMAMix

Tunable Hyperparameters

atan2

Kourkoutas-β

Pros/Cons

📚 References

Standard Optimizers (All support `factored=True/False`)