[advoptm] New Features: Scaled Optimizers, Centered WD, Factored 2nd Moment & More#1344
[advoptm] New Features: Scaled Optimizers, Centered WD, Factored 2nd Moment & More#1344Koratahiu wants to merge 25 commits intoNerogar:masterfrom
Conversation
Update 1Improved K-B for LoRA/OFTK-B now calculates
Summary: By ensuring every rank/block in LoRA/OFT has its own adaptive Improved OrthoGrad for LoRA/OFTOrthoGrad has been optimized for LoRA/OFT:
This results in more accurate orthogonalization and ensures that if a specific block or rank is noisy, it will not negatively impact the orthogonalization of others. |
Update 2:Added Fisher Weight Decay (Natural Weight Decay)The Fisher Information Matrix (FIM) tells us how sensitive the model's output is to changes in parameters. The diagonal of this matrix, diag(F), represents the individual sensitivity of each weight. Adam's second moment can be interpreted as an approximation of the diag(F), giving the importance of each parameter. Standard weight decay applies a uniform penalty, Fisher WD is a form of Adaptive Weight Decay, that's derived from Adam's second moment. It applies adaptive per-parameter weight decay. How it behaves:
Scale-Invariant WDBy multiplying by the (inverse) square root of the Fisher, we make the regularization scale-invariant. The penalty becomes proportional to how much the weight actually matters to the loss, rather than just how large its raw numerical value is.
When using Usage
Sources:
|
Gemini 3.1 Pro inspection of Scaled Optimizer Logic for all adapters (pretty neat):Based on the provided codebase (specifically This means the optimizer effectively decouples the learning rate from the architecture's geometry (similar to muP / Maximal Update Parametrization scaling). Below is the mathematical breakdown of the exact variance for each module. 1. Update Complexity AnalysisAssuming a standard initialization where the input vector A. DoRA Scale (Magnitude Vector
|

This PR introduces several powerful improvements to the advanced optimizers, including:
(Continuation of [advoptm] Spectral Normalization for Muon Variants #1263 but for all adv optimizers!)
Scaled Optimizer
Automatically scales the LR/WD to transfer seamlessly across all training methods. You only need to tune once, and your hyperparameters will transfer optimally across all LoRA ranks, Full Finetuning, etc.
We already achieved this with spectral Muon (#1263), but this PR extends the technique to all advanced optimizers!
Compatible with: Full finetuning, LoRA, DoRA, and OFT.
Set alpha=rank for LoRA/DoRA.
This decouples the WD from the LR.
Centered Weight Decay
For small-to-medium scale training, we want to learn new concepts while preserving the model's original knowledge. However, standard WD (for full finetuning & DoRA) is often useless or harmful here, as it pulls weights down to zero, leading to forgetting and destructive behavior.
Centered WD mitigates this by pulling the weight down to its original state. This forces the optimizer to learn a smooth representation of the dataset while preserving the original model knowledge.
full,fp8,int8, andint4.Signed Optimizers
This PR introduces two main improvements to signed optimizers (
SignSGD_advandLion_adv):1. Freeze-on-Flip (Projected Variant)
The sign operation is discontinuous; it flips (+/- 1) randomly and rapidly near zero, leading to unstable training. This instability is especially visible near convergence, where the optimizer has to slow down to reach the optimal solution (this is also why signed optimizers typically need decaying schedulers).
Freeze-on-Flip is a simple method that stores the sign of the previous step (skipping this for factored mode, and using
uint8for standard) and freezes any current update components that have flipped signs. Over successive steps, this simulates the effect of0and makes the signed update semi-continuous. This leads to better convergence and much more stable training.2. Projected Adaptivity
We can make signed optimizers more curvature-aware and adaptive by scaling the LR by the L1 norm, utilizing the dual norm of their geometry ($L_\infty \rightarrow L_1$ ).
This makes the optimizer semi-adaptive to changes in gradients.