[advoptm] Spectral Normalization for Muon Variants#1263
[advoptm] Spectral Normalization for Muon Variants#1263Koratahiu wants to merge 11 commits intoNerogar:masterfrom
Conversation
|
Update 1: I have made the weight decay for this method static and "decoupled" from the learning rate. |
|
Update 2:
|
|
I tested this, and it works very well. I used the same hyperparameters for LoRA/finetuning across SDXL, Chroma, and Zib; it trained successfully for all of them and delivered very solid results. I’ve found the baseline for all of them to be:
From there, you can adjust as needed (e.g., a higher Batch Size requires a larger LR, BF16 needs a higher LR, etc.). The weight decay is constant and differs from standard implementations, meaning it maintains the same effect regardless of whether the LR is high or low. The formula for it is: ❗ Note: that this method does not work with DoRAs, as DoRA has its own scaling which conflicts with this approach. It also behaves unpredictably with OFT (not sure, it trains at 0.1 LR). ❕ For full finetuning, 1D vectors are trained using AuxAdam, so you should use standard AdamW LR and weight decay settings for those. |
More Helpful Notes for LoRA Using This Method1) Rank-Invariant Updates: It is interesting to note that using this method for LoRA completely cancels out the rank effect (assuming alpha = rank). In this scenario, rank is cancelled out, allowing us to apply the full finetuning scaling rule: This achieves the same learning rate as full finetuning and results in rank-invariant updates. 2) Addressing the LoRA A-Matrix Muon appears to be sub-optimal for LoRA because the A matrix is often extremely "flat" or exhibits extreme dimensions, leading to unstable or "garbage" orthogonalization. Mathematically, we achieve:
These are my own findings, as the original paper did not experiment with LoRAs. Nonetheless, these results are very promising for LoRA/Muon combinations. |
|
is this superceeded by #1344? |
|
@dxqb |
| joined_patterns = "|".join([re.escape(p) for p in default_patterns]) | ||
| pattern = re.compile(rf'(?:^|\.)(?:{joined_patterns})\.\d+$') | ||
|
|
||
| layer_counts = {} |
There was a problem hiding this comment.
this function always returns {}: layer_counts is never modified.
it doesn't seem to have side effects either.
what is it supposed to do?
it appears that it's supposed to count the number of trained layers, I guess for scaling later in the optimizer.
But why does it have its own regex layer filter? Shouldn't the count depend on what layers the user is actually training (via the layer filter on the training tab)?
There was a problem hiding this comment.
this function always returns
{}:layer_countsis never modified. it doesn't seem to have side effects either.what is it supposed to do?
Fixed, it was deleted accidently
it appears that it's supposed to count the number of trained layers, I guess for scaling later in the optimizer. But why does it have its own regex layer filter? Shouldn't the count depend on what layers the user is actually training (via the layer filter on the
trainingtab)?
It calculates the model depth (the number of residual layers). For SDXL, this consists of transformer_blocks and resnets; for Transformers, it includes only transformer_blocks (or their equivalent names).
I think, we have two additional options:
- Create a new utility specifically to calculate depth (the same logic as this)
- Hardcode the integer values (e.g., SDXL = 48).
You may ask why we need the depth. To achieve scale-invariance in the optimizer, we must utilize the depth as follows:
- For Muon: It is inserted as a damping factor for orthogonalization (eps).
- For Adam: It is inserted as a damping factor for the second moment (eps).
This ensures that the damping factor grows as the model grows. For example, with Klein 8B and Klein 4B, these scalings allow us to use the same hyperparameters for both models.
TL;DR: Tune Once, Train Anywhere
This PR implements the spectral normalization/scaling proposed in Hyperparameter Transfer Enables Consistent Gains of Matrix-Preconditioned Optimizers Across Scales (NeurIPS 2025) for
Muon_advandAdaMuon_adv.This method allows you to tune hyperparameters (LR, Weight Decay) just once, and they will transfer to any model size.
Important Notes:
alpha=rankto disable the internal scaling!Other Notes:
More Info:
Koratahiu/Advanced_Optimizers#14