Fix and Enhance LoRA-Muon Setup: Orthogonalize B, Adam A#1314
Closed
Koratahiu wants to merge 1 commit intoNerogar:masterfrom
Closed
Fix and Enhance LoRA-Muon Setup: Orthogonalize B, Adam A#1314Koratahiu wants to merge 1 commit intoNerogar:masterfrom
Koratahiu wants to merge 1 commit intoNerogar:masterfrom
Conversation
Collaborator
|
even if this is clearly useful, I'm not sure this should be hardcoded. when a user selects the Muon optimizer, I think he'd expect OneTrainer to use the Muon optimizer as if he downloaded the optimizer and just use it as the authors have recommended.
|
Contributor
Author
|
Closed |
Contributor
Author
|
Even though this is closed; |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Muon wasn't originally designed for LoRAs, and as a result, the existing documentations and implementations are quite lacking.
In my initial implementation of Muon (within OT), I applied Muon to all LoRA layers based on hidden layer mapping. However, this is sub-optimal. Muon should only be applied to the B matrix. Because the A matrix typically has an extreme aspect ratio, Muon’s orthogonalization process tends to produce noise (garbage outputs) rather than meaningful updates.
This PR addresses this by:
This setup makes LoRA significantly more robust; theoretically, we can achieve DoRA-like effects on standard LoRA using this configuration.
Technical Context
I identified this issue while testing #1263 (comment). After applying the scaling to both the A and B matrices, I observed that spectral normalization severely cripples the orthogonalization of the A matrix when using Muon.
This approach is supported by recent research (see: arXiv:2508.17901), which proposes orthogonalizing the B matrix to achieve superior results. The paper notes that if the B matrix contains redundant (correlated) columns, the "effective rank" of the update falls below the actual rank. Orthogonalization resolves this by ensuring the update remains full-rank and efficient.
Potential Benefits
❕ Tests, comparisons, and feedback are always welcome.