Skip to content

Fix and Enhance LoRA-Muon Setup: Orthogonalize B, Adam A#1314

Closed
Koratahiu wants to merge 1 commit intoNerogar:masterfrom
Koratahiu:lora_muon
Closed

Fix and Enhance LoRA-Muon Setup: Orthogonalize B, Adam A#1314
Koratahiu wants to merge 1 commit intoNerogar:masterfrom
Koratahiu:lora_muon

Conversation

@Koratahiu
Copy link
Copy Markdown
Contributor

Muon wasn't originally designed for LoRAs, and as a result, the existing documentations and implementations are quite lacking.

In my initial implementation of Muon (within OT), I applied Muon to all LoRA layers based on hidden layer mapping. However, this is sub-optimal. Muon should only be applied to the B matrix. Because the A matrix typically has an extreme aspect ratio, Muon’s orthogonalization process tends to produce noise (garbage outputs) rather than meaningful updates.

This PR addresses this by:

  • Applying Muon exclusively to the B matrix.
  • Assigning AuxAdam to the A matrix.

This setup makes LoRA significantly more robust; theoretically, we can achieve DoRA-like effects on standard LoRA using this configuration.

Technical Context

I identified this issue while testing #1263 (comment). After applying the scaling to both the A and B matrices, I observed that spectral normalization severely cripples the orthogonalization of the A matrix when using Muon.

This approach is supported by recent research (see: arXiv:2508.17901), which proposes orthogonalizing the B matrix to achieve superior results. The paper notes that if the B matrix contains redundant (correlated) columns, the "effective rank" of the update falls below the actual rank. Orthogonalization resolves this by ensuring the update remains full-rank and efficient.

Potential Benefits

  • Enhanced Robustness: Prevents the "garbage" noise generation caused by applying orthogonalization to the extreme aspect ratios of the A matrix.
  • Improved Effective Rank: By orthogonalizing the B matrix, we eliminate column correlation, ensuring the update maintains its full actual rank .
  • DoRA-like Performance: Theoretically allows standard LoRA to achieve effects similar (or superior) to Weight-Decomposed Low-Rank Adaptation (DoRA) without the extra overhead.
  • Optimization Efficiency: Uses the strengths of Muon where it excels (B matrix) and relies on AuxAdam where Muon struggles (A matrix).

❕ Tests, comparisons, and feedback are always welcome.

@dxqb
Copy link
Copy Markdown
Collaborator

dxqb commented Feb 12, 2026

even if this is clearly useful, I'm not sure this should be hardcoded.

when a user selects the Muon optimizer, I think he'd expect OneTrainer to use the Muon optimizer as if he downloaded the optimizer and just use it as the authors have recommended.
If the authors haven't considered LoRA at all, I'd prefer to

  • discuss it with the authors
  • or, make it a toggle

@Koratahiu Koratahiu closed this Feb 16, 2026
@Koratahiu
Copy link
Copy Markdown
Contributor Author

Closed
Muon distorts the LoRA geometry anyway

@Koratahiu Koratahiu deleted the lora_muon branch February 18, 2026 07:49
@Koratahiu
Copy link
Copy Markdown
Contributor Author

Even though this is closed;
I found that SignSGD (AKA simplified Adam) has consistent RMS of 1 for all ranks of LoRA A factor, meaning the same LR can be used for all of them.
This was without scalings or such.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants