Skip to content

Geometry-Aware LoRA Optimization for Faster and Stable Convergence#1407

Draft
Koratahiu wants to merge 7 commits intoNerogar:masterfrom
Koratahiu:precond_lora
Draft

Geometry-Aware LoRA Optimization for Faster and Stable Convergence#1407
Koratahiu wants to merge 7 commits intoNerogar:masterfrom
Koratahiu:precond_lora

Conversation

@Koratahiu
Copy link
Copy Markdown
Contributor

This PR implements the gradient preconditioning technique proposed in Riemannian Preconditioned LoRA for Fine-Tuning Foundation Models.

Standard optimizers treat LoRA's $A$ and $B$ matrices as completely independent parameters. This method transforms how your optimizer sees the weights by accounting for their dependency (the actual low-rank manifold $W = BA$), essentially acting as a specialized, highly efficient second-order optimizer for low-rank adapters.

  • Geometry-Aware: By preconditioning the gradients based on the relationship between the $A$ and $B$ matrices, it follows the true gradient of the low-rank space, preventing the optimizer from taking inefficient steps.
  • Optimizer Agnostic: Because this applies a preconditioning transformation to the raw gradient right before the optimizer step, it is fully compatible with your favorite standard optimizers (AdamW, Prodigy, SGD, etc.).

Important Notes:

  • Implementation Mechanics: This requires the $A$ and $B$ tensors to be "aware" of each other. The codebase handles this by cross-linking the tensors (_lora_pair) during initialization so the preconditioner can calculate the necessary matrix inversions on the fly.
  • Negligible Overhead: The computational cost is incredibly small. The matrix inversion required for the preconditioning is bounded by the LoRA rank (e.g., inverting a 16x16 or 64x64 matrix), not the full parameter dimension, meaning it won't slow down your s/it.
  • Not for DoRA (Yet): This specific preconditioning math is explicitly derived for standard LoRA formulation ($W = BA$). Applying this to decomposed weights (DoRA) is not recommended without further mathematical adaptation or tests.
  • Untested for Muon

Other Notes:

  • Suggested Ranges: You can generally start with your standard LoRA learning rates (e.g., 1e-4 to 4e-4 for AdamW). However, because the gradients are better conditioned, you might find that the model can tolerate and benefit from higher learning rates than usual.
  • Faster Convergence: Expect to hit your target loss/image quality in significantly fewer steps compared to standard un-preconditioned training.
  • Improved Stability: This method naturally stabilizes the training dynamics, particularly at higher ranks where standard Adam can sometimes struggle to balance the updates between the $A$ and $B$ matrices.

Usage

  • Enable Riemannian Preconditioning in LoRA Tab (we might change that name)

torch.linalg.solve is more accurate, stable, faster, and cheaper than torch.linalg.inv (for this specific use case), and it's mathematically identical.
@Koratahiu
Copy link
Copy Markdown
Contributor Author

Update 1:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant