Skip to content

DoRA-OFT (DOFT): More Stable and Faster DoRA-variant#1335

Draft
Koratahiu wants to merge 15 commits intoNerogar:masterfrom
Koratahiu:DoRA_OFT
Draft

DoRA-OFT (DOFT): More Stable and Faster DoRA-variant#1335
Koratahiu wants to merge 15 commits intoNerogar:masterfrom
Koratahiu:DoRA_OFT

Conversation

@Koratahiu
Copy link
Copy Markdown
Contributor

I recently revisited some "dusty tomes" of abandoned sketch ideas and rediscovered my attempt at combining DoRA with OFT. I originally scrapped it because it felt at odds with OFT’s theory of norm-energy preservation - but looking back, it's actually a powerhouse combo.

It turns out DoRA-OFT is just as effective as standard DoRA, but significantly more stable and much faster. Here’s why:

  • The Synergy: OFT handles weight rotation (direction and angle) while DoRA manages the norm (magnitude).
  • The Speed: Since OFT is orthogonal and norm-preserving, DoRA calculations become incredibly streamlined. It assumes the initial weight norm is preserved - unlike standard LoRA, which changes it every step.
  • The Result: We bypass the heavy re-calculation overhead of DoRA, achieving the same it/s as standard OFT.

Performance:

In my tests, this method learned in half the steps of standard LoRA while maintaining very high expressivity (an area where standard OFT typically struggled).

By merging these two, we get the superior training dynamics of DoRA with the stability and speed of OFT. It’s a very promising method.

@yamatazen
Copy link
Copy Markdown

Is there a paper for this?

@Koratahiu
Copy link
Copy Markdown
Contributor Author

Is there a paper for this?

No, but DoRA is a theory of decoupling the norm (magnitude) from the direction. It can be applied to any adapter method (e.g., LoHa, LoKr, etc.).
​What makes it unique with OFT, however, is that OFT preserves the norm of the model and learns purely the rotation (direction). This makes it possible to bypass the heavy calculations associated with DoRA, achieving the same speed as standard OFT.

@yamatazen
Copy link
Copy Markdown

Is this based on OFTv2?

@Koratahiu
Copy link
Copy Markdown
Contributor Author

Merged in #1315.

This method should be used with scaled OFT option.
This is because signed optimizers (e.g., Adam) produce a step size of O(1) for the DoRA scale parameter, but a step size of O(1/√n_elements) for OFT blocks.
In a stable learning rate range, this effectively results in standard OFT because the LR for the DoRA scale would be too low to learn anything.

By using the scaled OFT setting, both parameters will share the same step size and LR range.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants