DoRA-OFT (DOFT): More Stable and Faster DoRA-variant#1335
DoRA-OFT (DOFT): More Stable and Faster DoRA-variant#1335Koratahiu wants to merge 15 commits intoNerogar:masterfrom
Conversation
|
Is there a paper for this? |
No, but DoRA is a theory of decoupling the norm (magnitude) from the direction. It can be applied to any adapter method (e.g., LoHa, LoKr, etc.). |
|
Is this based on OFTv2? |
|
Merged in #1315. This method should be used with scaled OFT option. By using the scaled OFT setting, both parameters will share the same step size and LR range. |
I recently revisited some "dusty tomes" of abandoned sketch ideas and rediscovered my attempt at combining DoRA with OFT. I originally scrapped it because it felt at odds with OFT’s theory of norm-energy preservation - but looking back, it's actually a powerhouse combo.
It turns out DoRA-OFT is just as effective as standard DoRA, but significantly more stable and much faster. Here’s why:
Performance:
In my tests, this method learned in half the steps of standard LoRA while maintaining very high expressivity (an area where standard OFT typically struggled).
By merging these two, we get the superior training dynamics of DoRA with the stability and speed of OFT. It’s a very promising method.