Scaled OFT: Block-size Invariant Learning Rates#1315
Open
Koratahiu wants to merge 9 commits intoNerogar:masterfrom
Open
Scaled OFT: Block-size Invariant Learning Rates#1315Koratahiu wants to merge 9 commits intoNerogar:masterfrom
Koratahiu wants to merge 9 commits intoNerogar:masterfrom
Conversation
2 tasks
Contributor
Author
Contributor
Author
While using this, doubling the block size from 256 (yellow) to 512 (purple) maintained the same LR. @dxqb, is there anything I can do for this PR? It’s well-tested and works just fine. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.


In the standard OFT, changing the Block Size fundamentally alters the number of trainable elements in the rotation matrix, which in turn shifts the magnitude of the weights during the Cayley transform. This is particularly problematic when using sign-based optimizers (e.g., AdamW), as the update scale becomes a moving target (smaller LR for larger block size).
This PR addresses this by:
This ensures that the "step size" taken by the optimizer remains mathematically consistent, regardless of whether you are using a small or large block size.
Technical Context
The number of elements (n_elements) of OFT matrix [rank, n_elements] is calculated as:
n_elements = block_size * (block_size - 1) / 2Without scaling, larger blocks result in a higher internal variance, which effectively "dilutes" or "amplifies" the learning rate when passed through the Cayley batch process.
By implementing
effective_weight = self.weight / (self.n_elements**0.5), we stabilize the input to the Cayley parametrization. This ensures that the resulting orthogonal matrix maintains a consistent deviation from the identity matrix across different ranks.Sanity Check
Other than my extensive tests (that got lost in the pit hole of tensorboard cache), here's a sanity check for block sizes (32, 64, 128), using 1e-3 LR:
Purple: 32, green: 64, pink: 128.
More Technical Context
If we interpret the OFT weight matrix
[rank, n_elements]as a set of vectors (whereset = rankandvector = n_elements), the update size of1/√n_elementsrepresents the theoretical and empirical update complexity of signed optimizers (e.g., Adam) and row-wise normalization optimizers (LMO).By scaling these by
1/√n_elements, we are enforcing an update complexity of (1) for all sizes ofn_elements.Potential Benefits
OFT Block Size.❕ This is meant to keep different block-sizes in the same LR range, in similar effect to alpha=1 in LoRA. In my extensive tests, it enforced a 1e-3 LR as a stable baseline for all block sizes (for SDXL), but it's still an approximation (that's accurate enough).
This also solves #1231