Skip to content

Scaled OFT: Block-size Invariant Learning Rates#1315

Open
Koratahiu wants to merge 9 commits intoNerogar:masterfrom
Koratahiu:scaled_oft
Open

Scaled OFT: Block-size Invariant Learning Rates#1315
Koratahiu wants to merge 9 commits intoNerogar:masterfrom
Koratahiu:scaled_oft

Conversation

@Koratahiu
Copy link
Copy Markdown
Contributor

In the standard OFT, changing the Block Size fundamentally alters the number of trainable elements in the rotation matrix, which in turn shifts the magnitude of the weights during the Cayley transform. This is particularly problematic when using sign-based optimizers (e.g., AdamW), as the update scale becomes a moving target (smaller LR for larger block size).

This PR addresses this by:

  • Introducing Scaled OFT, which applies a (1/√n_elements) scaling factor to the rotation weights.
  • Normalizing the effective weight based on the number of elements (N) in the skew-symmetric matrix before the parametrization step.

This ensures that the "step size" taken by the optimizer remains mathematically consistent, regardless of whether you are using a small or large block size.

Technical Context

The number of elements (n_elements) of OFT matrix [rank, n_elements] is calculated as:
n_elements = block_size * (block_size - 1) / 2

Without scaling, larger blocks result in a higher internal variance, which effectively "dilutes" or "amplifies" the learning rate when passed through the Cayley batch process.

By implementing effective_weight = self.weight / (self.n_elements**0.5), we stabilize the input to the Cayley parametrization. This ensures that the resulting orthogonal matrix maintains a consistent deviation from the identity matrix across different ranks.


Sanity Check

Other than my extensive tests (that got lost in the pit hole of tensorboard cache), here's a sanity check for block sizes (32, 64, 128), using 1e-3 LR:

image

Purple: 32, green: 64, pink: 128.


More Technical Context

If we interpret the OFT weight matrix [rank, n_elements] as a set of vectors (where set = rank and vector = n_elements), the update size of 1/√n_elements represents the theoretical and empirical update complexity of signed optimizers (e.g., Adam) and row-wise normalization optimizers (LMO).

By scaling these by 1/√n_elements, we are enforcing an update complexity of (1) for all sizes of n_elements.


Potential Benefits

  • Optimizer Consistency: Specifically benefits AdamW and other signed optimizers by ensuring the update magnitude is invariant to the OFT Block Size.
  • LR Portability: Allows users to find a stable Learning Rate once and keep it consistent even if they decide to change the block size.

❕ This is meant to keep different block-sizes in the same LR range, in similar effect to alpha=1 in LoRA. In my extensive tests, it enforced a 1e-3 LR as a stable baseline for all block sizes (for SDXL), but it's still an approximation (that's accurate enough).

This also solves #1231

@Koratahiu
Copy link
Copy Markdown
Contributor Author

Update 1:

  • A 1e-3 LR works very well with Flux.2 Klein 4B using this method:
image
  • It appears that we need to scale the OFT weight during inference (very similar to LoRA alpha). This should be simple enough to implement (by adding scalar), and support has been added in my ComfyUI patch.

@Koratahiu
Copy link
Copy Markdown
Contributor Author

image

While using this, doubling the block size from 256 (yellow) to 512 (purple) maintained the same LR.

@dxqb, is there anything I can do for this PR? It’s well-tested and works just fine.
I think it should be the default.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant