Skip to content

Compiled basic optimizers#1185

Draft
dxqb wants to merge 1 commit intoNerogar:masterfrom
dxqb:compiled_opt
Draft

Compiled basic optimizers#1185
dxqb wants to merge 1 commit intoNerogar:masterfrom
dxqb:compiled_opt

Conversation

@dxqb
Copy link
Copy Markdown
Collaborator

@dxqb dxqb commented Dec 3, 2025

optimizer.step = step_adafactor.__get__(optimizer, Adafactor)
optimizer.step_parameter = step_adafactor_parameter.__get__(optimizer, Adafactor)
#lambdas don't work because of scheduler patching:
def step(*args, **kwargs):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this and the below step_parameter function get decorators like @functools.wraps(step_adafactor) to make these wrappers (slightly more) seamless?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you elaborate? what's the problem with the current code?

@dxqb
Copy link
Copy Markdown
Collaborator Author

dxqb commented Mar 24, 2026

AdamW_adv can do everything what AdamW can but also compile, so there is not much point of implementing it again for patched AdamW.

However, Adafactor is the preset for full finetuning and compile has a large effect in full finetuning. adv_optm currently doesn't have Adafactor
CC @Koratahiu

@Koratahiu
Copy link
Copy Markdown
Contributor

Koratahiu commented Mar 24, 2026

However, Adafactor is the preset for full finetuning and compile has a large effect in full finetuning. adv_optm currently doesn't have Adafactor CC @Koratahiu

Any adv Adam variant with beta1 = 0 and the factored option is essentially Improved Adafactor (factored second moment, but with uncompressed raw gradient contribution and better factorization ).
I am also implementing a factored 2nd moment option in #1344, which behaves just like Adafactor (dense first moment and factored second moment).

I guess you can do factored AdamW_adv with beta1=0 for full finetuning presets?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants