Conversation
| optimizer.step = step_adafactor.__get__(optimizer, Adafactor) | ||
| optimizer.step_parameter = step_adafactor_parameter.__get__(optimizer, Adafactor) | ||
| #lambdas don't work because of scheduler patching: | ||
| def step(*args, **kwargs): |
There was a problem hiding this comment.
Should this and the below step_parameter function get decorators like @functools.wraps(step_adafactor) to make these wrappers (slightly more) seamless?
There was a problem hiding this comment.
can you elaborate? what's the problem with the current code?
|
AdamW_adv can do everything what AdamW can but also compile, so there is not much point of implementing it again for patched AdamW. However, Adafactor is the preset for full finetuning and compile has a large effect in full finetuning. adv_optm currently doesn't have Adafactor |
Any adv Adam variant with beta1 = 0 and the factored option is essentially Improved Adafactor (factored second moment, but with uncompressed raw gradient contribution and better factorization ). I guess you can do factored AdamW_adv with beta1=0 for full finetuning presets? |
Draft implementation
AdamWnot planned