Fixing MuP by marcobellagente93 · Pull Request #1061 · EleutherAI/gpt-neox

marcobellagente93 · 2023-10-19T10:10:49Z

Current MuP implementation in neox is buggy. This PR allows to get the main functionalities without major changes to the code. Current limitations:

only supports non-tied models
does not immediately supports embedding multipliers
does not immediately supports attention factor multipliers and 1/d**2 attention scaling

The main issue in the current code is that the model is always initialized with use_mup = False, which is then set to its correct value later. This doesn't work, as it sets the wrong attribute at the init of all classes, meaning that effectively it never used mup. Best solution would be to loop through all modules and set the correct attribute there. Current workaround provides a minimal modification whereby the attribute is reset at the re-init of the linear layers, meaning it does the correct thing for everything except for the self attention and embedding

A second issue is that the code as is expects the mu-optimizer to provide with the correct multiplier of target_width/base_width but this is not provided in the mup library. We should probably just open a PR on mup and get rid of this. As the fastest solution, mup is added to the repo, with the multiplier added to the optimizer dict. Plus, removed torch dependancies for mup cause that's useless and can only lead to issues.

Plots are also added for testing the implementation:

coordinate checker
wider always better

CLAassistant · 2023-10-19T10:10:56Z

All committers have signed the CLA.

accesslint

There are accessibility issues in these changes.

accesslint · 2023-10-19T10:11:28Z

+This can be used to tune extremely large neural networks such as large pretrained transformers, as we have done in our work.
+More generally, μP reduces the fragility and uncertainty when transitioning from exploration to scaling up, which are not often talked about explicitly in the deep learning literature.
+
+![](figures/sp_vs_mup_dashed.png)


This image is missing a text alternative. This is a problem for people using screen readers.

accesslint · 2023-10-19T10:11:28Z

+
+μP turns out to be the *unique* "natural" parametrization that has this hyperparameter stability property across width, as empirically verified in the gif below on MLPs trained with SGD. Here, across time, we interpolate between PyTorch default and μP's learning rate and initialization scalings (right), and we scale up the width-256 model (log2(width)=8) to width 2^13 = 8192 using this interpolated scaling rule (left).
+
+![](figures/parametrizations.gif)


This image is missing a text alternative. This is a problem for people using screen readers.

accesslint · 2023-10-19T10:11:28Z

+```
+You should find the generated plots under `./coord_checks`, which show stable coordinate sizes under μP, e.g., 
+
+![](coord_checks/μp_mlp_sgd_coord.png)


This image is missing a text alternative. This is a problem for people using screen readers.

accesslint · 2023-10-19T10:11:28Z

+
+and growing sizes under SP, e.g.,
+
+![](coord_checks/sp_mlp_sgd_coord.png)


This image is missing a text alternative. This is a problem for people using screen readers.

accesslint · 2023-10-19T10:11:29Z

+The first set of 3 plots shows an MLP in standard parametrization (SP), trained by adam.
+We see after 1 step of update, activation/output `l1` are exploding with width.
+This means SP is "incorrect."
+![](coord_checks/sp_mlp_adam_lr0.001_nseeds5_bn0_coord.png)


This image is missing a text alternative. This is a problem for people using screen readers.

accesslint · 2023-10-19T10:11:29Z

+![](coord_checks/sp_mlp_adam_lr0.001_nseeds5_bn0_coord.png)
+We now do the same for an MLP in maximal update parametrization (μP) (including using `mup.optim.MuAdam` instead of `torch.optim.Adam`).
+In contrast to the above, all curves stay horizontal, indicating that μP is implemented correctly.
+![](coord_checks/μp_mlp_adam_lr0.001_nseeds5_bn0_coord.png)


This image is missing a text alternative. This is a problem for people using screen readers.

accesslint · 2023-10-19T10:11:29Z

+
+### Wider is Always Better
+
+![](figures/widerbetter.png)


This image is missing a text alternative. This is a problem for people using screen readers.

accesslint

There are accessibility issues in these changes.

accesslint · 2023-10-19T10:12:25Z

+3. Run once. gpt-neox will output jpg images similar to those below and exit immediately
 4. Set coord-check to false
+What you are gonna get is some stastistics of pre-activations for models only differing by the width. If done correctly these should be approximately horizontal
+![](mup/figures/coord_check_up.0.jpg)


This image is missing a text alternative. This is a problem for people using screen readers.

accesslint · 2023-10-19T10:12:25Z

+What you are gonna get is some stastistics of pre-activations for models only differing by the width. If done correctly these should be approximately horizontal
+![](mup/figures/coord_check_up.0.jpg)
+<font size="1"> *Healthy coordinate check*</font> 
+![](mup/figures/coord_check_sp.0.jpg)


This image is missing a text alternative. This is a problem for people using screen readers.

accesslint · 2023-10-19T10:12:25Z

+<font size="1"> *Something's wrong*</font> 
+
+A second kind of test is to pick any configuration and learning rate (that doesn't lead to diverging training) and simply run a few different experiments fixing everything except for the width. Since with mup wider is always better the results should look like the figure below
+![](mup/figures/width_check.png)


This image is missing a text alternative. This is a problem for people using screen readers.

accesslint

There are accessibility issues in these changes.

accesslint · 2023-10-19T12:26:24Z

+3. Run once. gpt-neox will output jpg images similar to those below and exit immediately
 4. Set coord-check to false
+What you are gonna get is some stastistics of pre-activations for models only differing by the width. If done correctly these should be approximately horizontal
+![](mup/figures/coord_check_up.0.jpg)


This image is missing a text alternative. This is a problem for people using screen readers.

accesslint · 2023-10-19T12:26:25Z

+What you are gonna get is some stastistics of pre-activations for models only differing by the width. If done correctly these should be approximately horizontal
+![](mup/figures/coord_check_up.0.jpg)
+<font size="1"> *Healthy coordinate check*</font> 
+![](mup/figures/coord_check_sp.0.jpg)


This image is missing a text alternative. This is a problem for people using screen readers.

accesslint · 2023-10-19T12:26:25Z

+<font size="1"> *Something's wrong*</font> 
+
+A second kind of test is to pick any configuration and learning rate (that doesn't lead to diverging training) and simply run a few different experiments fixing everything except for the width. Since with mup wider is always better the results should look like the figure below
+![](mup/figures/width_check.png)


This image is missing a text alternative. This is a problem for people using screen readers.

Quentin-Anthony · 2023-10-19T14:57:20Z

@nsarka -- FYI

StellaAthena · 2023-10-19T17:53:13Z

Instead of incorporating muP into GPT-NeoX we are going to move these changes to our fork of their repo and install that version until the changes are upstreamed.

Quentin-Anthony · 2023-10-19T20:09:56Z

Instead of incorporating muP into GPT-NeoX we are going to move these changes to our fork of their repo and install that version until the changes are upstreamed.

Not all of his changes are muP-related. I've separated out the muP 1-line change into our fork until microsoft/mup#65 is merged. We can discuss the GPT-NeoX specific changes here and remove the mup subdir.

accesslint

There are accessibility issues in these changes.

accesslint · 2023-10-19T20:25:02Z

+3. Run once. gpt-neox will output jpg images similar to those below and exit immediately
 4. Set coord-check to false
+What you are gonna get is some stastistics of pre-activations for models only differing by the width. If done correctly these should be approximately horizontal
+![](mup/figures/coord_check_up.0.jpg)


This image is missing a text alternative. This is a problem for people using screen readers.

accesslint · 2023-10-19T20:25:02Z

+What you are gonna get is some stastistics of pre-activations for models only differing by the width. If done correctly these should be approximately horizontal
+![](mup/figures/coord_check_up.0.jpg)
+<font size="1"> *Healthy coordinate check*</font> 
+![](mup/figures/coord_check_sp.0.jpg)


This image is missing a text alternative. This is a problem for people using screen readers.

accesslint · 2023-10-19T20:25:02Z

+<font size="1"> *Something's wrong*</font> 
+
+A second kind of test is to pick any configuration and learning rate (that doesn't lead to diverging training) and simply run a few different experiments fixing everything except for the width. Since with mup wider is always better the results should look like the figure below
+![](mup/figures/width_check.png)


This image is missing a text alternative. This is a problem for people using screen readers.

accesslint

There are accessibility issues in these changes.

accesslint · 2023-10-19T20:25:02Z

+3. Run once. gpt-neox will output jpg images similar to those below and exit immediately
 4. Set coord-check to false
+What you are gonna get is some stastistics of pre-activations for models only differing by the width. If done correctly these should be approximately horizontal
+![](mup/figures/coord_check_up.0.jpg)


This image is missing a text alternative. This is a problem for people using screen readers.

accesslint · 2023-10-19T20:25:02Z

+What you are gonna get is some stastistics of pre-activations for models only differing by the width. If done correctly these should be approximately horizontal
+![](mup/figures/coord_check_up.0.jpg)
+<font size="1"> *Healthy coordinate check*</font> 
+![](mup/figures/coord_check_sp.0.jpg)


This image is missing a text alternative. This is a problem for people using screen readers.

accesslint · 2023-10-19T20:25:02Z

+<font size="1"> *Something's wrong*</font> 
+
+A second kind of test is to pick any configuration and learning rate (that doesn't lead to diverging training) and simply run a few different experiments fixing everything except for the width. Since with mup wider is always better the results should look like the figure below
+![](mup/figures/width_check.png)


This image is missing a text alternative. This is a problem for people using screen readers.

StellaAthena · 2023-10-20T05:03:29Z

Instead of incorporating muP into GPT-NeoX we are going to move these changes to our fork of their repo and install that version until the changes are upstreamed.

Not all of his changes are muP-related. I've separated out the muP 1-line change into our fork until microsoft/mup#65 is merged. We can discuss the GPT-NeoX specific changes here and remove the mup subdir.

Oh I see, I read the previous discussion backwards (thinking it was a 1-line NeoX edit and a substantial muP edit). I went ahead and removed the muP changes (moving them to the fork) and imported the new muP library. I haven't had a chance to check the correctness of this implementation yet however.

marcobellagente93 · 2023-10-20T07:32:52Z

Instead of incorporating muP into GPT-NeoX we are going to move these changes to our fork of their repo and install that version until the changes are upstreamed.

Not all of his changes are muP-related. I've separated out the muP 1-line change into our fork until microsoft/mup#65 is merged. We can discuss the GPT-NeoX specific changes here and remove the mup subdir.

Oh I see, I read the previous discussion backwards (thinking it was a 1-line NeoX edit and a substantial muP edit). I went ahead and removed the muP changes (moving them to the fork) and imported the new muP library. I haven't had a chance to check the correctness of this implementation yet however.

That's fairly quick to verify, currently neox adjust the learning with the width here, but group['width'] doesn't exists, and since it's just in the if block it never raised an error

https://github.com/EleutherAI/gpt-neox/blob/b02d98932f95fe0500c28698b38acb175e92e980/megatron/learning_rates.py#L97C1-L101C37

StellaAthena · 2023-10-21T03:30:49Z

That's fairly quick to verify, currently neox adjust the learning with the width here, but group['width'] doesn't exists, and since it's just in the if block it never raised an error

https://github.com/EleutherAI/gpt-neox/blob/b02d98932f95fe0500c28698b38acb175e92e980/megatron/learning_rates.py#L97C1-L101C37

It looks like this is unchanged from your branch? I thought your branch was working. Or am I missing something.

marcobellagente93 · 2023-10-21T23:17:57Z

That's fairly quick to verify, currently neox adjust the learning with the width here, but group['width'] doesn't exists, and since it's just in the if block it never raised an error
https://github.com/EleutherAI/gpt-neox/blob/b02d98932f95fe0500c28698b38acb175e92e980/megatron/learning_rates.py#L97C1-L101C37

It looks like this is unchanged from your branch? I thought your branch was working. Or am I missing something.

It's fixed if we use the eleuther fork of mup where I added the width to the optimizer dict https://github.com/EleutherAI/mup/blob/14e436bc013418725976e7cfb1b4e74e8901ab80/mup/optim.py#L75C9-L80C52.
Apologies for the confusion

marcobellagente93 added 5 commits October 19, 2023 11:43

added mup version with fixes to neox root

f85352c

removed buggy attention scale factor

a24ba2d

added comment and removed buggy embedding multiplier

e6ae3dc

set use_mup attribute at re-init and fix output multiplier

faff65e

do not reinit the output layer

4ebfb86

marcobellagente93 requested a review from a team as a code owner October 19, 2023 10:10

marcobellagente93 requested review from Quentin-Anthony and StellaAthena October 19, 2023 10:10

accesslint Bot reviewed Oct 19, 2023

View reviewed changes

update readme-mup

fe70494

accesslint Bot reviewed Oct 19, 2023

View reviewed changes

Delete mup/CODE_OF_CONDUCT.md

b5308f3

accesslint Bot reviewed Oct 19, 2023

View reviewed changes

StellaAthena closed this Oct 19, 2023

Quentin-Anthony reopened this Oct 19, 2023

StellaAthena added 2 commits October 19, 2023 16:24

Delete mup/mup directory

57ddc14

Delete mup directory

a34a6c6

accesslint Bot reviewed Oct 19, 2023

View reviewed changes

Update requirements.txt

05ff7df

Quentin-Anthony closed this Jun 6, 2025


		μP turns out to be the unique "natural" parametrization that has this hyperparameter stability property across width, as empirically verified in the gif below on MLPs trained with SGD. Here, across time, we interpolate between PyTorch default and μP's learning rate and initialization scalings (right), and we scale up the width-256 model (log2(width)=8) to width 2^13 = 8192 using this interpolated scaling rule (left).

		![](figures/parametrizations.gif)


		and growing sizes under SP, e.g.,

		![](coord_checks/sp_mlp_sgd_coord.png)

Conversation

marcobellagente93 commented Oct 19, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CLAassistant commented Oct 19, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

accesslint Bot left a comment

Choose a reason for hiding this comment

Uh oh!

accesslint Bot Oct 19, 2023

Choose a reason for hiding this comment

Uh oh!

accesslint Bot Oct 19, 2023

Choose a reason for hiding this comment

Uh oh!

accesslint Bot Oct 19, 2023

Choose a reason for hiding this comment

Uh oh!

accesslint Bot Oct 19, 2023

Choose a reason for hiding this comment

Uh oh!

accesslint Bot Oct 19, 2023

Choose a reason for hiding this comment

Uh oh!

accesslint Bot Oct 19, 2023

Choose a reason for hiding this comment

Uh oh!

accesslint Bot Oct 19, 2023

Choose a reason for hiding this comment

Uh oh!

accesslint Bot left a comment

Choose a reason for hiding this comment

Uh oh!

accesslint Bot Oct 19, 2023

Choose a reason for hiding this comment

Uh oh!

accesslint Bot Oct 19, 2023

Choose a reason for hiding this comment

Uh oh!

accesslint Bot Oct 19, 2023

Choose a reason for hiding this comment

Uh oh!

accesslint Bot left a comment

Choose a reason for hiding this comment

Uh oh!

accesslint Bot Oct 19, 2023

Choose a reason for hiding this comment

Uh oh!

accesslint Bot Oct 19, 2023

Choose a reason for hiding this comment

Uh oh!

accesslint Bot Oct 19, 2023

Choose a reason for hiding this comment

Uh oh!

Quentin-Anthony commented Oct 19, 2023

Uh oh!

StellaAthena commented Oct 19, 2023

Uh oh!

Quentin-Anthony commented Oct 19, 2023

Uh oh!

accesslint Bot left a comment

Choose a reason for hiding this comment

Uh oh!

accesslint Bot Oct 19, 2023

Choose a reason for hiding this comment

Uh oh!

accesslint Bot Oct 19, 2023

Choose a reason for hiding this comment

Uh oh!

accesslint Bot Oct 19, 2023

Choose a reason for hiding this comment

Uh oh!

accesslint Bot left a comment

Choose a reason for hiding this comment

Uh oh!

accesslint Bot Oct 19, 2023

Choose a reason for hiding this comment

Uh oh!

accesslint Bot Oct 19, 2023

marcobellagente93 commented Oct 19, 2023 •

edited

Loading

CLAassistant commented Oct 19, 2023 •

edited

Loading