gpt_oss_20b | adamw | opt_end_learning_rate | opt_base_learning_rate * 0.1
:::MLLOG {"namespace": "", "time_ms": 1773021880374, "event_type": "POINT_IN_TIME", "key": "opt_adamw_epsilon", "value": 1e-05, "metadata": {"file": "/opt/venv/lib/python3.10/site-packages/primus_mllog/mlperf_pre_training.py", "lineno": 65}}
:::MLLOG {"namespace": "", "time_ms": 1773021880374, "event_type": "POINT_IN_TIME", "key": "opt_adamw_weight_decay", "value": 0.1, "metadata": {"file": "/opt/venv/lib/python3.10/site-packages/primus_mllog/mlperf_pre_training.py", "lineno": 65}}
:::MLLOG {"namespace": "", "time_ms": 1773021880374, "event_type": "POINT_IN_TIME", "key": "opt_gradient_clip_norm", "value": 1.0, "metadata": {"file": "/opt/venv/lib/python3.10/site-packages/primus_mllog/mlperf_pre_training.py", "lineno": 65}}
:::MLLOG {"namespace": "", "time_ms": 1773021880374, "event_type": "POINT_IN_TIME", "key": opt_end_learning_rate", "value": 4e-05, "metadata": {"file": "/opt/venv/lib/python3.10/site-packages/primus_mllog/mlperf_pre_training.py", "lineno": 65}}
While collecting submission logs and comparing against RCPs, we found that the GBS 64 RCPs use an LR of "1e-05" while using a MIN_LR of "4e-05". This would violate the rule for the hyperparameters of this benchmark:
https://github.com/mlcommons/training_policies/blob/master/training_rules.adoc#91-hyperparameters
Logs from reference RCPs:
https://github.com/mlcommons/training/blob/master/small_llm_moe_pretraining/primus/rcp_logs/gbs64/run_0.log#L30
Logs snippet: