You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Feb 24, 2026. It is now read-only.
In MuJoCo, the learning rate linearly decays from 3e-4 to 0
Atari games set the learning rate to linearly decay from 2.5e-4 to 0
Generalized Advantage Estimation
Termination caused by environment length limits must be count as non terminals in target value calculation
Mini-batch Updates
Normalization of Advantages
After calculating the advantages based on GAE, PPO normalizes the advantages by subtracting their mean and dividing them by their standard deviation. In particular, this normalization happens at the minibatch level instead of the whole batch level!
Value Function Loss Clipping
Overall Loss and Entropy Bonus
Global Gradient Clipping
Debug variables
policy_loss
value_loss
entropy_loss
clipfrac
approxkl
Shared and separate MLP networks for policy and value functions
Details for continuous action domains (e.g. Mujoco)
Continuous actions via normal distributions
State-independent log standard deviation
Independent action components
Separate MLP networks for policy and value functions
Handling of action clipping to valid range and storage
The original unclipped action is stored as part of the episodic data
Squashing function (tanh) to the Gaussian samples to satisfy constraints works better
Normalization of Observation
VecNormalize. The raw observation was normalized by subtracting its running mean and divided by its variance.
Reproducing the official PPO implementation
Please check the link above for details.
Core implementation details
Details for continuous action domains (e.g. Mujoco)
VecNormalize. The raw observation was normalized by subtracting its running mean and divided by its variance.nsteps=2048, nminibatches=32, lam=0.95, gamma=0.99, noptepochs=10, log_interval=1, ent_coef=0.0, lr=lambda f: 3e-4 * f, cliprange=0.2, value_network='copy'LSTM implementation details
Auxiliary implementation details