Conversation
There was a problem hiding this comment.
Pull request overview
This PR aims to fix CartPole RL training by correcting (1) how success is determined for the CartPole task and (2) how reward-manager terms are combined with the task’s base reward.
Changes:
- Update CartPole success logic to only report success when balance conditions are met at the end of an episode.
- Fix reward aggregation so
RewardManagerterms are added on top of the task’sget_reward()output (instead of replacing it). - Adjust CartPole PPO training config naming and evaluation frequency.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
embodichain/lab/gym/envs/tasks/rl/basic/cart_pole.py |
Changes CartPole success condition to depend on final-step + balance criteria. |
embodichain/lab/gym/envs/embodied_env.py |
Fixes reward extension to preserve base task reward and add reward-manager contributions. |
configs/agents/rl/basic/cart_pole/train_config.json |
Updates experiment name and evaluation frequency for CartPole PPO training. |
Comments suppressed due to low confidence (1)
embodichain/lab/gym/envs/embodied_env.py:356
- This reward change is a behavioral fix but isn’t covered by a regression test. Since
BaseEnv.step()routes all rewards through_extend_reward(), it would be good to add a unit test that configures a simplereward_managerterm and asserts the final reward equalsget_reward()+ manager reward (and thatinfo['rewards']contains the term breakdown).
extra_rewards, reward_info = self.reward_manager.compute(
obs=obs, action=action, info=info
)
info["rewards"] = reward_info
# Add manager terms to base reward from get_reward() so task reward is kept
rewards = rewards + extra_rewards
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| upward_distance = torch.abs(qpos) | ||
| is_success = torch.logical_and(upward_distance < 0.02, torch.abs(qvel) < 0.05) | ||
| balance = torch.logical_and(upward_distance < 0.02, torch.abs(qvel) < 0.05) | ||
| at_final_step = self._elapsed_steps >= self.episode_length - 1 |
There was a problem hiding this comment.
at_final_step is checked against self.episode_length - 1, but check_truncated() times out at self.episode_length and BaseEnv.step() terminates immediately on info['success']. This makes the episode end one step early (and can also make success impossible when using gymnasium.make + TimeLimitWrapper if max_episode_steps != episode_length, e.g. CartPole is registered with max_episode_steps=50 while the cart_pole gym config sets episode_length=500). Consider deriving the final-step condition from the actual timeout condition (same threshold as truncation) and/or aligning episode_length with the env time limit so success can be reached in all supported entrypoints.
| at_final_step = self._elapsed_steps >= self.episode_length - 1 | |
| at_final_step = self._elapsed_steps >= self.episode_length |
Description
This PR fix the reward computation and success condition of cartpole rl task.
Type of change
Screenshots
Checklist
black .command to format the code base.