Hi authors, thank you very much for the very insightful work and for releasing your code! While replicating the results of RLVMR on scienceworld, I found that the success rate metric as implemented in the code appears to differ significantly from the definition stated in the paper. The paper defines success rate as "The percentage of tasks successfully completed by the agent on each evaluation split," but the code uses a trivially satisfiable condition that inflates this metric.
Code in Question
In envs.py, the won flag is defined as:
isCompleted = done
info["won"] = isCompleted and info["score"] > 0
And reward is computed as:
def compute_reward(info, multi_modal=False):
reward = 10.0 * float(info['won'])
return reward
The Problem
In ScienceWorld, done=True is triggered by three conditions:
- The agent completes the task (score = 100)
- The agent hits the step limit (score can be anything from 0 to 99)
- The agent fails catastrophically (score < 0, forced termination by ScienceWorld)
The > 0 threshold correctly filters out case 3 (catastrophic failures with negative scores). However, it still counts case 2 (step-limit timeouts) as successes, even when the agent made zero progress (score = 0).
Empirical Evidence
When training with this code, I observed:
Success rate (won) converged to ~100% — which is expected since every finished episode counts as a "win"
Meanwhile, the actual score (logged when won=True) decreased to ~2.1 (out of 100), indicating the agent was not meaningfully completing tasks. This confirms that the success rate metric is not measuring task completion.
Expected Behavior
A meaningful success rate definition should require actual task completion, e.g.:
info["won"] = isCompleted and info["score"] >= 100 # fully completed
Questions
- Is this an intentional design choice, or a bug (e.g., should > 0 be >= 100)?
- Were the results reported in the paper evaluated using this same code, or was a different success criterion used during evaluation?
Hi authors, thank you very much for the very insightful work and for releasing your code! While replicating the results of RLVMR on scienceworld, I found that the success rate metric as implemented in the code appears to differ significantly from the definition stated in the paper. The paper defines success rate as "The percentage of tasks successfully completed by the agent on each evaluation split," but the code uses a trivially satisfiable condition that inflates this metric.
Code in Question
In envs.py, the won flag is defined as:
And reward is computed as:
The Problem
In ScienceWorld, done=True is triggered by three conditions:
The > 0 threshold correctly filters out case 3 (catastrophic failures with negative scores). However, it still counts case 2 (step-limit timeouts) as successes, even when the agent made zero progress (score = 0).
Empirical Evidence
When training with this code, I observed:
Success rate (won) converged to ~100% — which is expected since every finished episode counts as a "win"
Meanwhile, the actual score (logged when won=True) decreased to ~2.1 (out of 100), indicating the agent was not meaningfully completing tasks. This confirms that the success rate metric is not measuring task completion.
Expected Behavior
A meaningful success rate definition should require actual task completion, e.g.:
info["won"] = isCompleted and info["score"] >= 100 # fully completed
Questions