[RLVMR] Success rate definition in code does not match paper description

Hi authors, thank you very much for the very insightful work and for releasing your code! While replicating the results of RLVMR on scienceworld, I found that the success rate metric as implemented in the code appears to differ significantly from the definition stated in the paper. The paper defines success rate as "The percentage of tasks successfully completed by the agent on each evaluation split," but the code uses a trivially satisfiable condition that inflates this metric.

# Code in Question
In envs.py, the won flag is defined as:

```
isCompleted = done
info["won"] = isCompleted and info["score"] > 0
```

And reward is computed as:

```
def compute_reward(info, multi_modal=False):
    reward = 10.0 * float(info['won'])
    return reward
```

# The Problem

In ScienceWorld, done=True is triggered by three conditions:

1. The agent completes the task (score = 100)
2. The agent hits the step limit (score can be anything from 0 to 99)
3. The agent fails catastrophically (score < 0, forced termination by ScienceWorld)

The > 0 threshold correctly filters out case 3 (catastrophic failures with negative scores). However, it still counts case 2 (step-limit timeouts) as successes, even when the agent made zero progress (score = 0).



# Empirical Evidence
When training with this code, I observed:

Success rate (won) converged to ~100% — which is expected since every finished episode counts as a "win"
Meanwhile, the actual score (logged when won=True) decreased to ~2.1 (out of 100), indicating the agent was not meaningfully completing tasks. This confirms that the success rate metric is not measuring task completion.

# Expected Behavior
A meaningful success rate definition should require actual task completion, e.g.:

info["won"] = isCompleted and info["score"] >= 100  # fully completed

# Questions
1. Is this an intentional design choice, or a bug (e.g., should > 0 be >= 100)?
2. Were the results reported in the paper evaluated using this same code, or was a different success criterion used during evaluation?



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RLVMR] Success rate definition in code does not match paper description #34

Code in Question

The Problem

Empirical Evidence

Expected Behavior

Questions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[RLVMR] Success rate definition in code does not match paper description #34

Description

Code in Question

The Problem

Empirical Evidence

Expected Behavior

Questions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions