Reproduction Environment
- GPU: H20*4
- Configuration: Codebase default settings
- Dataset: First 15k SAT samples (as per default config)
My Results
Figure 1. My training curve

Figure 2. My test curve

- My reproduction (base+GRPO): 58.4 (step 1000)
- Qwen2-VL instruct model: 61.6
Results from report
Figure 3. Test curve from report

Key Questions
- Performance gap (58.4 vs ~59.5) between my reproduction and reported results.
- Inconsistent qwen2-VL instruct model performance (61.6 locally vs ~56 in report).
- Abnormal trend in reproduced SFT curve and GRPO curve (Figure 2) .
- Why does default config only use first 15k SAT samples instead of full dataset?
- According to your experience, what are the possible reasons for the abnormal reproduction results?
Reproduction Environment
My Results
Figure 1. My training curve

Figure 2. My test curve

Results from report
Figure 3. Test curve from report

Key Questions