Fix: skip trainer config validation in bench mode by shiweijiezero · Pull Request #535 · agentscope-ai/Trinity-RFT

shiweijiezero · 2026-05-09T11:35:45Z

Bug

In bench mode, cluster.trainer_gpu_num is left at its default 0 because the cluster validator deliberately skips trainer GPU allocation for bench / explore / serve (these modes don't train) — see config_validator.py:244.

But the trainer config validator (config_validator.py:1168) still keeps bench in its whitelist. So for any local-model bench run (i.e. external_model.enable=false), the call chain reaches:

trinity/trainer/verl/verl_config.py:430
    if train_batch_size % (world_size // sp_size) != 0:
                              ↑ trainer_gpu_num = 0 → ZeroDivisionError

Repro: any yaml with mode: bench + a non-external model + engine_num × tensor_parallel_size = total GPU.

Fix

Drop bench from the trainer-config-check whitelist. Bench mode runs explorer-only and never touches the trainer, so its trainer parallelism config doesn't need validation. Same fast-path semantics as the existing external_model.enable=true check immediately below at line 1170.

After the fix, all 6 modes behave correctly:

mode	trainer GPU allocated	trainer config check
train	yes	yes
both	yes	yes
colocate	yes (1 GPU placeholder)	yes
bench	no	no (was: yes, broken)
explore	no	no (already)
serve	no	no (already)

Test

A local-model bench run (Qwen3.6-27B + frozen_lake_obscure eval, 1 node × 8 GPU, engine_num=4 × TP=2) reproduces the ZeroDivisionError on main and runs cleanly with this patch.

config.trainer.trainer_config is not accessed by any module outside config_validator.py (verified via repo-wide grep), so skipping synchronize_config() in bench mode has no downstream effect.

Bench mode runs explorer-only; cluster.trainer_gpu_num is left at 0 because the cluster validator (line 244) skips trainer GPU allocation for bench/explore/serve. The trainer config validator however still kept 'bench' in its whitelist, so any local-model bench run hit: trinity/trainer/verl/verl_config.py:430 if train_batch_size % (world_size // sp_size) != 0: ZeroDivisionError: integer division or modulo by zero Drop bench from the whitelist; same fast-path semantics as the existing external_model.enable check immediately below.

commit 159743e Author: chenyushuo <297086016@qq.com> Date: Mon May 11 18:12:00 2026 +0800 1. fix unittest 2. add `reset_running_requests=True` to `reset_prefix_cache` commit 1db7973 Merge: e9f8316 8d69663 Author: chenyushuo <297086016@qq.com> Date: Mon May 11 11:06:34 2026 +0800 Merge branch 'main' of github.com:modelscope/Trinity-RFT into dev/fix_qwen3_5 commit 8d69663 Author: Xuchen Pan <32844285+pan-x-c@users.noreply.github.com> Date: Mon May 11 10:42:37 2026 +0800 Support SGLang Inference Engine (agentscope-ai#533) commit 5b1d8a7 Author: weijie <34210233+shiweijiezero@users.noreply.github.com> Date: Sat May 9 20:34:47 2026 +0800 Fix: skip trainer config validation in bench mode (agentscope-ai#535) commit e9f8316 Author: chenyushuo <297086016@qq.com> Date: Sat May 9 11:39:40 2026 +0800 1. Fix Qwen3.5 sequence-parallel training bugs. 2. Fix Qwen3.5 multimodal training bugs. 3. Fix incorrect Qwen3.5 checkpoint parameter naming when saving with Transformers 5.4.0-5.5.4. 4. Add freeze_vision_tower support. 5. Fix compatibility issues with vLLM 0.20. 6. Fix a bug in Experience serialization. 7. Fix the condition for skipping TrainerConfigValidator checks. 8. Improve explorer robustness by safely handling missing rollout coordinator instead of hard-asserting. 9. Propagate checkpoint_job_dir into workflow/taskset runtime arguments. 10. Improve FSDP worker initialization and logging behavior for better stability and observability. 11. Apply typo and minor message fixes. Co-authored-by: Copilot <copilot@github.com>

pan-x-c approved these changes May 9, 2026

View reviewed changes

pan-x-c merged commit 5b1d8a7 into agentscope-ai:main May 9, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix: skip trainer config validation in bench mode#535

Fix: skip trainer config validation in bench mode#535
pan-x-c merged 1 commit into
agentscope-ai:mainfrom
shiweijiezero:fix/skip-trainer-check-bench-mode

shiweijiezero commented May 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

shiweijiezero commented May 9, 2026

Bug

Fix

Test

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants