Skip to content

Fix: skip trainer config validation in bench mode#535

Merged
pan-x-c merged 1 commit into
agentscope-ai:mainfrom
shiweijiezero:fix/skip-trainer-check-bench-mode
May 9, 2026
Merged

Fix: skip trainer config validation in bench mode#535
pan-x-c merged 1 commit into
agentscope-ai:mainfrom
shiweijiezero:fix/skip-trainer-check-bench-mode

Conversation

@shiweijiezero
Copy link
Copy Markdown
Contributor

Bug

In bench mode, cluster.trainer_gpu_num is left at its default 0 because the cluster validator deliberately skips trainer GPU allocation for bench / explore / serve (these modes don't train) — see config_validator.py:244.

But the trainer config validator (config_validator.py:1168) still keeps bench in its whitelist. So for any local-model bench run (i.e. external_model.enable=false), the call chain reaches:

trinity/trainer/verl/verl_config.py:430
    if train_batch_size % (world_size // sp_size) != 0:
                              ↑ trainer_gpu_num = 0 → ZeroDivisionError

Repro: any yaml with mode: bench + a non-external model + engine_num × tensor_parallel_size = total GPU.

Fix

Drop bench from the trainer-config-check whitelist. Bench mode runs explorer-only and never touches the trainer, so its trainer parallelism config doesn't need validation. Same fast-path semantics as the existing external_model.enable=true check immediately below at line 1170.

After the fix, all 6 modes behave correctly:

mode trainer GPU allocated trainer config check
train yes yes
both yes yes
colocate yes (1 GPU placeholder) yes
bench no no (was: yes, broken)
explore no no (already)
serve no no (already)

Test

A local-model bench run (Qwen3.6-27B + frozen_lake_obscure eval, 1 node × 8 GPU, engine_num=4 × TP=2) reproduces the ZeroDivisionError on main and runs cleanly with this patch.

config.trainer.trainer_config is not accessed by any module outside config_validator.py (verified via repo-wide grep), so skipping synchronize_config() in bench mode has no downstream effect.

Bench mode runs explorer-only; cluster.trainer_gpu_num is left at 0
because the cluster validator (line 244) skips trainer GPU allocation
for bench/explore/serve. The trainer config validator however still
kept 'bench' in its whitelist, so any local-model bench run hit:

    trinity/trainer/verl/verl_config.py:430
    if train_batch_size % (world_size // sp_size) != 0:
        ZeroDivisionError: integer division or modulo by zero

Drop bench from the whitelist; same fast-path semantics as the existing
external_model.enable check immediately below.
@pan-x-c pan-x-c merged commit 5b1d8a7 into agentscope-ai:main May 9, 2026
1 check passed
chenyushuo added a commit to chenyushuo/Trinity-RFT that referenced this pull request May 13, 2026
commit 159743e
Author: chenyushuo <297086016@qq.com>
Date:   Mon May 11 18:12:00 2026 +0800

    1. fix unittest
    2. add `reset_running_requests=True` to `reset_prefix_cache`

commit 1db7973
Merge: e9f8316 8d69663
Author: chenyushuo <297086016@qq.com>
Date:   Mon May 11 11:06:34 2026 +0800

    Merge branch 'main' of github.com:modelscope/Trinity-RFT into dev/fix_qwen3_5

commit 8d69663
Author: Xuchen Pan <32844285+pan-x-c@users.noreply.github.com>
Date:   Mon May 11 10:42:37 2026 +0800

    Support SGLang Inference Engine (agentscope-ai#533)

commit 5b1d8a7
Author: weijie <34210233+shiweijiezero@users.noreply.github.com>
Date:   Sat May 9 20:34:47 2026 +0800

    Fix: skip trainer config validation in bench mode (agentscope-ai#535)

commit e9f8316
Author: chenyushuo <297086016@qq.com>
Date:   Sat May 9 11:39:40 2026 +0800

    1. Fix Qwen3.5 sequence-parallel training bugs.
    2. Fix Qwen3.5 multimodal training bugs.
    3. Fix incorrect Qwen3.5 checkpoint parameter naming when saving with Transformers 5.4.0-5.5.4.
    4. Add freeze_vision_tower support.
    5. Fix compatibility issues with vLLM 0.20.
    6. Fix a bug in Experience serialization.
    7. Fix the condition for skipping TrainerConfigValidator checks.
    8. Improve explorer robustness by safely handling missing rollout coordinator instead of hard-asserting.
    9. Propagate checkpoint_job_dir into workflow/taskset runtime arguments.
    10. Improve FSDP worker initialization and logging behavior for better stability and observability.
    11. Apply typo and minor message fixes.

    Co-authored-by: Copilot <copilot@github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants