Skip to content

List out of index when debug_train_only #84

@Dogacel

Description

@Dogacel

When trying to set debug_train_only: true, I get the following issue,

[2026-04-27 13:22:40,186] factory.py:184 INFO Initializing 1 Sgl engines (2 GPU(s) each, nnodes=1, replicas=1)
rgpuid: []
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/root/TorchSpec/torchspec/train_entry.py", line 367, in <module>
    train_async_no_generation(args)
  File "/root/TorchSpec/torchspec/train_entry.py", line 323, in train_async_no_generation
    inference_engines, engine_init_refs = prepare_inference_engines(
                                          ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/TorchSpec/torchspec/inference/factory.py", line 85, in prepare_inference_engines
    engines, init_refs = _prepare_sgl_engines(args, inference_pg, mooncake_config, engine_group)
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/TorchSpec/torchspec/inference/factory.py", line 203, in _prepare_sgl_engines
    base_gpu_id = int(reordered_gpu_ids[bundle_offset])
                      ~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^
IndexError: list index out of range

pg is actually never initialized with rodered gpu ids, causing issues when trying to enable debug train.


Replication

Set debug_train_only: true for Qwen3 8B and run ./examples/qwen3-8b-single-node/run.sh

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions