Skip to content

Megatron-Bridge r0.3.0 enhancement#830

Open
juntaowww wants to merge 30 commits intoNVIDIA:mainfrom
juntaowww:mbridge_r0.3.0
Open

Megatron-Bridge r0.3.0 enhancement#830
juntaowww wants to merge 30 commits intoNVIDIA:mainfrom
juntaowww:mbridge_r0.3.0

Conversation

@juntaowww
Copy link
Contributor

@juntaowww juntaowww commented Mar 11, 2026

Summary

This PR introduces several enhancements to the CloudAI Megatron Bridge workload and Slurm infrastructure:

  • Node placement propagation to Megatron Bridge: The resolved node list from CloudAI's scenario config is now forwarded to Megatron Bridge's setup_experiment.py via --additional_slurm_params nodelist=.... This ensures that when users specify explicit partition:group:count node specs in the scenario, the exact placement is correctly defined inside the inner Slurm job. For example, in a scenario file:

    [[Tests]]
    id = "test_id"
    num_nodes = "8"
    exclude_nodes = ["node01"]
    nodes = [
      "partition:group1:4",
      "partition:group2:4",
    ]

    CloudAI resolves these partition:group specs into concrete node names, excludes node01, and then selects count many nodes out of the pool, and passes the final list as --additional_slurm_params 'nodelist=node02,node03,...' so Megatron-Bridge schedules on exactly those nodes.

  • Reserved nodes allocations: Treat Slurm nodes in resv (reserved) state as allocatable so that jobs with --reservation configured in extra_srun_args can resolve their node specifications instead of failing with "0 nodes available."

  • GPU resource requesting: gpus_per_node and gres are now forwarded as --additional_slurm_params gpus-per-node=N;gres=gpu:N alongside the nodelist when the system supports gpu directive. For example, setting gpus_per_node = 4 with a nodelist produces:

    --additional_slurm_params 'gpus-per-node=4;gres=gpu:4;nodelist=node02,node03'

    This ensures the GPU resources would be correctly requested, as some Slurm systems may not automatically allocate them.

  • Git submodule initialization: If init_submodules = True in GitRepo, then both SlurmInstaller and KubernetesInstaller now run git submodule update --init --recursive after cloning git repos. This is required for repos like Megatron-Bridge that bundle Megatron-LM as a submodule. Failures clean up the cloned directory.

    Example:

    [[git_repos]]
    url = "https://github.com/NVIDIA-NeMo/Megatron-Bridge.git"
    commit = "r0.3.0"
    init_submodules = true
  • Optional container mount for Megatron Bridge: When mount_as is set on the git repo entry (e.g. mount_as = "/opt/Megatron-Bridge"), the locally cloned repo is bind-mounted into the container via -cm /path/to/clone:/opt/Megatron-Bridge, overriding the image's built-in copy. Omitting mount_as uses the container's built-in copy as-is. Example test config:

    [[git_repos]]
    url = "https://github.com/NVIDIA-NeMo/Megatron-Bridge.git"
    commit = "r0.3.0"
    init_submodules = true
    mount_as = "/opt/Megatron-Bridge"   # remove this line to use the container's copy
  • Qwen recipe update: Updated the GB200 Qwen 30B recipe to use tag r0.3.0, container nvcr.io#nvidia/nemo:26.02, added PYTHONPATH for the bundled Megatron-LM submodule.

Test Plan

  • Corresponding unit tests + manual runs

Additional Notes

  • Submodule init is essential for the Qwen recipe as for Megatron-Bridge r0.3.0, which sets PYTHONPATH = "/opt/Megatron-Bridge/3rdparty/Megatron-LM:${PYTHONPATH}" to pick up the bundled Megatron-LM.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 11, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds per-test exclude_nodes support across SLURM node resolution and command generation; introduces optional git submodule initialization in installers with cleanup on failure; adjusts Megatron-Bridge repo mounting and launcher Slurm params; updates configs and adds tests for these behaviors.

Changes

Cohort / File(s) Summary
Configuration updates
conf/experimental/megatron_bridge/test/gb200/megatron_bridge_qwen_30b.toml, conf/experimental/megatron_bridge/test/gb300/megatron_bridge_qwen_30b.toml, conf/experimental/megatron_bridge/test/h100/megatron_bridge_qwen_30b.toml, conf/experimental/megatron_bridge/test/b200/megatron_bridge_qwen_30b.toml, conf/experimental/megatron_bridge/test_scenario/megatron_bridge_qwen_30b.toml
Bumped git commit and container image tags; added init_submodules = true to git repo stanzas and added extra_env_vars.PYTHONPATH entries; small header year update.
Test run model & parser
src/cloudai/_core/test_scenario.py, src/cloudai/models/scenario.py, src/cloudai/test_scenario_parser.py
Added exclude_nodes field to TestRun/TestRunModel and propagate it from parsed models into created TestRun instances.
SLURM core system
src/cloudai/systems/slurm/slurm_system.py
Added exclude_nodes parameter across node-resolution APIs; filter excluded hostnames during grouping, parsing, and allocation; include RESERVED in available-state handling and raise error if exclusions yield empty resolved node lists.
SLURM command generation
src/cloudai/systems/slurm/slurm_command_gen_strategy.py, src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py
Emit #SBATCH --exclude when appropriate; extend node-spec cache to include excluded nodes; parse extra srun args into Slurm params; add --additional_slurm_params with gpus-per-node, nodelist/exclude, and parsed srun args; conditional container bind-mount of Megatron-Bridge when mount_as set.
Installers: git + submodules
src/cloudai/_core/installables.py, src/cloudai/systems/slurm/slurm_installer.py, src/cloudai/systems/kubernetes/kubernetes_installer.py
Added init_submodules: bool to GitRepo; introduced _init_submodules(path: Path) -> InstallStatusResult and invoke it after checkout when requested; on failure log, remove cloned repo, and return error.
Megatron-Bridge selection & ref data
src/cloudai/workloads/megatron_bridge/megatron_bridge.py, tests/ref_data/megatron-bridge.sbatch
Stop forcing mount_as override when selecting repo (keep repo's mount_as); updated sbatch reference to include conditional Megatron-Bridge bind-mount and additional Slurm params.
Tests added/updated
tests/systems/slurm/test_allocation.py, tests/systems/slurm/test_system.py, tests/systems/slurm/test_command_gen_strategy.py, tests/test_git_repo_installer.py, tests/workloads/megatron_bridge/test_command_gen_strategy_slurm.py
Added tests covering node exclusion (single/multiple), RESERVED-state allocation, parse/empty-spec error cases, --exclude and --additional_slurm_params generation, mount_as behavior, Slurm srun-args parsing, and git submodule init success/failure and cleanup.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 I nibbled nodes and left a few behind,
I fetched curled submodules, tidy and kind.
Mounts snug like carrots, mapped just right,
Slurm learns new params and jobs take flight.
Hoppity tests pass under moonlit night.

🚥 Pre-merge checks | ✅ 2
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title 'Megatron-Bridge r0.3.0 enhancement' accurately summarizes the primary focus of the changeset—updating Megatron-Bridge to r0.3.0 and adding related enhancements.
Description check ✅ Passed The pull request description comprehensively explains the changes, providing clear examples of node placement propagation, GPU resource requesting, git submodule initialization, container mounting, and recipe updates.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
📝 Coding Plan
  • Generate coding plan for human review comments

Comment @coderabbitai help to get the list of available commands and usage tips.

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 11, 2026

Greptile Summary

This PR introduces several enhancements to the Megatron-Bridge workload and Slurm infrastructure for CloudAI: node placement propagation (resolved nodelist= forwarded to setup_experiment.py), reserved-node allocation support, GPU resource forwarding via --additional_slurm_params, git submodule initialization in both SlurmInstaller and KubernetesInstaller, and optional container bind-mounting of the cloned repo via mount_as. A new r0.3.0 scenario file for Qwen 30B is also added.

Key changes:

  • slurm_command_gen_strategy.py (Megatron Bridge): Builds additional_slurm_params from gpus_per_node, the resolved nodelist, and extra_srun_args; standalone boolean flags in extra_srun_args are now correctly forwarded. The elif self.test_run.exclude_nodes branch correctly forwards exclusions when no explicit node list is provided.
  • slurm_system.py: group_nodes_by_state and allocate_nodes now include RESERVED nodes, allowing jobs with --reservation to resolve their node specs without hitting "0 nodes available" errors. However, RESERVED is the lowest priority in the integer-count allocation path (after IDLE, COMPLETING, ALLOCATED), which can conflict with workflows where reserved nodes should be selected first.
  • SlurmInstaller / KubernetesInstaller: Symmetric _init_submodules method added (with cleanup on failure); both the "new clone" and "existing repo" paths respect init_submodules.
  • Config files: The new megatron_bridge_r0.3.0_qwen_30b.toml scenario uses commit = "r0.3.0", while the existing per-GPU test configs still use commit = "v0.3.0" — these may be the same release under different tag names, but the inconsistency warrants confirmation.

Confidence Score: 3/5

  • Safe to merge with caveats — the RESERVED node priority ordering may silently cause inner-job failures for reservation-based workflows, and the v0.3.0/r0.3.0 tag discrepancy across config files should be confirmed before shipping to users.
  • The core submodule-init and GPU-directive forwarding logic is well-tested and correct. The exclude_nodes forwarding for the inner job is now handled. The main concerns are: (1) the RESERVED priority order in allocate_nodes puts reserved nodes last, creating a potential conflict when --reservation is in use and reserved nodes should be selected over IDLE ones; (2) tag name inconsistency (v0.3.0 vs r0.3.0) across config files that could silently pin users to the wrong release.
  • src/cloudai/systems/slurm/slurm_system.py (RESERVED allocation priority order) and the per-GPU test TOML configs (tag name inconsistency)

Important Files Changed

Filename Overview
src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py Core logic for forwarding nodelist/gpus/exclude to setup_experiment.py via --additional_slurm_params; RESERVED priority ordering may conflict with --reservation workflows (reserved nodes allocated last, after IDLE/COMPLETING/ALLOCATED)
src/cloudai/systems/slurm/slurm_system.py Adds RESERVED state to grouped_nodes dict and allocate_nodes; reserved nodes are now considered allocatable for both integer counts and max_avail; change is well-targeted
src/cloudai/systems/slurm/slurm_installer.py Adds _init_submodules and integrates it into _install_one_git_repo and _clone_and_setup_repo with proper cleanup on failure; implementation is correct
src/cloudai/_core/installables.py Adds mount_as and init_submodules fields to GitRepo; container_mount property correctly falls back to /git/{repo_name} when mount_as is unset
tests/test_git_repo_installer.py Comprehensive tests for submodule init/failure paths including cleanup, existing-repo re-init, and skipping when not requested; covers both Slurm and Kubernetes installers
conf/experimental/megatron_bridge/test/gb200/megatron_bridge_qwen_30b.toml Removes mount_as (now optional) but retains commit = "v0.3.0" — inconsistent with the new scenario's r0.3.0 tag; also lacks init_submodules which may be required depending on the version

Sequence Diagram

sequenceDiagram
    participant User
    participant CloudAI
    participant SlurmInstaller
    participant SlurmSystem
    participant MBStrategy as MegatronBridgeCmdGenStrategy
    participant SetupExperiment as setup_experiment.py
    participant Slurm

    User->>CloudAI: cloudai install with init_submodules=true
    CloudAI->>SlurmInstaller: install GitRepo
    SlurmInstaller->>SlurmInstaller: clone and checkout commit
    SlurmInstaller->>SlurmInstaller: _init_submodules NEW
    SlurmInstaller-->>CloudAI: installed_path set

    User->>CloudAI: cloudai run with partition group node specs
    CloudAI->>SlurmSystem: get_nodes_by_spec
    SlurmSystem->>SlurmSystem: group_nodes_by_state IDLE COMPLETING ALLOCATED RESERVED
    Note over SlurmSystem: RESERVED state now included NEW
    SlurmSystem->>SlurmSystem: allocate_nodes picks from all states
    SlurmSystem-->>CloudAI: resolved nodelist

    CloudAI->>MBStrategy: gen_exec_command
    MBStrategy->>MBStrategy: build additional_slurm_params NEW
    Note over MBStrategy: gpus-per-node and nodelist forwarded
    MBStrategy->>MBStrategy: add mount_as bind-mount to cm NEW
    MBStrategy-->>CloudAI: wrapper script path

    CloudAI->>SetupExperiment: python setup_experiment.py with additional_slurm_params
    SetupExperiment->>Slurm: sbatch inner training job with forwarded nodelist and GPU resources
    Slurm-->>SetupExperiment: job ID
    SetupExperiment-->>CloudAI: Submitted batch job id
Loading

Last reviewed commit: "Restructure configur..."

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
conf/experimental/megatron_bridge/test/gb200/megatron_bridge_qwen_30b.toml (1)

28-40: ⚠️ Potential issue | 🟠 Major

Do not leave hf_token empty in this test recipe.

This updated recipe still ships with hf_token = "", which makes the config unusable unless callers patch the TOML out-of-band. Please wire this to a user-supplied secret/input, or explicitly exempt this scenario from the non-empty validation path. Based on learnings: For Megatron Bridge test TOML configurations under conf/experimental/megatron_bridge/test/, hf_token must be non-empty; tests should fail validation if hf_token is left empty to ensure users supply their own Hugging Face token.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@conf/experimental/megatron_bridge/test/gb200/megatron_bridge_qwen_30b.toml`
around lines 28 - 40, The test TOML leaves the [cmd_args] key hf_token empty
which breaks validation; change the recipe to require a user-supplied token by
wiring hf_token to a runtime secret/input (e.g., use a placeholder token
variable like HF_TOKEN referenced by the deployment system) or update the config
validation logic to reject empty hf_token for Megatron Bridge test configs;
specifically modify the [cmd_args] hf_token entry and/or the validation path
that checks test recipes under conf/experimental/megatron_bridge/test to enforce
non-empty hf_token so tests fail if not provided.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/cloudai/systems/slurm/slurm_system.py`:
- Around line 722-725: The current logic in
parse_node_list/parse_nodes/get_nodes_by_spec can produce an empty node list
after applying exclude_nodes and then fall back to an unconstrained allocation;
modify the code so that after filtering expanded_nodes = [n for n in
expanded_nodes if n not in exclude_nodes] you detect if expanded_nodes is empty
and immediately raise a clear exception (e.g., ValueError/RuntimeError) that
includes the offending node_spec and exclude_nodes, and ensure get_nodes_by_spec
propagates this error instead of returning (num_nodes, []). This prevents
SlurmCommandGenStrategy from seeing an empty nodelist and emitting -N for
unconstrained placement; alternatively, if a non-exceptional path is required,
return a sentinel that preserves the exclusion constraint and is handled by
SlurmCommandGenStrategy rather than an empty list fallback.

In `@tests/workloads/megatron_bridge/test_command_gen_strategy_slurm.py`:
- Around line 323-351: The tests are using the defaults from the test factory so
they pass regardless of the override plumbing; update the failing tests to use
non-default values so the overrides are actually exercised: in
test_mount_as_adds_repo_to_container_mounts call make_test_run with a distinct
mount_as (e.g. "/opt/custom-megatron") and assert the wrapper_content contains
f"-cm {repo_path.absolute()}:/opt/custom-megatron" (use the same mount_as
string), in test_gpus_per_node_passed_as_additional_slurm_param pass
cmd_args_overrides={"gpus_per_node": 2} (or another non-default number) and
assert "gpus-per-node=2" appears in wrapper_content; keep using
MegatronBridgeSlurmCommandGenStrategy, make_test_run, and _wrapper_content to
locate where to change the values.

---

Outside diff comments:
In `@conf/experimental/megatron_bridge/test/gb200/megatron_bridge_qwen_30b.toml`:
- Around line 28-40: The test TOML leaves the [cmd_args] key hf_token empty
which breaks validation; change the recipe to require a user-supplied token by
wiring hf_token to a runtime secret/input (e.g., use a placeholder token
variable like HF_TOKEN referenced by the deployment system) or update the config
validation logic to reject empty hf_token for Megatron Bridge test configs;
specifically modify the [cmd_args] hf_token entry and/or the validation path
that checks test recipes under conf/experimental/megatron_bridge/test to enforce
non-empty hf_token so tests fail if not provided.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: b086916f-352d-4c4c-b94f-f9fa9c3846a0

📥 Commits

Reviewing files that changed from the base of the PR and between dd076e8 and e14fa6f.

📒 Files selected for processing (15)
  • conf/experimental/megatron_bridge/test/gb200/megatron_bridge_qwen_30b.toml
  • src/cloudai/_core/test_scenario.py
  • src/cloudai/models/scenario.py
  • src/cloudai/systems/kubernetes/kubernetes_installer.py
  • src/cloudai/systems/slurm/slurm_command_gen_strategy.py
  • src/cloudai/systems/slurm/slurm_installer.py
  • src/cloudai/systems/slurm/slurm_system.py
  • src/cloudai/test_scenario_parser.py
  • src/cloudai/workloads/megatron_bridge/megatron_bridge.py
  • src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py
  • tests/ref_data/megatron-bridge.sbatch
  • tests/systems/slurm/test_allocation.py
  • tests/systems/slurm/test_system.py
  • tests/test_git_repo_installer.py
  • tests/workloads/megatron_bridge/test_command_gen_strategy_slurm.py

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/workloads/megatron_bridge/test_command_gen_strategy_slurm.py`:
- Around line 323-334: The test function
test_mount_as_adds_repo_to_container_mounts declares an unused tmp_path fixture
parameter; remove tmp_path from the test signature so it becomes def
test_mount_as_adds_repo_to_container_mounts(self, configured_slurm_system:
SlurmSystem, make_test_run: Callable[..., TestRun]) -> None: and update any
references if the test was called elsewhere (none expected) to eliminate the
unused parameter warning; locate this function by name in
tests/workloads/megatron_bridge/test_command_gen_strategy_slurm.py and adjust
the signature accordingly.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: d59eb2c8-9f04-4000-8d8d-493717387607

📥 Commits

Reviewing files that changed from the base of the PR and between e14fa6f and f26fe7b.

📒 Files selected for processing (6)
  • src/cloudai/systems/slurm/slurm_command_gen_strategy.py
  • src/cloudai/systems/slurm/slurm_system.py
  • src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py
  • tests/systems/slurm/test_command_gen_strategy.py
  • tests/systems/slurm/test_system.py
  • tests/workloads/megatron_bridge/test_command_gen_strategy_slurm.py

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/cloudai/_core/installables.py (1)

2-2: ⚠️ Potential issue | 🟡 Minor

Fix copyright year to resolve CI failure.

The pipeline reports that the copyright year is invalid. It should be 2025-2026 to match the header pattern.

🔧 Proposed fix
-# Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/cloudai/_core/installables.py` at line 2, Update the file header line
that currently reads "Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All
rights reserved." to use the required year range "2025-2026" so it matches the
project's header pattern; locate the top-of-file copyright string in
src/cloudai/_core/installables.py and replace the single year with "2025-2026".
♻️ Duplicate comments (1)
tests/workloads/megatron_bridge/test_command_gen_strategy_slurm.py (1)

323-325: ⚠️ Potential issue | 🟡 Minor

Remove unused tmp_path fixture from test signature.

tmp_path is unused in this method (Line 324), and keeping it triggers ARG002 noise.

🧹 Proposed fix
 def test_mount_as_adds_repo_to_container_mounts(
-    self, configured_slurm_system: SlurmSystem, make_test_run: Callable[..., TestRun], tmp_path: Path
+    self, configured_slurm_system: SlurmSystem, make_test_run: Callable[..., TestRun]
 ) -> None:
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/workloads/megatron_bridge/test_command_gen_strategy_slurm.py` around
lines 323 - 325, The test function test_mount_as_adds_repo_to_container_mounts
currently declares an unused tmp_path fixture which triggers ARG002; remove
tmp_path from the function signature so it reads def
test_mount_as_adds_repo_to_container_mounts(self, configured_slurm_system:
SlurmSystem, make_test_run: Callable[..., TestRun]) -> None:, and ensure there
are no remaining references to tmp_path in the body of that test (adjust imports
or fixtures if needed).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/test_git_repo_installer.py`:
- Around line 281-300: The tests were reformatted by ruff-format; run the
formatter and pre-commit hooks to fix CI failures by running the project's
formatting hooks (e.g., pre-commit run --all-files or ruff format) and commit
the changes; ensure the test functions
(test_submodules_skipped_when_not_requested, test_submodules_run_when_requested)
and the installer methods referenced (installer._install_one_git_repo,
installer._init_submodules, installer._clone_repository,
installer._checkout_commit) remain unchanged in behavior after formatting so the
assertions and mocks still match the corrected file.

In `@tests/workloads/megatron_bridge/test_command_gen_strategy_slurm.py`:
- Around line 362-370: The test test_test_run_extra_srun_args_forwarded
currently only checks for the existence of "--additional_slurm_params" which can
be present by default; update it to assert that the value derived from
TestRun.extra_srun_args is actually forwarded by
MegatronBridgeSlurmCommandGenStrategy. Specifically, set tr.extra_srun_args =
"--exclusive" (already done), call _wrapper_content(cmd_gen) as before, and
replace the loose assert with one that checks the wrapper_content contains the
forwarded flag and value (e.g., that the string representing the additional
slurm params includes "--exclusive" or an exact parameter encoding produced by
the command generator) so the test fails if extra_srun_args is ignored;
reference the test function name, the TestRun.extra_srun_args field, the
MegatronBridgeSlurmCommandGenStrategy constructor, and the _wrapper_content
helper to locate the code.
- Around line 335-343: Replace the brittle blanket check for "-cm" in
test_no_mount_as_skips_repo_container_mount with a precise assertion that the
Megatron-Bridge repo mount is not present: remove or change the assert "-cm" not
in wrapper_content to assert that the specific mount target or mount flag for
the repo (for example a string containing ':/opt/Megatron-Bridge' or the '-v
...:/opt/Megatron-Bridge' pattern) is not in wrapper_content, keeping the
existing assert "/opt/Megatron-Bridge" absence check or tightening it to match
the exact mount syntax emitted by MegatronBridgeSlurmCommandGenStrategy.

---

Outside diff comments:
In `@src/cloudai/_core/installables.py`:
- Line 2: Update the file header line that currently reads "Copyright (c) 2025
NVIDIA CORPORATION & AFFILIATES. All rights reserved." to use the required year
range "2025-2026" so it matches the project's header pattern; locate the
top-of-file copyright string in src/cloudai/_core/installables.py and replace
the single year with "2025-2026".

---

Duplicate comments:
In `@tests/workloads/megatron_bridge/test_command_gen_strategy_slurm.py`:
- Around line 323-325: The test function
test_mount_as_adds_repo_to_container_mounts currently declares an unused
tmp_path fixture which triggers ARG002; remove tmp_path from the function
signature so it reads def test_mount_as_adds_repo_to_container_mounts(self,
configured_slurm_system: SlurmSystem, make_test_run: Callable[..., TestRun]) ->
None:, and ensure there are no remaining references to tmp_path in the body of
that test (adjust imports or fixtures if needed).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 57739f39-82cc-44a7-add5-3cc1080fcf41

📥 Commits

Reviewing files that changed from the base of the PR and between a159736 and 52cef4c.

📒 Files selected for processing (17)
  • conf/experimental/megatron_bridge/test/b200/megatron_bridge_qwen_30b.toml
  • conf/experimental/megatron_bridge/test/gb200/megatron_bridge_qwen_30b.toml
  • conf/experimental/megatron_bridge/test/gb300/megatron_bridge_qwen_30b.toml
  • conf/experimental/megatron_bridge/test/h100/megatron_bridge_qwen_30b.toml
  • src/cloudai/_core/installables.py
  • src/cloudai/_core/test_scenario.py
  • src/cloudai/models/scenario.py
  • src/cloudai/systems/kubernetes/kubernetes_installer.py
  • src/cloudai/systems/slurm/slurm_command_gen_strategy.py
  • src/cloudai/systems/slurm/slurm_installer.py
  • src/cloudai/systems/slurm/slurm_system.py
  • src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py
  • tests/systems/slurm/test_allocation.py
  • tests/systems/slurm/test_command_gen_strategy.py
  • tests/systems/slurm/test_system.py
  • tests/test_git_repo_installer.py
  • tests/workloads/megatron_bridge/test_command_gen_strategy_slurm.py

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/cloudai/_core/installables.py`:
- Line 94: The early return in both _install_one_git_repo() implementations
skips submodule setup for existing clones, so ensure submodules are initialized
when item.init_submodules is true even if repo_path.exists() and commit matches;
move or duplicate the item.init_submodules handling out of the post-clone-only
block (or add an explicit check after the early-return path) so that when
repo_path.exists() and item.init_submodules is true you call the same submodule
init/update sequence used after cloning (reference symbols:
_install_one_git_repo, repo_path.exists(), item.init_submodules).

In `@src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py`:
- Around line 111-133: The _parse_srun_args_as_slurm_params function incorrectly
uses str.split() which breaks quoted shell arguments; replace tokens =
srun_args.split() with tokens = shlex.split(srun_args) so quoted values (e.g.
--comment "nightly run") are treated as a single token and the rest of the
parsing logic (the while loop handling --key=val, --key value, and standalone
flags) remains unchanged; shlex is already imported so just swap the split call
to shlex.split.

In `@tests/test_git_repo_installer.py`:
- Around line 281-299: Add a regression test that covers the "existing repo"
fast path with init_submodules enabled: create a test (in
tests/test_git_repo_installer.py) that pre-creates the repository directory
referenced by the GitRepo fixture (repo_path), set git.init_submodules = True,
mock installer._verify_commit to return InstallStatusResult(True) and mock
installer._clone_repository/_checkout_commit as needed, call
installer._install_one_git_repo(git), assert the result is successful and that
installer._init_submodules.assert_called_once() to ensure submodules are
initialized on the pre-existing-repo path.

In `@tests/workloads/megatron_bridge/test_command_gen_strategy_slurm.py`:
- Around line 375-385: Add a new unit test in the same test file exercising
MegatronBridgeSlurmCommandGenStrategy._parse_srun_args_as_slurm_params with a
quoted value, e.g. call it with '--comment "nightly run" --reservation
my_reserv' and assert the returned list includes 'comment=nightly run' and
'reservation=my_reserv' (ordering-insensitive or match exact expected order used
in existing tests); this ensures the parser correctly strips quotes and
preserves spaces for values like "nightly run".

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 001e2203-7e96-421a-ad8b-79c332c8123d

📥 Commits

Reviewing files that changed from the base of the PR and between 52cef4c and b6a5dcf.

📒 Files selected for processing (5)
  • conf/experimental/megatron_bridge/test_scenario/megatron_bridge_qwen_30b.toml
  • src/cloudai/_core/installables.py
  • src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py
  • tests/test_git_repo_installer.py
  • tests/workloads/megatron_bridge/test_command_gen_strategy_slurm.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants