Skip to content

Add support for x2 nodes serving for vLLM and SGLang#839

Merged
amaslenn merged 7 commits intomainfrom
am/2nodes-serve
Mar 18, 2026
Merged

Add support for x2 nodes serving for vLLM and SGLang#839
amaslenn merged 7 commits intomainfrom
am/2nodes-serve

Conversation

@amaslenn
Copy link
Contributor

Summary

Add support for x2 nodes serving for vLLM and SGLang.

Test Plan

  1. CI (extended)
  2. Manual runs for vLLM/SGLang on 1 and 2 nodes.

Additional Notes

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 17, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 59701c1e-7db9-4e42-b53c-23f5cfb8fd01

📥 Commits

Reviewing files that changed from the base of the PR and between 65754c8 and 7e9efde.

📒 Files selected for processing (1)
  • src/cloudai/workloads/sglang/slurm_command_gen_strategy.py

📝 Walkthrough

Walkthrough

Refactors LLM serving SLURM generators to add disaggregated multi-node mode: unified GPU ID parsing, per-role GPU/port/log properties, new public APIs (get_serve_commands/get_bench_command/get_helper_command), reworked script-generation flows, and corresponding SGLang/vLLM alignment plus tests and sbatch templates.

Changes

Cohort / File(s) Summary
Core LLM Serving Framework
src/cloudai/workloads/common/llm_serving.py
Adds parse_gpu_ids, unified GPU resolution (all_gpu_ids), port on LLMServingCmdArgs with validation, abstract workload_name/workload_slug, disaggregation flags and per-role GPU/port/log properties, public APIs (get_serve_commands, get_bench_command, get_helper_command), and new script-generation flow for aggregated vs disaggregated modes (preamble, role env, node setup, srun/script generators).
SGLang Workload
src/cloudai/workloads/sglang/sglang.py, src/cloudai/workloads/sglang/slurm_command_gen_strategy.py
Changes default disaggregation_transfer_backend"nixl"; removes port and health_endpoint from SglangCmdArgs; adapts Slurm strategy to new base-class API (renamed serve/bench/helper methods, workload_name, aggregated_serve_env, host-aware serve/bench/helper generation); removes legacy script-gen helpers.
vLLM Workload
src/cloudai/workloads/vllm/vllm.py, src/cloudai/workloads/vllm/slurm_command_gen_strategy.py
Removes port from VllmCmdArgs; renames public methods to get_serve_commands/get_bench_command/get_helper_command, adds workload_name and disaggregation helpers (disaggregated_script_preamble, disaggregated_role_env), and updates bench base URLs to 127.0.0.1/serve_port.
SLURM Batch Script Templates
tests/ref_data/sglang.sbatch, tests/ref_data/sglang-disagg.sbatch, tests/ref_data/sglang-disagg-2nodes.sbatch, tests/ref_data/vllm.sbatch, tests/ref_data/vllm-disagg.sbatch, tests/ref_data/vllm-disagg-2nodes.sbatch
Adds two-node disaggregated sbatch templates and updates existing templates: node discovery (PREFILL_NODE/DECODE_NODE), per-role startup/health checks, use of SERVE_PID/HELPER_PID, bench base URL → 127.0.0.1, helper/proxy lifecycle and cleanup adjustments.
Tests and Fixtures
tests/test_acceptance.py, tests/workloads/sglang/test_command_gen_strategy_slurm.py, tests/workloads/vllm/test_command_gen_strategy_slurm.py, tests/workloads/test_llm_serving.py
Adds two-node disaggregation test variants and fixtures, updates tests to new method names and disaggregated expectations, adds GPU-parsing unit tests and FakeLLMSlurmStrategy scaffold, and tests for role-host mapping and >2-node rejection.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

Suggested reviewers

  • srivatsankrishnan
  • jeffnvidia
  • podkidyshev

Poem

🐰 I nibble logs and ports by moonlit beams,
Prefill and decode chase shared GPU dreams.
Two nodes, one carrot—routes tied with care,
Health checks hum softly through the midnight air.
Hop, deploy, and rest — the cluster's all fair.

🚥 Pre-merge checks | ✅ 2
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main change: adding support for 2-node serving configurations for vLLM and SGLang, which aligns with the extensive changes in the changeset.
Description check ✅ Passed The description is related to the changeset, stating the feature addition and providing a test plan, though lacking detailed technical depth about the disaggregated architecture changes.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch am/2nodes-serve
📝 Coding Plan
  • Generate coding plan for human review comments

Comment @coderabbitai help to get the list of available commands and usage tips.

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 17, 2026

Greptile Summary

This PR extends vLLM and SGLang serving workloads to support 2-node disaggregated prefill/decode deployments, alongside a refactoring that consolidates duplicated script-generation logic into the shared LLMServingSlurmCommandGenStrategy base class.

Key changes:

  • 2-node disaggregated support: New is_two_node_disaggregated property gates on node count (1 or 2). generate_disaggregated_node_setup() creates bash variables for node roles. The --relative=0/-N1 and --relative=1/-N1 srun flags pin prefill/decode services to specific nodes. GPU IDs are shared (not split) per role when each node owns its full GPU set.
  • Base class consolidation: _gen_aggregated_script and _gen_disaggregated_script are promoted to the base class with hook methods (aggregated_serve_env, disaggregated_role_env, disaggregated_script_preamble, get_helper_command) for workload-specific customization. Both subclasses shed approximately 80 lines of duplicated code.
  • Port handling: port moved to LLMServingCmdArgs; a disaggregated validator ensures port <= 65335 so both prefill (+100) and decode (+200) offsets fit within the valid TCP range.
  • Isolation improvement: Per-command inline environment variable injection replaces global export statements, scoping env vars to each individual srun step. Health check endpoints now use actual node hostnames instead of 0.0.0.0.
  • Regression: SglangArgs.disaggregation_transfer_backend default changed from None to "nixl", but the old or "nixl" guard was removed. An explicit null in a user TOML config would now produce a literal "None" string passed as the backend argument, which SGLang would reject at runtime.

Confidence Score: 4/5

  • Safe to merge with one minor fix recommended — the None transfer backend regression in SGLang command generation.
  • The core 2-node logic is well-structured, test coverage is thorough (unit + acceptance tests with reference sbatch files), and the base class consolidation is clean. One P1 logic regression exists: removing the or "nixl" guard after changing the default from None to "nixl" leaves explicit None values producing an invalid "None" string in the generated command. The fix is a one-liner. All other concerns are style-level.
  • src/cloudai/workloads/sglang/slurm_command_gen_strategy.py line 103 — the str(args.disaggregation_transfer_backend) call should guard against explicit None.

Important Files Changed

Filename Overview
src/cloudai/workloads/common/llm_serving.py Core refactoring: adds port field to base class, introduces is_two_node_disaggregated, consolidates script generation into _gen_aggregated_script/_gen_disaggregated_script on the base, and adds parse_gpu_ids helper. Port validator magic number 65335 is correct but unexplained. Logic for GPU-per-role splitting (full gpu_ids per role in 2-node mode) is sound.
src/cloudai/workloads/sglang/slurm_command_gen_strategy.py Heavily slimmed down — most logic moved to base class. Contains the None"None" string regression in get_serve_commands when disaggregation_transfer_backend is explicitly set to None. Also introduces aggregated_serve_env() override for CUDA_VISIBLE_DEVICES.
src/cloudai/workloads/vllm/slurm_command_gen_strategy.py Logic moved to base class. Adds disaggregated_script_preamble() for NIXL port computation, switches from export env vars to env prefix (cleaner isolation), and routes host references to disaggregated_role_host() shell variables for 2-node support.
src/cloudai/workloads/sglang/sglang.py Removes duplicate port field (moved to base) and health_endpoint field. Changes disaggregation_transfer_backend default from None to "nixl", which is correct but paired with removal of the or "nixl" fallback guard in the command gen strategy.
tests/ref_data/vllm-disagg-2nodes.sbatch New reference file for 2-node vLLM disaggregated mode. Uses --relative=0 -N1 / --relative=1 -N1 to pin prefill/decode to specific nodes, env prefix for isolation, and ${PREFILL_NODE} / ${DECODE_NODE} shell variables throughout.
tests/ref_data/sglang-disagg-2nodes.sbatch New reference file for 2-node SGLang disaggregated mode. Structure mirrors vllm-disagg-2nodes with appropriate SGLang-specific commands.
tests/workloads/test_llm_serving.py Comprehensive new tests for 2-node disaggregated behavior: GPU ID sharing, node role hosts, port validator, used_gpus_count for multi-node, and parse_gpu_ids. Good coverage of the new base-class functionality.

Sequence Diagram

sequenceDiagram
    participant Gen as LLMServingSlurmCommandGenStrategy
    participant Base as _gen_llm_serving_srun_command
    participant Agg as _gen_aggregated_script
    participant Disagg as _gen_disaggregated_script

    Gen->>Base: get_serve_commands()
    Base-->>Gen: [cmd] or [prefill_cmd, decode_cmd]

    alt 1 serve command (aggregated)
        Base->>Agg: serve_cmd, bench_cmd
        Agg-->>Base: bash script\n(SERVE_PID, health check on $NODE)
    else 2 serve commands (disaggregated)
        Base->>Disagg: [prefill_cmd, decode_cmd], bench_cmd
        Disagg->>Gen: generate_disaggregated_node_setup()
        Gen-->>Disagg: PREFILL_NODE / DECODE_NODE setup

        alt is_two_node_disaggregated
            Disagg-->>Disagg: --relative=0 -N1 (prefill)\n--relative=1 -N1 (decode)
        else 1-node disaggregated
            Disagg-->>Disagg: no --relative (share single node)
        end

        Disagg->>Gen: get_helper_command()
        Gen-->>Disagg: router/proxy command
        Disagg-->>Base: bash script\n(PREFILL_PID, DECODE_PID, HELPER_PID)
    end
Loading

Comments Outside Diff (1)

  1. src/cloudai/workloads/sglang/slurm_command_gen_strategy.py, line 103 (link)

    None transfer backend produces "None" string in generated command

    str(args.disaggregation_transfer_backend) converts a Python None to the string "None" instead of falling back to a sensible default. The old code used args.disaggregation_transfer_backend or "nixl" as a guard.

    While the field's default was changed from None to "nixl" in this PR, the field type still declares str | list[str] | None, so None is still a valid value a user can pass explicitly in a TOML config (e.g. disaggregation_transfer_backend: null). Any such config would now generate --disaggregation-transfer-backend None at runtime, which would be rejected by the SGLang server.

Last reviewed commit: 7e9efde

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
src/cloudai/workloads/common/llm_serving.py (1)

71-80: ⚠️ Potential issue | 🟠 Major

Validate cmd_args.port before deriving role ports.

With the new shared port field, port=0 makes the generated health checks and benchmark URLs target port 0, and any disaggregated value above 65335 produces invalid prefill_port/decode_port. This now fails at runtime instead of during model validation.

🛡️ Possible fix
 class LLMServingCmdArgs(CmdArgs, Generic[LLMServingArgsT]):
     """Shared command-argument shape for LLM serving workloads."""
 
     docker_image_url: str
     model: str
-    port: int = 8000
+    port: int = Field(default=8000, ge=1, le=65535)
     serve_wait_seconds: int = 300
     prefill: LLMServingArgsT | None = Field(default=None)
     decode: LLMServingArgsT
+
+    `@model_validator`(mode="after")
+    def validate_port_range(self) -> Self:
+        if self.prefill is not None and self.port > 65335:
+            raise ValueError("Disaggregated runs require port <= 65335.")
+        return self

Also applies to: 383-391

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/cloudai/workloads/common/llm_serving.py` around lines 71 - 80, Validate
the shared port field to prevent invalid/zero/overflow ports: add a validation
on LLMServingCmdArgs.port (e.g., a Pydantic/validator or __post_init__ check) to
require 1 <= port <= 65535 and raise a clear validation error otherwise; also
update the logic that derives per-role ports for prefill/decode (the code paths
that compute prefill_port and decode_port around the other block noted at
383-391) to either clamp/validate derived ports or raise if derivation would
produce values outside 1–65535 so invalid runtime URLs/healthchecks are avoided.
src/cloudai/workloads/vllm/slurm_command_gen_strategy.py (1)

44-71: ⚠️ Potential issue | 🔴 Critical

Set --host 0.0.0.0 in disaggregated serve commands.

In disaggregated mode, the vLLM serve instances must be reachable from other nodes via ${PREFILL_NODE} and ${DECODE_NODE}. Without an explicit --host binding (defaulting to localhost), the health checks and proxy router connections will fail. SGLang's disaggregated implementation explicitly sets --host 0.0.0.0 for this reason. Add --host to the serve commands generated by get_serve_commands() when in disaggregated mode, or make it configurable via VllmArgs.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/cloudai/workloads/vllm/slurm_command_gen_strategy.py` around lines 44 -
71, get_serve_commands currently omits an explicit host binding for the
disaggregated branch; update the disaggregated command generation inside
get_serve_commands (the loop that iterates over
self.prefill_port/self.decode_port and uses tdef.cmd_args.prefill /
tdef.cmd_args.decode and _to_json_str_arg) to include the flag pair "--host",
"0.0.0.0" (add it to each command before "--port"); alternatively you may read a
host value from VllmCmdArgs/VllmArgs if you prefer configurability, but ensure
the disaggregated kv_producer/kv_consumer commands bind to 0.0.0.0 so other
nodes can reach them.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/cloudai/workloads/common/llm_serving.py`:
- Around line 276-291: The report undercounts GPUs for two-node disaggregated
runs because all_gpu_ids() only returns the local node’s GPU list while
prefill_gpu_ids() and decode_gpu_ids() can be node-local; update the logic so
the denominator reflects the true total GPUs: either plumb num_nodes into the
total-GPU helper (all_gpu_ids / used_gpus_count()) so all_gpu_ids() preserves
its “all GPUs” invariant across nodes, or change used_gpus_count() to compute
the denominator as len(self.prefill_gpu_ids) + len(self.decode_gpu_ids) for
disaggregated runs; modify the code paths around all_gpu_ids(),
used_gpus_count(), prefill_gpu_ids, and decode_gpu_ids to use the chosen
approach ensuring two-node runs count both nodes’ GPUs (reference symbols:
prefill_gpu_ids, decode_gpu_ids, all_gpu_ids, used_gpus_count, num_nodes,
is_two_node_disaggregated).

In `@src/cloudai/workloads/sglang/slurm_command_gen_strategy.py`:
- Line 41: The linter flags for binding to 0.0.0.0 should be explicitly
suppressed with a justification: add an inline Ruff suppression comment (e.g.
append "  # noqa: S104 - intentional bind to all interfaces for serving" to the
literal  "0.0.0.0" occurrences) in the get_serve_commands function signature and
the other binding site around line 91 so CI stops failing while documenting the
intent.

---

Outside diff comments:
In `@src/cloudai/workloads/common/llm_serving.py`:
- Around line 71-80: Validate the shared port field to prevent
invalid/zero/overflow ports: add a validation on LLMServingCmdArgs.port (e.g., a
Pydantic/validator or __post_init__ check) to require 1 <= port <= 65535 and
raise a clear validation error otherwise; also update the logic that derives
per-role ports for prefill/decode (the code paths that compute prefill_port and
decode_port around the other block noted at 383-391) to either clamp/validate
derived ports or raise if derivation would produce values outside 1–65535 so
invalid runtime URLs/healthchecks are avoided.

In `@src/cloudai/workloads/vllm/slurm_command_gen_strategy.py`:
- Around line 44-71: get_serve_commands currently omits an explicit host binding
for the disaggregated branch; update the disaggregated command generation inside
get_serve_commands (the loop that iterates over
self.prefill_port/self.decode_port and uses tdef.cmd_args.prefill /
tdef.cmd_args.decode and _to_json_str_arg) to include the flag pair "--host",
"0.0.0.0" (add it to each command before "--port"); alternatively you may read a
host value from VllmCmdArgs/VllmArgs if you prefer configurability, but ensure
the disaggregated kv_producer/kv_consumer commands bind to 0.0.0.0 so other
nodes can reach them.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 4809f61d-ba57-4be6-88c6-ba4bc5f294b6

📥 Commits

Reviewing files that changed from the base of the PR and between 61e16c7 and deae17d.

📒 Files selected for processing (14)
  • src/cloudai/workloads/common/llm_serving.py
  • src/cloudai/workloads/sglang/sglang.py
  • src/cloudai/workloads/sglang/slurm_command_gen_strategy.py
  • src/cloudai/workloads/vllm/slurm_command_gen_strategy.py
  • tests/ref_data/sglang-disagg-2nodes.sbatch
  • tests/ref_data/sglang-disagg.sbatch
  • tests/ref_data/sglang.sbatch
  • tests/ref_data/vllm-disagg-2nodes.sbatch
  • tests/ref_data/vllm-disagg.sbatch
  • tests/ref_data/vllm.sbatch
  • tests/test_acceptance.py
  • tests/workloads/sglang/test_command_gen_strategy_slurm.py
  • tests/workloads/test_llm_serving.py
  • tests/workloads/vllm/test_command_gen_strategy_slurm.py

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/cloudai/workloads/sglang/sglang.py (1)

42-48: ⚠️ Potential issue | 🟡 Minor

Default change makes implicit behavior explicit; verify this is the intended direction.

The default for disaggregation_transfer_backend has changed from None to "nixl". While this technically alters the field default, the command generation fallback at line 62 of slurm_command_gen_strategy.py (backend = args.disaggregation_transfer_backend or "nixl") means the runtime behavior likely doesn't change—both paths resolve to "nixl". Since this is an internal field (excluded from serve_args), the impact is minimal. Confirm this change aligns with your PR goals.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/cloudai/workloads/sglang/sglang.py` around lines 42 - 48, The PR changed
the default of the Field disaggregation_transfer_backend from None to "nixl",
but slurm_command_gen_strategy.py still uses the fallback `backend =
args.disaggregation_transfer_backend or "nixl"`; decide which behavior you want
and make it consistent: either revert disaggregation_transfer_backend to
default=None (so runtime defaulting remains centralized in
slurm_command_gen_strategy) or keep the Field default="nixl" and remove the
redundant fallback in slurm_command_gen_strategy to avoid implicit duplication;
update any related documentation and tests for disaggregation_transfer_backend
and ensure the field remains excluded from serve_args as intended.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/cloudai/workloads/common/llm_serving.py`:
- Around line 557-565: After spawning the router (the block that echoes
"Starting {self.proxy_router_name}..." and runs the helper_cmd setting
{self.proxy_router_pid_var}), add an explicit readiness wait before the "Running
benchmark..." srun: poll the router's readiness endpoint (use /health or the
router's specific readiness path) on {self.serve_port} with a retry loop and
timeout (curl or wget checking HTTP 200), logging retries and failing the script
if timeout is reached; only proceed to run the benchmark that writes to
{(self.test_run.output_path / self.bench_log_file).absolute()} after the
readiness check succeeds so the benchmark won't race with router startup.

---

Outside diff comments:
In `@src/cloudai/workloads/sglang/sglang.py`:
- Around line 42-48: The PR changed the default of the Field
disaggregation_transfer_backend from None to "nixl", but
slurm_command_gen_strategy.py still uses the fallback `backend =
args.disaggregation_transfer_backend or "nixl"`; decide which behavior you want
and make it consistent: either revert disaggregation_transfer_backend to
default=None (so runtime defaulting remains centralized in
slurm_command_gen_strategy) or keep the Field default="nixl" and remove the
redundant fallback in slurm_command_gen_strategy to avoid implicit duplication;
update any related documentation and tests for disaggregation_transfer_backend
and ensure the field remains excluded from serve_args as intended.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 85beaa9c-1199-4bdd-8532-6934d7c8255a

📥 Commits

Reviewing files that changed from the base of the PR and between deae17d and 67b68d1.

📒 Files selected for processing (4)
  • src/cloudai/workloads/common/llm_serving.py
  • src/cloudai/workloads/sglang/sglang.py
  • src/cloudai/workloads/vllm/vllm.py
  • tests/workloads/test_llm_serving.py
💤 Files with no reviewable changes (1)
  • src/cloudai/workloads/vllm/vllm.py

@amaslenn amaslenn requested a review from podkidyshev March 17, 2026 20:17
@amaslenn amaslenn merged commit 8b435bf into main Mar 18, 2026
5 checks passed
@amaslenn amaslenn deleted the am/2nodes-serve branch March 18, 2026 15:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants