Add support for x2 nodes serving for vLLM and SGLang#839
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughRefactors LLM serving SLURM generators to add disaggregated multi-node mode: unified GPU ID parsing, per-role GPU/port/log properties, new public APIs (get_serve_commands/get_bench_command/get_helper_command), reworked script-generation flows, and corresponding SGLang/vLLM alignment plus tests and sbatch templates. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
📝 Coding Plan
Comment |
Greptile SummaryThis PR extends vLLM and SGLang serving workloads to support 2-node disaggregated prefill/decode deployments, alongside a refactoring that consolidates duplicated script-generation logic into the shared Key changes:
Confidence Score: 4/5
Important Files Changed
|
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
src/cloudai/workloads/common/llm_serving.py (1)
71-80:⚠️ Potential issue | 🟠 MajorValidate
cmd_args.portbefore deriving role ports.With the new shared
portfield,port=0makes the generated health checks and benchmark URLs target port0, and any disaggregated value above65335produces invalidprefill_port/decode_port. This now fails at runtime instead of during model validation.🛡️ Possible fix
class LLMServingCmdArgs(CmdArgs, Generic[LLMServingArgsT]): """Shared command-argument shape for LLM serving workloads.""" docker_image_url: str model: str - port: int = 8000 + port: int = Field(default=8000, ge=1, le=65535) serve_wait_seconds: int = 300 prefill: LLMServingArgsT | None = Field(default=None) decode: LLMServingArgsT + + `@model_validator`(mode="after") + def validate_port_range(self) -> Self: + if self.prefill is not None and self.port > 65335: + raise ValueError("Disaggregated runs require port <= 65335.") + return selfAlso applies to: 383-391
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/cloudai/workloads/common/llm_serving.py` around lines 71 - 80, Validate the shared port field to prevent invalid/zero/overflow ports: add a validation on LLMServingCmdArgs.port (e.g., a Pydantic/validator or __post_init__ check) to require 1 <= port <= 65535 and raise a clear validation error otherwise; also update the logic that derives per-role ports for prefill/decode (the code paths that compute prefill_port and decode_port around the other block noted at 383-391) to either clamp/validate derived ports or raise if derivation would produce values outside 1–65535 so invalid runtime URLs/healthchecks are avoided.src/cloudai/workloads/vllm/slurm_command_gen_strategy.py (1)
44-71:⚠️ Potential issue | 🔴 CriticalSet
--host 0.0.0.0in disaggregated serve commands.In disaggregated mode, the vLLM serve instances must be reachable from other nodes via
${PREFILL_NODE}and${DECODE_NODE}. Without an explicit--hostbinding (defaulting to localhost), the health checks and proxy router connections will fail. SGLang's disaggregated implementation explicitly sets--host 0.0.0.0for this reason. Add--hostto the serve commands generated byget_serve_commands()when in disaggregated mode, or make it configurable viaVllmArgs.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/cloudai/workloads/vllm/slurm_command_gen_strategy.py` around lines 44 - 71, get_serve_commands currently omits an explicit host binding for the disaggregated branch; update the disaggregated command generation inside get_serve_commands (the loop that iterates over self.prefill_port/self.decode_port and uses tdef.cmd_args.prefill / tdef.cmd_args.decode and _to_json_str_arg) to include the flag pair "--host", "0.0.0.0" (add it to each command before "--port"); alternatively you may read a host value from VllmCmdArgs/VllmArgs if you prefer configurability, but ensure the disaggregated kv_producer/kv_consumer commands bind to 0.0.0.0 so other nodes can reach them.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/cloudai/workloads/common/llm_serving.py`:
- Around line 276-291: The report undercounts GPUs for two-node disaggregated
runs because all_gpu_ids() only returns the local node’s GPU list while
prefill_gpu_ids() and decode_gpu_ids() can be node-local; update the logic so
the denominator reflects the true total GPUs: either plumb num_nodes into the
total-GPU helper (all_gpu_ids / used_gpus_count()) so all_gpu_ids() preserves
its “all GPUs” invariant across nodes, or change used_gpus_count() to compute
the denominator as len(self.prefill_gpu_ids) + len(self.decode_gpu_ids) for
disaggregated runs; modify the code paths around all_gpu_ids(),
used_gpus_count(), prefill_gpu_ids, and decode_gpu_ids to use the chosen
approach ensuring two-node runs count both nodes’ GPUs (reference symbols:
prefill_gpu_ids, decode_gpu_ids, all_gpu_ids, used_gpus_count, num_nodes,
is_two_node_disaggregated).
In `@src/cloudai/workloads/sglang/slurm_command_gen_strategy.py`:
- Line 41: The linter flags for binding to 0.0.0.0 should be explicitly
suppressed with a justification: add an inline Ruff suppression comment (e.g.
append " # noqa: S104 - intentional bind to all interfaces for serving" to the
literal "0.0.0.0" occurrences) in the get_serve_commands function signature and
the other binding site around line 91 so CI stops failing while documenting the
intent.
---
Outside diff comments:
In `@src/cloudai/workloads/common/llm_serving.py`:
- Around line 71-80: Validate the shared port field to prevent
invalid/zero/overflow ports: add a validation on LLMServingCmdArgs.port (e.g., a
Pydantic/validator or __post_init__ check) to require 1 <= port <= 65535 and
raise a clear validation error otherwise; also update the logic that derives
per-role ports for prefill/decode (the code paths that compute prefill_port and
decode_port around the other block noted at 383-391) to either clamp/validate
derived ports or raise if derivation would produce values outside 1–65535 so
invalid runtime URLs/healthchecks are avoided.
In `@src/cloudai/workloads/vllm/slurm_command_gen_strategy.py`:
- Around line 44-71: get_serve_commands currently omits an explicit host binding
for the disaggregated branch; update the disaggregated command generation inside
get_serve_commands (the loop that iterates over
self.prefill_port/self.decode_port and uses tdef.cmd_args.prefill /
tdef.cmd_args.decode and _to_json_str_arg) to include the flag pair "--host",
"0.0.0.0" (add it to each command before "--port"); alternatively you may read a
host value from VllmCmdArgs/VllmArgs if you prefer configurability, but ensure
the disaggregated kv_producer/kv_consumer commands bind to 0.0.0.0 so other
nodes can reach them.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 4809f61d-ba57-4be6-88c6-ba4bc5f294b6
📒 Files selected for processing (14)
src/cloudai/workloads/common/llm_serving.pysrc/cloudai/workloads/sglang/sglang.pysrc/cloudai/workloads/sglang/slurm_command_gen_strategy.pysrc/cloudai/workloads/vllm/slurm_command_gen_strategy.pytests/ref_data/sglang-disagg-2nodes.sbatchtests/ref_data/sglang-disagg.sbatchtests/ref_data/sglang.sbatchtests/ref_data/vllm-disagg-2nodes.sbatchtests/ref_data/vllm-disagg.sbatchtests/ref_data/vllm.sbatchtests/test_acceptance.pytests/workloads/sglang/test_command_gen_strategy_slurm.pytests/workloads/test_llm_serving.pytests/workloads/vllm/test_command_gen_strategy_slurm.py
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/cloudai/workloads/sglang/sglang.py (1)
42-48:⚠️ Potential issue | 🟡 MinorDefault change makes implicit behavior explicit; verify this is the intended direction.
The default for
disaggregation_transfer_backendhas changed fromNoneto"nixl". While this technically alters the field default, the command generation fallback at line 62 ofslurm_command_gen_strategy.py(backend = args.disaggregation_transfer_backend or "nixl") means the runtime behavior likely doesn't change—both paths resolve to"nixl". Since this is an internal field (excluded from serve_args), the impact is minimal. Confirm this change aligns with your PR goals.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/cloudai/workloads/sglang/sglang.py` around lines 42 - 48, The PR changed the default of the Field disaggregation_transfer_backend from None to "nixl", but slurm_command_gen_strategy.py still uses the fallback `backend = args.disaggregation_transfer_backend or "nixl"`; decide which behavior you want and make it consistent: either revert disaggregation_transfer_backend to default=None (so runtime defaulting remains centralized in slurm_command_gen_strategy) or keep the Field default="nixl" and remove the redundant fallback in slurm_command_gen_strategy to avoid implicit duplication; update any related documentation and tests for disaggregation_transfer_backend and ensure the field remains excluded from serve_args as intended.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/cloudai/workloads/common/llm_serving.py`:
- Around line 557-565: After spawning the router (the block that echoes
"Starting {self.proxy_router_name}..." and runs the helper_cmd setting
{self.proxy_router_pid_var}), add an explicit readiness wait before the "Running
benchmark..." srun: poll the router's readiness endpoint (use /health or the
router's specific readiness path) on {self.serve_port} with a retry loop and
timeout (curl or wget checking HTTP 200), logging retries and failing the script
if timeout is reached; only proceed to run the benchmark that writes to
{(self.test_run.output_path / self.bench_log_file).absolute()} after the
readiness check succeeds so the benchmark won't race with router startup.
---
Outside diff comments:
In `@src/cloudai/workloads/sglang/sglang.py`:
- Around line 42-48: The PR changed the default of the Field
disaggregation_transfer_backend from None to "nixl", but
slurm_command_gen_strategy.py still uses the fallback `backend =
args.disaggregation_transfer_backend or "nixl"`; decide which behavior you want
and make it consistent: either revert disaggregation_transfer_backend to
default=None (so runtime defaulting remains centralized in
slurm_command_gen_strategy) or keep the Field default="nixl" and remove the
redundant fallback in slurm_command_gen_strategy to avoid implicit duplication;
update any related documentation and tests for disaggregation_transfer_backend
and ensure the field remains excluded from serve_args as intended.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 85beaa9c-1199-4bdd-8532-6934d7c8255a
📒 Files selected for processing (4)
src/cloudai/workloads/common/llm_serving.pysrc/cloudai/workloads/sglang/sglang.pysrc/cloudai/workloads/vllm/vllm.pytests/workloads/test_llm_serving.py
💤 Files with no reviewable changes (1)
- src/cloudai/workloads/vllm/vllm.py
Summary
Add support for x2 nodes serving for vLLM and SGLang.
Test Plan
Additional Notes
–