Skip to content

feat: add P/D disaggregated examples for XPU+CUDA with host buffer and RDMA#7673

Open
pallavijaini0525 wants to merge 7 commits intoai-dynamo:mainfrom
pallavijaini0525:xpu_cuda_examples
Open

feat: add P/D disaggregated examples for XPU+CUDA with host buffer and RDMA#7673
pallavijaini0525 wants to merge 7 commits intoai-dynamo:mainfrom
pallavijaini0525:xpu_cuda_examples

Conversation

@pallavijaini0525
Copy link
Copy Markdown

@pallavijaini0525 pallavijaini0525 commented Mar 30, 2026

Overview:

Adds two new Kubernetes deployment examples for mixed-device disaggregated prefill/decode (P/D) serving: Intel XPU as the prefill worker paired with an NVIDIA CUDA GPU as the decode worker. Two variants are provided - one using a CPU host buffer over TCP , and one using direct GPU-to-GPU KV cache transfer over an InfiniBand/RoCE RDMA fabric via NIXL.

Details:

disagg_xpu_cuda.yaml: Mixed P/D disaggregated deployment with Intel XPU prefill and NVIDIA CUDA decode. KV cache is staged through a CPU host buffer and transferred over TCP (UCX_TLS: tcp). KV cache events are published over ZMQ for cache-aware routing (DYN_ROUTER_MODE=kv). XPU device is allocated via Kubernetes DRA (ResourceClaimTemplate with gpu.intel.com).

disagg_xpu_cuda_rdma.yaml: Same mixed P/D topology but with high-performance RDMA KV transfer. Both the prefill (XPU, kv_buffer_device: xpu, ze_copy) and decode (CUDA, kv_buffer_device: cuda, cuda_copy) workers claim an RDMA NIC via ResourceClaimTemplate (rdma-dranet). UCX uses ib,rc transports for GPU-to-GPU KV movement without staging through CPU memory.

README.md: Updated Section 6 (Intel XPU deployments) with architecture descriptions for both new templates.

Where should the reviewer start?

examples/backends/vllm/deploy/disagg_xpu_cuda_rdma.yaml — the more complex new template; verify the dual ResourceClaimTemplate setup (xpu-template + rdma-net-template) and the UCX/NIXL env vars on each worker

examples/backends/vllm/deploy/disagg_xpu_cuda.yaml — simpler variant

examples/backends/vllm/deploy/README.md — verify the new architecture descriptions, prerequisites, and links are accurate

Summary by CodeRabbit

  • Documentation

    • Expanded Intel XPU deployment guide with additional configuration templates and verification steps for mixed workload deployments.
  • New Features

    • Added two new Kubernetes deployment configurations supporting mixed XPU and CUDA environments with TCP and RDMA network variants.

…r and RDMA

Signed-off-by: pallavi jaini <pallavi.jaini@intel.com>
@pallavijaini0525 pallavijaini0525 requested a review from a team as a code owner March 30, 2026 00:08
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Mar 30, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions
Copy link
Copy Markdown
Contributor

👋 Hi pallavijaini0525! Thank you for contributing to ai-dynamo/dynamo.

Just a reminder: The NVIDIA Test Github Validation CI runs an essential subset of the testing framework to quickly catch errors.Your PR reviewers may elect to test the changes comprehensively before approving your changes.

🚀

@github-actions github-actions Bot added external-contribution Pull request is from an external contributor documentation Improvements or additions to documentation backend::vllm Relates to the vllm backend labels Mar 30, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Mar 30, 2026

Walkthrough

This pull request introduces two new Kubernetes deployment manifests for mixed XPU and CUDA disaggregated inference configurations and updates the deployment documentation to describe these new architecture options and their usage.

Changes

Cohort / File(s) Summary
Documentation Update
examples/backends/vllm/deploy/README.md
Expanded Intel XPU deployment documentation to include new mixed XPU+CUDA disaggregated templates (disagg_xpu_cuda.yaml, disagg_xpu_cuda_rdma.yaml) alongside existing DRA-based templates. Added descriptions of four architecture variants and new run/verification subsections for both TCP and RDMA-based mixed deployments.
Mixed XPU + CUDA Disaggregated Deployments
examples/backends/vllm/deploy/disagg_xpu_cuda.yaml, examples/backends/vllm/deploy/disagg_xpu_cuda_rdma.yaml
New Kubernetes manifests defining multi-component DynamoGraphDeployment configurations. The TCP variant uses ZMQ KV event publishing and KV-aware routing for XPU prefill + CUDA decode. The RDMA variant adds RDMA network resource claims, elevated Linux capabilities, and RDMA/Nixl fabric configuration for the same XPU/CUDA disaggregated split.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description check ✅ Passed The PR description comprehensively covers all template sections: Overview clearly states the purpose, Details explains both new files and README updates with technical specifics, and Where to Start provides focused reviewer guidance on file review order.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Title check ✅ Passed The title accurately and specifically describes the main change: adding P/D disaggregated deployment examples for XPU+CUDA with host buffer and RDMA configurations.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@examples/backends/vllm/deploy/disagg_xpu_cuda_rdma.yaml`:
- Around line 72-80: The manifest's securityContext.capabilities currently adds
IPC_LOCK, SYS_PTRACE, NET_ADMIN, NET_RAW, SYS_ADMIN, and SYS_RESOURCE which
violate Kubernetes Pod Security standards; either prune this list to only the
capabilities actually required by your RDMA/NIXL workload (e.g., determine
minimal set and remove NET_ADMIN, SYS_ADMIN, SYS_PTRACE, SYS_RESOURCE, IPC_LOCK,
NET_RAW unless proven necessary) in the deployment YAML, or add a clear Pod
Security Policy note in the example README explaining that the template requires
a relaxed namespace pod-security label (e.g., pod-security.kubernetes.io/enforce
set to baseline/unrestricted or a cluster exception) and document the exact
capabilities and why each is needed; update the securityContext.capabilities
block and README accordingly (search for securityContext and capabilities in the
YAML to locate both occurrence sites).

In `@examples/backends/vllm/deploy/disagg_xpu_cuda.yaml`:
- Around line 82-85: The PrefillWorker block places requests at the service root
instead of under resources, so the ephemeral-storage request is ignored; move
the existing requests: ephemeral-storage: "2Gi" into a resources: requests:
subtree for PrefillWorker (matching how DecodeWorker and the PrefillWorker in
disagg_xpu_cuda_rdma.yaml are defined), ensuring indentation and nesting are
correct so resources.requests.ephemeral-storage is applied.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 01339973-c570-4bbc-a1b3-f453233274ed

📥 Commits

Reviewing files that changed from the base of the PR and between 98d0ce9 and cc3a017.

📒 Files selected for processing (3)
  • examples/backends/vllm/deploy/README.md
  • examples/backends/vllm/deploy/disagg_xpu_cuda.yaml
  • examples/backends/vllm/deploy/disagg_xpu_cuda_rdma.yaml

Comment thread examples/backends/vllm/deploy/disagg_xpu_cuda_rdma.yaml Outdated
Comment thread examples/backends/vllm/deploy/disagg_xpu_cuda.yaml Outdated
Signed-off-by: pallavi jaini <pallavi.jaini@intel.com>
Signed-off-by: pallavi jaini <pallavi.jaini@intel.com>
@pallavijaini0525 pallavijaini0525 changed the title Added the PD disaggregation examples for XPU and CUDA using hostbuffe… feat: add P/D disaggregated examples for XPU+CUDA with host buffer and RDMA Mar 30, 2026
@github-actions github-actions Bot added the feat label Mar 30, 2026
@github-actions github-actions Bot added the xpu label Apr 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend::vllm Relates to the vllm backend documentation Improvements or additions to documentation external-contribution Pull request is from an external contributor feat size/L xpu

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants