feat: add P/D disaggregated examples for XPU+CUDA with host buffer and RDMA#7673
feat: add P/D disaggregated examples for XPU+CUDA with host buffer and RDMA#7673pallavijaini0525 wants to merge 7 commits intoai-dynamo:mainfrom
Conversation
…r and RDMA Signed-off-by: pallavi jaini <pallavi.jaini@intel.com>
|
👋 Hi pallavijaini0525! Thank you for contributing to ai-dynamo/dynamo. Just a reminder: The 🚀 |
WalkthroughThis pull request introduces two new Kubernetes deployment manifests for mixed XPU and CUDA disaggregated inference configurations and updates the deployment documentation to describe these new architecture options and their usage. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@examples/backends/vllm/deploy/disagg_xpu_cuda_rdma.yaml`:
- Around line 72-80: The manifest's securityContext.capabilities currently adds
IPC_LOCK, SYS_PTRACE, NET_ADMIN, NET_RAW, SYS_ADMIN, and SYS_RESOURCE which
violate Kubernetes Pod Security standards; either prune this list to only the
capabilities actually required by your RDMA/NIXL workload (e.g., determine
minimal set and remove NET_ADMIN, SYS_ADMIN, SYS_PTRACE, SYS_RESOURCE, IPC_LOCK,
NET_RAW unless proven necessary) in the deployment YAML, or add a clear Pod
Security Policy note in the example README explaining that the template requires
a relaxed namespace pod-security label (e.g., pod-security.kubernetes.io/enforce
set to baseline/unrestricted or a cluster exception) and document the exact
capabilities and why each is needed; update the securityContext.capabilities
block and README accordingly (search for securityContext and capabilities in the
YAML to locate both occurrence sites).
In `@examples/backends/vllm/deploy/disagg_xpu_cuda.yaml`:
- Around line 82-85: The PrefillWorker block places requests at the service root
instead of under resources, so the ephemeral-storage request is ignored; move
the existing requests: ephemeral-storage: "2Gi" into a resources: requests:
subtree for PrefillWorker (matching how DecodeWorker and the PrefillWorker in
disagg_xpu_cuda_rdma.yaml are defined), ensuring indentation and nesting are
correct so resources.requests.ephemeral-storage is applied.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 01339973-c570-4bbc-a1b3-f453233274ed
📒 Files selected for processing (3)
examples/backends/vllm/deploy/README.mdexamples/backends/vllm/deploy/disagg_xpu_cuda.yamlexamples/backends/vllm/deploy/disagg_xpu_cuda_rdma.yaml
Signed-off-by: pallavi jaini <pallavi.jaini@intel.com>
Signed-off-by: pallavi jaini <pallavi.jaini@intel.com>
Overview:
Adds two new Kubernetes deployment examples for mixed-device disaggregated prefill/decode (P/D) serving: Intel XPU as the prefill worker paired with an NVIDIA CUDA GPU as the decode worker. Two variants are provided - one using a CPU host buffer over TCP , and one using direct GPU-to-GPU KV cache transfer over an InfiniBand/RoCE RDMA fabric via NIXL.
Details:
disagg_xpu_cuda.yaml: Mixed P/D disaggregated deployment with Intel XPU prefill and NVIDIA CUDA decode. KV cache is staged through a CPU host buffer and transferred over TCP (UCX_TLS: tcp). KV cache events are published over ZMQ for cache-aware routing (DYN_ROUTER_MODE=kv). XPU device is allocated via Kubernetes DRA (ResourceClaimTemplate with gpu.intel.com).
disagg_xpu_cuda_rdma.yaml: Same mixed P/D topology but with high-performance RDMA KV transfer. Both the prefill (XPU, kv_buffer_device: xpu, ze_copy) and decode (CUDA, kv_buffer_device: cuda, cuda_copy) workers claim an RDMA NIC via ResourceClaimTemplate (rdma-dranet). UCX uses ib,rc transports for GPU-to-GPU KV movement without staging through CPU memory.
README.md: Updated Section 6 (Intel XPU deployments) with architecture descriptions for both new templates.
Where should the reviewer start?
examples/backends/vllm/deploy/disagg_xpu_cuda_rdma.yaml — the more complex new template; verify the dual ResourceClaimTemplate setup (xpu-template + rdma-net-template) and the UCX/NIXL env vars on each worker
examples/backends/vllm/deploy/disagg_xpu_cuda.yaml — simpler variant
examples/backends/vllm/deploy/README.md — verify the new architecture descriptions, prerequisites, and links are accurate
Summary by CodeRabbit
Documentation
New Features