feat: add XPU container run script#7669
feat: add XPU container run script#7669ZhengHongming888 wants to merge 3 commits intoai-dynamo:mainfrom
Conversation
|
👋 Hi ZhengHongming888! Thank you for contributing to ai-dynamo/dynamo. Just a reminder: The 🚀 |
WalkthroughA new executable Bash script is added at Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~22 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
container/run_xpu.sh (2)
229-239: Minor inconsistency withrun.shmount path.The
/mntmount on line 232 lacks the trailing slash thatrun.shuses (/mnt/:/mnt). While functionally equivalent for directories, consider aligning for consistency across container run scripts.VOLUME_MOUNTS+=" -v ${SOURCE_DIR}/..:/workspace " VOLUME_MOUNTS+=" -v /tmp:/tmp " - VOLUME_MOUNTS+=" -v /mnt:/mnt " + VOLUME_MOUNTS+=" -v /mnt/:/mnt "🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@container/run_xpu.sh` around lines 229 - 239, In the MOUNT_WORKSPACE block where VOLUME_MOUNTS is assembled (conditioned by MOUNT_WORKSPACE and using SOURCE_DIR), change the mount entry currently added as "-v /mnt:/mnt" to include the trailing slash so it becomes "-v /mnt/:/mnt" to match the style used in run.sh; update the VOLUME_MOUNTS concatenation that references /mnt so the variable and surrounding concatenations (VOLUME_MOUNTS, MOUNT_WORKSPACE, SOURCE_DIR) remain consistent.
281-288: Consider multi-XPU environments.The script only checks
/dev/dri/renderD128, but systems with multiple Intel XPUs will have additional render nodes (renderD129, renderD130, etc.). While the GID is typically the same for all render devices (therendergroup), consider:
- Verifying
/dev/driexists rather than a specific device- Documenting this assumption in the help text
# XPU specific: Add DRI device group for GPU access - if [ -e /dev/dri/renderD128 ]; then - DRI_GROUP=$(stat -c '%g' /dev/dri/renderD128) + # Get GID from any render device (all share the same group) + RENDER_DEVICE=$(ls /dev/dri/renderD* 2>/dev/null | head -n1) + if [ -n "$RENDER_DEVICE" ]; then + DRI_GROUP=$(stat -c '%g' "$RENDER_DEVICE") GROUP_ADD_STRING="--group-add ${DRI_GROUP}" else GROUP_ADD_STRING="" - echo "Warning: /dev/dri/renderD128 not found. XPU access may not work properly." + echo "Warning: No /dev/dri/renderD* devices found. XPU access may not work properly." fi🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@container/run_xpu.sh` around lines 281 - 288, The check for a single device node /dev/dri/renderD128 is fragile in multi-XPU setups; update the logic around DRI_GROUP and GROUP_ADD_STRING to first verify /dev/dri exists and then derive the group ID from any renderD* entry (e.g., find the first /dev/dri/renderD*), and update the help/warning text to document the assumption that all render devices share the same GID; keep using DRI_GROUP and GROUP_ADD_STRING but replace the hardcoded path check with a /dev/dri presence check and a probe for renderD* to obtain the GID.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@container/run_xpu.sh`:
- Around line 352-372: The docker run command invoked by ${RUN_PREFIX} needs
explicit XPU device passthrough when running non-privileged (i.e. when
${PRIVILEGED_STRING} indicates FALSE); update the script to add device flags
(for example --device /dev/dri --device /dev/dri/renderD128 or a new
${DEVICE_STRING} variable) into the docker run invocation path that is used when
PRIVILEGED_STRING is not set, so the container retains access to /dev/dri
devices; alternatively, if non-privileged XPU is unsupported, remove the
--privileged option or update the help text accordingly.
---
Nitpick comments:
In `@container/run_xpu.sh`:
- Around line 229-239: In the MOUNT_WORKSPACE block where VOLUME_MOUNTS is
assembled (conditioned by MOUNT_WORKSPACE and using SOURCE_DIR), change the
mount entry currently added as "-v /mnt:/mnt" to include the trailing slash so
it becomes "-v /mnt/:/mnt" to match the style used in run.sh; update the
VOLUME_MOUNTS concatenation that references /mnt so the variable and surrounding
concatenations (VOLUME_MOUNTS, MOUNT_WORKSPACE, SOURCE_DIR) remain consistent.
- Around line 281-288: The check for a single device node /dev/dri/renderD128 is
fragile in multi-XPU setups; update the logic around DRI_GROUP and
GROUP_ADD_STRING to first verify /dev/dri exists and then derive the group ID
from any renderD* entry (e.g., find the first /dev/dri/renderD*), and update the
help/warning text to document the assumption that all render devices share the
same GID; keep using DRI_GROUP and GROUP_ADD_STRING but replace the hardcoded
path check with a /dev/dri presence check and a probe for renderD* to obtain the
GID.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: bf5a77e0-1ae4-4a79-aeee-ac53125f0bd2
📒 Files selected for processing (1)
container/run_xpu.sh
| ${RUN_PREFIX} docker run \ | ||
| ${INTERACTIVE} \ | ||
| ${RM_STRING} \ | ||
| ${PRIVILEGED_STRING} \ | ||
| ${GROUP_ADD_STRING} \ | ||
| --network "$NETWORK" \ | ||
| --shm-size=10G \ | ||
| --ulimit memlock=-1 \ | ||
| --ulimit stack=67108864 \ | ||
| --ulimit nofile=65536:65536 \ | ||
| ${ENVIRONMENT_VARIABLES} \ | ||
| ${VOLUME_MOUNTS} \ | ||
| ${PORT_MAPPINGS} \ | ||
| -w "$WORKDIR" \ | ||
| --cap-add CAP_SYS_PTRACE \ | ||
| --ipc host \ | ||
| ${USER_STRING} \ | ||
| ${NAME_STRING} \ | ||
| ${ENTRYPOINT_STRING} \ | ||
| ${IMAGE} \ | ||
| "${REMAINING_ARGS[@]}" |
There was a problem hiding this comment.
Missing XPU device passthrough for non-privileged mode.
When --privileged FALSE is specified, the container loses access to XPU devices because there's no --device flag to explicitly pass /dev/dri. The --group-add provides permission but not device visibility.
Consider adding device passthrough when not running privileged:
+ # XPU device passthrough (only needed when not privileged)
+ if [[ ${PRIVILEGED^^} == "FALSE" ]] && [ -d /dev/dri ]; then
+ DEVICE_STRING="--device=/dev/dri"
+ else
+ DEVICE_STRING=""
+ fi
+
${RUN_PREFIX} docker run \
+ ${DEVICE_STRING} \
${INTERACTIVE} \
${RM_STRING} \Alternatively, if non-privileged XPU mode isn't intended to be supported, consider either removing the --privileged option or documenting this limitation in the help text.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@container/run_xpu.sh` around lines 352 - 372, The docker run command invoked
by ${RUN_PREFIX} needs explicit XPU device passthrough when running
non-privileged (i.e. when ${PRIVILEGED_STRING} indicates FALSE); update the
script to add device flags (for example --device /dev/dri --device
/dev/dri/renderD128 or a new ${DEVICE_STRING} variable) into the docker run
invocation path that is used when PRIVILEGED_STRING is not set, so the container
retains access to /dev/dri devices; alternatively, if non-privileged XPU is
unsupported, remove the --privileged option or update the help text accordingly.
|
Any reason to not reuse existing logic from run.sh instead of creating a xpu specific script? https://github.com/ai-dynamo/dynamo/blob/main/container/run.sh |
In the beginning i add the modification for xpu in run.sh also. Then later i found many parameters are different than change original run.sh (cuda) much. if later more devices are added this file will be a little complicated. But of course if you think it's necessary to add all together in one file I am ok to change it.. :-) This is similar for other places like dockfile, example, etc. Now they have the different preference. For templating dockerfile one file is used for multiple device but examples are using the different device folders.. Either one is ok for me.. |
nv-tusharma
left a comment
There was a problem hiding this comment.
Changes LGTM, please follow up on pre-merge failures
bdad23e to
83d5b3b
Compare
Add run_xpu.sh script for launching Dynamo containers with Intel XPU support. The script handles XPU-specific device access, DRI group management, and provides flexible container configuration options. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Hongming Zheng <hongming.zheng@intel.com>
Add run_xpu.sh to the ignore filter to resolve pre-merge CI error about uncovered files. This script is a development helper like run.sh and doesn't require CI testing. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Hongming Zheng <hongming.zheng@intel.com>
507e118 to
50bf408
Compare
fixed! thanks. |
Overview:
This PR is to add run_xpu.sh script for launching Dynamo containers with Intel XPU support. The script handles XPU-specific device access, DRI group management, and provides flexible container configuration options.
Details:
generate dockerfile for xpu
python container/render.py --framework vllm --target local-dev --device xpu --output-short-filenamedocker build -
docker build --progress=plain --target local-dev -t dynamo:latest-vllm17.1-local-dev6 -f container/rendered.Dockerfile --build-arg USER_UID=$(id -u) --build-arg USER_GID=$(id -g) --build-arg http_proxy=... --build-arg https_proxy=... --build-arg no_proxy=localhost,127.0.0.1,0.0.0.0 .create container -
container/run_xpu.sh --mount-workspace --image dynamo:latest-vllm17.1-local-dev -itfor local-dev container you still need install inside container for dynamo -
cargo build --features dynamo-llm/block-manager && cd /workspace/lib/bindings/python && maturin develop --uv && cd /workspace && uv pip install --no-deps -e /workspaceWith the run_xpu.sh it will simplify the step for creating the xpu container and also aligned with Dynamo templating style.

Tested with run_xpu.sh it works fine.
Thanks.
Summary by CodeRabbit
Release Notes