Support Data Center precompiled driver container for Arm (Ubuntu 24.04)#533
Support Data Center precompiled driver container for Arm (Ubuntu 24.04)#533
Conversation
6405d48 to
574ce43
Compare
20726a8 to
46aa0d1
Compare
c008150 to
b684015
Compare
|
|
||
| - name: Set up Holodeck | ||
| uses: NVIDIA/holodeck@v0.2.18 | ||
| uses: NVIDIA/holodeck@main |
There was a problem hiding this comment.
I will update it and specify the actual version once @ArangoGutierrez releases the new version of Holodeck.
There was a problem hiding this comment.
Pull request overview
This pull request adds ARM64 (aarch64) platform support to the Ubuntu 24.04 precompiled driver container builds, while maintaining AMD64 as the default architecture. The changes enable multi-platform Docker builds and update the CI/CD pipeline to handle both architectures.
Changes:
- Added ARM64 platform support for Ubuntu 24.04 precompiled driver containers with architecture-specific package handling
- Updated CI workflow to build, test, and publish both AMD64 and ARM64 artifacts with platform-specific suffixes
- Modified Holodeck test infrastructure to support ARM64 instances (g5g.xlarge in us-west-2) and Ubuntu 24.04 OS specification
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| ubuntu24.04/precompiled/nvidia-driver | Added conditional installation of libnvidia-fbc1 package (AMD64 only) |
| ubuntu24.04/precompiled/local-repo.sh | Added conditional downloads for ARM64-incompatible packages (linux-signatures-nvidia, libnvidia-fbc1) |
| ubuntu24.04/precompiled/Dockerfile | Made i386 architecture and CUDA repository URLs conditional based on target architecture |
| tests/scripts/findkernelversion.sh | Added optional PLATFORM_SUFFIX parameter for artifact matching and platform-specific manifest inspection |
| tests/scripts/ci-precompiled-helpers.sh | Added PLATFORM_SUFFIX parameter support for kernel version testing |
| tests/holodeck_ubuntu24.04.yaml | Removed file (merged into holodeck_ubuntu.yaml) |
| tests/holodeck_ubuntu.yaml | Removed hardcoded ingressIpRanges and AMI, added OS specification support |
| multi-arch.mk | Removed AMD64-only platform restriction for ubuntu24.04 builds |
| Makefile | Added DOCKER_BUILD_PLATFORM_OPTIONS to base image build targets |
| .github/workflows/precompiled.yaml | Added platform matrix dimension, platform-aware artifact naming, ARM64 e2e testing with appropriate instance types, and Holodeck version update |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
| - name: Set up Holodeck | ||
| uses: NVIDIA/holodeck@v0.2.18 | ||
| uses: NVIDIA/holodeck@main |
There was a problem hiding this comment.
The Holodeck version has been changed from a pinned version (v0.2.18) to @main, which is not a best practice for CI/CD workflows. Using @main introduces unpredictability as the main branch could contain breaking changes at any time. The other workflow file (.github/workflows/ci.yaml) uses NVIDIA/holodeck@v0.2.18. Consider using a specific pinned version or tag instead of @main for stability and reproducibility.
| uses: NVIDIA/holodeck@main | |
| uses: NVIDIA/holodeck@v0.2.18 |
ee1265d to
49429dd
Compare
32e68a1 to
cdbfe9a
Compare
cdbfe9a to
e224399
Compare
| - dist: ubuntu24.04 | ||
| driver_branch: 535 | ||
| - dist: ubuntu24.04 | ||
| driver_branch: 570 |
There was a problem hiding this comment.
refer .common.yaml
.github/workflows/precompiled.yaml
Outdated
| BASE_IMAGE_TAG="${PRIVATE_REGISTRY}/nvidia/driver:base-${BASE_TARGET}-${LTS_KERNEL}-${KERNEL_FLAVOR}-${{ matrix.driver_branch }}" | ||
| docker tag ${BASE_IMAGE_TAG} ${BASE_IMAGE_TAG}-${{ env.PLATFORM_NAME }} | ||
| docker push "${BASE_IMAGE_TAG}-${{ env.PLATFORM_NAME }}" | ||
| docker buildx imagetools create -t "${BASE_IMAGE_TAG}" --append "${BASE_IMAGE_TAG}-${{ env.PLATFORM_NAME }}" || docker buildx imagetools create -t "${BASE_IMAGE_TAG}" "${BASE_IMAGE_TAG}-${{ env.PLATFORM_NAME }}" |
There was a problem hiding this comment.
multi arch support
.github/workflows/precompiled.yaml
Outdated
| DRIVER_IMAGE_TAG="${PRIVATE_REGISTRY}/nvidia/driver:${{ matrix.driver_branch }}-${{ env.KERNEL_VERSION }}" | ||
| docker tag ${DRIVER_IMAGE_TAG} ${DRIVER_IMAGE_TAG}-${{ env.PLATFORM_NAME }} | ||
| docker push "${DRIVER_IMAGE_TAG}-${{ env.PLATFORM_NAME }}" | ||
| docker buildx imagetools create -t "${DRIVER_IMAGE_TAG}" --append "${DRIVER_IMAGE_TAG}-${{ env.PLATFORM_NAME }}" || docker buildx imagetools create -t "${DRIVER_IMAGE_TAG}" "${DRIVER_IMAGE_TAG}-${{ env.PLATFORM_NAME }}" |
There was a problem hiding this comment.
multi arch support
There was a problem hiding this comment.
As mentioned in my earlier comment, let's do a single multiarch build instead of building individual platform-specific images and then merging them together.
e224399 to
6c1bb37
Compare
|
Note that we don't want separate images for arm. For the precompiled driver packages that have arm variants, we want to start building multi-arch images so that they support arm64 along with amd64 |
Yes, this PR already includes this feature. confirmed with the command: |
| kernel_flavors: ${{ steps.extract_driver_branch.outputs.kernel_flavors }} | ||
| dist: ${{ steps.extract_driver_branch.outputs.dist }} | ||
| lts_kernel: ${{ steps.extract_driver_branch.outputs.lts_kernel }} | ||
| platforms: ${{ steps.extract_driver_branch.outputs.platforms }} |
There was a problem hiding this comment.
Is it possible to not expand the matrix? Adding a new matrix column increases the complexity of the ci manifests by a lot.
Let's look at alternatives please.
Signed-off-by: Shiva Kumar (SW-CLOUD) <shivaku@nvidia.com>
6c1bb37 to
f694c5e
Compare
|
@shivakunv On putting more thought into this, can we do a multiarch build of the precompiled image instead of building the arm64 and amd64 images separate and then stitching them together? |
| DRIVER_BRANCHES=($(echo "$driver_branch_json" | jq -r '.[]')) | ||
| echo "DRIVER_BRANCHES=${DRIVER_BRANCHES[*]}" >> $GITHUB_ENV | ||
| - name: Set kernel version in holodeck_${{ env.DIST }}.yaml | ||
| - name: Configure Holodeck e2e test config (kernel, OS, instance) |
There was a problem hiding this comment.
This PR already has a large diff. Let's revisit the holodeck changes in a follow-up PR
There was a problem hiding this comment.
These changes are required for arm64 ( please ccheck yq replacement)
I will create a separate PR for holodeck changes that should be merged before this one, so that this PR will only include the arm64 changes.
c0dd5f8 to
eb9eebd
Compare
Signed-off-by: Shiva Kumar (SW-CLOUD) <shivaku@nvidia.com>
eb9eebd to
abcbd47
Compare
Signed-off-by: Shiva Kumar (SW-CLOUD) <shivaku@nvidia.com>
Signed-off-by: Shiva Kumar (SW-CLOUD) <shivaku@nvidia.com>
Signed-off-by: Shiva Kumar (SW-CLOUD) <shivaku@nvidia.com>
Signed-off-by: Shiva Kumar (SW-CLOUD) <shivaku@nvidia.com>
Signed-off-by: Shiva Kumar (SW-CLOUD) <shivaku@nvidia.com>
Signed-off-by: Shiva Kumar (SW-CLOUD) <shivaku@nvidia.com>
Signed-off-by: Shiva Kumar (SW-CLOUD) <shivaku@nvidia.com>
Signed-off-by: Shiva Kumar (SW-CLOUD) <shivaku@nvidia.com>
Code Changes Summary:
Platform Support
Added support for the ARM64 platform.
AMD64 remains the default architecture.
Artifacts Update
ARM64 build artifacts are now uploaded with the -arm64 suffix.
Instance Type and Region Mapping
g4dn.xlarge:
Architecture: AMD64
Supported Region: us-west-1
Used for AMD64 builds.
g5g.xlarge:
Architecture: ARM64
Supported Region: us-west-2
Used for ARM64 builds.
passed pipeline: https://github.com/NVIDIA/gpu-driver-container/actions/runs/22180871853
passed pipeline: https://github.com/NVIDIA/gpu-driver-container/actions/runs/22337833186