Skip to content

[CICD] Add auto build and push CUDA Docker images to Harbor pipeline#1143

Merged
Darryl233 merged 41 commits intoflagos-ai:mainfrom
zihugithub:auto-build-push-image-to-harbor
Apr 22, 2026
Merged

[CICD] Add auto build and push CUDA Docker images to Harbor pipeline#1143
Darryl233 merged 41 commits intoflagos-ai:mainfrom
zihugithub:auto-build-push-image-to-harbor

Conversation

@zihugithub
Copy link
Copy Markdown
Collaborator

PR Category

CICD

PR Types

New Features

PR Description

  • Add build_image_cuda.yml for building CUDA Docker images on PR and push
  • Add push_image_harbor.yml for promoting validated images to Harbor registry
  • Update all_tests_cuda.yml to chain build -> test -> push workflow
  • Upgrade vllm from 0.11.0 to 0.13.0 (cu128)

@zihugithub zihugithub requested a review from aoyulong as a code owner March 10, 2026 09:30
Comment thread .github/workflows/all_tests_cuda.yml Outdated
name: cuda_tests

on:
# Trigger after Build Docker Images - CUDA succeeds
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request to main will trigger both Build Docker Images and all_tests_cuda?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please ignore docs as well.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please ignore docs as well

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please ignore docs as well

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please ignore docs as well

xmhubj
xmhubj previously approved these changes Apr 16, 2026
@Darryl233
Copy link
Copy Markdown
Collaborator

@claude review this PR

@claude
Copy link
Copy Markdown

claude Bot commented Apr 20, 2026

Claude finished @Darryl233's task in 4m 20s —— View job


PR Review: Auto Build and Push CUDA Docker Images

Todo List

  • Read repository guidelines (CLAUDE.md)
  • Review new workflow: build_image_cuda.yml
  • Review new workflow: push_image_harbor.yml
  • Review modified workflows
  • Review Dockerfile changes
  • Analyze security considerations
  • Post comprehensive review feedback

Summary

This PR introduces a solid build → test → push pipeline for CUDA Docker images. The overall architecture is well-thought-out. Below are findings ranging from blocking issues to minor suggestions.


Issues

1. push_image_harbor.yml — tar-based promotion is fragile and racey

push_image_harbor.yml:40-42 detects images by scanning a shared directory (/home/flagscale_cicd/images_tar) for the latest tar file by name pattern. This approach has two problems:

  • Race condition: if two PRs build concurrently, tail -1 (sorted by name) may pick up a tar from a different run. There's no run-ID or SHA in the detection logic.
  • Stale tars: if a previous run's tar was not cleaned up (e.g. due to a failure), it will be promoted on the next push to main, even if it's outdated.

The build_image_cuda.yml already outputs exact tar paths (train_tar, inference_tar, all_tar) via job outputs. Consider passing those directly to push_image_harbor.yml via workflow_call inputs instead of re-scanning the directory. Fix this →

2. push_image_harbor.yml:122,144 — inconsistent sudo on rm

promote_train uses sudo rm -f (line 100) but promote_inference and promote_all use plain rm -f (lines 122, 144). If the tar was written by a privileged process (as in build_image_cuda.yml:255 which uses sudo docker save), the plain rm will silently fail, leaving stale tars on disk. Fix this →

3. build_image_cuda.yml:305load_images uses docker push without sudo

The image was loaded with sudo docker load (line 304) but pushed with plain docker push (line 305). On self-hosted runners where Docker requires root, this will fail. Same issue on lines 315-316. Fix this →

4. push_image_harbor.ymlcontents: write permission is overly broad

push_image_harbor.yml:14 sets permissions: contents: write at the workflow level. Only the update_config job needs write access. The promote and cleanup jobs don't need it. Scope permissions to the job level to follow least-privilege. Fix this →


Observations

5. build_image_cuda.ymlall image tar is built but never loaded

load_images job (lines 298-321) loads and pushes train and inference tars, but skips the all image entirely. The all_tag and all_tar outputs are exported but unused. If the all image is needed for tests or promotion, this is a gap. If it's intentionally excluded, the dead outputs should be removed.

6. all_tests_cuda.yml:58always() + condition may allow skipped tests to pass silently

if: always() && (github.event_name == 'push' || needs.check_docker_changes.outputs.has_docker_changes != 'true')

When check_docker_changes is skipped (e.g. on push events), needs.check_docker_changes.outputs.has_docker_changes evaluates to '', which is != 'true', so run_tests always runs on push. This is correct, but the logic is subtle and could confuse future maintainers. A comment explaining this would help.

7. push_image_harbor.yml — no trigger coordination with build_image_cuda.yml

push_image_harbor.yml triggers on push to main with the same path filters as build_image_cuda.yml on PRs. But there's no explicit dependency between the two workflows — push_image_harbor.yml assumes the tar files from the PR build are still present on the runner. If the runner is different or the PR build ran on a different machine, the tars won't be there and needs_promotion will be false, silently skipping the push. This is worth documenting explicitly.

8. Dockerfile.trainrelease stage re-declares env stage logic

The release stage (line 206+) duplicates the venv creation guard from the env stage rather than inheriting from deps. This means release doesn't benefit from the cached dependency layer in deps. Consider whether release should build FROM deps AS release with --no-dev flag instead.

9. build_image_cuda.yml:109 — matrix injection via string interpolation

echo 'matrix={"task":["${{ inputs.task }}"]}' >> $GITHUB_OUTPUT

inputs.task is constrained to a choice type, so injection is not possible here. But it's worth noting this pattern is fragile if the input type ever changes to string.


Minor

  • push_image_harbor.yml:87,110,133docker load output format (Loaded image: <tag>) can vary between Docker versions. A more robust approach is to inspect the loaded image ID directly rather than parsing stdout.
  • build_image_cuda.yml:52 — the default runs_on and container_volumes in workflow_dispatch inputs are hardcoded strings that duplicate the self-hosted runner labels. These could drift from the actual runner config in .github/configs/cuda.yml.
  • all_tests_ascend.yml and all_tests_metax.yml now ignore build_image_cuda.yml and push_image_harbor.yml in paths-ignore, which is correct since those workflows are CUDA-specific.

Overall this is a well-structured pipeline. The main concerns are the tar-based promotion race condition (#1), the inconsistent sudo usage (#2, #3), and the overly broad permissions (#4).

Copy link
Copy Markdown
Collaborator

@Darryl233 Darryl233 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Darryl233 Darryl233 merged commit f87324e into flagos-ai:main Apr 22, 2026
49 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants