[CICD] Add auto build and push CUDA Docker images to Harbor pipeline by zihugithub · Pull Request #1143 · flagos-ai/FlagScale

zihugithub · 2026-03-10T09:30:53Z

PR Category

CICD

PR Types

New Features

PR Description

Add build_image_cuda.yml for building CUDA Docker images on PR and push
Add push_image_harbor.yml for promoting validated images to Harbor registry
Update all_tests_cuda.yml to chain build -> test -> push workflow
Upgrade vllm from 0.11.0 to 0.13.0 (cu128)

Darryl233 · 2026-03-17T02:13:20Z

 name: cuda_tests

 on:
+  # Trigger after Build Docker Images - CUDA succeeds


Pull Request to main will trigger both Build Docker Images and all_tests_cuda?

…zihugithub/FlagScale into auto-build-push-image-to-harbor

…all-image promotion

…zihugithub/FlagScale into auto-build-push-image-to-harbor

xmhubj · 2026-04-16T03:05:01Z

please ignore docs as well.

xmhubj · 2026-04-16T03:05:43Z

please ignore docs as well

xmhubj · 2026-04-16T03:06:09Z

please ignore docs as well

xmhubj · 2026-04-16T03:07:06Z

please ignore docs as well

Darryl233 · 2026-04-20T11:37:05Z

@claude review this PR

claude · 2026-04-20T11:37:20Z

Claude finished @Darryl233's task in 4m 20s —— View job

PR Review: Auto Build and Push CUDA Docker Images

Todo List

Read repository guidelines (CLAUDE.md)
Review new workflow: build_image_cuda.yml
Review new workflow: push_image_harbor.yml
Review modified workflows
Review Dockerfile changes
Analyze security considerations
Post comprehensive review feedback

Summary

This PR introduces a solid build → test → push pipeline for CUDA Docker images. The overall architecture is well-thought-out. Below are findings ranging from blocking issues to minor suggestions.

Issues

1. push_image_harbor.yml — tar-based promotion is fragile and racey

push_image_harbor.yml:40-42 detects images by scanning a shared directory (/home/flagscale_cicd/images_tar) for the latest tar file by name pattern. This approach has two problems:

Race condition: if two PRs build concurrently, tail -1 (sorted by name) may pick up a tar from a different run. There's no run-ID or SHA in the detection logic.
Stale tars: if a previous run's tar was not cleaned up (e.g. due to a failure), it will be promoted on the next push to main, even if it's outdated.

The build_image_cuda.yml already outputs exact tar paths (train_tar, inference_tar, all_tar) via job outputs. Consider passing those directly to push_image_harbor.yml via workflow_call inputs instead of re-scanning the directory. Fix this →

2. push_image_harbor.yml:122,144 — inconsistent sudo on rm

promote_train uses sudo rm -f (line 100) but promote_inference and promote_all use plain rm -f (lines 122, 144). If the tar was written by a privileged process (as in build_image_cuda.yml:255 which uses sudo docker save), the plain rm will silently fail, leaving stale tars on disk. Fix this →

3. build_image_cuda.yml:305 — load_images uses docker push without sudo

The image was loaded with sudo docker load (line 304) but pushed with plain docker push (line 305). On self-hosted runners where Docker requires root, this will fail. Same issue on lines 315-316. Fix this →

4. push_image_harbor.yml — contents: write permission is overly broad

push_image_harbor.yml:14 sets permissions: contents: write at the workflow level. Only the update_config job needs write access. The promote and cleanup jobs don't need it. Scope permissions to the job level to follow least-privilege. Fix this →

Observations

5. build_image_cuda.yml — all image tar is built but never loaded

load_images job (lines 298-321) loads and pushes train and inference tars, but skips the all image entirely. The all_tag and all_tar outputs are exported but unused. If the all image is needed for tests or promotion, this is a gap. If it's intentionally excluded, the dead outputs should be removed.

6. all_tests_cuda.yml:58 — always() + condition may allow skipped tests to pass silently

if: always() && (github.event_name == 'push' || needs.check_docker_changes.outputs.has_docker_changes != 'true')

When check_docker_changes is skipped (e.g. on push events), needs.check_docker_changes.outputs.has_docker_changes evaluates to '', which is != 'true', so run_tests always runs on push. This is correct, but the logic is subtle and could confuse future maintainers. A comment explaining this would help.

7. push_image_harbor.yml — no trigger coordination with build_image_cuda.yml

push_image_harbor.yml triggers on push to main with the same path filters as build_image_cuda.yml on PRs. But there's no explicit dependency between the two workflows — push_image_harbor.yml assumes the tar files from the PR build are still present on the runner. If the runner is different or the PR build ran on a different machine, the tars won't be there and needs_promotion will be false, silently skipping the push. This is worth documenting explicitly.

8. Dockerfile.train — release stage re-declares env stage logic

The release stage (line 206+) duplicates the venv creation guard from the env stage rather than inheriting from deps. This means release doesn't benefit from the cached dependency layer in deps. Consider whether release should build FROM deps AS release with --no-dev flag instead.

9. build_image_cuda.yml:109 — matrix injection via string interpolation

echo 'matrix={"task":["${{ inputs.task }}"]}' >> $GITHUB_OUTPUT

inputs.task is constrained to a choice type, so injection is not possible here. But it's worth noting this pattern is fragile if the input type ever changes to string.

Minor

push_image_harbor.yml:87,110,133 — docker load output format (Loaded image: <tag>) can vary between Docker versions. A more robust approach is to inspect the loaded image ID directly rather than parsing stdout.
build_image_cuda.yml:52 — the default runs_on and container_volumes in workflow_dispatch inputs are hardcoded strings that duplicate the self-hosted runner labels. These could drift from the actual runner config in .github/configs/cuda.yml.
all_tests_ascend.yml and all_tests_metax.yml now ignore build_image_cuda.yml and push_image_harbor.yml in paths-ignore, which is correct since those workflows are CUDA-specific.

Overall this is a well-structured pipeline. The main concerns are the tar-based promotion race condition (#1), the inconsistent sudo usage (#2, #3), and the overly broad permissions (#4).

Darryl233

LGTM

[CICD] Add auto build and push CUDA Docker images to Harbor pipeline

2314a21

zihugithub requested a review from aoyulong as a code owner March 10, 2026 09:30

zihugithub added 7 commits March 10, 2026 18:07

fix: code style adjustments in PR

6a6a947

Set safe directory

7843266

Clean workspace

1f895eb

Remove redundant clean workspace steps before checkout

903c520

fix(ci): use head_ref for PR checkout to avoid detached HEAD on push

84d54bc

fix git fetch failure in update-config job

753242a

Clean workspace

26c3a14

Darryl233 reviewed Mar 17, 2026

View reviewed changes

zihugithub and others added 20 commits March 17, 2026 10:56

add paths-ignore for test workflow and fix PR checkout in build workflow

bf71fbe

support fork PRs in build_image_cuda workflow

9cfa2e1

use FORK_PUSH_TOKEN for fork PR push and add token-help job

7310831

fix(ci): use stable image tags without timestamp for registry push

8b39e35

debug1

a510a24

Merge branch 'flagos-ai:main' into auto-build-push-image-to-harbor

7513472

re-enable Docker build/push and CUDA tests in build_image_cuda workflow

3492db7

add two-stage pipeline to support fork PR builds

03ee149

debug2

bc95289

Merge branch 'main' into auto-build-push-image-to-harbor

6c5a953

Merge branch 'flagos-ai:main' into auto-build-push-image-to-harbor

3095f1c

ci: refactor build_image_cuda workflow

02ffa31

Merge branch 'auto-build-push-image-to-harbor' of https://github.com/…

f37da2e

…zihugithub/FlagScale into auto-build-push-image-to-harbor

ci: trigger push_image_harbor on build workflow success

56c30b1

debug3

63ae083

ci: add paths-ignore for ascend/metax, fix sudo docker load, support …

cd02656

…all-image promotion

ci: pass proxy settings to docker build stages

3b99fa3

Merge branch 'main' into auto-build-push-image-to-harbor

cd777fb

ci: auto-detect proxy from runner environment for docker build

c4c3e10

Merge branch 'auto-build-push-image-to-harbor' of https://github.com/…

b78b2f1

…zihugithub/FlagScale into auto-build-push-image-to-harbor

zihugithub and others added 6 commits April 15, 2026 10:52

ci: add runs_on parameter for custom runner selection

1a7a3bc

ci: add runs_on and container_volumes as overridable workflow inputs

fcbdbcc

ci: add runs_on and container_volumes as overridable workflow inputs

d6301a7

debug0

abcb13b

Merge branch 'main' into auto-build-push-image-to-harbor

cbf4723

ci: update Ascend volume paths and re-enable unit tests

747517c

xmhubj reviewed Apr 16, 2026

View reviewed changes

Comment thread .github/workflows/all_tests_ascend.yml

Copy link
Copy Markdown

Collaborator

xmhubj Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please ignore docs as well.

xmhubj reviewed Apr 16, 2026

View reviewed changes

Comment thread .github/workflows/all_tests_common.yml

Copy link
Copy Markdown

Collaborator

xmhubj Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please ignore docs as well

xmhubj reviewed Apr 16, 2026

View reviewed changes

Comment thread .github/workflows/all_tests_cuda.yml

Copy link
Copy Markdown

Collaborator

xmhubj Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please ignore docs as well

xmhubj reviewed Apr 16, 2026

View reviewed changes

Comment thread .github/workflows/all_tests_metax.yml

Copy link
Copy Markdown

Collaborator

xmhubj Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please ignore docs as well

Darryl233 added 2 commits April 16, 2026 14:38

ignore docs

852340d

syntax

264d68e

xmhubj previously approved these changes Apr 16, 2026

View reviewed changes

resolve conflict

b1fe283

Darryl233 dismissed xmhubj’s stale review via b1fe283 April 20, 2026 11:55

Darryl233 added 4 commits April 21, 2026 17:15

fix

d7eb059

fix

5c9eeb4

unify tar dir

9c3f0b4

cleanup

cd4e2d5

xmhubj approved these changes Apr 22, 2026

View reviewed changes

Darryl233 approved these changes Apr 22, 2026

View reviewed changes

Darryl233 merged commit f87324e into flagos-ai:main Apr 22, 2026
49 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CICD] Add auto build and push CUDA Docker images to Harbor pipeline#1143

[CICD] Add auto build and push CUDA Docker images to Harbor pipeline#1143
Darryl233 merged 41 commits intoflagos-ai:mainfrom
zihugithub:auto-build-push-image-to-harbor

zihugithub commented Mar 10, 2026

Uh oh!

Darryl233 Mar 17, 2026

Uh oh!

xmhubj Apr 16, 2026

Uh oh!

xmhubj Apr 16, 2026

Uh oh!

xmhubj Apr 16, 2026

Uh oh!

xmhubj Apr 16, 2026

Uh oh!

Darryl233 commented Apr 20, 2026

Uh oh!

claude Bot commented Apr 20, 2026 •

edited

Loading

Uh oh!

Darryl233 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

zihugithub commented Mar 10, 2026

PR Category

PR Types

PR Description

Uh oh!

Darryl233 Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

xmhubj Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

xmhubj Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

xmhubj Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

xmhubj Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Darryl233 commented Apr 20, 2026

Uh oh!

claude Bot commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: Auto Build and Push CUDA Docker Images

Todo List

Summary

Issues

Observations

Minor

Uh oh!

Darryl233 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

claude Bot commented Apr 20, 2026 •

edited

Loading