Skip to content

Feat-Infra: Add Cloud/RunPod Docker Support with Automated Builds#1569

Open
FNGarvin wants to merge 6 commits intodeepbeepmeep:mainfrom
FNGarvin:feat-infra-ci
Open

Feat-Infra: Add Cloud/RunPod Docker Support with Automated Builds#1569
FNGarvin wants to merge 6 commits intodeepbeepmeep:mainfrom
FNGarvin:feat-infra-ci

Conversation

@FNGarvin
Copy link
Copy Markdown

Add Cloud/RunPod Docker Support with Automated Builds

Hiya, Deep. Long-time fan of your amazing project and the motivation behind it. I've been recommending it frequently for quite some time. It's amazing how much leading edge tech you've packed into it. Thanks.

Summary

This PR adds first-class container support for local or cloud deployment, including CUDA 13.0 Blackwell (sm_12x) support. It features an automated build pipeline that compiles and publishes architecture-specific SageAttention weights and Blackwell NVFP4 kernels as release assets. The final container image is ultra-mobile, bundling an SSH daemon and web-based filemanager, and is automatically published to GHCR on every commit. A step-by-step RunPod Deployment Guide is now included to help users get started quickly. The two images (cuda12.8 and cuda130) both have the appropriate Nunchaku kernels baked in.

The changes are entirely additive — nothing in the existing codebase, install scripts,
or local workflow is touched. Users who run WanGP locally or via the existing
run-docker-cuda-deb.sh script are completely unaffected. But should the PR be accepted, there will be an "Official" container image available at ghcr.io/deepbeepmeep/wan2gp and it will always be up-to-date with almost zero extra maintenance effort. Installing it will be as simple as docker run --gpus all -p 7860:7860 -v /my/wangp_storage:/workspace ghcr.io/deepbeepmeep/wan2gp.

I recognize that this PR seems like A LOT. But it basically boils down to merging, clicking a few action buttons to run the initial wheel builds, and from thenceforth always having automatic Docker images. You can even use the image builds as a way of sanity-checking future commits or merges. I tried very hard to make this one come with no downsides.

image image

Motivation

The existing Dockerfile is a solid foundation, but it has a few limitations that
make cloud deployment awkward:

  • It only targets Ampere GPUs (SM 8.0 and 8.6) by default. Ada Lovelace (RTX 40xx),
    Hopper (H100), and Blackwell (RTX 50xx) users get no pre-compiled CUDA kernels and
    fall back to slower paths.
  • The run-docker-cuda-deb.sh script (which has great GPU-detection logic) is designed
    for local use — it assumes the project is bind-mounted from the host. Cloud platforms
    like RunPod require the code to be copied into the image.
  • There is no automated build, so users must build the image themselves — a process
    that takes 30–60 minutes on a laptop due to the SageAttention CUDA compilation.

This PR fixes all three.


What Changed

New: Automated Build Workflow (.github/workflows/docker-build.yml)

A GitHub Actions workflow that automatically builds and publishes the Docker image to
the GitHub Container Registry (ghcr.io/deepbeepmeep/wan2gp) whenever code is pushed.

When it runs:

Event What happens
Push to main Build the image and publish it to GHCR as :latest
Push to any feature branch Build the image and publish it (useful for testing)
Pull request to main Build only — no publish — acts as a safety check before merge
Changes to docs/markdown only Skipped entirely — no point building for a README fix

What this means in practice: Once merged, users can pull a pre-built, ready-to-run
image with a single command rather than compiling for an hour themselves:

docker run --gpus all -p 7860:7860 ghcr.io/deepbeepmeep/wan2gp

Modified: Dockerfile — Three-Stage Optimized Architecture

The Dockerfile has been refactored into a three-stage build to solve the "45-minute recompile" problem while ensuring maximum GPU compatibility.

Stage 1: base (Shared Foundation)
Contains the common environment (Ubuntu 24.04 + CUDA 12.8 + PyTorch 2.10.0+cu128 + uv installer). Both local and CI builds start here.

Stage 2: sage-tools (The Compiler — Robot Only)
A heavy-duty compiler stage used strictly by the automated GitHub "Sage Wheels" ritual. It compiles SageAttention for all four GPU generations. This stage is never run by the user or the primary CI build.

Stage 3: sage-compile (The Production Stage)
The actual production image. It downloads three architecture-specific parallel factory wheels directly from your GitHub Releases. This keeps the primary build time under 10 minutes.

┌─────────────────────────────────────────────────────┐
│ 1. [base] Shared Environment (CUDA + PyTorch)       │
└──────────────────────────┬──────────────────────────┘
             ┌─────────────┴─────────────┐
             ▼                           ▼
┌───────────────────────────┐ ┌───────────────────────────┐
│ 2. [sage-tools] Compiler  │ │ 3. [sage-compile] Prod    │
│ (Runs once per version)   │ │ (Downloads Wheels)        │
│ ~45 mins NVCC             │ │ ~10 mins build time        │
└────────────┬──────────────┘ └───────────────────────────┘
             │
             └─► Pushes 3x specific .whl to GitHub Release
                 (Ampere/Ada, Hopper, Blackwell)

Native coverage expanded from Ampere to Blackwell:

Architecture GPU Examples Before After
Ampere (SM 8.0/8.6) A100, RTX 3090
Ada Lovelace (SM 8.9) RTX 4090 ⚠️
Hopper (SM 9.0) H100, H200 ⚠️
Blackwell (SM 10.0/12.0) RTX 5090 Native (CUDA 13.0)
  • Dual CUDA Environments: Optimized support for both stable CUDA 12.8 (Default) and bleeding-edge CUDA 13.0 (Blackwell/FP4).
  • Native Blackwell (NVFP4): Includes a dedicated lightx2v_kernel factory for high-performance FP4 inference on 50-series GPUs.
  • Parallel Dispatch: The image bundles architecture-specific wheels, and entrypoint.sh installs the optimal one at boot.
  • Privacy: All library telemetry (HF, Transformers, Gradio) is disabled by default.

New: Step-by-Step RunPod Guide

To make cloud deployment as frictionless as possible, I've added a visual guide (docs/RUNPOD-HOWTO.md) that walks users through:

  • Selecting the right GPU (CUDA 12.8 vs 13.0).
  • Configuring environment variables and persistent storage.
  • Monitoring the automated setup via system logs.
  • Connecting to the Gradio UI and the web-based file manager.

I would be happy to setup a public "Official Wan2GP" template as pictured in the docs here, but would want your blessing.


The SageAttention Version Pin — and How to Upgrade It

SageAttention is pinned to a specific release (v2.2.0, commit eb615cf6) in the
Dockerfile:

git clone --branch v2.2.0 --depth 1 \
    https://github.com/thu-ml/SageAttention.git /tmp/SageAttention

Why pin it? Without a pin, every automated build pulls the latest commit from
SageAttention's main branch. In a fast-moving repo, that means a breaking change
upstream can silently make our image fail to build — potentially days after it was
introduced, mid-release, with no obvious cause. Pinning gives us a reproducible,
auditable build
: the image built today and the image built six months from now will
compile the same SageAttention code.

How to upgrade it:

Upgrading SageAttention is a one-line change in the Dockerfile. When a new release
of SageAttention is published at https://github.com/thu-ml/SageAttention/releases:

  1. Note the new tag (e.g., v2.3.0) and its full commit SHA (shown on the releases page)
  2. Open Dockerfile and find the line:
    git clone --branch v2.2.0 --depth 1 \
    
  3. Change v2.2.0 to the new tag:
    git clone --branch v2.3.0 --depth 1 \
    
  4. Update the comment above it:
    # Pinned to v2.3.0 (abc123de) for reproducibility.
    
  5. Commit and push — the automated build will pick it up, compile the new version,
    and publish a fresh image. The old cached layers are reused up to that point.

What about automatic upgrade notifications?

This PR also includes a .github/dependabot.yml file. Dependabot is a free GitHub
service that opens pull requests to bump dependency versions automatically. It is
configured here to check GitHub Actions references weekly and open a PR if any of
the CI action versions have newer releases — completely automated, zero configuration
required, one click to merge.

Dependabot cannot currently bump git clone refs inside Dockerfiles, so the
SageAttention tag remains a manual one-liner as described above. Upgrading it takes
about 60 seconds of human effort.


Caching: How Builds Stay Fast After the First Run

The first automated build is slow — primarily because of the SageAttention CUDA
compilation, which takes ~45 minutes on a standard CI runner. Every build after
that is fast because of BuildKit layer caching:

BuildKit (the Docker build engine) saves each completed build stage as a snapshot.
On the next build, it checks whether anything that affects a stage has changed. If
not, it restores the stage from cache in seconds rather than re-running it.

Here is what triggers a rebuild of each stage:

What you changed Deps stage (expensive) Runtime stage (cheap)
A .py or .sh file ✅ Cache hit — skipped 🔄 Rebuilds in ~2s
entrypoint.sh ✅ Cache hit — skipped 🔄 Rebuilds in ~2s
requirements.txt 🔄 Rebuilds (~2 min) 🔄 Rebuilds in ~2s
PyTorch version in Dockerfile 🔄 Rebuilds (~1 hr) 🔄 Rebuilds in ~2s
SageAttention tag in Dockerfile 🔄 Rebuilds (~45 min) 🔄 Rebuilds in ~2s
A markdown or docs file ⏭️ Build skipped entirely ⏭️ Build skipped entirely

In practice, the vast majority of commits are source code changes — those hit the
cache on the expensive stage and only re-run the trivial copy step.


New: sage-wheels.yml — The Wheel Factory

This workflow compiles SageAttention and publishes it to GitHub Releases.

  1. Stage 1 builds pull these wheels via URL.
  2. Benefit: The primary container build is decoupled from the 45-minute NVCC compilation.
  3. Portability: It uses build-args so that forks automatically use their own releases, and the upstream repo will automatically use its own releases after merge.

One-Time Setup Required (Post-Merge)

To ensure the Docker image can pull the necessary optimized binaries, the following GitHub Actions must be run manually once after merging this PR:

1. Blackwell Static Kernel (cu130)

  • Purpose: Compiles the lightx2v_kernel (NVFP4) for RTX 50-series.
  • Release Name: blackwell-kernels

2. SageAttention Factory (cu130)

  • Purpose: Compiles parallel wheels for CUDA 13.0 (Ampere -> Blackwell).
  • Release Name: sage-v<version>-cu130-cp312

3. SageAttention Pre-built Wheel (cu128)

  • Purpose: Compiles standard wheels for CUDA 12.8.
  • Release Name: sage-v<version>-cu128-cp312

Tip

Run these via the Actions tab on GitHub by selecting the workflow and clicking Run workflow. Once these releases are populated, the main Docker builds will complete in ~3 minutes.


Testing

To test locally before pulling the published image:

# Build and verify the dependency stage only (no GPU needed)
docker build --target deps -t wan2gp-deps:test .

# Full build
docker build -t wan2gp:test .

# Run (requires NVIDIA Container Toolkit)
docker run --gpus all -p 7860:7860 wan2gp:test

Files Changed

File Change
Dockerfile New: Main Dockerfile (CUDA 12.8)
Dockerfile.cu13 New: Blackwell-optimized Dockerfile (CUDA 13.0)
entrypoint.sh New: GPU detection + smart architecture-specific wheel dispatch
.github/workflows/docker-build.yml New: Automated build + push (cu128)
.github/workflows/docker-build-cu13.yml New: Automated build + push (cu130)
.github/workflows/blackwell-kernels.yml New: One-time lightx2v kernel factory
.github/workflows/sage-wheels.yml New: 3-way parallel wheel factory (cu128)
.github/workflows/sage-wheels-cu13.yml New: 3-way parallel wheel factory (cu130)
.github/dependabot.yml New: Weekly CI dependency pin bumps
.dockerignore New: Functions like .gitignore to prevent files from being incorporated into images
.gitignore Minor Modifications: Functions like .gitignore to prevent files from being incorporated into images
docs/RUNPOD-HOWTO.md New: Visual step-by-step guide for RunPod deployment
docs/images/*.jpg New: Supporting screenshots for the RunPod guide
.yamllint New: YAML linting rules for GH Actions workflows
requirements.txt Modified: Added scipy and safetensors dependencies

Note on .gitignore: The existing .gitignore has a blanket .* rule that
ignores all dotfiles and dotfolders, which would prevent .github/ from ever being
committed. Two exception lines were added alongside the existing !.gitignore
exception along with some other similarly small changes.

@FNGarvin
Copy link
Copy Markdown
Author

Reviewer's Guide

Containerization and infrastructure overhaul for Wan2GP: introduces a multi-stage Docker build targeting CUDA 12.8 and 13.0, runtime GPU/profile auto-detection with SageAttention 2++ wheel selection, optional SSH and filebrowser services, and CI workflows to build/publish Docker images and SageAttention/Blackwell wheels for cloud (RunPod) deployments.

Sequence diagram for container startup and Wan2GP launch

sequenceDiagram
    actor Operator
    participant Pod as RunPod_container
    participant Entrypoint as entrypoint_sh
    participant SSHD as sshd
    participant FB as filebrowser
    participant GPU as nvidia_smi
    participant Sage as Sage_wheels_installer
    participant BW as Blackwell_kernel_installer
    participant App as wgp_py

    Operator->>Pod: Start Wan2GP image
    Pod->>Entrypoint: Invoke /entrypoint.sh
    Entrypoint->>Entrypoint: Sanitize env
    Entrypoint->>Entrypoint: Configure cache and telemetry

    Entrypoint->>SSHD: Start sshd (if SSH_PORT set)
    Entrypoint->>FB: Start filebrowser (if FILEBROWSER_PORT set)

    Entrypoint->>GPU: Query GPU name and VRAM
    GPU-->>Entrypoint: GPU_NAME, VRAM_GB

    Entrypoint->>Entrypoint: Derive PROFILE and ATTN
    Entrypoint->>Entrypoint: Apply WGP_PROFILE and WGP_ATTENTION overrides

    Entrypoint->>Sage: Detect compute_capability via nvidia-smi
    Sage-->>Entrypoint: Matching SageAttention wheel path
    Entrypoint->>Sage: pip install selected wheel

    Entrypoint->>BW: Check Blackwell and CUDA 13.0
    BW-->>Entrypoint: NVFP4 kernel wheel (if available)
    Entrypoint->>BW: pip install NVFP4 kernel

    Entrypoint->>App: python3 wgp.py --listen --profile PROFILE --attention ATTN WGP_ARGS "$@"
    App-->>Operator: Serve UI on port 7860
Loading

Flow diagram for SageAttention wheel factory and Blackwell kernel workflows

graph TD
    A[workflow_dispatch<br>sage-wheels or sage-wheels-cu13] --> B[initialize-release job]
    B --> B1[Checkout repo]
    B1 --> B2[Parse Dockerfile to get SAGE_VERSION]
    B2 --> B3[Compute release tag<br>sage-vX.Y.Z-cu128_or_cu130-cp312]
    B3 --> B4[Create or refresh prerelease<br>on GitHub Releases]

    B4 --> C[build-wheels matrix job]

    subgraph MatrixBuilds
        C --> C1[Variant ampere-ada-rtx-30-40<br>CUDA_ARCHITECTURES 8.0;8.6;8.9]
        C --> C2[Variant hopper-h100-h200<br>CUDA_ARCHITECTURES 9.0;10.0]
        C --> C3[Variant blackwell-rtx-50<br>CUDA_ARCHITECTURES 12.0+PTX]
    end

    C1 --> D1[Build Docker target sage-tools<br>with docker-build-push-action]
    C2 --> D2[Build Docker target sage-tools]
    C3 --> D3[Build Docker target sage-tools]

    D1 --> E1[Run container and copy /tmp/sa_dist/*.whl]
    D2 --> E2[Run container and copy /tmp/sa_dist/*.whl]
    D3 --> E3[Run container and copy /tmp/sa_dist/*.whl]

    E1 --> F1[Rename wheel with suffix ampere.ada.rtx30.40]
    E2 --> F2[Rename wheel with suffix hopper.h100.h200]
    E3 --> F3[Rename wheel with suffix blackwell.rtx50]

    F1 --> G[Upload wheel assets to release tag]
    F2 --> G
    F3 --> G

    subgraph BlackwellKernelsWorkflow
        H[workflow_dispatch<br>blackwell-kernels] --> I[Build Docker target blackwell-tools<br>from Dockerfile.cu13]
        I --> J[Extract /tmp/bw_dist/*.whl<br>NVFP4 kernels]
        J --> K[Ensure blackwell-kernels release exists]
        K --> L[Upload kernel wheels to blackwell-kernels release]
    end
Loading

File-Level Changes

Change Details Files
Replace the simple entrypoint with a production-grade runtime bootstrap that configures env/caches, optionally starts SSH and filebrowser, auto-detects GPU to select WanGP profile and attention mode, and dynamically installs the best SageAttention/Blackwell wheel before launching wgp.py.
  • Add strict shell options, sanitize LD_LIBRARY_PATH, and move cache directories to /workspace for persistence.
  • Set telemetry-disable env vars, CPU thread counts, TF32 flags, and TORCH_CUDA_ARCH_LIST (from CUDA_ARCHITECTURES or sensible defaults).
  • Inject SSH public key from env vars and optionally start sshd on a configurable port.
  • Optionally start filebrowser scoped to /workspace when FILEBROWSER_PORT is set.
  • Implement GPU name/VRAM detection via nvidia-smi and map to WanGP profile (1–5) and attention mode (sage2/sage/sdpa) with env overrides.
  • Add dynamic SageAttention wheel selection/installation from /opt/sage_wheels based on compute capability and optional Blackwell NVFP4 (LightX2V) kernel installation.
  • Change working directory to /workspace/wan2gp and exec wgp.py with auto-detected profile/attention plus extra WGP_ARGS.
entrypoint.sh
Refactor Docker build into a multi-stage pipeline (base, sage-tools, sage-compile, deps, runtime) with pinned CUDA 12.8, PyTorch, SageAttention wheels, and optional filebrowser, tailored for cloud/RunPod environments.
  • Introduce a base image on nvidia/cuda:12.8.1-cudnn-devel-ubuntu24.04 that installs Python, uv, core build tools, and PyTorch 2.10.0+cu128 stack.
  • Add a sage-tools stage that builds SageAttention from source using TORCH_CUDA_ARCH_LIST from CUDA_ARCHITECTURES and exports wheels in /tmp/sa_dist.
  • Add a sage-compile stage that configures sshd and downloads pre-built SageAttention 2.2.0 wheels for ampere/ada, hopper, and blackwell from GitHub Releases into /opt/sage_wheels.
  • Add a deps stage that installs project requirements via uv, including a pinned nunchaku wheel and specific huggingface-hub/diffusers versions, with cross-index resolution settings.
  • Add a runtime stage that copies the project into /workspace/wan2gp, creates model/cache directories, installs filebrowser, wires CUDA_ARCHITECTURES through, and sets /entrypoint.sh as ENTRYPOINT with ports 7860 and 22 exposed.
Dockerfile
Extend Python dependencies to support new optimizations and serialization features.
  • Add scipy to requirements for numerical routines used by the new stack.
  • Add safetensors to support safe tensor serialization/loading.
  • Document that torchao is managed in the Dockerfile base stage rather than via requirements.txt.
requirements.txt
Introduce CI workflows to build/publish SageAttention wheels (CUDA 12.8 and 13.0), Blackwell NVFP4 kernels, and Docker images to GHCR, with caching and merge-gate behavior.
  • Add sage-wheels.yml to build SageAttention wheels for multiple GPU arch sets via the sage-tools stage, upload them as pre-release assets, and avoid release race conditions by serializing release initialization.
  • Add sage-wheels-cu13.yml to do the same for CUDA 13.0 using Dockerfile.cu13 and sage-tools target.
  • Add blackwell-kernels.yml to build and publish Blackwell-specific NVFP4 (LightX2V) kernel wheels to a static blackwell-kernels release.
  • Add docker-build.yml to build and push the main CUDA 12.8 image (and tags) to GHCR on pushes/PRs, with BuildKit cache and safe login behavior for PRs.
  • Add docker-build-cu13.yml to build and push a CUDA 13.0 variant image when relevant files change.
.github/workflows/sage-wheels.yml
.github/workflows/sage-wheels-cu13.yml
.github/workflows/blackwell-kernels.yml
.github/workflows/docker-build.yml
.github/workflows/docker-build-cu13.yml
Add RunPod deployment documentation and repo-level automation/config for infra hygiene.
  • Add RUNPOD-HOWTO.md with step-by-step instructions and screenshots for selecting GPUs, deploying pods, monitoring logs, and accessing the UI/filebrowser.
  • Introduce dependabot.yml to keep GitHub Actions dependencies up to date weekly.
  • Add .dockerignore and .yamllint stubs to control Docker build context and YAML style (contents not shown in diff).
  • Introduce Dockerfile.cu13 (not shown fully in diff) to support a CUDA 13.0 image variant aligned with the cu13 workflows and Blackwell kernels.
docs/RUNPOD-HOWTO.md
.github/dependabot.yml
.dockerignore
.yamllint
Dockerfile.cu13

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@FNGarvin
Copy link
Copy Markdown
Author

Hello from Reddit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant