Skip to content

Add nightly workflow for testing with latest Triton#366

Draft
Copilot wants to merge 8 commits intomainfrom
copilot/add-nightly-tests-action
Draft

Add nightly workflow for testing with latest Triton#366
Copilot wants to merge 8 commits intomainfrom
copilot/add-nightly-tests-action

Conversation

Copy link
Contributor

Copilot AI commented Feb 7, 2026

Adds a GitHub Actions workflow that runs nightly to validate Iris against Triton's main branch, catching compatibility regressions early.

Summary

  • Nightly workflow that reinstalls Triton from main inside the existing CI container, then runs the full test suite
  • No container rebuild — reuses the same cached Apptainer image as normal CI
  • Same test matrix as iris-tests.yml: 5 test dirs × 4 rank counts = 20 jobs
  • Silent failures — no notifications, check the Actions tab or badge manually

Approach

Instead of rebuilding the container image with a different Triton (the original Copilot approach), each job:

  1. Reuses the existing cached container via container_build.sh
  2. Acquires GPUs via acquire_gpus.sh (flock-based bitmap allocator)
  3. Inside the container overlay: pip install --force-reinstall --no-deps git+...triton@main
  4. Installs iris (pip install -e .), runs tests with torchrun --rdzv-endpoint=localhost:0
  5. Releases GPUs via release_gpus.sh (if: always())

Each job gets its own container overlay, so the Triton reinstall is isolated and parallel-safe.

Changes

  • .github/workflows/iris-nightly-triton-test.yml — nightly workflow (complete rewrite from Copilot original)
  • README.md — status badge

Details

Property Value
Schedule Midnight Pacific daily (0 7 * * * UTC)
Manual trigger workflow_dispatch
Runner linux-mi325-8gpu-ossci-rad
Test matrix examples, unittests, ccl, x, ops × 1, 2, 4, 8 ranks
fail-fast false
Timeout 180 min per job

Fixes #365

Copilot AI changed the title [WIP] Add nightly tests using latest Triton version Add nightly workflow for testing with latest Triton Feb 7, 2026
Copilot AI requested a review from mawad-amd February 7, 2026 14:30
Copilot AI and others added 7 commits March 22, 2026 20:10
Co-authored-by: mawad-amd <112003944+mawad-amd@users.noreply.github.com>
Co-authored-by: mawad-amd <112003944+mawad-amd@users.noreply.github.com>
Co-authored-by: mawad-amd <112003944+mawad-amd@users.noreply.github.com>
Co-authored-by: mawad-amd <112003944+mawad-amd@users.noreply.github.com>
Instead of rebuilding the container with Triton from main (slow,
fragile), reuse the existing cached CI container and reinstall
Triton from main at test time via pip. This means:

- Zero changes to build infrastructure
- Each test job is independent (parallel-safe, own container overlay)
- Uses container_build.sh for cache hits, container_exec.sh for execution
- Same test matrix as iris-tests.yml (5 dirs × 4 ranks)
- Editable install only (no git/pip variants needed for nightly)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Update to match main's CI infrastructure:
- Runner: linux-mi325-8gpu-ossci-rad (not self-hosted/mi3xx)
- GPU allocation: flock-based acquire_gpus.sh/release_gpus.sh
- Remove hardcoded gpu_devices from matrix
- Remove separate build-container-image job (each job builds its own)
- Remove manual tritonBLAS install (now baked into container)
- Use torchrun --rdzv-endpoint=localhost:0 (not run_tests_distributed --num_ranks)
- Match timeout-minutes: 180

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@mawad-amd mawad-amd force-pushed the copilot/add-nightly-tests-action branch from 0a8f234 to 02457b8 Compare March 23, 2026 03:11
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Nightly tests using latest Triton

2 participants