Add nightly workflow for testing with latest Triton#366
Draft
Add nightly workflow for testing with latest Triton#366
Conversation
Copilot
AI
changed the title
[WIP] Add nightly tests using latest Triton version
Add nightly workflow for testing with latest Triton
Feb 7, 2026
Co-authored-by: mawad-amd <112003944+mawad-amd@users.noreply.github.com>
Co-authored-by: mawad-amd <112003944+mawad-amd@users.noreply.github.com>
Co-authored-by: mawad-amd <112003944+mawad-amd@users.noreply.github.com>
Co-authored-by: mawad-amd <112003944+mawad-amd@users.noreply.github.com>
Instead of rebuilding the container with Triton from main (slow, fragile), reuse the existing cached CI container and reinstall Triton from main at test time via pip. This means: - Zero changes to build infrastructure - Each test job is independent (parallel-safe, own container overlay) - Uses container_build.sh for cache hits, container_exec.sh for execution - Same test matrix as iris-tests.yml (5 dirs × 4 ranks) - Editable install only (no git/pip variants needed for nightly) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Update to match main's CI infrastructure: - Runner: linux-mi325-8gpu-ossci-rad (not self-hosted/mi3xx) - GPU allocation: flock-based acquire_gpus.sh/release_gpus.sh - Remove hardcoded gpu_devices from matrix - Remove separate build-container-image job (each job builds its own) - Remove manual tritonBLAS install (now baked into container) - Use torchrun --rdzv-endpoint=localhost:0 (not run_tests_distributed --num_ranks) - Match timeout-minutes: 180 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
0a8f234 to
02457b8
Compare
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds a GitHub Actions workflow that runs nightly to validate Iris against Triton's
mainbranch, catching compatibility regressions early.Summary
maininside the existing CI container, then runs the full test suiteiris-tests.yml: 5 test dirs × 4 rank counts = 20 jobsApproach
Instead of rebuilding the container image with a different Triton (the original Copilot approach), each job:
container_build.shacquire_gpus.sh(flock-based bitmap allocator)pip install --force-reinstall --no-deps git+...triton@mainpip install -e .), runs tests withtorchrun --rdzv-endpoint=localhost:0release_gpus.sh(if: always())Each job gets its own container overlay, so the Triton reinstall is isolated and parallel-safe.
Changes
.github/workflows/iris-nightly-triton-test.yml— nightly workflow (complete rewrite from Copilot original)README.md— status badgeDetails
0 7 * * *UTC)workflow_dispatchlinux-mi325-8gpu-ossci-radfail-fastfalseFixes #365