Skip to content
130 changes: 130 additions & 0 deletions .github/workflows/iris-nightly-triton-test.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,130 @@
name: Iris Nightly Triton Test

on:
schedule:
# Run nightly at midnight Pacific (7 AM UTC / PDT)
- cron: '0 7 * * *'
workflow_dispatch: # Allow manual triggering

concurrency:
group: ${{ github.workflow }}
cancel-in-progress: true

permissions:
contents: read

jobs:
test-nightly:
name: Test ${{ matrix.test_dir }} (${{ matrix.num_ranks }} ranks, nightly Triton)
runs-on: [linux-mi325-8gpu-ossci-rad]
timeout-minutes: 180
strategy:
fail-fast: false
matrix:
include:
- test_dir: examples
num_ranks: 1
- test_dir: examples
num_ranks: 2
- test_dir: examples
num_ranks: 4
- test_dir: examples
num_ranks: 8
- test_dir: unittests
num_ranks: 1
- test_dir: unittests
num_ranks: 2
- test_dir: unittests
num_ranks: 4
- test_dir: unittests
num_ranks: 8
- test_dir: ccl
num_ranks: 1
- test_dir: ccl
num_ranks: 2
- test_dir: ccl
num_ranks: 4
- test_dir: ccl
num_ranks: 8
- test_dir: x
num_ranks: 1
- test_dir: x
num_ranks: 2
- test_dir: x
num_ranks: 4
- test_dir: x
num_ranks: 8
- test_dir: ops
num_ranks: 1
- test_dir: ops
num_ranks: 2
- test_dir: ops
num_ranks: 4
- test_dir: ops
num_ranks: 8

steps:
- name: Checkout repository
uses: actions/checkout@v4

- name: Setup Apptainer (if not available)
run: |
if ! command -v apptainer &> /dev/null && ! command -v docker &> /dev/null; then
echo "Neither Apptainer nor Docker found, installing Apptainer..."
apt-get update && apt-get install -y software-properties-common
add-apt-repository -y ppa:apptainer/ppa
apt-get update && apt-get install -y apptainer
else
echo "Container runtime already available"
Comment on lines +72 to +78
Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The guard condition treats Docker as a sufficient substitute for Apptainer, but later steps call container_build.sh / container_exec.sh (and the PR description emphasizes Apptainer). If those scripts require Apptainer, this condition can incorrectly skip installation whenever Docker is present, leading to failures later. Also, apt-get / add-apt-repository typically requires elevated privileges (sudo) and may be blocked on hardened/self-hosted runners—making this step brittle; consider requiring a runner image with Apptainer preinstalled, or check/install Apptainer specifically (not “Apptainer or Docker”), and use sudo if your runner environment supports it.

Suggested change
if ! command -v apptainer &> /dev/null && ! command -v docker &> /dev/null; then
echo "Neither Apptainer nor Docker found, installing Apptainer..."
apt-get update && apt-get install -y software-properties-common
add-apt-repository -y ppa:apptainer/ppa
apt-get update && apt-get install -y apptainer
else
echo "Container runtime already available"
if ! command -v apptainer &> /dev/null; then
echo "Apptainer not found, attempting to install..."
if command -v sudo &> /dev/null; then
sudo apt-get update && sudo apt-get install -y software-properties-common
sudo add-apt-repository -y ppa:apptainer/ppa
sudo apt-get update && sudo apt-get install -y apptainer
else
echo "Error: sudo is not available; Apptainer must be preinstalled on this runner."
exit 1
fi
else
echo "Apptainer already available"

Copilot uses AI. Check for mistakes.
fi

- name: Build Iris container
run: |
bash .github/scripts/container_build.sh

- name: Acquire GPUs
run: |
bash .github/scripts/acquire_gpus.sh "${{ matrix.num_ranks }}"

- name: Run ${{ matrix.test_dir }} tests with ${{ matrix.num_ranks }} ranks (nightly Triton)
run: |
set -e
echo "::group::Running ${{ matrix.test_dir }} tests with ${{ matrix.num_ranks }} ranks (nightly Triton)"

# Build GPU argument (GPU_DEVICES set by acquire_gpus.sh)
GPU_ARG=""
if [ -n "$GPU_DEVICES" ]; then
GPU_ARG="--gpus $GPU_DEVICES"
fi
Comment on lines +85 to +98
Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GPU_DEVICES set inside acquire_gpus.sh will not automatically persist to later GitHub Actions steps unless it’s written to $GITHUB_ENV (or exposed as a step output). As written, $GPU_DEVICES will typically be empty in the “Run … tests” step, which can result in not passing the intended GPU selection to container_exec.sh. Persist the allocated devices by having acquire_gpus.sh (or the step wrapper) write GPU_DEVICES=... to $GITHUB_ENV, or convert Acquire GPUs into an id: step that sets outputs and reference those outputs in the next step.

Copilot uses AI. Check for mistakes.

# Run tests in container, reinstalling Triton from main first
bash .github/scripts/container_exec.sh $GPU_ARG "
set -e

# Reinstall Triton from main branch
echo \"Reinstalling Triton from main branch...\"
pip install --force-reinstall --no-deps \
git+https://github.com/triton-lang/triton@main
Comment on lines +104 to +107
Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rebuilding/reinstalling Triton from main in every matrix job (20× per nightly run) can significantly increase total runtime and load on external infrastructure, and increases failure rates due to transient network/build issues. Consider adding pip caching (e.g., caching wheels/build artifacts keyed by commit date), or building a single Triton wheel once (in a separate job) and reusing it across matrix jobs as an artifact; the tradeoff is a bit more workflow complexity in exchange for faster, more reliable nightly runs.

Copilot uses AI. Check for mistakes.
echo \"Triton version: \$(pip show triton 2>/dev/null | grep Version || echo unknown)\"

# Install iris in editable mode
echo \"Installing iris in editable mode\"
pip install -e .

# Run tests in the specified directory
for test_file in tests/${{ matrix.test_dir }}/test_*.py; do
if [ -f \"\$test_file\" ]; then
echo \"Testing: \$test_file with ${{ matrix.num_ranks }} ranks (nightly Triton)\"
torchrun --rdzv-backend=c10d --rdzv-endpoint=localhost:0 \
--nnodes=1 --nproc_per_node=${{ matrix.num_ranks }} \
tests/run_tests_distributed.py \"\$test_file\" -v --tb=short --durations=10
fi
done
Comment on lines +114 to +122
Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If tests/${{ matrix.test_dir }}/test_*.py matches no files, Bash will iterate once with the literal glob pattern, -f will be false, and the job will exit successfully having run zero tests (a silent false-green). Enable nullglob and/or explicitly fail when no test files are found (e.g., collect into an array and assert non-empty), or run a test runner invocation that fails appropriately when the directory contains no tests for the selected pattern.

Copilot uses AI. Check for mistakes.
"
echo "::endgroup::"
echo "✅ ${{ matrix.test_dir }} tests with ${{ matrix.num_ranks }} ranks (nightly Triton) passed!"

- name: Release GPUs
if: always()
run: |
bash .github/scripts/release_gpus.sh
Loading