Skip to content

Add test sharding, proactive clean, and retry logic for self-hosted CI#1171

Draft
sbryngelson wants to merge 12 commits intoMFlowCode:masterfrom
sbryngelson:ci-test
Draft

Add test sharding, proactive clean, and retry logic for self-hosted CI#1171
sbryngelson wants to merge 12 commits intoMFlowCode:masterfrom
sbryngelson:ci-test

Conversation

@sbryngelson
Copy link
Member

@sbryngelson sbryngelson commented Feb 19, 2026

Summary

Hardens self-hosted CI with test sharding, retry logic, and script deduplication.

Test sharding & retry

  • Add --shard i/n flag to ./mfc.sh test — splits tests via modular arithmetic for even distribution
  • Frontier GPU matrix now runs 2 shards per interface (acc/omp), halving wall-clock time
  • GitHub runner tests retry up to 5 sporadic failures using tests/failed_uuids.txt
  • Abort path cleans failed_uuids.txt to prevent stale retries

--only filter improvements

  • UUIDs use OR logic (match any), labels use AND logic (match all)
  • --only matching zero tests now raises an error instead of silently passing

CI script consolidation

  • Merge submit-bench.sh into submit.sh for all 3 clusters (frontier, frontier_amd, phoenix) — submit.sh auto-detects bench vs test mode from the submitted script's basename
  • Unify frontier/ and frontier_amd/ scripts via directory-name detection — build.sh, bench.sh, submit.sh, and test.sh are now byte-identical across both directories
  • Net deletion of 3 files and ~120 lines of duplicated shell code

Other

  • --requeue on Phoenix SLURM jobs for preemption recovery
  • Build retry wrapper (3 attempts with clean between)
  • Lint-gate must pass before self-hosted tests run
  • Skip benchmark workflow for bot review events

Depends on: #1170

Test plan

  • Frontier GPU tests run in 2 shards per interface and complete within 2h
  • Phoenix tests pass with --requeue and preemption recovery
  • Lint-gate blocks self-hosted tests on lint failure
  • GitHub runner retry logic fires on ≤5 test failures
  • Benchmark jobs submit correctly via merged submit.sh (bench mode auto-detected)
  • frontier/ and frontier_amd/ scripts are identical and detect cluster correctly

Copilot AI review requested due to automatic review settings February 19, 2026 20:01
@codeant-ai codeant-ai bot added the size:M This PR changes 30-99 lines, ignoring generated files label Feb 19, 2026

This comment was marked as outdated.

coderabbitai[bot]

This comment was marked as outdated.

cubic-dev-ai[bot]

This comment was marked as outdated.

coderabbitai[bot]

This comment was marked as outdated.

@codeant-ai codeant-ai bot added size:L This PR changes 100-499 lines, ignoring generated files and removed size:M This PR changes 30-99 lines, ignoring generated files labels Feb 20, 2026
coderabbitai[bot]

This comment was marked as outdated.

coderabbitai[bot]

This comment was marked as outdated.

@codeant-ai codeant-ai bot added size:L This PR changes 100-499 lines, ignoring generated files and removed size:L This PR changes 100-499 lines, ignoring generated files labels Feb 21, 2026
@codeant-ai codeant-ai bot added size:L This PR changes 100-499 lines, ignoring generated files and removed size:L This PR changes 100-499 lines, ignoring generated files labels Feb 23, 2026
@MFlowCode MFlowCode deleted a comment from github-actions bot Feb 23, 2026
@codeant-ai codeant-ai bot added size:L This PR changes 100-499 lines, ignoring generated files and removed size:L This PR changes 100-499 lines, ignoring generated files labels Feb 24, 2026
@codeant-ai codeant-ai bot added size:L This PR changes 100-499 lines, ignoring generated files and removed size:L This PR changes 100-499 lines, ignoring generated files labels Feb 24, 2026
@sbryngelson sbryngelson marked this pull request as draft February 25, 2026 01:04
@MFlowCode MFlowCode deleted a comment from codeant-ai bot Feb 26, 2026
@MFlowCode MFlowCode deleted a comment from codeant-ai bot Feb 26, 2026
@MFlowCode MFlowCode deleted a comment from Copilot AI Feb 26, 2026
@MFlowCode MFlowCode deleted a comment from Copilot AI Feb 26, 2026
@MFlowCode MFlowCode deleted a comment from Copilot AI Feb 26, 2026
@MFlowCode MFlowCode deleted a comment from Copilot AI Feb 26, 2026
@MFlowCode MFlowCode deleted a comment from coderabbitai bot Feb 26, 2026
@MFlowCode MFlowCode deleted a comment from codeant-ai bot Feb 26, 2026
@MFlowCode MFlowCode deleted a comment from codeant-ai bot Feb 26, 2026
@MFlowCode MFlowCode deleted a comment from codeant-ai bot Feb 26, 2026
@MFlowCode MFlowCode deleted a comment from coderabbitai bot Feb 26, 2026
@MFlowCode MFlowCode deleted a comment from coderabbitai bot Feb 26, 2026
@MFlowCode MFlowCode deleted a comment from codeant-ai bot Feb 26, 2026
@MFlowCode MFlowCode deleted a comment from codeant-ai bot Feb 26, 2026
@MFlowCode MFlowCode deleted a comment from codeant-ai bot Feb 26, 2026
@MFlowCode MFlowCode deleted a comment from codeant-ai bot Feb 26, 2026
@MFlowCode MFlowCode deleted a comment from codeant-ai bot Feb 26, 2026
@MFlowCode MFlowCode deleted a comment from codeant-ai bot Feb 26, 2026
@MFlowCode MFlowCode deleted a comment from codeant-ai bot Feb 26, 2026
@MFlowCode MFlowCode deleted a comment from codeant-ai bot Feb 26, 2026
sbryngelson and others added 3 commits February 25, 2026 21:23
The -s check already guarantees the file is non-empty, so
NUM_FAILED > 0 is always true in that branch.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…zero-match guard

- Include shard in SLURM job_slug to prevent output file collisions
  between parallel shards (e.g., test-gpu-acc-1-of-2.out)
- Consolidate frontier/ and frontier_amd/ submit.sh and test.sh into
  identical scripts that derive compiler flag and config from directory
- Add $shard_opts to CPU test branch for future-proofing
- Add zero-match guard for --only filter to fail loudly instead of
  silently exiting 0 when no tests match
- Hoist failed_uuids_path to single definition at top of test()
- Compute log slug dynamically in test.yml for shard-aware filenames
- Remove unnecessary shard: '' from non-sharded matrix entries
- Replace useless cat|tr pipeline with tr < file

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The --only filter now detects whether each term is a UUID (8-char hex)
or a trace label and applies appropriate matching:
  - Labels: AND logic (--only 2D Bubbles matches tests with both)
  - UUIDs: OR logic (--only UUID1 UUID2 matches tests with either)
  - Mixed: keep case if all labels match OR any UUID matches

This preserves the documented behavior for label filtering while
correctly supporting the CI retry path that passes multiple UUIDs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@MFlowCode MFlowCode deleted a comment from github-actions bot Feb 26, 2026
@MFlowCode MFlowCode deleted a comment from github-actions bot Feb 26, 2026
@MFlowCode MFlowCode deleted a comment from github-actions bot Feb 26, 2026
submit.sh now auto-detects job type (bench vs test) from the submitted
script's basename, selecting the appropriate SBATCH account, time limit,
and partition. This eliminates three submit-bench.sh files and makes
frontier/ and frontier_amd/ scripts byte-identical via directory-name
detection for compiler flags and cluster-specific options.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@MFlowCode MFlowCode deleted a comment from github-actions bot Feb 26, 2026
@MFlowCode MFlowCode deleted a comment from github-actions bot Feb 26, 2026
@codecov
Copy link

codecov bot commented Feb 26, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 44.05%. Comparing base (df28255) to head (06c0641).
⚠️ Report is 29 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #1171   +/-   ##
=======================================
  Coverage   44.05%   44.05%           
=======================================
  Files          70       70           
  Lines       20496    20496           
  Branches     1989     1989           
=======================================
  Hits         9030     9030           
  Misses      10328    10328           
  Partials     1138     1138           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions
Copy link

Claude Code Review

Head SHA: a1c55ede5160f32ec503b4e154c0ee04f2a3e4da
Files changed: 17

Changed files
  • .github/scripts/submit_and_monitor_bench.sh
  • .github/workflows/bench.yml
  • .github/workflows/frontier/bench.sh
  • .github/workflows/frontier/build.sh
  • .github/workflows/frontier/submit-bench.sh (deleted)
  • .github/workflows/frontier/submit.sh
  • .github/workflows/frontier/test.sh
  • .github/workflows/frontier_amd/bench.sh
  • .github/workflows/frontier_amd/build.sh
  • .github/workflows/frontier_amd/submit-bench.sh (deleted)
  • .github/workflows/frontier_amd/submit.sh
  • .github/workflows/frontier_amd/test.sh
  • .github/workflows/phoenix/submit-bench.sh (deleted)
  • .github/workflows/phoenix/submit.sh
  • .github/workflows/test.yml
  • toolchain/mfc/cli/commands.py
  • toolchain/mfc/test/test.py

Summary

  • Introduces --shard i/n flag for modular test distribution across SLURM jobs; sharding is applied post---only filtering in __filter().
  • Merges submit-bench.sh into submit.sh (3 files deleted, ~120 lines removed); bench vs. test mode is auto-detected from the submitted script basename.
  • frontier/ and frontier_amd/ scripts are unified via SCRIPT_DIR/cluster_name detection, eliminating duplicated shell code.
  • GitHub runner adds retry logic for ≤5 sporadic test failures using tests/failed_uuids.txt; abort path cleans the file to prevent stale retries.
  • --only now raises an error on zero matches (previously silent pass); UUID terms use OR logic while label terms use AND logic.

Findings

1. Behaviour change in --only (breaking for mixed use) — toolchain/mfc/test/test.py ~L881

The new _filter_only changes semantics when both UUIDs and labels are supplied:

if labels and uuids:
    keep = label_ok or uuid_ok   # OR between the two groups

Previously every term in --only was an AND (subset check). Now a UUID hit alone overrides label filtering entirely. This is documented in the docstring and intentional for the retry use-case, but it is a breaking change for any user who passes mixed terms expecting AND behaviour. A release note or in-help mention would help.

2. No zero-test guard after --shardtoolchain/mfc/test/test.py ~L932

The --only path (just above) raises MFCException when filtering yields zero cases. The shard path does not:

cases = [c for i, c in enumerate(cases) if i % shard_count == shard_idx - 1]
# no guard: if not cases: raise ...

If test count < shard count (e.g., --shard 2/2 on a 1-test suite), the shard silently produces an empty run, exits 0, and a CI matrix job shows green with nothing executed. Consider adding the same zero-test check here.

3. Unpinned third-party action — .github/workflows/test.yml ~L802

uses: nick-fields/retry@v3

This is a floating tag; tags can be moved to point to different commits. For security best practice—especially in CI that touches self-hosted nodes with cluster credentials—pin to a specific commit SHA.

4. Empty sbatch_extra expands to blank line in heredoc — .github/workflows/frontier/submit.sh ~L247

sbatch_extra=""
...
$sbatch_extra   # expands to empty line inside <<EOT

SLURM ignores blank lines, so this is harmless in practice. A guard (${sbatch_extra:+$'\n'$sbatch_extra} or an if block) would be cleaner.


Nits

  • bench.yml: bot-review skip checks github.event.review.user.type != 'Bot'; works for GitHub Apps but not for regular user accounts acting as bots. Low risk given typical usage.
  • submit_and_monitor_bench.sh: output_file is now defined only inside the error branch (line 60); that is the only place it is used — correct.
  • Removal of ACTIONS_RUNNER_FORCE_ACTIONS_NODE_VERSION: node16 and the unsecure-node workaround is a welcome clean-up.
  • --requeue added to Phoenix but not Frontier — appears intentional (different infra).
  • Retry step omits TEST_PCT (-% 20) intentionally — correct, since we re-run specific UUIDs.

Overall this is a well-structured hardening PR. The two items most worth addressing before merge are the zero-test guard after sharding (#2) and pinning the third-party action (#3).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:L This PR changes 100-499 lines, ignoring generated files

Development

Successfully merging this pull request may close these issues.

2 participants