Add test sharding, proactive clean, and retry logic for self-hosted CI by sbryngelson · Pull Request #1171 · MFlowCode/MFC

sbryngelson · 2026-02-19T20:01:54Z

Summary

Hardens self-hosted CI with test sharding, retry logic, and script deduplication.

Test sharding & retry

Add --shard i/n flag to ./mfc.sh test — splits tests via modular arithmetic for even distribution
Frontier GPU matrix now runs 2 shards per interface (acc/omp), halving wall-clock time
GitHub runner tests retry up to 5 sporadic failures using tests/failed_uuids.txt
Abort path cleans failed_uuids.txt to prevent stale retries

`--only` filter improvements

UUIDs use OR logic (match any), labels use AND logic (match all)
--only matching zero tests now raises an error instead of silently passing

CI script consolidation

Merge submit-bench.sh into submit.sh for all 3 clusters (frontier, frontier_amd, phoenix) — submit.sh auto-detects bench vs test mode from the submitted script's basename
Unify frontier/ and frontier_amd/ scripts via directory-name detection — build.sh, bench.sh, submit.sh, and test.sh are now byte-identical across both directories
Net deletion of 3 files and ~120 lines of duplicated shell code

Other

--requeue on Phoenix SLURM jobs for preemption recovery
Build retry wrapper (3 attempts with clean between)
Lint-gate must pass before self-hosted tests run
Skip benchmark workflow for bot review events

Depends on: #1170

Test plan

Frontier GPU tests run in 2 shards per interface and complete within 2h
Phoenix tests pass with --requeue and preemption recovery
Lint-gate blocks self-hosted tests on lint failure
GitHub runner retry logic fires on ≤5 test failures
Benchmark jobs submit correctly via merged submit.sh (bench mode auto-detected)
frontier/ and frontier_amd/ scripts are identical and detect cluster correctly

The -s check already guarantees the file is non-empty, so NUM_FAILED > 0 is always true in that branch. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…zero-match guard - Include shard in SLURM job_slug to prevent output file collisions between parallel shards (e.g., test-gpu-acc-1-of-2.out) - Consolidate frontier/ and frontier_amd/ submit.sh and test.sh into identical scripts that derive compiler flag and config from directory - Add $shard_opts to CPU test branch for future-proofing - Add zero-match guard for --only filter to fail loudly instead of silently exiting 0 when no tests match - Hoist failed_uuids_path to single definition at top of test() - Compute log slug dynamically in test.yml for shard-aware filenames - Remove unnecessary shard: '' from non-sharded matrix entries - Replace useless cat|tr pipeline with tr < file Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The --only filter now detects whether each term is a UUID (8-char hex) or a trace label and applies appropriate matching: - Labels: AND logic (--only 2D Bubbles matches tests with both) - UUIDs: OR logic (--only UUID1 UUID2 matches tests with either) - Mixed: keep case if all labels match OR any UUID matches This preserves the documented behavior for label filtering while correctly supporting the CI retry path that passes multiple UUIDs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

submit.sh now auto-detects job type (bench vs test) from the submitted script's basename, selecting the appropriate SBATCH account, time limit, and partition. This eliminates three submit-bench.sh files and makes frontier/ and frontier_amd/ scripts byte-identical via directory-name detection for compiler flags and cluster-specific options. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

codecov · 2026-02-26T06:33:56Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 44.05%. Comparing base (df28255) to head (06c0641).
⚠️ Report is 29 commits behind head on master.

Additional details and impacted files

@@           Coverage Diff           @@
##           master    #1171   +/-   ##
=======================================
  Coverage   44.05%   44.05%           
=======================================
  Files          70       70           
  Lines       20496    20496           
  Branches     1989     1989           
=======================================
  Hits         9030     9030           
  Misses      10328    10328           
  Partials     1138     1138

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

github-actions · 2026-02-26T14:52:04Z

Claude Code Review

Head SHA: a1c55ede5160f32ec503b4e154c0ee04f2a3e4da
Files changed: 17

Changed files

.github/scripts/submit_and_monitor_bench.sh
.github/workflows/bench.yml
.github/workflows/frontier/bench.sh
.github/workflows/frontier/build.sh
.github/workflows/frontier/submit-bench.sh (deleted)
.github/workflows/frontier/submit.sh
.github/workflows/frontier/test.sh
.github/workflows/frontier_amd/bench.sh
.github/workflows/frontier_amd/build.sh
.github/workflows/frontier_amd/submit-bench.sh (deleted)
.github/workflows/frontier_amd/submit.sh
.github/workflows/frontier_amd/test.sh
.github/workflows/phoenix/submit-bench.sh (deleted)
.github/workflows/phoenix/submit.sh
.github/workflows/test.yml
toolchain/mfc/cli/commands.py
toolchain/mfc/test/test.py

Summary

Introduces --shard i/n flag for modular test distribution across SLURM jobs; sharding is applied post---only filtering in __filter().
Merges submit-bench.sh into submit.sh (3 files deleted, ~120 lines removed); bench vs. test mode is auto-detected from the submitted script basename.
frontier/ and frontier_amd/ scripts are unified via SCRIPT_DIR/cluster_name detection, eliminating duplicated shell code.
GitHub runner adds retry logic for ≤5 sporadic test failures using tests/failed_uuids.txt; abort path cleans the file to prevent stale retries.
--only now raises an error on zero matches (previously silent pass); UUID terms use OR logic while label terms use AND logic.

Findings

1. Behaviour change in --only (breaking for mixed use) — toolchain/mfc/test/test.py ~L881

The new _filter_only changes semantics when both UUIDs and labels are supplied:

if labels and uuids:
    keep = label_ok or uuid_ok   # OR between the two groups

Previously every term in --only was an AND (subset check). Now a UUID hit alone overrides label filtering entirely. This is documented in the docstring and intentional for the retry use-case, but it is a breaking change for any user who passes mixed terms expecting AND behaviour. A release note or in-help mention would help.

2. No zero-test guard after --shard — toolchain/mfc/test/test.py ~L932

The --only path (just above) raises MFCException when filtering yields zero cases. The shard path does not:

cases = [c for i, c in enumerate(cases) if i % shard_count == shard_idx - 1]
# no guard: if not cases: raise ...

If test count < shard count (e.g., --shard 2/2 on a 1-test suite), the shard silently produces an empty run, exits 0, and a CI matrix job shows green with nothing executed. Consider adding the same zero-test check here.

3. Unpinned third-party action — .github/workflows/test.yml ~L802

uses: nick-fields/retry@v3

This is a floating tag; tags can be moved to point to different commits. For security best practice—especially in CI that touches self-hosted nodes with cluster credentials—pin to a specific commit SHA.

4. Empty sbatch_extra expands to blank line in heredoc — .github/workflows/frontier/submit.sh ~L247

sbatch_extra=""
...
$sbatch_extra   # expands to empty line inside <<EOT

SLURM ignores blank lines, so this is harmless in practice. A guard (${sbatch_extra:+$'\n'$sbatch_extra} or an if block) would be cleaner.

Nits

bench.yml: bot-review skip checks github.event.review.user.type != 'Bot'; works for GitHub Apps but not for regular user accounts acting as bots. Low risk given typical usage.
submit_and_monitor_bench.sh: output_file is now defined only inside the error branch (line 60); that is the only place it is used — correct.
Removal of ACTIONS_RUNNER_FORCE_ACTIONS_NODE_VERSION: node16 and the unsecure-node workaround is a welcome clean-up.
--requeue added to Phoenix but not Frontier — appears intentional (different infra).
Retry step omits TEST_PCT (-% 20) intentionally — correct, since we re-run specific UUIDs.

Overall this is a well-structured hardening PR. The two items most worth addressing before merge are the zero-test guard after sharding (#2) and pinning the third-party action (#3).

Copilot AI review requested due to automatic review settings February 19, 2026 20:01

Copilot started reviewing on behalf of sbryngelson February 19, 2026 20:02 View session

codeant-ai bot added the size:M This PR changes 30-99 lines, ignoring generated files label Feb 19, 2026

This comment was marked as outdated.

Sign in to view

codeant-ai bot added size:L This PR changes 100-499 lines, ignoring generated files and removed size:M This PR changes 30-99 lines, ignoring generated files labels Feb 20, 2026

This comment was marked as outdated.

Sign in to view

codeant-ai bot added size:L This PR changes 100-499 lines, ignoring generated files and removed size:L This PR changes 100-499 lines, ignoring generated files labels Feb 21, 2026

sbryngelson force-pushed the ci-test branch from 55b68e5 to 491b27b Compare February 23, 2026 14:50

codeant-ai bot added size:L This PR changes 100-499 lines, ignoring generated files and removed size:L This PR changes 100-499 lines, ignoring generated files labels Feb 23, 2026

MFlowCode deleted a comment from github-actions bot Feb 23, 2026

sbryngelson force-pushed the ci-test branch from 3ce4f39 to f3bab46 Compare February 24, 2026 16:03

codeant-ai bot added size:L This PR changes 100-499 lines, ignoring generated files and removed size:L This PR changes 100-499 lines, ignoring generated files labels Feb 24, 2026

sbryngelson force-pushed the ci-test branch from 749eb67 to a9b1e40 Compare February 24, 2026 16:49

codeant-ai bot added size:L This PR changes 100-499 lines, ignoring generated files and removed size:L This PR changes 100-499 lines, ignoring generated files labels Feb 24, 2026

sbryngelson marked this pull request as draft February 25, 2026 01:04

MFlowCode deleted a comment from codeant-ai bot Feb 26, 2026

MFlowCode deleted a comment from Copilot AI Feb 26, 2026

MFlowCode deleted a comment from coderabbitai bot Feb 26, 2026

MFlowCode deleted a comment from codeant-ai bot Feb 26, 2026

MFlowCode deleted a comment from coderabbitai bot Feb 26, 2026

MFlowCode deleted a comment from codeant-ai bot Feb 26, 2026

sbryngelson and others added 3 commits February 25, 2026 21:23

Remove redundant NUM_FAILED > 0 guard in test retry logic

b5c095f

The -s check already guarantees the file is non-empty, so NUM_FAILED > 0 is always true in that branch. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

MFlowCode deleted a comment from github-actions bot Feb 26, 2026

Use normal QOS instead of hackathon for Frontier test jobs

a1c55ed

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add test sharding, proactive clean, and retry logic for self-hosted CI#1171

Add test sharding, proactive clean, and retry logic for self-hosted CI#1171
sbryngelson wants to merge 12 commits intoMFlowCode:masterfrom
sbryngelson:ci-test

sbryngelson commented Feb 19, 2026 •

edited

Loading

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

codecov bot commented Feb 26, 2026

Uh oh!

github-actions bot commented Feb 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

2 participants

Conversation

sbryngelson commented Feb 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test sharding & retry

--only filter improvements

CI script consolidation

Other

Test plan

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

codecov bot commented Feb 26, 2026

Codecov Report

Uh oh!

github-actions bot commented Feb 26, 2026

Claude Code Review

Summary

Findings

Nits

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

2 participants

sbryngelson commented Feb 19, 2026 •

edited

Loading

`--only` filter improvements