Add test sharding, proactive clean, and retry logic for self-hosted CI#1171
Add test sharding, proactive clean, and retry logic for self-hosted CI#1171sbryngelson wants to merge 12 commits intoMFlowCode:masterfrom
Conversation
The -s check already guarantees the file is non-empty, so NUM_FAILED > 0 is always true in that branch. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…zero-match guard - Include shard in SLURM job_slug to prevent output file collisions between parallel shards (e.g., test-gpu-acc-1-of-2.out) - Consolidate frontier/ and frontier_amd/ submit.sh and test.sh into identical scripts that derive compiler flag and config from directory - Add $shard_opts to CPU test branch for future-proofing - Add zero-match guard for --only filter to fail loudly instead of silently exiting 0 when no tests match - Hoist failed_uuids_path to single definition at top of test() - Compute log slug dynamically in test.yml for shard-aware filenames - Remove unnecessary shard: '' from non-sharded matrix entries - Replace useless cat|tr pipeline with tr < file Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The --only filter now detects whether each term is a UUID (8-char hex) or a trace label and applies appropriate matching: - Labels: AND logic (--only 2D Bubbles matches tests with both) - UUIDs: OR logic (--only UUID1 UUID2 matches tests with either) - Mixed: keep case if all labels match OR any UUID matches This preserves the documented behavior for label filtering while correctly supporting the CI retry path that passes multiple UUIDs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
submit.sh now auto-detects job type (bench vs test) from the submitted script's basename, selecting the appropriate SBATCH account, time limit, and partition. This eliminates three submit-bench.sh files and makes frontier/ and frontier_amd/ scripts byte-identical via directory-name detection for compiler flags and cluster-specific options. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #1171 +/- ##
=======================================
Coverage 44.05% 44.05%
=======================================
Files 70 70
Lines 20496 20496
Branches 1989 1989
=======================================
Hits 9030 9030
Misses 10328 10328
Partials 1138 1138 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Claude Code ReviewHead SHA: Changed files
Summary
Findings1. Behaviour change in The new if labels and uuids:
keep = label_ok or uuid_ok # OR between the two groupsPreviously every term in 2. No zero-test guard after The cases = [c for i, c in enumerate(cases) if i % shard_count == shard_idx - 1]
# no guard: if not cases: raise ...If test count < shard count (e.g., 3. Unpinned third-party action — uses: nick-fields/retry@v3This is a floating tag; tags can be moved to point to different commits. For security best practice—especially in CI that touches self-hosted nodes with cluster credentials—pin to a specific commit SHA. 4. Empty sbatch_extra=""
...
$sbatch_extra # expands to empty line inside <<EOTSLURM ignores blank lines, so this is harmless in practice. A guard ( Nits
Overall this is a well-structured hardening PR. The two items most worth addressing before merge are the zero-test guard after sharding (#2) and pinning the third-party action (#3). |
Summary
Hardens self-hosted CI with test sharding, retry logic, and script deduplication.
Test sharding & retry
--shard i/nflag to./mfc.sh test— splits tests via modular arithmetic for even distributiontests/failed_uuids.txtfailed_uuids.txtto prevent stale retries--onlyfilter improvements--onlymatching zero tests now raises an error instead of silently passingCI script consolidation
submit-bench.shintosubmit.shfor all 3 clusters (frontier, frontier_amd, phoenix) —submit.shauto-detects bench vs test mode from the submitted script's basenamefrontier/andfrontier_amd/scripts via directory-name detection —build.sh,bench.sh,submit.sh, andtest.share now byte-identical across both directoriesOther
--requeueon Phoenix SLURM jobs for preemption recoveryDepends on: #1170
Test plan
--requeueand preemption recoverysubmit.sh(bench mode auto-detected)frontier/andfrontier_amd/scripts are identical and detect cluster correctly