B2: triage run -- mitigation x environment matrix runner by amd-vivekag · Pull Request #160 · ROCm/aorta

amd-vivekag · 2026-05-05T09:55:35Z

Implements task B2 (issue #151): the aorta triage run command that sweeps a (mitigation x environment x trials) matrix in-process via B1's run_trials, aggregates per-cell stats, classifies each cell against a baseline with speed-confound detection, and emits matrix.md / matrix.json / recipe.resolved.yaml under a ticket-grouped output directory.

Scope (matrix mode only per design D11; optimize mode deferred to P1):

triage/recipe.py -- schema v1, YAML/JSON loader, and the build_recipe_from_flags shim. Recipe/Cell/InlineEnv/ ConfoundCfg are frozen dataclasses; loader rejects unknown schema_version, unknown keys, duplicate cell names, and resolves every (mitigation, environment) reference through B3's registry (built-ins + entry-point plugins + B3.1 sidecars). Inline-docker envs auto-name as _inline_<8char-blake2b> so two cells with the same image ref collapse to one environment probe.
triage/runner.py -- run_recipe() is the single execution path both modes converge on. Resolves <output>/<ticket>/<workload>/<ts>/, writes inline-docker sidecar JSON for B1's resolver to consume, captures host + per-environment collect_env snapshots, dispatches every cell via in-process run_trials. Per-cell exceptions are caught and surfaced as an error row so a single bad cell doesn't kill the matrix. Strictly no subprocess -- a grep test under tests/triage/ enforces it.
triage/matrix.py -- aggregate_cell() collapses per-trial results into median/p90/min/max step time, NaN counts, exit-status histogram, and the per-cell label set.
triage/confound.py -- resolve_baseline() (explicit override -> first baseline-* cell -> first mitigations: [none] cell) and classify_all() (cell_step_time / baseline_step_time > threshold -> speed (+N%) flag). Threshold default 1.15 per D11.
triage/output.py -- writes the three artifacts. matrix.md is human-first (table with per-row label badges + run summary); matrix.json is machine-first (full per-cell stats and the resolved recipe round-trips into recipe.resolved.yaml so a matrix can be re-run byte-identically).
cli/triage.py -- two equivalent entry points: --recipe <file> (primary) and --mode matrix --workload ... --mitigation-axis ... --environment-axis ... (flag shim). Mixing the two raises a UsageError up-front instead of silently letting one win. Adds list-mitigations / list-environments discovery subcommands (delegates to B3's resolver, prints source_package so users can see whether a mitigation came from aorta, a plugin, or a sidecar).

Recipes:

recipes/README.md -- full schema reference + worked examples.
recipes/example-fsdp-smoke.yaml -- minimal end-to-end example.

Tests (1364 lines, 5 files):

test_recipe.py -- schema validation: unknown version, unknown top-level keys, duplicate cell names, missing required fields, inline-env hash determinism, sidecar resolution.
test_runner_b1_api.py -- per-cell error isolation, in-process guarantee (greps triage/runner.py for subprocess), inline-env sidecar round-trip, per-environment snapshot caching.
test_matrix.py -- aggregation against synthetic TrialResult fixtures incl. the all-NaN and mixed-exit-status edges.
test_confound.py -- baseline resolution priority and the threshold classifier (boundary at exactly 1.15).
test_output_layout.py -- artifact paths, matrix.md table shape, matrix.json schema, resolved-recipe round-trip.

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

Implements task B2 (issue #151): the `aorta triage run` command that sweeps a (mitigation x environment x trials) matrix in-process via B1's `run_trials`, aggregates per-cell stats, classifies each cell against a baseline with speed-confound detection, and emits matrix.md / matrix.json / recipe.resolved.yaml under a ticket-grouped output directory. Scope (matrix mode only per design D11; optimize mode deferred to P1): * `triage/recipe.py` -- schema v1, YAML/JSON loader, and the `build_recipe_from_flags` shim. `Recipe`/`Cell`/`InlineEnv`/ `ConfoundCfg` are frozen dataclasses; loader rejects unknown schema_version, unknown keys, duplicate cell names, and resolves every `(mitigation, environment)` reference through B3's registry (built-ins + entry-point plugins + B3.1 sidecars). Inline-docker envs auto-name as `_inline_<8char-blake2b>` so two cells with the same image ref collapse to one environment probe. * `triage/runner.py` -- `run_recipe()` is the single execution path both modes converge on. Resolves `<output>/<ticket>/<workload>/<ts>/`, writes inline-docker sidecar JSON for B1's resolver to consume, captures host + per-environment `collect_env` snapshots, dispatches every cell via in-process `run_trials`. Per-cell exceptions are caught and surfaced as an `error` row so a single bad cell doesn't kill the matrix. Strictly no `subprocess` -- a grep test under `tests/triage/` enforces it. * `triage/matrix.py` -- `aggregate_cell()` collapses per-trial results into median/p90/min/max step time, NaN counts, exit-status histogram, and the per-cell label set. * `triage/confound.py` -- `resolve_baseline()` (explicit override -> first `baseline-*` cell -> first `mitigations: [none]` cell) and `classify_all()` (cell_step_time / baseline_step_time > threshold -> `speed (+N%)` flag). Threshold default 1.15 per D11. * `triage/output.py` -- writes the three artifacts. matrix.md is human-first (table with per-row label badges + run summary); matrix.json is machine-first (full per-cell stats and the resolved recipe round-trips into recipe.resolved.yaml so a matrix can be re-run byte-identically). * `cli/triage.py` -- two equivalent entry points: `--recipe <file>` (primary) and `--mode matrix --workload ... --mitigation-axis ... --environment-axis ...` (flag shim). Mixing the two raises a UsageError up-front instead of silently letting one win. Adds `list-mitigations` / `list-environments` discovery subcommands (delegates to B3's resolver, prints source_package so users can see whether a mitigation came from `aorta`, a plugin, or a sidecar). Recipes: * `recipes/README.md` -- full schema reference + worked examples. * `recipes/example-fsdp-smoke.yaml` -- minimal end-to-end example. Tests (1364 lines, 5 files): * `test_recipe.py` -- schema validation: unknown version, unknown top-level keys, duplicate cell names, missing required fields, inline-env hash determinism, sidecar resolution. * `test_runner_b1_api.py` -- per-cell error isolation, in-process guarantee (greps `triage/runner.py` for `subprocess`), inline-env sidecar round-trip, per-environment snapshot caching. * `test_matrix.py` -- aggregation against synthetic TrialResult fixtures incl. the all-NaN and mixed-exit-status edges. * `test_confound.py` -- baseline resolution priority and the threshold classifier (boundary at exactly 1.15). * `test_output_layout.py` -- artifact paths, matrix.md table shape, matrix.json schema, resolved-recipe round-trip. Co-authored-by: Cursor <cursoragent@cursor.com>

Copilot

Pull request overview

Adds the B2 triage matrix workflow to aorta: recipe/flag-driven orchestration around B1 run_trials, aggregation/classification, output artifact generation, and CLI entry points for running and inspecting mitigation/environment registries.

Changes:

Adds triage core modules for recipe parsing, per-cell execution, aggregation, confound classification, and artifact writing.
Replaces the placeholder aorta triage run CLI with recipe mode, flag-shim matrix mode, and registry discovery subcommands.
Adds recipe documentation/examples plus a substantial new tests/triage/ suite covering loader, runner, output, matrix, and confound behavior.

Reviewed changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
`src/aorta/cli/triage.py`	Implements the triage CLI, dual entry paths, and list subcommands.
`src/aorta/triage/__init__.py`	Exposes the new triage public API.
`src/aorta/triage/recipe.py`	Defines recipe schema/dataclasses, file loader, and flag-mode builder.
`src/aorta/triage/runner.py`	Orchestrates matrix execution, env capture, cell dispatch, and artifact emission.
`src/aorta/triage/matrix.py`	Aggregates per-trial results into per-cell statistics.
`src/aorta/triage/confound.py`	Resolves the baseline and classifies cells for speed confounds.
`src/aorta/triage/output.py`	Resolves run directories and writes `matrix.md`, `matrix.json`, and resolved recipe output.
`recipes/README.md`	Documents the recipe schema and output layout.
`recipes/example-fsdp-smoke.yaml`	Provides a minimal example triage recipe.
`tests/triage/__init__.py`	Adds the new triage test package.
`tests/triage/test_recipe.py`	Covers recipe loading/validation and flag builder behavior.
`tests/triage/test_runner_b1_api.py`	Verifies runner plumbing, in-process execution, and CLI smoke paths.
`tests/triage/test_matrix.py`	Tests cell aggregation behavior and timing fallbacks.
`tests/triage/test_confound.py`	Tests baseline resolution and confound classification.
`tests/triage/test_output_layout.py`	Verifies artifact layout/content and fail-soft output behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+            environment_axis=environment_axis,
+            trials=trials,
+            baseline_cell=baseline_cell,
+        )


+        resolved_cells.append(
+            {
+                "name": cell.name,
+                "mitigations": list(cell.mitigations),
+                "mitigation_contributions": mit_contributions,
+                "environment": cell.environment,
+                "resolved_environment": resolved_env,
+                "extra_env": dict(cell.extra_env),
+                "resolved_mitigation_env": mit_union,
+                "trials": cell.effective_trials(recipe.trials),
+                "steps": cell.effective_steps(recipe.steps),
+            }
+        )
+
+    doc = {
+        "schema_version": recipe.schema_version,
+        "ticket": recipe.ticket,
+        "workload": recipe.workload,
+        "trials": recipe.trials,
+        "steps": recipe.steps,
+        "confound": {
+            "threshold": recipe.confound.threshold,
+            "baseline_cell": recipe.confound.baseline_cell,
+        },
+        "inline_environments": [
+            {"name": e.name, "docker": e.docker} for e in recipe.inline_environments
+        ],
+        "cells": resolved_cells,


+        if cell.environment not in seen_envs:
+            _capture_env(
+                env_dir / cell.environment / "env.json",
+                scope=f"environment:{cell.environment}",
+                warnings=warnings,
+            )


+    _validate_names_resolve(cells_tuple, inline_envs, sidecar_files)
+
+    if confound.baseline_cell is not None:
+        names = {c.name for c in cells_tuple}
+        if confound.baseline_cell not in names:
+            raise RecipeCellError(
+                f"confound.baseline_cell {confound.baseline_cell!r} does not "
+                f"match any cell name; cells: {sorted(names)}"
+            )
+
+    return Recipe(
+        schema_version=SCHEMA_VERSION,
+        workload=workload,
+        trials=trials,
+        steps=steps,
+        cells=cells_tuple,
+        ticket=ticket,
+        confound=confound,
+        inline_environments=tuple(inline_envs.values()),
+        source_path=source_path,
+        source_sha256=source_sha256,
+    )


+    cell_dir = _cells_dir(run_dir) / cell.name
+    cell_dir.mkdir(parents=True, exist_ok=True)


+            r = build_recipe_from_flags(
+                workload=workload,  # type: ignore[arg-type]
+                mitigation_axis=mitigation_axis,  # type: ignore[arg-type]
+                environment_axis=environment_axis,  # type: ignore[arg-type]
+                trials=trials,  # type: ignore[arg-type]
+                steps=steps,
+                ticket=ticket,
+                baseline_cell=baseline_cell,
+                confound_threshold=confound_threshold,
+                sidecar_files=mitigation_files or None,
+            )


end-of-file-fixer flagged tests/triage/__init__.py: the file had a single stray `\n` (1 byte). The hook's rule for empty packages is "zero bytes, no trailing newline"; truncating to 0 bytes satisfies it and Python still treats the directory as a package. Co-authored-by: Cursor <cursoragent@cursor.com>

… audit Copilot's review surfaced 6 distinct issues; categorising them showed three overlapping classes worth closing for the next reviewer instead of addressing one comment at a time. Class A -- validation incompleteness / not fail-fast ==================================================== (Comments 1, 4, 6.) Inputs that should have been rejected at the recipe boundary instead reached execution. * cli/triage.py: ``_reject_flag_mode_args`` only checked --workload, --mitigation-axis, --environment-axis, --trials, --baseline-cell. --steps / --ticket / --confound-threshold passed through silently when combined with --recipe, masking misconfigurations. Helper now checks every flag that affects recipe content; the no-op runner-level flags (--mode, --output-dir, --dry-run, --mitigations-file) are explicitly documented as intentionally allowed. * cli/triage.py: --confound-threshold default flips from 1.15 to None so the "user explicitly set this" check actually works. Flag-mode default is applied at build_recipe_from_flags() call time instead. * triage/recipe.py: build_recipe_from_flags() now validates the same set of primitive constraints load_recipe enforces -- positive trials/steps, non-empty workload, ticket-when-provided non-empty, and confound_threshold > 0. Previously ``aorta triage run --trials 0`` got all the way to ``run_trials()`` before failing. * triage/recipe.py: cross-mitigation env-var collision detection landed as ``_validate_no_mitigation_collisions`` and is wired into both _build_recipe (load path) and build_recipe_from_flags (flag path). The Cell docstring previously promised this check existed; it didn't, and ``dict.update`` was silently letting the later mitigation win. ``extra_env`` overrides stay silent because they are documented as the intentional-override knob. Class B -- documented contract not honoured =========================================== (Comments 2, 4.) Code claimed behaviours its implementation didn't deliver. * triage/output.py: ``recipe.resolved.yaml`` was advertised as rerunnable but contained per-cell debug fields (``mitigation_contributions``, ``resolved_environment``, ``resolved_mitigation_env``) and a top-level ``inline_environments`` block, all of which load_recipe rejects as unknown schema keys. Rewrote write_resolved_recipe to emit a strict, schema-valid recipe. Inline-docker cells re-emit the ``{docker: <ref>}`` shorthand so the same ``_inline_<hash>`` is re-derived at reload without a sidecar. Per-cell expansion data (the resolved Environment descriptor) moved into matrix.json::cells[*].resolved_environment, where it logically belongs as run-state. The unioned env-var bundle was already in matrix.json::cells[*].resolved_env_vars (via CellStats), so nothing is lost. * recipes/README.md: ``§Re-running a past matrix`` updated to reflect the strict-rerunnable property, with an explicit note that registry drift on named entries is currently not pinned (use matrix.json's resolved_env_vars to detect it). * triage/recipe.py: Cell docstring rewritten -- the "later names win" language was the silent-update bug in disguise. Now spells out "rejected at recipe-construction time". Class C -- recipe strings used as filesystem paths ================================================== (Comment 5.) Recipe-supplied strings reached Path operations without sanitisation; ``cell.name = "../foo"`` could escape the run directory. * triage/recipe.py: cell-name regex and reserved-name list: ``^[A-Za-z0-9_][A-Za-z0-9_.\-]*$`` plus a reject list for ``.``, ``..``, ``matrix.md``, ``matrix.json``. Validated at parse time (recipe mode) and at flag-mode cell synthesis. The leading-char restriction also prevents ``--`` getting interpreted as a CLI flag if anyone copy-pastes a cell name into a shell. * triage/output.py: safe_slug() additionally collapses ``.`` and ``..`` to ``_`` -- the character-class regex alone allowed those through, so a ticket like ``..`` would have moved the run dir up one level. * triage/runner.py: the cell.name and cell.environment path components now go through safe_slug() at the runner boundary (defence in depth -- the recipe loader already rejects unsafe names, but the slug at the boundary means even a hand-built Recipe with a bad name can't escape). * recipes/README.md: ``cells[*].name`` doc updated with the regex. Comment 3 (per-environment env probe is misleading) =================================================== The runner was calling ``collect_env()`` in the runner process and labelling the result as the docker / venv environment. Since B1 doesn't actually swap to docker today (in-process), the artifact was the host state under a docker label. * triage/runner.py: new ``_is_isolated_environment`` predicate (true if the env's Environment descriptor sets ``docker`` or ``venv``, or if the env is an inline-docker auto-name). When true, the runner writes a non-misleading placeholder to ``environments/<name>/env.json``: ``snapshot_captured=False``, the descriptor itself, and an explanation of why (B1 in-process). The per-run host_env.json still captures the runner's view, so nothing about the host is lost. * A warning is appended to matrix.md / matrix.json so the operator sees the skip explicitly. Tests ===== 22 new tests across the three classes (124 passing in tests/triage, was 102): * test_recipe.py: - 9 cell-name path-safety params (../escape, a/b, .., ., matrix.md, leading dash, embedded whitespace, etc.) + 1 happy case asserting ``PROJ_1.fix-cell`` parses fine. - test_stacked_mitigations_with_disagreeing_keys_rejected / _agreeing_keys_accepted / _error_names_the_cell_and_keys for the new collision detector (mitigations monkey-patched). - test_extra_env_overrides_mitigation_silently pins that extra_env overrides intentionally do NOT trip the collision check. - 4 flag-mode boundary-validation tests: trials=0 / -1, steps=0 / -5, empty workload, threshold=0/-1/-0.5, ticket="". * test_runner_b1_api.py: - test_cli_recipe_mode_rejects_every_flag_mode_knob parametrised over all 8 recipe-content flags (the comment-1 fix). - test_cli_recipe_mode_allows_runner_only_flags pins that --output-dir / --dry-run still work alongside --recipe. - test_cli_flag_mode_rejects_non_positive_trials pins the --trials 0 fail-fast (comment 6). * test_output_layout.py: - test_safe_slug_rejects_dot_components for the safe_slug fix. - test_resolved_recipe_is_loadable_by_load_recipe pins the strict rerunnable contract (comment 2). - test_resolved_recipe_inline_envs_emit_docker_shorthand pins the ``{docker: <ref>}`` re-emission for inline cells. - test_matrix_json_records_resolved_environment_per_cell pins that the moved-out debug expansion lands in matrix.json. - test_isolated_env_writes_placeholder_not_runner_snapshot pins the comment-3 fix (placeholder, not misleading collect_env call). - test_host_env_collected_exactly_once updated: 1 host + 1 local-env probe = 2, not 3, because the inline-docker env now skips its runner-process probe. Verification ============ * pytest tests/triage/: 124 passing (was 102; +22 new). * pytest tests/ (excl. tests/hw_queue_eval which needs torch): 403 passing. * ruff check + ruff format: clean across all touched src + tests. * mypy: 2 pre-existing yaml-import-untyped errors only; no new errors. * pre-commit run --all-files: clean. Co-authored-by: Cursor <cursoragent@cursor.com>

Copilot

Pull request overview

Copilot reviewed 14 out of 15 changed files in this pull request and generated 9 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+    threshold = raw.get("threshold", 1.15)
+    if not isinstance(threshold, (int, float)) or isinstance(threshold, bool):
+        raise RecipeSchemaError(
+            f"{path_hint}.confound.threshold: must be a number, got {type(threshold).__name__}"
+        )
+    baseline = raw.get("baseline_cell")
+    if baseline is not None and not isinstance(baseline, str):
+        raise RecipeSchemaError(
+            f"{path_hint}.confound.baseline_cell: must be a string, got {type(baseline).__name__}"
+        )
+    return ConfoundCfg(threshold=float(threshold), baseline_cell=baseline)


+    Lookup failures fall back to "treat as local" so probe behaviour is
+    unchanged for envs we genuinely don't know anything about; the
+    pre-flight :func:`_validate_names_resolve` would already have failed if
+    the name were truly unknown.
+    """
+    if any(env_name == e.name for e in inline_envs):
+        return True
+    try:
+        descriptor = get_environment(env_name)
+    except RegistryError:
+        return False


+    """Write a schema-valid ``recipe.resolved.yaml`` that can be re-loaded as-is.
+
+    The output is a strict :func:`aorta.triage.recipe.load_recipe` input -- no
+    debug-only fields, no per-cell ``resolved_mitigation_env`` block, no
+    top-level ``inline_environments`` key. The "resolved" property comes from
+    inline-docker cells being re-emitted in the ``{docker: <ref>}`` shorthand
+    form (so re-loading on another machine reproduces the same
+    ``_inline_<hash>`` env without needing a sidecar JSON next to the file).
+
+    Per-cell debug expansions (the ``Environment`` descriptor and the unioned
+    mitigation env-var bundle) live in ``matrix.json`` instead, where they
+    belong as run-time state. ``sidecar_files`` is intentionally unused at
+    write time -- B3 already resolved everything by the time we got here --
+    but the parameter is preserved so the runner can keep its single
+    "everything-needed-for-replay" call site.
+    """
+    del sidecar_files  # see docstring; kept for caller-stable signature


+    @property
+    def nan_rate(self) -> float:
+        """Fraction of trials that failed. 0.0 for an empty cell."""
+        if self.trials == 0:
+            return 0.0
+        return self.failed_count / self.trials


+    passed_count: int
+    failed_count: int
+    mean_step_time_ms: float
+    std_step_time_ms: float
+    p50_step_time_ms: float
+    p99_step_time_ms: float
+    mean_wall_clock_sec: float
+    step_times_ms: list[float] = field(default_factory=list)
+    trial_paths: list[str] = field(default_factory=list)


+    found: list[Path] = []
+    for candidate in sorted(results_dir.rglob("trial_*.json")):
+        found.append(candidate)


+        if cell.environment not in seen_envs:
+            env_json_path = env_dir / safe_slug(cell.environment) / "env.json"
+            if _is_isolated_environment(cell.environment, recipe.inline_environments):
+                _write_isolated_env_placeholder(
+                    env_json_path,
+                    cell.environment,
+                    recipe.inline_environments,
+                    warnings,
+                )
+            else:
+                _capture_env(
+                    env_json_path,
+                    scope=f"environment:{cell.environment}",
+                    warnings=warnings,
+                )
+            seen_envs.add(cell.environment)


+def list_mitigations(files: tuple[Path, ...]) -> None:
+    """List every registered mitigation with its source_package and env-var bundle."""
+    registry = load_mitigations(extra_files=list(files) or None)
+    name_w = max(len("NAME"), *(len(n) for n in registry))
+    src_w = max(len("SOURCE"), *(len(m.source_package) for m in registry.values()))
+    click.echo(f"{'NAME'.ljust(name_w)}  {'SOURCE'.ljust(src_w)}  ENV")
+    for name in sorted(registry):
+        m = registry[name]
+        env_str = " ".join(f"{k}={v}" for k, v in sorted(m.env.items())) or "(none)"
+        click.echo(f"{name.ljust(name_w)}  {m.source_package.ljust(src_w)}  {env_str}")


+def list_environments(files: tuple[Path, ...]) -> None:
+    """List every registered environment with its source_package and docker/venv."""
+    registry = load_environments(extra_files=list(files) or None)
+    name_w = max(len("NAME"), *(len(n) for n in registry))
+    src_w = max(len("SOURCE"), *(len(e.source_package) for e in registry.values()))
+    docker_w = max(len("DOCKER"), *(len(e.docker or "-") for e in registry.values()))
+    click.echo(f"{'NAME'.ljust(name_w)}  {'SOURCE'.ljust(src_w)}  {'DOCKER'.ljust(docker_w)}  VENV")
+    for name in sorted(registry):
+        e = registry[name]
+        click.echo(
+            f"{name.ljust(name_w)}  {e.source_package.ljust(src_w)}  "
+            f"{(e.docker or '-').ljust(docker_w)}  {e.venv or '-'}"


… audit Copilot's second pass surfaced 9 distinct comments. Categorising them showed five overlapping classes, all of which got the same audit-and-fix treatment as round 1: find every instance of each pattern in the diff, fix all of them, add tests pinning each behaviour. Class A -- validation parity (recipe vs flag mode disagree) =========================================================== (Comment 3188207431.) Recipe-mode loader accepted ``confound.threshold`` of ``0`` or negative, even though ``build_recipe_from_flags`` already rejected those values. A non-positive threshold makes ``classify`` flag every non-baseline cell as a speed confound (any positive ratio satisfies ``ratio > threshold``), so the recipe was always misuse. * triage/recipe.py: ``_parse_confound`` now applies the same ``> 0`` check, with an error message that names the offending value. Class B -- ``--mitigations-file`` (sidecar) lifecycle is incomplete ================================================================== (Comments 3188207484, 3188207509, 3188207675, 3188207707.) Four independent symptoms of one root cause: sidecar files weren't threaded through the parts of the system that need to know about them. * triage/runner.py: ``_is_isolated_environment`` and ``_write_isolated_env_placeholder`` now accept ``sidecar_files`` and forward it to ``get_environment``. Without this, a sidecar-defined docker/venv env failed the registry lookup, fell back to "treat as local", and the runner wrote a misleading host ``collect_env()`` snapshot under the env's name. Both call sites in ``run_recipe`` pass the runner's sidecar bundle in. * triage/runner.py: new ``_copy_operator_sidecars`` snapshots each ``--mitigations-file`` JSON into ``<run_dir>/sidecars/<basename>`` before B1 sees them. The runner then uses the snapshotted paths as the resolver source -- so what gets executed and what gets archived for replay are byte-identical. Without this, ``recipe.resolved.yaml`` was unrunnable on its own for sidecar-backed runs (the YAML references mitigation/env names, but the names were only resolvable via the operator's local sidecar). The runner also prints a one-line ``to rerun:`` hint on stdout when sidecars are involved so the operator does not have to reconstruct the flags by hand. * triage/output.py: ``write_resolved_recipe`` docstring updated to document the snapshot-and-replay convention; the YAML schema stays strict (no metadata namespace) -- the README documents the rerun command shape. * cli/triage.py: ``list-mitigations`` and ``list-environments`` now wrap ``RegistryError`` in ``ClickException`` so a malformed ``--mitigations-file`` produces the same one-line clean error that ``triage run`` and ``aorta run`` already produced. Previously a bad sidecar dumped a Python traceback. Class C -- recipe-supplied strings -> filesystem path components ================================================================ (Comment 3188207640.) Round 1 added ``safe_slug`` for ``cell.name`` and gave cell names a regex; environment names slipped through. Distinct registered names like ``a/b`` and ``a:b`` both slug to ``a_b`` and the second probe would silently overwrite the first while ``matrix.json`` kept them as distinct envs. * triage/runner.py: new ``_check_env_slug_collisions`` runs at the top of ``run_recipe`` and raises ``RegistryError`` naming both colliding env names + the shared slug, before any directory is created. This matches the existing cell-name regex enforcement in spirit (different mechanism because env names come from the registry / sidecar, not the recipe text directly). Class D -- aggregation model accuracy + completeness ==================================================== (Comments 3188207553, 3188207584.) Two issues with ``CellStats``: * ``nan_rate`` was misnamed -- the property counts every trial whose ``exit_status != "ok"`` or ``WorkloadResult.passed`` is False, which is generic failure (NaN, corruption, divergence, infrastructure crash), not NaN-specific. Renamed to ``failure_rate`` everywhere it is read or rendered: matrix.py property, confound.py classifier, output.py formatter + matrix.json key + matrix.md column header ("NaN rate" -> "Failure rate"). Consumer-facing rename is documented in the matrix.md "Notes" section so readers know to look at ``exit_status_counts`` for failure-mode detail. * ``CellStats`` was missing fields the PR description promised: ``min_step_time_ms``, ``max_step_time_ms``, ``p90_step_time_ms``, and an ``exit_status_counts`` histogram keyed by ``TrialResult.exit_status``. Without these, ``matrix.json`` consumers could not distinguish ``workload_failed`` from ``infrastructure_failed`` when triaging a cell. ``aggregate_cell`` populates all four; the error path zeroes them. README ``matrix.json per-cell shape`` section enumerates the full set so future field additions have a documented home. Class E -- sort-order bug ========================= (Comment 3188207612.) ``sorted(rglob("trial_*.json"))`` is lexicographic, so ``trial_10.json`` came before ``trial_2.json`` once a cell had 10+ trials, and ``matrix.json`` ``trial_paths`` no longer matched execution order. * triage/runner.py: ``_collect_trial_paths`` extracts the integer index from the filename and sorts on that. Files that don't parse as ``trial_<int>.json`` sort last (alphabetically), preserving any future sibling artifacts without breaking the contract for the common case. Tests ===== * tests/triage/test_matrix.py: pinned the new ``min``/``max``/``p90`` fields, the ``exit_status_counts`` histogram, and the ``failure_rate``-not-``nan_rate`` rename (``hasattr`` regression check). * tests/triage/test_recipe.py: parameterised ``test_load_recipe_rejects_non_positive_confound_threshold`` matching the existing flag-mode test. * tests/triage/test_output_layout.py: pinned the natural-sort fix, the env-name collision detection, the renamed matrix.md column, and three sidecar-lifecycle behaviours (snapshot-and-replay, isolated-env classification with sidecars, matrix.json fields). * tests/triage/test_runner_b1_api.py: updated the existing ``test_extra_sidecar_files_threaded_to_run_trials`` for the new snapshot-and-use-the-snapshot contract; added two new tests pinning the ``ClickException`` wrapping for ``list-mitigations`` / ``list-environments``. * tests/triage/test_confound.py: ``_stats`` factory updated for the new ``CellStats`` field set; renamed test names that mentioned ``nan_rate`` so future grep is honest. Comments NOT pushed back on =========================== None. All 9 comments are real correctness or contract gaps; the biggest scope-add is Class D (new ``CellStats`` fields + rename), but that matches what the PR description promised, so it belongs. Co-authored-by: Cursor <cursoragent@cursor.com>

Copilot

Pull request overview

Copilot reviewed 14 out of 15 changed files in this pull request and generated 4 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+    if dry_run:
+        _print_dry_run(recipe)
+        return Path(".")


+    now = now or _dt.datetime.now(_dt.timezone.utc)
+    return now.strftime("%Y-%m-%dT%H-%M-%S")


+        "- `recipe.resolved.yaml` (alongside this file) captures the registry state "
+        "at run time -- re-run it to reproduce."


… audit Copilot's third pass flagged 4 distinct comments. Each one is a different class (no overlap this round); fixed all of them in the same audit-and-pin pattern as rounds 1 and 2. Class A -- inconsistent error-shape contract ============================================ (Comment 3188441868.) ``triage run`` wrapped ``load_recipe`` / ``build_recipe_from_flags`` errors in ``ClickException`` but left ``run_recipe()`` exceptions unhandled. Two flavours of the same kind of error therefore exited the CLI with different shapes depending on which validator caught them. This was a regression introduced in round 2: I added ``_check_env_slug_collisions`` raising ``RegistryError`` from the runner without extending the CLI wrapper to match. * cli/triage.py: ``triage run`` now wraps the ``run_recipe`` call in a ``try`` that catches ``(RegistryError, RecipeCellError, RecipeSchemaError)`` and re-raises as ``click.ClickException``. Both pre-flight failures (env-slug collisions, baseline resolution) and any late-stage recipe-level rejection now produce the same one-line ``Error: ...`` shape that ``aorta run`` and the list subcommands already produce. Class B -- ``--dry-run`` skips runtime preflight ================================================ (Comment 3188441945.) Dry-run printed a clean summary for recipes that a real run immediately rejected -- e.g. multi-cell recipes with no auto-resolvable baseline, or environment names whose slugs collide. Same "documented contract not honoured" pattern as round 1. * triage/runner.py: new ``_preflight_validate(recipe)`` collects every pure (no FS / network / subprocess) validation check and runs BEFORE the dry-run early-return. Currently covers env-slug collisions and baseline resolution; the docstring documents the purity requirement for callables added later. The inline collision check inside ``run_recipe`` is removed (it's now redundant) so there is exactly one call site. * triage/runner.py: dry-run's "no FS side effects" guarantee is strengthened: preflight runs before ``resolve_run_dir`` so a failing dry-run leaves no breadcrumbs on disk. Class C -- matrix.md note overpromised reproducibility ====================================================== (Comment 3188442020.) The Notes footer claimed ``recipe.resolved.yaml`` "captures the registry state at run time", but the writer only preserves names + inline-docker shorthands. Named mitigations / environments are NOT expanded, so a registry change between runs silently changes behaviour. The README NOTE block already warned about this; matrix.md disagreed with itself. * triage/output.py: rewrote the matrix.md Notes line to match the README's cautious wording -- explicitly flags the inline-vs-named asymmetry and points operators at ``matrix.json::cells[*].resolved_env_vars`` for drift detection. Class D -- same-second runs overwrite each other ================================================ (Comment 3188441990.) ``resolve_run_dir`` used ``mkdir(exist_ok=True)`` plus a second-precision timestamp, so two runs in the same wall-clock second for the same ``(ticket, workload)`` silently reused the same directory and the second run clobbered the first run's artifacts. The docstring guaranteed "never overwrites". Easy to hit in CI loops. * triage/output.py: ``resolve_run_dir`` now creates the ``<output>/<ticket>/<workload>/`` parent with ``exist_ok=True`` (it's a shared parent across runs), then atomically claims the leaf with ``mkdir(exist_ok=False)``. On collision it appends ``-2``, ``-3``, ... until the call succeeds. The race between two parallel processes is resolved by ``mkdir`` itself: only one wins for a given suffix, the loser bumps and retries. Bounded at 10000 iterations so a stuck clock or runaway loop surfaces as a clean RuntimeError instead of hanging. * recipes/README.md: output-layout block updated to document the ``[-N]`` suffix. Tests ===== * tests/triage/test_output_layout.py: 6 new tests -- ``test_resolve_run_dir_appends_suffix_on_same_timestamp`` and ``test_resolve_run_dir_handles_many_collisions`` (Class D); ``test_dry_run_rejects_unresolvable_baseline``, ``test_dry_run_rejects_env_slug_collision`` and ``test_dry_run_does_not_create_run_dir_on_validation_failure`` (Class B); ``test_matrix_md_does_not_overpromise_reproducibility`` (Class C). * tests/triage/test_runner_b1_api.py: new ``test_cli_run_wraps_run_recipe_errors_in_click_exception`` exercising the Class A wrapper via the unresolvable-baseline path (the same kind of late-stage error that previously escaped raw). Comments NOT pushed back on =========================== None. All 4 are real correctness / contract gaps; D is particularly bad in CI loops, A was a regression I introduced last round. Co-authored-by: Cursor <cursoragent@cursor.com>

Copilot

Pull request overview

Copilot reviewed 14 out of 15 changed files in this pull request and generated 6 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+            cell_name = f"{m}-{display}"
+            _validate_cell_name(
+                f"--mitigation-axis x --environment-axis ({m!r}, {display!r})", cell_name
+            )


+    if baseline.error is not None or baseline.mean_step_time_ms <= 0:
+        # Baseline unusable: we can't compute a ratio, so surface None and
+        # let the renderer emit "n/a". The runner also writes a top-of-file
+        # warning in matrix.md so readers see why.
+        return CONFOUND_NEUTRAL, None


+  env-var bundles, trial JSON paths, cell-level errors, etc.).
+* ``recipe.resolved.yaml`` -- the recipe AS EXECUTED, with every registry
+  name expanded to its underlying env-var bundle + docker ref. This is the
+  reproducibility artifact: re-running it on a different machine produces
+  the same matrix even if the registries drift in the interim.


+> "Reproducing the same matrix" is up to the registries available at
+> rerun time. If a sidecar mitigation's env-var bundle drifts between
+> runs the rerun will use the new bundle (snapshotting the *file* doesn't
+> pin its *contents* once the operator edits it later). Inline-docker
+> cells are immune (the docker ref is in the recipe text itself), but
+> registry drift on named entries is currently not pinned. Compare each
+> run's `matrix.json::cells[*].resolved_env_vars` to detect drift.


+        "Failure rate",
+        "Trials",
+        "Mean step (ms)",
+        "Confound",
+    )
+    rows: list[tuple[str, ...]] = [header]
+    for cell in cell_stats:
+        tag, _ = confound_tags.get(cell.name, (cell.error and "error" or "-", None))
+        rows.append(
+            (
+                cell.name,
+                _format_mitigations(cell.mitigations),
+                cell.environment,
+                _format_failure_rate(cell),
+                _format_fail_count(cell),


+    ticket_slug = safe_slug(recipe.ticket) if recipe.ticket else NO_TICKET_SLUG
+    workload_slug = safe_slug(recipe.workload)
+    ts = timestamp or format_timestamp()
+    parent = Path(output_dir) / ticket_slug / workload_slug


Copilot

Pull request overview

Copilot reviewed 14 out of 15 changed files in this pull request and generated 3 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+    except Exception as exc:
+        raise click.ClickException(str(exc)) from exc
+    # different exit shape, depending on which validator catches it.
+    try:
+        run_dir = run_recipe(
+            r,
+            output_dir=output_dir,
+            dry_run=dry_run,
+            extra_sidecar_files=mitigation_files,
+        )
+    except (RegistryError, RecipeCellError, RecipeSchemaError) as exc:
+        raise click.ClickException(str(exc)) from exc


+* ``recipe.resolved.yaml`` -- the recipe AS EXECUTED, with every registry
+  name expanded to its underlying env-var bundle + docker ref. This is the
+  reproducibility artifact: re-running it on a different machine produces
+  the same matrix even if the registries drift in the interim.


+schema_version: 1
+
+ticket: EXAMPLE-151
+workload: fsdp


Copilot

Pull request overview

Copilot reviewed 14 out of 15 changed files in this pull request and generated 4 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+def run_recipe(
+    recipe: Recipe,
+    output_dir: Path,
+    dry_run: bool = False,
+    extra_sidecar_files: tuple[Path, ...] = (),
+    timestamp: str | None = None,
+) -> Path:
+    """Execute a recipe and write matrix.md / matrix.json / recipe.resolved.yaml.
+
+    Args:
+        recipe: Pre-validated recipe (from :func:`aorta.triage.recipe.load_recipe`
+            or :func:`aorta.triage.recipe.build_recipe_from_flags`).
+        output_dir: Top-level output directory (the CLI's ``--output-dir``).
+        dry_run: When True, validates and prints the resolved cell list to
+            stdout without touching the filesystem and returns a sentinel
+            ``Path(".")``.
+        extra_sidecar_files: Operator-supplied sidecar JSONs (from
+            ``--mitigations-file``). These are threaded through to B1's
+            registry resolver alongside the runner-generated inline sidecar.


+    header = (
+        "Cell",
+        "Mitigations",
+        "Environment",
+        "Failure rate",
+        "Trials",
+        "Mean step (ms)",
+        "Confound",
+    )
+    rows: list[tuple[str, ...]] = [header]
+    for cell in cell_stats:
+        tag, _ = confound_tags.get(cell.name, (cell.error and "error" or "-", None))
+        rows.append(
+            (
+                cell.name,
+                _format_mitigations(cell.mitigations),
+                cell.environment,
+                _format_failure_rate(cell),
+                _format_fail_count(cell),


+    if baseline.error is not None or baseline.mean_step_time_ms <= 0:
+        # Baseline unusable: we can't compute a ratio, so surface None and
+        # let the renderer emit "n/a". The runner also writes a top-of-file
+        # warning in matrix.md so readers see why.
+        return CONFOUND_NEUTRAL, None


+* ``recipe.resolved.yaml`` -- the recipe AS EXECUTED, with every registry
+  name expanded to its underlying env-var bundle + docker ref. This is the
+  reproducibility artifact: re-running it on a different machine produces
+  the same matrix even if the registries drift in the interim.


Copilot AI review requested due to automatic review settings May 5, 2026 09:55

Copilot started reviewing on behalf of amd-vivekag May 5, 2026 09:57 View session