B2: triage run -- mitigation x environment matrix runner#160
B2: triage run -- mitigation x environment matrix runner#160amd-vivekag wants to merge 5 commits intousers/vivekag/b1-aorta-runfrom
Conversation
Implements task B2 (issue #151): the `aorta triage run` command that sweeps a (mitigation x environment x trials) matrix in-process via B1's `run_trials`, aggregates per-cell stats, classifies each cell against a baseline with speed-confound detection, and emits matrix.md / matrix.json / recipe.resolved.yaml under a ticket-grouped output directory. Scope (matrix mode only per design D11; optimize mode deferred to P1): * `triage/recipe.py` -- schema v1, YAML/JSON loader, and the `build_recipe_from_flags` shim. `Recipe`/`Cell`/`InlineEnv`/ `ConfoundCfg` are frozen dataclasses; loader rejects unknown schema_version, unknown keys, duplicate cell names, and resolves every `(mitigation, environment)` reference through B3's registry (built-ins + entry-point plugins + B3.1 sidecars). Inline-docker envs auto-name as `_inline_<8char-blake2b>` so two cells with the same image ref collapse to one environment probe. * `triage/runner.py` -- `run_recipe()` is the single execution path both modes converge on. Resolves `<output>/<ticket>/<workload>/<ts>/`, writes inline-docker sidecar JSON for B1's resolver to consume, captures host + per-environment `collect_env` snapshots, dispatches every cell via in-process `run_trials`. Per-cell exceptions are caught and surfaced as an `error` row so a single bad cell doesn't kill the matrix. Strictly no `subprocess` -- a grep test under `tests/triage/` enforces it. * `triage/matrix.py` -- `aggregate_cell()` collapses per-trial results into median/p90/min/max step time, NaN counts, exit-status histogram, and the per-cell label set. * `triage/confound.py` -- `resolve_baseline()` (explicit override -> first `baseline-*` cell -> first `mitigations: [none]` cell) and `classify_all()` (cell_step_time / baseline_step_time > threshold -> `speed (+N%)` flag). Threshold default 1.15 per D11. * `triage/output.py` -- writes the three artifacts. matrix.md is human-first (table with per-row label badges + run summary); matrix.json is machine-first (full per-cell stats and the resolved recipe round-trips into recipe.resolved.yaml so a matrix can be re-run byte-identically). * `cli/triage.py` -- two equivalent entry points: `--recipe <file>` (primary) and `--mode matrix --workload ... --mitigation-axis ... --environment-axis ...` (flag shim). Mixing the two raises a UsageError up-front instead of silently letting one win. Adds `list-mitigations` / `list-environments` discovery subcommands (delegates to B3's resolver, prints source_package so users can see whether a mitigation came from `aorta`, a plugin, or a sidecar). Recipes: * `recipes/README.md` -- full schema reference + worked examples. * `recipes/example-fsdp-smoke.yaml` -- minimal end-to-end example. Tests (1364 lines, 5 files): * `test_recipe.py` -- schema validation: unknown version, unknown top-level keys, duplicate cell names, missing required fields, inline-env hash determinism, sidecar resolution. * `test_runner_b1_api.py` -- per-cell error isolation, in-process guarantee (greps `triage/runner.py` for `subprocess`), inline-env sidecar round-trip, per-environment snapshot caching. * `test_matrix.py` -- aggregation against synthetic TrialResult fixtures incl. the all-NaN and mixed-exit-status edges. * `test_confound.py` -- baseline resolution priority and the threshold classifier (boundary at exactly 1.15). * `test_output_layout.py` -- artifact paths, matrix.md table shape, matrix.json schema, resolved-recipe round-trip. Co-authored-by: Cursor <cursoragent@cursor.com>
There was a problem hiding this comment.
Pull request overview
Adds the B2 triage matrix workflow to aorta: recipe/flag-driven orchestration around B1 run_trials, aggregation/classification, output artifact generation, and CLI entry points for running and inspecting mitigation/environment registries.
Changes:
- Adds triage core modules for recipe parsing, per-cell execution, aggregation, confound classification, and artifact writing.
- Replaces the placeholder
aorta triage runCLI with recipe mode, flag-shim matrix mode, and registry discovery subcommands. - Adds recipe documentation/examples plus a substantial new
tests/triage/suite covering loader, runner, output, matrix, and confound behavior.
Reviewed changes
Copilot reviewed 15 out of 15 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
src/aorta/cli/triage.py |
Implements the triage CLI, dual entry paths, and list subcommands. |
src/aorta/triage/__init__.py |
Exposes the new triage public API. |
src/aorta/triage/recipe.py |
Defines recipe schema/dataclasses, file loader, and flag-mode builder. |
src/aorta/triage/runner.py |
Orchestrates matrix execution, env capture, cell dispatch, and artifact emission. |
src/aorta/triage/matrix.py |
Aggregates per-trial results into per-cell statistics. |
src/aorta/triage/confound.py |
Resolves the baseline and classifies cells for speed confounds. |
src/aorta/triage/output.py |
Resolves run directories and writes matrix.md, matrix.json, and resolved recipe output. |
recipes/README.md |
Documents the recipe schema and output layout. |
recipes/example-fsdp-smoke.yaml |
Provides a minimal example triage recipe. |
tests/triage/__init__.py |
Adds the new triage test package. |
tests/triage/test_recipe.py |
Covers recipe loading/validation and flag builder behavior. |
tests/triage/test_runner_b1_api.py |
Verifies runner plumbing, in-process execution, and CLI smoke paths. |
tests/triage/test_matrix.py |
Tests cell aggregation behavior and timing fallbacks. |
tests/triage/test_confound.py |
Tests baseline resolution and confound classification. |
tests/triage/test_output_layout.py |
Verifies artifact layout/content and fail-soft output behavior. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| environment_axis=environment_axis, | ||
| trials=trials, | ||
| baseline_cell=baseline_cell, | ||
| ) |
| resolved_cells.append( | ||
| { | ||
| "name": cell.name, | ||
| "mitigations": list(cell.mitigations), | ||
| "mitigation_contributions": mit_contributions, | ||
| "environment": cell.environment, | ||
| "resolved_environment": resolved_env, | ||
| "extra_env": dict(cell.extra_env), | ||
| "resolved_mitigation_env": mit_union, | ||
| "trials": cell.effective_trials(recipe.trials), | ||
| "steps": cell.effective_steps(recipe.steps), | ||
| } | ||
| ) | ||
|
|
||
| doc = { | ||
| "schema_version": recipe.schema_version, | ||
| "ticket": recipe.ticket, | ||
| "workload": recipe.workload, | ||
| "trials": recipe.trials, | ||
| "steps": recipe.steps, | ||
| "confound": { | ||
| "threshold": recipe.confound.threshold, | ||
| "baseline_cell": recipe.confound.baseline_cell, | ||
| }, | ||
| "inline_environments": [ | ||
| {"name": e.name, "docker": e.docker} for e in recipe.inline_environments | ||
| ], | ||
| "cells": resolved_cells, |
| if cell.environment not in seen_envs: | ||
| _capture_env( | ||
| env_dir / cell.environment / "env.json", | ||
| scope=f"environment:{cell.environment}", | ||
| warnings=warnings, | ||
| ) |
| _validate_names_resolve(cells_tuple, inline_envs, sidecar_files) | ||
|
|
||
| if confound.baseline_cell is not None: | ||
| names = {c.name for c in cells_tuple} | ||
| if confound.baseline_cell not in names: | ||
| raise RecipeCellError( | ||
| f"confound.baseline_cell {confound.baseline_cell!r} does not " | ||
| f"match any cell name; cells: {sorted(names)}" | ||
| ) | ||
|
|
||
| return Recipe( | ||
| schema_version=SCHEMA_VERSION, | ||
| workload=workload, | ||
| trials=trials, | ||
| steps=steps, | ||
| cells=cells_tuple, | ||
| ticket=ticket, | ||
| confound=confound, | ||
| inline_environments=tuple(inline_envs.values()), | ||
| source_path=source_path, | ||
| source_sha256=source_sha256, | ||
| ) |
| cell_dir = _cells_dir(run_dir) / cell.name | ||
| cell_dir.mkdir(parents=True, exist_ok=True) |
| r = build_recipe_from_flags( | ||
| workload=workload, # type: ignore[arg-type] | ||
| mitigation_axis=mitigation_axis, # type: ignore[arg-type] | ||
| environment_axis=environment_axis, # type: ignore[arg-type] | ||
| trials=trials, # type: ignore[arg-type] | ||
| steps=steps, | ||
| ticket=ticket, | ||
| baseline_cell=baseline_cell, | ||
| confound_threshold=confound_threshold, | ||
| sidecar_files=mitigation_files or None, | ||
| ) |
end-of-file-fixer flagged tests/triage/__init__.py: the file had a single stray `\n` (1 byte). The hook's rule for empty packages is "zero bytes, no trailing newline"; truncating to 0 bytes satisfies it and Python still treats the directory as a package. Co-authored-by: Cursor <cursoragent@cursor.com>
… audit
Copilot's review surfaced 6 distinct issues; categorising them showed three
overlapping classes worth closing for the next reviewer instead of
addressing one comment at a time.
Class A -- validation incompleteness / not fail-fast
====================================================
(Comments 1, 4, 6.) Inputs that should have been rejected at the recipe
boundary instead reached execution.
* cli/triage.py: ``_reject_flag_mode_args`` only checked --workload,
--mitigation-axis, --environment-axis, --trials, --baseline-cell.
--steps / --ticket / --confound-threshold passed through silently when
combined with --recipe, masking misconfigurations. Helper now checks
every flag that affects recipe content; the no-op runner-level flags
(--mode, --output-dir, --dry-run, --mitigations-file) are explicitly
documented as intentionally allowed.
* cli/triage.py: --confound-threshold default flips from 1.15 to None so
the "user explicitly set this" check actually works. Flag-mode default
is applied at build_recipe_from_flags() call time instead.
* triage/recipe.py: build_recipe_from_flags() now validates the same set
of primitive constraints load_recipe enforces -- positive trials/steps,
non-empty workload, ticket-when-provided non-empty, and
confound_threshold > 0. Previously ``aorta triage run --trials 0`` got
all the way to ``run_trials()`` before failing.
* triage/recipe.py: cross-mitigation env-var collision detection landed
as ``_validate_no_mitigation_collisions`` and is wired into both
_build_recipe (load path) and build_recipe_from_flags (flag path).
The Cell docstring previously promised this check existed; it didn't,
and ``dict.update`` was silently letting the later mitigation win.
``extra_env`` overrides stay silent because they are documented as the
intentional-override knob.
Class B -- documented contract not honoured
===========================================
(Comments 2, 4.) Code claimed behaviours its implementation didn't
deliver.
* triage/output.py: ``recipe.resolved.yaml`` was advertised as
rerunnable but contained per-cell debug fields
(``mitigation_contributions``, ``resolved_environment``,
``resolved_mitigation_env``) and a top-level ``inline_environments``
block, all of which load_recipe rejects as unknown schema keys.
Rewrote write_resolved_recipe to emit a strict, schema-valid recipe.
Inline-docker cells re-emit the ``{docker: <ref>}`` shorthand so the
same ``_inline_<hash>`` is re-derived at reload without a sidecar.
Per-cell expansion data (the resolved Environment descriptor) moved
into matrix.json::cells[*].resolved_environment, where it logically
belongs as run-state. The unioned env-var bundle was already in
matrix.json::cells[*].resolved_env_vars (via CellStats), so nothing
is lost.
* recipes/README.md: ``§Re-running a past matrix`` updated to reflect
the strict-rerunnable property, with an explicit note that registry
drift on named entries is currently not pinned (use matrix.json's
resolved_env_vars to detect it).
* triage/recipe.py: Cell docstring rewritten -- the "later names win"
language was the silent-update bug in disguise. Now spells out
"rejected at recipe-construction time".
Class C -- recipe strings used as filesystem paths
==================================================
(Comment 5.) Recipe-supplied strings reached Path operations without
sanitisation; ``cell.name = "../foo"`` could escape the run directory.
* triage/recipe.py: cell-name regex and reserved-name list:
``^[A-Za-z0-9_][A-Za-z0-9_.\-]*$`` plus a reject list for ``.``,
``..``, ``matrix.md``, ``matrix.json``. Validated at parse time
(recipe mode) and at flag-mode cell synthesis. The leading-char
restriction also prevents ``--`` getting interpreted as a CLI flag
if anyone copy-pastes a cell name into a shell.
* triage/output.py: safe_slug() additionally collapses ``.`` and ``..``
to ``_`` -- the character-class regex alone allowed those through,
so a ticket like ``..`` would have moved the run dir up one level.
* triage/runner.py: the cell.name and cell.environment path components
now go through safe_slug() at the runner boundary (defence in depth
-- the recipe loader already rejects unsafe names, but the slug at
the boundary means even a hand-built Recipe with a bad name can't
escape).
* recipes/README.md: ``cells[*].name`` doc updated with the regex.
Comment 3 (per-environment env probe is misleading)
===================================================
The runner was calling ``collect_env()`` in the runner process and
labelling the result as the docker / venv environment. Since B1 doesn't
actually swap to docker today (in-process), the artifact was the host
state under a docker label.
* triage/runner.py: new ``_is_isolated_environment`` predicate (true if
the env's Environment descriptor sets ``docker`` or ``venv``, or if
the env is an inline-docker auto-name). When true, the runner writes
a non-misleading placeholder to ``environments/<name>/env.json``:
``snapshot_captured=False``, the descriptor itself, and an
explanation of why (B1 in-process). The per-run host_env.json still
captures the runner's view, so nothing about the host is lost.
* A warning is appended to matrix.md / matrix.json so the operator sees
the skip explicitly.
Tests
=====
22 new tests across the three classes (124 passing in tests/triage,
was 102):
* test_recipe.py:
- 9 cell-name path-safety params (../escape, a/b, .., ., matrix.md,
leading dash, embedded whitespace, etc.) + 1 happy case asserting
``PROJ_1.fix-cell`` parses fine.
- test_stacked_mitigations_with_disagreeing_keys_rejected /
_agreeing_keys_accepted / _error_names_the_cell_and_keys for the
new collision detector (mitigations monkey-patched).
- test_extra_env_overrides_mitigation_silently pins that extra_env
overrides intentionally do NOT trip the collision check.
- 4 flag-mode boundary-validation tests: trials=0 / -1, steps=0 / -5,
empty workload, threshold=0/-1/-0.5, ticket="".
* test_runner_b1_api.py:
- test_cli_recipe_mode_rejects_every_flag_mode_knob parametrised
over all 8 recipe-content flags (the comment-1 fix).
- test_cli_recipe_mode_allows_runner_only_flags pins that
--output-dir / --dry-run still work alongside --recipe.
- test_cli_flag_mode_rejects_non_positive_trials pins the
--trials 0 fail-fast (comment 6).
* test_output_layout.py:
- test_safe_slug_rejects_dot_components for the safe_slug fix.
- test_resolved_recipe_is_loadable_by_load_recipe pins the strict
rerunnable contract (comment 2).
- test_resolved_recipe_inline_envs_emit_docker_shorthand pins the
``{docker: <ref>}`` re-emission for inline cells.
- test_matrix_json_records_resolved_environment_per_cell pins that
the moved-out debug expansion lands in matrix.json.
- test_isolated_env_writes_placeholder_not_runner_snapshot pins the
comment-3 fix (placeholder, not misleading collect_env call).
- test_host_env_collected_exactly_once updated: 1 host + 1 local-env
probe = 2, not 3, because the inline-docker env now skips its
runner-process probe.
Verification
============
* pytest tests/triage/: 124 passing (was 102; +22 new).
* pytest tests/ (excl. tests/hw_queue_eval which needs torch): 403
passing.
* ruff check + ruff format: clean across all touched src + tests.
* mypy: 2 pre-existing yaml-import-untyped errors only; no new errors.
* pre-commit run --all-files: clean.
Co-authored-by: Cursor <cursoragent@cursor.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 14 out of 15 changed files in this pull request and generated 9 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| threshold = raw.get("threshold", 1.15) | ||
| if not isinstance(threshold, (int, float)) or isinstance(threshold, bool): | ||
| raise RecipeSchemaError( | ||
| f"{path_hint}.confound.threshold: must be a number, got {type(threshold).__name__}" | ||
| ) | ||
| baseline = raw.get("baseline_cell") | ||
| if baseline is not None and not isinstance(baseline, str): | ||
| raise RecipeSchemaError( | ||
| f"{path_hint}.confound.baseline_cell: must be a string, got {type(baseline).__name__}" | ||
| ) | ||
| return ConfoundCfg(threshold=float(threshold), baseline_cell=baseline) |
| Lookup failures fall back to "treat as local" so probe behaviour is | ||
| unchanged for envs we genuinely don't know anything about; the | ||
| pre-flight :func:`_validate_names_resolve` would already have failed if | ||
| the name were truly unknown. | ||
| """ | ||
| if any(env_name == e.name for e in inline_envs): | ||
| return True | ||
| try: | ||
| descriptor = get_environment(env_name) | ||
| except RegistryError: | ||
| return False |
| """Write a schema-valid ``recipe.resolved.yaml`` that can be re-loaded as-is. | ||
|
|
||
| The output is a strict :func:`aorta.triage.recipe.load_recipe` input -- no | ||
| debug-only fields, no per-cell ``resolved_mitigation_env`` block, no | ||
| top-level ``inline_environments`` key. The "resolved" property comes from | ||
| inline-docker cells being re-emitted in the ``{docker: <ref>}`` shorthand | ||
| form (so re-loading on another machine reproduces the same | ||
| ``_inline_<hash>`` env without needing a sidecar JSON next to the file). | ||
|
|
||
| Per-cell debug expansions (the ``Environment`` descriptor and the unioned | ||
| mitigation env-var bundle) live in ``matrix.json`` instead, where they | ||
| belong as run-time state. ``sidecar_files`` is intentionally unused at | ||
| write time -- B3 already resolved everything by the time we got here -- | ||
| but the parameter is preserved so the runner can keep its single | ||
| "everything-needed-for-replay" call site. | ||
| """ | ||
| del sidecar_files # see docstring; kept for caller-stable signature |
| @property | ||
| def nan_rate(self) -> float: | ||
| """Fraction of trials that failed. 0.0 for an empty cell.""" | ||
| if self.trials == 0: | ||
| return 0.0 | ||
| return self.failed_count / self.trials |
| passed_count: int | ||
| failed_count: int | ||
| mean_step_time_ms: float | ||
| std_step_time_ms: float | ||
| p50_step_time_ms: float | ||
| p99_step_time_ms: float | ||
| mean_wall_clock_sec: float | ||
| step_times_ms: list[float] = field(default_factory=list) | ||
| trial_paths: list[str] = field(default_factory=list) |
| found: list[Path] = [] | ||
| for candidate in sorted(results_dir.rglob("trial_*.json")): | ||
| found.append(candidate) |
| if cell.environment not in seen_envs: | ||
| env_json_path = env_dir / safe_slug(cell.environment) / "env.json" | ||
| if _is_isolated_environment(cell.environment, recipe.inline_environments): | ||
| _write_isolated_env_placeholder( | ||
| env_json_path, | ||
| cell.environment, | ||
| recipe.inline_environments, | ||
| warnings, | ||
| ) | ||
| else: | ||
| _capture_env( | ||
| env_json_path, | ||
| scope=f"environment:{cell.environment}", | ||
| warnings=warnings, | ||
| ) | ||
| seen_envs.add(cell.environment) |
| def list_mitigations(files: tuple[Path, ...]) -> None: | ||
| """List every registered mitigation with its source_package and env-var bundle.""" | ||
| registry = load_mitigations(extra_files=list(files) or None) | ||
| name_w = max(len("NAME"), *(len(n) for n in registry)) | ||
| src_w = max(len("SOURCE"), *(len(m.source_package) for m in registry.values())) | ||
| click.echo(f"{'NAME'.ljust(name_w)} {'SOURCE'.ljust(src_w)} ENV") | ||
| for name in sorted(registry): | ||
| m = registry[name] | ||
| env_str = " ".join(f"{k}={v}" for k, v in sorted(m.env.items())) or "(none)" | ||
| click.echo(f"{name.ljust(name_w)} {m.source_package.ljust(src_w)} {env_str}") |
| def list_environments(files: tuple[Path, ...]) -> None: | ||
| """List every registered environment with its source_package and docker/venv.""" | ||
| registry = load_environments(extra_files=list(files) or None) | ||
| name_w = max(len("NAME"), *(len(n) for n in registry)) | ||
| src_w = max(len("SOURCE"), *(len(e.source_package) for e in registry.values())) | ||
| docker_w = max(len("DOCKER"), *(len(e.docker or "-") for e in registry.values())) | ||
| click.echo(f"{'NAME'.ljust(name_w)} {'SOURCE'.ljust(src_w)} {'DOCKER'.ljust(docker_w)} VENV") | ||
| for name in sorted(registry): | ||
| e = registry[name] | ||
| click.echo( | ||
| f"{name.ljust(name_w)} {e.source_package.ljust(src_w)} " | ||
| f"{(e.docker or '-').ljust(docker_w)} {e.venv or '-'}" |
… audit
Copilot's second pass surfaced 9 distinct comments. Categorising them
showed five overlapping classes, all of which got the same audit-and-fix
treatment as round 1: find every instance of each pattern in the diff,
fix all of them, add tests pinning each behaviour.
Class A -- validation parity (recipe vs flag mode disagree)
===========================================================
(Comment 3188207431.) Recipe-mode loader accepted ``confound.threshold``
of ``0`` or negative, even though ``build_recipe_from_flags`` already
rejected those values. A non-positive threshold makes ``classify`` flag
every non-baseline cell as a speed confound (any positive ratio
satisfies ``ratio > threshold``), so the recipe was always misuse.
* triage/recipe.py: ``_parse_confound`` now applies the same ``> 0``
check, with an error message that names the offending value.
Class B -- ``--mitigations-file`` (sidecar) lifecycle is incomplete
==================================================================
(Comments 3188207484, 3188207509, 3188207675, 3188207707.) Four
independent symptoms of one root cause: sidecar files weren't threaded
through the parts of the system that need to know about them.
* triage/runner.py: ``_is_isolated_environment`` and
``_write_isolated_env_placeholder`` now accept ``sidecar_files`` and
forward it to ``get_environment``. Without this, a sidecar-defined
docker/venv env failed the registry lookup, fell back to "treat as
local", and the runner wrote a misleading host ``collect_env()``
snapshot under the env's name. Both call sites in ``run_recipe`` pass
the runner's sidecar bundle in.
* triage/runner.py: new ``_copy_operator_sidecars`` snapshots each
``--mitigations-file`` JSON into ``<run_dir>/sidecars/<basename>``
before B1 sees them. The runner then uses the snapshotted paths as
the resolver source -- so what gets executed and what gets archived
for replay are byte-identical. Without this, ``recipe.resolved.yaml``
was unrunnable on its own for sidecar-backed runs (the YAML
references mitigation/env names, but the names were only resolvable
via the operator's local sidecar). The runner also prints a one-line
``to rerun:`` hint on stdout when sidecars are involved so the
operator does not have to reconstruct the flags by hand.
* triage/output.py: ``write_resolved_recipe`` docstring updated to
document the snapshot-and-replay convention; the YAML schema stays
strict (no metadata namespace) -- the README documents the rerun
command shape.
* cli/triage.py: ``list-mitigations`` and ``list-environments`` now
wrap ``RegistryError`` in ``ClickException`` so a malformed
``--mitigations-file`` produces the same one-line clean error that
``triage run`` and ``aorta run`` already produced. Previously a bad
sidecar dumped a Python traceback.
Class C -- recipe-supplied strings -> filesystem path components
================================================================
(Comment 3188207640.) Round 1 added ``safe_slug`` for ``cell.name`` and
gave cell names a regex; environment names slipped through. Distinct
registered names like ``a/b`` and ``a:b`` both slug to ``a_b`` and the
second probe would silently overwrite the first while ``matrix.json``
kept them as distinct envs.
* triage/runner.py: new ``_check_env_slug_collisions`` runs at the top
of ``run_recipe`` and raises ``RegistryError`` naming both colliding
env names + the shared slug, before any directory is created. This
matches the existing cell-name regex enforcement in spirit (different
mechanism because env names come from the registry / sidecar, not
the recipe text directly).
Class D -- aggregation model accuracy + completeness
====================================================
(Comments 3188207553, 3188207584.) Two issues with ``CellStats``:
* ``nan_rate`` was misnamed -- the property counts every trial whose
``exit_status != "ok"`` or ``WorkloadResult.passed`` is False, which
is generic failure (NaN, corruption, divergence, infrastructure
crash), not NaN-specific. Renamed to ``failure_rate`` everywhere it
is read or rendered: matrix.py property, confound.py classifier,
output.py formatter + matrix.json key + matrix.md column header
("NaN rate" -> "Failure rate"). Consumer-facing rename is documented
in the matrix.md "Notes" section so readers know to look at
``exit_status_counts`` for failure-mode detail.
* ``CellStats`` was missing fields the PR description promised:
``min_step_time_ms``, ``max_step_time_ms``, ``p90_step_time_ms``, and
an ``exit_status_counts`` histogram keyed by ``TrialResult.exit_status``.
Without these, ``matrix.json`` consumers could not distinguish
``workload_failed`` from ``infrastructure_failed`` when triaging a
cell. ``aggregate_cell`` populates all four; the error path zeroes
them. README ``matrix.json per-cell shape`` section enumerates the
full set so future field additions have a documented home.
Class E -- sort-order bug
=========================
(Comment 3188207612.) ``sorted(rglob("trial_*.json"))`` is
lexicographic, so ``trial_10.json`` came before ``trial_2.json`` once
a cell had 10+ trials, and ``matrix.json`` ``trial_paths`` no longer
matched execution order.
* triage/runner.py: ``_collect_trial_paths`` extracts the integer
index from the filename and sorts on that. Files that don't parse as
``trial_<int>.json`` sort last (alphabetically), preserving any
future sibling artifacts without breaking the contract for the
common case.
Tests
=====
* tests/triage/test_matrix.py: pinned the new
``min``/``max``/``p90`` fields, the ``exit_status_counts``
histogram, and the ``failure_rate``-not-``nan_rate`` rename
(``hasattr`` regression check).
* tests/triage/test_recipe.py: parameterised
``test_load_recipe_rejects_non_positive_confound_threshold``
matching the existing flag-mode test.
* tests/triage/test_output_layout.py: pinned the
natural-sort fix, the env-name collision detection, the renamed
matrix.md column, and three sidecar-lifecycle behaviours
(snapshot-and-replay, isolated-env classification with sidecars,
matrix.json fields).
* tests/triage/test_runner_b1_api.py: updated the existing
``test_extra_sidecar_files_threaded_to_run_trials`` for the new
snapshot-and-use-the-snapshot contract; added two new tests pinning
the ``ClickException`` wrapping for ``list-mitigations`` /
``list-environments``.
* tests/triage/test_confound.py: ``_stats`` factory updated for the
new ``CellStats`` field set; renamed test names that mentioned
``nan_rate`` so future grep is honest.
Comments NOT pushed back on
===========================
None. All 9 comments are real correctness or contract gaps; the
biggest scope-add is Class D (new ``CellStats`` fields + rename), but
that matches what the PR description promised, so it belongs.
Co-authored-by: Cursor <cursoragent@cursor.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 14 out of 15 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if dry_run: | ||
| _print_dry_run(recipe) | ||
| return Path(".") |
| now = now or _dt.datetime.now(_dt.timezone.utc) | ||
| return now.strftime("%Y-%m-%dT%H-%M-%S") |
| "- `recipe.resolved.yaml` (alongside this file) captures the registry state " | ||
| "at run time -- re-run it to reproduce." |
… audit Copilot's third pass flagged 4 distinct comments. Each one is a different class (no overlap this round); fixed all of them in the same audit-and-pin pattern as rounds 1 and 2. Class A -- inconsistent error-shape contract ============================================ (Comment 3188441868.) ``triage run`` wrapped ``load_recipe`` / ``build_recipe_from_flags`` errors in ``ClickException`` but left ``run_recipe()`` exceptions unhandled. Two flavours of the same kind of error therefore exited the CLI with different shapes depending on which validator caught them. This was a regression introduced in round 2: I added ``_check_env_slug_collisions`` raising ``RegistryError`` from the runner without extending the CLI wrapper to match. * cli/triage.py: ``triage run`` now wraps the ``run_recipe`` call in a ``try`` that catches ``(RegistryError, RecipeCellError, RecipeSchemaError)`` and re-raises as ``click.ClickException``. Both pre-flight failures (env-slug collisions, baseline resolution) and any late-stage recipe-level rejection now produce the same one-line ``Error: ...`` shape that ``aorta run`` and the list subcommands already produce. Class B -- ``--dry-run`` skips runtime preflight ================================================ (Comment 3188441945.) Dry-run printed a clean summary for recipes that a real run immediately rejected -- e.g. multi-cell recipes with no auto-resolvable baseline, or environment names whose slugs collide. Same "documented contract not honoured" pattern as round 1. * triage/runner.py: new ``_preflight_validate(recipe)`` collects every pure (no FS / network / subprocess) validation check and runs BEFORE the dry-run early-return. Currently covers env-slug collisions and baseline resolution; the docstring documents the purity requirement for callables added later. The inline collision check inside ``run_recipe`` is removed (it's now redundant) so there is exactly one call site. * triage/runner.py: dry-run's "no FS side effects" guarantee is strengthened: preflight runs before ``resolve_run_dir`` so a failing dry-run leaves no breadcrumbs on disk. Class C -- matrix.md note overpromised reproducibility ====================================================== (Comment 3188442020.) The Notes footer claimed ``recipe.resolved.yaml`` "captures the registry state at run time", but the writer only preserves names + inline-docker shorthands. Named mitigations / environments are NOT expanded, so a registry change between runs silently changes behaviour. The README NOTE block already warned about this; matrix.md disagreed with itself. * triage/output.py: rewrote the matrix.md Notes line to match the README's cautious wording -- explicitly flags the inline-vs-named asymmetry and points operators at ``matrix.json::cells[*].resolved_env_vars`` for drift detection. Class D -- same-second runs overwrite each other ================================================ (Comment 3188441990.) ``resolve_run_dir`` used ``mkdir(exist_ok=True)`` plus a second-precision timestamp, so two runs in the same wall-clock second for the same ``(ticket, workload)`` silently reused the same directory and the second run clobbered the first run's artifacts. The docstring guaranteed "never overwrites". Easy to hit in CI loops. * triage/output.py: ``resolve_run_dir`` now creates the ``<output>/<ticket>/<workload>/`` parent with ``exist_ok=True`` (it's a shared parent across runs), then atomically claims the leaf with ``mkdir(exist_ok=False)``. On collision it appends ``-2``, ``-3``, ... until the call succeeds. The race between two parallel processes is resolved by ``mkdir`` itself: only one wins for a given suffix, the loser bumps and retries. Bounded at 10000 iterations so a stuck clock or runaway loop surfaces as a clean RuntimeError instead of hanging. * recipes/README.md: output-layout block updated to document the ``[-N]`` suffix. Tests ===== * tests/triage/test_output_layout.py: 6 new tests -- ``test_resolve_run_dir_appends_suffix_on_same_timestamp`` and ``test_resolve_run_dir_handles_many_collisions`` (Class D); ``test_dry_run_rejects_unresolvable_baseline``, ``test_dry_run_rejects_env_slug_collision`` and ``test_dry_run_does_not_create_run_dir_on_validation_failure`` (Class B); ``test_matrix_md_does_not_overpromise_reproducibility`` (Class C). * tests/triage/test_runner_b1_api.py: new ``test_cli_run_wraps_run_recipe_errors_in_click_exception`` exercising the Class A wrapper via the unresolvable-baseline path (the same kind of late-stage error that previously escaped raw). Comments NOT pushed back on =========================== None. All 4 are real correctness / contract gaps; D is particularly bad in CI loops, A was a regression I introduced last round. Co-authored-by: Cursor <cursoragent@cursor.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 14 out of 15 changed files in this pull request and generated 6 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| cell_name = f"{m}-{display}" | ||
| _validate_cell_name( | ||
| f"--mitigation-axis x --environment-axis ({m!r}, {display!r})", cell_name | ||
| ) |
| if baseline.error is not None or baseline.mean_step_time_ms <= 0: | ||
| # Baseline unusable: we can't compute a ratio, so surface None and | ||
| # let the renderer emit "n/a". The runner also writes a top-of-file | ||
| # warning in matrix.md so readers see why. | ||
| return CONFOUND_NEUTRAL, None |
| env-var bundles, trial JSON paths, cell-level errors, etc.). | ||
| * ``recipe.resolved.yaml`` -- the recipe AS EXECUTED, with every registry | ||
| name expanded to its underlying env-var bundle + docker ref. This is the | ||
| reproducibility artifact: re-running it on a different machine produces | ||
| the same matrix even if the registries drift in the interim. |
| > "Reproducing the same matrix" is up to the registries available at | ||
| > rerun time. If a sidecar mitigation's env-var bundle drifts between | ||
| > runs the rerun will use the new bundle (snapshotting the *file* doesn't | ||
| > pin its *contents* once the operator edits it later). Inline-docker | ||
| > cells are immune (the docker ref is in the recipe text itself), but | ||
| > registry drift on named entries is currently not pinned. Compare each | ||
| > run's `matrix.json::cells[*].resolved_env_vars` to detect drift. |
| "Failure rate", | ||
| "Trials", | ||
| "Mean step (ms)", | ||
| "Confound", | ||
| ) | ||
| rows: list[tuple[str, ...]] = [header] | ||
| for cell in cell_stats: | ||
| tag, _ = confound_tags.get(cell.name, (cell.error and "error" or "-", None)) | ||
| rows.append( | ||
| ( | ||
| cell.name, | ||
| _format_mitigations(cell.mitigations), | ||
| cell.environment, | ||
| _format_failure_rate(cell), | ||
| _format_fail_count(cell), |
| ticket_slug = safe_slug(recipe.ticket) if recipe.ticket else NO_TICKET_SLUG | ||
| workload_slug = safe_slug(recipe.workload) | ||
| ts = timestamp or format_timestamp() | ||
| parent = Path(output_dir) / ticket_slug / workload_slug |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 14 out of 15 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| except Exception as exc: | ||
| raise click.ClickException(str(exc)) from exc | ||
| # different exit shape, depending on which validator catches it. | ||
| try: | ||
| run_dir = run_recipe( | ||
| r, | ||
| output_dir=output_dir, | ||
| dry_run=dry_run, | ||
| extra_sidecar_files=mitigation_files, | ||
| ) | ||
| except (RegistryError, RecipeCellError, RecipeSchemaError) as exc: | ||
| raise click.ClickException(str(exc)) from exc |
| * ``recipe.resolved.yaml`` -- the recipe AS EXECUTED, with every registry | ||
| name expanded to its underlying env-var bundle + docker ref. This is the | ||
| reproducibility artifact: re-running it on a different machine produces | ||
| the same matrix even if the registries drift in the interim. |
| schema_version: 1 | ||
|
|
||
| ticket: EXAMPLE-151 | ||
| workload: fsdp |
b4ec361 to
c6debd8
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 14 out of 15 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| def run_recipe( | ||
| recipe: Recipe, | ||
| output_dir: Path, | ||
| dry_run: bool = False, | ||
| extra_sidecar_files: tuple[Path, ...] = (), | ||
| timestamp: str | None = None, | ||
| ) -> Path: | ||
| """Execute a recipe and write matrix.md / matrix.json / recipe.resolved.yaml. | ||
|
|
||
| Args: | ||
| recipe: Pre-validated recipe (from :func:`aorta.triage.recipe.load_recipe` | ||
| or :func:`aorta.triage.recipe.build_recipe_from_flags`). | ||
| output_dir: Top-level output directory (the CLI's ``--output-dir``). | ||
| dry_run: When True, validates and prints the resolved cell list to | ||
| stdout without touching the filesystem and returns a sentinel | ||
| ``Path(".")``. | ||
| extra_sidecar_files: Operator-supplied sidecar JSONs (from | ||
| ``--mitigations-file``). These are threaded through to B1's | ||
| registry resolver alongside the runner-generated inline sidecar. |
| header = ( | ||
| "Cell", | ||
| "Mitigations", | ||
| "Environment", | ||
| "Failure rate", | ||
| "Trials", | ||
| "Mean step (ms)", | ||
| "Confound", | ||
| ) | ||
| rows: list[tuple[str, ...]] = [header] | ||
| for cell in cell_stats: | ||
| tag, _ = confound_tags.get(cell.name, (cell.error and "error" or "-", None)) | ||
| rows.append( | ||
| ( | ||
| cell.name, | ||
| _format_mitigations(cell.mitigations), | ||
| cell.environment, | ||
| _format_failure_rate(cell), | ||
| _format_fail_count(cell), |
| if baseline.error is not None or baseline.mean_step_time_ms <= 0: | ||
| # Baseline unusable: we can't compute a ratio, so surface None and | ||
| # let the renderer emit "n/a". The runner also writes a top-of-file | ||
| # warning in matrix.md so readers see why. | ||
| return CONFOUND_NEUTRAL, None |
| * ``recipe.resolved.yaml`` -- the recipe AS EXECUTED, with every registry | ||
| name expanded to its underlying env-var bundle + docker ref. This is the | ||
| reproducibility artifact: re-running it on a different machine produces | ||
| the same matrix even if the registries drift in the interim. |
Implements task B2 (issue #151): the
aorta triage runcommand that sweeps a (mitigation x environment x trials) matrix in-process via B1'srun_trials, aggregates per-cell stats, classifies each cell against a baseline with speed-confound detection, and emits matrix.md / matrix.json / recipe.resolved.yaml under a ticket-grouped output directory.Scope (matrix mode only per design D11; optimize mode deferred to P1):
triage/recipe.py-- schema v1, YAML/JSON loader, and thebuild_recipe_from_flagsshim.Recipe/Cell/InlineEnv/ConfoundCfgare frozen dataclasses; loader rejects unknown schema_version, unknown keys, duplicate cell names, and resolves every(mitigation, environment)reference through B3's registry (built-ins + entry-point plugins + B3.1 sidecars). Inline-docker envs auto-name as_inline_<8char-blake2b>so two cells with the same image ref collapse to one environment probe.triage/runner.py--run_recipe()is the single execution path both modes converge on. Resolves<output>/<ticket>/<workload>/<ts>/, writes inline-docker sidecar JSON for B1's resolver to consume, captures host + per-environmentcollect_envsnapshots, dispatches every cell via in-processrun_trials. Per-cell exceptions are caught and surfaced as anerrorrow so a single bad cell doesn't kill the matrix. Strictly nosubprocess-- a grep test undertests/triage/enforces it.triage/matrix.py--aggregate_cell()collapses per-trial results into median/p90/min/max step time, NaN counts, exit-status histogram, and the per-cell label set.triage/confound.py--resolve_baseline()(explicit override -> firstbaseline-*cell -> firstmitigations: [none]cell) andclassify_all()(cell_step_time / baseline_step_time > threshold ->speed (+N%)flag). Threshold default 1.15 per D11.triage/output.py-- writes the three artifacts. matrix.md is human-first (table with per-row label badges + run summary); matrix.json is machine-first (full per-cell stats and the resolved recipe round-trips into recipe.resolved.yaml so a matrix can be re-run byte-identically).cli/triage.py-- two equivalent entry points:--recipe <file>(primary) and--mode matrix --workload ... --mitigation-axis ... --environment-axis ...(flag shim). Mixing the two raises a UsageError up-front instead of silently letting one win. Addslist-mitigations/list-environmentsdiscovery subcommands (delegates to B3's resolver, prints source_package so users can see whether a mitigation came fromaorta, a plugin, or a sidecar).Recipes:
recipes/README.md-- full schema reference + worked examples.recipes/example-fsdp-smoke.yaml-- minimal end-to-end example.Tests (1364 lines, 5 files):
test_recipe.py-- schema validation: unknown version, unknown top-level keys, duplicate cell names, missing required fields, inline-env hash determinism, sidecar resolution.test_runner_b1_api.py-- per-cell error isolation, in-process guarantee (grepstriage/runner.pyforsubprocess), inline-env sidecar round-trip, per-environment snapshot caching.test_matrix.py-- aggregation against synthetic TrialResult fixtures incl. the all-NaN and mixed-exit-status edges.test_confound.py-- baseline resolution priority and the threshold classifier (boundary at exactly 1.15).test_output_layout.py-- artifact paths, matrix.md table shape, matrix.json schema, resolved-recipe round-trip.Motivation
Technical Details
Test Plan
Test Result
Submission Checklist