Skip to content

B2: triage run -- mitigation x environment matrix runner#160

Open
amd-vivekag wants to merge 5 commits intousers/vivekag/b1-aorta-runfrom
users/vivekag/b2-triage-run
Open

B2: triage run -- mitigation x environment matrix runner#160
amd-vivekag wants to merge 5 commits intousers/vivekag/b1-aorta-runfrom
users/vivekag/b2-triage-run

Conversation

@amd-vivekag
Copy link
Copy Markdown
Collaborator

Implements task B2 (issue #151): the aorta triage run command that sweeps a (mitigation x environment x trials) matrix in-process via B1's run_trials, aggregates per-cell stats, classifies each cell against a baseline with speed-confound detection, and emits matrix.md / matrix.json / recipe.resolved.yaml under a ticket-grouped output directory.

Scope (matrix mode only per design D11; optimize mode deferred to P1):

  • triage/recipe.py -- schema v1, YAML/JSON loader, and the build_recipe_from_flags shim. Recipe/Cell/InlineEnv/ ConfoundCfg are frozen dataclasses; loader rejects unknown schema_version, unknown keys, duplicate cell names, and resolves every (mitigation, environment) reference through B3's registry (built-ins + entry-point plugins + B3.1 sidecars). Inline-docker envs auto-name as _inline_<8char-blake2b> so two cells with the same image ref collapse to one environment probe.

  • triage/runner.py -- run_recipe() is the single execution path both modes converge on. Resolves <output>/<ticket>/<workload>/<ts>/, writes inline-docker sidecar JSON for B1's resolver to consume, captures host + per-environment collect_env snapshots, dispatches every cell via in-process run_trials. Per-cell exceptions are caught and surfaced as an error row so a single bad cell doesn't kill the matrix. Strictly no subprocess -- a grep test under tests/triage/ enforces it.

  • triage/matrix.py -- aggregate_cell() collapses per-trial results into median/p90/min/max step time, NaN counts, exit-status histogram, and the per-cell label set.

  • triage/confound.py -- resolve_baseline() (explicit override -> first baseline-* cell -> first mitigations: [none] cell) and classify_all() (cell_step_time / baseline_step_time > threshold -> speed (+N%) flag). Threshold default 1.15 per D11.

  • triage/output.py -- writes the three artifacts. matrix.md is human-first (table with per-row label badges + run summary); matrix.json is machine-first (full per-cell stats and the resolved recipe round-trips into recipe.resolved.yaml so a matrix can be re-run byte-identically).

  • cli/triage.py -- two equivalent entry points: --recipe <file> (primary) and --mode matrix --workload ... --mitigation-axis ... --environment-axis ... (flag shim). Mixing the two raises a UsageError up-front instead of silently letting one win. Adds list-mitigations / list-environments discovery subcommands (delegates to B3's resolver, prints source_package so users can see whether a mitigation came from aorta, a plugin, or a sidecar).

Recipes:

  • recipes/README.md -- full schema reference + worked examples.
  • recipes/example-fsdp-smoke.yaml -- minimal end-to-end example.

Tests (1364 lines, 5 files):

  • test_recipe.py -- schema validation: unknown version, unknown top-level keys, duplicate cell names, missing required fields, inline-env hash determinism, sidecar resolution.
  • test_runner_b1_api.py -- per-cell error isolation, in-process guarantee (greps triage/runner.py for subprocess), inline-env sidecar round-trip, per-environment snapshot caching.
  • test_matrix.py -- aggregation against synthetic TrialResult fixtures incl. the all-NaN and mixed-exit-status edges.
  • test_confound.py -- baseline resolution priority and the threshold classifier (boundary at exactly 1.15).
  • test_output_layout.py -- artifact paths, matrix.md table shape, matrix.json schema, resolved-recipe round-trip.

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Implements task B2 (issue #151): the `aorta triage run` command that
sweeps a (mitigation x environment x trials) matrix in-process via
B1's `run_trials`, aggregates per-cell stats, classifies each cell
against a baseline with speed-confound detection, and emits
matrix.md / matrix.json / recipe.resolved.yaml under a ticket-grouped
output directory.

Scope (matrix mode only per design D11; optimize mode deferred to P1):

* `triage/recipe.py` -- schema v1, YAML/JSON loader, and the
  `build_recipe_from_flags` shim. `Recipe`/`Cell`/`InlineEnv`/
  `ConfoundCfg` are frozen dataclasses; loader rejects unknown
  schema_version, unknown keys, duplicate cell names, and resolves
  every `(mitigation, environment)` reference through B3's registry
  (built-ins + entry-point plugins + B3.1 sidecars). Inline-docker
  envs auto-name as `_inline_<8char-blake2b>` so two cells with the
  same image ref collapse to one environment probe.

* `triage/runner.py` -- `run_recipe()` is the single execution path
  both modes converge on. Resolves `<output>/<ticket>/<workload>/<ts>/`,
  writes inline-docker sidecar JSON for B1's resolver to consume,
  captures host + per-environment `collect_env` snapshots, dispatches
  every cell via in-process `run_trials`. Per-cell exceptions are
  caught and surfaced as an `error` row so a single bad cell doesn't
  kill the matrix. Strictly no `subprocess` -- a grep test under
  `tests/triage/` enforces it.

* `triage/matrix.py` -- `aggregate_cell()` collapses per-trial
  results into median/p90/min/max step time, NaN counts, exit-status
  histogram, and the per-cell label set.

* `triage/confound.py` -- `resolve_baseline()` (explicit override ->
  first `baseline-*` cell -> first `mitigations: [none]` cell) and
  `classify_all()` (cell_step_time / baseline_step_time > threshold
  -> `speed (+N%)` flag). Threshold default 1.15 per D11.

* `triage/output.py` -- writes the three artifacts. matrix.md is
  human-first (table with per-row label badges + run summary);
  matrix.json is machine-first (full per-cell stats and the resolved
  recipe round-trips into recipe.resolved.yaml so a matrix can be
  re-run byte-identically).

* `cli/triage.py` -- two equivalent entry points: `--recipe <file>`
  (primary) and `--mode matrix --workload ... --mitigation-axis ...
  --environment-axis ...` (flag shim). Mixing the two raises a
  UsageError up-front instead of silently letting one win. Adds
  `list-mitigations` / `list-environments` discovery subcommands
  (delegates to B3's resolver, prints source_package so users can
  see whether a mitigation came from `aorta`, a plugin, or a
  sidecar).

Recipes:

* `recipes/README.md` -- full schema reference + worked examples.
* `recipes/example-fsdp-smoke.yaml` -- minimal end-to-end example.

Tests (1364 lines, 5 files):

* `test_recipe.py` -- schema validation: unknown version, unknown
  top-level keys, duplicate cell names, missing required fields,
  inline-env hash determinism, sidecar resolution.
* `test_runner_b1_api.py` -- per-cell error isolation, in-process
  guarantee (greps `triage/runner.py` for `subprocess`), inline-env
  sidecar round-trip, per-environment snapshot caching.
* `test_matrix.py` -- aggregation against synthetic TrialResult
  fixtures incl. the all-NaN and mixed-exit-status edges.
* `test_confound.py` -- baseline resolution priority and the
  threshold classifier (boundary at exactly 1.15).
* `test_output_layout.py` -- artifact paths, matrix.md table shape,
  matrix.json schema, resolved-recipe round-trip.

Co-authored-by: Cursor <cursoragent@cursor.com>
Copilot AI review requested due to automatic review settings May 5, 2026 09:55
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds the B2 triage matrix workflow to aorta: recipe/flag-driven orchestration around B1 run_trials, aggregation/classification, output artifact generation, and CLI entry points for running and inspecting mitigation/environment registries.

Changes:

  • Adds triage core modules for recipe parsing, per-cell execution, aggregation, confound classification, and artifact writing.
  • Replaces the placeholder aorta triage run CLI with recipe mode, flag-shim matrix mode, and registry discovery subcommands.
  • Adds recipe documentation/examples plus a substantial new tests/triage/ suite covering loader, runner, output, matrix, and confound behavior.

Reviewed changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
src/aorta/cli/triage.py Implements the triage CLI, dual entry paths, and list subcommands.
src/aorta/triage/__init__.py Exposes the new triage public API.
src/aorta/triage/recipe.py Defines recipe schema/dataclasses, file loader, and flag-mode builder.
src/aorta/triage/runner.py Orchestrates matrix execution, env capture, cell dispatch, and artifact emission.
src/aorta/triage/matrix.py Aggregates per-trial results into per-cell statistics.
src/aorta/triage/confound.py Resolves the baseline and classifies cells for speed confounds.
src/aorta/triage/output.py Resolves run directories and writes matrix.md, matrix.json, and resolved recipe output.
recipes/README.md Documents the recipe schema and output layout.
recipes/example-fsdp-smoke.yaml Provides a minimal example triage recipe.
tests/triage/__init__.py Adds the new triage test package.
tests/triage/test_recipe.py Covers recipe loading/validation and flag builder behavior.
tests/triage/test_runner_b1_api.py Verifies runner plumbing, in-process execution, and CLI smoke paths.
tests/triage/test_matrix.py Tests cell aggregation behavior and timing fallbacks.
tests/triage/test_confound.py Tests baseline resolution and confound classification.
tests/triage/test_output_layout.py Verifies artifact layout/content and fail-soft output behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/aorta/cli/triage.py
environment_axis=environment_axis,
trials=trials,
baseline_cell=baseline_cell,
)
Comment thread src/aorta/triage/output.py Outdated
Comment on lines +308 to +335
resolved_cells.append(
{
"name": cell.name,
"mitigations": list(cell.mitigations),
"mitigation_contributions": mit_contributions,
"environment": cell.environment,
"resolved_environment": resolved_env,
"extra_env": dict(cell.extra_env),
"resolved_mitigation_env": mit_union,
"trials": cell.effective_trials(recipe.trials),
"steps": cell.effective_steps(recipe.steps),
}
)

doc = {
"schema_version": recipe.schema_version,
"ticket": recipe.ticket,
"workload": recipe.workload,
"trials": recipe.trials,
"steps": recipe.steps,
"confound": {
"threshold": recipe.confound.threshold,
"baseline_cell": recipe.confound.baseline_cell,
},
"inline_environments": [
{"name": e.name, "docker": e.docker} for e in recipe.inline_environments
],
"cells": resolved_cells,
Comment thread src/aorta/triage/runner.py Outdated
Comment on lines +255 to +260
if cell.environment not in seen_envs:
_capture_env(
env_dir / cell.environment / "env.json",
scope=f"environment:{cell.environment}",
warnings=warnings,
)
Comment on lines +468 to +489
_validate_names_resolve(cells_tuple, inline_envs, sidecar_files)

if confound.baseline_cell is not None:
names = {c.name for c in cells_tuple}
if confound.baseline_cell not in names:
raise RecipeCellError(
f"confound.baseline_cell {confound.baseline_cell!r} does not "
f"match any cell name; cells: {sorted(names)}"
)

return Recipe(
schema_version=SCHEMA_VERSION,
workload=workload,
trials=trials,
steps=steps,
cells=cells_tuple,
ticket=ticket,
confound=confound,
inline_environments=tuple(inline_envs.values()),
source_path=source_path,
source_sha256=source_sha256,
)
Comment thread src/aorta/triage/runner.py Outdated
Comment on lines +156 to +157
cell_dir = _cells_dir(run_dir) / cell.name
cell_dir.mkdir(parents=True, exist_ok=True)
Comment thread src/aorta/cli/triage.py
Comment on lines +183 to +193
r = build_recipe_from_flags(
workload=workload, # type: ignore[arg-type]
mitigation_axis=mitigation_axis, # type: ignore[arg-type]
environment_axis=environment_axis, # type: ignore[arg-type]
trials=trials, # type: ignore[arg-type]
steps=steps,
ticket=ticket,
baseline_cell=baseline_cell,
confound_threshold=confound_threshold,
sidecar_files=mitigation_files or None,
)
vmnit and others added 2 commits May 5, 2026 15:32
end-of-file-fixer flagged tests/triage/__init__.py: the file had a
single stray `\n` (1 byte). The hook's rule for empty packages is
"zero bytes, no trailing newline"; truncating to 0 bytes satisfies it
and Python still treats the directory as a package.

Co-authored-by: Cursor <cursoragent@cursor.com>
… audit

Copilot's review surfaced 6 distinct issues; categorising them showed three
overlapping classes worth closing for the next reviewer instead of
addressing one comment at a time.

Class A -- validation incompleteness / not fail-fast
====================================================

(Comments 1, 4, 6.) Inputs that should have been rejected at the recipe
boundary instead reached execution.

* cli/triage.py: ``_reject_flag_mode_args`` only checked --workload,
  --mitigation-axis, --environment-axis, --trials, --baseline-cell.
  --steps / --ticket / --confound-threshold passed through silently when
  combined with --recipe, masking misconfigurations. Helper now checks
  every flag that affects recipe content; the no-op runner-level flags
  (--mode, --output-dir, --dry-run, --mitigations-file) are explicitly
  documented as intentionally allowed.

* cli/triage.py: --confound-threshold default flips from 1.15 to None so
  the "user explicitly set this" check actually works. Flag-mode default
  is applied at build_recipe_from_flags() call time instead.

* triage/recipe.py: build_recipe_from_flags() now validates the same set
  of primitive constraints load_recipe enforces -- positive trials/steps,
  non-empty workload, ticket-when-provided non-empty, and
  confound_threshold > 0. Previously ``aorta triage run --trials 0`` got
  all the way to ``run_trials()`` before failing.

* triage/recipe.py: cross-mitigation env-var collision detection landed
  as ``_validate_no_mitigation_collisions`` and is wired into both
  _build_recipe (load path) and build_recipe_from_flags (flag path).
  The Cell docstring previously promised this check existed; it didn't,
  and ``dict.update`` was silently letting the later mitigation win.
  ``extra_env`` overrides stay silent because they are documented as the
  intentional-override knob.

Class B -- documented contract not honoured
===========================================

(Comments 2, 4.) Code claimed behaviours its implementation didn't
deliver.

* triage/output.py: ``recipe.resolved.yaml`` was advertised as
  rerunnable but contained per-cell debug fields
  (``mitigation_contributions``, ``resolved_environment``,
  ``resolved_mitigation_env``) and a top-level ``inline_environments``
  block, all of which load_recipe rejects as unknown schema keys.
  Rewrote write_resolved_recipe to emit a strict, schema-valid recipe.
  Inline-docker cells re-emit the ``{docker: <ref>}`` shorthand so the
  same ``_inline_<hash>`` is re-derived at reload without a sidecar.
  Per-cell expansion data (the resolved Environment descriptor) moved
  into matrix.json::cells[*].resolved_environment, where it logically
  belongs as run-state. The unioned env-var bundle was already in
  matrix.json::cells[*].resolved_env_vars (via CellStats), so nothing
  is lost.

* recipes/README.md: ``§Re-running a past matrix`` updated to reflect
  the strict-rerunnable property, with an explicit note that registry
  drift on named entries is currently not pinned (use matrix.json's
  resolved_env_vars to detect it).

* triage/recipe.py: Cell docstring rewritten -- the "later names win"
  language was the silent-update bug in disguise. Now spells out
  "rejected at recipe-construction time".

Class C -- recipe strings used as filesystem paths
==================================================

(Comment 5.) Recipe-supplied strings reached Path operations without
sanitisation; ``cell.name = "../foo"`` could escape the run directory.

* triage/recipe.py: cell-name regex and reserved-name list:
  ``^[A-Za-z0-9_][A-Za-z0-9_.\-]*$`` plus a reject list for ``.``,
  ``..``, ``matrix.md``, ``matrix.json``. Validated at parse time
  (recipe mode) and at flag-mode cell synthesis. The leading-char
  restriction also prevents ``--`` getting interpreted as a CLI flag
  if anyone copy-pastes a cell name into a shell.

* triage/output.py: safe_slug() additionally collapses ``.`` and ``..``
  to ``_`` -- the character-class regex alone allowed those through,
  so a ticket like ``..`` would have moved the run dir up one level.

* triage/runner.py: the cell.name and cell.environment path components
  now go through safe_slug() at the runner boundary (defence in depth
  -- the recipe loader already rejects unsafe names, but the slug at
  the boundary means even a hand-built Recipe with a bad name can't
  escape).

* recipes/README.md: ``cells[*].name`` doc updated with the regex.

Comment 3 (per-environment env probe is misleading)
===================================================

The runner was calling ``collect_env()`` in the runner process and
labelling the result as the docker / venv environment. Since B1 doesn't
actually swap to docker today (in-process), the artifact was the host
state under a docker label.

* triage/runner.py: new ``_is_isolated_environment`` predicate (true if
  the env's Environment descriptor sets ``docker`` or ``venv``, or if
  the env is an inline-docker auto-name). When true, the runner writes
  a non-misleading placeholder to ``environments/<name>/env.json``:
  ``snapshot_captured=False``, the descriptor itself, and an
  explanation of why (B1 in-process). The per-run host_env.json still
  captures the runner's view, so nothing about the host is lost.

* A warning is appended to matrix.md / matrix.json so the operator sees
  the skip explicitly.

Tests
=====

22 new tests across the three classes (124 passing in tests/triage,
was 102):

* test_recipe.py:
  - 9 cell-name path-safety params (../escape, a/b, .., ., matrix.md,
    leading dash, embedded whitespace, etc.) + 1 happy case asserting
    ``PROJ_1.fix-cell`` parses fine.
  - test_stacked_mitigations_with_disagreeing_keys_rejected /
    _agreeing_keys_accepted / _error_names_the_cell_and_keys for the
    new collision detector (mitigations monkey-patched).
  - test_extra_env_overrides_mitigation_silently pins that extra_env
    overrides intentionally do NOT trip the collision check.
  - 4 flag-mode boundary-validation tests: trials=0 / -1, steps=0 / -5,
    empty workload, threshold=0/-1/-0.5, ticket="".

* test_runner_b1_api.py:
  - test_cli_recipe_mode_rejects_every_flag_mode_knob parametrised
    over all 8 recipe-content flags (the comment-1 fix).
  - test_cli_recipe_mode_allows_runner_only_flags pins that
    --output-dir / --dry-run still work alongside --recipe.
  - test_cli_flag_mode_rejects_non_positive_trials pins the
    --trials 0 fail-fast (comment 6).

* test_output_layout.py:
  - test_safe_slug_rejects_dot_components for the safe_slug fix.
  - test_resolved_recipe_is_loadable_by_load_recipe pins the strict
    rerunnable contract (comment 2).
  - test_resolved_recipe_inline_envs_emit_docker_shorthand pins the
    ``{docker: <ref>}`` re-emission for inline cells.
  - test_matrix_json_records_resolved_environment_per_cell pins that
    the moved-out debug expansion lands in matrix.json.
  - test_isolated_env_writes_placeholder_not_runner_snapshot pins the
    comment-3 fix (placeholder, not misleading collect_env call).
  - test_host_env_collected_exactly_once updated: 1 host + 1 local-env
    probe = 2, not 3, because the inline-docker env now skips its
    runner-process probe.

Verification
============

* pytest tests/triage/: 124 passing (was 102; +22 new).
* pytest tests/ (excl. tests/hw_queue_eval which needs torch): 403
  passing.
* ruff check + ruff format: clean across all touched src + tests.
* mypy: 2 pre-existing yaml-import-untyped errors only; no new errors.
* pre-commit run --all-files: clean.

Co-authored-by: Cursor <cursoragent@cursor.com>
Copilot AI review requested due to automatic review settings May 5, 2026 11:52
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 14 out of 15 changed files in this pull request and generated 9 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +201 to +211
threshold = raw.get("threshold", 1.15)
if not isinstance(threshold, (int, float)) or isinstance(threshold, bool):
raise RecipeSchemaError(
f"{path_hint}.confound.threshold: must be a number, got {type(threshold).__name__}"
)
baseline = raw.get("baseline_cell")
if baseline is not None and not isinstance(baseline, str):
raise RecipeSchemaError(
f"{path_hint}.confound.baseline_cell: must be a string, got {type(baseline).__name__}"
)
return ConfoundCfg(threshold=float(threshold), baseline_cell=baseline)
Comment thread src/aorta/triage/runner.py Outdated
Comment on lines +115 to +125
Lookup failures fall back to "treat as local" so probe behaviour is
unchanged for envs we genuinely don't know anything about; the
pre-flight :func:`_validate_names_resolve` would already have failed if
the name were truly unknown.
"""
if any(env_name == e.name for e in inline_envs):
return True
try:
descriptor = get_environment(env_name)
except RegistryError:
return False
Comment on lines +295 to +311
"""Write a schema-valid ``recipe.resolved.yaml`` that can be re-loaded as-is.

The output is a strict :func:`aorta.triage.recipe.load_recipe` input -- no
debug-only fields, no per-cell ``resolved_mitigation_env`` block, no
top-level ``inline_environments`` key. The "resolved" property comes from
inline-docker cells being re-emitted in the ``{docker: <ref>}`` shorthand
form (so re-loading on another machine reproduces the same
``_inline_<hash>`` env without needing a sidecar JSON next to the file).

Per-cell debug expansions (the ``Environment`` descriptor and the unioned
mitigation env-var bundle) live in ``matrix.json`` instead, where they
belong as run-time state. ``sidecar_files`` is intentionally unused at
write time -- B3 already resolved everything by the time we got here --
but the parameter is preserved so the runner can keep its single
"everything-needed-for-replay" call site.
"""
del sidecar_files # see docstring; kept for caller-stable signature
Comment on lines +60 to +65
@property
def nan_rate(self) -> float:
"""Fraction of trials that failed. 0.0 for an empty cell."""
if self.trials == 0:
return 0.0
return self.failed_count / self.trials
Comment on lines +49 to +57
passed_count: int
failed_count: int
mean_step_time_ms: float
std_step_time_ms: float
p50_step_time_ms: float
p99_step_time_ms: float
mean_wall_clock_sec: float
step_times_ms: list[float] = field(default_factory=list)
trial_paths: list[str] = field(default_factory=list)
Comment thread src/aorta/triage/runner.py Outdated
Comment on lines +213 to +215
found: list[Path] = []
for candidate in sorted(results_dir.rglob("trial_*.json")):
found.append(candidate)
Comment on lines +331 to +346
if cell.environment not in seen_envs:
env_json_path = env_dir / safe_slug(cell.environment) / "env.json"
if _is_isolated_environment(cell.environment, recipe.inline_environments):
_write_isolated_env_placeholder(
env_json_path,
cell.environment,
recipe.inline_environments,
warnings,
)
else:
_capture_env(
env_json_path,
scope=f"environment:{cell.environment}",
warnings=warnings,
)
seen_envs.add(cell.environment)
Comment thread src/aorta/cli/triage.py
Comment on lines +261 to +270
def list_mitigations(files: tuple[Path, ...]) -> None:
"""List every registered mitigation with its source_package and env-var bundle."""
registry = load_mitigations(extra_files=list(files) or None)
name_w = max(len("NAME"), *(len(n) for n in registry))
src_w = max(len("SOURCE"), *(len(m.source_package) for m in registry.values()))
click.echo(f"{'NAME'.ljust(name_w)} {'SOURCE'.ljust(src_w)} ENV")
for name in sorted(registry):
m = registry[name]
env_str = " ".join(f"{k}={v}" for k, v in sorted(m.env.items())) or "(none)"
click.echo(f"{name.ljust(name_w)} {m.source_package.ljust(src_w)} {env_str}")
Comment thread src/aorta/cli/triage.py
Comment on lines +281 to +292
def list_environments(files: tuple[Path, ...]) -> None:
"""List every registered environment with its source_package and docker/venv."""
registry = load_environments(extra_files=list(files) or None)
name_w = max(len("NAME"), *(len(n) for n in registry))
src_w = max(len("SOURCE"), *(len(e.source_package) for e in registry.values()))
docker_w = max(len("DOCKER"), *(len(e.docker or "-") for e in registry.values()))
click.echo(f"{'NAME'.ljust(name_w)} {'SOURCE'.ljust(src_w)} {'DOCKER'.ljust(docker_w)} VENV")
for name in sorted(registry):
e = registry[name]
click.echo(
f"{name.ljust(name_w)} {e.source_package.ljust(src_w)} "
f"{(e.docker or '-').ljust(docker_w)} {e.venv or '-'}"
… audit

Copilot's second pass surfaced 9 distinct comments. Categorising them
showed five overlapping classes, all of which got the same audit-and-fix
treatment as round 1: find every instance of each pattern in the diff,
fix all of them, add tests pinning each behaviour.

Class A -- validation parity (recipe vs flag mode disagree)
===========================================================

(Comment 3188207431.) Recipe-mode loader accepted ``confound.threshold``
of ``0`` or negative, even though ``build_recipe_from_flags`` already
rejected those values. A non-positive threshold makes ``classify`` flag
every non-baseline cell as a speed confound (any positive ratio
satisfies ``ratio > threshold``), so the recipe was always misuse.

* triage/recipe.py: ``_parse_confound`` now applies the same ``> 0``
  check, with an error message that names the offending value.

Class B -- ``--mitigations-file`` (sidecar) lifecycle is incomplete
==================================================================

(Comments 3188207484, 3188207509, 3188207675, 3188207707.) Four
independent symptoms of one root cause: sidecar files weren't threaded
through the parts of the system that need to know about them.

* triage/runner.py: ``_is_isolated_environment`` and
  ``_write_isolated_env_placeholder`` now accept ``sidecar_files`` and
  forward it to ``get_environment``. Without this, a sidecar-defined
  docker/venv env failed the registry lookup, fell back to "treat as
  local", and the runner wrote a misleading host ``collect_env()``
  snapshot under the env's name. Both call sites in ``run_recipe`` pass
  the runner's sidecar bundle in.

* triage/runner.py: new ``_copy_operator_sidecars`` snapshots each
  ``--mitigations-file`` JSON into ``<run_dir>/sidecars/<basename>``
  before B1 sees them. The runner then uses the snapshotted paths as
  the resolver source -- so what gets executed and what gets archived
  for replay are byte-identical. Without this, ``recipe.resolved.yaml``
  was unrunnable on its own for sidecar-backed runs (the YAML
  references mitigation/env names, but the names were only resolvable
  via the operator's local sidecar). The runner also prints a one-line
  ``to rerun:`` hint on stdout when sidecars are involved so the
  operator does not have to reconstruct the flags by hand.

* triage/output.py: ``write_resolved_recipe`` docstring updated to
  document the snapshot-and-replay convention; the YAML schema stays
  strict (no metadata namespace) -- the README documents the rerun
  command shape.

* cli/triage.py: ``list-mitigations`` and ``list-environments`` now
  wrap ``RegistryError`` in ``ClickException`` so a malformed
  ``--mitigations-file`` produces the same one-line clean error that
  ``triage run`` and ``aorta run`` already produced. Previously a bad
  sidecar dumped a Python traceback.

Class C -- recipe-supplied strings -> filesystem path components
================================================================

(Comment 3188207640.) Round 1 added ``safe_slug`` for ``cell.name`` and
gave cell names a regex; environment names slipped through. Distinct
registered names like ``a/b`` and ``a:b`` both slug to ``a_b`` and the
second probe would silently overwrite the first while ``matrix.json``
kept them as distinct envs.

* triage/runner.py: new ``_check_env_slug_collisions`` runs at the top
  of ``run_recipe`` and raises ``RegistryError`` naming both colliding
  env names + the shared slug, before any directory is created. This
  matches the existing cell-name regex enforcement in spirit (different
  mechanism because env names come from the registry / sidecar, not
  the recipe text directly).

Class D -- aggregation model accuracy + completeness
====================================================

(Comments 3188207553, 3188207584.) Two issues with ``CellStats``:

* ``nan_rate`` was misnamed -- the property counts every trial whose
  ``exit_status != "ok"`` or ``WorkloadResult.passed`` is False, which
  is generic failure (NaN, corruption, divergence, infrastructure
  crash), not NaN-specific. Renamed to ``failure_rate`` everywhere it
  is read or rendered: matrix.py property, confound.py classifier,
  output.py formatter + matrix.json key + matrix.md column header
  ("NaN rate" -> "Failure rate"). Consumer-facing rename is documented
  in the matrix.md "Notes" section so readers know to look at
  ``exit_status_counts`` for failure-mode detail.

* ``CellStats`` was missing fields the PR description promised:
  ``min_step_time_ms``, ``max_step_time_ms``, ``p90_step_time_ms``, and
  an ``exit_status_counts`` histogram keyed by ``TrialResult.exit_status``.
  Without these, ``matrix.json`` consumers could not distinguish
  ``workload_failed`` from ``infrastructure_failed`` when triaging a
  cell. ``aggregate_cell`` populates all four; the error path zeroes
  them. README ``matrix.json per-cell shape`` section enumerates the
  full set so future field additions have a documented home.

Class E -- sort-order bug
=========================

(Comment 3188207612.) ``sorted(rglob("trial_*.json"))`` is
lexicographic, so ``trial_10.json`` came before ``trial_2.json`` once
a cell had 10+ trials, and ``matrix.json`` ``trial_paths`` no longer
matched execution order.

* triage/runner.py: ``_collect_trial_paths`` extracts the integer
  index from the filename and sorts on that. Files that don't parse as
  ``trial_<int>.json`` sort last (alphabetically), preserving any
  future sibling artifacts without breaking the contract for the
  common case.

Tests
=====

* tests/triage/test_matrix.py: pinned the new
  ``min``/``max``/``p90`` fields, the ``exit_status_counts``
  histogram, and the ``failure_rate``-not-``nan_rate`` rename
  (``hasattr`` regression check).

* tests/triage/test_recipe.py: parameterised
  ``test_load_recipe_rejects_non_positive_confound_threshold``
  matching the existing flag-mode test.

* tests/triage/test_output_layout.py: pinned the
  natural-sort fix, the env-name collision detection, the renamed
  matrix.md column, and three sidecar-lifecycle behaviours
  (snapshot-and-replay, isolated-env classification with sidecars,
  matrix.json fields).

* tests/triage/test_runner_b1_api.py: updated the existing
  ``test_extra_sidecar_files_threaded_to_run_trials`` for the new
  snapshot-and-use-the-snapshot contract; added two new tests pinning
  the ``ClickException`` wrapping for ``list-mitigations`` /
  ``list-environments``.

* tests/triage/test_confound.py: ``_stats`` factory updated for the
  new ``CellStats`` field set; renamed test names that mentioned
  ``nan_rate`` so future grep is honest.

Comments NOT pushed back on
===========================

None. All 9 comments are real correctness or contract gaps; the
biggest scope-add is Class D (new ``CellStats`` fields + rename), but
that matches what the PR description promised, so it belongs.

Co-authored-by: Cursor <cursoragent@cursor.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 14 out of 15 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/aorta/cli/triage.py Outdated
Comment on lines +389 to +391
if dry_run:
_print_dry_run(recipe)
return Path(".")
Comment on lines +72 to +73
now = now or _dt.datetime.now(_dt.timezone.utc)
return now.strftime("%Y-%m-%dT%H-%M-%S")
Comment thread src/aorta/triage/output.py Outdated
Comment on lines +229 to +230
"- `recipe.resolved.yaml` (alongside this file) captures the registry state "
"at run time -- re-run it to reproduce."
… audit

Copilot's third pass flagged 4 distinct comments. Each one is a different
class (no overlap this round); fixed all of them in the same audit-and-pin
pattern as rounds 1 and 2.

Class A -- inconsistent error-shape contract
============================================

(Comment 3188441868.) ``triage run`` wrapped ``load_recipe`` /
``build_recipe_from_flags`` errors in ``ClickException`` but left
``run_recipe()`` exceptions unhandled. Two flavours of the same kind of
error therefore exited the CLI with different shapes depending on which
validator caught them. This was a regression introduced in round 2: I
added ``_check_env_slug_collisions`` raising ``RegistryError`` from the
runner without extending the CLI wrapper to match.

* cli/triage.py: ``triage run`` now wraps the ``run_recipe`` call in a
  ``try`` that catches ``(RegistryError, RecipeCellError,
  RecipeSchemaError)`` and re-raises as ``click.ClickException``. Both
  pre-flight failures (env-slug collisions, baseline resolution) and any
  late-stage recipe-level rejection now produce the same one-line
  ``Error: ...`` shape that ``aorta run`` and the list subcommands
  already produce.

Class B -- ``--dry-run`` skips runtime preflight
================================================

(Comment 3188441945.) Dry-run printed a clean summary for recipes that a
real run immediately rejected -- e.g. multi-cell recipes with no
auto-resolvable baseline, or environment names whose slugs collide. Same
"documented contract not honoured" pattern as round 1.

* triage/runner.py: new ``_preflight_validate(recipe)`` collects every
  pure (no FS / network / subprocess) validation check and runs BEFORE
  the dry-run early-return. Currently covers env-slug collisions and
  baseline resolution; the docstring documents the purity requirement
  for callables added later. The inline collision check inside
  ``run_recipe`` is removed (it's now redundant) so there is exactly one
  call site.

* triage/runner.py: dry-run's "no FS side effects" guarantee is
  strengthened: preflight runs before ``resolve_run_dir`` so a failing
  dry-run leaves no breadcrumbs on disk.

Class C -- matrix.md note overpromised reproducibility
======================================================

(Comment 3188442020.) The Notes footer claimed
``recipe.resolved.yaml`` "captures the registry state at run time",
but the writer only preserves names + inline-docker shorthands. Named
mitigations / environments are NOT expanded, so a registry change
between runs silently changes behaviour. The README NOTE block already
warned about this; matrix.md disagreed with itself.

* triage/output.py: rewrote the matrix.md Notes line to match the
  README's cautious wording -- explicitly flags the inline-vs-named
  asymmetry and points operators at
  ``matrix.json::cells[*].resolved_env_vars`` for drift detection.

Class D -- same-second runs overwrite each other
================================================

(Comment 3188441990.) ``resolve_run_dir`` used ``mkdir(exist_ok=True)``
plus a second-precision timestamp, so two runs in the same wall-clock
second for the same ``(ticket, workload)`` silently reused the same
directory and the second run clobbered the first run's artifacts. The
docstring guaranteed "never overwrites". Easy to hit in CI loops.

* triage/output.py: ``resolve_run_dir`` now creates the
  ``<output>/<ticket>/<workload>/`` parent with ``exist_ok=True`` (it's
  a shared parent across runs), then atomically claims the leaf with
  ``mkdir(exist_ok=False)``. On collision it appends ``-2``, ``-3``,
  ... until the call succeeds. The race between two parallel processes
  is resolved by ``mkdir`` itself: only one wins for a given suffix,
  the loser bumps and retries. Bounded at 10000 iterations so a stuck
  clock or runaway loop surfaces as a clean RuntimeError instead of
  hanging.

* recipes/README.md: output-layout block updated to document the
  ``[-N]`` suffix.

Tests
=====

* tests/triage/test_output_layout.py: 6 new tests --
  ``test_resolve_run_dir_appends_suffix_on_same_timestamp`` and
  ``test_resolve_run_dir_handles_many_collisions`` (Class D);
  ``test_dry_run_rejects_unresolvable_baseline``,
  ``test_dry_run_rejects_env_slug_collision`` and
  ``test_dry_run_does_not_create_run_dir_on_validation_failure``
  (Class B); ``test_matrix_md_does_not_overpromise_reproducibility``
  (Class C).

* tests/triage/test_runner_b1_api.py: new
  ``test_cli_run_wraps_run_recipe_errors_in_click_exception``
  exercising the Class A wrapper via the unresolvable-baseline path
  (the same kind of late-stage error that previously escaped raw).

Comments NOT pushed back on
===========================

None. All 4 are real correctness / contract gaps; D is particularly bad
in CI loops, A was a regression I introduced last round.

Co-authored-by: Cursor <cursoragent@cursor.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 14 out of 15 changed files in this pull request and generated 6 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +657 to +660
cell_name = f"{m}-{display}"
_validate_cell_name(
f"--mitigation-axis x --environment-axis ({m!r}, {display!r})", cell_name
)
Comment on lines +120 to +124
if baseline.error is not None or baseline.mean_step_time_ms <= 0:
# Baseline unusable: we can't compute a ratio, so surface None and
# let the renderer emit "n/a". The runner also writes a top-of-file
# warning in matrix.md so readers see why.
return CONFOUND_NEUTRAL, None
Comment on lines +9 to +13
env-var bundles, trial JSON paths, cell-level errors, etc.).
* ``recipe.resolved.yaml`` -- the recipe AS EXECUTED, with every registry
name expanded to its underlying env-var bundle + docker ref. This is the
reproducibility artifact: re-running it on a different machine produces
the same matrix even if the registries drift in the interim.
Comment thread recipes/README.md
Comment on lines +167 to +173
> "Reproducing the same matrix" is up to the registries available at
> rerun time. If a sidecar mitigation's env-var bundle drifts between
> runs the rerun will use the new bundle (snapshotting the *file* doesn't
> pin its *contents* once the operator edits it later). Inline-docker
> cells are immune (the docker ref is in the recipe text itself), but
> registry drift on named entries is currently not pinned. Compare each
> run's `matrix.json::cells[*].resolved_env_vars` to detect drift.
Comment on lines +191 to +205
"Failure rate",
"Trials",
"Mean step (ms)",
"Confound",
)
rows: list[tuple[str, ...]] = [header]
for cell in cell_stats:
tag, _ = confound_tags.get(cell.name, (cell.error and "error" or "-", None))
rows.append(
(
cell.name,
_format_mitigations(cell.mitigations),
cell.environment,
_format_failure_rate(cell),
_format_fail_count(cell),
Comment on lines +95 to +98
ticket_slug = safe_slug(recipe.ticket) if recipe.ticket else NO_TICKET_SLUG
workload_slug = safe_slug(recipe.workload)
ts = timestamp or format_timestamp()
parent = Path(output_dir) / ticket_slug / workload_slug
Copilot AI review requested due to automatic review settings May 5, 2026 17:15
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 14 out of 15 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/aorta/cli/triage.py Outdated
Comment on lines +212 to +223
except Exception as exc:
raise click.ClickException(str(exc)) from exc
# different exit shape, depending on which validator catches it.
try:
run_dir = run_recipe(
r,
output_dir=output_dir,
dry_run=dry_run,
extra_sidecar_files=mitigation_files,
)
except (RegistryError, RecipeCellError, RecipeSchemaError) as exc:
raise click.ClickException(str(exc)) from exc
Comment on lines +10 to +13
* ``recipe.resolved.yaml`` -- the recipe AS EXECUTED, with every registry
name expanded to its underlying env-var bundle + docker ref. This is the
reproducibility artifact: re-running it on a different machine produces
the same matrix even if the registries drift in the interim.
schema_version: 1

ticket: EXAMPLE-151
workload: fsdp
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 14 out of 15 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +391 to +409
def run_recipe(
recipe: Recipe,
output_dir: Path,
dry_run: bool = False,
extra_sidecar_files: tuple[Path, ...] = (),
timestamp: str | None = None,
) -> Path:
"""Execute a recipe and write matrix.md / matrix.json / recipe.resolved.yaml.

Args:
recipe: Pre-validated recipe (from :func:`aorta.triage.recipe.load_recipe`
or :func:`aorta.triage.recipe.build_recipe_from_flags`).
output_dir: Top-level output directory (the CLI's ``--output-dir``).
dry_run: When True, validates and prints the resolved cell list to
stdout without touching the filesystem and returns a sentinel
``Path(".")``.
extra_sidecar_files: Operator-supplied sidecar JSONs (from
``--mitigations-file``). These are threaded through to B1's
registry resolver alongside the runner-generated inline sidecar.
Comment on lines +187 to +205
header = (
"Cell",
"Mitigations",
"Environment",
"Failure rate",
"Trials",
"Mean step (ms)",
"Confound",
)
rows: list[tuple[str, ...]] = [header]
for cell in cell_stats:
tag, _ = confound_tags.get(cell.name, (cell.error and "error" or "-", None))
rows.append(
(
cell.name,
_format_mitigations(cell.mitigations),
cell.environment,
_format_failure_rate(cell),
_format_fail_count(cell),
Comment on lines +120 to +124
if baseline.error is not None or baseline.mean_step_time_ms <= 0:
# Baseline unusable: we can't compute a ratio, so surface None and
# let the renderer emit "n/a". The runner also writes a top-of-file
# warning in matrix.md so readers see why.
return CONFOUND_NEUTRAL, None
Comment on lines +10 to +13
* ``recipe.resolved.yaml`` -- the recipe AS EXECUTED, with every registry
name expanded to its underlying env-var bundle + docker ref. This is the
reproducibility artifact: re-running it on a different machine produces
the same matrix even if the registries drift in the interim.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants