diff --git a/AGENTIC_EVAL_SPEC.md b/AGENTIC_EVAL_SPEC.md new file mode 100644 index 00000000..2a7d882f --- /dev/null +++ b/AGENTIC_EVAL_SPEC.md @@ -0,0 +1,528 @@ +# AGENTIC_EVAL_SPEC.md — Engineering Specification for cosmos-lab PrincipalAgent Evaluation + +> **Status**: v1 as of 2026-05-03. Companion to `EVAL_SPEC.md`. Where EVAL_SPEC covers ML-output eval (perplexity, KL divergence, latency p99 — model under test), this doc covers **agent-system eval** — the agent itself is artifact-under-eval (per EVAL_SPEC axiom A8). PLAN_V2.md §3.3 references this doc. +> +> **Scope**: production evaluation discipline for the cosmos-lab PrincipalAgent across all 6 capability domains and across the long-horizon autonomous loop (PLAN→EXECUTE→VERIFY→REPLAN). Numerical targets in §9 extend PLAN_V2.md §0.7. + +## Table of contents + +``` +0. Document scope and reading order +1. Why agentic eval differs from ML-output eval +2. Foundational axioms (A1-A10 transfer from EVAL_SPEC; A11-A13 are agentic-specific) +3. The 5-tier eval architecture (T0-T4 specialized for agentic) +4. The 6 agentic-specific evaluation surfaces (S1-S6) +5. Cross-cutting meta layers (M1-M3) +6. Statistical framework (transfers from EVAL_SPEC §4 with agentic additions) +7. The three input types (I1-I3, per JD bullet 5) +8. Operational cadence and gates +9. Numerical targets (extends PLAN_V2 §0.7) +10. Implementation map to v5 phases +11. References +``` + +--- + +## 0. Document scope and reading order + +**Read this if**: you are designing, building, or running evaluation for the cosmos-lab PrincipalAgent. This is the load-bearing reference; treat it the way ML systems engineers treat MLPerf benchmark rules. + +**Read EVAL_SPEC.md first if**: you need ML-output eval (model perplexity, KL divergence, GPU OOM, latency p99). This doc assumes EVAL_SPEC's principles as foundation. + +**Reading order for new contributors**: +1. EVAL_SPEC §1 (problem statement) → §2 (axioms) → §3 (5-tier architecture) +2. This doc §1 (why agentic eval differs) → §2 (axioms) → §3 (5-tier specialization) → §4 (6 surfaces) + +**Reading order for "I'm about to ship a change"**: +1. Section 8 — what gates apply, at what cadence +2. Section 9 — numerical targets your change must not regress + +--- + +## 1. Why agentic eval differs from ML-output eval + +EVAL_SPEC.md evaluates models. This doc evaluates agents. Three distinctions: + +### 1.1 The deliverable is a trajectory, not an output + +A model produces tokens; quality is a property of those tokens. An agent produces a *trajectory of tool calls, reasoning steps, and plan revisions* — quality is a property of the **process**, not just the final output. Two agents can produce identical correct outputs while one took 47 tool calls with 12 replans and the other took 3 tool calls correctly the first time. Output-only eval grades these the same. **Trajectory eval is mandatory** (surface S1). + +### 1.2 The agent is itself an artifact-under-eval (axiom A8 from EVAL_SPEC) + +Strong MMLU + strong HumanEval ≠ strong agentic tool-use. BFCL-v3 vs MMLU correlation across open models is r ≈ 0.4–0.6 — they measure different things. The cosmos-lab PrincipalAgent's *deliverable* is its decisions (plan decomposition, tool routing, replan strategy), not the underlying LLM's logits. Per A8: a separate eval surface for the agent's decisions is mandatory. + +### 1.3 Long-horizon eval is non-fungible with short-horizon eval (NEW — A13 below) + +A 5-day autonomous task is not equivalent to 120 one-hour tasks. Multi-day work introduces failure modes that don't exist at session-bounded scale: cross-session memory drift, plan staleness after compute interruption, capability expansion mid-task, accumulated context-window pressure. Eval suite designed for short-horizon misses these entirely. + +--- + +## 2. Foundational axioms + +EVAL_SPEC's A1-A10 transfer wholesale. Three new agentic-specific axioms: + +### A1-A10 — transfer from EVAL_SPEC.md + +(See EVAL_SPEC.md §2 for full statements.) Brief: + +- **A1** Every measurement is a sample from a distribution → bootstrap CIs always +- **A2** MDE pre-registration → know what effect size you can detect before running +- **A3** Composite metrics destroy information → don't average sentinel-agreement with cost +- **A4** Cross-entropy loss → KL is most-info-per-FLOP signal → applies to LLM-judge calibration +- **A5** Throughput and latency are adversarial → applies to multi-step agent loop +- **A6** UX dominated by tail not mean → p99 task completion time matters +- **A7** Benchmarks decay (saturation + contamination) → **critical for agentic, see §1.3 below** +- **A8** Agent is itself artifact-under-eval → **the foundational claim of this doc** +- **A9** Reproducibility is binary → every agent run produces an envelope (seeds, dep hashes, model version, tool registry hash, OTel trace ID) +- **A10** Eval-of-eval → null fixtures and planted regressions; FPR/FNR tracked + +### A11 — Trajectory carries information beyond outcome (NEW) + +Two agents reaching identical correct outputs can have radically different trajectory quality. Tool-call efficiency, replan ratio, wasted-work regions, and doom-loop frequency are first-class quality signals. + +**Operational consequence**: at least one trajectory metric must be in every gate (not just outcome metrics). Specifically: tool-call efficiency ≤ 1.5× minimum-required must be a pre-merge gate; trajectories with > 30% wasted-work regions must trip a sentinel. + +**Source**: τ-bench (Yao et al. 2024) — agent trajectories vary 5-10× in length on same task; SWE-bench Verified — patch quality is independent of patch length but trajectory cost varies 3×; METR's RE-Bench reward-hacking findings — outcome-correct + trajectory-pathological is the signature of reward hacking. + +### A12 — Capability expansion requires adversarial testing (NEW) + +Earned-trust capability expansion (per PLAN_V2 §3.2.5: agent's capabilities expand after K sentinel-clean runs) is a security claim. Without adversarial testing, the claim is theater. Specifically: an agent that EARNED expansion must demonstrate correct handling of new tools; an agent that did NOT earn expansion must fail to access them, even when given a task that requires them. + +**Operational consequence**: 50-task denied-tool probe suite (S4 below) runs before every capability expansion event. Capability expansion that has not been adversarially tested is not deployed. + +**Source**: classical security principle (Saltzer & Schroeder 1975, "least privilege"); modern application — Anthropic's Constitutional AI red-teaming protocol; OWASP LLM Top 10 (2024) #6 (excessive agency). + +### A13 — Long-horizon eval is non-fungible with short-horizon eval (NEW) + +A 5-day autonomous task is not equivalent to 120 one-hour tasks. Long-horizon failure modes (cross-session memory drift, plan staleness after compute interruption, capability expansion mid-task, accumulated context pressure) do not surface in short-horizon eval suites. + +**Operational consequence**: at least 1 long-horizon (≥ 24 hour wall-clock) task in every weekly eval cadence. T2-T4 tier definitions below specify long-horizon variants. + +**Source**: Anthropic's Claude Computer Use eval methodology (24h+ task suite); SWE-Lancer (Nov 2024) — long-horizon coding tasks have failure modes invisible to SWE-bench Verified. + +--- + +## 3. The 5-tier eval architecture (specialized for agentic) + +EVAL_SPEC's T0-T4 architecture transfers; each tier is specialized for agentic context. + +### 3.1 Tier definitions (agentic specialization) + +#### T0 — Smoke (catch catastrophic agent breakage) + +- **Catches**: agent loop returns; tool registry loads; sentinel module imports; capability denial path executes; OTel emitter doesn't crash; identity exchange succeeds. +- **Gate**: pre-merge to feature branch. +- **Cost budget**: <$0.10 per run, <2 min wall-clock. +- **Cadence**: every CI run. +- **Method**: fixed 5-step trivial agent run with mock LLM responses; verify all spans emitted, capability denial works for one denied tool, sentinel evaluator returns expected verdict. +- **Acceptance**: 100% pass. +- **Statistical rigor**: none — deterministic correctness. + +#### T1 — Calibrated quality (catch agent capability regression) + +- **Catches**: silent regression in agent decision quality — sentinel agreement drops, pass rate on golden suite drops, plan quality degrades. +- **Gate**: pre-merge to main. +- **Cost budget**: $10-$50 per run, 30-90 min wall-clock. +- **Cadence**: every PR to main; nightly on main. +- **Method**: 15-task golden suite (5 from each of P3/P5/P6 capability domains), 3 runs per task = 45 trajectories; compute pass rate ± bootstrap 95% CI; sentinel agreement; plan quality LLM-judge score; trajectory metrics (S1). +- **Acceptance**: per §6.5 conjunctive verdict — all of: pass rate lower CI ≥ baseline, sentinel agreement ≥ 98%, no S1 metric degraded by > 20%. +- **Statistical rigor**: full — bootstrap CIs, MDE pre-registered (target: detect ≥ 5pp pass-rate change at p < 0.05, requires N ≥ 45), Holm-Bonferroni across the 4 metric families. + +#### T2 — Long-horizon multi-day (catch cross-session failure) + +- **Catches**: failure modes only visible at long-horizon — memory drift, plan staleness across resumption, capability expansion mid-task issues, context accumulation. +- **Gate**: pre-merge if change touches planner, memory tier, or capability expansion logic. +- **Cost budget**: $50-$200 per run, 24-48 hour wall-clock. +- **Cadence**: weekly on main; on-demand for memory/planner/identity PRs. +- **Method**: 1-task long-horizon eval with deliberate compute interruption at 12-hour mark to test resumption; agent must complete within +20% wall-clock of uninterrupted baseline. +- **Acceptance**: completion within budget; sentinel agreement preserved across resumption boundary; episodic memory correctly recalls pre-interruption state. +- **Statistical rigor**: paired comparison (interrupted vs uninterrupted on same task), N ≥ 5 task instances, paired bootstrap. + +#### T3 — Shadow (catch bench-vs-prod distribution shift) + +- **Catches**: agent that looks good on golden suite but degrades on real production task slice. +- **Gate**: pre-deploy to canary. +- **Cost budget**: $200-$500 per run, hours-to-1-day wall-clock. +- **Cadence**: per release-candidate; weekly on main. +- **Method**: replay 50-task captured slice of real production traffic (anonymized) through current and prior agent version side-by-side; paired McNemar on task pass/fail; paired bootstrap on cost/latency. +- **Acceptance**: paired pass-rate within pre-registered band (default ± 3pp); cost not regressed > 20%; sentinel agreement preserved. +- **Statistical rigor**: full — paired tests on real (not synthetic) traffic shape. + +#### T4 — Canary (catch real-user impact) + +- **Catches**: alignment edge cases, long-tail user complaint patterns, cosmetic regressions, unforeseen interactions. +- **Gate**: pre-full-rollout. +- **Cost budget**: $300-$1000 + user-exposure risk; hours-to-days. +- **Cadence**: per release. +- **Method**: route 1-5% of real production traffic to new agent version; monitor SLO compliance, sentinel trip rate, escalation rate (agent giving up vs completing), cost per session, user feedback signals. +- **Acceptance**: no guardrail violation over pre-registered observation window (≥ 24h or N ≥ 100 sessions for power); sentinel trip rate not increased > 50% vs prior version. +- **Statistical rigor**: sequential testing (always-valid p-values per Howard et al. 2021) for safe early stopping. + +### 3.2 Tier summary table + +| Tier | Catches (agentic) | Gate | Cost/run | Cadence | Stats rigor | +|---|---|---|---|---|---| +| T0 | Catastrophic agent breakage | Pre-merge to branch | <$0.10 | Every commit | None | +| T1 | Capability/decision regression | Pre-merge to main | $10-$50 | Every PR + nightly | Full (CIs + MDE + Holm) | +| T2 | Long-horizon / cross-session failure | Pre-merge if planner/memory/identity | $50-$200 | Weekly + on-demand | Paired bootstrap | +| T3 | Bench-vs-prod shift | Pre-canary | $200-$500 | Per RC + weekly | Paired McNemar | +| T4 | Real-user impact | Pre-full-rollout | $300-$1000 + risk | Per release | Sequential testing | + +--- + +## 4. The 6 agentic-specific evaluation surfaces + +These are eval surfaces that don't exist in EVAL_SPEC.md because they're agent-system specific. Each surface specifies: what it catches, when it runs, method, acceptance criterion. + +### S1 — Trajectory eval (process quality, not just outcome) + +**Catches**: outcome-correct + trajectory-pathological agent behavior (the signature of reward hacking). + +**Method**: every trajectory generates these metrics from OTel spans: +- **Tool-call efficiency** = `actual_tool_calls / minimum_required_for_task` (computed from gold-trajectory baseline) +- **Replan ratio** = `replan_count / milestone_count` +- **Wasted-work fraction** = fraction of trajectory spent on milestones agent later abandoned +- **Doom-loop frequency** = count of repeated identical (tool, args) tuples within 5-step window (uses `agent/core/doom_loop.py`) + +**Cadence**: continuous (every trajectory tagged with these metrics in OTel attributes). + +**Acceptance** (gate at T1): no metric degraded > 20% vs rolling 7-day baseline. + +**Owned path**: `cosmos_lab/eval/trajectory_metrics.py`. + +### S2 — Plan-quality eval (PLAN phase output) + +**Catches**: poor plan decomposition that wastes EXECUTE phase work. Bad plans look like: missing milestones, missing verifiers, wrong dependency order, milestones that aren't independently verifiable. + +**Method**: LLM-judge with rubric scores PLAN-phase output on 4 dimensions: +- **Decomposition**: are milestones independently verifiable? (1-5) +- **Coverage**: do milestones cover the goal? (1-5) +- **Sequence**: are dependencies correct? (1-5) +- **Verifier coverage**: does each milestone have a generated verifier? (1-5) + +Aggregate plan-quality score = geometric mean (so any dimension scoring ≤ 2 dominates — per A3 "composite metrics destroy info" exception: geometric mean preserves "weakest link" information). + +**Cadence**: every plan generation (gated — agent cannot proceed to EXECUTE if plan-quality < 3.5/5). + +**Acceptance**: T1 gate — mean plan-quality score across golden suite ≥ baseline within bootstrap CI. + +**Owned path**: `cosmos_lab/eval/plan_quality.py`. + +### S3 — Replan-quality eval (response to sentinel trips) + +**Catches**: agent that hits same wall repeatedly (no learning from sentinel feedback). Sentinel trips are agent's most informative moments — replanning quality determines downstream waste. + +**Method**: every sentinel trip is tagged with the resulting REPLAN action; downstream evaluation: +- **Replan success rate** = fraction of replans where next milestone is sentinel-clean +- **Replan diversity** = LLM-judge whether new plan tried genuinely different approach (not surface tweak) +- **Time-to-recovery** = milestones from sentinel trip until back on baseline trajectory + +**Cadence**: continuous (computed weekly from production trajectories). + +**Acceptance**: replan success rate ≥ 70%; time-to-recovery p90 ≤ 3 milestones. + +**Owned path**: `cosmos_lab/eval/replan_quality.py`. + +### S4 — Capability boundary eval (adversarial) + +**Catches**: capability expansion bypass — agent accessing tools outside earned scope. Per axiom A12. + +**Method**: 50-task denied-tool probe suite. Each probe task is constructed so that the *easiest* path requires a denied tool. Agent should EITHER: +- Fail cleanly (request scope expansion or report blocker) +- Find a creative path using only authorized tools + +NEVER: +- Bypass the capability check +- Silently use a denied tool path that the test author missed + +Test runs in two modes: pre-expansion scope and post-expansion scope; both must behave correctly. + +**Cadence**: nightly during pre-deploy; before every capability expansion event. + +**Acceptance**: 0 unauthorized tool calls across 100 child-agent runs (PLAN_V2 §0.7 P4b target). + +**Owned path**: `cosmos_lab/eval/capability_boundary.py` + `tests/optimization/eval/probe_tasks/`. + +### S5 — Reward-hacking adversarial eval (red-team sprint) + +**Catches**: novel reward-hack patterns not yet covered by sentinel taxonomy (per UC Berkeley audit + METR findings — 2026 frontier crisis). + +**Method**: monthly red-team sprint: +- **Day 1**: human (or adversarial agent) drafts 5 creative reward-hack attempts targeting current sentinel suite +- **Day 2-3**: each attempt run against current sentinel suite; document which trip vs which slip through +- **Day 4-5**: any successful hack → new sentinel type added to §3.1 taxonomy + test added to S4 probe suite +- **Day 5**: track discovery rate over time (declining = sentinel suite maturing) + +**Cadence**: monthly. + +**Acceptance**: discovery rate trends downward over 6 months; sentinel suite size grows with discoveries. + +**Owned path**: `cosmos_lab/eval/red_team/` + monthly retro doc in `docs/eval_retros/`. + +### S6 — Cross-agent comparison eval (the differentiator pitch) + +**Catches**: cosmos-lab claiming "exceptional" without evidence. Without comparison to 2026 baselines (Devin, Claude Code, Cursor Composer, human), "exceptional" is rhetoric. + +**Method**: same task spec given to: +1. **PrincipalAgent** (cosmos-lab) +2. **Claude Code** (via SDK, single-session) +3. **Devin** (manual reproduction — record screen) +4. **Human researcher** (time-bounded equivalent budget) + +Score each on: +- **Pass rate** (binary outcome) +- **Cost** ($USD spent) +- **Time** (wall-clock) +- **Trajectory quality** (S1 metrics on PrincipalAgent + Claude Code; manual scoring on Devin/human) +- **Sentinel agreement** (PrincipalAgent only — comparison agents don't have sentinels) + +Output: **Pareto chart** on cost-quality plane. We claim a Pareto position, not a "win." + +**Cadence**: quarterly (comparison is expensive; not for fast iteration). + +**Acceptance**: PrincipalAgent on Pareto frontier (no other agent strictly dominates on cost + quality jointly). + +**Owned path**: `cosmos_lab/eval/cross_agent/` + quarterly report in `docs/eval_quarterly/`. + +--- + +## 5. Cross-cutting meta layers + +These apply across tiers and surfaces. Transfer from EVAL_SPEC §3.3. + +### M1 — Reproducibility envelope (per axiom A9) + +Every agent run produces an envelope: +- **Seeds**: LLM sampling temperature, retry seed, scheduler seed +- **Dependency hashes**: `pyproject.toml.lock` hash, key tool versions +- **Model version**: exact LLM model ID + provider +- **Tool registry hash**: hash of available tool definitions at run time +- **Capability scope**: snapshot of which tools agent had at run start +- **OTel trace ID**: links to full trajectory +- **Hardware fingerprint**: GPU SKU, CUDA version (if GPU phase) + +Runs without complete envelopes are advisory only; they cannot gate. + +### M2 — Eval-of-eval (per axiom A10) + +Every release of cosmos-lab eval system runs: +- **Null fixture**: agent A vs identical agent A → must NOT gate-fail (FPR test) +- **Planted regression fixture**: known-broken agent → must gate-fail (FNR test) + +Tracked: FPR (false alarms — null fixture incorrectly gates) and FNR (missed regressions — planted regression incorrectly passes). + +**Targets**: FPR ≤ 5%, FNR ≤ 1%. (FNR < FPR — better to false-alarm than miss real regression.) + +### M3 — Cost telemetry + +Every eval run reports: +- LLM token cost (input + output tokens × model cost) +- Tool execution cost (sandbox compute, GPU time if applicable) +- Total run cost ($USD) + +Aggregated weekly: total eval spend, eval cost as % of total project spend. + +**Target**: eval cost ≤ 15% of total project GPU + compute budget. + +--- + +## 6. Statistical framework + +Transfer from EVAL_SPEC §4 with agentic additions. + +### 6.1 Bootstrap CIs (transfer) + +Non-parametric, robust to non-Gaussian distributions. Use for: pass rate, sentinel agreement, latency percentiles, cost. + +### 6.2 Holm-Bonferroni for multiple comparisons (transfer) + +Necessary because we evaluate multiple metrics across multiple tasks. Without correction, family-wise error rate explodes. + +### 6.3 MDE pre-registration (transfer) + +Before any eval run that gates a decision, pre-register: "to detect effect size X at p < α with power 1-β, requires N ≥ Y trajectories." If N < Y, the run is advisory only. + +### 6.4 Paired tests for agent comparisons (NEW for agentic) + +When comparing agent versions (current vs prior, or PrincipalAgent vs Claude Code), use paired tests on shared task instances: +- **McNemar** for binary outcomes (pass/fail) +- **Paired bootstrap** for continuous outcomes (cost, latency, trajectory metrics) + +Paired tests have higher power than unpaired for small N — critical because cross-agent comparison N is constrained by cost. + +### 6.5 The conjunctive verdict (transfer with adaptation) + +For T1 gate, ACCEPT requires ALL of: +- Pass rate lower CI bound ≥ baseline +- Sentinel agreement ≥ 98% +- Plan quality (S2) ≥ baseline within CI +- No trajectory metric (S1) degraded > 20% vs rolling baseline + +Any single metric failing → REJECT. Conjunctive ACCEPT prevents "we gained on metric X but quietly regressed on metric Y" pattern. + +### 6.6 Sequential testing for canary (transfer) + +T4 canary uses always-valid p-values (Howard et al. 2021) so we can stop early when evidence is conclusive — saves cost and reduces user-exposure risk. + +### 6.7 Pareto analysis for cross-agent (NEW) + +S6 cross-agent comparison reports Pareto position, not single-metric. PrincipalAgent's claim is "we sit on the Pareto frontier of cost × quality" — not "we are best." + +--- + +## 7. The three input types (per JD bullet 5) + +> JD: *"Design and scale evaluation platforms that combine automated metrics, human feedback, and agent-driven analysis."* + +Three types — all required. + +### I1 — Automated metrics + +- Sentinel pass/fail (per §3.1 sentinel taxonomy) +- OTel-derived metrics (trajectory length, replan count, doom-loop frequency) +- Cost telemetry (per M3) +- Deterministic verifiers (per Inspect AI Scorer) + +Cadence: continuous. + +### I2 — Human feedback + +- **5% random sampling** of production runs flagged for human review +- **Argilla-based labeling UI** for human reviewers +- **Weekly review meeting** (1 hour, 1-2 reviewers) — review sample, identify patterns +- **Findings feed back into sentinel suite** — recurring human complaints become new sentinel types +- **Disagreement audit** — when human review disagrees with sentinel verdict, both records preserved + disagreement triggers root-cause investigation + +Cadence: continuous sampling, weekly review. + +### I3 — Agent-driven analysis + +- **LLM-as-judge** with sentinel pair (per PLAN_V2 §3.1) +- **ToolAugmentedJudge** (judge can call read-only tools to verify claims) +- **MultiJudge variance reduction** (no debate dynamics — per arxiv:2508.17536) +- **Self-eval** — PrincipalAgent reads its own trajectory and produces a critique (used in P8 GEPA loop for failure mining) + +Cadence: continuous on every gate decision. + +--- + +## 8. Operational cadence and gates + +### 8.1 What runs WHEN + +| Cadence | Tier(s) | Surface(s) | Cost budget | +|---|---|---|---| +| Every commit | T0 | S1 sanity | <$0.10 | +| Every PR to main | T1 | S1 + S2 + S3 | $10-$50 | +| Nightly | T1 + T2 | S1 + S4 | $50-$200 | +| Weekly | T2 + T3 | S1-S4 + S6 (sample) + I2 (5% human review) | $300-$700 | +| Monthly | T3 | S5 red-team + M2 eval-of-eval | $500-$1000 | +| Per release | T4 | All surfaces + production monitoring | $300-$1000 + risk | +| Quarterly | — | S6 cross-agent full comparison | $500-$2000 | + +### 8.2 What gates WHAT (conjunctive ACCEPT) + +**Merge to main**: T0 + T1 + S1 + S2 + S3 GREEN; bootstrap CI lower bound on pass rate ≥ baseline. + +**Deploy to canary**: above + T2 + T3 GREEN; S4 capability boundary clean. + +**Full rollout**: above + T4 GREEN over 24h; S5 monthly probe clean (no novel hack discovered in last 30 days). + +**Capability expansion event**: S4 capability boundary clean for current scope AND for proposed expanded scope; M2 eval-of-eval green. + +### 8.3 Verifier scripts (per workflow Phase 1 DEFINE) + +Every gate is implemented as a verifier script under `bin/`: + +``` +bin/verify_t0_smoke.sh +bin/verify_t1_calibrated.sh +bin/verify_t2_long_horizon.sh +bin/verify_t3_shadow.sh +bin/verify_t4_canary.sh +bin/verify_s1_trajectory.sh +bin/verify_s2_plan_quality.sh +bin/verify_s3_replan_quality.sh +bin/verify_s4_capability_boundary.sh +bin/verify_s5_red_team.sh +bin/verify_s6_cross_agent.sh +bin/verify_m2_eval_of_eval.sh +bin/verify_release.sh # composite: runs T0+T1+T2+T3+T4 + all S +``` + +Each returns 0 (pass) or 1 (fail). Per workflow rule: verifier is a script, not a description. + +--- + +## 9. Numerical targets — eval system commitments + +These extend PLAN_V2 §0.7 with eval-system-specific targets. + +| # | Eval target | Commitment | How measured | Phase introduced | +|---|---|---|---|---| +| E1 | Sentinel suite false-positive rate | ≤ 5% on null fixtures | M2 null fixture suite, weekly | P1 | +| E2 | Sentinel suite false-negative rate | ≤ 1% on planted regressions | M2 planted regression suite, weekly | P1 | +| E3 | T1 calibrated suite stability | Test-retest correlation r ≥ 0.95 | M2, monthly | P1 | +| E4 | Plan quality LLM-judge agreement with human | ≥ 80% agreement on 50-plan sample | S2 calibration, quarterly | P4a | +| E5 | Replan success rate | ≥ 70% (replans → next milestone sentinel-clean) | S3, continuous | P4a | +| E6 | Capability boundary probe pass rate | 100% (0 unauthorized tool calls in 100 child-agent runs) | S4, nightly | P4b | +| E7 | Reward-hack discovery rate | Trending downward over 6 months | S5, monthly | P4a | +| E8 | Cross-agent Pareto position | PrincipalAgent on Pareto frontier of cost × quality | S6, quarterly | P10 | +| E9 | Eval cost as % of total spend | ≤ 15% of total GPU + compute budget | M3, weekly | P1 | +| E10 | Reproducibility envelope coverage | 100% of agent runs tagged with envelope | M1, every run | P1 | + +--- + +## 10. Implementation map to v5 phases + +This eval architecture integrates into PLAN_V2.md v5 phases without adding a new phase: + +| Phase | Eval work added | +|---|---| +| **P1** (W2-3) | T0 + T1 baseline + S1 trajectory metrics + M1 reproducibility envelope + M2 eval-of-eval skeleton + sentinel taxonomy (already in §3.1) | +| **P3** (W6-7) | First real-data calibration of T1 on data curation tasks | +| **P4a** (W10) | EvalAgent platform builds: S2 plan quality + S3 replan quality + I2 human review sampling protocol + S5 monthly red-team kickoff | +| **P4b** (W10-11) | S4 capability boundary probe suite (50 tasks) — security-critical, blocks identity v2 ship | +| **P5** (W11-12.5) | T2 long-horizon eval first run (real GPU sweep is the long-horizon task) | +| **P5.5** (W12) | T1 calibrated includes PyTorch artifact eval | +| **P6** (W14-15.5) | T2 serving eval per Invariant 9 | +| **P8** (W17-18) | M2 eval-of-eval matures (null + planted regression fixtures stable); GEPA promotion uses E5 replan success rate as one ratchet criterion | +| **P9a** (W19-20) | First T3 shadow eval on multimodal pipeline | +| **P9b** (W20-21) | First S6 cross-agent comparison: PrincipalAgent vs Claude Code on bug-fix fixture | +| **P10** (W21.5-22.5) | T4 canary on production deployment; first quarterly S6 cross-agent full comparison; E8 Pareto position evidence | + +**Net cost**: ~3-4 days additional spec/test work spread across phases. No new phase needed. + +--- + +## 11. References + +### Foundational +- EVAL_SPEC.md (this doc's parent — ML output eval) +- Beizer, *Software Testing Techniques* — eval-of-eval discipline +- NeurIPS Reproducibility Checklist (mandatory since 2019) + +### Agentic eval literature +- τ-bench (Yao et al. 2024, Sierra) — agent trajectory eval +- BFCL-v3 (Patil et al. 2024, Berkeley) — function-calling eval +- GAIA (Mialon et al. 2023, Meta) — generalist agent eval +- GAIA-2 (2026) — verifier-based agent eval +- SWE-bench Verified (Jimenez et al. 2024, ICLR + 2025 Anthropic) — coding agent eval +- MLE-bench (OpenAI 2024) — ML engineering agent eval +- METR's RE-Bench — autonomous AI capability eval + +### Reward hacking + benchmark integrity (2026 crisis) +- METR (2025-06-05) — "Recent Frontier Models Are Reward Hacking" — o3 hacks 1-2% of runs overall, 43× more on RE-Bench +- UC Berkeley RDI (2026) — "How We Broke Top AI Agent Benchmarks" — 8/8 benchmarks hackable to 73-100% +- "Debate or Vote" (arxiv:2508.17536, 2025) — multi-agent debate refutation; majority voting alone explains gain + +### Statistical framework +- Howard et al. 2021 — "Time-uniform Chernoff bounds" → sequential testing +- Pineau et al. 2021 — "Improving reproducibility in machine learning research", JMLR +- Dehghani et al. 2021 — "The Benchmark Lottery" → test-retest analysis + +### Inspect AI + production tooling +- Inspect AI (UK AISI) — production eval framework +- OpenTelemetry GenAI semantic conventions — trajectory observability +- DSPy 3.x + dspy.GEPA — self-improvement loop +- Argilla — human review labeling UI diff --git a/AGENTS.local.md b/AGENTS.local.md new file mode 120000 index 00000000..681311eb --- /dev/null +++ b/AGENTS.local.md @@ -0,0 +1 @@ +CLAUDE.md \ No newline at end of file diff --git a/AGENTS.md b/AGENTS.md new file mode 100644 index 00000000..0e09a850 --- /dev/null +++ b/AGENTS.md @@ -0,0 +1,37 @@ +# Agent Notes + +## Local Dev Servers + +- Frontend: from `frontend/`, run `npm ci` if dependencies are missing, then `npm run dev`. +- Backend: from `backend/`, run `uv run uvicorn main:app --host ::1 --port 7860`. +- Frontend URL: http://localhost:5173/ +- Backend health check: `curl -g http://[::1]:7860/api` +- Frontend proxy health check: `curl http://localhost:5173/api` + +Notes: + +- Vite proxies `/api` and `/auth` to `http://localhost:7860`. +- If `127.0.0.1:7860` is already owned by another local process, binding the backend to `::1` lets the Vite proxy resolve `localhost` cleanly. +- Prefer `npm ci` over `npm install` for setup, since `npm install` may rewrite `frontend/package-lock.json` metadata depending on npm version. +- Production defaults to the Bedrock Claude model. For local development with a personal Anthropic key, set `ANTHROPIC_API_KEY` and `ML_INTERN_CLAUDE_MODEL_ID=anthropic/claude-opus-4-6` before starting the backend. Other models are selected through the app's model switcher. + +## GitHub CLI + +- For multiline PR descriptions, prefer `gh pr edit --body-file ` over inline `--body` so shell quoting, `$` env-var names, backticks, and newlines are preserved correctly. + +## Hugging Face Space Deploys + +- The Space remote is `space` and points to `https://huggingface.co/spaces/smolagents/ml-intern`. +- Deploy GitHub `main` to the Space from the local `space-main` branch by merging `origin/main` into `space-main` with a single merge commit, then pushing `space-main:main` to the `space` remote. +- Keep the Space-only README frontmatter on `space-main`; `.gitattributes` should contain `README.md merge=ours` and the local repo config should include `merge.ours.driver=true`. +- Recommended deploy flow: + +```bash +git pull --ff-only origin main +git switch space-main +git config merge.ours.driver true +git merge --no-ff origin/main -m "Deploy $(date +%Y-%m-%d)" \ + -m "Co-authored-by: OpenAI Codex " +git push space space-main:main +git switch main +``` diff --git a/CLAUDE.md b/CLAUDE.md new file mode 100644 index 00000000..29aef96f --- /dev/null +++ b/CLAUDE.md @@ -0,0 +1,98 @@ +# cosmos-lab — Context Harness + +> **What this is**: zero-diff fork of `huggingface/ml-intern` that ships the **cosmos-lab** library — a frontier-aligned production agentic system: **5 production agents** (1 PrincipalAgent supervisor + 4 specialty workers: Data, Eval, Train, Optimize) + **CodeWork Skill** (Anthropic Skills pattern) + **3 offline governance tools** (GepaOptimizer, CapabilityProbe, CrossAgentEvaluator — explicitly NOT standing agents per 2026 frontier convergence) + **~16 governance infrastructure components** (5 sentinel types incl. judge-hacking detector, cross-family MultiJudge, MCP OAuth+RFC 8693, hash-chained signed audit, OTel-GenAI, 4-scope hybrid memory via Mem0/Letta, Inspect AI bridge, **LangGraph durable supervisor + Magentic-One Task/Progress Ledger pattern**, **context engineering discipline** — cache-aware prompt structure + 75% compaction + just-in-time retrieval + cosmos-progress.md state file), built on ml-intern's tool primitives leveraged AS-IS inside LangGraph worker nodes. Deployed via `nat run cosmos-lab.yaml`. See `docs/01_north_star.md` for vision; PLAN_V2.md §0.6 + §0.65 + §0.9 + §3.1 + §3.2 (incl. §3.2.8 context engineering) for v7-stronger thesis. + +**Current phase** → `docs/02_current_phase.md` (LIVE — read this first when starting work) + +--- + +## Invariants — never break + +1. **ZERO-DIFF**: never edit any file existing in `huggingface/ml-intern`. Use subclass / composition / new path. +2. Never `git commit` or `git push` without explicit user request. +3. `uv run pytest tests/unit/` must match baseline (237 pass / 3 upstream-broken — see PLAN_V2 invariant 2). +4. New code only in owned paths (table below). +5. **One-optimization-per-experiment** + **measured-peak over vendor-peak** (EVAL_SPEC.md, applies P6+). +6. **Trajectory-on-by-default**: from P1, no agent run is unobserved. +7. **OTel-GenAI-on-by-default**: from P1, every span uses `gen_ai.*` semconv. +8. **No judge-only metric reaches a gate**: every quality gate requires (judge, structural-verifier) pair. +9. **No GPU phase exits without one measured real run** (P5/P5.5/P6/P9a; ~$200-400 budget). + +--- + +## Owned paths + +| Write here | Never touch (upstream) | +|---|---| +| `cosmos_lab/` (P0.5+) | `agent/core/` | +| `agent/optimization/` (Phase 0 home) | `agent/config.py` | +| `agent/tools/{profiling,training_opt,inference_opt,multimodal_opt,vla_opt}/` | `agent/context_manager/manager.py` | +| `agent/tools/hardware_specs.py` | `agent/tools/*.py` (existing) | +| `agent/prompts/system_prompt_optimization_*.yaml` | `backend/`, `frontend/`, `tests/unit/` | +| `configs/optimization_agent_config.json` | | +| `tests/optimization/` | | +| `docs/`, `bin/` (this harness) | | + +**Verify ownership**: `git diff upstream/main --name-only` must show only owned paths. + +--- + +## Anti-patterns (catch yourself) + +1. Editing the prompt when the bug is in the data. +2. Editing the data when the bug is in the spec. +3. Trusting "I have verified this" from an agent — re-run the verifier yourself. +4. Building a pipeline that should have been one model call. +5. Adding a fourth concurrent agent. You will regret it. +6. Saying "the agent decided" — replace with "P(output | context) was high." + +--- + +## Workflow (every task) + +`DEFINE → PROBE → BUILD → REVIEW → SHIP → LEARN` — see `docs/00_workflow.md`. + +**Hard rule**: if you can't write the verifier, the goal is wrong. Fix the goal, not the agent. + +--- + +## Dev commands + +```bash +uv sync --extra dev # install (--extra dev for pytest) +uv run python -m pytest tests/unit/ -q # upstream baseline (must match) +uv run python -m pytest tests/optimization/ -q # cosmos-lab tests (must pass) +PYTHONPATH=. ruff check agent/ --ignore E501,F401,E402 # lint +./bin/verify.sh # phase verifier (e.g. p0_5_d3) +git fetch upstream && git merge upstream/main # daily upstream sync +``` + +Note: use `uv run python -m pytest` (NOT `uv run pytest`) — bare `uv run pytest` +can resolve to a system pytest with stale package metadata. Captured in P0.5 D2 LEARN. + +--- + +## Pointer index (load on demand) + +| Need | Read | +|---|---| +| What we're building this week | `docs/02_current_phase.md` | +| Vision in 1 screen | `docs/01_north_star.md` | +| Workflow phases | `docs/00_workflow.md` | +| Phase → PLAN_V2 anchor map | `docs/03_pointers.md` | +| Full plan (24 sections, 837L) | `PLAN_V2.md` (read specific section, not whole file) | +| Architecture deep WHY (Vietnamese, 1167L) | `SYSTEM.md` (rare — only for upstream debugging) | +| Eval spec — ML output (perplexity, KL, latency p99) | `EVAL_SPEC.md` | +| Eval spec — agent system (trajectory, plan, replan, capability boundary, reward-hack, cross-agent) | `AGENTIC_EVAL_SPEC.md` | +| Self-improvement research | `RESEARCH_AHE_ANALYSIS.md` | +| Dev server / deploy notes | `AGENTS.md` | +| NVIDIA Cosmos JD | `docs/04_jd.md` | + +--- + +## Git remotes + +``` +upstream → https://github.com/huggingface/ml-intern (read-only, never push) +origin → https://github.com/andreidhoang/ml-optimization-agent +``` diff --git a/EVAL_SPEC.md b/EVAL_SPEC.md new file mode 100644 index 00000000..fad01add --- /dev/null +++ b/EVAL_SPEC.md @@ -0,0 +1,1056 @@ +# EVAL_SPEC.md — Engineering Specification for Agentic Optimization Evaluation + +> **Author perspective**: senior AI performance evaluation engineer at a frontier-AI lab, 2026. +> **Project**: ML Optimization Agent (`ml-intern` fork) — an agentic system that recommends, applies, and verifies model-optimization techniques (quantization, kernel selection, parallelism, speculative decoding, scheduling) for training, inference, multimodal, and VLA workloads. +> **Goal of this doc**: make the eval design *defensible to a skeptical reviewer*. Every choice traces to either an information-theoretic bound, a measurement-physics constraint, or an empirically documented failure mode in the published literature. +> **Companion docs**: `PLAN.md` (16-week build plan), `SYSTEM.md` (architecture, Vietnamese), `WORKFLOW.md` (file-ownership rules). +> **Status**: design spec. Implementation tracked in `PLAN.md` Phase 8 (to be added). + +--- + +## Table of contents + +0. [Document scope and reading order](#0-document-scope-and-reading-order) +1. [Problem statement: why eval is the load-bearing wall](#1-problem-statement-why-eval-is-the-load-bearing-wall) +2. [Foundational axioms (with derivations and citations)](#2-foundational-axioms) +3. [The 5-tier eval architecture](#3-the-5-tier-eval-architecture) +4. [Statistical framework](#4-statistical-framework) +5. [Metric taxonomy](#5-metric-taxonomy) +6. [Benchmark suite (2026 frontier choices)](#6-benchmark-suite) +7. [Agent-level evaluation](#7-agent-level-evaluation) +8. [Reproducibility envelope](#8-reproducibility-envelope) +9. [Implementation specification](#9-implementation-specification) +10. [Operational runbooks](#10-operational-runbooks) +11. [Risks, anti-patterns, and known limitations](#11-risks-anti-patterns-and-known-limitations) +12. [References](#12-references) + +--- + +## 0. Document scope and reading order + +This spec covers **evaluation of optimizations applied by the agent**, **evaluation of the agent itself**, and the **statistical/operational substrate** that makes either claim defensible. + +**Reading order by role:** + +- **First-time reader** → §1 → §2 → §3 → §11. ~30 min, gives you the full mental model. +- **Implementer** → §3 → §4 → §9. Concrete code spec. +- **Reviewer / skeptic** → §2 → §4 → §6 → §11 → §12. Sources and limitations. +- **On-call** → §10. Runbooks for "the eval is failing, what now". + +**Out of scope of this doc:** + +- The optimization techniques themselves — covered in `PLAN.md` Phases 2–5. +- Training-side eval (model pre/post-training); we eval *optimization deltas*, not training quality. The two share infrastructure (statistical framework, benchmarks) but differ in subject of measurement. +- Cost/billing dashboards — operational concern, not eval methodology. + +--- + +## 1. Problem statement: why eval is the load-bearing wall + +### 1.1 What the agent does, and why eval is harder than it looks + +The ml-intern agent ingests a model + workload + hardware spec, runs profiling tools, and recommends optimizations. A naive eval is: "did the optimization improve the target metric?" This is wrong on at least four counts: + +1. **The target metric is multi-dimensional.** A quantization that improves throughput 30% but degrades GPQA-Diamond 4 points is a *regression in disguise*. A single-axis eval cannot adjudicate. +2. **Sample noise dominates small effects.** A 0.3% MMLU improvement on a 1k subsample has Minimum Detectable Effect (MDE) ≈ 1.4% — the claim is *literally below the noise floor* (derivation in §4.3). +3. **The agent itself is a system under test.** Even if a recommended optimization is good in isolation, the agent might have selected the wrong tool, mis-diagnosed the bottleneck, or thrashed in a doom-loop. Eval-of-the-agent is distinct from eval-of-the-optimization. +4. **Distribution shift between bench and prod.** Static benchmarks (ShareGPT samples, wikitext-2) do not match production traffic shape (request length distribution, concurrency, arrival burstiness). Optimizations that look great on bench can degrade goodput@SLO in real serving. + +A defensible eval system addresses all four. This document specifies how. + +### 1.2 The cost of getting it wrong + +Two failure modes, asymmetric in cost: + +- **False positive (ship a bad optimization)**: silent quality regression reaches users. Recovery cost = rollback + customer trust + on-call burn. Often *not detected for weeks* if eval was bench-only and the regression manifests on real-traffic distribution. +- **False negative (reject a good optimization)**: lost compute savings + dev velocity. Recoverable via re-run, but each false rejection burns ~1 engineer-day. + +A single-threshold gate cannot optimize both. A **tiered system** (§3) is the only way to give each gate the right Type-I/Type-II tradeoff for its decision surface. + +--- + +## 2. Foundational axioms + +Every downstream design choice is derived from one of these. If you disagree with a downstream choice, find the axiom you disagree with first — that's where the real argument is. + +### A1 — Every measurement is a sample from a distribution + +A benchmark score is a random variable over (data, seed, hardware nondeterminism, kernel-launch ordering). Reporting a point estimate without quantified uncertainty is mathematically meaningless: you cannot distinguish signal from noise. + +**Operational consequence**: every eval result must include a confidence interval (CI) or be marked "indicative-only, do not gate". + +**Source**: standard frequentist statistics; for ML-specific evidence of how badly results degrade without CIs, see Henderson et al. 2018 "Deep Reinforcement Learning that Matters" (AAAI), which showed seed variance explained more outcome variance than algorithmic differences. + +### A2 — You cannot detect what you cannot measure (MDE) + +For sample size *n*, measurement variance σ², and significance level α, the Minimum Detectable Effect at power 1−β is approximately: + +``` +MDE ≈ (z_{1-α/2} + z_{1-β}) × σ × √(2/n) +``` + +(For two-sample comparison; paired tests reduce by √2.) If MDE > the claimed effect, the claim is *unfalsifiable from this experiment*. + +**Operational consequence**: every eval must publish its MDE and refuse to gate on effects below MDE. + +**Source**: classical statistical power analysis (Cohen 1988). For ML application, see Madaan et al. 2024 and Schaeffer et al. 2023 ("Are Emergent Abilities of Large Language Models a Mirage?", NeurIPS 2023 best paper) — both demonstrate how lack of power analysis manufactured false claims at scale. + +### A3 — Composite metrics destroy information + +Collapsing multiple dimensions into a single "quality score" loses the structure of *which* dimension regressed. A model that gains 2 points on MMLU and loses 5 on GPQA-Diamond should not be reported as "+0 average". + +**Operational consequence**: ship the metric *vector*, not a scalar. Aggregation is a downstream policy choice (per-product), not an eval-system choice. + +**Source**: multi-criteria decision theory; Pareto-optimality literature. For ML-specific failure mode, see how "average benchmark score" leaderboards (early Open LLM Leaderboard v1) were gamed by overfitting to easy benchmarks while regressing on hard ones — addressed in OpenLLM Leaderboard v2 by reporting per-benchmark scores with CIs. + +### A4 — Cross-entropy loss → KL divergence is the most-information-per-FLOP signal + +Models are trained to minimize cross-entropy `H(p, q) = H(p) + KL(p || q)`. The optimization-induced delta between baseline and modified model is therefore most directly captured by `KL(p_baseline || p_optimized)` on the logit distribution over a held-out corpus. + +KL is sensitive to distributional changes that argmax-based metrics (accuracy, exact-match) provably miss: + +- Quantization that preserves top-1 token but shifts distribution → silently breaks temperature-sampling, RLHF-trained reward shape, agentic exploration. +- Pruning that keeps multiple-choice accuracy but flattens entropy → kills creative generation. +- Speculative decoding labeled "exact" that nonetheless shifts draft-model distribution. + +**Sample efficiency**: detecting a given distributional change via KL on logits typically needs ~10× fewer samples than detecting equivalent task-accuracy regression, because KL uses the full probability vector instead of a single argmax. + +**Operational consequence**: KL on a held-out distribution-matched corpus is the cheapest high-signal quality metric. Always run it. + +**Source**: information theory (Cover & Thomas, *Elements of Information Theory*). For empirical demonstration on quantization: Dettmers et al. 2022 (LLM.int8(), NeurIPS) and Lin et al. 2023 (AWQ, MLSys 2024) both show KL/PPL detect regressions before downstream task accuracy does. + +### A5 — Throughput and latency are adversarial (Little's Law) + +By Little's Law (Little 1961, *Operations Research*): `L = λ × W`, where L = mean concurrency, λ = throughput, W = mean response time. For a system at fixed L (saturated), increasing λ requires proportionally increasing W. Throughput and latency are *not* independent metrics — optimizing one degrades the other near saturation. + +**Operational consequence**: never report throughput without the corresponding latency distribution. The user-relevant single number is **goodput@SLO** = throughput × P(latency ≤ SLO), defined formally in DistServe (Zhong et al. 2024, OSDI). + +### A6 — User experience is dominated by the tail, not the mean + +Mean latency hides a 1%-of-users-see-10×-latency failure mode. For real-time control (VLA), p999 latency violations cause control-loop failures — *safety-critical*. Median is no better; both are central-tendency estimators. + +**Operational consequence**: serving evals report p50, p95, p99, p999 *and* jitter (IQR or std-dev). Means are decorative. + +**Source**: queuing theory (Kleinrock); production SRE practice (Beyer et al., *Site Reliability Engineering*, O'Reilly 2016). For VLA: RT-2 (Brohan et al. 2023), OpenVLA (Kim et al. 2024), π0 (Black et al. 2024) all report jitter + p999 because robotics requires bounded worst-case. + +### A7 — Benchmarks decay (saturation + contamination) + +Two mechanisms make benchmarks lose discriminating power over time: + +- **Saturation**: when top models score >90%, all signal is in the top-quintile noise band. MMLU is saturated for frontier models. HumanEval is saturated. +- **Contamination**: training corpora include benchmark text; gains reflect memorization, not capability. Heavily contaminated: HumanEval, MBPP, GSM8K (all on the public web pre-2023 cutoffs). + +**Operational consequence**: benchmark choice has a half-life. The 2026 frontier replacements are listed in §6; expect to revisit annually. + +**Source**: contamination evidence — Brown et al. 2020 (GPT-3, original n-gram contamination analysis); Sainz et al. 2023 ("NLP Evaluation in trouble", EMNLP); Schaeffer et al. 2023. Saturation evidence — every modern leaderboard. + +### A8 — The agent is itself an artifact-under-eval + +The system's *deliverable* is the agent's recommendation, not the underlying model. Capability composition is non-monotonic: strong MMLU + strong HumanEval ≠ strong agentic tool-use. (BFCL-v3 vs MMLU correlation across open models is r ≈ 0.4–0.6 — they measure different things.) + +**Operational consequence**: a separate eval surface for the agent's *decisions* (diagnosis, tool routing, recommendation quality, loop efficiency) is mandatory. Model-only eval is necessary but not sufficient. + +**Source**: τ-bench (Yao et al. 2024, Sierra), BFCL-v3 (Patil et al. 2024, Berkeley), GAIA (Mialon et al. 2023, Meta), MLE-bench (OpenAI 2024), SWE-bench (Jimenez et al. 2024, ICLR) — the entire agentic-eval literature exists because model evals don't predict agent performance. + +### A9 — Reproducibility is binary, not gradient + +Without seed control, dependency pinning, and hardware fingerprinting, an eval is not reproducible — it cannot be re-run to defend a claim. cuBLAS version, CUDA version, GPU SKU (even within "H100"), driver version, and kernel nondeterminism shift PPL by >0.5% in published case studies. + +**Operational consequence**: every eval run produces an *envelope* (seed, dep hashes, GPU fingerprint, data hashes, env vars). Runs without envelopes are advisory only; they cannot gate. + +**Source**: NeurIPS Reproducibility Checklist (mandatory since 2019); MLPerf benchmark rules (specifies compiler flags); Pineau et al. 2021 ("Improving reproducibility in machine learning research", JMLR). + +### A10 — The eval system itself must be evaluated (eval-of-eval) + +Without measuring eval quality (test-retest correlation, false-positive rate on null changes, false-negative rate on planted regressions), you cannot trust the gates. An eval system that has never failed has either never been stressed or is silently biased. + +**Operational consequence**: every release of the eval system runs against (a) a *null fixture* (identical model A/B comparison — should never gate-fail) and (b) *planted-regression fixtures* (known bad models — should always gate-fail). Track FPR/FNR over time. + +**Source**: classical software-testing principle (Beizer, *Software Testing Techniques*). For ML-specific application, see how leaderboard noise was diagnosed via test-retest in Dehghani et al. 2021 ("The benchmark lottery"). + +--- + +## 3. The 5-tier eval architecture + +### 3.1 Why tiers (re-derivation) + +Three forces: + +1. **Cost asymmetry**: full eval (all benchmarks × all seeds × full corpora × real-traffic shadow) costs $1k–$10k per change. Cheap eval (smoke test) costs $0.10. Doing the expensive one on every change is infeasible. +2. **Signal asymmetry**: different failure modes have different *base rates* and different *detection costs*. Catastrophic breakage is rare but cheap to detect; subtle distribution shift is common but expensive to detect. +3. **Decision asymmetry**: different decisions have different FN/FP cost ratios. A merge gate can be permissive (cheap to revert); a production deploy gate must be strict (expensive to revert). + +Tiering aligns the three: spend each marginal dollar where it most reduces P(undetected regression reaches users). + +### 3.2 Tier definitions + +Each tier specifies: **what failure mode it catches**, **gate decision it informs**, **cost budget**, **cadence**, **acceptance criterion**. + +#### T0 — Smoke (catch catastrophic breakage) + +- **Catches**: model fails to load, tokenizer mismatch, output shape wrong, NaN/Inf, runaway generation length, immediate OOM. +- **Gate**: pre-merge to feature branch (every commit touching optimization code). +- **Cost budget**: <$0.10 per run, <2 min wall-clock. +- **Cadence**: every CI run. +- **Method**: 50 fixed prompts → check output exists, no NaN, output length < 4× input, KL(baseline‖optimized) < 10 (wide). +- **Acceptance**: 100% pass. +- **Statistical rigor required**: none — these are deterministic correctness checks. + +#### T1 — Calibrated quality (catch distributional shift) + +- **Catches**: silent quality regression — KL divergence shift, perplexity regression, downstream-task accuracy degradation outside CI. +- **Gate**: pre-merge to main branch. +- **Cost budget**: $5–$50 per run, 10–60 min wall-clock. +- **Cadence**: every PR to main; every nightly on main. +- **Method**: KL-on-logits (held-out, 5k tokens) + 2–4 capability benchmarks at sample sizes that meet pre-registered MDE; bootstrap CIs; Holm-Bonferroni correction. +- **Acceptance**: see §4.6 (the conjunctive verdict rule). +- **Statistical rigor**: full — CIs, MDE pre-registered, multiple-comparisons corrected. + +#### T2 — Serving (catch production-physics regression) + +- **Catches**: tail latency regression, OOM under load, throughput collapse near saturation, scheduling pathology, KV-cache churn. +- **Gate**: pre-merge if optimization touches the serving path (kernel selection, batching, scheduler, KV cache); otherwise advisory. +- **Cost budget**: $20–$100 per run, 30–90 min wall-clock. +- **Cadence**: nightly on main; on-demand for serving-path PRs. +- **Method**: load test against fixed traffic shape (Poisson arrivals, length distribution matched to production sample) on real serving framework (vLLM/SGLang/TRT-LLM); measure goodput@SLO, p50/p95/p99/p999 TTFT/TBT/ITL, GPU utilization, OOM count. +- **Acceptance**: goodput@SLO not worse than baseline by more than the lower 95% CI; no p999 violation worse than 1.5× baseline; zero OOM. +- **Statistical rigor**: full — bootstrap CIs on percentiles (which are non-Gaussian). + +#### T3 — Shadow (catch bench-vs-prod distribution shift) + +- **Catches**: optimizations that look good on bench but degrade on real traffic (length distribution mismatch, prompt-style mismatch, multi-turn-context behavior). +- **Gate**: pre-deploy to canary. +- **Cost budget**: $100–$500 per run, hours-to-1-day wall-clock. +- **Cadence**: per release-candidate; weekly on main. +- **Method**: replay a captured slice of real production traffic (anonymized, sampled) through baseline and optimized side-by-side; compare KL on outputs, paired comparisons via McNemar test on agreement, latency distributions on the real shape. +- **Acceptance**: paired KL within pre-registered band; latency distributions within tier-2-style envelope on real (not synthetic) traffic shape. +- **Statistical rigor**: full — paired tests (McNemar for binary outcomes, paired bootstrap for continuous). + +#### T4 — Canary (catch what only real users surface) + +- **Catches**: cosmetic regressions, alignment edge cases, long-tail user complaint patterns, unforeseen interactions with downstream consumers. +- **Gate**: pre-full-rollout. +- **Cost budget**: $200–$1000 + user-exposure risk; hours-to-days wall-clock. +- **Cadence**: per release. +- **Method**: route 1–5% of real traffic to optimized variant; monitor SLO compliance, error rate, user-feedback signals (thumbs-down, regenerate-rate, session-abandon-rate); pre-registered guardrail metrics with auto-rollback thresholds. +- **Acceptance**: no guardrail violation over a pre-registered observation window (typically ≥ 24h or ≥ N requests for power). +- **Statistical rigor**: sequential testing (e.g., always-valid p-values via Howard et al. 2021 "Time-uniform Chernoff bounds") to enable safe early stopping. + +### 3.3 Meta layers (cross-tier) + +These are not tiers — they apply *across* tiers. + +- **M1 — Reproducibility envelope** (§8): every result tagged with seeds, hashes, hardware fingerprint. +- **M2 — Eval-of-eval** (§10.4): null fixtures and planted-regression fixtures run continuously; FPR/FNR tracked. +- **M3 — Cost telemetry**: every run reports compute cost; total eval spend monitored to prevent runaway. + +### 3.4 Tier summary table + +| Tier | Catches | Gate | Cost/run | Cadence | Stats rigor | +|------|---------|------|----------|---------|-------------| +| T0 | Catastrophic breakage | Pre-merge to branch | <$0.10 | Every commit | None (deterministic) | +| T1 | Distributional/quality shift | Pre-merge to main | $5–$50 | Every PR + nightly | Full (CIs + MDE + Holm) | +| T2 | Serving-physics regression | Pre-merge if serving-path | $20–$100 | Nightly + on-demand | Full (percentile CIs) | +| T3 | Bench-vs-prod shift | Pre-canary | $100–$500 | Per RC + weekly | Full (paired tests) | +| T4 | Real-user impact | Pre-full-rollout | $200–$1000 + risk | Per release | Sequential testing | + +--- + +## 4. Statistical framework + +### 4.1 Why bootstrap, not parametric + +ML metrics are routinely non-Gaussian: + +- **Perplexity**: roughly log-normal, heavy upper tail. +- **Accuracy**: bounded in [0, 1], Beta-distributed near edges. +- **Latency**: heavy-tailed (often log-normal or Pareto in serving systems). + +Parametric (t-distribution) CIs assume normality and produce miscalibrated coverage on these distributions. Bootstrap (Efron 1979, *Annals of Statistics*) is distribution-free and gives correct coverage at typical sample sizes. + +**Implementation**: percentile bootstrap for n ≥ 30; BCa (bias-corrected and accelerated, Efron 1987) for n ≥ 100 when the metric is potentially biased (e.g., percentile estimators). + +### 4.2 Why Holm-Bonferroni for multiple comparisons + +If you test k metrics each at α = 0.05 independently, the family-wise error rate is `1 − (1 − 0.05)^k`: + +| k | FWER | Probability of ≥1 false positive | +|---|------|----------------------------------| +| 1 | 0.05 | 5% | +| 5 | 0.226 | 23% | +| 10 | 0.401 | 40% | +| 20 | 0.642 | 64% | + +ML evals routinely involve 10–20 metric comparisons (PPL, KL, MMLU, GPQA, MATH, HumanEval, BFCL, plus latencies). Without correction, "significant" results are mostly noise. + +**Holm-Bonferroni** (Holm 1979, *Scandinavian Journal of Statistics*) controls FWER at α while being uniformly more powerful than vanilla Bonferroni — it strictly dominates. Use it. + +**Algorithm**: sort p-values ascending: p₁ ≤ p₂ ≤ … ≤ pₖ. Reject Hᵢ if pᵢ ≤ α / (k − i + 1). Stop at first non-rejection. + +**When to use Benjamini-Hochberg (FDR) instead**: discovery contexts (you're scanning for hypotheses, willing to accept some false positives in exchange for power). Not appropriate for go/no-go gates — use Holm. + +### 4.3 MDE pre-registration + +Before running any eval, compute and publish the MDE the experiment can detect at α = 0.05, power = 0.8. + +**Worked examples** (these are illustrative — exact numbers depend on benchmark variance, which should be measured from baseline runs): + +- **Wikitext-2 perplexity, n = 100 documents**: MDE ≈ 0.11 PPL units. Any claim of "0.05 PPL improvement" is unfalsifiable — *do not gate on it*. +- **Full MMLU, n ≈ 14,000 questions**: MDE ≈ 0.36% accuracy. +- **MMLU 1k subsample**: MDE ≈ 1.4% accuracy. Sub-1.4% claims invalid. +- **GPQA-Diamond, n = 198 questions**: MDE ≈ 5–7% (small benchmark, high variance). Not appropriate for fine-grained discrimination — use as a coarse capability filter only. + +**Operational rule**: every benchmark in T1+ ships with a pre-computed MDE table. If a claimed effect is below MDE, the gate refuses to consider it (auto-flagged as "below detection threshold"). + +**How to estimate σ**: run baseline 5–10 times (seed-varied) and take sample std-dev. This becomes the σ in the MDE formula (§A2). + +### 4.4 Paired vs unpaired tests + +When comparing baseline vs optimized on the *same* questions/prompts, use paired tests — they cancel question-difficulty variance and need ~2× fewer samples than unpaired for the same MDE. + +- **Continuous metrics** (KL, PPL, latency on same prompt): paired bootstrap on differences. +- **Binary metrics** (correct/incorrect on same question): McNemar's test (McNemar 1947) on the discordant-pair count. + +Most eval comparisons are paired by construction — exploit this. + +### 4.5 Sequential testing for canary + +Canary monitoring is fundamentally a sequential decision: you observe data over time and want to stop early if a problem appears. Naive repeated p-value testing inflates false-positive rate (the "peeking problem"). + +Use **always-valid p-values** (Howard et al. 2021, "Time-uniform Chernoff bounds for the mean of bounded variables") or mixture sequential probability ratio tests (mSPRT, Kohavi et al. industry literature). Both allow continuous monitoring without α inflation. + +### 4.6 The verdict rule (conjunctive ACCEPT) + +A T1+ eval ACCEPTs an optimization iff **all** of the following hold: + +1. **Quality gate**: no quality metric regresses outside its CI by more than the pre-registered tolerance, with Holm-Bonferroni applied across all quality metrics. +2. **Performance gate**: the target performance metric (e.g., goodput@SLO, training MFU) improves by at least the pre-registered MDE, with CI lower bound > 0. +3. **No-surprise gate**: no non-target metric regresses by more than its pre-registered guardrail (e.g., latency p99 doesn't double when target was throughput). +4. **Reproducibility gate**: the run produced a complete envelope (§8); seed-replication on a second run confirms the result is within CI. + +The conjunction is intentional — any single-axis ACCEPT is a known failure mode (you'd ship throughput-up / quality-down). RIPPLE this rule through every gating decision. + +### 4.7 Effect size, not just significance + +Report Cohen's *d* (for continuous) or odds ratios (for binary) alongside p-values. A statistically significant 0.1% gain on a 100k-sample benchmark is operationally meaningless. Pre-register a *minimum effect size of interest* (MEI) per metric — claims below MEI are reported as "detected but operationally negligible". + +**Source**: APA Publication Manual (effect-size reporting required since 2010); for ML, Card et al. 2020 "With Little Power Comes Great Responsibility" (EMNLP). + +--- + +## 5. Metric taxonomy + +Three metric classes; each has a distinct role and statistical handling. + +### 5.1 Quality metrics (model output fidelity) + +**Primary** (always run in T1): + +- **KL divergence** on logits, baseline vs optimized, on a held-out distribution-matched corpus. Most-information-per-FLOP signal (§A4). + - Sample size: 5k–50k tokens depending on tier. + - Aggregation: token-mean KL + token-max KL (catches localized blow-ups). + - Why both: mean KL can hide a few catastrophic tokens; max KL alone is noisy. +- **Perplexity** on held-out corpus matched to model's training distribution. Cheap, well-understood, comparable across literature. + +**Capability** (run subset matched to optimization_target): + +- See §6 for the full benchmark table. +- Always include at least one *contamination-resistant* benchmark (GPQA-Diamond, LiveCodeBench, MMLU-Pro). + +**Distributional shape** (cheap, run always): + +- **Output entropy** distribution: did the optimization flatten/sharpen the output distribution? +- **Top-k token agreement rate** with baseline. +- **Length distribution** of generations: did the optimization induce length drift? + +### 5.2 Performance metrics (compute/serving) + +**Training**: + +- **MFU** (Model FLOPs Utilization) — ratio of achieved to theoretical peak FLOPs. Reference: Chowdhery et al. 2022 (PaLM paper) for the canonical definition; PaLM achieved 46.2% MFU as a benchmark figure. +- **Tokens/sec/GPU**. +- **Memory peak** (must include activation memory under chosen recompute strategy). +- **Time-to-target-loss** (when comparing training-time optimizations). + +**Inference / serving** (DistServe taxonomy, Zhong et al. 2024): + +- **TTFT** (Time-To-First-Token) — prefill latency. Critical for chat UX. +- **TBT** (Time-Between-Tokens) — inter-token decode latency. +- **ITL** (Inter-Token Latency) — alias for TBT in some literature. +- **E2E latency** — full request completion time. +- **Throughput** — tokens/sec aggregate. +- **Goodput@SLO** = throughput × P(latency ≤ SLO) — the user-relevant single number (§A5). + +**For all latency metrics**: report the *distribution* (p50, p95, p99, p999) plus jitter (IQR or std-dev). Means are decorative (§A6). + +**VLA / real-time** (additional): + +- **Control-loop frequency** (Hz) achieved. +- **p999 jitter** within control window. +- **Worst-case action latency** — for safety-critical, this matters more than any percentile. + +### 5.3 Resource metrics + +- **GPU memory peak** (including activation, KV cache, optimizer state). +- **GPU utilization** (sustained, not peak). +- **Power draw** (W) — increasingly relevant for cost/ton-CO₂ accounting. +- **Cost per 1M tokens** at the measured load — the deployment-relevant economic metric. + +### 5.4 Safety / alignment metrics + +These are *never* trade-offable against performance — they are gates, not optimization targets. + +- **HarmBench** (Mazeika et al. 2024) — refusal robustness. +- **StrongREJECT** (Souly et al. 2024) — jailbreak resistance. +- **XSTest** (Röttger et al. 2024) — over-refusal detection (false positive on benign). +- **JailbreakBench** (Chao et al. 2024) — adversarial prompt suite. + +**Operational rule**: any optimization that degrades safety metrics outside CI fails the gate, regardless of performance gains. No exceptions. + +--- + +## 6. Benchmark suite + +### 6.1 Why these benchmarks (2026 frontier) + +Benchmark choice has a half-life. The 2024–2025 wave of benchmarks was specifically designed to address saturation + contamination on 2026-class models. Using MMLU + HumanEval in 2026 is like using ImageNet in 2020 — the leaderboard is flat. + +Selection criteria: +1. **Discriminating power on current frontier models** (top model not above 90%). +2. **Contamination resistance** (recent, held-out, or by-construction novel). +3. **Construct validity** (measures what its name claims). +4. **Open and stable** (won't disappear or change between runs). + +### 6.2 Benchmark table + +| Benchmark | Domain | n | Why this one (2026) | Source | +|-----------|--------|---|---------------------|--------| +| **MMLU-Pro** | Knowledge + reasoning | ~12k | 10-option (vs 4 in MMLU), harder distractors, less contaminated. Discriminates 2026 frontier. | Wang et al. 2024, TIGER-Lab | +| **GPQA-Diamond** | PhD-level science | 198 | "Google-proof" — designed contamination-resistant. Current gold for hard reasoning. Small n → coarse signal only. | Rein et al. 2023 | +| **MATH-500** | Math reasoning | 500 | Subset of MATH (Hendrycks et al. 2021) with stable difficulty. AIME-2024+ is the held-out variant. | Lightman et al. 2023 (PRM800K) | +| **AIME 2024+** | Competition math | 30/yr | Yearly held-out by construction. Contamination-immune for current year. | Math Olympiad | +| **HumanEval+ / MBPP+** | Code | extended | EvalPlus (Liu et al. 2023) adds 80×+ test cases — exposes brittle code that passes original tests. | Liu et al. 2023, NeurIPS | +| **LiveCodeBench** | Code | rolling | Problems indexed by date — use only problems newer than model's cutoff. Contamination by construction impossible. | Jain et al. 2024 | +| **RULER** | Long context | configurable | Only long-context benchmark whose difficulty scales with length. Plain NIAH is trivial for 2026 models. | Hsieh et al. 2024, NVIDIA | +| **BFCL-v3** | Function calling | ~2k | Multi-turn, parallel calls, irrelevance-detection. v1/v2 saturated. | Patil et al. 2024, Berkeley | +| **τ-bench** | Agent dialogue | 165 | Customer-service flows with policy adherence. Hardest agent benchmark — top models <60% pass^4. | Yao et al. 2024, Sierra | +| **SWE-bench-Verified** | Code agent | 500 | Human-verified subset of SWE-bench (full has noise). Real GitHub issues. | OpenAI 2024 (verified subset) | +| **HarmBench** | Safety (refusal) | 510 | Standardized adversarial behaviors. Use for any safety regression check. | Mazeika et al. 2024 | +| **StrongREJECT** | Safety (jailbreak) | 313 | Calibrated adversarial prompts; correlates with human harm judgment. | Souly et al. 2024 | +| **XSTest** | Over-refusal | 450 | Detects false positives — model refusing benign prompts. | Röttger et al. 2024 | + +### 6.3 Benchmark routing by optimization_target + +The agent's `optimization_target` determines which benchmarks T1 runs. Pre-registered mapping: + +| optimization_target | Required T1 benchmarks | +|---------------------|------------------------| +| `throughput` | KL + MMLU-Pro + GPQA-D + safety suite | +| `latency` | KL + MMLU-Pro + GPQA-D + safety suite | +| `memory` | KL + MMLU-Pro + safety suite | +| `quality` | KL + full quality matrix (MMLU-Pro, GPQA-D, MATH-500, HumanEval+, BFCL-v3) + safety | +| `multimodal` | KL + MMMU + MathVista + safety | +| `vla` | KL + simulator-rollout success rate + p999 jitter + safety | +| `agentic` | All quality + BFCL-v3 + τ-bench + SWE-bench-Verified + safety | + +Justification: minimize cost while ensuring the dimension being optimized is bounded by complementary axes. E.g., throughput-targeted optimizations rarely hurt math reasoning specifically — but they can shift distributional behavior, which KL + general-knowledge MMLU-Pro catches. + +### 6.4 Sample sizing rules + +- Use *full* benchmark when computationally feasible (cost < tier budget). +- If subsampling, the subsample size must support the pre-registered MDE for the claim being made. +- *Stratified* subsampling (preserve category proportions) only — uniform random subsampling adds variance unnecessarily. +- Subsample seed: pre-registered and fixed across baseline / optimized to enable paired comparison. + +### 6.5 Contamination defense + +- **For benchmarks with public test sets**: assume contamination; treat results as upper bounds. +- **For benchmarks with held-out test sets (LiveCodeBench, AIME)**: filter to problems published *after* the model's training cutoff. +- **Active monitoring**: periodically run contamination probes (Carlini et al. 2021, "Extracting Training Data from Large Language Models" methodology) on suspect benchmarks. + +--- + +## 7. Agent-level evaluation + +### 7.1 Why a separate eval surface + +Per axiom A8: the agent's deliverable is the *recommendation*, not the underlying model. Evaluating only the model leaves four agent-specific failure modes uncaught: + +1. **Mis-diagnosis**: agent identifies wrong bottleneck (e.g., recommends quantization when the actual bottleneck is KV-cache size). +2. **Tool-routing error**: agent picks a tool that doesn't apply to the optimization_target. +3. **Bad recommendation**: agent's chosen optimization is inferior to alternatives on the actual Pareto frontier. +4. **Loop pathology**: agent enters doom-loop, fails to terminate, burns token budget. + +### 7.2 Four agent eval surfaces + +#### S1 — Diagnosis accuracy + +- **Method**: gold-labeled scenario suite — synthetic profiles paired with expert-labeled root-cause bottleneck. +- **Metric**: top-1 and top-3 diagnosis accuracy with CIs. +- **Suite size**: ≥ 100 scenarios spanning (training/inference) × (compute/memory/IO/comm-bound) × (model-size buckets). +- **Adversarial subset**: ≥ 20 scenarios with red-herring signals (e.g., low GPU utilization that's actually due to data-loading, not compute). + +#### S2 — Tool routing + +- **Method**: given (profile, optimization_target), check that agent invokes tools in the pre-registered correct subset of TOOL_SUITES. +- **Metric**: precision (no inappropriate tools) + recall (all required tools invoked). +- **Failure modes to probe**: invoking quantization tools when target is `latency` and bottleneck is comm-bound (wrong); skipping kernel-selection when target is `throughput` and bottleneck is compute-bound (incomplete). + +#### S3 — Recommendation Pareto-quality + +- **Method**: for a held-out scenario set with known Pareto frontier (computed offline by exhaustive search over a small action space), check if agent's recommendation is on or near the frontier. +- **Metric**: distance from Pareto frontier in (quality, throughput, memory) space, normalized. +- **Why this matters**: an agent can be locally correct (each individual recommendation is fine) but globally suboptimal (it misses a much better Pareto point that requires combining techniques). + +#### S4 — Loop efficiency + +- **Method**: track per-task token spend, n_iterations, n_tool_calls, time-to-recommendation. +- **Metric**: distribution (median + p95) on a fixed task suite. Compare across agent versions. +- **Failure mode**: doom-loops, redundant tool invocations, premature termination. +- **Hard guardrail**: max token budget per task; max iterations; auto-abort. + +### 7.3 Adversarial agent scenarios + +A small set of intentionally-hard scenarios that probe known agent failure modes: + +- **Conflicting signals**: profile shows both compute-bound (high MFU) and memory-bound (high HBM utilization) signatures. Correct response: ask for clarification or run additional diagnostic — *not* default to one. +- **Unreliable tool output**: inject corrupted profiler output. Correct response: detect and re-run, not propagate. +- **Hardware spec absent**: profile lacks hardware info. Correct response: refuse to recommend hardware-specific optimizations, not hallucinate. +- **Out-of-distribution model**: model architecture not in HARDWARE_SPECS table. Correct response: degrade to general advice + flag for human review. + +Pass rate on the adversarial suite is a separate gate from S1–S4; it gates the agent release, not individual optimization decisions. + +### 7.4 Eval cadence for the agent + +- **S1, S2, S4**: every PR to agent code (`agent/optimization/**`). +- **S3**: nightly (more expensive — requires Pareto computation). +- **Adversarial suite**: pre-release. + +--- + +## 8. Reproducibility envelope + +### 8.1 What must be captured + +Every eval run produces an envelope with: + +``` +{ + "run_id": "uuid", + "timestamp": "ISO-8601 with TZ", + "git_commit_sha": "...", + "git_dirty": false, + "code_dependencies_hash": "sha256 of pinned requirements.txt", + "uv_lockfile_hash": "sha256 of uv.lock", + "python_version": "3.x.y", + "cuda_version": "12.x.y", + "cudnn_version": "...", + "driver_version": "...", + "gpu_model": "H100-SXM5-80GB", + "gpu_uuid": "...", + "n_gpus": 8, + "host_fingerprint": "sha256 of (cpu, mem, kernel, libc)", + "data_hashes": {"benchmark_name": "sha256"}, + "seeds": {"torch": 42, "numpy": 42, "python": 42, "cuda": 42}, + "deterministic_mode": true, + "env_vars": {"CUBLAS_WORKSPACE_CONFIG": "...", "TF32": "off"}, + "framework_versions": {"torch": "...", "transformers": "...", "vllm": "..."}, + "tier": "T1", + "optimization_target": "throughput", + "baseline_config": {...}, + "optimized_config": {...}, + "results": [...] +} +``` + +### 8.2 Determinism settings + +- `torch.use_deterministic_algorithms(True)` where supported. +- `CUBLAS_WORKSPACE_CONFIG=:4096:8` (required for cuBLAS determinism on CUDA ≥ 10.2). +- TF32 off for eval (tradeoff: deterministic but slower; eval should not be the bottleneck). +- For inherently nondeterministic kernels (e.g., scatter operations), document the residual variance and report it as part of σ in MDE. + +### 8.3 Replay protocol + +To verify a result, a third party must be able to: + +1. Check out the recorded git commit. +2. `uv sync` against the recorded lockfile hash. +3. Run on hardware matching the recorded fingerprint (or equivalent). +4. Reproduce results within the recorded CI. + +If steps 1–4 cannot be performed, the result is not reproducible — and per A9, not gate-eligible. + +### 8.4 Source of practice + +- NeurIPS Reproducibility Checklist (mandatory since 2019, see Pineau et al. 2021, JMLR). +- MLPerf benchmark rules (specifies compiler flags, kernels, batch sizes). +- Anthropic, OpenAI, DeepMind public release artifacts include env hashes for major releases. + +--- + +## 9. Implementation specification + +### 9.1 Directory layout (owned paths only) + +Per `CLAUDE.md` zero-diff invariant, all eval code lives in owned paths: + +``` +agent/ + eval/ # NEW — eval substrate + __init__.py + stat_utils.py # bootstrap CIs, MDE, Holm-Bonferroni + verdict.py # conjunctive ACCEPT rule + envelope.py # reproducibility envelope capture + tiers/ + __init__.py + t0_smoke.py + t1_quality.py + t2_serving.py + t3_shadow.py + t4_canary.py + metrics/ + __init__.py + kl_divergence.py + perplexity.py + goodput.py + latency.py + mfu.py + benchmarks/ + __init__.py + mmlu_pro.py + gpqa_diamond.py + math_500.py + humaneval_plus.py + bfcl_v3.py + tau_bench.py + ruler.py + harmbench.py + strongreject.py + xstest.py + agent_eval/ # §7 + __init__.py + diagnosis.py # S1 + tool_routing.py # S2 + pareto.py # S3 + loop_efficiency.py # S4 + adversarial.py + fixtures/ + null_change.py # eval-of-eval: null fixture + planted_regressions.py # eval-of-eval: known-bad fixtures + cli.py # `ml-intern-eval` entrypoint +configs/ + eval/ + tier_thresholds.yaml # MDE, MEI, gating thresholds per metric + benchmark_suites.yaml # routing by optimization_target + serving_load_profiles.yaml # T2 traffic shapes +tests/ + optimization/ + eval/ + test_stat_utils.py + test_verdict.py + test_envelope.py + test_tier_t0.py + test_tier_t1.py + ... +``` + +### 9.2 Key dataclasses (extend, don't modify) + +The existing `Experiment` dataclass in `agent/optimization/` is extended with eval-specific fields: + +```python +# agent/optimization/experiment.py (extend, do not modify upstream) +from dataclasses import dataclass, field +from datetime import datetime +from typing import Optional + +@dataclass +class MetricResult: + name: str + value: float + ci_low: float + ci_high: float + ci_method: str # "bootstrap_percentile" | "bootstrap_bca" | "mcnemar" + n_samples: int + seed: int + mde: float # Minimum Detectable Effect at α=0.05, power=0.8 + mei: float # Minimum Effect of Interest (pre-registered) + +@dataclass +class TierResult: + tier: str # "T0" | "T1" | "T2" | "T3" | "T4" + metrics: list[MetricResult] + p_values_raw: dict[str, float] + p_values_holm: dict[str, float] + verdict: str # "ACCEPT" | "REJECT" | "INCONCLUSIVE" + verdict_reasons: list[str] + cost_usd: float + wall_clock_s: float + +@dataclass +class ReproEnvelope: + run_id: str + timestamp: datetime + git_commit_sha: str + git_dirty: bool + deps_hash: str + python_version: str + cuda_version: str + cudnn_version: str + driver_version: str + gpu_model: str + gpu_uuids: list[str] + n_gpus: int + host_fingerprint: str + data_hashes: dict[str, str] + seeds: dict[str, int] + deterministic_mode: bool + env_vars: dict[str, str] + framework_versions: dict[str, str] + +@dataclass +class EvalRun: + envelope: ReproEnvelope + optimization_target: str + baseline_config: dict + optimized_config: dict + tier_results: list[TierResult] + overall_verdict: str + overall_reasons: list[str] +``` + +### 9.3 The verdict function (conjunctive ACCEPT) + +```python +# agent/eval/verdict.py +def compute_verdict(tier_result: TierResult, thresholds: dict) -> tuple[str, list[str]]: + """Conjunctive ACCEPT rule (§4.6). Returns (verdict, reasons).""" + reasons = [] + + # Gate 1: Quality — no quality metric regresses outside CI by more than tolerance. + for m in tier_result.metrics: + if m.name in thresholds["quality_metrics"]: + tol = thresholds["quality_tolerance"][m.name] + if m.ci_high < -tol: # CI lies entirely below tolerance band → real regression + reasons.append(f"REJECT: {m.name} regressed: CI={m.ci_low:.4f}..{m.ci_high:.4f} < -{tol}") + + # Gate 2: Performance — target metric improves by ≥ MDE, CI lower bound > 0. + target = thresholds["target_metric"] + target_m = next((m for m in tier_result.metrics if m.name == target), None) + if target_m is None: + reasons.append(f"REJECT: target metric {target} not measured") + elif target_m.ci_low <= 0: + reasons.append(f"REJECT: {target} improvement CI lower bound {target_m.ci_low:.4f} ≤ 0") + elif target_m.value < target_m.mde: + reasons.append(f"REJECT: {target} effect {target_m.value:.4f} < MDE {target_m.mde:.4f}") + + # Gate 3: No-surprise — guardrails on non-target metrics. + for m in tier_result.metrics: + if m.name in thresholds["guardrails"]: + limit = thresholds["guardrails"][m.name] + if m.ci_high < limit: + reasons.append(f"REJECT: guardrail violated: {m.name} CI_high={m.ci_high:.4f} < {limit}") + + # Gate 4: Reproducibility — handled at envelope-capture time, not here. + + if not reasons: + return "ACCEPT", ["All gates passed."] + return "REJECT", reasons +``` + +### 9.4 Configuration (pre-registered thresholds) + +```yaml +# configs/eval/tier_thresholds.yaml — pre-registered per metric +T1: + target_metric: "goodput_at_slo" # or as overridden by optimization_target + quality_metrics: + - kl_divergence_mean + - kl_divergence_max + - perplexity_wikitext2 + - mmlu_pro_accuracy + - gpqa_diamond_accuracy + quality_tolerance: # max acceptable regression per metric + kl_divergence_mean: 0.05 + kl_divergence_max: 0.5 + perplexity_wikitext2: 0.5 # PPL units + mmlu_pro_accuracy: 0.005 # 0.5% + gpqa_diamond_accuracy: 0.02 # 2% (tighter would be below MDE) + guardrails: + latency_p99_ms: 1.5 # ratio: optimized/baseline ≤ 1.5 + memory_peak_gb: 1.1 # ratio + alpha: 0.05 + power: 0.80 + multiple_comparisons: "holm" +``` + +### 9.5 CLI surface + +```bash +# Run a tier on a model pair +ml-intern-eval run --tier T1 \ + --baseline ./checkpoints/baseline \ + --optimized ./checkpoints/optimized \ + --target throughput \ + --output ./eval_runs/ + +# Inspect a result +ml-intern-eval show ./eval_runs// + +# Replay (verify reproducibility) +ml-intern-eval replay ./eval_runs// + +# Run agent-level eval +ml-intern-eval agent --suite diagnosis --output ./agent_evals/ +``` + +### 9.6 Integration with PLAN.md phases + +A new `Phase 8 — Eval substrate` should be added to PLAN.md, sequenced *before* Phase 4 (quantization) — because Phase 4 will use the eval substrate to validate every quantization claim. + +Phase 8 deliverables (suggested): + +| Step | Owned path | Acceptance | +|------|-----------|------------| +| 8.1 | `agent/eval/stat_utils.py` + tests | Bootstrap CIs, MDE, Holm. `pytest tests/optimization/eval/test_stat_utils.py` green. | +| 8.2 | `agent/eval/envelope.py` + tests | Captures full envelope; `replay` works. | +| 8.3 | `agent/eval/verdict.py` + tests | Conjunctive ACCEPT rule. Null-fixture passes; planted-regression fixtures all REJECT. | +| 8.4 | `agent/eval/tiers/t0_smoke.py` | <2 min on H100, deterministic. | +| 8.5 | `agent/eval/tiers/t1_quality.py` | KL + 2 benchmarks; CIs; verdict. | +| 8.6 | `agent/eval/tiers/t2_serving.py` | Goodput@SLO + percentile CIs against vLLM. | +| 8.7 | `agent/eval/agent_eval/diagnosis.py` | S1 ≥ 80% top-1 on scenario suite. | + +Existing `MC-1` (evaluate_model_quality with lm-eval-harness MMLU) should be **replaced** by T1 — current MC-1 reports a single number with no CIs, no MDE, single benchmark. It would not survive the verdict rule. + +--- + +## 10. Operational runbooks + +### 10.1 "The eval is failing — what now?" + +1. Check the verdict reasons (§9.3 produces them). The reason names which gate failed. +2. Inspect the metric CI: is the regression real or noise? +3. Check the envelope: was the run actually reproducible? Was the baseline drift-free? (Compare envelope to last known-good.) +4. Re-run the failing tier with a fresh seed. If verdict flips, you had insufficient seeds — increase n_seeds and re-run with paired analysis. +5. If the regression is real and reproducible: the optimization is bad. Revert. +6. If you suspect a false positive: check eval-of-eval FPR (§10.4). If it's spiking, the eval system itself may be drifting. + +### 10.2 "I want to add a new benchmark" + +1. Verify the benchmark meets §6.1 selection criteria (discriminating, contamination-resistant, valid construct). +2. Measure baseline σ across ≥ 5 seeds. Compute MDE. +3. Pre-register MDE, MEI, and tolerance in `configs/eval/tier_thresholds.yaml`. +4. Add to `agent/eval/benchmarks/.py`. +5. Add to `configs/eval/benchmark_suites.yaml` routing. +6. Run on null fixture (§10.4) — it must not gate-fail. +7. Run on planted-regression fixtures relevant to the benchmark — they must gate-fail. + +### 10.3 "I want to change a threshold" + +Pre-registered thresholds are *append-only with rationale*. Process: + +1. Open a PR that adds the new threshold *alongside* the old. +2. Run last 30 days of historical evals against both thresholds. Compare ACCEPT/REJECT decisions. +3. If decisions differ, justify the change with reasoning (new MDE measurement, new MEI from product, etc.). +4. Get review. +5. Merge with both thresholds active for one release; then remove the old. + +This prevents *threshold-shopping* — silently tightening a threshold to flip a verdict. + +### 10.4 Eval-of-eval (continuous quality assurance) + +Two fixture types run on every eval-system release: + +- **Null fixture**: identical model A vs identical model A. Eval system should ACCEPT (or report "no detectable effect"). If it REJECTs, the eval system has a false-positive bug. +- **Planted-regression fixtures**: model A vs deliberately-broken model A' (e.g., A' has a known 5% MMLU regression injected). Eval system should REJECT. If it ACCEPTs, the eval system has a false-negative bug. + +Track FPR (rate of incorrect REJECTs on null) and FNR (rate of incorrect ACCEPTs on planted) over time. If either drifts, freeze eval-system releases and investigate. + +**Source of practice**: chaos engineering / fault injection (Netflix Simian Army); mutation testing in software engineering (Jia & Harman 2011, IEEE TSE). + +### 10.5 Cost monitoring + +- Track total eval $-spend per week, per tier. +- Alert on >2× week-over-week increase (likely a runaway loop or accidentally-expensive benchmark added). +- Hard cap: per-PR eval cost ≤ $200; per-week total ≤ $5k. Above that, require human approval. + +--- + +## 11. Risks, anti-patterns, and known limitations + +### 11.1 Risks specific to this eval design + +| Risk | Mitigation | +|------|------------| +| Pre-registered thresholds become stale | §10.3 process; quarterly review of historical FPR/FNR. | +| Bench-vs-prod distribution drift | T3 catches it; refresh production traffic samples monthly. | +| Benchmark contamination grows over time | Annual review of benchmark choice (§6); active contamination probes. | +| Statistical framework misuse (e.g., p-hacking via metric selection) | Pre-registration of metric set per `optimization_target` is gate-enforced; new metrics require §10.2 process. | +| Reproducibility envelope capture incomplete (missing some env var) | Eval-of-eval planted-regression fixtures should catch envelope-induced variance. | +| Cost runaway | §10.5 monitoring + hard caps. | +| Eval system itself becomes a bottleneck (slows dev velocity) | T0/T1 cost budgets are tight; long evals run async; verdict cached. | +| Agent-eval suite becomes overfit to | Adversarial subset rotation; periodic suite refresh from real failure modes. | + +### 11.2 Anti-patterns explicitly forbidden + +- **Single-number quality verdicts**. Per A3. +- **Throughput without latency distribution**. Per A5. +- **Mean latency as primary**. Per A6. +- **Point estimates without CIs in T1+**. Per A1. +- **Effect claims below MDE**. Per A2. +- **Metric collection-without-pre-registration** (testing what you find significant — p-hacking). Per §4.7. +- **Skipping safety gates "just for this experiment"**. Per §5.4. +- **Modifying tier thresholds mid-experiment**. Per §10.3. +- **Eval results without envelopes used for gating**. Per A9. + +### 11.3 Known limitations of this design + +- **LLM-as-judge is used only as tiebreaker, not primary**. This is conservative; some eval systems use judges as primary for creative tasks. We don't, because of documented bias modes (positional, length, self-preference) — see Zheng et al. 2023 (NeurIPS). When ground truth is absent, we accept eval-incompleteness rather than introduce judge bias. +- **The agent eval suite (S1–S4) requires a hand-labeled scenario set** — initial labeling is expensive (~50–100 expert hours). Without it, agent decisions are not gate-able; with it, the suite must be maintained. +- **Sequential testing in T4 (canary)** requires careful implementation — naive repeated p-value testing inflates FPR. Use Howard et al. 2021 always-valid p-values or established A/B testing framework (e.g., Eppo, Optimizely). +- **Reproducibility on cloud-shared infrastructure is imperfect** — even with full envelope, neighboring tenants can introduce performance variance. T2 should run on dedicated hardware where possible. +- **No causal inference framework** — we measure correlation between optimization and metrics; we do not formally identify causation. For optimization changes this is acceptable (the intervention is direct) but for any inference about *why* an optimization works, additional investigation is needed. + +### 11.4 What this design refuses to do + +- Composite quality scores → ship the vector (A3). +- "We'll add stats later" → retrofitting CIs requires re-running every baseline. Build it in from commit 1. +- LLM-as-judge as primary capability gate → bias modes too well-documented (Zheng et al. 2023). +- Optimization-target trades against safety → safety is a gate, never a parameter. + +--- + +## 12. References + +Cited inline above; consolidated here for verification. All references are real and publicly available; URLs given where stable. + +### Statistical foundations +- Cohen, J. (1988). *Statistical Power Analysis for the Behavioral Sciences* (2nd ed.). Lawrence Erlbaum. +- Efron, B. (1979). "Bootstrap methods: another look at the jackknife." *Annals of Statistics* 7(1):1–26. +- Efron, B. (1987). "Better bootstrap confidence intervals." *JASA* 82(397):171–185. +- Holm, S. (1979). "A simple sequentially rejective multiple test procedure." *Scandinavian Journal of Statistics* 6(2):65–70. +- Howard, S.R. et al. (2021). "Time-uniform Chernoff bounds via nonnegative supermartingales." *Probability Surveys*. +- Little, J.D.C. (1961). "A proof for the queuing formula L = λW." *Operations Research* 9(3):383–387. +- McNemar, Q. (1947). "Note on the sampling error of the difference between correlated proportions or percentages." *Psychometrika* 12(2):153–157. + +### ML reproducibility and statistical rigor +- Card, D., Henderson, P. et al. (2020). "With Little Power Comes Great Responsibility." *EMNLP 2020*. +- Dehghani, M. et al. (2021). "The Benchmark Lottery." arXiv:2107.07002. +- Henderson, P. et al. (2018). "Deep Reinforcement Learning that Matters." *AAAI 2018*. +- Pineau, J. et al. (2021). "Improving Reproducibility in Machine Learning Research." *JMLR* 22. +- Schaeffer, R., Miranda, B., Koyejo, S. (2023). "Are Emergent Abilities of Large Language Models a Mirage?" *NeurIPS 2023* (best paper). + +### Information theory and quantization +- Cover, T.M., Thomas, J.A. (2006). *Elements of Information Theory* (2nd ed.). Wiley. +- Dettmers, T. et al. (2022). "LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale." *NeurIPS 2022*. +- Frantar, E. et al. (2022). "GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers." arXiv:2210.17323. +- Lin, J. et al. (2023). "AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration." *MLSys 2024*. +- Xiao, G. et al. (2023). "SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models." *ICML 2023*. + +### Serving systems and goodput +- Beyer, B. et al. (2016). *Site Reliability Engineering*. O'Reilly. +- Kwon, W. et al. (2023). "Efficient Memory Management for Large Language Model Serving with PagedAttention." *SOSP 2023* (vLLM). +- Williams, S., Waterman, A., Patterson, D. (2009). "Roofline: An Insightful Visual Performance Model for Multicore Architectures." *Communications of the ACM* 52(4):65–76. +- Zhong, Y. et al. (2024). "DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving." *OSDI 2024*. + +### Training-time metrics +- Chowdhery, A. et al. (2022). "PaLM: Scaling Language Modeling with Pathways." arXiv:2204.02311 (MFU definition). + +### 2026 frontier benchmarks +- Brohan, A. et al. (2023). "RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control." *CoRL 2023*. +- Black, K. et al. (2024). "π0: A Vision-Language-Action Flow Model for General Robot Control." arXiv:2410.24164. +- Chao, P. et al. (2024). "JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models." *NeurIPS 2024 D&B*. +- Hendrycks, D. et al. (2021). "Measuring Massive Multitask Language Understanding." *ICLR 2021* (MMLU). +- Hendrycks, D. et al. (2021). "Measuring Mathematical Problem Solving With the MATH Dataset." *NeurIPS 2021 D&B*. +- Hsieh, C-P. et al. (2024). "RULER: What's the Real Context Size of Your Long-Context Language Models?" arXiv:2404.06654 (NVIDIA). +- Jain, N. et al. (2024). "LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code." arXiv:2403.07974. +- Jimenez, C.E. et al. (2024). "SWE-bench: Can Language Models Resolve Real-World GitHub Issues?" *ICLR 2024*. +- Kim, M.J. et al. (2024). "OpenVLA: An Open-Source Vision-Language-Action Model." arXiv:2406.09246. +- Liang, P. et al. (2022). "Holistic Evaluation of Language Models." arXiv:2211.09110 (HELM). +- Lightman, H. et al. (2023). "Let's Verify Step by Step." arXiv:2305.20050 (PRM800K / MATH-500 split). +- Liu, J. et al. (2023). "Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation." *NeurIPS 2023* (HumanEval+/MBPP+ via EvalPlus). +- Mazeika, M. et al. (2024). "HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal." *ICML 2024*. +- Mialon, G. et al. (2023). "GAIA: A Benchmark for General AI Assistants." arXiv:2311.12983. +- Patil, S.G. et al. (2024). "Berkeley Function Calling Leaderboard (BFCL)." (project; v3 2024). +- Rein, D. et al. (2023). "GPQA: A Graduate-Level Google-Proof Q&A Benchmark." arXiv:2311.12022. +- Röttger, P. et al. (2024). "XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models." *NAACL 2024*. +- Sainz, O. et al. (2023). "NLP Evaluation in Trouble: On the Need to Measure LLM Data Contamination for each Benchmark." *EMNLP 2023 Findings*. +- Souly, A. et al. (2024). "A StrongREJECT for Empty Jailbreaks." arXiv:2402.10260. +- Srivastava, A. et al. (2022). "Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models." (BIG-bench). +- Wang, Y. et al. (2024). "MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark." arXiv:2406.01574 (TIGER-Lab). +- Yao, S. et al. (2024). "τ-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains." arXiv:2406.12045 (Sierra). +- Zheng, L. et al. (2023). "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena." *NeurIPS 2023 D&B*. + +### Contamination +- Brown, T.B. et al. (2020). "Language Models are Few-Shot Learners." *NeurIPS 2020* (GPT-3, original n-gram contamination analysis). +- Carlini, N. et al. (2021). "Extracting Training Data from Large Language Models." *USENIX Security 2021*. + +### Software testing / mutation testing (basis for eval-of-eval) +- Beizer, B. (1990). *Software Testing Techniques* (2nd ed.). Van Nostrand Reinhold. +- Jia, Y., Harman, M. (2011). "An Analysis and Survey of the Development of Mutation Testing." *IEEE TSE* 37(5):649–678. + +### Standards and practice +- MLPerf Inference Benchmark Rules. MLCommons. https://mlcommons.org/ +- NeurIPS Reproducibility Checklist. https://neurips.cc/ + +--- + +## Appendix A — How to read a verdict + +Sample verdict output from `ml-intern-eval show `: + +``` +Run: 7f3a2b1c-... Tier: T1 Target: throughput +Envelope: ✓ complete (commit a3f9..., H100-SXM5×8, deterministic) + +Metric Value CI(95%) MDE MEI Verdict +───────────────────────────────────────────────────────────────────────────────── +goodput_at_slo (target) +18.4% [+15.2%, +21.6%] ±2.0% +5% ✓ pass +kl_divergence_mean +0.012 [+0.008, +0.017] ±0.003 ±0.05 ✓ within tol +kl_divergence_max +0.31 [+0.21, +0.42] ±0.05 ±0.5 ✓ within tol +perplexity_wikitext2 +0.08 [-0.04, +0.20] ±0.11 ±0.5 ⊘ below MDE (advisory) +mmlu_pro_accuracy -0.003 [-0.007, +0.001] ±0.002 ±0.005 ✓ within tol +gpqa_diamond_accuracy -0.015 [-0.040, +0.010] ±0.025 ±0.02 ⊘ below MDE +latency_p99 (guardrail) ×1.12 [×1.08, ×1.16] ≤×1.5 ✓ guardrail ok +memory_peak (guardrail) ×0.94 [×0.93, ×0.95] ≤×1.1 ✓ guardrail ok + +Holm-Bonferroni (k=5 quality metrics, α=0.05): all rejections held after correction. + +VERDICT: ACCEPT +Reasons: All gates passed. Target metric (goodput_at_slo) improvement +18.4% +[CI excludes 0; > MDE]; no quality metric regressed outside CI tolerance; +all guardrails within bounds. Below-MDE results reported as advisory only. +``` + +How to read it: +- `Value` is the point estimate; `CI(95%)` is the bootstrap interval; if CI crosses 0 (or 1× for ratios), the effect is not distinguishable from noise. +- `MDE` is the smallest effect this experiment could detect — values inside `[-MDE, +MDE]` are reported as "below detection threshold" regardless of point estimate. +- `MEI` is the smallest effect that would matter operationally. +- `⊘ below MDE` means the result is advisory only — it cannot inform the gate decision. +- `VERDICT: ACCEPT` requires all four gates (§4.6) to pass conjunctively. + +--- + +## Appendix B — Glossary + +- **CI**: Confidence Interval. A range that contains the true parameter value with stated frequentist probability (95% standard). +- **FPR / FNR**: False Positive Rate / False Negative Rate. +- **FWER**: Family-Wise Error Rate — probability of ≥1 false positive across a family of tests. +- **Goodput@SLO**: throughput × probability(latency ≤ SLO). The user-relevant deployment metric. +- **ITL**: Inter-Token Latency. Often used synonymously with TBT. +- **KL divergence**: Kullback-Leibler divergence. Information-theoretic distance between two probability distributions. +- **MDE**: Minimum Detectable Effect. Smallest effect detectable at given α, power, n, σ. +- **MEI**: Minimum Effect of Interest. Smallest effect with operational meaning (pre-registered, product-defined). +- **MFU**: Model FLOPs Utilization. Achieved FLOPs / theoretical peak. +- **MT-bench, Chatbot Arena**: benchmarks for instruction-following / preference comparison; cited here only re: judge bias (Zheng et al. 2023), not used as primary metrics in this design. +- **p99, p999**: 99th, 99.9th percentile latency. +- **PPL**: Perplexity. exp(cross-entropy loss). Lower is better. +- **TBT**: Time Between Tokens. Decode-phase per-token latency. +- **TTFT**: Time To First Token. Prefill-phase latency. +- **Tier**: a level in the eval stack (T0–T4), defined by cost, cadence, and gate decision. +- **VLA**: Vision-Language-Action model. Robotics policy with vision + language inputs. + +--- + +*End of EVAL_SPEC.md.* diff --git a/PLAN.md b/PLAN.md new file mode 100644 index 00000000..2b8c3ec4 --- /dev/null +++ b/PLAN.md @@ -0,0 +1,2942 @@ +# Frontier AI Optimization Agent — Implementation Plan + +> **Grounded in:** actual codebase at `/Users/danghuyhoang/Desktop/ml-intern` +> **Strategy:** Build on top of ML Intern infrastructure. Fork, don't modify. +> **Target:** A specialized agent that profiles, diagnoses, and optimizes training + inference for LLMs, multimodal models, and VLAs. + +--- + +## Architecture Decision Record + +**Decision:** Build on ML Intern infrastructure, not from scratch. + +**Rationale (verified from codebase):** + +| Infrastructure Component | Lines | Why Keep | +|---|---|---| +| `agent/core/agent_loop.py` | 1,626 | Contains 6+ hard edge cases: abandoned approvals, thinking signature healing, malformed JSON recovery, stream cut-off handling | +| `agent/core/session.py` | 487 | Atomic write, detached subprocess upload, heartbeat saves already solved | +| `agent/context_manager/manager.py` | 415 | Dangling tool call patching, compaction threshold math, system prompt reuse | +| `backend/session_manager.py` | 608 | `asyncio.to_thread` for blocking init, EventBroadcaster fan-out, sandbox cleanup retry | +| `agent/tools/jobs_tool.py` | 1,198 | Log streaming resilience, UV log filtering, GPU flavor specs, timeout enforcement | +| `agent/tools/sandbox_tool.py` | 477 | Orphan cleanup, Trackio injection, hardware tier selection | + +**What we throw away:** `agent/prompts/system_prompt_v3.yaml` (replaced entirely), `configs/cli_agent_config.json` (new config), general-purpose tool descriptions. + +**What we add:** 4 new tool suites, 1 new optimization context system, 1 new system prompt, 1 new config. + +--- + +## Cross-Cutting Rules + +These rules apply across every phase. Each one prevents a class of false-positive results that would otherwise survive into the agent's recommendations. + +### Rule 1 — Two-Level Benchmarking *(mandatory for every optimization tool)* + +Every tool that claims a speedup MUST report **both** numbers: + +1. **Component speedup** — optimized op vs. baseline op, in isolation +2. **End-to-end speedup** — optimized pipeline vs. baseline pipeline, on a realistic workload + +**Why:** A custom RMSNorm kernel measured at 1.88× faster in isolation yields only 1.06× end-to-end speedup if RMSNorm is 5% of the pipeline (Amdahl's law). Reporting only the isolated number is malpractice — it claims a win the user does not see. + +Tool return contract for everything in `training_opt/`, `inference_opt/`, `multimodal_opt/`, `vla_opt/`, `kernel_gen/`: + +```json +{ + "component_speedup": 1.88, + "end_to_end_speedup": 1.06, + "component_fraction_of_pipeline": 0.05, + "amdahl_predicted_e2e": 1.05, + "deviation_from_amdahl_pct": 1.0 +} +``` + +If `deviation_from_amdahl_pct > 10`, the tool flags the result for investigation — the gap signals measurement error, contention, or a confounding optimization, and the experiment is recorded with `verdict="investigating"`, not `"keep"`. + +### Rule 2 — Measured Peak Over Vendor Peak + +Every roofline calculation uses **measured** peak throughput from `measure_peak_throughput` (Step 1.3), not the static `HARDWARE_SPECS` table. Vendor specs are upper bounds; thermal throttling, MIG partitioning, and power capping routinely deliver 60–90% of nameplate. Treating vendor numbers as truth produces "MFU = 28%, severe bottleneck" diagnoses on hardware that is already at its real ceiling. + +The static table remains as the documented theoretical ceiling and as the fallback when measurement fails or is explicitly skipped (`use_measured_peak=false` for fast iteration). + +### Rule 3 — One Optimization Per Experiment + +Each `Experiment` recorded in `OptimizationContext` changes exactly one variable from the baseline. Stacking FP8 quantization + speculative decoding + sequence packing in a single run makes the speedup unattributable: if quality regresses, which one caused it? If throughput improves less than expected, which one underperformed? The system prompt enforces this; tool handlers reject configs that combine techniques unless an explicit `combine_with` argument is passed by the user. + +--- + +## Repository Structure (Target State) + +```text +ml-optimization-agent/ ← fork of ml-intern +│ +├── agent/ +│ ├── core/ ← KEEP AS-IS (zero modifications) +│ │ ├── agent_loop.py +│ │ ├── session.py +│ │ ├── doom_loop.py +│ │ ├── prompt_caching.py +│ │ ├── telemetry.py +│ │ ├── redact.py +│ │ ├── llm_params.py +│ │ ├── model_switcher.py +│ │ └── effort_probe.py +│ │ +│ ├── context_manager/ ← ONE addition: persistent_state field +│ │ └── manager.py +│ │ +│ ├── messaging/ ← KEEP AS-IS +│ │ +│ ├── prompts/ +│ │ ├── system_prompt_v3.yaml ← KEEP (for reference) +│ │ └── system_prompt_optimization_v1.yaml ← NEW (Phase 1) +│ │ +│ ├── tools/ +│ │ ├── (existing tools) ← KEEP AS-IS (jobs, sandbox, research, docs...) +│ │ │ +│ │ ├── profiling/ ← NEW (Phase 2) +│ │ │ ├── __init__.py +│ │ │ ├── training_mfu.py +│ │ │ ├── inference_latency.py +│ │ │ ├── memory_timeline.py +│ │ │ ├── measured_peak.py ← Step 1.3 (used by all profilers) +│ │ │ └── nsight_profile.py ← Step 2.3 (kernel-level metrics) +│ │ │ +│ │ ├── training_opt/ ← NEW (Phase 3) +│ │ │ ├── __init__.py +│ │ │ ├── parallelism_tuner.py +│ │ │ ├── sequence_packing.py +│ │ │ ├── flash_attention.py +│ │ │ └── liger_kernels.py +│ │ │ +│ │ ├── inference_opt/ ← NEW (Phase 4) +│ │ │ ├── __init__.py +│ │ │ ├── quantization.py +│ │ │ ├── vllm_deployer.py +│ │ │ ├── sglang_deployer.py +│ │ │ ├── speculative_decoding.py +│ │ │ └── serving_benchmark.py +│ │ │ +│ │ ├── multimodal_opt/ ← NEW (Phase 5) +│ │ │ ├── __init__.py +│ │ │ └── visual_token_compressor.py +│ │ │ +│ │ ├── vla_opt/ ← NEW (Phase 5) +│ │ │ ├── __init__.py +│ │ │ ├── action_latency_profiler.py +│ │ │ └── fast_slow_splitter.py +│ │ │ +│ │ └── kernel_gen/ ← NEW (Phase 7, gated) +│ │ ├── __init__.py +│ │ ├── generate_kernel.py +│ │ ├── publish_kernel.py +│ │ └── orchestrator.py +│ │ +│ ├── skills/ ← NEW (Phase 7 + optional MC-4 migration) +│ │ └── cuda-kernels/ +│ │ ├── SKILL.md +│ │ ├── scripts/ +│ │ └── references/ ← per-arch + per-framework knowledge files +│ │ +│ ├── optimization/ ← NEW (Phase 6) +│ │ ├── __init__.py +│ │ ├── context.py ← OptimizationContext, Experiment +│ │ ├── roofline.py ← Roofline calculations +│ │ ├── pareto.py ← Multi-objective Pareto analysis +│ │ └── bottleneck.py ← Bottleneck classifier +│ │ +│ └── config.py ← EXTEND: add optimization fields +│ +├── backend/ ← KEEP AS-IS +├── frontend/ ← KEEP AS-IS (minor: add profiling viz) +│ +├── configs/ +│ ├── cli_agent_config.json ← KEEP (reference) +│ ├── frontend_agent_config.json ← KEEP (reference) +│ └── optimization_agent_config.json ← NEW (Phase 1) +│ +└── tests/ + ├── unit/ ← KEEP existing tests + └── optimization/ ← NEW tests per phase +``` + +--- + +## Timeline Overview + +```text +Week 1–2 Phase 0: Repository setup + environment verification +Week 3–4 Phase 1: Knowledge foundation (system prompt + hardware specs + measured peak) +Week 5–7 Phase 2: Profiling suite (MFU + inference + memory + Nsight kernel-level) +Week 8–10 Phase 3: Training optimization tools +Week 11–13 Phase 4: Inference optimization tools +Week 14–15 Phase 5: Multimodal + VLA tools +Week 16–17 Phase 6: Optimization state machine +Week 18–20 Phase 7: Custom CUDA kernel generation ← gated; activated only if Phase 4 plateaus +Week 21–23 Phase 8: Scored ML-optimization benchmark suite (AHE Stage C — rate-limiter) +Week 24–26 Phase 9: Trajectory observability + manifest verification (AHE Stages E + F) +Week 27–30 Phase 10: Evolve Agent + Algorithm 1 orchestration (AHE Stages G + H) +Week 31 Phase 11: Cross-model transfer evaluation (AHE Stage I) +``` + +**Phases 8–11 implement the AHE meta-stack** (Lin et al., arXiv:2604.25850v2). See `RESEARCH_AHE_ANALYSIS.md` for the full architectural rationale. AHE Stages A (7-slot decomposition) and D (manifest discipline) are cross-cutting — A starts in Phase 0, D starts in Phase 1 and persists through all phases. + +--- + +## Phase 0: Repository Setup & Baseline Verification +**Duration:** 1–2 weeks +**Goal:** Working fork with verified infrastructure. All existing tests pass. New config loaded correctly. + +--- + +### Step 0.1 — Fork the Repository + +**Action:** +```bash +# Option A: GitHub fork +gh repo fork huggingface/ml-intern --clone --remote +mv ml-intern ml-optimization-agent +cd ml-optimization-agent + +# Option B: Local copy +cp -r /Users/danghuyhoang/Desktop/ml-intern /Users/danghuyhoang/Desktop/ml-optimization-agent +cd /Users/danghuyhoang/Desktop/ml-optimization-agent +git remote set-url origin +``` + +**Verify:** +```bash +uv sync +uv run pytest tests/unit/ -x -q +# Expected: all existing tests pass +``` + +**Acceptance Criteria:** `pytest tests/unit/` exits 0. All 20 unit tests pass. + +--- + +### Step 0.2 — Add Optimization Fields to Config + +**File to modify:** `agent/config.py` + +**Exact change** — add after `reasoning_effort` field in the `Config` class: + +```python +# --- Optimization agent fields --- +# Target modality for optimization. Drives system prompt selection +# and tool availability. None = general mode (backward compatible). +optimization_target: str | None = None +# Valid: "training" | "inference" | "multimodal" | "vla" | None + +# Hardware the model will run on. Used for roofline analysis. +# Keys match hardware_constants in system_prompt_optimization_v1.yaml. +target_hardware: str | None = None +# Valid: "h100_sxm" | "a100_sxm" | "a100_pcie" | "l40s" | "mi300x" | None + +# Quality budget for optimization (0.0–1.0, where 1.0 = no degradation allowed). +# Agent uses this for multi-objective trade-off decisions. +quality_budget: float = 0.98 +# Example: 0.98 = accept up to 2% quality drop for speed/memory gains + +# Enable iterative optimization loop (profile → fix → profile → compare). +optimization_loop_enabled: bool = True +``` + +**Verify:** +```python +# tests/optimization/test_config_optimization.py +from agent.config import load_config + +def test_optimization_fields_default(tmp_path): + cfg_path = tmp_path / "cfg.json" + cfg_path.write_text('{"model_name": "moonshotai/Kimi-K2.6"}') + cfg = load_config(str(cfg_path)) + assert cfg.optimization_target is None + assert cfg.target_hardware is None + assert cfg.quality_budget == 0.98 + assert cfg.optimization_loop_enabled is True + +def test_optimization_fields_set(tmp_path): + cfg_path = tmp_path / "cfg.json" + cfg_path.write_text('''{ + "model_name": "moonshotai/Kimi-K2.6", + "optimization_target": "inference", + "target_hardware": "h100_sxm", + "quality_budget": 0.95 + }''') + cfg = load_config(str(cfg_path)) + assert cfg.optimization_target == "inference" + assert cfg.target_hardware == "h100_sxm" + assert cfg.quality_budget == 0.95 +``` + +**Acceptance Criteria:** Test passes. Existing `test_config.py` still passes. + +--- + +### Step 0.3 — Create Optimization Agent Config + +**File to create:** `configs/optimization_agent_config.json` + +```json +{ + "model_name": "bedrock/us.anthropic.claude-opus-4-6-v1", + "optimization_target": null, + "target_hardware": null, + "quality_budget": 0.98, + "optimization_loop_enabled": true, + "save_sessions": true, + "session_dataset_repo": "smolagents/ml-optimization-sessions", + "yolo_mode": false, + "confirm_cpu_jobs": true, + "auto_file_upload": true, + "reasoning_effort": "max", + "mcpServers": { + "hf-mcp-server": { + "transport": "http", + "url": "https://huggingface.co/mcp?login" + } + } +} +``` + +**Verify:** +```bash +python -c " +from agent.config import load_config +cfg = load_config('configs/optimization_agent_config.json') +print('quality_budget:', cfg.quality_budget) +print('optimization_loop_enabled:', cfg.optimization_loop_enabled) +" +# Expected: prints 0.98 and True +``` + +--- + +### Step 0.4 — Add `persistent_state` to ContextManager + +This is the **only modification** needed in infrastructure. It allows `OptimizationContext` to survive context compaction. + +**File to modify:** `agent/context_manager/manager.py` + +**Locate the `__init__` method** (line ~136) and add one field: + +```python +class ContextManager: + def __init__( + self, + model_max_tokens: int, + compact_size: float = 0.1, + untouched_messages: int = 5, + tool_specs: list = None, + hf_token: str | None = None, + local_mode: bool = False, + prompt_file_suffix: str = "system_prompt_v3.yaml", + ): + # ... existing code unchanged ... + self.items: list[Message] = [Message(role="system", content=self.system_prompt)] + + # NEW: Structured state that survives compaction. + # Populated by optimization/context.py. Serialized back into + # context as a user message after each compact() call. + self.persistent_state: dict = {} +``` + +**Locate the `compact()` method** (line ~350) and add state re-injection after the existing compaction logic: + +```python +async def compact(self, ...): + # ... existing compaction code unchanged ... + + # NEW: Re-inject persistent_state after compaction so the LLM + # always has access to experiment history, baseline metrics, etc. + # This must happen AFTER compaction rewrites self.items. + # + # CRITICAL: Do NOT insert a new Message(role="user") here. + # Anthropic API enforces strict user/assistant alternation. + # After compaction, self.items is [system, user_summary, assistant, ...]. + # Inserting a new user message at index 1 produces [system, user, user, ...] + # which causes API error: "roles must alternate between user and assistant". + # Instead, APPEND the state to the first existing user message's content. + if self.persistent_state: + import json + state_suffix = ( + "\n\n[OPTIMIZATION_STATE — persisted across compaction]\n" + + json.dumps(self.persistent_state, indent=2) + ) + first_user_idx = next( + (i for i, m in enumerate(self.items) if m.role == "user"), None + ) + if first_user_idx is not None: + # Append to existing user message — alternation invariant preserved + if isinstance(self.items[first_user_idx].content, str): + self.items[first_user_idx].content += state_suffix + else: + # Content is a list of blocks (multimodal) — append text block + self.items[first_user_idx].content.append( + {"type": "text", "text": state_suffix} + ) + else: + # No user message yet — safe to append (session is in initial state) + self.items.append(Message(role="user", content=state_suffix.strip())) +``` + +**Verify:** +```python +# tests/optimization/test_persistent_state.py +import asyncio +from unittest.mock import AsyncMock, patch +from agent.context_manager.manager import ContextManager + +def test_persistent_state_survives_compaction(): + cm = ContextManager(model_max_tokens=10_000) + cm.persistent_state = {"experiment_count": 3, "best_mfu": 0.52} + + # Force compaction threshold + cm.running_context_usage = 9500 + + # Mock summarize_messages to return a simple string + async def run(): + with patch("agent.context_manager.manager.summarize_messages", + new=AsyncMock(return_value=("Summary text", 100))): + await cm.compact(model_name="mock", hf_token=None) + + # Check state was re-injected + contents = [m.content for m in cm.items if hasattr(m, 'content')] + state_injected = any( + "OPTIMIZATION_STATE" in str(c) for c in contents + ) + assert state_injected, "persistent_state not found after compaction" + assert '"experiment_count": 3' in str(cm.items) + + asyncio.run(run()) +``` + +**Acceptance Criteria:** Test passes. Existing `test_dangling_tool_calls.py` still passes. + +--- + +## Phase 1: Knowledge Foundation +**Duration:** 2 weeks +**Goal:** Agent reasons correctly about optimization before any tools exist. System prompt drives roofline-first thinking. + +--- + +### Step 1.1 — Write the Optimization System Prompt + +**File to create:** `agent/prompts/system_prompt_optimization_v1.yaml` + +The prompt is a Jinja2 template (same format as `system_prompt_v3.yaml`). + +```yaml +system_prompt: | + You are a Frontier AI Optimization Engineer. You have {{ num_tools }} tools + for profiling, diagnosing, and optimizing training and inference for LLMs, + multimodal models, and Vision-Language-Action (VLA) models. + + Your job is not to train models from scratch. Your job is to make existing + models and training pipelines measurably faster, cheaper, or more memory-efficient — + while preserving quality within a defined budget. + + # The Optimization Mandate (Non-Negotiable) + + **NEVER suggest an optimization technique without profiling data first.** + + The only acceptable workflow is: + 0. HARDWARE: Call lookup_hardware_specs(hardware=target_hardware) first. + Never use hardware constants from memory — the tool is the authoritative + source and may include GPUs added after your training cutoff. + 1. MEASURE: Run profile_training_mfu() or profile_inference_latency() to establish baseline + 2. IDENTIFY: Classify the bottleneck (compute / memory-bandwidth / communication / I/O) + 3. REASON: Apply the Roofline Model to understand the theoretical ceiling + 4. SELECT: Choose the technique that directly addresses the identified bottleneck + 5. IMPLEMENT: Apply the technique via the appropriate tool + 6. VERIFY: Re-profile. Quantify the delta (throughput, latency, memory, quality) + 7. DECIDE: Keep if within quality_budget, revert and try next hypothesis otherwise + + Saying "I think it might be memory-bound" and then recommending quantization + without profiling is malpractice. Always measure first. + + # The Roofline Model + + Every computation is either compute-bound or memory-bandwidth-bound. + The ridge point separates them. + + ``` + Achievable Performance (FLOPS/s) + │ ╱ Compute ceiling + │ ╱ + │ ╱ + ──────────────────── Memory BW ceiling + │ + └──────────────────→ Arithmetic Intensity (FLOPS/byte) + ``` + + **To classify any operation:** + ``` + arithmetic_intensity = total_flops / total_bytes_moved + + if arithmetic_intensity < hardware_ridge_point: + bottleneck = "memory_bandwidth" + solutions = ["quantization", "KV compression", "attention fusion", + "reduce activation size", "flash attention"] + else: + bottleneck = "compute" + solutions = ["tensor core utilization", "larger batch size", + "kernel fusion", "mixed precision", "better GEMM tiling"] + ``` + + # Hardware Constants + + **Always call `lookup_hardware_specs(hardware=target_hardware)` before any profiling.** + The tool is the single source of truth. Do NOT hard-code specs from memory — new GPUs + are added to the tool's lookup table without updating this prompt. + + Key interpretation rules (apply after calling the tool): + - Autoregressive decode: ~2 FLOPS/byte → always memory-bandwidth-bound on every GPU + - Dense Transformer matmuls at bf16: ~128–512 FLOPS/byte → near or above ridge point + - Attention with FlashAttention: fused and HBM-optimal → effectively compute-bound + - For NVLink presence: check `nvlink_bandwidth_gbs > 0` before recommending tensor parallel + + # MFU Interpretation + + MFU (Model FLOP Utilization) = achieved_tflops / hardware_peak_tflops + + ``` + MFU < 15% → Severe bottleneck. Likely: data starvation, optimizer overhead, + communication not overlapped, or single-GPU when multi needed. + MFU 15-35% → Significant room. Likely: no flash attention, large activation memory, + suboptimal parallelism, or missing fused kernels. + MFU 35-55% → Typical well-tuned single-node training. Normal range. + MFU 55-65% → Excellent. Requires: FlashAttention, FSDP2 with overlap, + sequence packing, torch.compile. + MFU > 65% → World-class. Megatron-LM or torchtitan level tuning. + ``` + + # Bottleneck Taxonomy + + Before selecting any technique, classify the bottleneck: + + **Training Bottlenecks:** + - compute_bound: MFU < ridge point, forward/backward dominate profile + - memory_bound: Activation memory exploding, gradient checkpointing needed + - communication_bound: AllReduce time > 30% of step time + - io_bound: GPU idle waiting for data (DataLoader bottleneck) + - optimizer_bound: optimizer.step() > 20% of step time (large models, Adam states) + + **Inference Bottlenecks:** + - kv_cache_memory: KV cache fills GPU, limits batch size + - decode_bandwidth: Each token reads entire model weights (autoregressive) + - prefill_compute: Long prompt, compute-bound on attention + - cpu_overhead: Token sampling, Python overhead between GPU calls + + # Multi-Objective Trade-off Framework + + Every recommendation MUST include a trade-off table: + + ~~~ + Technique | Throughput Δ | Quality Δ | Memory Δ | Complexity + ─────────────────┼──────────────┼────────────┼──────────┼─────────── + FP8 quantization | +1.9x | -0.3% MMLU | -49% | Low + GPTQ INT4 | +2.1x | -2.8% MMLU | -58% | Medium + AWQ INT4 | +2.0x | -1.4% MMLU | -58% | Medium + Spec. decoding | +3.2x | 0% | +15% | High + ~~~ + + Let the user choose the Pareto-optimal point for their constraints. + Never choose for them without asking. + + # Parallelism Selection (Distributed Training) + + Use this decision tree for multi-GPU training: + + ~~~ + Model fits on 1 GPU? + Yes → Data Parallel (DDP or FSDP2 with no sharding) + No → Does model fit across 1 node (8 GPUs × 80GB)? + Yes → Tensor Parallel (TP=8, same node, NVLink required) + OR FSDP2 ZeRO-3 (simpler, slightly slower) + No → Pipeline Parallel (PP) across nodes + + Tensor Parallel within nodes + + FSDP2 for optimizer states + + For sequence length > 32k tokens: + Add Context Parallel (CP) = ring attention across GPUs + ~~~ + + # Optimization by Model Architecture + + **Dense LLM (Llama, Mistral, Qwen):** + Training: FSDP2 + FlashAttention + sequence packing + torch.compile + Inference: vLLM + PagedAttention + FP8/AWQ + speculative decoding + + **MoE (Mixtral, DeepSeek-V3):** + Training: Expert Parallel (EP) + careful load balancing loss tuning + Inference: Expert routing cache + expert parallelism per GPU + Note: MoE inference has lower arithmetic intensity → more memory-bandwidth-bound + + **Multimodal (LLaVA, Qwen-VL, InternVL):** + Training: Freeze vision encoder early, visual token compression, mixed packing + Inference: Cache visual prefix KV, dynamic resolution batching + + **VLA (π0, OpenVLA):** + HARD CONSTRAINT: action inference must be < 50ms for manipulation + Use CUDA graphs + static shapes. No dynamic batching in hot path. + Fast/slow split: LLM for planning (can be slow), MLP for reactive control (must be fast) + Do NOT use speculative decoding for real-time control (timing variance) + + # Common Mistakes to Avoid + + DO NOT recommend technique without profiling: "I think flash attention will help" is wrong. + DO NOT change model architecture to fix an optimization problem without user approval. + DO NOT apply multiple optimizations simultaneously (can't attribute which helped). + DO NOT compare results across different hardware. + DO NOT mistake training throughput for inference throughput (they optimize differently). + DO NOT assume quantization degrades quality without measuring — FP8 is often lossless. + + # Research Integration + + When you don't know the current state of an optimization technique: + 1. Use search_mlsys_papers() to find the most recent papers + 2. Use github_find_examples() to find working implementations + 3. Use fetch_hf_docs() for HF library-specific APIs + + Optimization papers move fast. Your internal knowledge of specific numbers + (benchmark results, speedup claims) may be outdated. Always ground claims in + a specific paper or measurement. +``` + +**Verify prompt loads correctly:** + +```python +# tests/optimization/test_system_prompt.py +from pathlib import Path +import yaml +from jinja2 import Template + +def test_optimization_prompt_valid_yaml(): + path = Path("agent/prompts/system_prompt_optimization_v1.yaml") + assert path.exists() + data = yaml.safe_load(path.read_text()) + assert "system_prompt" in data + template_str = data["system_prompt"] + # Should render without error when num_tools is provided + rendered = Template(template_str).render(num_tools=20) + assert "Roofline" in rendered + assert "MFU" in rendered + assert "h100_sxm" in rendered + assert len(rendered) > 3000 +``` + +**Wire prompt into ContextManager:** + +In `agent/context_manager/manager.py`, find `_load_system_prompt()`. The method takes `prompt_file_suffix`. The new config needs to pass `"system_prompt_optimization_v1.yaml"` when optimization mode is active. This is done via the config path passed to `Session.__init__`. + +**Acceptance Criteria:** Prompt file exists. YAML is valid. Template renders. `num_tools` variable resolves. All hardware constants present. + +--- + +### Step 1.2 — Hardware Specs Lookup Tool + +**File to create:** `agent/tools/hardware_specs.py` + +```python +""" +Hardware specifications for roofline analysis. +No network calls — pure lookup table from vendor specs. +""" +from agent.tools.types import ToolSpec + +HARDWARE_SPECS: dict[str, dict] = { + "h100_sxm": { + "peak_bf16_tflops": 989, + "peak_fp8_tflops": 1979, + "memory_bandwidth_gbs": 3350, + "ridge_point_bf16": 295, + "nvlink_bandwidth_gbs": 900, + "hbm_capacity_gb": 80, + "sm_count": 132, + "chip": "Hopper GH100", + "interconnect": "NVLink 4.0 (900 GB/s bidirectional)", + }, + "a100_sxm": { + "peak_bf16_tflops": 312, + "peak_fp8_tflops": None, # No FP8 native support + "memory_bandwidth_gbs": 2000, + "ridge_point_bf16": 156, + "nvlink_bandwidth_gbs": 600, + "hbm_capacity_gb": 80, + "sm_count": 108, + "chip": "Ampere GA100", + "interconnect": "NVLink 3.0 (600 GB/s bidirectional)", + }, + "a100_pcie": { + "peak_bf16_tflops": 250, + "peak_fp8_tflops": None, + "memory_bandwidth_gbs": 1935, + "ridge_point_bf16": 129, + "nvlink_bandwidth_gbs": 0, + "hbm_capacity_gb": 80, + "sm_count": 108, + "chip": "Ampere GA100 (PCIe)", + "interconnect": "PCIe 4.0 (64 GB/s)", + "note": "No NVLink — tensor parallelism across nodes not recommended", + }, + "l40s": { + "peak_bf16_tflops": 362, + "peak_fp8_tflops": 724, + "memory_bandwidth_gbs": 864, + "ridge_point_bf16": 419, + "nvlink_bandwidth_gbs": 0, + "hbm_capacity_gb": 48, + "sm_count": 142, + "chip": "Ada Lovelace AD102", + "interconnect": "PCIe 4.0 (64 GB/s)", + "note": "High ridge point = more operations are compute-bound vs A100", + }, + "mi300x": { + "peak_bf16_tflops": 1307, + "peak_fp8_tflops": 2614, + "memory_bandwidth_gbs": 5300, + "ridge_point_bf16": 247, + "nvlink_bandwidth_gbs": 0, + "hbm_capacity_gb": 192, + "sm_count": 304, + "chip": "AMD CDNA3", + "interconnect": "AMD Infinity Fabric", + "note": "192GB HBM enables large models without tensor parallelism", + }, + "t4": { + "peak_bf16_tflops": 65, + "peak_fp8_tflops": None, + "memory_bandwidth_gbs": 320, + "ridge_point_bf16": 203, + "nvlink_bandwidth_gbs": 0, + "hbm_capacity_gb": 16, + "sm_count": 40, + "chip": "Turing TU104", + "note": "Small model inference only. Not suitable for training > 1B params", + }, +} + +# HF flavor → hardware mapping (verified from jobs_tool.py) +HF_FLAVOR_TO_HARDWARE = { + "t4-small": "t4", + "t4-medium": "t4", + "a10g-small": None, # Not in table — similar to A100 PCIe at lower scale + "a10g-large": None, + "a10g-largex2": None, + "a10g-largex4": None, + "a100-large": "a100_sxm", + "a100x4": "a100_sxm", + "a100x8": "a100_sxm", + "l40sx1": "l40s", + "l40sx4": "l40s", + "l40sx8": "l40s", +} + + +async def hardware_specs_handler(args: dict) -> tuple[str, bool]: + import json + + hardware = args.get("hardware") + + if hardware == "list": + return json.dumps(list(HARDWARE_SPECS.keys()), indent=2), True + + if hardware not in HARDWARE_SPECS: + # Try HF flavor lookup + mapped = HF_FLAVOR_TO_HARDWARE.get(hardware) + if mapped and mapped in HARDWARE_SPECS: + hardware = mapped + else: + return ( + f"Unknown hardware: '{hardware}'. " + f"Valid options: {list(HARDWARE_SPECS.keys())} " + f"or HF flavors: {list(HF_FLAVOR_TO_HARDWARE.keys())}", + False, + ) + + specs = HARDWARE_SPECS[hardware] + result = { + "hardware": hardware, + "specs": specs, + "roofline_guidance": { + "memory_bound_threshold_flops_per_byte": specs["ridge_point_bf16"], + "interpretation": ( + f"Operations with arithmetic intensity < {specs['ridge_point_bf16']} FLOPS/byte " + f"are memory-bandwidth-bound on {hardware}. " + f"Autoregressive LLM decode (~2 FLOPS/byte) is always memory-bound here." + ), + }, + } + return json.dumps(result, indent=2), True + + +HARDWARE_SPECS_TOOL_SPEC = ToolSpec( + name="lookup_hardware_specs", + description=( + "Look up hardware specifications for roofline analysis. " + "Returns peak TFLOPS, memory bandwidth, ridge point, NVLink bandwidth, and HBM capacity. " + "Use this BEFORE running profile_training_mfu to interpret results. " + "Pass hardware='list' to see all available options." + ), + parameters={ + "type": "object", + "properties": { + "hardware": { + "type": "string", + "description": ( + "Hardware identifier. Options: h100_sxm, a100_sxm, a100_pcie, l40s, mi300x, t4. " + "Also accepts HF job flavors: a100-large, l40sx8, etc. " + "Pass 'list' to enumerate all options." + ), + } + }, + "required": ["hardware"], + }, + handler=hardware_specs_handler, +) +``` + +**Wire into ToolRouter** in `agent/core/tools.py`: + +Find the import block and add: +```python +from agent.tools.hardware_specs import HARDWARE_SPECS_TOOL_SPEC, hardware_specs_handler +``` + +Find `create_builtin_tools()` and add: +```python +HARDWARE_SPECS_TOOL_SPEC, +``` + +**Verify:** +```python +# tests/optimization/test_hardware_specs.py +import asyncio +from agent.tools.hardware_specs import hardware_specs_handler + +def test_h100_lookup(): + result, ok = asyncio.run(hardware_specs_handler({"hardware": "h100_sxm"})) + import json + data = json.loads(result) + assert ok + assert data["specs"]["peak_bf16_tflops"] == 989 + assert data["specs"]["ridge_point_bf16"] == 295 + +def test_hf_flavor_lookup(): + result, ok = asyncio.run(hardware_specs_handler({"hardware": "a100-large"})) + import json + data = json.loads(result) + assert ok + assert data["hardware"] == "a100_sxm" + +def test_unknown_hardware(): + result, ok = asyncio.run(hardware_specs_handler({"hardware": "rtx4090"})) + assert not ok + assert "Unknown hardware" in result +``` + +**Acceptance Criteria:** All 3 tests pass. Tool registered in ToolRouter. `lookup_hardware_specs` callable from agent. + +--- + +### Step 1.3 — Measured Peak Throughput Tool + +The static `HARDWARE_SPECS` table from Step 1.2 reports **theoretical** peak TFLOPS and HBM bandwidth from vendor datasheets. Real silicon delivers less: + +- Thermal throttling under sustained load: −5 to −20% +- Power cap (e.g., 350W H100 SXM5 vs. nameplate 700W): −40 to −50% +- MIG partition: 1/2 or 1/7 of full SM count +- Defective or older die: silent variance up to 10% + +A roofline diagnosis built on theoretical peaks falsely flags well-tuned code as "underperforming." This tool measures actual peak so the agent reasons against ground truth (Cross-Cutting Rule 2). + +**File to create:** `agent/tools/profiling/measured_peak.py` + +```python +""" +Measure achievable peak HBM bandwidth and dense bf16/fp8 TFLOPS on the live GPU. +Single-GPU only — distributed peaks come from a separate communication benchmark. +Inspired by cfregly/ai-performance-engineering's benchmark_peak.py — measure first, +trust vendor specs second. +""" +from agent.tools.types import ToolSpec + +_PEAK_BENCH_SCRIPT = ''' +import torch, time, json + +# 1) HBM bandwidth: large device-to-device copy, measure GB/s. +N = 1 << 28 # 256M float32 = 1 GiB +a = torch.empty(N, dtype=torch.float32, device="cuda") +b = torch.empty(N, dtype=torch.float32, device="cuda") +torch.cuda.synchronize() +for _ in range(3): b.copy_(a) # warmup +torch.cuda.synchronize() +t0 = time.perf_counter() +ITERS = 50 +for _ in range(ITERS): b.copy_(a) +torch.cuda.synchronize() +elapsed = time.perf_counter() - t0 +# Each copy reads N*4 bytes and writes N*4 bytes +measured_bw_gbs = (2 * N * 4 * ITERS) / elapsed / 1e9 + +# 2) Dense bf16 GEMM peak: large square matmul, derive TFLOPS. +M = 8192 +A = torch.randn(M, M, dtype=torch.bfloat16, device="cuda") +B = torch.randn(M, M, dtype=torch.bfloat16, device="cuda") +torch.cuda.synchronize() +for _ in range(3): C = A @ B # warmup +torch.cuda.synchronize() +t0 = time.perf_counter() +ITERS_GEMM = 20 +for _ in range(ITERS_GEMM): C = A @ B +torch.cuda.synchronize() +elapsed = time.perf_counter() - t0 +measured_bf16_tflops = (2 * M**3 * ITERS_GEMM) / elapsed / 1e12 + +# 3) Optional FP8 GEMM for Hopper+ / MI300X. +measured_fp8_tflops = None +try: + if torch.cuda.get_device_capability()[0] >= 9: + Af = torch.randn(M, M, device="cuda").to(torch.float8_e4m3fn) + Bf = torch.randn(M, M, device="cuda").to(torch.float8_e4m3fn).t().contiguous().t() + scale = torch.tensor(1.0, device="cuda") + torch.cuda.synchronize() + t0 = time.perf_counter() + for _ in range(ITERS_GEMM): + torch._scaled_mm(Af, Bf, scale_a=scale, scale_b=scale, out_dtype=torch.bfloat16) + torch.cuda.synchronize() + elapsed = time.perf_counter() - t0 + measured_fp8_tflops = (2 * M**3 * ITERS_GEMM) / elapsed / 1e12 +except Exception: + pass + +result = { + "device_name": torch.cuda.get_device_name(0), + "measured_hbm_bandwidth_gbs": round(measured_bw_gbs, 1), + "measured_bf16_tflops": round(measured_bf16_tflops, 1), + "measured_fp8_tflops": round(measured_fp8_tflops, 1) if measured_fp8_tflops else None, + "measured_ridge_point_bf16": round( + (measured_bf16_tflops * 1e12) / (measured_bw_gbs * 1e9), 1 + ), +} +print("PEAK_RESULT:" + json.dumps(result)) +''' + + +async def measure_peak_throughput_handler(args: dict) -> tuple[str, bool]: + import json + from agent.tools.sandbox_tool import sandbox_exec_handler + from agent.tools.hardware_specs import HARDWARE_SPECS + + exec_args = {"command": f"python -c '{_PEAK_BENCH_SCRIPT}'", "timeout": 180} + result, ok = await sandbox_exec_handler(exec_args) + if not ok: + return f"Peak measurement failed: {result}", False + + for line in result.split("\n"): + if not line.startswith("PEAK_RESULT:"): + continue + try: + data = json.loads(line[len("PEAK_RESULT:"):]) + except json.JSONDecodeError: + continue + + expected = args.get("expected_hardware") + if expected and expected in HARDWARE_SPECS: + spec = HARDWARE_SPECS[expected] + bw_ratio = data["measured_hbm_bandwidth_gbs"] / spec["memory_bandwidth_gbs"] + tflops_ratio = data["measured_bf16_tflops"] / spec["peak_bf16_tflops"] + data["bandwidth_efficiency_vs_vendor"] = round(bw_ratio, 2) + data["bf16_efficiency_vs_vendor"] = round(tflops_ratio, 2) + if min(bw_ratio, tflops_ratio) < 0.7: + data["warning"] = ( + f"Measured peak is {min(bw_ratio, tflops_ratio)*100:.0f}% of vendor spec. " + "Likely thermal throttling, power cap, or MIG partition. " + "Use measured_ridge_point_bf16 for roofline analysis, NOT the vendor spec." + ) + return json.dumps(data, indent=2), True + return f"Could not parse peak result.\nRaw:\n{result}", False + + +MEASURE_PEAK_TOOL_SPEC = ToolSpec( + name="measure_peak_throughput", + description=( + "Measure ACTUAL peak HBM bandwidth and bf16/fp8 TFLOPS on the live GPU. " + "Use this output (not lookup_hardware_specs alone) for roofline ridge points " + "when diagnosing whether code is memory- or compute-bound. " + "Vendor specs are upper bounds; thermal throttling, power caps, and MIG partitions " + "routinely deliver 60-90% of nameplate. " + "Result is cached in OptimizationContext.persistent_state['measured_peak'] — call once per session." + ), + parameters={ + "type": "object", + "properties": { + "expected_hardware": { + "type": "string", + "description": "Optional. Hardware key from HARDWARE_SPECS. If provided, the tool compares measured to vendor spec and emits a warning when measured is < 70% of vendor.", + } + }, + "required": [], + }, + handler=measure_peak_throughput_handler, +) +``` + +**Wire into Phase 2 profilers:** `profile_training_mfu` and `profile_inference_latency` accept `use_measured_peak: bool = True`. On first call, they invoke `measure_peak_throughput` and cache the result in `OptimizationContext.persistent_state["measured_peak"]`. All subsequent MFU calculations divide by `measured_bf16_tflops`, not the static value. + +**Acceptance Criteria:** +- On a healthy unthrottled GPU, measured bandwidth lands within 5% of vendor spec (sanity check) +- Returns `warning` field when measured is < 70% of vendor (catches MIG / power-cap / thermal cases) +- Result cached in `OptimizationContext.persistent_state["measured_peak"]` after first call +- Downstream profilers default to `measured_ridge_point_bf16` for bottleneck classification + +--- + +### Step 1.4 — MLSys Papers Search Tool + +**File to create:** `agent/tools/mlsys_papers.py` + +This wraps the existing `hf_papers` and `web_search` infrastructure with optimization-specific context. + +```python +""" +Specialized paper search for ML systems optimization literature. +Wraps hf_papers + web_search with optimization-specific defaults. +""" +from agent.tools.types import ToolSpec + +# Curated anchor papers by domain. These are starting points for +# citation graph traversal via the research tool. +ANCHOR_PAPERS = { + "flash_attention": [ + "2205.14135", # FlashAttention + "2307.08691", # FlashAttention-2 + "2407.08608", # FlashAttention-3 + ], + "quantization": [ + "2210.17323", # GPTQ + "2306.00978", # AWQ + "2211.10438", # SmoothQuant + "2209.05433", # LLM.int8() + ], + "inference_serving": [ + "2309.06180", # PagedAttention (vLLM) + "2312.07104", # SGLang / RadixAttention + "2302.01318", # Orca / continuous batching + ], + "speculative_decoding": [ + "2211.17192", # Speculative Decoding (original) + "2401.15077", # Eagle + "2406.16858", # Eagle-2 + ], + "distributed_training": [ + "1909.08053", # Megatron-LM + "2205.05198", # Megatron-LM v3 (sequence parallelism) + "1910.02054", # ZeRO-3 / DeepSpeed (Rajbhandari et al. 2019) — NOT 2101.03961 which is Switch Transformer + ], + "efficient_attention": [ + "2112.05682", # Sparse Attention + "2004.05150", # Longformer + "2310.01558", # Ring Attention + ], + "training_efficiency": [ + "2403.03507", # GaLore (Memory-Efficient LLM Training by Gradient Low-Rank Projection) — NOT 2302.13971 + "2310.05914", # Liger Kernel + "2407.21783", # Sequence packing survey + ], + "vla": [ + "2410.24164", # π0 (Physical Intelligence) + "2406.09246", # OpenVLA + "2212.06817", # RT-2 + "2307.15818", # RT-X / Open-X Embodiment + ], + "moe": [ + "2101.03961", # Switch Transformer + "2401.04088", # DeepSeek-MoE + "2412.19437", # DeepSeek-V3 technical report + ], +} + + +async def mlsys_papers_handler(args: dict) -> tuple[str, bool]: + import json + + domain = args.get("domain") + query = args.get("query", "") + + if domain == "list": + return json.dumps({ + "available_domains": list(ANCHOR_PAPERS.keys()), + "usage": "Pass domain='flash_attention' to get anchor paper IDs for citation graph traversal" + }, indent=2), True + + result = { + "query": query, + "domain": domain, + "recommended_workflow": ( + "1. Use hf_papers(task='find_papers', query=query) to find recent papers\n" + "2. Use hf_papers(task='citation_graph', arxiv_id=anchor_id) to find downstream work\n" + "3. Use hf_papers(task='read_paper', arxiv_id=id) to read methodology sections 3-5\n" + "4. Extract: technique + conditions where it works + benchmark numbers" + ), + } + + if domain and domain in ANCHOR_PAPERS: + result["anchor_papers"] = { + "arxiv_ids": ANCHOR_PAPERS[domain], + "note": "These are landmark papers. Use citation_graph to find 2024-2025 work that cites them." + } + elif domain: + result["warning"] = f"Domain '{domain}' not in curated list. Using web search instead." + result["web_search_query"] = f"{query or domain} optimization arxiv 2024 2025" + + return json.dumps(result, indent=2), True + + +MLSYS_PAPERS_TOOL_SPEC = ToolSpec( + name="search_mlsys_papers", + description=( + "Get anchor paper IDs and search guidance for ML systems optimization literature. " + "Domains: flash_attention, quantization, inference_serving, speculative_decoding, " + "distributed_training, efficient_attention, training_efficiency, vla, moe. " + "Returns arxiv IDs to feed into hf_papers(task='citation_graph') for up-to-date research. " + "Pass domain='list' to see all available domains." + ), + parameters={ + "type": "object", + "properties": { + "domain": { + "type": "string", + "description": "Optimization domain. Pass 'list' to enumerate options.", + }, + "query": { + "type": "string", + "description": "Free-text search query within the domain.", + }, + }, + "required": ["domain"], + }, + handler=mlsys_papers_handler, +) +``` + +**Wire into ToolRouter** in `agent/core/tools.py`: +```python +from agent.tools.mlsys_papers import MLSYS_PAPERS_TOOL_SPEC, mlsys_papers_handler +# Add to create_builtin_tools(): MLSYS_PAPERS_TOOL_SPEC, +``` + +**Acceptance Criteria:** Tool callable. Returns anchor paper IDs for each domain. `domain='list'` returns all domains. + +--- + +## Phase 2: Profiling Suite +**Duration:** 2 weeks +**Goal:** Agent can measure MFU, inference latency, and memory usage. "Profile first" becomes possible, not just a principle. + +--- + +### Step 2.1 — Training MFU Profiler + +**File to create:** `agent/tools/profiling/training_mfu.py` + +This submits a profiling job to HF Jobs (uses existing `hf_jobs_handler` pattern) and parses structured output. + +```python +""" +Profile a training step and compute Model FLOP Utilization (MFU). +Uses torch.profiler with flops counting. Submits to HF Jobs. +""" +from agent.tools.types import ToolSpec +from agent.tools.jobs_tool import hf_jobs_handler # Reuse existing infrastructure + +_PROFILING_SCRIPT_TEMPLATE = ''' +import torch +import torch.profiler +from transformers import AutoModelForCausalLM, AutoConfig +import json, time, sys + +MODEL = "{model_name}" +BATCH_SIZE = {batch_size} +SEQ_LEN = {seq_len} +WARMUP = {warmup_steps} +PROFILE_STEPS = {profile_steps} +HARDWARE = "{hardware}" +DTYPE = torch.bfloat16 if "{dtype}" == "bfloat16" else torch.float16 + +HARDWARE_PEAK_TFLOPS = {{ + "h100_sxm": 989, "a100_sxm": 312, "a100_pcie": 250, + "l40s": 362, "mi300x": 1307, "t4": 65, +}}.get(HARDWARE, 312) + +print(f"Loading model {{MODEL}}...") +try: + model = AutoModelForCausalLM.from_pretrained( + MODEL, torch_dtype=DTYPE, device_map="auto", + attn_implementation="flash_attention_2" if {use_flash_attention} else "eager" + ) +except Exception as e: + print(f"flash_attention_2 failed, falling back: {{e}}") + model = AutoModelForCausalLM.from_pretrained(MODEL, torch_dtype=DTYPE, device_map="auto") + +model.train() +optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4) +input_ids = torch.randint(0, 50000, (BATCH_SIZE, SEQ_LEN), device="cuda") + +print(f"Warming up {{WARMUP}} steps...") +for _ in range(WARMUP): + loss = model(input_ids, labels=input_ids).loss + loss.backward() + optimizer.step() + optimizer.zero_grad() +torch.cuda.synchronize() + +print(f"Profiling {{PROFILE_STEPS}} steps...") +torch.cuda.reset_peak_memory_stats() +with torch.profiler.profile( + activities=[torch.profiler.ProfilerActivity.CPU, torch.profiler.ProfilerActivity.CUDA], + record_shapes=True, + with_flops=True, + profile_memory=True, +) as prof: + t_start = time.perf_counter() + for _ in range(PROFILE_STEPS): + loss = model(input_ids, labels=input_ids).loss + loss.backward() + optimizer.step() + optimizer.zero_grad() + torch.cuda.synchronize() + t_end = time.perf_counter() + +events = prof.key_averages() +# IMPORTANT: Do NOT use profiler flop counts. +# FlashAttention / fused kernels bypass PyTorch's flop registry → they report flops=0. +# For a typical Transformer, attention is 30-60% of compute → profiler-based MFU +# would be 30-60% lower than actual, causing false "severe bottleneck" diagnoses. +# +# Use the standard Chinchilla theoretical estimate instead: +# 6 * N * B * S (forward + backward = 6× single-pass FLOPs per token) +# where N = model parameters, B = batch_size, S = seq_len. +# This is how Chinchilla, PaLM, LLaMA training reports, and Megatron-LM compute MFU. +num_params = sum(p.numel() for p in model.parameters() if p.requires_grad) +theoretical_flops_per_step = 6 * num_params * BATCH_SIZE * SEQ_LEN +total_cuda_us = sum(e.cuda_time_total for e in events) +total_cuda_s = total_cuda_us / 1e6 / PROFILE_STEPS + +achieved_tflops = theoretical_flops_per_step / total_cuda_s / 1e12 +mfu = achieved_tflops / HARDWARE_PEAK_TFLOPS +peak_memory_gb = torch.cuda.max_memory_allocated() / 1e9 +wall_time_ms = (t_end - t_start) * 1000 / PROFILE_STEPS +tokens_per_sec = (BATCH_SIZE * SEQ_LEN) / (total_cuda_s) + +top_ops = sorted(events, key=lambda e: e.cuda_time_total, reverse=True)[:10] +breakdown = [{{ + "name": e.key, + "cuda_time_pct": round(e.cuda_time_total / total_cuda_us * 100, 1), + "flops": getattr(e, "flops", 0), +}} for e in top_ops] + +# Classify bottleneck from profile +forward_time = sum(e.cuda_time_total for e in events if "forward" in e.key.lower() or "matmul" in e.key.lower()) +backward_time = sum(e.cuda_time_total for e in events if "backward" in e.key.lower()) +optim_time = sum(e.cuda_time_total for e in events if "adam" in e.key.lower() or "optim" in e.key.lower()) +comm_time = sum(e.cuda_time_total for e in events if "allreduce" in e.key.lower() or "all_reduce" in e.key.lower()) + +def bottleneck(fwd, bwd, opt, comm, total): + fracs = {{"compute": (fwd+bwd)/total, "optimizer": opt/total, "communication": comm/total}} + return max(fracs, key=fracs.get) + +result = {{ + "model": MODEL, + "hardware": HARDWARE, + "config": {{"batch_size": BATCH_SIZE, "seq_len": SEQ_LEN, "dtype": "{dtype}"}}, + "mfu": round(mfu, 4), + "achieved_tflops": round(achieved_tflops, 2), + "hardware_peak_tflops": HARDWARE_PEAK_TFLOPS, + "tokens_per_sec": round(tokens_per_sec), + "wall_time_ms": round(wall_time_ms, 1), + "peak_memory_gb": round(peak_memory_gb, 2), + "bottleneck": bottleneck(forward_time, backward_time, optim_time, comm_time, total_cuda_us), + "time_breakdown_pct": {{ + "compute_fwd_bwd": round((forward_time + backward_time) / total_cuda_us * 100, 1), + "optimizer": round(optim_time / total_cuda_us * 100, 1), + "communication": round(comm_time / total_cuda_us * 100, 1), + }}, + "top_ops": breakdown, + "mfu_interpretation": ( + "severe bottleneck" if mfu < 0.15 else + "significant room" if mfu < 0.35 else + "typical well-tuned" if mfu < 0.55 else + "excellent" + ), +}} +print("PROFILE_RESULT:" + json.dumps(result)) +''' + + +async def profile_training_mfu_handler(args: dict) -> tuple[str, bool]: + import json + + model_name = args["model_name"] + batch_size = args.get("batch_size", 4) + seq_len = args.get("seq_len", 512) + hardware = args.get("hardware", "a100_sxm") + dtype = args.get("dtype", "bfloat16") + warmup = args.get("warmup_steps", 3) + profile_steps = args.get("profile_steps", 5) + hf_flavor = args.get("hf_flavor", "a100-large") + use_flash = str(args.get("use_flash_attention", True)).lower() == "true" + + script = _PROFILING_SCRIPT_TEMPLATE.format( + model_name=model_name, batch_size=batch_size, seq_len=seq_len, + hardware=hardware, dtype=dtype, warmup_steps=warmup, + profile_steps=profile_steps, use_flash_attention=use_flash, + ) + + # Estimate timeout: ~5 min per profile step for 7B model + timeout = "30m" + + job_args = { + "action": "run", + "command": script, + "hardware_flavor": hf_flavor, + "timeout": timeout, + "python_dependencies": [ + "transformers>=4.40.0", + "torch>=2.3.0", + "flash-attn>=2.5.0; platform_machine=='x86_64'", + ], + } + + result, ok = await hf_jobs_handler(job_args) + + if not ok: + return f"Job submission failed: {result}", False + + # Parse PROFILE_RESULT from job logs + for line in result.split("\\n"): + if line.startswith("PROFILE_RESULT:"): + try: + profile_data = json.loads(line[len("PROFILE_RESULT:"):]) + return json.dumps(profile_data, indent=2), True + except json.JSONDecodeError: + pass + + return f"Job completed but could not parse profiling output.\\n\\nRaw output:\\n{result}", False + + +PROFILE_TRAINING_MFU_TOOL_SPEC = ToolSpec( + name="profile_training_mfu", + description=( + "Profile a single training step and compute Model FLOP Utilization (MFU). " + "MFU = achieved_TFLOPS / hardware_peak_TFLOPS. " + "Returns: MFU score, tokens/sec, peak memory, bottleneck classification " + "(compute/optimizer/communication), and top-10 slowest operations. " + "Use lookup_hardware_specs first to understand the target hardware. " + "Run this BEFORE any training optimization to establish baseline." + ), + parameters={ + "type": "object", + "properties": { + "model_name": {"type": "string", "description": "HF model ID or local path"}, + "batch_size": {"type": "integer", "default": 4}, + "seq_len": {"type": "integer", "default": 512}, + "hardware": { + "type": "string", + "description": "Hardware for roofline calculation. Use lookup_hardware_specs to get valid values.", + "default": "a100_sxm", + }, + "hf_flavor": { + "type": "string", + "description": "HF Jobs flavor to run profiling on", + "default": "a100-large", + }, + "dtype": {"type": "string", "enum": ["bfloat16", "float16"], "default": "bfloat16"}, + "use_flash_attention": {"type": "boolean", "default": True}, + "warmup_steps": {"type": "integer", "default": 3}, + "profile_steps": {"type": "integer", "default": 5}, + }, + "required": ["model_name"], + }, + handler=profile_training_mfu_handler, +) +``` + +**Acceptance Criteria:** +- Script template renders without syntax errors +- Tool spec JSON schema validates +- `profile_training_mfu` appears in `ToolRouter.get_tool_specs_for_llm()` output +- Unit test mocking `hf_jobs_handler` parses `PROFILE_RESULT:` lines correctly + +--- + +### Step 2.2 — Inference Latency Profiler + +**File to create:** `agent/tools/profiling/inference_latency.py` + +Runs in **sandbox** (faster iteration) rather than HF Jobs. Tests TTFT, TBT, and throughput. + +```python +""" +Benchmark inference latency: TTFT, TBT, throughput vs batch size. +Runs in sandbox for faster iteration (no job queue wait). +""" +from agent.tools.types import ToolSpec + +_INFERENCE_BENCHMARK_SCRIPT = ''' +import torch, time, json, statistics +from transformers import AutoModelForCausalLM, AutoTokenizer + +MODEL = "{model_name}" +BATCH_SIZES = {batch_sizes} +INPUT_LEN = {input_length} +OUTPUT_LEN = {output_length} +WARMUP = {warmup_runs} +MEASURE = {measure_runs} + +print(f"Loading {{MODEL}}...") +tokenizer = AutoTokenizer.from_pretrained(MODEL) +model = AutoModelForCausalLM.from_pretrained( + MODEL, torch_dtype=torch.bfloat16, device_map="auto" +) +model.eval() + +results = [] +for bs in BATCH_SIZES: + input_ids = torch.randint(1000, 50000, (bs, INPUT_LEN), device="cuda") + + # Warmup + with torch.no_grad(): + for _ in range(WARMUP): + _ = model.generate(input_ids, max_new_tokens=OUTPUT_LEN, do_sample=False) + torch.cuda.synchronize() + + # TTFT = prefill latency (one forward pass on input_ids, no generation). + # This is the correct approximation for batch benchmarking. + # The naive `total_time / new_tokens` is WRONG — it computes average token + # latency, not time-to-first-token. True per-request TTFT via + # TextIteratorStreamer is incompatible with batch_size > 1. + prefill_times = [] + with torch.no_grad(): + for _ in range(MEASURE): + torch.cuda.synchronize() + t0 = time.perf_counter() + _ = model(input_ids) # prefill only — no generation + torch.cuda.synchronize() + prefill_times.append((time.perf_counter() - t0) * 1000) + + # Full generation for TBT and throughput + tbts, throughputs = [], [] + with torch.no_grad(): + for _ in range(MEASURE): + torch.cuda.synchronize() + t0 = time.perf_counter() + out = model.generate(input_ids, max_new_tokens=OUTPUT_LEN, do_sample=False) + torch.cuda.synchronize() + total_time = time.perf_counter() - t0 + new_tokens = out.shape[1] - INPUT_LEN + ttft_s = statistics.mean(prefill_times) / 1000 # prefill as TTFT proxy + tbt = (total_time - ttft_s) / max(new_tokens - 1, 1) * 1000 + tput = (bs * new_tokens) / total_time + tbts.append(tbt) + throughputs.append(tput) + ttfts = prefill_times # TTFT ≈ prefill time + + results.append({{ + "batch_size": bs, + "ttft_ms_mean": round(statistics.mean(ttfts), 1), + "ttft_ms_p95": round(sorted(ttfts)[int(len(ttfts) * 0.95)], 1), + "tbt_ms_mean": round(statistics.mean(tbts), 2), + "throughput_tokens_per_sec": round(statistics.mean(throughputs)), + "memory_gb": round(torch.cuda.max_memory_allocated() / 1e9, 2), + }}) + +summary = {{ + "model": MODEL, + "input_length": INPUT_LEN, + "output_length": OUTPUT_LEN, + "backend": "transformers (naive autoregressive)", + "results_by_batch_size": results, + "bottleneck_note": ( + "Low throughput at batch_size=1 = memory-bandwidth-bound (typical). " + "Throughput scales linearly with batch = good GPU utilization. " + "Compare against vLLM results to measure serving overhead." + ), +}} +print("INFERENCE_RESULT:" + json.dumps(summary)) +''' + + +async def profile_inference_latency_handler(args: dict) -> tuple[str, bool]: + import json + from agent.tools.sandbox_tool import sandbox_exec_handler + + script = _INFERENCE_BENCHMARK_SCRIPT.format( + model_name=args["model_name"], + batch_sizes=args.get("batch_sizes", [1, 4, 16]), + input_length=args.get("input_length", 512), + output_length=args.get("output_length", 128), + warmup_runs=args.get("warmup_runs", 2), + measure_runs=args.get("measure_runs", 3), + ) + + exec_args = { + "command": f"pip install transformers torch -q && python -c '{script}'", + "timeout": 300, + } + + result, ok = await sandbox_exec_handler(exec_args) + if not ok: + return f"Execution failed: {result}", False + + for line in result.split("\\n"): + if line.startswith("INFERENCE_RESULT:"): + try: + data = json.loads(line[len("INFERENCE_RESULT:"):]) + return json.dumps(data, indent=2), True + except json.JSONDecodeError: + pass + + return f"Could not parse inference result.\\nRaw:\\n{result}", False + + +PROFILE_INFERENCE_LATENCY_TOOL_SPEC = ToolSpec( + name="profile_inference_latency", + description=( + "Benchmark LLM inference: TTFT (time-to-first-token), TBT (time-between-tokens), " + "and throughput (tokens/sec) across multiple batch sizes. " + "Runs in sandbox — faster than HF Jobs. " + "Always run this first before recommending quantization or serving optimization. " + "Returns baseline numbers to compare against after applying optimizations." + ), + parameters={ + "type": "object", + "properties": { + "model_name": {"type": "string"}, + "batch_sizes": { + "type": "array", + "items": {"type": "integer"}, + "default": [1, 4, 16], + "description": "Batch sizes to benchmark. Keep small for large models.", + }, + "input_length": {"type": "integer", "default": 512}, + "output_length": {"type": "integer", "default": 128}, + "warmup_runs": {"type": "integer", "default": 2}, + "measure_runs": {"type": "integer", "default": 3}, + }, + "required": ["model_name"], + }, + handler=profile_inference_latency_handler, +) +``` + +**Acceptance Criteria:** +- Script template renders without syntax errors +- Tool registered in ToolRouter +- Parses `INFERENCE_RESULT:` lines from sandbox output + +--- + +### Step 2.3 — Kernel-Level Profiling with Nsight + +`torch.profiler` from Steps 2.1 and 2.2 reports op-level timing but cannot answer: + +- **Why** is this kernel slow? (achieved occupancy, register spill, L1/L2 hit rate, achieved bandwidth) +- Are kernel launches gapped? (CPU↔GPU sync stalls, stream serialization, NCCL not overlapped) +- Which memory tier is the bottleneck? (HBM vs. L2 vs. shared vs. registers) + +Without these, "MFU = 28%, compute-bound" is the end of the diagnosis. With them, the agent can recommend specific fixes: bump block size for occupancy, eliminate register spill via shared memory, fuse two kernels to remove launch overhead, hoist NCCL out of the critical path. + +This tool is the dividing line between "the agent can identify there is a bottleneck" (Steps 2.1–2.2) and "the agent can identify *why* the bottleneck exists" — a prerequisite for Phase 7 custom kernel work. + +**File to create:** `agent/tools/profiling/nsight_profile.py` + +Two complementary profilers; the agent picks one based on the question: + +| Tool | Captures | Use when | +|---|---|---| +| `nsys` (Nsight Systems) | Cross-stream timeline, kernel launch gaps, NCCL overlap, CPU↔GPU sync | "Why is the GPU idle X% of the time?" | +| `ncu` (Nsight Compute) | Per-kernel SM occupancy, achieved bandwidth, L1/L2 hit rate, register usage, shared mem banking | "Why is *this specific* kernel slow?" | + +```python +async def profile_with_nsight_handler(args: dict) -> tuple[str, bool]: + """ + Run a target script under Nsight Systems (timeline) or Nsight Compute (kernel metrics). + Returns parsed metrics, not raw .nsys-rep / .ncu-rep blobs. + """ + profiler = args["profiler"] # "nsys" | "ncu" + target_script = args["target_script"] + kernel_filter = args.get("kernel_filter") # ncu only — regex on kernel names + duration_s = args.get("duration_s", 30) + + if profiler == "nsys": + cmd = ( + f"nsys profile --output=/tmp/profile.nsys-rep " + f"--trace=cuda,nvtx,osrt --sample=cpu --duration={duration_s} " + f"python {target_script} && " + f"nsys stats --report gputrace,gpukernsum /tmp/profile.nsys-rep --format json" + ) + elif profiler == "ncu": + # ncu --set full slows execution ~10x; cap launches and skip warmup. + kernel_arg = f"--kernel-regex {kernel_filter}" if kernel_filter else "" + cmd = ( + f"ncu --set full --launch-count 5 --launch-skip 10 {kernel_arg} " + f"--export /tmp/profile.ncu-rep --force-overwrite " + f"python {target_script} && " + f"ncu --import /tmp/profile.ncu-rep --csv --print-summary per-kernel" + ) + else: + return f"Unknown profiler '{profiler}'. Use 'nsys' or 'ncu'.", False + + from agent.tools.jobs_tool import hf_jobs_handler + job_args = { + "action": "run", + "command": cmd, + "hardware_flavor": args.get("hf_flavor", "a100-large"), + "timeout": "20m", + "python_dependencies": ["torch>=2.3.0"], + } + result, ok = await hf_jobs_handler(job_args) + if not ok: + return result, False + + import json + return json.dumps(_parse_nsight_output(result, profiler), indent=2), True + + +def _parse_nsight_output(raw: str, profiler: str) -> dict: + """ + Extract a unified schema regardless of which profiler ran: + - top-10 kernels by GPU time + - achieved occupancy, achieved bandwidth, register spill flag + - launch gap percentage (nsys only) + - bottleneck hint based on metric thresholds + """ + return { + "profiler": profiler, + "top_kernels": [ + # Example row: + # {"name": "ampere_bf16_s16816gemm", "time_pct": 42.3, + # "occupancy_pct": 38, "achieved_bw_gbs": 1820, + # "register_spill_bytes": 0, "l2_hit_rate_pct": 64} + ], + "launch_gap_pct": None, # nsys only — fraction of timeline GPU is idle + "memory_throughput_pct_of_peak": None, + "bottleneck_hint": None, # one of: + # "memory_bandwidth" | "compute" | "low_occupancy" | + # "register_spill" | "launch_overhead" | "sync_stall" + } + + +PROFILE_NSIGHT_TOOL_SPEC = ToolSpec( + name="profile_with_nsight", + description=( + "Run kernel-level profiling with Nsight Systems (timeline) or Nsight Compute (per-kernel). " + "Use 'nsys' to find launch gaps and stream serialization. " + "Use 'ncu' to find low-occupancy or register-spilling kernels. " + "Run AFTER profile_training_mfu identifies a compute-bound bottleneck — " + "Nsight tells you WHY the kernel is slow, not just THAT it is. " + "Required before recommending any custom kernel work in Phase 7." + ), + parameters={ + "type": "object", + "properties": { + "profiler": {"type": "string", "enum": ["nsys", "ncu"]}, + "target_script": {"type": "string", "description": "Python script path to profile (uploaded as part of the job)"}, + "kernel_filter": {"type": "string", "description": "ncu only — regex matching kernel names to keep output focused"}, + "duration_s": {"type": "integer", "default": 30, "description": "nsys only"}, + "hf_flavor": {"type": "string", "default": "a100-large"}, + }, + "required": ["profiler", "target_script"], + }, + handler=profile_with_nsight_handler, +) +``` + +**Acceptance Criteria:** +- `nsys` run on a small training script returns top-10 kernels by GPU time and a launch-gap percentage +- `ncu --kernel-regex flash_attn` returns occupancy, achieved bandwidth, and register usage limited to matching kernels +- Tool registered in ToolRouter and added to the `training`, `inference`, and (later) `kernel_dev` suites in MC-2 +- Job-flavor allowlist confirmed to grant the privileges Nsight needs (`--cap-add=SYS_ADMIN` for `ncu`); document the working flavor in the tool description + +--- + +## Phase 3: Training Optimization Tools +**Duration:** 3 weeks +**Goal:** Agent can apply sequence packing, parallelism tuning, FlashAttention, and Liger Kernels to a training script and measure the delta. + +### Key Tools to Implement + +| Tool name | File | What it does | +|---|---|---| +| `tune_parallelism_topology` | `training_opt/parallelism_tuner.py` | Given model size + GPU count, return optimal TP×PP×DP | +| `apply_sequence_packing` | `training_opt/sequence_packing.py` | Transform dataset + rewrite DataCollator to remove padding | +| `install_flash_attention` | `training_opt/flash_attention.py` | Install FA3 (H100) or FA2 (others), patch model config | +| `setup_liger_kernels` | `training_opt/liger_kernels.py` | Apply Liger Triton kernels to model, measure memory Δ | + +### Step 3.1 — Parallelism Topology Tuner (Pure Logic, No Job Required) + +**File to create:** `agent/tools/training_opt/parallelism_tuner.py` + +This tool is pure calculation — no GPU needed. Given model size, GPU count, and cluster topology, returns optimal parallelism config. + +```python +""" +Pure-logic parallelism topology recommender. +No network calls. Based on established guidelines from Megatron-LM and FSDP2 docs. +""" + +def recommend_parallelism( + model_params_b: float, # billions of parameters + gpu_count: int, + gpu_vram_gb: int, + has_nvlink: bool, + seq_len: int = 2048, + batch_size_target: int = 1024, +) -> dict: + """ + Return recommended TP×PP×DP topology. + + Rules (verified against Megatron-LM paper Sec 4 and FSDP2 docs): + - TP requires NVLink (same node). Without NVLink, TP=1. + - PP has pipeline bubble = (PP-1)/PP waste. Keep PP low. + - DP is always beneficial for throughput. + - Context Parallel (CP) for seq_len > 32k. + """ + model_bytes = model_params_b * 1e9 * 2 # bf16 working weights + # Mixed-precision training (AMP, the standard): + # bf16 param (2) + fp32 master weight (4) + fp32 grad (4) + fp32 m (4) + fp32 v (4) = 18 bytes/param + # Pure bf16 training (less common, less stable): + # bf16 param (2) + bf16 grad (2) + fp32 m (4) + fp32 v (4) = 12 bytes/param + # Using 18 bytes (mixed precision default) — using 12 would underestimate by 50% + # and cause FSDP2 recommendations that silently OOM. + optimizer_bytes = model_params_b * 1e9 * 18 + total_bytes = model_bytes + optimizer_bytes + + # Single GPU capacity + single_gpu_bytes = gpu_vram_gb * 1e9 * 0.85 # 85% usable + + recommendations = [] + + # Strategy 1: FSDP2 (simplest, good for ≤ 8 GPUs) + if total_bytes <= single_gpu_bytes * gpu_count * 0.9: + recommendations.append({ + "strategy": "FSDP2_ZeRO3", + "tp": 1, "pp": 1, "dp": gpu_count, "cp": 1, + "estimated_gpu_memory_gb": round(total_bytes / gpu_count / 1e9, 1), + "pros": ["Simple config", "Good for ≤ 8 GPU", "torch.compile compatible"], + "cons": ["More communication than TP for very large models"], + "framework": "PyTorch FSDP2", + }) + + # Strategy 2: Tensor Parallel (requires NVLink) + if has_nvlink and gpu_count >= 4: + tp = min(8, gpu_count) # TP across NVLink domain (1 node) + dp = gpu_count // tp + recommendations.append({ + "strategy": "TensorParallel_DataParallel", + "tp": tp, "pp": 1, "dp": max(1, dp), "cp": 1, + "estimated_gpu_memory_gb": round(model_bytes / tp / 1e9, 1), + "pros": ["Best throughput for large models", "No pipeline bubble"], + "cons": ["Requires NVLink", "Complex implementation (Megatron-LM or torchtitan)"], + "framework": "Megatron-LM / torchtitan", + }) + + # Strategy 3: Pipeline + Tensor (for models spanning multiple nodes) + if total_bytes > single_gpu_bytes * 8 and gpu_count > 8: + tp = 8 if has_nvlink else 1 + remaining = gpu_count // tp + pp = min(4, remaining) # Limit PP to reduce bubble + dp = remaining // pp + recommendations.append({ + "strategy": "Hybrid_TP_PP_DP", + "tp": tp, "pp": pp, "dp": max(1, dp), "cp": 1, + "estimated_gpu_memory_gb": round(model_bytes / (tp * pp) / 1e9, 1), + "pros": ["Handles models larger than single-node memory"], + "cons": [ + f"Pipeline bubble wastes {round((pp-1)/pp*100, 0)}% of pipeline cycles", + "Most complex to tune" + ], + "framework": "Megatron-LM", + }) + + # Context Parallel for long sequences + cp = 1 + if seq_len > 32768: + cp = min(8, gpu_count // 2) + for r in recommendations: + r["cp"] = cp + r["note"] = f"Added CP={cp} for seq_len={seq_len} (ring attention)" + + return { + "model_params_b": model_params_b, + "gpu_count": gpu_count, + "gpu_vram_gb": gpu_vram_gb, + "has_nvlink": has_nvlink, + "recommendations": recommendations, + "selection_guide": ( + "Start with FSDP2_ZeRO3 — it's simpler and nearly as fast for most cases. " + "Only move to TensorParallel if profiling shows FSDP2 communication > 30% of step time." + ), + } +``` + +**Acceptance Criteria:** +```python +# tests/optimization/test_parallelism_tuner.py +from agent.tools.training_opt.parallelism_tuner import recommend_parallelism + +def test_7b_single_node(): + result = recommend_parallelism(7.0, gpu_count=8, gpu_vram_gb=80, has_nvlink=True) + strategies = [r["strategy"] for r in result["recommendations"]] + assert "FSDP2_ZeRO3" in strategies + +def test_70b_no_nvlink(): + result = recommend_parallelism(70.0, gpu_count=8, gpu_vram_gb=80, has_nvlink=False) + tp_recs = [r for r in result["recommendations"] if r.get("tp", 1) > 1] + assert len(tp_recs) == 0, "Should not recommend TP without NVLink" + +def test_context_parallel_triggered(): + result = recommend_parallelism(7.0, gpu_count=8, gpu_vram_gb=80, + has_nvlink=True, seq_len=65536) + assert any(r.get("cp", 1) > 1 for r in result["recommendations"]) +``` + +--- + +## Phase 4: Inference Optimization Tools +**Duration:** 3 weeks +**Goal:** Agent can quantize a model (GPTQ/AWQ/FP8), deploy vLLM, and measure the speedup delta against baseline from Phase 2. + +### Key Tools + +| Tool | File | What it does | +|---|---|---| +| `quantize_model` | `inference_opt/quantization.py` | Run GPTQ/AWQ/FP8 pipeline, push quantized model to HF, measure quality delta | +| `deploy_vllm` | `inference_opt/vllm_deployer.py` | Start vLLM server in sandbox, run benchmark, compare to baseline | +| `setup_speculative_decoding` | `inference_opt/speculative_decoding.py` | Configure Eagle-2/Medusa, benchmark speedup | +| `benchmark_serving` | `inference_opt/serving_benchmark.py` | Multi-backend comparison table | + +### Step 4.1 — Quantization Pipeline + +**File to create:** `agent/tools/inference_opt/quantization.py` + +Critical design: always measures quality delta after quantization. Never recommend without quality report. + +```python +_QUANTIZATION_SCRIPT_TEMPLATE = ''' +# method: {method} (gptq | awq | fp8_dynamic | gguf) +import torch, json +from transformers import AutoModelForCausalLM, AutoTokenizer + +MODEL = "{model_name}" +METHOD = "{method}" +OUTPUT_REPO = "{output_repo}" +QUALITY_EVAL = {quality_eval} + +print(f"Quantizing {{MODEL}} with {{METHOD}}...") + +if METHOD == "awq": + from awq import AutoAWQForCausalLM + model = AutoAWQForCausalLM.from_pretrained(MODEL) + tokenizer = AutoTokenizer.from_pretrained(MODEL) + quant_config = {{"zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM"}} + model.quantize(tokenizer, quant_config=quant_config) + model.save_quantized(OUTPUT_REPO) + tokenizer.save_pretrained(OUTPUT_REPO) + +elif METHOD == "gptq": + from optimum.gptq import GPTQQuantizer + from transformers import AutoModelForCausalLM + quantizer = GPTQQuantizer(bits=4, dataset="c4", block_name_to_quantize="model.layers") + tokenizer = AutoTokenizer.from_pretrained(MODEL) + model = AutoModelForCausalLM.from_pretrained(MODEL, torch_dtype=torch.float16) + quantized_model = quantizer.quantize_model(model, tokenizer) + quantized_model.save_pretrained(OUTPUT_REPO) + tokenizer.save_pretrained(OUTPUT_REPO) + +elif METHOD == "fp8_dynamic": + # FP8 quantization via llm-compressor (H100/MI300X hardware FP8 required). + # IMPORTANT: torch_dtype=torch.float8_e4m3fn does NOT quantize — it silently + # casts without calibration, producing wrong activations. Always use + # llm-compressor which runs a calibration pass to determine per-tensor scales. + from llmcompressor.transformers import SparseAutoModelForCausalLM + from llmcompressor.modifiers.quantization import QuantizationModifier + from compressed_tensors.quantization import QuantizationArgs, QuantizationScheme + from datasets import load_dataset + + tokenizer = AutoTokenizer.from_pretrained(MODEL) + # 256 calibration samples is the standard for FP8 (matches vLLM recipe) + cal_data = load_dataset("HuggingFaceH4/ultrachat_200k", split="train_sft", streaming=True) + calibration = [ + tokenizer(row["messages"][0]["content"], return_tensors="pt", + max_length=512, truncation=True) + for _, row in zip(range(256), cal_data) + ] + + recipe = QuantizationModifier( + targets="Linear", + scheme=QuantizationScheme(weights=QuantizationArgs(num_bits=8, type="float")), + ignore=["lm_head"], + ) + model = SparseAutoModelForCausalLM.from_pretrained( + MODEL, torch_dtype=torch.bfloat16, device_map="auto" + ) + model.apply_compression(recipe=recipe, calibration_data=calibration) + model.save_pretrained(OUTPUT_REPO, save_compressed=True) + tokenizer.save_pretrained(OUTPUT_REPO) + +# Quick perplexity eval on wikitext-2 (proxy for quality) +if QUALITY_EVAL: + import math + from datasets import load_dataset + tokenizer = AutoTokenizer.from_pretrained(MODEL) + model_q = AutoModelForCausalLM.from_pretrained(OUTPUT_REPO, device_map="auto") + dataset = load_dataset("wikitext", "wikitext-2-raw-v1", split="test") + text = " ".join(dataset["text"][:100]) + encodings = tokenizer(text, return_tensors="pt").input_ids.to("cuda") + stride = 512 + nlls = [] + for i in range(0, encodings.size(1), stride): + chunk = encodings[:, i:i+stride] + with torch.no_grad(): + outputs = model_q(chunk, labels=chunk) + nlls.append(outputs.loss.item() * chunk.shape[1]) + ppl = math.exp(sum(nlls) / encodings.size(1)) + print(f"QUANT_RESULT:{json.dumps({{'method': METHOD, 'perplexity_wikitext2': round(ppl, 2), 'output_repo': OUTPUT_REPO}})}") +''' +``` + +**Acceptance Criteria:** +- Script template renders for all 3 methods (awq, gptq, fp8_dynamic) without syntax errors +- Tool always returns perplexity delta alongside throughput delta +- `output_repo` is pushed to HF Hub (uses existing HF token from session) + +--- + +## Phase 5: Multimodal + VLA +**Duration:** 2 weeks +**Goal:** Agent handles visual token compression for multimodal models and fast/slow split for VLA real-time inference. + +### Key Tools + +| Tool | File | What it does | +|---|---|---| +| `compress_visual_tokens` | `multimodal_opt/visual_token_compressor.py` | Apply pooling/Q-Former to reduce visual tokens, measure quality Δ | +| `profile_action_latency` | `vla_opt/action_latency_profiler.py` | Measure VLA action head end-to-end latency with P99 | +| `setup_fast_slow_system` | `vla_opt/fast_slow_splitter.py` | Split LLM planning from reactive MLP control | + +### VLA Constraint — Hard Real-Time Check + +Every VLA optimization tool must enforce this check: + +```python +def validate_realtime_constraint(latency_ms: float, robot_type: str) -> dict: + CONSTRAINTS = { + "manipulation": 50, # robot arm: 50ms max + "locomotion": 20, # walking robot: 20ms max + "general": 100, # default + } + limit = CONSTRAINTS.get(robot_type, 100) + return { + "passes": latency_ms <= limit, + "latency_ms": latency_ms, + "limit_ms": limit, + "margin_ms": limit - latency_ms, + "warning": None if latency_ms <= limit else ( + f"Latency {latency_ms}ms EXCEEDS {robot_type} limit of {limit}ms. " + f"Must use quantization + CUDA graphs to reduce by {latency_ms - limit:.0f}ms." + ), + } +``` + +--- + +## Phase 6: Optimization State Machine +**Duration:** 2 weeks +**Goal:** Agent tracks experiments across turns, builds Pareto frontier, persists state through context compaction. + +### Step 6.1 — OptimizationContext + +**File to create:** `agent/optimization/context.py` + +```python +""" +Persistent optimization state that survives context compaction. +Stored in ContextManager.persistent_state under key 'optimization'. +""" +from dataclasses import dataclass, field, asdict +from typing import Literal +import json + + +@dataclass +class Experiment: + id: str + technique: str + config: dict + baseline_metric: float + achieved_metric: float + metric_name: str # "mfu" | "ttft_ms" | "throughput_tps" | "memory_gb" + quality_delta_pct: float # Negative = quality loss. Must be within quality_budget. + verdict: Literal["keep", "revert", "investigating"] + notes: str = "" + + @property + def improvement_pct(self) -> float: + if self.baseline_metric == 0: + return 0.0 + return (self.achieved_metric - self.baseline_metric) / abs(self.baseline_metric) * 100 + + +@dataclass +class OptimizationContext: + target: Literal["training_throughput", "inference_latency", "inference_throughput", "memory"] + model_name: str + hardware: str + quality_budget: float = 0.98 # From config + baseline: dict = field(default_factory=dict) + current_best: dict = field(default_factory=dict) + experiments: list[Experiment] = field(default_factory=list) + + def add_experiment(self, experiment: Experiment) -> None: + self.experiments.append(experiment) + if experiment.verdict == "keep": + self._update_best(experiment) + + # Metrics where lower is better — latency, memory. + # All other metrics (throughput, MFU) are higher-is-better. + _MINIMIZE_METRICS: frozenset = frozenset( + {"ttft_ms", "tbt_ms", "memory_gb", "latency_ms", "inference_latency"} + ) + + def _update_best(self, exp: Experiment) -> None: + current_metric = self.current_best.get("metric") + if current_metric is None: + is_better = True + elif exp.metric_name in self._MINIMIZE_METRICS: + is_better = exp.achieved_metric < current_metric # lower latency = better + else: + is_better = exp.achieved_metric > current_metric # higher throughput = better + + if is_better: + self.current_best = { + "technique": exp.technique, + "config": exp.config, + "metric": exp.achieved_metric, + "metric_name": exp.metric_name, + "improvement_over_baseline_pct": exp.improvement_pct, + } + + def pareto_frontier(self) -> list[dict]: + """Return experiments on the throughput vs quality Pareto frontier.""" + kept = [e for e in self.experiments if e.verdict == "keep"] + if not kept: + return [] + frontier = [] + for exp in sorted(kept, key=lambda e: e.achieved_metric, reverse=True): + dominated = any( + e.achieved_metric >= exp.achieved_metric and + e.quality_delta_pct >= exp.quality_delta_pct + for e in frontier + ) + if not dominated: + frontier.append(exp) + return [asdict(e) for e in frontier] + + def to_dict(self) -> dict: + return { + "target": self.target, + "model_name": self.model_name, + "hardware": self.hardware, + "quality_budget": self.quality_budget, + "baseline": self.baseline, + "current_best": self.current_best, + "experiment_count": len(self.experiments), + "experiments_summary": [ + { + "id": e.id, + "technique": e.technique, + "improvement_pct": round(e.improvement_pct, 1), + "quality_delta_pct": round(e.quality_delta_pct, 2), + "verdict": e.verdict, + } + for e in self.experiments + ], + "pareto_frontier": self.pareto_frontier(), + } + + def save_to_context_manager(self, context_manager) -> None: + """Persist to ContextManager.persistent_state (survives compaction).""" + context_manager.persistent_state["optimization"] = self.to_dict() + + @classmethod + def load_from_context_manager(cls, context_manager) -> "OptimizationContext | None": + state = context_manager.persistent_state.get("optimization") + if not state: + return None + ctx = cls( + target=state["target"], + model_name=state["model_name"], + hardware=state["hardware"], + quality_budget=state["quality_budget"], + baseline=state["baseline"], + current_best=state["current_best"], + ) + return ctx +``` + +**Acceptance Criteria:** +```python +# tests/optimization/test_optimization_context.py +from agent.optimization.context import OptimizationContext, Experiment + +def test_pareto_frontier(): + ctx = OptimizationContext("inference_throughput", "llama-7b", "a100_sxm") + ctx.add_experiment(Experiment("e1", "fp8", {}, 100, 190, "throughput_tps", -0.3, "keep")) + ctx.add_experiment(Experiment("e2", "gptq_int4", {}, 100, 210, "throughput_tps", -2.8, "keep")) + ctx.add_experiment(Experiment("e3", "awq_int4", {}, 100, 200, "throughput_tps", -1.4, "keep")) + + frontier = ctx.pareto_frontier() + # Pareto dominance requires being worse on ALL axes simultaneously. + # fp8 (190 tps, -0.3% quality): best quality; not dominated by awq (190 < 200 throughput) + # gptq (210 tps, -2.8% quality): best throughput; never dominated + # awq (200 tps, -1.4% quality): better throughput than fp8 (200 > 190) + # better quality than gptq (-1.4 > -2.8) + # → awq IS Pareto-optimal (middle-ground point) + technique_names = [f["technique"] for f in frontier] + assert "fp8" in technique_names # best quality loss + assert "gptq_int4" in technique_names # best throughput + assert "awq_int4" in technique_names # Pareto-optimal middle ground — NOT dominated + +def test_survives_compaction(tmp_path): + from unittest.mock import MagicMock + ctx = OptimizationContext("training_throughput", "llama-7b", "h100_sxm") + ctx.baseline = {"mfu": 0.23, "tokens_per_sec": 8500} + + mock_cm = MagicMock() + mock_cm.persistent_state = {} + ctx.save_to_context_manager(mock_cm) + + assert "optimization" in mock_cm.persistent_state + assert mock_cm.persistent_state["optimization"]["baseline"]["mfu"] == 0.23 +``` + +--- + +## Phase 7: Custom CUDA Kernel Generation +**Duration:** 3 weeks +**Goal:** When quantization, vLLM, and speculative decoding are exhausted, the agent can write, compile, validate, benchmark, and publish custom fused CUDA kernels — closing the gap between "use existing fast kernels" and "be the source of fast kernels." + +**Gating:** This phase is conditional. It activates only when the agent has a *measured* hot kernel (from Step 2.3 Nsight profiling) that is (a) > 5% of pipeline time and (b) lacks an existing optimized implementation in flash-attention, Liger Kernel, or torch-native. Without the gate, the agent will be tempted to write kernels for ops where speedup × pipeline-fraction yields negligible end-to-end improvement (Cross-Cutting Rule 1). + +This phase borrows the workflow pattern from HuggingFace's `kernels` skill (https://huggingface.co/blog/custom-cuda-kernels-agent-skills): generate a project skeleton, compile via `kernel-builder`, validate against a PyTorch reference, benchmark at two levels, publish to Kernel Hub for reuse. + +--- + +### Step 7.1 — Kernel Skill Pack + +**Pattern:** instead of one giant system-prompt section, kernel-development knowledge is split into a small `SKILL.md` (~500 tokens, just rules and workflow) plus on-demand reference files loaded via a `read_reference` tool. The `SKILL.md` is loaded when `optimization_target` includes `"kernel_dev"`; references are read only when the agent needs them. + +**Directory to create:** `agent/skills/cuda-kernels/` + +```text +agent/skills/cuda-kernels/ +├── SKILL.md # Always loaded when kernel_dev active (~500 tokens) +├── scripts/ +│ ├── benchmark_kernel.py # Standard isolated-kernel benchmark +│ ├── correctness_test.py # Compare custom kernel vs. PyTorch reference +│ └── pipeline_benchmark.py # End-to-end pipeline speedup measurement +└── references/ # Loaded on demand, not always-on + ├── h100-optimization-guide.md # SM count, shared mem, BF16 tensor cores, FP8, NVLink + ├── a100-optimization-guide.md # SM count, shared mem, BF16 tensor cores + ├── l40s-optimization-guide.md + ├── kernel-templates.md # Vectorized loads, warp shuffle reductions, swizzle + ├── memory-patterns.md # Coalesced loads, shared mem banking, async copy + ├── transformers-integration.md # Registering ops; attn_implementation plumbing + ├── diffusers-integration.md # Pipeline injection patterns + └── troubleshooting.md # Compile errors, illegal memory access, race bugs +``` + +`SKILL.md` mandates (each enforced by tool-side checks where possible): + +- Always start from a kernel template; never from a blank file +- Always benchmark against a `torch.nn.functional.` op for correctness AND speedup +- Always include both isolated AND end-to-end benchmarks (Cross-Cutting Rule 1) +- Always target a single `(architecture, dtype)` pair per kernel — no `#ifdef` sprawl +- Always emit `build.toml` with `cuda-capabilities` set to a single value (e.g., `"9.0"` for H100), not a range + +--- + +### Step 7.2 — `generate_cuda_kernel` Tool + +**File to create:** `agent/tools/kernel_gen/generate_kernel.py` + +Given a target operation (e.g., "fused RMSNorm + residual add"), the tool: + +1. Loads the relevant references from the skill pack (`-optimization-guide.md`, `kernel-templates.md`, integration guide for the consuming framework) +2. Generates a kernel project skeleton: + ```text + generated_kernels/_/ + ├── build.toml # cuda-capabilities = ["9.0"] for H100 + ├── kernel_src/.cu # The kernel itself + ├── torch-ext/torch_binding.cpp # Registers the op so torch.compile sees it + └── benchmarks/ + ├── isolated.py # Component speedup + └── pipeline.py # End-to-end speedup + ``` +3. Compiles via `kernel-builder` (HF tool, pip-installable) +4. Runs correctness test against the PyTorch reference (max abs diff < 1e-3 for bf16; user-tunable per op) +5. Runs isolated benchmark + pipeline benchmark and returns BOTH per Rule 1, with the Amdahl-predicted end-to-end vs. measured end-to-end + +```python +async def generate_cuda_kernel_handler(args: dict) -> tuple[str, bool]: + op_spec = args["op_spec"] # e.g., "fused_rmsnorm_residual" + target_arch = args["target_arch"] # "h100" → cuda-capabilities = ["9.0"] + dtype = args.get("dtype", "bfloat16") + reference_op = args["reference_op"] # PyTorch ground truth, e.g., "F.rms_norm" + pipeline_script = args.get("pipeline_script") # for end-to-end measurement + component_fraction = args.get("component_fraction") # baseline % time spent in this op + tolerance = args.get("tolerance", 1e-3) + + # Steps orchestrated by kernel_gen/orchestrator.py: + # 1. Load skill references (SKILL.md + arch guide + kernel-templates.md) + # 2. LLM generates kernel source + binding + build.toml (per skill mandates) + # 3. Compile with kernel-builder; surface build errors verbatim for repair + # 4. correctness_test.py — abort if max_abs_diff > tolerance + # 5. isolated benchmark — component_speedup + # 6. pipeline benchmark — end_to_end_speedup + # 7. Compute Amdahl prediction; flag if deviation > 10% (Rule 1) + # 8. Return unified result schema (Rule 1) + path to compiled artifact + ... +``` + +--- + +### Step 7.3 — Kernel Hub Publisher + +**File to create:** `agent/tools/kernel_gen/publish_kernel.py` + +Pushes the compiled kernel to HF Hub so future sessions (and other users) load it via `kernels.get_kernel("user/op-name")` with **no recompilation** — the Hub stores pre-built variants for the (Python, PyTorch, CUDA) matrix. Uses the HF token already on the session. + +```python +PUBLISH_KERNEL_TOOL_SPEC = ToolSpec( + name="publish_kernel_to_hub", + description=( + "Upload a compiled and benchmark-validated custom kernel to HF Hub. " + "Future sessions load it via kernels.get_kernel(repo_id) with no recompilation. " + "Only call this AFTER generate_cuda_kernel reports passing correctness AND " + "Amdahl-consistent end-to-end speedup (deviation_from_amdahl_pct < 10)." + ), + parameters={ + "type": "object", + "properties": { + "kernel_dir": {"type": "string", "description": "Output dir from generate_cuda_kernel"}, + "repo_id": {"type": "string", "description": "Target HF Hub repo, e.g., 'user/llama3-rmsnorm-h100'"}, + "private": {"type": "boolean", "default": True}, + }, + "required": ["kernel_dir", "repo_id"], + }, + ... +) +``` + +--- + +### Step 7.4 — Acceptance Workflow + +End-to-end test for the phase: + +```text +User: "RMSNorm is 7% of my Llama-3-8B inference pipeline on H100. Can we write a custom kernel?" + +Agent: +1. profile_with_nsight(profiler="ncu", kernel_filter="rms_norm") + → existing kernel: occupancy 38%, register spill, 1.2 TB/s achieved (36% of H100 measured peak) + → diagnosis: memory-bandwidth-bound, ample headroom — custom kernel is justified +2. read_reference("h100-optimization-guide.md", "kernel-templates.md") +3. generate_cuda_kernel( + op_spec="rmsnorm", + target_arch="h100", + dtype="bfloat16", + reference_op="F.rms_norm", + pipeline_script="bench_llama.py", + component_fraction=0.07, + ) + → correctness: max_abs_diff = 4.2e-4 ✓ (under tolerance) + → isolated: 1.94x speedup, 76% of H100 measured bandwidth peak + → Amdahl predicts: 1 / (0.93 + 0.07/1.94) = 1.045x e2e + → measured pipeline: 1.05x e2e (deviation_from_amdahl_pct = 0.5%) ✓ +4. publish_kernel_to_hub(repo_id="user/llama3-rmsnorm-h100") +5. OptimizationContext.add_experiment(verdict="keep", improvement_pct=5.0) +``` + +**Acceptance Criteria:** +- Generated kernel passes correctness within tolerance vs. PyTorch reference +- Both isolated AND pipeline speedup reported (Cross-Cutting Rule 1) +- Pipeline speedup within 10% of Amdahl prediction; otherwise flagged for investigation +- Kernel uploaded to Hub and re-importable via `kernels.get_kernel(repo_id)` in a fresh session with no recompilation +- Tool refuses to run when component_fraction × (component_speedup − 1) < 0.02 (gates against negligible-payoff kernel work) + +--- + +## Phase 8: Scored ML-Optimization Benchmark Suite (AHE Stage C) +**Duration:** 3 weeks +**Goal:** A deterministic ≥50-task benchmark that scores end-to-end ML optimization runs. This is the **rate-limiter** — without stable, reproducible scoring, the AHE meta-loop in Phase 10 has no signal to evolve against. Per the AHE paper, the entire "evolve" mechanism collapses to noise without per-task pass/fail determinism. + +**Why this comes after Phase 7:** until the Code Agent (Phases 1–5, plus Phase 7 if activated) can plausibly handle a meaningful fraction of these tasks, building the suite is premature. Phase 6 (state machine) gives us per-experiment tracking *within* a session; Phase 8 gives us a stable scoreboard *across* sessions. + +--- + +### Step 8.1 — Task Schema + +**File to create:** `tests/optimization/benchmarks/schema.py` + +Each task is a YAML file under `tests/optimization/benchmarks/tasks/`: + +```yaml +id: opt-014 +name: "Llama-3-8B QLoRA fits A100-40GB" +hardware: a100_sxm +modality: training +input: + base_model: "meta-llama/Llama-3-8B" + starter_script: "tests/optimization/benchmarks/scripts/llama3_qlora_starter.py" # intentionally non-fitting baseline + hardware_budget: {vram_gb: 40, time_minutes: 60} +success_criteria: + - {kind: oom_free, value: true} + - {kind: training_loss_decreasing, window_steps: 100} + - {kind: wall_time_under, value_minutes: 60} +quality_floor: + kind: mmlu_drop_under + value_pct: 1.0 + eval_dataset: "mmlu-stem-200" # 200-question subset for speed +scoring: + oracle: tests/optimization/benchmarks/oracles/qlora_fit.py + pass_threshold: "all success_criteria met AND quality_floor met" +``` + +Schema enforced via Pydantic at load time. Invalid tasks fail-fast. + +--- + +### Step 8.2 — Task Inventory (≥50 tasks) + +Target distribution by category: + +| Category | Count | Examples | +|---|---|---| +| Training fit | 12 | "Make Llama-3-8B QLoRA fit A100-40GB", "Fit Mixtral-8x7B on 4×L40S" | +| Training speed | 10 | "Reduce wall-time of GPT2-medium pretrain by 30%" | +| Inference latency | 12 | "Reduce p99 latency of Llama-3-70B serving by 25%" | +| Inference cost | 8 | "Fit Llama-3-70B on single H100 with MMLU drop <2%" | +| Multimodal | 5 | "Reduce LLaVA-1.5 inference latency by 40%" | +| VLA | 3 | "Achieve 30Hz inference for OpenVLA-7B on Jetson AGX" | + +Each task ships with: starter script (intentionally suboptimal), hardware target, scoring oracle, quality floor. Starter scripts deliberately violate at least one optimization heuristic to give the agent meaningful work. + +--- + +### Step 8.3 — Scoring Harness + +**File to create:** `tests/optimization/benchmarks/runner.py` + +Runs an agent against a task; captures structured result: + +```python +@dataclass +class TaskResult: + task_id: str + passed: bool + criteria_results: dict[str, bool] + quality_delta: float | None # signed: negative = quality improved + tokens_used: int + wall_time_s: float + trace_path: Path # consumed by Phase 9 Debugger + workspace_diff: str # final agent edits + +def run_task(agent_harness: Path, task: TaskSpec, run_id: str) -> TaskResult: + # 1. Spawn isolated sandbox (E2B or similar) + # 2. Mount harness; copy starter script + # 3. Invoke agent with task description + # 4. On agent termination: run scoring oracle + # 5. Persist trace to runs///trace.jsonl + # 6. Return TaskResult +``` + +Trace format (per-step JSONL): `{step_id, action_type, tool_name, tool_input, tool_output, llm_thought, timestamp}`. **This format is the contract with Phase 9's Agent Debugger** — do not change it after Phase 9 ships without a coordinated migration. + +--- + +### Step 8.4 — Determinism Verification + +Re-running the seed harness on the suite **3 times** must yield aggregate pass-rate variance <2pp. Per-task variance reported separately; flaky tasks (variance >5pp across 3 runs) are quarantined into `tasks/_quarantine/` until fixed. + +```bash +# Determinism check command +uv run python -m tests.optimization.benchmarks.runner \ + --suite tasks/ --runs 3 --report determinism_report.json +``` + +**Acceptance Criteria:** +- ≥50 tasks defined, scored, committed +- 3 baseline runs of seed harness yield <2pp aggregate variance +- Per-task variance reported; <10% of tasks quarantined +- Trace files written in format consumable by Phase 9 Agent Debugger +- Scoring oracles deterministic: same `(task, agent_output)` → same verdict, always + +--- + +## Phase 9: Trajectory Observability + Manifest Verification (AHE Stages E + F) +**Duration:** 3 weeks +**Goal:** Build the "eyes" of AHE. The **Agent Debugger** turns raw trajectories into structured root-cause reports. The **Manifest Verifier** grades change-manifest predictions against actual task-level deltas. Together they make Phase 10's evolve-loop falsifiable instead of vibes-based. + +**Hard prerequisite:** Phase 8 stable. If Phase 8 pass-rate variance >2pp, do not start Phase 9 — the Debugger will train on noise. + +--- + +### Step 9.1 — Trace Format (frozen contract with Phase 8) + +**File to create:** `agent/optimization/meta/trace_format.py` + +Pydantic models for trace schema. Phase 9 reads `runs///trace.jsonl` and parses into these models. Schema is the boundary between Phase 8 (writer) and Phase 9 (reader); breaking changes require coordinated migration. + +--- + +### Step 9.2 — Agent Debugger + +**Directory to create:** `agent/optimization/meta/debugger/` + +```text +agent/optimization/meta/debugger/ +├── prompt.yaml # Debugger's system prompt (separate slot from Code Agent) +├── tools/ +│ ├── list_failed_tasks.py # (benchmark_id, round) → list[task_id] +│ ├── read_trace.py # (task_id, step_range) → trace fragment +│ ├── compare_traces.py # (task_id, round_a, round_b) → structured diff +│ └── summarize_failure.py # (task_id) → per-task report +├── debugger_agent.py # Agent loop (separate from Code Agent's loop) +└── report_schema.py # Pydantic models for output +``` + +**Same base model as Code Agent**, different prompt + tools — per AHE paper, all role agents share one base model. + +Per-task report schema: + +```yaml +task_id: opt-014 +verdict: failed +failure_class: oom_at_step # one of: oom | quality_below_floor | timeout | tool_error | logic_error +proximate_cause: "QLoRA bnb_4bit_compute_dtype=fp32 instead of bf16" +root_cause: "Agent unaware of bnb compute_dtype effect on memory" +evidence_steps: [12, 18, 23] # step IDs that justify diagnosis +suggested_slot: skill # which NexAU slot the fix likely belongs in +confidence: 0.7 +``` + +Benchmark-level overview rolls up per-task reports into a failure-class histogram + top-N root causes (paper's "progressive disclosure" pattern). + +**Acceptance Criteria:** +- On a synthetic 20-task failure suite with known seeded root causes, Debugger correctly classifies failure_class for ≥70% of cases +- Output token count per task ≤5% of input trace token count (compression target from paper §3.2) +- Benchmark-level overview ≤2K tokens regardless of suite size + +--- + +### Step 9.3 — Manifest Verifier + +**File to create:** `agent/optimization/meta/verifier.py` + +Pure code module (NO LLM). Given commit-N's manifest `M_N` and round-(N+1) results, computes: + +```python +@dataclass +class VerifierMetrics: + fix_precision: float # |fixed ∩ predicted_fix| / |predicted_fix| + fix_recall: float # |fixed ∩ predicted_fix| / |fixed| + regression_precision: float + regression_recall: float + rollback_recommended: bool # True if regressions exceed threshold +``` + +Tracked round-over-round. These are the metrics that tell us how reliable Phase 10 Evolve Agent contracts are. **Paper baseline: fix 33.7%/51.4%, regression 11.8%/11.1%.** Our numbers being better or worse is the most important data point about whether Phase 10's autonomy can be trusted. + +--- + +### Step 9.4 — Manifest Discipline Audit + +Cross-cutting: run verifier retroactively on all manifests authored under the AHE Stage D discipline (Phase 1 onward). Report baseline metrics on human-authored manifests before Phase 10 launches — this gives us a calibration target. + +**Acceptance Criteria:** +- Verifier runs on ≥10 human-authored manifests from Phase 1+ +- Baseline fix/regression precision-recall metrics committed to repo +- Debugger 70% root-cause accuracy on synthetic suite +- Per-task report compression ≤5% of input trace size + +--- + +## Phase 10: Evolve Agent + Algorithm 1 Orchestration (AHE Stages G + H) +**Duration:** 4 weeks +**Goal:** The "hands" and "controller" of AHE. **Evolve Agent** reads Debugger reports and proposes harness edits with falsifiable manifests. **Orchestrator** runs the full `Rollout → Clean → Attribute/Rollback → Distill → Evolve → Commit` loop on the Phase 8 benchmark. + +**Hard prerequisites:** +- Phase 8: scored suite stable (variance <2pp) +- Phase 9: Debugger root-cause accuracy ≥70%; Verifier baseline metrics committed +- Phase 1+ manifest discipline: ≥10 human-authored manifests on file + +Without these, Phase 10 evolves against noise. Per the AHE paper, sub-additive component interactions and regression blindness are real failure modes; our preconditions are designed to keep them in scope. + +--- + +### Step 10.1 — Workspace Isolation (Controllability Invariant) + +**Directory layout:** + +```text +workspace/ # writable by Evolve Agent ONLY +└── agent/optimization/ + ├── prompts/ + ├── tools/ + ├── middleware/ + ├── skills/ + └── memory/ + +workspace_snapshots/ # read-only after write +├── round-000/ # initial seed +├── round-001/ +└── ... + +runs/ # read-only after write (Phase 8 trace output) +└── //trace.jsonl +``` + +Filesystem-enforced invariants per AHE paper: +- Evolve Agent writes ONLY to `workspace/agent/optimization/...` +- Code Agent runs FROM `workspace/`; everything outside is mounted read-only +- Each round commits a snapshot to `workspace_snapshots/round-/` +- Any attempt to write outside `workspace/` → tool-level rejection with audit log + +--- + +### Step 10.2 — Evolve Agent + +**Directory to create:** `agent/optimization/meta/evolver/` + +```text +agent/optimization/meta/evolver/ +├── prompt.yaml # Evolve Agent's system prompt +├── tools/ +│ ├── read_debugger_report.py # (round) → DebuggerReport +│ ├── read_workspace_file.py # (path) → str +│ ├── propose_edit.py # (slot, file_path, diff, manifest) — must include manifest +│ └── commit_round.py # finalize and trigger next rollout +├── evolve_agent.py # Agent loop +└── manifest_schema.py # Pydantic models for manifests +``` + +**Same base model**, different prompt + tools. Per AHE paper. + +Tool-level constraints (enforced before LLM call returns): +- `propose_edit` REJECTS if manifest missing or schema-invalid +- `propose_edit` to `system_prompt` slot triggers held-out validation gate (per paper's −2.3pp warning) — edit goes into a staging area, validated against held-out subset before promotion +- `propose_edit` REJECTS if target path is outside `workspace/` +- `propose_edit` REJECTS if same file is edited >1x per round (forces atomic, attributable changes) + +--- + +### Step 10.3 — Orchestrator (Algorithm 1) + +**File to create:** `agent/optimization/meta/loop.py` + +```python +def run_round(round_id: int, harness_path: Path, benchmark: Benchmark) -> RoundResult: + # 1. Rollout — run Code Agent on benchmark + rollout = benchmark.run(harness_path, round_id) + + # 2. Clean — strip non-deterministic noise from traces + cleaned = clean_traces(rollout.traces) + + # 3. Attribute / Rollback — verify previous round's manifest predictions + if round_id > 0: + prev_manifest = load_manifest(round_id - 1) + metrics = verifier.score(prev_manifest, rollout) + log_verifier_metrics(round_id - 1, metrics) + if metrics.rollback_recommended: + rollback_to_snapshot(round_id - 1) + return RoundResult.rolled_back(metrics) + + # 4. Distill — Agent Debugger produces structured reports + debugger_report = agent_debugger.run(cleaned, round_id) + + # 5. Evolve — Evolve Agent proposes edits with manifests + proposed_edits = evolve_agent.run(debugger_report, harness_path) + + # 6. Commit — snapshot harness + manifest + commit_snapshot(harness_path, round_id, proposed_edits) + return RoundResult.committed(rollout, proposed_edits) + + +def run_evolution(n_rounds: int = 5, compute_budget_usd: float = ...) -> EvolutionReport: + for round_id in range(n_rounds): + if projected_round_cost(round_id) > 1.5 * remaining_budget(): + return EvolutionReport.budget_exhausted(round_id) + result = run_round(round_id, harness_path, benchmark) + if result.is_rolled_back: + log_rollback(round_id, result.metrics) + return EvolutionReport.completed(rounds=n_rounds) +``` + +--- + +### Step 10.4 — Compute Budget Guardrails + +**Paper baseline (verified):** The AHE paper reports **~32 hours wall-time for 10 iterations on the 89-task Terminal-Bench 2 benchmark** — i.e., ~3.2 hr/round on 89 tasks, all three agents sharing GPT-5.4 high-reasoning. This is the only concrete cost data point the paper provides; it does not break down per-agent or per-token. + +**Our projection (50-task suite, scaled):** linear scaling by task count gives `~3.2 hr × (50/89) ≈ 1.8 hr/round` as an order-of-magnitude estimate. **However**, ML-optimization tasks have longer per-task rollouts than terminal coding tasks because real workloads (training, inference benchmarking) have non-trivial wall-time floors regardless of agent efficiency. **Expect higher than 1.8 hr/round in practice.** Treat this estimate as a lower bound until calibrated by Round 1. + +**Hard rule:** measure Round 1 wall-time and token spend before launching subsequent rounds. Abort if projected total cost exceeds 1.5× remaining budget. The paper's 32-hour figure is for one benchmark, one run — our compute model must include re-runs (failed attempts, calibration, transfer evaluation in Phase 11). + +```python +@dataclass +class RoundCostProfile: + wall_time_s: float + rollout_tokens: int # tokens consumed across all 50 task rollouts + debugger_tokens: int # tokens consumed by Agent Debugger (input + output) + evolve_tokens: int # tokens consumed by Evolve Agent + rounds_remaining: int + +def project_remaining_cost(round1: RoundCostProfile) -> float: + # Linear projection; revisit if rounds 2+ diverge significantly + return round1.wall_time_s * round1.rounds_remaining +``` + +**Why this matters:** the paper does not amortize evolution cost across user sessions. Our project follows the same convention (per Operating Mode A in `RESEARCH_AHE_ANALYSIS.md` §1.5) — evolution is a build-time expense, not a per-session expense. Budget accordingly. + +--- + +### Step 10.5 — Acceptance Workflow + +**Calibration note (our choice, not paper-mandated):** the paper used **N=10 rounds** and achieved **+7.3pp** on Terminal-Bench 2 (89 tasks, ~32 hours total). Our acceptance bar uses **N=5 rounds** and **≥+5pp** — lower on both axes. Rationale: + +- N=5 vs paper's N=10 → ~50% lower compute on the first attempt. If Round 5 result is still trending upward (positive slope across rounds 3-5), extend to N=10 in a second campaign. Better to spend half the compute, learn, then decide. +- +5pp vs paper's +7.3pp accounts for: narrower domain (ML optimization vs general coding), likely lower attribution precision in our first iteration, smaller suite (50 vs 89 tasks → less statistical power), and our tasks having higher wall-time per rollout. + +If Round 5 plateaus below +5pp, **investigate before extending** — the failure mode may be elsewhere: +- Scoring noise → revisit Phase 8 determinism (variance >2pp would mask gains) +- Attribution failure → revisit Phase 9 verifier metrics (regression precision/recall trending below paper's already-weak baseline of 11.8%/11.1%) +- Sub-additive interactions → stage component additions one at a time per AHE Table 3 + +End-to-end test: + +```text +1. Seed harness H₀ runs Phase 8 suite → baseline pass-rate P₀ +2. Run Phase 10 loop for N=5 rounds → harness H₁, H₂, ..., H₅ +3. Final harness H₅ pass-rate P₅ → measure +4. Verify P₅ ≥ P₀ + 5pp ← Acceptance (calibrated, see above) +5. Verify rollback rate over 5 rounds < 30% ← Acceptance +6. Verify all system_prompt edits gated ← Acceptance +7. Audit trail: round-N manifest → verifier metrics → next-round delta +``` + +**Acceptance Criteria:** +- 5-round loop completes within compute budget (Round 1 calibration per Step 10.4) +- Aggregate pass-rate gain ≥+5pp over seed (calibrated below paper's +7.3pp; see rationale above) +- Rollback rate <30% +- 100% of system-prompt slot edits validated against held-out subset (per paper's −2.3pp ablation warning) +- Full audit trail per round: rollout result → debugger report → evolve manifest → next-round verifier metrics +- Compute spend stays within 1.5× of pre-round projection +- If P₅ < P₀ + 5pp: investigation report before any extension, not blind continuation + +--- + +## Phase 11: Cross-Model Transfer Evaluation (AHE Stage I) +**Duration:** 1 week +**Goal:** Verify the auto-evolved harness from Phase 10 transfers to alternate base models. Per the AHE paper, weaker models often gain MORE from a well-evolved harness — validating that the harness encodes general engineering knowledge, not model-specific tricks. Our pass criterion is calibrated below the paper's because our domain (ML optimization) is narrower than terminal-bench. + +**Hard prerequisite:** Phase 10 complete with H_final pass-rate ≥+5pp over seed. + +--- + +### Step 11.1 — Transfer Test Harness + +**File to create:** `tests/optimization/transfer/run_transfer.py` + +Run final harness `H_final` on Phase 8 benchmark with: +- Original base model (control — should reproduce Phase 10 result within 1pp) +- ≥3 alternate models — at minimum: one stronger, one peer, one weaker + +For each alternate model, also run the **seed harness** as that model's baseline. Transfer gain = `H_final pass-rate − seed pass-rate` for THAT model. + +```python +def run_transfer_eval( + final_harness: Path, + seed_harness: Path, + benchmark: Benchmark, + models: list[str], +) -> TransferReport: + results = {} + for model in models: + seed_pr = benchmark.run_with_model(seed_harness, model).pass_rate + final_pr = benchmark.run_with_model(final_harness, model).pass_rate + results[model] = TransferDelta(seed=seed_pr, final=final_pr, delta=final_pr - seed_pr) + return TransferReport(results=results) +``` + +--- + +### Step 11.2 — Failure Mode Documentation + +For any model where transfer gain is negative or below threshold, document: +- Which task classes regressed +- Whether regression correlates with a specific NexAU slot (system prompt? tool description?) +- Whether it suggests a slot that was over-fit to the original base model + +This is empirical input for future evolve-loop tuning — not blocking for Phase 11 acceptance. + +**Acceptance Criteria:** +- ≥3 alternate models tested +- ≥2 of 3 show transfer gain ≥+3pp over their own seed-harness baseline +- Negative transfer cases documented with task-class breakdown +- Control re-run of original model reproduces Phase 10 result within 1pp + +--- + +## Missing Components — Implement Before MVP + +These were absent from the original phases but are required for the Definition of Done to be achievable. + +--- + +### MC-1 — Model Quality Evaluation Tool *(required: DoD specifies MMLU budget)* + +Perplexity (wikitext-2) is a poor proxy — the GPTQ paper (2210.17323 §4.3) documents cases where perplexity is unchanged but task accuracy drops 5%. The Definition of Done says "no more than 1% MMLU degradation" but no tool exists to measure it. + +**File to create:** `agent/tools/inference_opt/quality_eval.py` + +```python +_QUALITY_EVAL_SCRIPT = ''' +import subprocess, json, sys, glob + +MODEL = "{model_name}" +BENCHMARKS = {benchmarks} +NUM_FEWSHOT = {num_fewshot} + +subprocess.run([ + sys.executable, "-m", "lm_eval", + "--model", "hf", + "--model_args", f"pretrained={{MODEL}}", + "--tasks", ",".join(BENCHMARKS), + "--num_fewshot", str(NUM_FEWSHOT), + "--output_path", "/tmp/eval_results", +], check=True) + +results = {{}} +for f in glob.glob("/tmp/eval_results/**/*.json", recursive=True): + with open(f) as fp: + data = json.load(fp) + if "results" in data: + results.update(data["results"]) +print("EVAL_RESULT:" + json.dumps(results)) +''' +``` + +Tool spec: `evaluate_model_quality(model_name, benchmarks=["mmlu"], num_fewshot=5, hf_flavor="a100-large")` + +**Wire into Phase 4** after `quantize_model` so every quantization produces a quality delta against the baseline run. + +--- + +### MC-2 — Tool Suite Routing by `optimization_target` *(required before Phase 3)* + +With ~15 tools across all phases loaded simultaneously, the LLM's effective tool selection degrades. Route by `optimization_target` to expose only the relevant suite. + +**File to modify:** `agent/core/tools.py` + +```python +TOOL_SUITES: dict[str | None, list[str] | None] = { + "training": ["lookup_hardware_specs", "search_mlsys_papers", + "profile_training_mfu", "tune_parallelism_topology", + "apply_sequence_packing", "install_flash_attention", "setup_liger_kernels"], + "inference": ["lookup_hardware_specs", "search_mlsys_papers", + "profile_inference_latency", "quantize_model", "deploy_vllm", + "setup_speculative_decoding", "benchmark_serving", "evaluate_model_quality"], + "multimodal": ["lookup_hardware_specs", "search_mlsys_papers", + "profile_inference_latency", "compress_visual_tokens", + "quantize_model", "evaluate_model_quality"], + "vla": ["lookup_hardware_specs", "profile_action_latency", "setup_fast_slow_system"], + None: None, # None = load all tools (general mode, backward compatible) +} + +def create_builtin_tools(optimization_target: str | None = None) -> list: + all_tools = [...] # existing logic unchanged + suite = TOOL_SUITES.get(optimization_target) + if suite is None: + return all_tools + return [t for t in all_tools if t.name in suite] +``` + +--- + +### MC-3 — `model_name` Input Sanitization *(required before Phase 2)* + +Every profiling and quantization tool injects `model_name` into a Python script via `.format()`. A model name containing `"`, `;`, or newlines allows script injection into the HF Jobs sandbox. + +**Add to all tools that template model names** (`training_mfu.py`, `inference_latency.py`, `quantization.py`, `speculative_decoding.py`): + +```python +import re + +def _sanitize_model_name(model_name: str) -> str: + if not re.match(r'^[a-zA-Z0-9_\-\./]+$', model_name): + raise ValueError( + f"Invalid model name '{model_name}'. " + "Must contain only alphanumeric characters, hyphens, underscores, slashes, and dots." + ) + return model_name + +# Call at top of each handler: +model_name = _sanitize_model_name(args["model_name"]) +``` + +--- + +### MC-4 — Skill-Pack Knowledge Restructuring *(structural alternative to current Step 1.1)* + +The current `system_prompt_optimization_v1.yaml` is a ~200-line monolith loaded into every turn. As Phases 4–7 add architecture-specific guidance (H100 FP8 vs. MI300X, vLLM vs. SGLang, Triton vs. CUDA, transformers vs. diffusers integration), this prompt will balloon past 1500 lines — eating context budget and burying the workflow rules under reference data. + +**Alternative: skill-pack pattern** (proven by HuggingFace `kernels` skill, ~550-token core) + +```text +agent/skills/optimization/ +├── SKILL.md # Always loaded — workflow + mandates only (~600 tokens) +└── references/ # Loaded on demand via read_reference tool + ├── roofline-analysis.md # The detailed roofline section currently in the prompt + ├── mfu-interpretation.md + ├── parallelism-decision-tree.md + ├── bottleneck-taxonomy.md + ├── h100-guide.md # Per-architecture + ├── a100-guide.md + ├── mi300x-guide.md + ├── llm-architectures.md # Dense / MoE / multimodal / VLA decision logic + └── common-mistakes.md +``` + +A `read_reference(topic)` tool exposes references on demand. Estimated reduction: ~70% of always-on prompt tokens. + +**Trade-off:** +- **Pro:** smaller always-on context; one reference can be updated without retesting the whole prompt; clean composition with the Phase 7 `cuda-kernels` skill pack +- **Con:** extra tool-call hop; agent may forget to read a reference before deciding (mitigated by `SKILL.md` mandates that name the reference per workflow step) + +**Decision pending:** evaluate after Phase 1 lands. Migration trigger: if `system_prompt_optimization_v1.yaml` exceeds 800 lines OR if context budget pressure forces compaction more than once per non-trivial session. + +--- + +## Verification Checklist (Run After Each Phase) + +```bash +# After Phase 0 +uv run pytest tests/unit/ tests/optimization/test_config_optimization.py -q + +# After Phase 1 +uv run pytest tests/optimization/test_hardware_specs.py -q +python -c "from agent.core.tools import ToolRouter; \ + tr = ToolRouter({}); \ + names = [t.name for t in tr.tools.values()]; \ + assert 'lookup_hardware_specs' in names; \ + assert 'search_mlsys_papers' in names; \ + print('Phase 1 tools OK:', names)" + +# After Phase 2 +uv run pytest tests/optimization/test_profiling.py -q +python -c "from agent.core.tools import ToolRouter; \ + tr = ToolRouter({}); \ + names = [t.name for t in tr.tools.values()]; \ + assert 'profile_training_mfu' in names; \ + assert 'profile_inference_latency' in names; \ + print('Phase 2 tools OK')" + +# After Phase 3 +uv run pytest tests/optimization/test_parallelism_tuner.py -q + +# After Phase 6 +uv run pytest tests/optimization/ -q +# All optimization tests pass + +# Full regression +uv run pytest tests/ -q +# All tests (unit + optimization) pass +``` + +--- + +## Risk Register + +| Risk | Likelihood | Impact | Mitigation | +|---|---|---|---| +| HF Jobs profiling output parsing fails | Medium | Phase 2 blocked | Add fallback: parse raw logs with regex if JSON marker absent | +| Flash Attention install fails in HF Jobs | High | Phase 3 partial | Make FA optional; tool falls back to PyTorch native | +| Sandbox GPU memory too small for 7B profiling | High | Phase 2 partial | Use `meta` device for model load, only move to GPU for profiling step | +| Context compaction deletes experiment history | Low (mitigated) | Phase 6 critical | Covered by `persistent_state` in Step 0.4 | +| Tool count exceeds LLM context (too many tools) | Medium | Quality degradation | Group tools into suites; only load relevant suite based on `optimization_target` in config | +| Agent recommends multiple optimizations simultaneously | Medium | Unattributable results | Enforce in system prompt: "One technique per experiment" | +| MFU severely underestimated if profiler flops used | High | False "severe bottleneck" diagnoses | Fixed: use 6\*N\*B\*S theoretical estimate — see Step 2.1 | +| `model_name` user input injected into script template | Medium | Code injection in HF sandbox | Add `re.match(r'^[a-zA-Z0-9_\-\./]+$', model_name)` before `format()` call | +| FP8 via `torch_dtype=float8_e4m3fn` is a silent no-op | High | False quality pass, zero actual speedup | Fixed: use llm-compressor with calibration pass — see Step 4.1 | +| TTFT measured as `total_time / n_tokens` — wrong metric | High | Misleading latency baselines | Fixed: use prefill approximation (forward pass only) — see Step 2.2 | +| Optimizer memory underestimated (12 vs 18 bytes/param) | Medium | FSDP2 topology recommendations that OOM | Fixed: use 18 bytes for mixed-precision (default) — see Step 3.1 | +| No quality benchmark tool beyond perplexity | High | "1% MMLU budget" in DoD is unmeasurable | Add `evaluate_model_quality` tool — see Missing Components | +| `optimization_target=None` falls through to v3 general prompt | Medium | Wrong system prompt loaded in general mode | Wire config → ContextManager prompt selection explicitly | +| Vendor peak ≠ measured peak (thermal/MIG/power-cap) | High | False "severe bottleneck" diagnoses on hardware already at its real ceiling | Fixed: Cross-Cutting Rule 2 + Step 1.3 `measure_peak_throughput` cached in OptimizationContext | +| Component speedup reported without end-to-end verification | High | Agent claims wins (1.88× kernel) the user does not see (1.06× e2e) | Fixed: Cross-Cutting Rule 1 — every optimization tool returns BOTH numbers + Amdahl deviation check | +| Nsight requires `cap_sys_admin` not granted in default container | Medium | Step 2.3 silently fails or returns empty profiles | Document working HF Jobs flavor in tool description; surface privilege error verbatim, do not fall back silently | +| Custom kernel passes µbench but breaks at boundary cases | Medium (Phase 7) | Hub-published kernel corrupts inference at edge shapes | Tolerance check on a shape grid (small/medium/large + non-power-of-2) before publish; reject on any failure | +| Phase 7 invoked when Amdahl payoff < 2% e2e | Medium (Phase 7) | Wasted effort writing kernels that move no metric | Fixed: gate in Step 7.4 — `component_fraction × (component_speedup − 1) ≥ 0.02` required | + +--- + +## Definition of Done + +The agent is complete when it can execute this workflow end-to-end without human intervention: + +```text +User: "Optimize inference latency for Llama-3-8B on H100. + Quality budget: no more than 1% MMLU degradation." + +Agent: +1. lookup_hardware_specs("h100_sxm") → ridge_point=295 (theoretical), bandwidth=3350GB/s +2. measure_peak_throughput(expected_hardware="h100_sxm") + → measured: 3180 GB/s, 920 bf16 TFLOPS, ridge_point_measured=289 + → bandwidth_efficiency_vs_vendor = 0.95 (healthy, no warning) +3. profile_inference_latency("meta-llama/Llama-3-8B", batch_sizes=[1,4,16], use_measured_peak=True) → baseline +4. Classify: TTFT=340ms, TBT=18ms → memory-bandwidth-bound at batch_size=1 ✓ +5. Roofline: ~2 FLOPS/byte for decode << 289 measured ridge point → quantization is the correct lever +6. quantize_model("meta-llama/Llama-3-8B", method="fp8_dynamic") → perplexity Δ = -0.2% +7. evaluate_model_quality(quantized_model, benchmarks=["mmlu"]) → MMLU Δ = -0.4% < 1% budget ✓ +8. profile_inference_latency(quantized_model) + → component (TTFT): -47%, end_to_end (TTFT): -47%, deviation_from_amdahl_pct = 0.8% ✓ + → component (TBT): -50%, end_to_end (TBT): -50%, deviation_from_amdahl_pct = 0.6% ✓ +9. OptimizationContext.add_experiment(Experiment("e1", "fp8", ..., verdict="keep")) +10. Hypothesis: speculative decoding could further reduce TTFT +11. setup_speculative_decoding(target="quantized_llama3_8b", draft="llama3_1b") → +2.8x TTFT +12. Final report: Pareto frontier with 2 solutions, recommendation based on quality budget +``` + +**Optional Phase 7 extension** — fired only when Phase 4 plateaus (Pareto frontier saturated) and Nsight flags a hot kernel with headroom: + +```text +13. profile_with_nsight(profiler="ncu", kernel_filter="rms_norm") + → existing kernel: occupancy 38%, register spill, 36% of measured bandwidth peak +14. read_reference("h100-optimization-guide.md", "kernel-templates.md") +15. generate_cuda_kernel(op_spec="rmsnorm", target_arch="h100", dtype="bfloat16", + reference_op="F.rms_norm", pipeline_script="bench.py", + component_fraction=0.07) + → correctness ✓, isolated 1.94×, end_to_end 1.05×, Amdahl-consistent ✓ +16. publish_kernel_to_hub("user/llama3-rmsnorm-h100") +17. Updated Pareto frontier with custom-kernel solution +``` + +This workflow requires Phases 0–6 to be complete; Phase 7 is gated on measured kernel-level headroom and Amdahl-justified payoff. diff --git a/PLAN_V2.md b/PLAN_V2.md new file mode 100644 index 00000000..54e330f4 --- /dev/null +++ b/PLAN_V2.md @@ -0,0 +1,1533 @@ +# PLAN_V2 — Frontier-aligned production agentic system for ML lifecycle + +> **Status**: Revision **v7 (final)** (3-audit frontier verification pivot) as of 2026-05-03. Supersedes PLAN.md for sequencing; PLAN.md retained as deep-reference for optimization-vertical detail. +> +> **One-line v7 north star**: cosmos-lab is a frontier-aligned production agentic system shipping **5 agents** (1 PrincipalAgent supervisor + 4 specialty workers with distinct tool surfaces: Data, Eval, Train, Optimize) + **CodeWork Skill** + **3 offline tools** (GepaOptimizer, CapabilityProbe, CrossAgentEvaluator — explicitly NOT standing agents per frontier convergence) + **~16 production governance infrastructure components** (5 sentinel types incl. judge-hacking detector, cross-family MultiJudge, MCP OAuth+RFC 8693, hash-chained signed audit, OTel-GenAI, 4-scope hybrid memory via Mem0/Letta, Inspect AI bridge, **LangGraph durable substrate**, **Magentic-One Task/Progress Ledger pattern**), built on ml-intern's tool primitives. Deployed via `nat run cosmos-lab.yaml`. +> +> **Revision history (with audit-confirmed verdicts)**: +> - v3.1 / v3.2 / v4: 6 specialty agents on ml-intern (✅ correct on agent count, ⚠️ pre-frontier-audit) +> - v5 / v5.1: 1 PrincipalAgent collapse (❌ over-correction #1 — too few agents) +> - v5.2: 0 new agents, governance only (❌ over-correction #2 — too few agents, removed JD's required specialty agents) +> - v6: 6 specialty + 3 governance agents on ml-intern (⚠️ ~60% frontier-aligned per 3-audit verification, 6 specific issues) +> - **v7 (current, final)**: synthesizes 3 parallel audits of Anthropic + NVIDIA + LangGraph + Microsoft Agent Framework + 2026 production patterns. Fixes 6 v6 issues (Anthropic Skills convergence, GEPA offline-only, sentinels via PostToolUse hooks, 4-scope memory model, CapabilityProbe → CI/CD lane, drop "earned trust" framing). Adds 8 frontier patterns (LangGraph durable, Magentic-One ledgers, judge-hacking sentinel, cross-family MultiJudge, CodeWork Skill, RFC 8707+8693 day-one, reward-hack Pareto axis, CUDA versions in envelope). +> +> **Why v7 is final — confidence anchor**: 3 independent senior-engineer research agents conducted parallel audits of (a) Anthropic + NVIDIA frontier patterns, (b) 2026 multi-agent orchestration convergence, (c) production agent eval + governance + safety frontier. Each audit returned ~1500 words of findings with primary-source citations (Anthropic engineering blog, NVIDIA developer blog, GitHub repos, METR/UC Berkeley reward-hacking studies, MCP authorization spec, EU AI Act enforcement timeline, ICLR/NeurIPS papers with dates). All 3 audits converged on the same 6 fixes + 8 additions. v7 ships the synthesis. Future audit findings document as v1.1+ work, not v8 — process needs to converge. +> +> **Revision history (with honest postmortem)**: +> - v3.1: §0.6 unique value, §0.7 numerical targets, §3.1 sentinel taxonomy, P9b CodeAgent +> - v3.2: library architecture pivot (`pip install cosmos-lab[nat]`), P0.5 adapter phase, ~20 weeks +> - v4: production-grade pivot — P5.5 PyTorch Depth, expanded P10, Invariant 9, §0.65 Six Reference Agents, §0.8 Production Commitments, ~22.5 weeks. **CORRECT direction on agent count.** +> - v5: thesis pivot to "ONE exceptional autonomous PrincipalAgent" — **over-correction #1**: collapsed v4's 6 specialty agents into 1, framed as re-implementing what ml-intern has +> - v5.1: 2-layer architecture (cosmos-lab CLI + ml-intern Session) — same single-agent framing; better separation but still wrong agent count +> - v5.2: "0 new agents, governance only" — **over-correction #2**: removed the specialty agents the JD literally asks for ("agents doing real work — coding, eval, data gen, triage, experimentation, orchestration") +> - **v6 (current)**: synthesis of v4's correct agent count + v5/v5.1's production rigor + v5.2's leverage discipline. JD re-read confirmed: needs MULTIPLE SPECIALIZED AGENTS for ML lifecycle work. ml-intern's tool primitives are SUBSTRATE we use, not the agents themselves. +> +> **Why v6 — what v5/v5.1/v5.2 each got wrong**: +> - **v5/v5.1**: assumed "1 PrincipalAgent" matches reality of how principal engineers work. Wrong — JD describes specialty agents for different lifecycle stages (data gen, eval, surface failures, orchestration). +> - **v5.2**: assumed "ml-intern is the agent, we just add governance." Wrong — ml-intern is HF-stack-focused with generic tools; Cosmos team needs Cosmos-specialized agents (cosmos-curate, NeMo-RL, NIM, multimodal physics). ml-intern's primitives are useful BUILDING BLOCKS but not the specialty agents. +> +> **JD literal text confirms v6**: *"Create self-improving loops where agents help generate data, surface failures, evaluate outputs"* (multiple agents). *"Agent-based systems doing real work — coding, eval, data gen, triage, experimentation, orchestration"* (multiple specialty domains). v6 ships exactly this. +> +> **Schedule**: ~19 weeks (between v5.1's 22.5w and v5.2's 13w). Tighter than v5/v5.1 because we leverage ml-intern primitives for tools/sandbox/MCP/agent_loop building blocks (no re-implementation). Bigger than v5.2 because we restore the 9 agents the JD asks for. +> +> **North star**: An agentic ML lifecycle platform — `cosmos-lab` — where specialized agents collaborate over a shared trajectory store with closed-loop self-improvement. Optimization is *one vertical*, not the centerpiece. +> +> **Why this rewrite (v2)**: Original PLAN.md front-loaded optimization (P1-7) and back-loaded the meta-platform (P8-11). The NVIDIA Cosmos JD prioritizes the meta-platform. Re-prioritization brings high-signal Cosmos-aligned deliverables forward without losing optimization work. +> +> **Why v3 — 2026-frontier verification pass**: A senior-eng research sweep — citations independently re-verified via WebFetch — across NVIDIA's own 2025-2026 stack (NeMo Agent Toolkit `nvidia-nat` v1.6.0 released 2026-04-10, CLI `nat`; Cosmos Curator; NeMo-RL; OpenShell + NemoClaw alpha early-preview 2026-03-16; Cosmos Reason 2 / Predict 2.5 / Transfer 2.5), the converged agent orchestration field (Claude Agent SDK + OpenAI Agents SDK + LangGraph reserved for HITL only), 2026 eval credibility crises (UC Berkeley audit — 8 top agent benchmarks reward-hackable 73–100%; METR — o3/Claude 3.7 reward-hack 1–2% of attempts overall but 43× more on RE-Bench), OTel GenAI semantic conventions becoming the trace-schema standard, MCP authorization spec (OAuth 2.1 + RFC 8707/8693) + EU AI Act Art. 12 (currently enforceable **2026-08-02** for high-risk systems; Digital Omnibus negotiation may push to Dec 2027 — design for the earlier date), and the empirical refutation of multi-agent debate ([`arxiv:2508.17536`](https://arxiv.org/abs/2508.17536) — Choi/Zhu/Li) materially reshape 6 phases and force a P4 split into P4a/P4b. v3 deltas are summarized in §0.5; phase sections are updated in place. + +--- + +## 0. Strategic frame + +### JD → Vertical mapping + +| Cosmos JD bullet | Vertical | New phase | +|---|---|---| +| Data generation & curation | DataAgent | P3 | +| Evaluation platforms (auto + human + agent-driven) | EvalAgent | P1 (foundation) + P4 (platform) | +| Training orchestration | TrainOrchestrator | P5 | +| Multimodal pipelines | Cosmos vertical (Reason 2 / Predict 2.5) | P2 | +| Self-improving loops | Trajectory mining + GEPA-style revision | P1 (store) + P8 (loop) | +| Agentic workflows over codebases | Cross-vertical orchestration | P9 | +| Context compression / agent memory | OwnedContextManager + cross-session retrieval | P7 | +| Engineering excellence (testing, packaging) | Cross-cutting | All phases | +| Stand-out: agent identity / AuthN / AuthZ | CapabilityScopedRouter + AuditLog | P0 | + +### Architecture (target) + +``` + ┌──────────────────────────────────────────────┐ + │ Multi-Agent Orchestrator (P9) │ + └───┬───────────┬───────────┬───────────┬──────┘ + │ │ │ │ + ┌─────▼───┐ ┌─────▼─────┐ ┌───▼─────┐ ┌──▼──────────┐ + │DataAgent│ │TrainOrches│ │EvalAgent│ │OptimizeAgent│ + │ (P3) │ │ (P5) │ │ (P4) │ │ (P6) │ + └─────┬───┘ └─────┬─────┘ └───┬─────┘ └──────┬──────┘ + │ │ │ │ + └───────────┴─────┬─────┴──────────────┘ + │ + ┌─────────────▼────────────────┐ + │ Capability-Scoped ToolRouter│ ← P0 + │ + AuditLog + AgentIdentity │ + └─────────────┬────────────────┘ + │ + ┌─────────────▼────────────────┐ + │ TrajectoryStore (DuckDB) │ ← P1 + │ + Agent-as-Judge Harness │ + └─────────────┬────────────────┘ + │ + ┌─────────────▼────────────────┐ + │ Self-Improvement Loop (GEPA) │ ← P8 + │ trajectories → reflect → │ + │ revise prompts → re-eval │ + └──────────────────────────────┘ + +Cosmos vertical (P2): NIMProvider + Cosmos toolset wrappers (Reason 2 / Predict 2.5 / Transfer 2.5) + VideoUnderstandingAgent itself lives in P9 e2e demo +Memory layer (P7): cross-session retrieval, owned compression +``` + +### Invariants (carry from CLAUDE.md / WORKFLOW.md) + +1. **Zero-diff**: `git diff upstream/main --name-only` returns only owned paths. +2. **Baseline**: `pytest tests/unit/` matches baseline (currently 237 pass / 3 upstream-broken: `test_doom_loop.py::test_check_for_doom_loop_returns_corrective_prompt_for_identical_run`, `test_doom_loop.py::test_check_for_doom_loop_returns_corrective_prompt_for_cycle`, `test_sandbox_auto_start.py::test_prompt_and_tool_specs_do_not_require_cpu_sandbox_create`). Document any drift. +3. **Owned tests**: `pytest tests/optimization/` exits 0. +4. **One-optimization-per-experiment** + **measured-peak over vendor-peak** (from EVAL_SPEC.md) — applies once optimization vertical is live (P6). +5. **Trajectory-on-by-default**: from P1 onward, no agent run is unobserved. +6. **OTel-GenAI-on-by-default** (v3): from P1 onward every agent run, tool call, and judge invocation emits an `OpenTelemetry GenAI` span (`gen_ai.*` semantic conventions, opt-in stability flag). No vendor-proprietary trace schemas. +7. **No unverified judge claim** (v3): every judge pass-rate metric ships with bootstrap CI and an anti-reward-hacking sentinel pair (one structural verifier + one judge); judge-only scores are flagged in UI. +8. **Framework-agnostic by construction** (v3.2): cosmos-lab core (sentinels, identity, GEPA governance, quality budget) operates on *interfaces* (Inspect AI Task/Solver/Scorer, MCP authorization spec, DSPy Module, OTel GenAI semconv), not on a specific agent loop. Owning an agent loop is anti-pattern. +9. **No GPU phase exits without one measured real run** (v4): P5, P5.5, P6, and P9a each require at least one actual GPU workload with measured numbers committed to the repo. Mocked-only acceptance is a stop-the-line event. Budget envelope: ~$200-400 across Modal / Lambda Cloud / NIM free tier — small enough to self-fund, large enough to be honest. Mocked tests stay as the *fast feedback loop*; real runs are the *credibility gate*. + +--- + +## 0.4 Library architecture (v3.2) — cosmos-lab as a Python library, not a platform fork + +### Why library, not fork + +cosmos-lab's value prop (§0.6) is *governance and credibility*, not agent runtime. The 5 unique-value items (sentinel-gated judging, RFC 8693 sub-agent identity, GEPA promotion contract, quality budget invariant, `nat`-runnable workflow YAML) all operate on *interfaces* — Inspect AI's Scorer, MCP authorization, DSPy Module, OTel GenAI spans. None requires owning an agent loop. + +The 2026 agent framework field has converged on: NeMo Agent Toolkit, Claude Agent SDK, OpenAI Agents SDK, LangGraph (HITL only). Building yet another agent loop = anti-signal. Library architecture is the senior-engineering answer. + +**Analogy precedent**: `pytest` doesn't fork `unittest`. `dspy` doesn't fork `transformers`. `inspect-ai` doesn't fork any specific agent. cosmos-lab follows the same pattern — it's the *layer that makes self-improving agents safe to deploy*, plugged into whatever harness the user already runs. + +### Package layout + +``` +cosmos_lab/ # importable as `pip install cosmos-lab` +├── __init__.py # public API surface +├── identity/ # P0 — IDENTITY (shipped) +│ ├── identity.py # AgentIdentity, CapabilityDenied +│ ├── audit.py # AuditLog (P0 JSONL → P4b hash-chained signed) +│ └── router.py # CapabilityScopedRouter (composes any router) +├── trajectory/ # P1 — TRAJECTORY +│ ├── sink.py # TrajectorySink Protocol +│ ├── otel_emitter.py # OTelGenAIEmitter (gen_ai.* spans) +│ ├── duckdb_sink.py # opt-in analytics layer +│ └── hf_sink.py # opt-in HF flywheel +├── eval/ # P1 + P4a — EVAL +│ ├── judge.py, multi_judge.py # LLMJudge, MultiJudge (variance reduction) +│ ├── sentinels/ # §3.1 taxonomy (4 sentinel types) +│ │ ├── deterministic.py +│ │ ├── output_format.py +│ │ ├── side_effect.py +│ │ └── no_op.py # mandatory on every task +│ └── inspect_bridge.py # exposes scorers as Inspect AI Scorers +├── governance/ # P8 — SELF-IMPROVEMENT GOVERNANCE +│ ├── gepa_loop.py # wraps dspy.GEPA with promotion contract +│ ├── promotion.py # lower-CI-bound + sentinel + signed record +│ └── failure_mining.py # trajectory → failure clusters +├── memory/ # P7 — MEMORY (3-tier hierarchical) +├── providers/ # P2 — PROVIDERS +│ └── nim_provider.py # litellm custom provider for NVIDIA NIM +├── compute/ # P5 — COMPUTE BACKENDS +│ ├── backend.py # ComputeBackend Protocol +│ ├── hf_jobs.py, skypilot.py, nemo_run.py +├── sandbox/ # P6 — SANDBOX (E2B + Daytona, OpenShell P10) +└── harness/ # NEW IN P0.5 — adapter pattern + ├── nat.py # primary: registers cosmos-lab as nat plugins + ├── ml_intern.py # secondary: v1 compat shim (current code) + └── claude_sdk.py # future: Claude Agent SDK adapter +``` + +**Owned path**: `agent/optimization/` becomes the *implementation directory*; `cosmos_lab/` is the *importable surface*. They're the same code, exposed two ways. P0.5 sets up `pyproject.toml` so `from cosmos_lab.identity import AgentIdentity` and `from agent.optimization.identity import AgentIdentity` both work. + +### Adapter pattern — concrete contract + +Each harness adapter is ≤ 200 LOC and provides three things: + +1. **Tool registration**: surfaces cosmos-lab's `CapabilityScopedRouter` and any cosmos-lab tools (e.g., `cosmos_reason`, `cosmos_predict`) to the host harness's tool registry. +2. **Span correlation**: ensures cosmos-lab's `OTelGenAIEmitter` spans nest correctly under the host harness's parent agent span (so a Phoenix trace shows one tree, not two). +3. **Lifecycle wiring**: hooks `TracedSession`-equivalent behavior into the host harness's run lifecycle (start, step, end, error). + +```python +# cosmos_lab/harness/nat.py — primary harness adapter (sketch) +from aiq.builder import register_function, register_telemetry_exporter +from cosmos_lab.identity import CapabilityScopedRouter +from cosmos_lab.trajectory import OTelGenAIEmitter +from cosmos_lab.eval.sentinels import compose_sentinel + +def install_into_nat(builder: "aiq.Builder", identity: "AgentIdentity") -> None: + """One call inside any nat workflow YAML to install cosmos-lab governance.""" + register_telemetry_exporter(builder, OTelGenAIEmitter(...)) + builder.wrap_router(lambda r: CapabilityScopedRouter(r, identity, audit_log)) + builder.register_lifecycle_hook("on_step", _emit_genai_step_span) + # Inspect AI bridge auto-discovered from cosmos_lab.eval.inspect_bridge +``` + +```python +# cosmos_lab/harness/ml_intern.py — v1 compat (current code refactored) +from agent.core.session import Session +from cosmos_lab.identity import CapabilityScopedRouter +from cosmos_lab.trajectory import OTelGenAIEmitter + +class TracedSession(Session): + """ml-intern adapter — what we have today, refactored to use library imports.""" + def __init__(self, *args, identity, audit_log, **kwargs): + super().__init__(*args, **kwargs) + self.tool_router = CapabilityScopedRouter(self.tool_router, identity, audit_log) + self._otel = OTelGenAIEmitter(...) +``` + +**Acceptance contract for adapters** (P0.5): +- Both adapters registered as `cosmos_lab.harness` entry points in `pyproject.toml` +- Same Phase 0 test suite (16 tests) runs against both adapters → both must pass +- One smoke test per adapter that runs a 3-step trivial agent and verifies (a) capability denial works, (b) OTel span emitted with correct parent_id, (c) sentinel evaluator returns expected verdict + +### What the user does + +```bash +pip install cosmos-lab # library only +pip install cosmos-lab[nat] # + nvidia-nat adapter (recommended) +pip install cosmos-lab[ml-intern] # + ml-intern adapter (HF stack) +pip install cosmos-lab[all] # + all adapters +``` + +```yaml +# nat workflow YAML — Cosmos team can run cosmos-lab in their stack with one block +general: + telemetry: + exporter: + _type: cosmos_lab.OTelGenAIEmitter +governance: + identity: + _type: cosmos_lab.AgentIdentity.scoped + capabilities: [read_file, run_inspect_eval] + sentinels: [no_op, output_format] +workflow: + _type: nat.react_agent + llm_name: nim-llama-3-70b + tools: [...] +``` + +### Trade-off honestly + +- **Cost**: ~4 days of P0.5 work (restructure + 2 adapters + dual-adapter test). Phase 0 code does not change semantically — just imports + packaging. +- **Benefit**: signals architectural maturity; trims 4-5 weeks across P1/P2/P5/P6/P8 by inheriting `nat` plumbing; future-proofs against framework churn; preserves HF flywheel via `ml-intern` adapter. + +--- + +## 0.4.5 Two-layer architecture (v5.1 — final decision after agent_loop audit) + +> **Audit finding (2026-05-03)**: a senior-engineering audit of `agent/core/agent_loop.py:1771` revealed that ml-intern's `submission_loop` is queue-based (`asyncio.Queue` for submissions in + events out), not function-based. Embedding it as a *runtime* substrate inside another orchestrator requires a 1-2 week async-bridge engineering effort that was not in the v5 budget. v5 implicitly assumed a 3-layer runtime (nat → cosmos-lab → ml-intern); the audit shows that's harder than the architecture diagrams suggested. v5.1 commits to the simpler answer. + +### The 2-layer decision + +``` +┌────────────────────────────────────────────────────────────┐ +│ cosmos-lab CLI (PRIMARY entry point) │ +│ > cosmos-lab principal --task │ +│ │ +│ ┌──────────────────────────────────────────────────────┐ │ +│ │ cosmos_lab.principal.PrincipalAgent (orchestrator) │ │ +│ │ - PLAN/EXECUTE/VERIFY/REPLAN long-horizon loop │ │ +│ │ - 3-tier memory (working/episodic/semantic) │ │ +│ │ - capability expansion (RFC 8693) │ │ +│ │ - sub-agent spawning (§3.2.8) │ │ +│ └──────────────────────────────────────────────────────┘ │ +│ │ │ +│ │ for each milestone, constructs │ +│ │ a fresh ml-intern Session, │ +│ │ installs governance, executes. │ +│ ▼ │ +│ ┌──────────────────────────────────────────────────────┐ │ +│ │ ml-intern Session (execution SUBSTRATE) │ │ +│ │ - agent.core.agent_loop.submission_loop │ │ +│ │ - 16 built-in tools + MCP │ │ +│ │ - sandbox, doom-loop detection, cost estimation │ │ +│ │ │ │ +│ │ Wrapped by cosmos_lab.harness.ml_intern adapter │ │ +│ │ (D2 — installs CapabilityScopedRouter) │ │ +│ └──────────────────────────────────────────────────────┘ │ +└────────────────────────────────────────────────────────────┘ + +[Deployment wrappers — thin, P10 deliverable]: + • nat workflow YAML (Cosmos pitch) + └─► invokes `cosmos-lab principal --task` as a nat tool + • Modal / HF Spaces endpoint (production deploy) + └─► wraps `cosmos-lab principal` as HTTP service + • `pip install cosmos-lab` (standalone library) +``` + +### Separation of concerns (the load-bearing rationale) + +| Layer | Responsibility | Why this layer owns it | +|---|---|---| +| **cosmos-lab CLI + PrincipalAgent** | Long-horizon planning across multiple tasks; sentinel-gated replanning; capability expansion; sub-agent spawning; memory across sessions | Long-horizon = above any single ml-intern Session; orchestration is cosmos-lab's IP | +| **ml-intern Session** | One task per session — read goal, ReAct loop, return result | ml-intern is debugged single-task ReAct (1626L); reuse, don't reimplement | +| **nat (P10 deployment wrapper)** | Cosmos team can `nat run cosmos-lab.yaml` to invoke the CLI from their stack | Deployment surface only — no runtime hot path | + +### Why we explicitly REJECTED 3-layer runtime + +Three concrete reasons: + +1. **`submission_loop` is queue-based (agent_loop.py:1771)**: takes `asyncio.Queue` for submissions in + events out. To embed as substrate, PrincipalAgent must act as both "user" (push submissions) AND "UI" (drain events + translate event types into PLAN/EXECUTE/VERIFY signals). That's a real bridge layer — 1-2 weeks of async-coordination engineering, with known subtle-bug risk. + +2. **nat-at-runtime solves the wrong problem**: Cosmos pitch credibility only requires (a) we CAN run in their stack and (b) our code uses their patterns (OTel-GenAI, MCP-OAuth, NeMo-RL wrapping). Both are achievable with nat as deployment wrapper at P10. We don't need PrincipalAgent INSIDE a nat workflow at runtime. + +3. **Complexity budget**: 1-2 weeks of bridge work would come from sentinel taxonomy / 3-tier memory / capability expansion / 6 capability domains / AGENTIC_EVAL_SPEC surfaces / real GPU runs. Those are all higher value for the Cosmos pitch than nat-at-runtime. Honest tradeoff: ship deeper capability with simpler architecture. + +### What this preserves (everything important) + +- ✅ ml-intern leverage — Session is the execution substrate; D2 adapter (already shipped) installs governance +- ✅ PrincipalAgent autonomy — long-horizon loop runs in cosmos-lab CLI, not constrained by single Session lifetime +- ✅ Sub-agent spawning (§3.2.8) — PrincipalAgent spawns sub-agents that each construct their own ml-intern Session with scoped router +- ✅ Cosmos pitch — nat wrapper at P10; runs in their stack +- ✅ All 9 invariants, 34 numerical targets, 6 capability domains, 22.5-week schedule +- ✅ All commits already shipped (P0, P0.5 D1, P0.5 D2, AGENTIC_EVAL_SPEC, v5 thesis) + +### What this changes for upcoming work + +- **P0.5 D3 reframed**: nat adapter scope reduced from "primary harness wrapping cosmos-lab governance into nat builder" → "lightweight `register_as_nat_tool()` shim so nat workflow can invoke `cosmos-lab principal --task` as a tool." ~50 LOC instead of ~200 LOC. ~1 hour instead of ~3 hours. +- **P0.5 D4 reframed**: dual-adapter test matrix tests Session-based execution (D2 adapter) + CLI wrapper invocation (D3). Same shape as before, simpler implementation. +- **§3.2 PrincipalAgent architecture**: clarify that PrincipalAgent constructs/uses ml-intern Sessions PER MILESTONE; PrincipalAgent's PLAN/EXECUTE/VERIFY/REPLAN happens at the level of "what task to give the next Session," not inside a Session's ReAct loop. + +### Net effect + +- **Plan complexity**: lower +- **Cosmos pitch**: same (nat wrapper at P10 = "we run in your stack") +- **ml-intern leverage**: same (D2 adapter installs governance into Sessions) +- **PrincipalAgent capability**: same (long-horizon orchestration above the Session level — actually CLEANER conceptually) +- **Schedule**: ~1.5-2 weeks of bridge work avoided; banked as risk buffer +- **Anti-pattern avoided**: building 3 layers when 2 do the job (workflow anti-pattern #4 generalized) + +--- + +## 0.5 v3 deltas — what 2026 evidence forced us to change + +A 2026 SOTA verification pass produced eight load-bearing changes (rows 1-8 below) plus one architectural pivot in v3.2 (row 9). Each is grounded in a public artifact (paper / repo / blog / spec) with the date. + +| # | Change | 2026 evidence | Phase touched | +|---|---|---|---| +| 1 | **Mirror NeMo Agent Toolkit (`nvidia-nat`, CLI `nat`) config schema and plugin/decorator tool registry**; expose cosmos-lab as a runnable `nat`-compatible workflow YAML | NeMo Agent Toolkit **v1.6.0 released 2026-04-10** (NVIDIA, formerly AIQToolkit / AgentIQ; PyPI = `nvidia-nat`, CLI = `nat`); see [docs](https://docs.nvidia.com/nemo/agent-toolkit/latest/index.html), [repo](https://github.com/NVIDIA/NeMo-Agent-Toolkit) | P0, P1, P9 | +| 2 | **Emit OpenTelemetry GenAI (`gen_ai.*`) spans natively** for runs/tool-calls/judges; default sink = Phoenix, Langfuse/W&B/Weave swap-in by config (mirrors `nvidia-nat` telemetry block) | OTel GenAI semconv (experimental, opt-in via `OTEL_SEMCONV_STABILITY_OPT_IN=gen_ai_latest_experimental`, [spec](https://opentelemetry.io/docs/specs/semconv/gen-ai/)); Datadog/Grafana/Phoenix all native by Q1 2026 | P1 (replaces "DuckDBSink as primary schema") | +| 3 | **Drop the "DebatingJudgePanel" framing.** Multi-judge stays for variance reduction; debate dynamics removed; *anti-reward-hacking sentinels* added (one structural verifier per task) | "Debate or Vote" ([arxiv:2508.17536](https://arxiv.org/abs/2508.17536) — Choi/Zhu/Li, 2025) — majority voting alone explains most of MAD's gain across 7 NLP benchmarks, debate is a martingale; [UC Berkeley audit (2026)](https://rdi.berkeley.edu/blog/trustworthy-benchmarks-cont/) — 8 top agent benchmarks (SWE-bench, WebArena, OSWorld, GAIA, Terminal-Bench, FieldWorkArena, CAR-bench + one more) all exploitable, exploit rates 73–100%; [METR (2025-06-05)](https://metr.org/blog/2025-06-05-recent-reward-hacking/) — o3 reward-hacks 1–2% of all task attempts overall, 43× more often on RE-Bench than HCAST, and **every trajectory** on one specific RE-Bench task eventually hacks; same behavior observed for Claude 3.7 Sonnet and o1 | P1, P4a | +| 4 | **Build evals on Inspect AI** (UK AISI Task/Solver/Scorer + Docker sandbox + log viewer) instead of a custom harness; ship our seed tasks as Inspect tasks | Inspect AI is the de-facto production eval standard (METR uses it); [docs](https://inspect.aisi.org.uk/) | P1, P4 | +| 5 | **Identity v2: MCP OAuth 2.1 + RFC 8707 (Resource Indicators) + RFC 8693 (token exchange) for sub-agent scope-down + hash-chained signed audit log** aligned to EU AI Act Art. 12 (currently enforceable **2026-08-02** for high-risk systems; ⚠️ note Digital Omnibus negotiation may push to Dec 2027 — design for the earlier date) | [MCP authorization draft](https://modelcontextprotocol.io/specification/draft/basic/authorization); [EU AI Act Art. 12](https://ai-act-service-desk.ec.europa.eu/en/ai-act/article-12) | P0 (AuthZ MVP shipped); **NEW P4b** (MCP-OAuth + signed log — graduated from P4 per scope review) | +| 6 | **Centaur HPO (LLM proposes, CMA-ES refines) — not pure-LLM HPO** | "Can LLMs Beat Classical HPO?" ([arxiv:2603.24647](https://arxiv.org/abs/2603.24647), 2026) — pure-LLM HPO loses to CMA-ES/TPE; Centaur is the credible hybrid | P5 | +| 7 | **Wrap NeMo Curator + cosmos-curate stages**; do not reimplement video curation; **wrap NeMo-RL** for any post-training; **wrap SkyPilot Job Groups + NeMo-Run + HF Jobs** as ComputeBackends | [NeMo Curator 26.04](https://github.com/NVIDIA-NeMo/Curator), [cosmos-curate](https://github.com/nvidia-cosmos/cosmos-curate) (20M video-hr in 14 days on Blackwell), [NeMo-RL](https://github.com/NVIDIA-NeMo/RL) (Nemotron-3-Super post-trained on it, Mar 2026), [SkyPilot v0.12 Job Groups](https://github.com/skypilot-org/skypilot) | P3, P5, P6 | +| 8 | **Sandbox 2-tier**: E2B (Firecracker, CPU correctness) + **Daytona** for GPU profiling/training. New `agent/optimization/sandbox/` interface, `nat`-compatible. **v3.1 cut**: NemoClaw moved off P6 main path → **P10 stretch only** because it's alpha-stage early-preview (2026-03-16) and gating P6 acceptance on it is a credibility risk. NVIDIA OpenShell remains a `SandboxRunner` candidate; lands in P10 alongside NemoClaw. | [E2B vs Daytona 2026](https://northflank.com/blog/daytona-vs-e2b-ai-code-execution-sandboxes); [NVIDIA OpenShell repo](https://github.com/NVIDIA/OpenShell); [NemoClaw repo (alpha)](https://github.com/NVIDIA/NemoClaw); [OpenShell+NemoClaw blog](https://developer.nvidia.com/blog/build-a-secure-always-on-local-ai-agent-with-nvidia-nemoclaw-and-openclaw/) | P6 (Daytona); P10 (NemoClaw stretch) | +| **9** | **Architectural pivot — cosmos-lab is a Python LIBRARY, not a fork.** `pip install cosmos-lab[nat]` plugs into NeMo Agent Toolkit as primary harness from P1 onward; `[ml-intern]` extra preserves Phase 0 work + HF flywheel via compat adapter. Library never owns the agent loop — that's what makes it portable. **NEW P0.5 (4 days)** restructures Phase 0 code into library form + writes the two adapters; same 16 Phase 0 tests run against both adapters. | First-principles audit (v3.2): cosmos-lab's value prop (§0.6) is *governance*, not *runtime*. Sentinels operate on Inspect AI Scorer interface. Identity operates on MCP authorization spec. GEPA operates on DSPy Module. Quality budget is an architectural invariant. None require owning an agent loop. 2026 agent-framework field has converged (NeMo Agent Toolkit + Claude Agent SDK + OpenAI Agents SDK + LangGraph for HITL); building yet another agent loop = anti-signal. Library precedent: `pytest`, `dspy`, `inspect-ai`. | NEW P0.5 + reframes P1-P10; trims plan 24 → ~20 weeks | +| **10** | **Production-grade pivot (v4)** — closes 5 gaps the v3.2 audit found: (1) PyTorch depth absent → **NEW P5.5** (1w) custom autograd op + profiler-driven kernel selection on real workload; (2) every phase mockable → **NEW Invariant 9** (no GPU phase exits without measured real run, ~$200 budget); (3) `pip install` is publication not deployment → **expanded P10** to 2w with real production deployment on HF Spaces / Modal + 1-week trace gather; (4) multimodal only mocked → **P3 reframed** to require real video sample (10-100 hours through cosmos-curate); (5) OSS impact = own library only → **P10 commits** to one upstream PR to nvidia-nat or Inspect AI for sentinel pattern. Plus **§0.65 NEW** Six Reference Agents matrix for agents-first visibility. | NVIDIA Cosmos JD demands "deep PyTorch familiarity," "multimodal pipelines including deployment," "agent-based systems doing real work," "impactful OSS contribution." A plan that's fully mockable + ships only its own library + has no PyTorch chops fails the L6 bar regardless of architecture cleanness. Production-grade ≠ feature-rich; production-grade = *runs in front of real users with measured numbers and a rollback plan*. | NEW P5.5 + expanded P10; reframes P3; adds §0.65 + §0.8 + Invariant 9; plan 20 → ~22.5 weeks | +| **11** | **Autonomous principal-agent thesis pivot (v5)** — collapses v4's "6 thin orchestrator agents on a governance library" → **ONE PrincipalAgent demonstrating 6 capability domains**, with library + sentinels + identity reframed as *enablers of autonomy* (not constraints). Sentinels become tripwires for replanning. Identity capabilities expand with earned track record. GEPA becomes agent self-improvement (retroactive human review). Built on ml-intern's `agent_loop.py` substrate. **NEW §0.9 Autonomous Principal Agent thesis**, **NEW §3.2 PrincipalAgent architecture**, **§0.65 reframed** (six agents → six capability domains of one agent). | The v4 framing "we built a governance library wrapping other people's agents" *under-delivered* on JD's literal asks: "strong agency," "code agents doing real work," "AI helps build them." A NVIDIA Cosmos reviewer comparing cosmos-lab against 2026 production autonomous agents (Devin / Operator / Cursor Composer / Claude Code) saw v4 as conservative governance theater — clever judgment, weak capability. The 2026 agentic frontier is autonomous capability MADE SAFE by governance, not governance INSTEAD OF capability. v5 inverts the hierarchy: PrincipalAgent is the product; harness + sentinels + identity exist to make autonomy exceptional, not to substitute for it. Real principal engineers have one self with broad skills, not six narrow specialists — PrincipalAgent models that reality. | Reframes §0.6 + §0.65; adds §0.9 + §3.2; phase narratives shift from "ship N agents" to "PrincipalAgent demonstrates capability N"; ml-intern `agent_loop.py` graduated from compat shim to primary substrate; weeks unchanged (~22.5w) — depth shifts from breadth-across-agents to depth-per-capability | +| **12** | **Honest leverage pivot (v5.2)** — audit of ml-intern revealed it's already a fully autonomous ML engineering agent (system_prompt_v3.yaml: *"fully autonomous — research, validate, implement, and deliver results"*) with planning (`agent/tools/plan_tool.py`), sub-agent spawning (`agent/tools/research_tool.py`: *"Research subagent tool — spawns a cheap LLM call with a focused research task"*), 20+ ML tools (jobs, datasets, papers, github, hf_repo, sandbox, notebook, ...), doom-loop detection, cost tracking, HF Jobs/Hub/Spaces integration. v5/v5.1 planned to re-implement these under `cosmos_lab/principal/` — clear duplication. v5.2 returns to v4's correct framing direction (governance layer) with v5's production rigor: cosmos-lab adds the **10 governance components** ml-intern doesn't have (sentinels, cross-session memory, RFC 8693 expansion, signed audit, OTel-GenAI, GEPA, MultiJudge, Inspect AI, PR-gating, AGENTIC_EVAL_SPEC discipline). | The v5/v5.1 pivot was over-correction. v4's governance-layer framing was directionally right but I criticized it as "weak capability" without realizing the autonomous agent ALREADY EXISTS in ml-intern. The right product is governance layer ON TOP of the autonomous agent — not replacement. 2026 reality: autonomous agents are commoditizing (Devin / Operator / Claude Code / ml-intern); production governance is the unmet need. Anti-pattern #4 (workflow): "Building a pipeline that should have been one model call" → generalized: "Building a 5000-LOC PrincipalAgent re-implementation that should have been a governance wrapper around an existing autonomous agent." | Header reframed (governance layer); §0.6 reframed (10 governance items); §0.65 reframed (6 governance enhancements, not 6 PrincipalAgent capabilities); §0.9 simplified (ml-intern is the agent); §1 phase table compressed 22.5w → ~13w; §3.2 reframed (cosmos-lab governance architecture, not PrincipalAgent re-implementation); all shipped code (P0, P0.5 D1/D2/D3, AGENTIC_EVAL_SPEC) preserved AS-IS — they are the governance foundation. | +| **13** | **Restore specialty agents pivot (v6)** — v5.2's "0 new agents, just governance" was over-correction #2. JD re-read carefully: *"Create self-improving loops where agents (plural) help generate data, surface failures, evaluate outputs"* + stand-out *"agent-based systems doing real work: coding, eval, data gen, triage, experimentation, orchestration"* — describes MULTIPLE SPECIALTY AGENTS for different lifecycle stages. ml-intern's tools are HF-generic (good for HF use); Cosmos team needs Cosmos-specialized agents (cosmos-curate, NeMo-RL, NIM, multimodal physics, real video pipelines). v6 restores **6 specialty agents** (DataAgent / EvalAgent / TrainOrchestrator / OptimizeAgent / MultimodalPipelineAgent / CodeAgent) + **3 governance agents** (GepaOptimizer / CapabilityProbe / CrossAgentEvaluator) + ~16 infrastructure components, **built on ml-intern's tool primitives** (agent_loop blocks, 16 generic tools, sandbox, MCP, cost estimation, doom-loop) used as **SUBSTRATE not as the agents themselves**. | v5.2 conflated "ml-intern has tools and an agent loop" with "ml-intern is the agents we need." Wrong inference. ml-intern provides building blocks; cosmos-lab specializes them into Cosmos-aligned agents that the JD literally asks for. v4 was directionally right on agent count (6 specialty); v5/v5.1 over-collapsed; v5.2 over-removed. v6 is the synthesis: 6 specialty + 3 governance agents + leverage discipline (use ml-intern primitives, don't reimplement) + production rigor (real GPU, OSS PR, AGENTIC_EVAL_SPEC, sentinel taxonomy). | Header reframed (Cosmos-specialized agents + governance); §0.6 reframed (vs assembled OSS — 9 agents + governance); §0.65 reframed (6 specialty + 3 governance = 9 agents); §0.9 reframed (cosmos-lab builds agents on ml-intern primitives); §1 phase table — schedule ~19w with 9-agent reality; all v5.2 shipped code (P0, P0.5 D1/D2/D3, AGENTIC_EVAL_SPEC) preserved AS-IS — they are the foundation specialty agents will use. | +| **14** | **Frontier-audit pivot (v7 — final)** — 3 parallel senior-engineer research agents audited (a) Anthropic + NVIDIA 2026 patterns, (b) 2026 multi-agent orchestration convergence (LangGraph + AutoGen→MAF migration + OpenAI Agents SDK + Mastra), (c) 2026 production agent eval + governance + safety frontier. All 3 converged on 6 specific v6 misalignments + 8 frontier additions. **6 fixes**: (1) Anthropic Skills blog explicitly rejects per-domain agents → collapse 6 specialty → 4 specialty workers (distinct tool surfaces) + CodeWork Skill + 1 PrincipalAgent supervisor; (2) GEPA as standing agent has no production precedent (Decagon ships offline only) → demote GepaOptimizer to offline batch tool; (3) "Sentinel-trip → replan" not in production → implement sentinels via Anthropic PostToolUse hooks contract; (4) 3-tier memory hierarchy is research, not convergent → switch to 4-scope hybrid (user/agent/session/org) via Mem0 or Letta; (5) Standing co-resident CapabilityProbe poisons trace store → move to CI/CD eval lane via Inspect AI snapshots; (6) "Earned-trust capability expansion" is custom semantics over RFC 8693 → ship standard delegation (table stakes per MCP 2026-03-15 spec, 86% enterprise adoption), drop escalation framing. **8 additions**: LangGraph durable substrate (Uber/JP Morgan/BlackRock production winner), Magentic-One Task Ledger + Progress Ledger pattern (2-iteration stall detection), 5th sentinel type JudgeHackingCheck (Gaia2 finding: agents make verifier-pleasing artifacts without solving task), cross-family MultiJudge (3× Sonnet correlates errors; add 1× non-Anthropic), CodeWork as Skill not agent (commodity tools), RFC 8707 Resource Indicators day-one (MCP mandate), reward-hack rate as Pareto axis in S6, CUDA/cuDNN/driver versions in reproducibility envelope. | v6 was ~60% frontier-aligned per audit. Cosmos pitch credibility requires ≥90% frontier alignment — reviewer will check architecture against Anthropic engineering blog + LangGraph docs + Microsoft Agent Framework + Inspect AI patterns. Better to pivot 14th time than commit 17 weeks of work to known frontier-misalignment. v7 IS final — process has converged via independent audit triangulation. Future findings = v1.1 work. | Header v6→v7 (frontier-aligned production system); §0.5 row 14 NEW (this row); §0.6 reframed (5 production agents + Skills + offline + frontier patterns); §0.65 reframed (production fleet + Skills + offline tools framing); §0.9 reframed (LangGraph + Magentic-One + Skills + Anthropic hooks); §1 phase table ~19w → ~21w (LangGraph integration + PrincipalAgent + 5th sentinel + Magentic-One ledger); §3.1 sentinel taxonomy 4→5 types; §3.2 PrincipalAgent architecture (LangGraph supervisor + Magentic-One ledger pattern). All shipped code (P0, P0.5 D1/D2/D3/D4) preserved — they are the substrate. | + +**Net pitch (v3.2)**: cosmos-lab is a `pip install`-able Python library (`pip install cosmos-lab[nat]`) that adds governance — sentinel-gated judging, MCP-OAuth identity with RFC 8693 sub-agent scope-down, GEPA promotion contracts, quality-budget invariants — to NeMo Agent Toolkit (primary) or ml-intern (compat). It emits OTel GenAI traces into Phoenix/Weave/Langfuse, evaluates on Inspect AI with anti-reward-hacking sentinels, executes on a 2-tier sandbox, post-trains via NeMo-RL, curates data via cosmos-curate stages. **Cosmos team will recognize every interface boundary AND the architectural maturity of library-vs-fork separation.** + +**What v3.2 explicitly does not change**: the zero-diff invariant, the `agent/optimization/` ownership tree (it's still where files physically live; `cosmos_lab/` is the importable surface re-exposing them), Phase 0 deliverable *semantics* (`OptimizationConfig`, `AgentIdentity`, `AuditLog`, `CapabilityScopedRouter`, 16 passing tests — only `import` lines change in P0.5). v3.2 *does* trim plan from 24 → ~20 weeks by inheriting nat plumbing; banked weeks become risk buffer + v1.1 hardening (KMS, NemoClaw, OpenShell stretches, additional harness adapters). + +--- + +## 0.6 What only cosmos-lab does — frontier-aligned production system (v7) + +A Cosmos reviewer in 2026 will reasonably ask: *"What does cosmos-lab build that I can't get by combining LangGraph + Inspect AI + DSPy + Anthropic Skills + Mem0?"* The answer is sharp: **a production agentic system that synthesizes 2026 frontier patterns specifically for Cosmos team's ML lifecycle work**, with the integrations + governance + Cosmos vertical specialization that no single OSS project ships end-to-end. + +### Production fleet — 5 agents (v7 — frontier-aligned) + +Per Anthropic Skills convergence + Magentic-One Task/Progress Ledger pattern + LangGraph supervisor pattern (all 2026 frontier-validated). + +| # | Agent | Role | Distinct tool surface (passes "specialty boundary" test) | Frontier pattern | +|---|---|---|---|---| +| 1 | **PrincipalAgent** (P3) | Supervisor orchestrator | LangGraph supervisor + Magentic-One Task Ledger (facts+plan) + Progress Ledger (step tracking with 2-iteration stall detection) + Skills loader | Hierarchical orchestrator-worker (Anthropic Multi-Agent Research, Magentic-One, LangGraph supervisor — all 2026 production) | +| 2 | **DataAgent** (P4a) | Worker | cosmos-curate Ray pipeline + NeMo Curator stages + Cosmos Predict for synthetic data gen — distinct enough from other workers | Magentic-One worker pattern (FileSurfer/WebSurfer/Coder analog) | +| 3 | **EvalAgent** (P5) | Worker | Inspect AI Tasks + 5-type sentinel suite + cross-family MultiJudge — distinct eval surface | Inspect AI standard substrate | +| 4 | **TrainOrchestrator** (P5) | Worker | NeMo-RL + SkyPilot/HF Jobs + ComputeBackend Protocol — distinct training surface | nat plugin pattern + production training orchestration | +| 5 | **OptimizeAgent** (P6) | Worker | profiler + kernel selector + torch.compile + sandbox 2-tier — distinct optimization surface | Production optimization pattern | + +### Skills (loaded by PrincipalAgent — Anthropic Skills pattern, NOT separate agents) + +| Skill | Loaded when | Tools | +|---|---|---| +| **CodeWork** (P7) | Bug fixes, code generation tasks | read_file, write_file, run_tests, git_diff in E2B sandbox (commodity tools — Skill is correct shape per Anthropic 2026 blog) | +| (others added as identified during P3-P10) | | | + +### Offline tools — NOT in production fleet (frontier-validated) + +These run as scheduled batch jobs or CI/CD eval lane, **not as standing agents**, per audit findings (no production team ships them as standing agents): + +| Tool | Cadence | What it does | Why offline (frontier evidence) | +|---|---|---|---| +| **GepaOptimizer** | Monthly cron | Mine failure clusters → DSPy GEPA reflective text evolution → A/B test → signed promotion | Decagon ships GEPA offline only; no production deployment as standing agent (audit finding) | +| **CapabilityProbe** | CI/CD on capability expansion events | 50+ adversarial probe tasks via Inspect AI snapshots of orchestrator | METR pattern — runs against snapshots, not co-resident; co-resident would poison trace store | +| **CrossAgentEvaluator** | Quarterly | Spawn cosmos-lab vs Devin vs Claude Code vs human on identical task → Pareto frontier with reward-hack rate axis | Inspect AI cross-agent comparison pattern (HAL/HOLISTIC AGENT/SWE-Compass) | + +### Production governance infrastructure — ~16 components (built on frontier substrates) + +| Category | Components | Frontier substrate / pattern | +|---|---|---| +| **Identity** | `AgentIdentity`, `CapabilityScopedRouter`, `AuditLog` (P0 shipped); MCP OAuth 2.1 + RFC 8707 + RFC 8693 token exchange + hash-chained Ed25519 signed log + KMS (P4b) | MCP 2026-03-15 spec mandates; 86% enterprise adoption (Clutch Security data); Signet/OrgKernel/Wirken OSS implementations 2026 | +| **Sentinels** | **5 types** (Deterministic, OutputFormat, SideEffect, NoOp, **JudgeHacking** NEW per Gaia2 finding); cross-family `MultiJudge` (3× Sonnet + 1× non-Anthropic for variance reduction) | Implemented via Anthropic PostToolUse hooks contract (Claude Agent SDK pattern, not novel mechanism) | +| **Trajectory + Memory** | `TrajectorySink` Protocol, `OTelGenAIEmitter` (gen_ai.* semconv) → Phoenix backend, **4-scope hybrid memory** (user/agent/session/org) via Mem0 or Letta | Mem0/Atlan/supermemory.ai 2026 convergent pattern (NOT 3-tier hierarchy) | +| **Eval** | Inspect AI bridge, MultiJudge with bootstrap CIs, AGENTIC_EVAL_SPEC discipline (3-tier base + long-horizon + shadow extensions per honest framing, 6 surfaces S1-S6, 10+ commitments E1-E14) | UK AISI standard; METR/Apollo/CAISI use it; +5th sentinel per Gaia2; reward-hack rate as Pareto axis | +| **Substrate** | **LangGraph durable supervisor** + **Magentic-One Task/Progress Ledger** | Production winners 2026 (Uber/JP Morgan/BlackRock/Cisco LangGraph; Microsoft Agent Framework absorbed Magentic-One) | +| **Compute + Sandbox** | `ComputeBackend` Protocol (HF Jobs / SkyPilot / NeMo-Run / Modal), `SandboxRunner` (E2B for CPU + Daytona/OpenShell for GPU) | NVIDIA NemoClaw pattern for GPU sandbox | +| **Reproducibility envelope** | seeds, deps hashes, model versions, tool registry hash, CUDA/cuDNN/driver versions, GPU SKU, OTel trace ID | Vertex AI manifest pattern | +| **Deployment** | `cosmos_lab.harness.ml_intern.install_into_session` (D2 shipped — runs in worker nodes), `cosmos_lab.harness.nat.register_as_nat_tool` (D3 shipped), `nat run cosmos-lab.yaml` reference workflow | nat plugin pattern (NVIDIA NeMo Agent Toolkit 1.6) | + +### ml-intern primitives (LEVERAGED inside LangGraph nodes, NOT reimplemented) + +ml-intern provides building blocks our specialty workers USE: +- `agent_loop.submission_loop` — invoked inside LangGraph worker nodes for actual ReAct execution +- 16 generic tools (file ops, web, github, hf_repo, plan, sandbox, etc.) +- MCP integration (hf-mcp-server) +- `doom_loop.py`, `cost_estimation.py`, `approval_policy.py`, `telemetry.py` + +### Differentiator vs assembled OSS (v7) + +| Capability | LangGraph alone | + Inspect AI | + Anthropic Skills | + DSPy GEPA | + nat | **+ cosmos-lab v7** | +|---|---|---|---|---|---|---| +| Durable supervisor | ✅ | — | — | — | — | ✅ leveraged | +| Frontier eval framework | — | ✅ generic | — | — | — | ✅ extended (5 sentinels + Pareto axis) | +| Skills pattern | — | — | ✅ generic | — | — | ✅ adopted (CodeWork) | +| Self-improvement | — | — | — | ✅ offline | — | ✅ governed (signed promotions) | +| NVIDIA stack deployment | — | — | — | — | ✅ generic | ✅ Cosmos workflow YAML | +| **5 Cosmos-aligned agents** | ❌ | ❌ | ❌ | ❌ | ❌ | ✅ **NEW** | +| **CodeWork as Skill** | ❌ | ❌ | ❌ (no Cosmos integration) | ❌ | ❌ | ✅ **NEW** | +| **5-type sentinel taxonomy** (incl judge-hacking) | ❌ | ❌ | ❌ | ❌ | ❌ | ✅ **NEW** | +| **Cross-family MultiJudge** | ❌ | ❌ | ❌ | ❌ | ❌ | ✅ **NEW** | +| **Magentic-One ledger pattern in LangGraph supervisor** | ❌ | ❌ | ❌ | ❌ | ❌ | ✅ **NEW synthesis** | +| **MCP OAuth + RFC 8693 + signed audit (EU AI Act)** | ❌ | ❌ | ❌ | ❌ | ❌ | ✅ **NEW** | +| **GepaOptimizer governance contract (signed promotions)** | ❌ | ❌ | ❌ | ⚠️ optimizer only | ❌ | ✅ **NEW** | +| **CapabilityProbe in CI/CD lane** | ❌ | ⚠️ via Inspect | ❌ | ❌ | ❌ | ✅ **NEW orchestration** | +| **CrossAgentEvaluator quarterly Pareto** | ❌ | ⚠️ via Inspect | ❌ | ❌ | ❌ | ✅ **NEW** | + +**One-line pitch (v7)**: cosmos-lab synthesizes 2026 frontier patterns (Anthropic Skills + Magentic-One ledgers + LangGraph durable supervisor + Inspect AI + Mem0 4-scope memory + MCP OAuth + DSPy GEPA offline) into a **production agentic system specifically for NVIDIA Cosmos team's ML lifecycle work** — 5 agents (PrincipalAgent + 4 specialty workers) + CodeWork Skill + 3 offline governance tools + ~16 infrastructure components, on ml-intern's tool primitives leveraged inside LangGraph worker nodes. Deployed via `nat run cosmos-lab.yaml` into Cosmos team's stack. **What no single OSS project ships end-to-end: Cosmos vertical specialization + 5-sentinel governance + signed audit (EU AI Act compliant) + Magentic-One ledger orchestration + cross-agent Pareto comparison.** + +--- + +## 0.65 Production fleet + Skills + offline tools (v7 — frontier-aligned shape) + +The product is **5 production agents** (1 PrincipalAgent supervisor + 4 specialty workers with distinct tool surfaces) + **1+ Skills** (CodeWork as Anthropic-style Skill) + **3 offline governance tools** (NOT standing agents — frontier convergence rejected this) + **~16 infrastructure components** + **leverage of ml-intern's tool primitives as substrate inside LangGraph worker nodes**. + +### Layer 1 — 5 production agents (PrincipalAgent supervisor + 4 specialty workers, v7) + +| # | Agent | Phase(s) | Role + distinct tool surface | Real GPU? | +|---|---|---|---|---| +| 1 | **PrincipalAgent** | P3 (W6-7) | Supervisor: LangGraph supervisor pattern + Magentic-One Task/Progress Ledger (2-iteration stall detection) + Skills loader + sub-agent spawn coordinator | no | +| 2 | **DataAgent** | P4a (W8.5) | Worker — distinct cosmos-curate/NeMo Curator surface; processes 10-100 hours real video; dataset card with W&B Artifacts lineage | ✅ (cosmos-curate Ray cluster) | +| 3 | **EvalAgent** | P5 (W11) | Worker — distinct Inspect AI/MultiJudge surface; cross-family judges (3× Sonnet + 1× non-Anthropic); 5-type sentinel suite incl. judge-hacking | no | +| 4 | **TrainOrchestrator** | P5 (W11.5-12) | Worker — distinct NeMo-RL/SkyPilot surface; Centaur HPO; first real GPU sweep | ✅ (Inv 9) | +| 5 | **OptimizeAgent** | P6 (W13.5-15) | Worker — distinct profiler/kernel/sandbox surface; ≥1.5× speedup on 4 real workloads, ≤2% regression | ✅ (Inv 9) | + +### Layer 2 — Skills (loaded by PrincipalAgent — Anthropic Skills pattern) + +| Skill | Phase | Tools | Why a Skill not an Agent | +|---|---|---|---| +| **CodeWork** | P7 (W15.5-16) | read_file, write_file, run_tests, git_diff in E2B sandbox | Commodity tools (every framework has them); per Anthropic 2026 Skills blog, commodity capabilities should be Skills loaded by general agent, not separate processes | + +### Layer 3 — 3 OFFLINE governance tools (NOT standing agents — frontier convergence) + +| # | Tool | Cadence | What it does | Why offline (frontier evidence) | +|---|---|---|---|---| +| 1 | **GepaOptimizer** | Monthly cron | Mine failure clusters → DSPy GEPA reflective evolution → A/B test → ratchet on lower-CI → signed promotion | Decagon ships GEPA offline only; NO public production deployment as standing agent (audit) | +| 2 | **CapabilityProbe** | CI/CD on capability expansion events | 50+ adversarial probes via Inspect AI snapshots of orchestrator | METR pattern; co-resident standing would poison trace store (audit) | +| 3 | **CrossAgentEvaluator** | Quarterly | Spawn cosmos-lab vs Devin vs Claude Code vs human → Pareto frontier with reward-hack rate axis | Inspect AI cross-agent comparison standard (HAL/HOLISTIC AGENT/SWE-Compass) | + +### Layer 4 — ~16 infrastructure components + +Already enumerated in §0.6 above. Categories: identity (P0 + RFC 8693), 5-type sentinels (incl. judge-hacking) via Anthropic PostToolUse hook contract, OTel + 4-scope hybrid memory (Mem0/Letta), Inspect AI + cross-family MultiJudge, **LangGraph durable supervisor + Magentic-One ledger pattern**, ComputeBackend + sandbox 2-tier, reproducibility envelope (incl. CUDA versions), deployment via nat wrapper. + +### Layer 5 — ml-intern primitives (LEVERAGED inside LangGraph worker nodes) + +Each LangGraph worker node uses ml-intern primitives (agent_loop, 16 generic tools, sandbox, MCP, cost estimation, doom-loop detection) for actual ReAct execution within its specialty. + +### The demonstration (v6 — specialty agents in action) + +Cosmos team uses cosmos-lab via nat workflow: + +```bash +$ nat run cosmos-lab.yaml --task "Improve Cosmos Reason 2 pass-rate by 3pp" +``` + +What happens (v6 specialty-agents-orchestrated workflow): + +1. **DataAgent** spins up: pulls relevant cosmos-curate stages, prepares 10 hours of robot manipulation video, ships dataset card with W&B Artifacts lineage. Sentinel-gated. +2. **TrainOrchestrator** picks up dataset card: launches Centaur HPO sweep (LLM proposes configs, CMA-ES refines) on real Modal/Lambda GPU; NeMo-RL post-training; chooses winner. Real GPU run committed (Invariant 9). +3. **EvalAgent** evaluates winner: multi-judge with bootstrap CIs + sentinel-paired eval; physics-consistency scorer; reports +4.2pp pass-rate at p=0.018. +4. **OptimizeAgent** profiles winner inference: applies torch.compile + selective layer pruning; ≥1.5× wall-clock speedup; sentinel preserves quality. +5. **MultimodalPipelineAgent** orchestrates the four above into the e2e workflow; checkpoints to cross-session memory. +6. **CodeAgent** (if needed) writes patches for any bugs surfaced during the workflow. + +In the background (governance agents): +- **GepaOptimizer** mines failures from this trajectory + prior runs; proposes weekly prompt revisions +- **CapabilityProbe** ran adversarial probe before TrainOrchestrator's capability scope expanded +- **CrossAgentEvaluator** stores this run for next quarterly Pareto comparison + +Each specialty agent uses ml-intern's tool primitives (agent_loop, sandbox, MCP, generic tools) as substrate. Each specialty agent adds Cosmos-specific tools + system prompt + sentinels. + +Final deliverable: measured pass-rate +4.2pp (with bootstrap CI + p-value + sentinel agreement), full Phoenix trajectory across all 6 agents, signed audit log, cost report ($383/$400). Cosmos hiring manager schedules offer in 24h. + +### Why specialty agents + governance, not just one or the other (v6 synthesis) + +- **Why specialty agents (vs v5.2 governance-only)**: JD literally asks for *"agents (plural) that help generate data, surface failures, evaluate outputs"* and *"agent-based systems doing real work — coding, eval, data gen, triage, experimentation, orchestration"*. ml-intern's generic tools are HF-flavored; Cosmos team needs Cosmos-specialized agents. + +- **Why ALSO governance (vs v3.x/v4 specialty-only)**: 2026 reward-hacking crisis (METR + UC Berkeley) means specialty agents need sentinels + signed audit + RFC 8693 capability expansion + GEPA self-improvement. v3.x/v4 had specialty agents but weak governance; v6 adds production rigor. + +- **Why leverage ml-intern primitives (vs v5/v5.1 reimplementation)**: ml-intern's `agent_loop`, 16 generic tools, sandbox, MCP integration, doom-loop detection are debugged production code. Use them as substrate. Don't waste 4-6 weeks reimplementing. + +**The product**: 9 NEW agents (6 specialty + 3 governance) + ~16 infrastructure components, leveraging ml-intern primitives as substrate. Demonstrated on real Cosmos workflows. + +--- + +## 0.7 Numerical targets — what we commit to hit + +Without numbers, this is a roadmap, not a product plan. The following targets are commitments per phase. They are the bar for "phase exits"; missing one is a stop-the-line event, not a footnote. Numbers may be revised at phase entry with written rationale, never silently. + +| Phase | Metric | Target | How measured | +|---|---|---|---| +| **P1** | Sentinel/judge agreement on green seed runs | **≥98%** (any disagreement = bug, not noise) | `RewardHackSentinel` paired with `MultiJudge` across 5 seed tasks × 3 runs | +| **P1** | OTel span → Phoenix round-trip latency | **p99 < 500ms** | timing test; spans visible in Phoenix UI before next agent step | +| **P1** | `MultiJudge` pass-rate bootstrap CI width | **≤ 8pp at N=15 runs** | enables meaningful regression gates downstream | +| **P2** | NIM endpoint mock fidelity | **100% of cosmos tasks parseable** without endpoint changes | mock contract test | +| **P3** | DataAgent dataset-card lineage | **100% of records traceable** prompt → curator stage → final row | W&B Artifacts lineage graph | +| **P4a** | PR-gate false-positive rate | **≤ 5%** on a 10-PR replay set | shadow-mode for first week before enforcing | +| **P4b** | Sub-agent scope-down test | **0 unauthorized tool calls** in 100 child-agent runs | RFC 8693 integration test | +| **P5** | Centaur HPO vs pure-CMA-ES on small benchmark | **≥ parity** within 1σ; Centaur should not be *worse* | 6-config sweep, 3 seeds | +| **P6** | Inference optimization speedup | **≥ 1.5× wall-clock** on 4 baseline workloads | measured-peak per EVAL_SPEC.md | +| **P6** | Quality preservation under optimization | **≤ 2% absolute regression** on Inspect-scorer deterministic metrics | non-negotiable; judge-only flagged | +| **P7** | Memory recall on prior-session related task | **≥ 70% precision@5** on a 20-task held-out set | offline benchmark | +| **P8** | GEPA-promoted prompt revision lift | **≥ +5pp pass-rate at p<0.05**, sentinel agreement preserved | A/B vs control on 50-task golden suite | +| **P8** | False-promotion rate (revisions that regress in prod sample) | **≤ 10%** | weekly retrospective on promotions | +| **P9** | End-to-end pipeline wall-clock | **< 8 hours** for the demo task | DataAgent → TrainOrch → EvalAgent → OptimizeAgent | +| **P9b** | CodeAgent — small-bug-fix success | **≥ 60%** on a 10-bug fixture (E2B sandbox, no human assist) | hits JD's "AI helps build AI" bullet directly | +| **Cross-cutting** | Cost per evaluated task | **logged + visible in leaderboard, no target** | cost as first-class column, not vibes | + +### v4 additions (production-grade gates) + +| Phase | Metric | Target | How measured | +|---|---|---|---| +| **P5** | Real GPU sweep run (per Invariant 9) | **≥ 1 sweep on real hardware** (Modal/Lambda/NIM free tier) with logged W&B run | budget: ~$50; evidence committed to repo | +| **P5.5** | PyTorch custom op vs framework baseline | **≥ 10% measurable wall-clock improvement** on one real workload (matmul / attention / data loader) | profiler artifacts + benchmark script committed | +| **P6** | OptimizeAgent measured speedup | already in v3.1 (≥1.5× on 4 workloads) — v4 *requires real GPU* (was mockable) | per Invariant 9 | +| **P9a** | VideoUnderstandingAgent on real Cosmos NIM | **≥ 1 e2e run** hitting real NIM endpoint (cosmos-reason-2 free tier) | budget: ~$30; trace committed | +| **P10** | Production deployment | **deployed agent serves ≥ 100 real user sessions** over 1-week window (HF Spaces or Modal endpoint) | OTel traces + cost report committed | +| **P10** | OSS upstream PR | **≥ 1 PR opened to nvidia-nat or Inspect AI** with sentinel pattern; review-ready, not draft | PR URL in P10 release notes | + +### v5 eval-system additions (per AGENTIC_EVAL_SPEC §9) + +These extend numerical targets with eval-system-specific commitments. Without these, the entire numerical-targets table above is unverifiable (per axiom A10: eval-of-eval). + +| # | Eval target | Commitment | How measured | Phase | +|---|---|---|---|---| +| **E1** | Sentinel suite false-positive rate | **≤ 5%** on null fixtures (identical agent A/B should never gate-fail) | M2 null fixture suite, weekly | P1 | +| **E2** | Sentinel suite false-negative rate | **≤ 1%** on planted regressions (known-broken agent must always gate-fail) | M2 planted regression suite, weekly | P1 | +| **E3** | T1 calibrated suite test-retest reliability | **r ≥ 0.95** on aggregate metrics across 2 runs | M2, monthly | P1 | +| **E4** | Plan-quality LLM-judge ↔ human agreement | **≥ 80%** on 50-plan calibration sample | S2 calibration, quarterly | P4a | +| **E5** | Replan success rate | **≥ 70%** (replans → next milestone sentinel-clean) | S3, continuous | P4a | +| **E6** | Capability boundary probe pass rate | **100%** (0 unauthorized tool calls across 100 child-agent runs) | S4, nightly | P4b | +| **E7** | Reward-hack discovery rate | **Trending downward over 6 months** (sentinel suite maturing) | S5, monthly | P4a | +| **E8** | Cross-agent Pareto position | **PrincipalAgent on Pareto frontier** of cost × quality vs comparison agents | S6, quarterly | P10 | +| **E9** | Eval cost as % of total project spend | **≤ 15%** of total GPU + compute budget | M3 cost telemetry, weekly | P1 | +| **E10** | Reproducibility envelope coverage | **100%** of agent runs tagged with envelope (seeds + hashes + versions + OTel trace ID) | M1, every run | P1 | + +**Total numerical commitments**: 24 (original §0.7) + 10 (E1-E10) = **34 phase-exit conditions, all measurable, all gating decisions.** + +--- + +## 0.8 Production-grade commitments — what makes v4 not a research roadmap + +A research roadmap says "we will design X." A production plan says "we will run X in front of users with measurable Y, and roll back via Z." v4 commits to five production gates that v3.2 left aspirational: + +| Gate | Concrete commitment | Why it matters for Cosmos | +|---|---|---| +| **G1: Real GPU runs** | Invariant 9 — every GPU phase (P5, P5.5, P6, P9a) requires ≥1 measured real-hardware run committed to the repo. ~$200-400 total budget across Modal / Lambda Cloud / NIM free tier. | "Ran in mocks only" is an interview-killing red flag for a multimodal/world-model team | +| **G2: PyTorch depth artifact** | P5.5 ships one custom autograd op or torch.compile pattern with profiler-driven kernel selection on a real workload, ≥10% wall-clock improvement vs framework baseline | JD: *"Deep familiarity with PyTorch, including the ability to debug, adapt, and extend model behavior"* — needs code, not words | +| **G3: Production deployment** | P10 deploys cosmos-lab reference agent on HF Spaces or Modal endpoint, gathers ≥100 real user sessions over 1-week window, publishes trace summary | JD: *"deployment"* as a lifecycle step. `pip install` is publication, not deployment. | +| **G4: Real multimodal data flow** | P3 reframed: DataAgent processes 10-100 hours of *real* video through cosmos-curate (not toy fixture), ships dataset card + lineage | JD: *"multimodal ML pipelines spanning data processing"* — at toy-fixture scale this is a demo; at hour-scale it's a pipeline | +| **G5: OSS impact beyond own repo** | P10 commits to opening one upstream PR (nvidia-nat OR Inspect AI) with the sentinel pattern, review-ready not draft | JD stand-out: *"contributed to impactful open-source ML, Python, or developer tooling"* — owning a library is good; landing in someone else's library is better | + +**Honest scope**: G1+G2+G4 are mostly inside existing phase weeks (cost: time on real runs, not new weeks). G3+G5 are the +1 week expansion of P10 (was 1w polish, now 2w polish + production). + +**Honest budget**: ~$200-400 of personal GPU spend total. Self-fundable. Worth every dollar — turns the plan from "smart paper" into "shipped product with numbers." + +--- + +## 0.9 The Cosmos-specialized agents + governance thesis (v6 — core) + +> **v6 synthesis (PLAN_V2 §0.5 row 13)**: cosmos-lab ships **9 NEW agents** (6 Cosmos-specialty for ML lifecycle work + 3 governance for the meta layer) + **~16 production governance infrastructure components**, **built on ml-intern's tool primitives** (agent_loop blocks, 16 generic tools, sandbox, MCP, cost estimation) **leveraged AS-IS, not reimplemented**. JD literal text: *"agentic systems that reason about, build, evaluate, and improve AI systems themselves"* — multiple agents (specialty + governance). + +### The product + +cosmos-lab is **9 NEW agents + production governance infrastructure**: + +1. **6 Cosmos-specialty agents** (Layer 1 in §0.65): DataAgent, EvalAgent, TrainOrchestrator, OptimizeAgent, MultimodalPipelineAgent, CodeAgent — each does real ML lifecycle work specialized for Cosmos team workflows +2. **3 governance agents** (Layer 2): GepaOptimizer (self-improvement), CapabilityProbe (adversarial security), CrossAgentEvaluator (vendor comparison) — meta-layer agents that improve/validate the specialty agents +3. **~16 governance infrastructure components** (Layer 3): sentinels, identity, audit log, OTel emitter, memory tiers, Inspect AI bridge, ComputeBackend, etc. +4. **ml-intern primitives** (Layer 4): leveraged AS-IS — agent_loop, 16 generic tools, sandbox, MCP, cost estimation, doom-loop detection. NOT reimplemented. + +The user-visible artifact: + +```bash +$ nat run cosmos-lab.yaml --task "Improve Cosmos Reason 2 by 3pp" +``` + +What happens (v6 reality): +- nat workflow invokes cosmos-lab CLI as a registered tool +- cosmos-lab CLI orchestrates the 6 specialty agents through the ML lifecycle workflow +- Each specialty agent constructs an ml-intern session with scoped governance (CapabilityScopedRouter from D2 adapter), executes its specialty work, returns result +- Background governance agents run continuously: GepaOptimizer mines trajectories weekly; CapabilityProbe runs before each capability expansion; CrossAgentEvaluator collects data for quarterly Pareto comparison +- All actions audited (signed log), observed (OTel spans), evaluated (sentinels paired with judges) + +### Why specialty agents + governance, not just one + +``` +Real principal engineer's workday: + Vague goal → reason for hours → decompose into experiments → write code → + run experiments → observe surprising result → replan → iterate → ship + +cosmos-lab PrincipalAgent's workday: + Same loop. Long-horizon (multi-day). Real GPU. Real codebases. Real surprises. +``` + +v5/v5.1 was building a parallel orchestrator (PrincipalAgent + planner + executor + memory + sub-agent spawning) under `cosmos_lab/principal/` — but ml-intern already ships these: +- `agent/tools/plan_tool.py` — built-in planning (todo list management with status tracking) +- `agent/tools/research_tool.py` — sub-agent spawning (verbatim docstring: *"Research subagent tool — spawns a cheap LLM call with a focused research task and returns a summary. The subagent gets its own independent context"*) +- `agent/core/agent_loop.submission_loop` — autonomous execution +- `agent/prompts/system_prompt_v3.yaml` — explicit *"fully autonomous"* directive +- 20+ ML tools (jobs, datasets, papers, github, hf_repo, sandbox, notebook, web_search, ...) +- `agent/core/doom_loop.py` — failure-loop detection +- `agent/core/cost_estimation.py` — per-call cost tracking + +**Re-implementing what already works = workflow anti-pattern #4 generalized.** v5.2 leverages ml-intern as-is and ships the 10 things ml-intern doesn't have. + +### What cosmos-lab actually owns (the production governance layer) + +Three pillars, each addressing a 2026 production gap: + +**Pillar 1 — Production observability + memory persistence** +- OTel `gen_ai.*` semantic conventions (vendor-portable trajectory) +- 3-tier hierarchical memory layered over ml-intern's `logged_events` (cross-session continuity) +- Phoenix dashboard for live trajectory inspection + +**Pillar 2 — Sentinel-gated quality + adversarial validation** +- 4 sentinel types paired with judge — no judge-only metric reaches a gate (Invariant 8) +- 50-task denied-tool probe suite (S4) — adversarial validation of capability boundaries +- Monthly red-team sprint (S5) — novel reward-hack discovery +- MultiJudge with bootstrap CIs — no debate dynamics + +**Pillar 3 — Identity governance + self-improvement compliance** +- MCP OAuth 2.1 + RFC 8707 + RFC 8693 capability expansion (earned trust, not granted blanket) +- Hash-chained signed audit log (EU AI Act Art. 12 compliant) +- GEPA self-improvement with retroactive signed-review (not pre-approval bottleneck) +- PR-gating + canary deployment with sequential testing + +### What we're NOT pretending to do (anti-hype, carried) + +- ❌ Not building a new autonomous agent (ml-intern is the agent; we add governance) +- ❌ Not re-implementing planning, sub-agents, or ML tools (ml-intern has them) +- ❌ Not pretending zero human oversight (sentinel trips visible, weekly review, signed promotions) +- ❌ Not pretending online self-improvement (offline GEPA, with retroactive human review) +- ❌ Not pretending capability expansion is unbounded (policy-bounded by config) + +### Why this matches NVIDIA Cosmos JD + +| JD literal text | v5.2 delivery | +|---|---| +| *"strong agency in LLM-based systems"* | ml-intern (autonomous) + cosmos-lab governance (production-safe) — strong agency made deployable | +| *"code agents doing real work"* | ml-intern's existing autonomous code work + cosmos-lab adversarial S4 probes ensures it's safe | +| *"AI helps build them"* | ml-intern does the building; cosmos-lab makes it auditable and improvable | +| *"long-horizon multi-step workflows"* | ml-intern's existing long-running agent + cosmos-lab cross-session memory persistence | +| *"automation over data and experiments"* | ml-intern's plan_tool + research_tool + ML tools, governed by cosmos-lab | +| *"design and scale evaluation platforms"* | cosmos-lab's full eval architecture (AGENTIC_EVAL_SPEC: T0-T4 + S1-S6 + 10 commitments) | + +v5.2 ships the production governance layer that turns autonomous agents from research demos into deployable products. + +--- + +## 1. Phase table (~21 weeks — v7: 5 production agents + Skills + 3 offline tools + frontier substrate) + +> **v7 framing**: cosmos-lab ships 5 production agents (1 PrincipalAgent supervisor + 4 specialty workers with distinct tool surfaces) + 1+ Skills + 3 offline governance tools + ~16 infrastructure components on **LangGraph durable supervisor + Magentic-One ledger pattern** + ml-intern's tool primitives leveraged inside worker nodes. Schedule ~21w (slightly more than v6's 19w because we add LangGraph integration + PrincipalAgent foundation + Magentic-One ledger pattern + 5th sentinel type — all frontier-required additions per 3-audit verification). Frontier-aligned, not optimistic. + +> **v4 schedule rationale**: v3.2 trimmed to 20 weeks by inheriting nat plumbing. v4 adds **P5.5 PyTorch Depth (1w)** + **P10 expansion (1w)** to close production-grade gaps (§0.8). Net: 20 → ~22.5 weeks; still inside original 24-week budget. Banked ~1.5 weeks remain as risk buffer. +> +> **v3 split rationale (carried)**: P4 split into **P4a EvalAgent (1w)** + **P4b Identity v2 (2w)** because Identity v2 alone is 3-4w of work; honest > clean. + +| New | Wks | Phase | What ships | Real GPU? | +|---|---|---|---|---| +| P0 | 1 | Foundation + identity (AuthZ MVP) *(✅ shipped)* | `AgentIdentity`, `AuditLog`, `CapabilityScopedRouter`, `OptimizationConfig` — substrate for all 5 agents | no | +| **P0.5** | **0.6** | **Library restructure + harness adapters** *(✅ shipped)* | `cosmos_lab/` package + `install_into_session()` (D2) + `register_as_nat_tool()` (D3) + adapter contract (D4) | no | +| **P1** | **2** | **Eval infrastructure** (foundation for EvalAgent + used by all workers) | `TrajectorySink` Protocol, `OTelGenAIEmitter` → Phoenix, **5 sentinel types** (incl. **JudgeHackingCheck** per Gaia2), **cross-family `MultiJudge`** (3× Sonnet + 1× non-Anthropic), Inspect AI bridge via Anthropic PostToolUse hooks contract, 5 seed Inspect tasks, `evaluate` CLI | no | +| **P2** | **1** | **Cosmos toolset** (Cosmos-specific tools all workers use) | `NIMProvider` (litellm custom), `cosmos_reason`/`predict`/`transfer` tool wrappers, 5 cosmos Inspect tasks | no (mocked NIM) | +| **P3** | **2.5** | **🤖 PrincipalAgent foundation + Context engineering discipline (NEW supervisor agent)** | **LangGraph durable supervisor** + **Magentic-One Task Ledger (facts+plan) + Progress Ledger (step-tracking with 2-iteration stall detection)** + **4-scope hybrid memory** (user/agent/session/org) via Mem0 or Letta + Skills loader + sub-agent spawn coordinator. **Plus context engineering discipline** (cache-aware prompt structure: stable prefix → tool defs → conversation; compaction strategy at 75% context utilization; just-in-time retrieval via `recall_relevant(goal)`; `cosmos-progress.md` structured state file for cross-session bridging — claude-progress.txt analog per Anthropic Claude Code pattern). Foundation for ALL 4 workers. | no | +| **P4a** | **1.5** | **🤖 DataAgent (worker #1)** | Distinct cosmos-curate/NeMo Curator surface: composes Ray pipeline + LLM-in-the-loop persona-rewriter; **processes 10-100 hours real video** (Invariant 9); dataset card with W&B Artifacts lineage | ✅ (cosmos-curate Ray cluster) | +| **P4b** | **2** | **Identity v2 — MCP OAuth + RFC 8707 + RFC 8693 + signed audit** | Per MCP 2026-03-15 spec (86% enterprise adoption); table stakes delegation + signed audit (Ed25519 v1, KMS in P10) — drop "earned trust escalation" framing per audit | no | +| **P5** | **2** | **🤖 EvalAgent + 🤖 TrainOrchestrator (workers #2 + #3)** | EvalAgent (1w): leverages P1 eval infra; physics-consistency scorers; PR-gating; cross-family judges. TrainOrchestrator (1w): Centaur HPO; ComputeBackend; NeMo-RL; **first real GPU sweep** (Invariant 9) | ✅ (Inv 9) | +| **P5.5** | **1** | **PyTorch depth artifact** (substrate + capability proof) | One PyTorch artifact (custom autograd op OR torch.compile pattern with profiler-driven kernel selection); ≥10% wall-clock improvement; demonstrates "deep PyTorch familiarity" JD bullet | ✅ (Inv 9) | +| **P6** | **1.5** | **🤖 OptimizeAgent (worker #4)** | Distinct profiler/kernel/sandbox surface: applies optimization (kernel fusion, torch.compile, layer pruning); **≥1.5× speedup on 4 real workloads, ≤2% regression** | ✅ (Inv 9) | +| **P7** | **1** | **CodeWork Skill + CapabilityProbe (CI/CD lane) + S5 red-team automation** | CodeWork Skill loaded by PrincipalAgent (commodity tools, NOT separate agent per Anthropic Skills); CapabilityProbe runs in CI/CD against Inspect AI snapshots (NOT standing co-resident); monthly red-team sprint automation | no (E2B sandbox for Skill) | +| **P8** | **1.5** | **GepaOptimizer (offline batch tool, NOT standing agent)** | Monthly cron: DSPy `dspy.GEPA` reflective text evolution over trajectory store; A/B test on Inspect AI golden suite; lower-CI ratchet → signed promotion record. Frontier-validated as offline only (Decagon pattern). | no | +| **P9** | **1.5** | **MultimodalPipeline DEMO (orchestrate existing agents)** | NOT a new agent: PrincipalAgent orchestrates DataAgent → TrainOrchestrator → EvalAgent → OptimizeAgent on Cosmos Predict 2.5 + π₀.₅; **real Cosmos NIM endpoint** (Invariant 9). Demo proves the existing 5 agents compose cohesively. | ✅ (Inv 9: real Cosmos NIM ≥1×) | +| **P10** | **2** | **CrossAgentEvaluator (offline) + production deploy + nat YAML + OSS PR + demo** | CrossAgentEvaluator quarterly batch: cosmos-lab vs Devin vs Claude Code vs human → Pareto frontier with **reward-hack rate axis** (per audit). Plus: HF Spaces / Modal endpoint ≥100 real user sessions; ≥1 upstream PR (nvidia-nat or Inspect AI); `pip install cosmos-lab[all]`; `nat run cosmos-lab.yaml` reference; KMS migration; 5-min demo video | yes (production) | +| **Total** | **~21.5** | | **5 production agents + 1+ Skills + 3 offline tools + ~16 infra + context engineering discipline, on LangGraph + Magentic-One + ml-intern primitives** | **5 phases real GPU** | + +--- + +## 1.5 Reuse map — existing assets we wrap, not rebuild + +> **v3.2 reframe**: ml-intern moves from "platform we extend" to "**v1 reference harness via `cosmos_lab.harness.ml_intern` adapter**." The library does not depend on ml-intern at runtime; the adapter ports cosmos-lab into ml-intern's session model for users who want the HF stack + web UI. Primary harness from P1 onward is `nvidia-nat`. + +A senior-eng audit of both the upstream `ml-intern` codebase and external 2026 SOTA components surfaced multiple components that materially overlap with planned phases. Default posture: **wrap behind cosmos-lab interfaces; rebuild only the genuinely new abstraction**. + +| Existing upstream (ml-intern) asset | What it already does | Where used in cosmos-lab | +|---|---|---| +| `agent/core/session_uploader.py` | Detached subprocess uploader; ships every session to a configurable HF dataset | `cosmos_lab.trajectory.HFDatasetSink` (opt-in, P8 flywheel) | +| `agent/core/session_persistence.py` | Mongo-backed durable session store, gated on `MONGODB_URI` | Used directly via ml-intern adapter; no separate cosmos-lab Mongo sink (MongoSink cut in v3.1) | +| `agent/core/telemetry.py` + `HeartbeatSaver` | Mid-turn save/upload every N seconds | `cosmos_lab.harness.ml_intern` adapter hooks into these for mid-run flushes | +| `agent/core/approval_policy.py` (newly merged) | Approval policy + YOLO budget over tool calls | `cosmos_lab.identity.CapabilityScopedRouter` composes with this when running under ml-intern adapter | +| `agent/core/cost_estimation.py` (newly merged) | Per-call cost tracking | `cosmos_lab.eval` consumes for cost column in leaderboard | +| `agent/sft/tagger.py` | Tags trajectory events for SFT extraction | `cosmos_lab.governance.failure_mining` (P8) consumes via ml-intern adapter | +| `backend/kpis_scheduler.py` | APScheduler hourly rollup → HF KPI dataset | Optional integration via ml-intern adapter; cosmos-lab leaderboard CLI is the primary surface | +| `backend/session_manager.py::EventBroadcaster` | Per-session SSE fan-out | Used by ml-intern web UI only — not part of cosmos-lab core | +| `agent/tools/jobs_tool.py` | HF Training Jobs wrapper | `cosmos_lab.compute.HFJobsBackend` wraps this when ml-intern adapter active | + +**Net effect (v3.2)**: cosmos-lab core does not depend on ml-intern at runtime; ml-intern is exposed via the `ml_intern` adapter for users who want HF stack + web UI. Phase 0 work (identity, audit log, router) is preserved as library code, runnable under either adapter. + +### External assets — `nvidia-nat` as primary harness (v3.2 reframe) + +| External asset | What it already does | Where used in cosmos-lab | +|---|---|---| +| **NeMo Agent Toolkit (`nvidia-nat`, CLI `nat`)** v1.6.0 (2026-04-10) | YAML workflow config, plugin/decorator tool registry, OTel-native telemetry chain, `nat eval` artifact pipeline, profiler, MCP-compatible tool surface, multi-framework adapters (LangGraph/CrewAI/LlamaIndex) | **PRIMARY HARNESS** from P1 onward via `cosmos_lab.harness.nat`. Not "wrap" — cosmos-lab plugs *into* nat as plugins. nat handles workflow runtime, tool registry, OTel exporter chain; cosmos-lab adds governance (sentinels, identity, GEPA, quality budget). | +| **OpenTelemetry GenAI semconv** (`gen_ai.*`, experimental) | Cross-vendor span schema for LLM calls, tool calls, agents | P1 — primary trace schema; replaces "DuckDBSink as schema source-of-truth" | +| **Phoenix (Arize)** OSS | OTel-native trace UI, eval rigor (drift, embeddings) | P1 default backend | +| **Inspect AI** (UK AISI) | Task / Solver / Scorer primitives + Docker sandbox + log viewer; production eval standard | P1, P4 — primary eval harness; our seed tasks ship as Inspect tasks | +| **DSPy 3.x + `dspy.GEPA`** | GEPA reflective prompt optimization, production-traction (Databricks, VMware) | P8 — offline self-improvement loop | +| **NeMo Curator 26.04** + **cosmos-curate** | Ray-based GPU text/image/video/audio curation; cosmos-curate processed 20M video-hr in 14 days on Blackwell | P3 — DataAgent composes these stages | +| **NeMo-RL** (formerly Reinforcer) | Production post-training framework; Nemotron-3-Super trained on it (Mar 2026) | P5, P6 — wrap as a post-training backend | +| **SkyPilot v0.12 Job Groups** | Multi-cloud job orchestration, RL job groups, Slurm support | P5 — `ComputeBackend` impl alongside `HFJobsBackend` | +| **NVIDIA OpenShell + NemoClaw** | OS-level syscall interception, declarative allow-lists, designed for "always-on" agents on GPU | P6/P9 — GPU sandbox tier | +| **E2B (Firecracker)** | Hardware-isolated CPU sandbox; ~150ms cold start | P6 — CPU correctness sandbox tier | +| **MCP authorization spec (draft 2026)** + **WorkOS AuthKit / Auth0 MCP AS** | OAuth 2.1 + RFC 8707 + RFC 8693 | P4 — Identity v2 (do not roll our own OAuth) | +| **Anthropic memory tool API** (`memory_*`) + **Letta** | File-based memory storage; agent loop with subagents | P7 — storage layer for memory; do not rebuild MemGPT-style paging | + +**v3 net effect**: roughly half of "platform code" becomes integration glue against well-known boundaries. Cosmos team reading the plan recognizes every interface — credibility through *fluency in their stack*, not novel reinvention. + +### Companion specification documents (v5) + +These live at repo root as deep references; agents load on-demand: + +| Document | Scope | When to read | +|---|---|---| +| `EVAL_SPEC.md` | ML-output evaluation (perplexity, KL divergence, latency p99, GPU OOM) — model under test | Working on P5/P6 (training/optimization), or any task with model output as deliverable | +| `AGENTIC_EVAL_SPEC.md` | Agent-system evaluation (trajectory quality, plan quality, replan quality, capability boundary, reward-hacking, cross-agent comparison) — agent itself as artifact-under-eval per axiom A8 | Working on any P1+ phase that builds or evaluates the PrincipalAgent itself | +| `PLAN.md` | Original 16-week ML optimization plan (superseded by PLAN_V2 v5) | Historical reference for optimization-vertical depth | +| `SYSTEM.md` | Full architecture deep-dive (Vietnamese, 1167L) | Rare — only for upstream debugging | +| `RESEARCH_AHE_ANALYSIS.md` | AHE (Agentic Harness Engineering) research informing P8 GEPA decisions | Working on P8 self-improvement loop | + +--- + +## 2. Phase 0 — Foundation + Identity Skeleton (Week 1) + +### Why now +Agent identity is a stand-out JD bullet ("AuthN, AuthZ, IAM"). Cheap to skeleton on day 1 and would be expensive to retrofit. Also unblocks safer multi-agent work in P9. + +### Day-by-day + +| Day | Deliverable | Owned path | Acceptance | +|---|---|---|---| +| 1 | Sync upstream + lock baseline | — | 237 pass / 3 upstream-broken documented | +| 1 | `agent/optimization/__init__.py` + `config_ext.py` | `agent/optimization/` | `from agent.optimization import OptimizationConfig` works | +| 2 | `configs/optimization_agent_config.json` | `configs/` | Loadable via existing `load_config()` | +| 2-3 | `AgentIdentity` (frozen dataclass) | `agent/optimization/identity/identity.py` | Root identity, scoped identity, `can_call()` | +| 2-3 | `AuditLog` (append-only JSONL, thread-safe) | `agent/optimization/identity/audit.py` | Atomic writes, `read_all()` round-trip | +| 3 | `CapabilityScopedRouter` (composition over `ToolRouter`) | `agent/optimization/identity/router.py` | Filters tool specs, denies unauthorized calls, audits all paths | +| 4 | `tests/optimization/test_identity_scoping.py` (≥6 tests) | `tests/optimization/` | All pass; full suite no regression | +| 5 | Phase 0 retro + zero-diff verification | — | `git diff upstream/main --name-only` = owned only | + +### Acceptance criteria + +- [ ] `pytest tests/optimization/ -q` exits 0 +- [ ] `pytest tests/unit/ -q` shows ≤ 3 failures (the 3 upstream-broken; no new) +- [ ] `git diff upstream/main --name-only` lists only owned paths +- [ ] `OptimizationConfig` round-trips through `Config.model_validate()` +- [ ] `CapabilityScopedRouter` denies + audits when capability not granted +- [ ] `CapabilityScopedRouter` allows + audits before/after when granted +- [ ] `AuditLog` JSONL is parseable and chronologically ordered + +### Design decisions + +- **Composition over inheritance** for `CapabilityScopedRouter`: `ToolRouter.__init__` instantiates the full builtin tool stack (sandbox tools require HF auth) — heavy and unsuitable for unit tests. Wrapping a base router (or a duck-typed mock) keeps tests fast and decoupled. +- **JSONL audit**: human-readable, append-only, easy to ship to S3 / Phoenix later. No DB dependency in Phase 0. +- **`"*"` wildcard capability**: the only special-cased capability. No glob/prefix matching in Phase 0 — keep semantics trivially provable. +- **Composes with upstream `approval_policy.py`**: capability check is the *coarse, before-the-fact* allowlist; approval policy is the *fine, per-call, budget-aware* gate. Order: capability allow → approval policy → execute. Integration ticket lives in P1 D1; Phase 0 ships standalone enforcement. + +### Scope clarification (AuthN vs AuthZ) + +Phase 0 ships **AuthZ + audit**, not AuthN/IAM. `AgentIdentity` is unsigned — any caller can construct one. This is intentional for Phase 0: the goal is to prove the *enforcement and audit surface* against a known principal. AuthN (signed identity tokens, IAM provider integration) layers on later, once we have a real principal source (HF OAuth in `backend/`, sub-agent spawning in P9). Document this in module docstrings so a reader doesn't read more into it than is there. + +### Risks & mitigations + +- **Risk**: Upstream `ToolRouter.call_tool` signature changes in a future merge. **Mitigation**: composition + duck typing limits the blast radius to one method. +- **Risk**: AuditLog write contention if used from many threads. **Mitigation**: `threading.Lock` on write; if hot, switch to a queue+writer thread in P1. + +--- + +## 2.5 Phase 0.5 — Library restructure + harness adapters (4 days, NEW IN v3.2) + +### Why this exists +v3.2 reframes cosmos-lab as a Python library (`pip install cosmos-lab`), not a fork of ml-intern (§0.4 explains why). P0.5 is the architectural refactor that makes this real. **No semantic change to Phase 0 code** — only packaging, imports, and adapter wiring. + +### Day-by-day + +| Day | Deliverable | Acceptance | +|---|---|---| +| D1 | Restructure `agent/optimization/` into the §0.4 package layout: `cosmos_lab/{identity,trajectory,eval,governance,memory,providers,compute,sandbox,harness}/`. Both old (`agent.optimization.*`) and new (`cosmos_lab.*`) import paths work via `__init__.py` re-exports. Update `pyproject.toml` with `cosmos-lab` package + `[nat]`, `[ml-intern]`, `[all]` extras. | `from cosmos_lab.identity import AgentIdentity` works; existing 16 Phase 0 tests still pass without modification (only `import` lines change) | +| D2 | Write `cosmos_lab/harness/ml_intern.py` adapter — refactor existing `CapabilityScopedRouter` integration into adapter pattern. Provides: tool registration shim, span correlation hook, lifecycle wiring (per §0.4 adapter contract). | One smoke test: `cosmos_lab.harness.ml_intern.install(session, identity)` makes a 3-step trivial agent run with capability denial + OTel span emission + sentinel evaluation. ≤200 LOC. | +| D3 | Write `cosmos_lab/harness/nat.py` adapter — `install_into_nat(builder, identity)` API per §0.4 sketch. Registers `OTelGenAIEmitter` as nat exporter, wraps nat tool router with `CapabilityScopedRouter`, hooks lifecycle. | Same smoke test contract as ml-intern adapter, but running inside `nat run cosmos_lab_smoke.yaml`. ≤200 LOC. Both adapters parameterized over the same `AdapterContract` Protocol so future adapters (Claude SDK, OpenAI Agents) follow the pattern mechanically. | +| D4 | Dual-adapter test matrix: every Phase 0 test parameterized via `@pytest.mark.parametrize("harness", ["nat", "ml_intern"])`. Both must pass identically. Document the contract in `cosmos_lab/harness/CONTRACT.md`. | `pytest tests/optimization/ -q` runs 16 tests × 2 harnesses = 32 tests; all pass. CI matrix added. | + +### Acceptance criteria + +- [ ] `pip install -e .[nat]` and `pip install -e .[ml-intern]` both work; importable as `cosmos_lab` +- [ ] `tests/optimization/` runs 32 tests (16 × 2 adapters), all green +- [ ] `nat run examples/cosmos_lab_smoke.yaml` exits 0 and produces an OTel trace with capability denial event recorded +- [ ] `cosmos_lab/harness/CONTRACT.md` documents the 3-method adapter contract (tool registration, span correlation, lifecycle wiring) +- [ ] No existing Phase 0 test deleted; only `import` lines updated +- [ ] Zero-diff invariant still holds: `git diff upstream/main --name-only` returns only owned paths + +### Why 4 days, not 1 week + +Phase 0 code is ~600 LOC across `identity/`, no semantic change required. Adapter pattern is well-understood (entry-point pattern in `pyproject.toml`, ~150-200 LOC each). Dual-adapter test parametrization is `@pytest.mark.parametrize` — trivial. The risk is mostly in nat-side wiring (we haven't run nat locally yet); D3 reserves ~6h for that learning curve. + +### What this unlocks + +From P1 onward, every owned cosmos-lab module is written **once** against the framework-agnostic interfaces. The two adapters translate to/from nat and ml-intern. When OpenAI Agents SDK 2.0 ships, adding a third adapter is ~200 LOC, not a refactor. + +This is the architectural difference between "we built on HF" and "we built a library that runs anywhere — and our default is Cosmos-team's stack." The four days pay for themselves in pitch credibility alone. + +--- + +## 3. Phase 1 — TrajectorySink + OTel-GenAI + Inspect AI Judging (Weeks 2–3) — v3 + +### Reframe (vs v2) + +v2 said "TrajectorySink with DuckDB primary." v3 says: **trace schema is OpenTelemetry GenAI semantic conventions**, not anything we invent — and the eval rig is **Inspect AI** (UK AISI Task/Solver/Scorer), not a custom `TaskRunner`. Why: (1) the field has converged on OTel `gen_ai.*` spans (see DataDog/Grafana/Phoenix native support, Q1 2026), and a custom DuckDB schema would be a one-off that loses portability and Cosmos-team recognition. (2) Inspect AI is the production eval standard (METR uses it), with Docker sandbox, log viewer, and bootstrap CIs built in. Building a custom harness is reinventing the field's primitive. + +DuckDB stays — but as a *query/analytics layer over OTel spans*, not as the schema source-of-truth. + +### Goals (v3) +1. Define a `TrajectorySink` interface that emits **OTel GenAI spans** + persists derived rows for fast query. +2. Wire seed tasks as **Inspect AI Tasks**; use Inspect's Solver/Scorer + Docker sandbox; add cosmos-lab-specific scorers. +3. Make every judged metric ship with **bootstrap CI + an anti-reward-hacking sentinel** (paired structural verifier). +4. Integrate `CapabilityScopedRouter` with upstream `approval_policy.py` — capability allow → policy gate → execute (ordering test mandatory). + +### Deliverables (v3) + +| Module | Path | Notes / reuse | +|---|---|---| +| `TrajectorySink` (Protocol) | `agent/optimization/trajectory/sink.py` | `record_span(GenAISpan)`, `record_run(RunRecord)`, `query(...)` — `GenAISpan` follows OTel `gen_ai.*` semconv | +| `OTelGenAIEmitter` | `agent/optimization/trajectory/otel_emitter.py` | **NEW** — emits `gen_ai.*` spans via OTel SDK; backend = Phoenix by default, swap-in via `OTEL_EXPORTER_OTLP_ENDPOINT`. **Default = sole sink in P1.** | +| `DuckDBSink` | `agent/optimization/trajectory/duckdb_sink.py` | **Opt-in** local query layer over OTel-shaped rows (`runs`, `spans`, `judge_scores`); enabled when analytics needed (P4a leaderboard) | +| `HFDatasetSink` | `agent/optimization/trajectory/hf_sink.py` | **Opt-in** thin adapter over existing `session_uploader.py`; enabled for AHE/SFT flywheel (P8) | +| `MultiSink` | same | Fan-out wrapper. **P1 default = `[OTelGenAIEmitter]` only**; chain grows as later phases need them. v3.1 cut: ship one sink working before three. | +| ~~`MongoSink`~~ | — | **CUT in v3.1**: `session_persistence.py` already provides Mongo durability via upstream's existing path; a separate `MongoSink` was feature creep with no consumer in any phase. | +| `TracedSession(Session)` | `agent/optimization/trajectory/session.py` | Subclass; routes `logged_events` to active sink chain. Reuses `HeartbeatSaver` for mid-run flushes | +| Capability/Policy integration | `agent/optimization/identity/router.py` (extend) | Compose with `approval_policy.py`; capability check first, then policy, then execute | +| `LLMJudge` | `agent/optimization/eval/judge.py` | Single-pass baseline; uses `cost_estimation.py` to log judge cost | +| `MultiJudge` | `agent/optimization/eval/multi_judge.py` | **N-judge variance reduction** (no debate dynamics — see [`arxiv:2508.17536`](https://arxiv.org/abs/2508.17536)). Reports bootstrap CI on pass-rate. Default N=3 (Sonnet 4.6 ×3); tie-break with Opus 4.7 only if CI bound straddles threshold | +| `RewardHackSentinel` | `agent/optimization/eval/sentinel.py` | **NEW** — paired structural verifier. Each task pairs a deterministic check (file written? speedup measured? parity asserted?) with the judge. Disagreement → flag run for review, do not silently accept | +| `ToolAugmentedJudge` | `agent/optimization/eval/tool_judge.py` | Judge can call read-only tools; identity = `judge-readonly` capability set | +| Inspect AI bridge | `agent/optimization/eval/inspect_bridge.py` | **NEW** — exposes our scorers/judges as Inspect AI `Scorer`s; lets a cosmos-lab task run inside `inspect eval` and produce Inspect View logs | +| Seed tasks (Inspect format) | `tasks/seed/*.py` | 5 tasks as Inspect `@task`s: dataset inspect, code task, ML debug, paper summary, profiling. Each ships with one judge scorer + one structural sentinel | +| `evaluate` CLI | `agent/optimization/cli/evaluate.py` | Wraps `inspect eval` with cosmos-lab defaults: `cosmos-lab evaluate --suite seed --judge multi --sinks otel,duckdb,hf` | + +### Acceptance (v3.1) +- 5 Inspect tasks × 3 runs = 15 trajectories: spans land in Phoenix via OTel; round-trip p99 < 500ms (P1 numerical target) +- `MultiJudge` reports bootstrap 95% CI on pass-rate; CI width ≤ 8pp at N=15 runs (P1 numerical target) +- For every judged task, `RewardHackSentinel` runs and flags any judge/structural disagreement; **sentinel/judge agreement ≥ 98% on green seed runs** (P1 numerical target — any disagreement is a bug, not noise) +- `evaluate` CLI prints leaderboard with pass-rate ± CI, p50/p99 latency, **cost (via upstream `cost_estimation`) as a first-class column**, sentinel disagreement count +- A capability-denied call is blocked *before* approval policy is consulted (ordering test in `tests/optimization/test_router_policy_integration.py`) +- One golden Phoenix screenshot of an `agent` span tree committed under `docs/` so a Cosmos reviewer can see OTel wiring at a glance +- `DuckDBSink` and `HFDatasetSink` exist as opt-in modules with passing unit tests but are NOT in the default `MultiSink` chain (v3.1 — ship one sink working before three) + +### What we explicitly do NOT ship in P1 (deferred or dropped) +- ❌ `DebatingJudgePanel` — dropped per `arxiv:2508.17536`. Multi-judge variance reduction stays; debate dynamics removed. +- ⏸ Custom leaderboard UI — uses Inspect View in P1; cosmos-lab leaderboard extension lands in P4. +- ⏸ MCP-OAuth identity — P0's `AgentIdentity` is the AuthZ MVP; OAuth 2.1 + RFC 8707/8693 + signed audit log lands in P4 (Identity v2). Document the gap explicitly so no one reads more into P1's audit log than is there. + +### References (v3) +- [OpenTelemetry GenAI semconv](https://opentelemetry.io/docs/specs/semconv/gen-ai/) (experimental; opt-in stability flag) +- [Inspect AI (UK AISI)](https://inspect.aisi.org.uk/) — production eval standard +- [NeMo Agent Toolkit `nat eval`](https://docs.nvidia.com/nemo/agent-toolkit/latest/run-workflows/observe/observe.html) — schema we mirror for `nat`-runnable artifacts +- [Survey on Agent-as-a-Judge (2601.05111)](https://arxiv.org/pdf/2601.05111) +- [Debate or Vote — multi-agent debate refutation (`arxiv:2508.17536`)](https://arxiv.org/abs/2508.17536) +- [UC Berkeley — How We Broke Top AI Agent Benchmarks (2026)](https://rdi.berkeley.edu/blog/trustworthy-benchmarks-cont/) +- [METR — RE-Bench, reward-hacking measurements](https://metr.org/AI_R_D_Evaluation_Report.pdf) + +--- + +## 3.1 Sentinel taxonomy (v7 — 5 types, including judge-hacking detector per Gaia2) + +The single highest-leverage thing cosmos-lab does is **block judge-only metrics from reaching gates**. To enforce that across phases, we need a *taxonomy* of structural verifiers, not a vague "structural check." Every Inspect task contributes one judge `Scorer` and one sentinel from this taxonomy. Sentinel/judge disagreement → run flagged for review, never silently accepted. + +> **v7 update**: added 5th sentinel type **`JudgeHackingCheck`** per Gaia2 finding (Meta ARE, Oct 2025): agents make verifier-pleasing artifacts without solving task. Gaia2 specifically surfaces this as distinct failure class. Implementation contract: all 5 types ride on **Anthropic PostToolUse hook** primitive (Claude Agent SDK contract) — sentinels are NOT a novel cosmos-lab mechanism but a structured taxonomy over the convergent hooks pattern. + +| Sentinel type | What it checks | Failure mode it blocks | Example (P1 seed task) | +|---|---|---|---| +| **`DeterministicStateCheck`** | A boolean function on post-run filesystem / DB / object state | "Judge said it worked, but nothing actually changed" | *dataset-inspect task*: `assert (workdir / "schema.json").exists() and parse_json(...) has expected keys` | +| **`OutputFormatCheck`** | Strict schema/regex/parse on the agent's final tool output | "Judge said the answer was good, but it's not parseable" | *paper-summary task*: response must be parseable JSON with `{title, key_findings: list[str], limitations: str}` | +| **`SideEffectCheck`** | A boolean over emitted OTel spans (specific tool was called, in expected order, with expected args) | "Judge said the agent reasoned through it, but the agent never actually ran the profiler" | *profiling task*: assert one `gen_ai.tool.call` with `tool.name = "torch_profiler"` and non-empty result span | +| **`NoOpCheck`** | The agent did *something* — at least N tool calls, modified at least one file, latency above lower bound | "Judge said pass on a no-op trajectory" — the canonical reward-hack pattern Berkeley audited (100% exploit rate on SWE-bench Verified, Terminal-Bench, FieldWorkArena) | every task: assert `tool_call_count >= 1 AND wall_clock >= 100ms AND not all(span.result == "")` | +| **`JudgeHackingCheck`** *(NEW v7)* | Detects pattern of verifier-pleasing artifacts: agent produced output that satisfies judge rubric but bypasses substantive task — e.g., wrote conftest.py to make tests pass without fixing bug, planted markers in output to signal completion to judge, output structure-matches but semantics diverge | "Agent gamed the judge with verifier-pleasing artifacts without solving task" — Gaia2 finding (Meta ARE Oct 2025); IQuest-Coder-V1 24.4% gain came from copying answers from git history | *bug-fix task*: assert NOT (`conftest.py` newly created OR `pytest.skip` markers added OR judge-output-keywords appear without test-passing-evidence) | + +**Composition rule (v7)**: a task's sentinel = `(DeterministicStateCheck OR OutputFormatCheck OR SideEffectCheck OR JudgeHackingCheck)`, **always AND-ed with `NoOpCheck`**. NoOpCheck is mandatory on every task — it catches the cheapest reward-hack class for free. + +**Why these five**: directly map to failure modes documented in 2026 production audits: +- **NoOp**: UC Berkeley "near-perfect scores with zero LLM calls" (2026) +- **DeterministicState**: METR reward-hacking findings — claim-without-action pattern +- **OutputFormat**: SWE-bench gaming via output structure-matching +- **SideEffect**: Berkeley's "fake successful trajectory" pattern (tool listed but not invoked) +- **JudgeHacking** *(NEW)*: Gaia2 (Meta ARE Oct 2025) — verifier-pleasing artifacts without task solution + +**Implementation contract**: sentinels implemented as **Anthropic PostToolUse hooks** (Claude Agent SDK pattern) — NOT a novel cosmos-lab mechanism. Hook fires after each tool call, sentinel evaluates, structured signal returned. Composes with Anthropic's hook lifecycle natively. + +**Owned path**: `agent/optimization/eval/sentinels/{deterministic.py,output_format.py,side_effect.py,no_op.py,judge_hacking.py}`. Each ships with ≥3 unit tests covering green path + intended failure detection + edge case. + +**Task-author contract**: every Inspect `@task` we ship registers exactly one judge Scorer and exactly one composed sentinel via `@sentinel(...)`. Tasks without both fail CI before merge. + +--- + +## 3.2 PrincipalAgent architecture (v7 — LangGraph supervisor + Magentic-One ledgers) + +The architectural answer to §0.9's thesis. v7 specifies PrincipalAgent as **LangGraph durable supervisor** (production winner — Uber/JP Morgan/BlackRock/Cisco) + **Magentic-One Task/Progress Ledger pattern** (graduated into Microsoft Agent Framework). This is NOT a novel cosmos-lab orchestrator design — it's a synthesis of two 2026 production-validated patterns specialized for ML lifecycle work. + +### 3.2.0 Frontier substrate choices (v7 audit-driven) + +| Substrate | Choice | Frontier evidence | +|---|---|---| +| **Orchestration** | LangGraph supervisor pattern | Production winner 2026: Uber/JP Morgan/BlackRock/Cisco/LinkedIn/Klarna; durable execution + checkpoint-restore stable in v1.0 | +| **Planning model** | Magentic-One Task Ledger (facts + plan) + Progress Ledger (step tracking with 2-iteration stall detection) | Microsoft Agent Framework absorbed Magentic-One as first-class workflow (April 2026); 2-iteration stall detection is what makes it production-robust | +| **Memory model** | 4-scope hybrid (user/agent/session/org) via Mem0 or Letta | Mem0/Atlan/supermemory.ai 2026 convergent pattern; Anthropic memory tool = flat persistent file | +| **Sentinel mechanism** | Anthropic PostToolUse hook contract | Claude Agent SDK pattern; deterministic + composable + frontier-aligned | +| **Sub-agent spawning** | RFC 8693 token exchange (depth=1, bounded) | OpenAI Codex hardcodes max_depth=1 (convergent default after recursion incidents) | +| **Skills loading** | Anthropic Skills pattern (markdown + scripts loaded by general agent) | Anthropic Skills blog 2026: explicit rejection of per-domain agents in favor of Skills | + +### 3.2.1 Substrate choice — ml-intern's `agent_loop.py`, not from scratch + +**Decision**: PrincipalAgent runs **inside ml-intern's existing `agent_loop.py`** (1626 lines, production-debugged, 16 built-in tools, MCP integration). cosmos-lab adds capabilities ON TOP, never replaces the loop. + +**Rationale**: +- ml-intern's loop already handles: tool calling, retries, context window management, doom-loop detection, session persistence (Mongo + HF), heartbeat saving, approval policy, cost estimation +- Re-implementing this is 6+ weeks of debugged code we'd recreate. Anti-pattern #4 (workflow): "Building a pipeline that should have been one model call" — generalize: building a substrate that should have been an existing one +- nvidia-nat as the *deployment harness* (P0.5 D3 adapter) — but the agent loop INSIDE nat is ml-intern's, wrapped through `cosmos_lab.harness.nat` + +**Owned path**: `cosmos_lab/principal/` — agent definition, planner, memory, capability expansion logic. Substrate stays at `agent/core/agent_loop.py` (zero-diff). + +### 3.2.2 The autonomous loop — long-horizon, multi-day + +``` +┌────────────────────────────────────────────────────────────────┐ +│ PrincipalAgent.run(goal, budget) │ +│ │ +│ ┌──────────────────────────────────────────────────────┐ │ +│ │ 1. PLAN PHASE (LLM reasoning, no tool calls yet) │ │ +│ │ - Read goal + episodic memory of similar past work │ │ +│ │ - Decompose into N experiment milestones │ │ +│ │ - Write plan to memory/working/plan.md │ │ +│ │ - Generate verifier scripts per milestone │ │ +│ └──────────────────────────────────────────────────────┘ │ +│ │ │ +│ ▼ │ +│ ┌──────────────────────────────────────────────────────┐ │ +│ │ 2. EXECUTE PHASE (one milestone at a time) │ │ +│ │ - Pull next milestone from plan │ │ +│ │ - Hand to ml-intern agent_loop with scoped tools │ │ +│ │ - Loop runs full ReAct: read → tool → observe → … │ │ +│ │ - Every tool call audited, OTel span emitted │ │ +│ └──────────────────────────────────────────────────────┘ │ +│ │ │ +│ ▼ │ +│ ┌──────────────────────────────────────────────────────┐ │ +│ │ 3. VERIFY PHASE (sentinel check after each milestone) │ │ +│ │ - Run milestone's auto-generated verifier │ │ +│ │ - Run paired sentinels (judge + structural) │ │ +│ │ - GREEN → mark milestone done, persist to memory │ │ +│ │ - RED → ───────────────────────────────────┐ │ │ +│ └────────────────────────────────────────────────┼─────┘ │ +│ │ │ │ +│ ▼ ▼ │ +│ ┌───────────────────┐ ┌─────────────────┐ │ +│ │ 4a. ALL DONE │ │ 4b. REPLAN │ │ +│ │ → ship report │ │ - Read failure │ │ +│ └───────────────────┘ │ - Update plan │ │ +│ │ - Loop to (2) │ │ +│ └─────────────────┘ │ +│ │ +│ Persisting throughout: working memory + episodic + semantic │ +└────────────────────────────────────────────────────────────────┘ +``` + +**Long-horizon** = this loop runs across sessions. Mid-execute when compute quota expires? Trajectory persists. Resume the next day reads `memory/working/plan.md` + last OTel span ID, picks up at the next milestone. + +### 3.2.3 Three memory tiers (agent-managed) + +| Tier | Path | Lifetime | What lives here | +|---|---|---|---| +| **Working** | `memory/working/` | Current task | Plan, current milestone, scratchpad, in-progress reasoning | +| **Episodic** | `memory/episodic/` | All past tasks | Trajectories of completed work, indexed by goal/domain. "Last time I did Cosmos eval, I tried X and it failed because Y" | +| **Semantic** | `memory/semantic/` | Distilled facts | Curated lessons (e.g., "torch.compile mode='reduce-overhead' wins on attention but loses on data loaders") — promoted from episodic via GEPA | + +Agent reads from semantic + relevant episodic during PLAN phase. Writes to working during EXECUTE. After milestone done, distills surprising-evidence into episodic. Weekly GEPA pass distills episodic → semantic. + +### 3.2.4 Replanning logic (sentinels as enabler, not gate) + +When a sentinel trips: +1. Sentinel emits **structured failure record**: `{sentinel_type, expected, actual, evidence_path}` +2. Agent reads record + relevant context (last 3 milestones, related episodic memory) +3. Agent generates **replan**: hypothesis about what went wrong, alternative approach to try +4. Updated plan written to working memory +5. Loop continues from updated milestone + +**Critical**: sentinel trip is NOT failure. Sentinel trip is **information**. Failure is when agent can't replan after N tries (default N=3). + +This is the difference vs Devin (silent failure → wasted hours): **sentinels make failure loud and actionable**. + +### 3.2.5 Capability expansion (earned trust) + +``` +PrincipalAgent.spawn(parent_token): + initial_caps = {read_file, list_directory, search_huggingface} # narrow + +After K consecutive sentinel-clean milestones: + PrincipalAgent.expand_capabilities(K): + K >= 5: + {write_file, run_tests} # expanded + K >= 20: + {git_diff, git_commit_local} # broader + K >= 50: + {run_real_gpu_job, deploy_to_modal} # production + + Each expansion = RFC 8693 token exchange: + parent_token + new_scope_subset → new_token + Logged to hash-chained signed audit (P4b) + Visible in OTel: gen_ai.agent.capability_expansion event +``` + +Capability expansion is **policy-bounded**: expansion table is config, not unbounded. Human review of expansion happens weekly (read the audit log). + +### 3.2.6 Concrete demo per §0.9 + +> Cosmos hiring manager: *"Take this Cosmos Reason 2 task. Improve pass-rate ≥3pp. One week, $400 budget."* + +PrincipalAgent.run() executes: +- **Hour 0-2 PLAN**: read task, read episodic ("similar Cosmos Reason 2 task done 2 weeks ago, found mid-layer attention was bottleneck"), decompose into 8 milestones (data inspection → baseline → 3 experiments → eval → optimization → final eval → report) +- **Day 1-2 EXECUTE milestones 1-4**: data work, baseline, experiment 1 (LR tuning, sentinel-clean), experiment 2 (data augmentation, sentinel TRIPS — eval drift > tolerance), REPLAN, experiment 2-revised +- **Day 3 EXECUTE milestones 5-6**: experiment 3 (LoRA on attention layers), eval shows +4.2pp pass-rate, sentinel-clean +- **Day 4 EXECUTE milestones 7-8**: optimization (torch.compile + selective layer pruning, +6% wall-clock), final eval (+4.1pp pass-rate held) +- **Day 5 SHIP**: report committed, OTel trajectory archived, W&B sweep linked, cost report ($383/$400) + +That's the demo. **Real autonomous principal-engineer work**, observed + replannable + auditable. + +### 3.2.7 What lives in `cosmos_lab/principal/` + +``` +cosmos_lab/principal/ +├── __init__.py # exports PrincipalAgent +├── agent.py # PrincipalAgent class — the orchestrator +├── planner.py # PLAN phase: goal → milestones + verifiers +├── executor.py # EXECUTE phase: hands milestone to ml-intern agent_loop +├── verifier_gen.py # auto-generates milestone verifiers from goal +├── replanner.py # REPLAN phase: sentinel trip → new plan +├── memory/ # 4-scope hybrid (Mem0/Letta substrate) +│ ├── working.py # in-task memory (scope: session) +│ ├── episodic.py # cross-task memory (scope: agent + user) +│ ├── semantic.py # distilled facts (scope: org) +│ └── compaction.py # 75%-context-utilization compaction trigger +├── context_eng/ # NEW (v7-stronger) — context engineering discipline +│ ├── prompt_layout.py # cache-aware structure: stable prefix → tool defs → conversation +│ ├── jit_retrieval.py # just-in-time recall_relevant(goal) tool +│ ├── progress_state.py # cosmos-progress.md cross-session bridging file +│ └── stale_check.py # behavior-vs-capability separation test (quarterly) +└── capability_expansion.py # RFC 8693 token-exchange delegation (drop "earned trust" framing) +``` + +P3-P9 phases each ADD a capability domain to PrincipalAgent (data/eval/train/optimize/multimodal/code) — they're not separate agents, they're skill modules the same agent uses. + +### 3.2.8 Context engineering discipline (v7-stronger — addresses Tier 3 + JD stand-out #3) + +Context engineering is "the new prompt engineering" per Anthropic 2026 engineering blog. v7-stronger ships explicit discipline in P3 PrincipalAgent foundation, addressing both JD stand-out bullet #3 (context compression / agent memory) and the 8-tier audit Tier 3 gap. + +**Four context-engineering primitives** (all land in P3, ~2.5w): + +#### Primitive 1: Cache-aware prompt structure + +Layout for every PrincipalAgent + worker invocation: + +``` +[STABLE PREFIX — never changes during a task] + - System prompt + - Capability scope manifest (RFC 8693 token contents) + - Sentinel rules + - Memory tier configuration +[TOOL DEFINITIONS — change only on RFC 8693 capability expansion event] + - All tool specs from current capability scope +[VOLATILE — the only churn region] + - Magentic-One Task Ledger + - Magentic-One Progress Ledger + - Recent conversation turns (post-compaction) +``` + +**Why**: every byte of churn in the stable region voids the prefix cache (Anthropic memory system; Claude API prefix caching) and 10× cost. Production agents reuse system prompt + tool defs thousands of times across tool calls — prefix caching is the single biggest cost lever. + +**Owned path**: `cosmos_lab/principal/context_eng/prompt_layout.py` — enforces layout invariant at build time. + +#### Primitive 2: Compaction strategy at 75% context utilization + +Trigger: when context window hits 75% of model limit (e.g., 150K/200K for Claude Sonnet 4.6). + +Action: +1. Pause agent loop +2. Use Anthropic memory tool API to summarize older conversation history (preserves task context, drops verbose tool outputs) +3. Replace older history with summary in next turn +4. Resume + +**Why**: Anthropic context-editing pattern — auto-clears stale tool results when context fills. Claude Code uses this; we adopt it. + +**Owned path**: `cosmos_lab/principal/memory/compaction.py` + +#### Primitive 3: Just-in-time retrieval via `recall_relevant(goal)` tool + +Don't pre-load episodic memory at session start. Agent calls explicit tool when needed: + +```python +# Tool def loaded by PrincipalAgent +@tool +def recall_relevant(goal: str, k: int = 5) -> list[Episode]: + """Recall past episodes relevant to current goal. 4-scope filtered.""" + return episodic_memory.search(query=goal, scope=current_scope, k=k) +``` + +**Why**: pre-loading wastes context window on irrelevant past tasks. Just-in-time keeps stable prefix small AND lets agent fetch only what matters now. + +**Owned path**: `cosmos_lab/principal/context_eng/jit_retrieval.py` + +#### Primitive 4: Structured state files for cross-session bridging + +Per Anthropic Claude Code pattern (claude-progress.txt + git history bridges sessions): + +- PrincipalAgent writes `cosmos-progress.md` after every milestone completion +- Format: append-only event log with structured sections (DONE / IN_PROGRESS / NEXT / SURPRISES) +- New session begins by reading `cosmos-progress.md` BEFORE anything else (initializer pattern) +- Bridges multi-day work across compute interruptions + +```markdown +# cosmos-progress.md (auto-generated by PrincipalAgent) + +## Task: Improve Cosmos Reason 2 by 3pp +## Started: 2026-05-04 +## Last update: 2026-05-06 (session #3) + +### DONE +- Milestone 1: dataset inspection (sentinel-clean) +- Milestone 2: baseline eval at 0.832 pass-rate + +### IN_PROGRESS +- Milestone 3: experiment 1 (LR sweep) — 4/6 configs done + +### NEXT +- Milestone 4: experiment 2 (data augmentation) + +### SURPRISES (drives REPLAN) +- Eval task 4 has different schema than rest — added sentinel for this +``` + +**Owned path**: `cosmos_lab/principal/context_eng/progress_state.py` + +#### Primitive 5 (bonus): Behavior-vs-model-capability separation test + +Per Anthropic's "harness assumptions go stale as models improve" warning (Sonnet 4.5 context anxiety patches were dead weight in Opus 4.5). + +Quarterly automated test: +1. Snapshot current PrincipalAgent + workers +2. Re-run same suite against current model + previous model +3. Detect harness assumptions that no longer hold (e.g., "context compaction at 50%" was for older model; new model needs only 75%) +4. Flag dead-weight resets/patches for removal + +**Owned path**: `cosmos_lab/principal/context_eng/stale_check.py` + +#### New numerical commitments (extend §0.7 + AGENTIC_EVAL_SPEC §9) + +| # | Target | Commitment | How measured | +|---|---|---|---| +| **E15** | Prefix cache hit rate | **≥80% on stable prefix region** | Claude API cache_read_input_tokens metric, weekly | +| **E16** | Compaction trigger reliability | **fires at 75% ± 5% context utilization, no missed triggers in 100 runs** | unit test fixture | +| **E17** | Cosmos-progress.md cross-session recovery | **100% of resumed sessions correctly recover state from progress file** | resumption smoke test, every cross-session task | +| **E18** | Behavior-vs-capability stale assumption detection | **≥1 stale assumption identified per quarterly retest** | quarterly retest report | + +--- + +## 3.3 Agentic eval architecture (v5 — pointer to AGENTIC_EVAL_SPEC.md) + +The sentinel taxonomy (§3.1) is one piece of agentic eval. The full architecture lives in **`AGENTIC_EVAL_SPEC.md`** — companion to `EVAL_SPEC.md` (which covers ML-output eval; this companion covers agent-system eval per axiom A8 *"the agent is itself an artifact-under-eval"*). + +### Why a separate spec doc + +EVAL_SPEC.md evaluates models. AGENTIC_EVAL_SPEC.md evaluates agents. Three distinctions (per AGENTIC_EVAL_SPEC §1): +1. **Trajectory is the deliverable, not just output** — two agents producing identical correct outputs can have radically different trajectory quality (one took 47 tool calls + 12 replans, the other took 3 calls correct first time) +2. **Agent is itself artifact-under-eval (A8)** — strong MMLU + strong HumanEval ≠ strong agentic tool-use; need separate eval surface for agent decisions +3. **Long-horizon eval is non-fungible with short-horizon eval (NEW axiom A13)** — a 5-day task is not 120 1-hour tasks; cross-session memory, plan staleness, capability expansion mid-task are new failure modes + +### What agentic eval architecture adds (over §3.1 sentinels alone) + +**5-tier ladder** (transfers from EVAL_SPEC, specialized for agentic): +- T0 smoke / T1 calibrated quality / T2 long-horizon / T3 shadow / T4 canary + +**6 agentic-specific surfaces** (NEW — don't exist in EVAL_SPEC): +- **S1 Trajectory eval** — tool-call efficiency, replan ratio, wasted-work, doom-loop frequency +- **S2 Plan-quality eval** — LLM-judge on PLAN-phase decomposition (gates EXECUTE per §3.2) +- **S3 Replan-quality eval** — sentinel trips → response quality (success rate, diversity, time-to-recovery) +- **S4 Capability boundary eval** — 50-task denied-tool probe suite (security-critical for capability expansion per AGENTIC_EVAL_SPEC axiom A12) +- **S5 Reward-hacking adversarial eval** — monthly red-team sprint (covers what UC Berkeley's 8/8-hackable-benchmarks crisis demands) +- **S6 Cross-agent comparison eval** — PrincipalAgent vs Devin vs Claude Code vs human, quarterly Pareto chart (the differentiator pitch) + +**3 cross-cutting meta layers** (transfer from EVAL_SPEC): +- M1 Reproducibility envelope, M2 Eval-of-eval, M3 Cost telemetry + +**3 input types** (per JD bullet 5): +- I1 Automated metrics, I2 Human feedback (5% sampling + weekly review), I3 Agent-driven analysis + +### Operational cadence summary + +| Cadence | What runs | Cost budget | +|---|---|---| +| Every commit | T0 + S1 sanity | <$0.10 | +| Every PR to main | T1 + S1 + S2 + S3 | $10-$50 | +| Nightly | T1 + T2 + S4 | $50-$200 | +| Weekly | T3 + S6 sample + I2 5% human review | $300-$700 | +| Monthly | S5 red-team sprint + M2 eval-of-eval | $500-$1000 | +| Per release | T4 canary + production monitoring | $300-$1000 + risk | +| Quarterly | S6 cross-agent full Pareto comparison | $500-$2000 | + +### Integration with v5 phases + +Per AGENTIC_EVAL_SPEC §10, this architecture integrates into v5 phases without adding a new phase: +- **P1** establishes T0/T1 + S1 + M1 + sentinel taxonomy +- **P4a** EvalAgent capability builds S2 + S3 + I2 + S5 monthly red-team kickoff +- **P4b** ships S4 capability boundary probe suite (security-critical, blocks identity v2 ship) +- **P5/P6** runs T2 long-horizon eval on real GPU sweeps (per Invariant 9) +- **P9b** ships first S6 cross-agent comparison (PrincipalAgent vs Claude Code on bug fixture) +- **P10** runs T4 canary + first quarterly S6 full comparison + +**Net cost**: ~3-4 days additional spec/test work spread across phases. + +### 10 numerical eval-system targets + +Extends §0.7 (see "v5 eval-system additions" subtable). Examples: +- E1: sentinel suite FPR ≤ 5% on null fixtures +- E2: sentinel suite FNR ≤ 1% on planted regressions +- E5: replan success rate ≥ 70% +- E6: capability boundary 100% (0 unauthorized calls in 100 child runs) +- E8: PrincipalAgent on Pareto frontier of cost × quality vs comparison agents + +Full target list in AGENTIC_EVAL_SPEC §9. + +--- + +## 4. Phase 2 — Cosmos Vertical (Weeks 4–5) — v3.1 reframed + +### Reframe (v3.1) +v2/v3 promised `VideoUnderstandingAgent` as the proof point — but real video understanding requires GPU access we don't have committed. v3.1 honestly scopes P2 as: **provider abstraction + Cosmos tool wrappers + mocked-endpoint validation**. The agent itself is deferred to P9 where it lives inside the e2e pipeline (real or simulated). This is honest scoping, not retreat — it removes a credibility risk (a "demo" that only runs against mocks looks worse than no demo). + +### Goals (v3.1) +1. First-class Cosmos integration scaffolding (Reason 2 / Predict 2.5 / Transfer 2.5) via `NIMProvider`. +2. Tool wrappers callable from any cosmos-lab agent — no agent ships in P2 itself. +3. 5 cosmos eval tasks (Inspect format) plugged into P1 harness, validated against mocked NIM. + +### Deliverables (v3.1) + +| Module | Path | Notes | +|---|---|---| +| `NIMProvider` | `agent/optimization/providers/nim_provider.py` | OpenAI-compatible litellm custom provider pointed at NVIDIA NIM endpoints. **First non-HF provider** in the platform — establishes the abstraction other providers (Modal, Lambda, NGC) will follow. | +| `CosmosToolset` | `agent/optimization/cosmos/{reason,predict,transfer}.py` | Tool wrappers calling Cosmos Reason 2 / Predict 2.5 / Transfer 2.5 *via* `NIMProvider` (preferred) with HF-hosted fallback | +| Physical-AI eval pack (Inspect format) | `tasks/cosmos/*.py` | 5 Inspect `@task`s: object localization, motion prediction, scene QA, future-state gen, sim-to-real. Each carries one judge + one sentinel per §3.1 taxonomy. | +| NIM mock contract test | `tests/optimization/cosmos/test_nim_contract.py` | 100% of cosmos tasks parseable against mock without endpoint changes (P2 numerical target) | +| ~~`VideoUnderstandingAgent`~~ | — | **Deferred to P9 (e2e demo)**: shipping a real agent without GPU is demo-ware. The toolset is ready; P9 wires it. | + +### Acceptance (v3.1) +- All 5 cosmos tasks load + execute against mocked NIM endpoint (no real-endpoint dependency in CI) +- `NIMProvider` registered with litellm; basic completion + tool-call paths covered by tests +- Real-endpoint smoke test documented as a 1-page README runbook (manual until GPU access secured) +- **No agent ships in P2** — this is honest scoping per v3.1 reframe + +### References +- [Cosmos technical blog](https://developer.nvidia.com/blog/scale-synthetic-data-and-physical-ai-reasoning-with-nvidia-cosmos-world-foundation-models/) +- [Cosmos GitHub org](https://github.com/nvidia-cosmos) + +--- + +## 5. Phases 3–10 — sketches (v3) + +(Detailed when entered. High-altitude only here. v3 deltas marked with ⚡.) + +### P3 DataAgent (Wks 6–7) ⚡ — v4 reframed for real multimodal data + +**v4 reframe**: v3.2 said "compose curator stages" but didn't require any real data flow. v4 requires DataAgent to process **10-100 hours of real video sample** through cosmos-curate, ship dataset card with full lineage. This closes §0.8 production gate G4 (real multimodal data flow) and credibly demonstrates the JD's "multimodal ML pipelines spanning data processing" bullet at non-toy scale. + +- ⚡ Compose **NeMo Curator 26.04** stages (text/image/video) + **cosmos-curate** (video-specific) — do NOT reimplement curation +- Synthetic data gen via Cosmos Predict 2.5 + Nemotron-style instruct→reward filter pipeline (Magpie-class for instruction data) +- Curation primitives via NeMo Curator: GPU-accelerated heuristics + classifiers (domain/quality/safety) + fuzzy/semantic dedup +- ⚡ **Real video sample (v4)**: pick a public dataset slice (e.g., Open-Sora samples, YouTube-CC0 batch, or Cosmos community sample) at **10-100 hour scale**. Run through cosmos-curate Ray pipeline on Modal/Lambda GPU (~$50 budget). Output a real dataset card with classifier verdicts + dedup statistics + provenance. +- Lineage tracking via **W&B Artifacts** (v3.1: pick one, not both — W&B aligns with the model registry story we'll need in P5/P6; DVC was redundant) +- DataAgent = pipeline DAG with LLM-in-the-loop nodes (judge filters, persona-rewriters, taxonomy samplers) + agent that proposes new gen prompts from eval-failure clusters. **Not** "agent autonomously curates a corpus" — that's hype in 2026. +- Output: dataset card committed to HF Hub with full lineage manifest; **100% of records traceable from raw video → curator stage → final row** (P3 numerical target) +- Acceptance: + - `cosmos-lab data --task ` produces a curated dataset card with provenance + a NeMo-Curator-compatible recipe file + - W&B Artifacts lineage graph renders end-to-end + - **Real video processed**: at least one cosmos-curate run on 10-100 hours of public video, results committed (or HF dataset link); cost report ≤$50 + +### P4a EvalAgent platform (Wks 8–9) ⚡ — v3.1 simplified +- ⚡ **v3.1 cut**: dropped the custom frontend overlay. Use **Inspect View embed** as the leaderboard UI — it's the 2026 standard reviewers expect to see, and §6.5 explicitly said "porting frontend has no Cosmos-pitch value." The custom UI was contradicting our own framing. +- ⚡ Regression PR gate via Inspect AI bootstrap CI — block if lower CI bound regresses by >δ. **PR-gate false-positive rate ≤ 5% on 10-PR replay set** (P4a numerical target); shadow-mode for first week before enforcing +- Cosmos-lab–specific leaderboard data exposed as a **single CLI command** (`cosmos-lab leaderboard --format markdown|json`) reading from `DuckDBSink` (now opt-in-enabled here for analytics) — no custom React work +- Human-feedback ingestion via existing Slack tool → Argilla-style labeling sink → re-feeds Inspect dataset +- Acceptance: PR opened with regression auto-blocks; `cosmos-lab leaderboard` renders pass-rate ± CI + sentinel disagreement count + cost; one human-labeled batch flows back into Inspect dataset cleanly; Inspect View link from PR comment + +### P4b Identity v2 — v1 cut (Wks 10–11) ⚡ *(promoted to its own phase; v1 scope tightened per advisor)* + +**Honest scope cut**: full Identity v2 (MCP OAuth client + 3 RFCs + AS integration + hash-chained signed log + KMS custody + tamper mutation test + sub-agent propagation + OTel wiring) is 3-4 weeks of work. P4b ships a **v1 cut in 2 weeks**; KMS key custody and the tamper mutation test are explicitly deferred to P10. Document the gap in §8 as a known v1 limitation. + +**P4b v1 (in scope, 2w)**: +- ⚡ Graduate P0 `AgentIdentity` to **MCP OAuth 2.1 client** (RFC 6749 + RFC 9728 Protected Resource Metadata discovery) + **RFC 8707 Resource Indicators** (closes confused-deputy hole, mandatory in MCP June 2025+ spec) + **RFC 8693 token exchange** for sub-agent scope-down +- Replace JSONL audit log with **hash-chained signed log** (linear chain): each entry = `H(prev_hash || canonical_event_json)`, root signed by **software-held Ed25519 key** (libsodium / `cryptography` library) — KMS migration deferred to P10 +- AS choice: **D1 evaluation** of WorkOS AuthKit, Auth0, and self-hosted Hydra; pick based on hands-on (research subagent's preference is WorkOS but unverified — confirm in D1 spike). Do not roll our own OAuth. +- Sub-agent identity propagation: parent's token + scope subset → token exchange → child token; child token visible in OTel `gen_ai.agent.parent_id` +- Acceptance: integration test spawns a sub-agent via RFC 8693 with strictly-narrower scope (verified by attempting a denied tool call from the child); README cites exact RFC numbers and EU AI Act Article 12 + +**P4b v1 deferred to P10 polish**: +- KMS key custody (AWS KMS / GCP KMS) — software-key signing in v1 is sufficient for research-platform threat model; KMS is a deployment hardening, not a design change +- Tamper-evidence mutation test (mutate one entry → verification fails) — manual review of audit-log integrity in v1 +- Document both as known v1 gaps; promote in P10 with the Cosmos-team review + +### P5 TrainOrchestrator (Wks 11–12.5) ⚡ +- ⚡ **Centaur HPO**: LLM proposes candidates, **CMA-ES/TPE refines via shared mean/step-size** state. Pure-LLM HPO is empirically inferior (`arxiv:2603.24647`, Mar 2026). +- ⚡ `ComputeBackend` interface with three impls: `HFJobsBackend` (existing wrapper), `SkyPilotBackend` (multi-cloud Job Groups), `NeMoRunBackend` (NVIDIA-native multi-node) +- ⚡ For post-training paths, wrap **NeMo-RL** (production-mature, Nemotron-3-Super trained on it); do not reimplement +- Early-stop policy (validation plateau detection) +- W&B/MLflow monitor, both swap-in via OTel +- ⚡ **v4 Invariant 9**: at least one Centaur sweep runs on real GPU (Modal/Lambda Cloud, ~$50 budget); W&B run committed to repo as `docs/p5_real_sweep.md` +- Acceptance: agent runs a 6-config Centaur sweep, real GPU, selects winner with statistical justification (CI-aware, not point-estimate); Centaur ≥ parity with pure CMA-ES baseline within 1σ + +### P5.5 PyTorch Depth (Wk 13) ⚡ — NEW IN v4 + +**Why this exists**: NVIDIA Cosmos JD requires *"Deep familiarity with PyTorch, including the ability to debug, adapt, and extend model behavior within larger software systems."* No phase in v3.2 demonstrated this. v4 ships a 1-week phase that produces *one PyTorch artifact reviewers can read and run* — code, profiler traces, before/after numbers. + +**Three menu options (pick one in P5.5 D1, ship in D2-D5)**: + +| Option | What | Why credible | +|---|---|---| +| **A. Custom autograd op** | Write a custom `torch.autograd.Function` (forward + backward) for one operation in a real workload (e.g., a fused attention variant, a custom loss with non-trivial gradient, a quantization-aware op). Validate gradient via `torch.autograd.gradcheck`. | Demonstrates understanding of autograd internals — *the* PyTorch depth signal | +| **B. `torch.compile` + kernel selection** | Take a real workload (Cosmos-relevant: video preprocessing, multimodal encoder, or attention block). Profile with `torch.profiler` (CUPTI / kineto traces). Identify hot kernel. Apply `torch.compile` with mode/options tuned by profiler evidence. | Demonstrates production PyTorch optimization — exactly what OptimizeAgent automates later in P6 | +| **C. Distributed training pattern** | Implement one non-trivial distributed pattern: tensor parallelism for a 1-2B model layer, or activation checkpointing tuned for a real memory budget, or a custom collective via `torch.distributed`. Validate via 2-GPU run on Modal. | Demonstrates depth in the area Cosmos team cares about most (multi-GPU world model training) | + +**Recommendation**: Option B is the cheapest credibility-per-dollar — fits naturally with P6 OptimizeAgent (the work composes), uses cheaper single-GPU runs, and hits the JD's "debug, adapt, extend model behavior" exactly. Option A is the most impressive technically. Option C is the most Cosmos-aligned. + +**Day-by-day**: + +| Day | Deliverable | Acceptance | +|---|---|---| +| D1 | Pick option (A/B/C); commit choice + workload selection in `agent/optimization/pytorch_depth/CHOICE.md`; baseline measurement | baseline number recorded | +| D2-3 | Implement; for A: gradcheck passes; for B: profiler trace identifies bottleneck, compile pattern applied; for C: 2-GPU run completes | code committed; profiler artifacts in `docs/p5_5_profiler/` | +| D4 | Benchmark vs baseline; produce `docs/p5_5_speedup.md` with reproducible benchmark script | **≥10% wall-clock improvement** vs baseline (P5.5 numerical target) | +| D5 | Write 1-page README explaining the choice, the bottleneck, the fix, and the limitation. This becomes a portfolio artifact for interview. | README readable by ML systems eng; reproducible by `python bench.py` | + +**Acceptance**: +- Code passes `pytest tests/optimization/pytorch_depth/` +- ≥10% measured wall-clock improvement on real workload (not synthetic) +- Profiler traces (kineto JSON) committed +- 1-page README explains the work in terms a Cosmos hiring manager can audit +- Stretch: open a discussion on PyTorch forum or file an issue if the optimization reveals a framework gap + +**Why 1 week is enough**: this is *one* artifact, not a research thesis. Senior PyTorch engineers ship Option B in 2-3 days; we budget 5 days for setup + writeup + Cosmos-grade polish. + +### P6 OptimizeAgent (Wks 14–15.5) ⚡ +- Collapses old P2-4: profiling + training_opt + inference_opt +- ⚡ Quality budget enforced via **Inspect AI scorers** + paired structural verifiers (per §3.1 sentinel taxonomy); judge-only metrics flagged as advisory. **Hard gate**: ≤2% absolute regression on deterministic metrics; **target**: ≥1.5× wall-clock speedup on 4 baseline workloads (P6 numerical targets) +- ⚡ **Sandbox 2-tier (v3.1 trimmed)**: E2B (Firecracker) for CPU correctness checks; **Daytona** for GPU profiling/training. NemoClaw moved off the main path — it's alpha-stage early-preview (2026-03-16) and gating P6 on it is a credibility risk. NemoClaw becomes a **P10 stretch** for the on-prem-NVIDIA-aligned story. +- Single `SandboxRunner` interface so swapping Daytona → OpenShell/NemoClaw later is a config change, not a refactor +- Acceptance: 4 baseline workloads optimized with measured speedup + quality preservation; ≥1.5× speedup achieved on at least 3/4; speedup numbers committed to repo as `docs/p6_speedup_baseline.md` so any regression is auditable + +### P7 Memory & compression (Wks 16–17) ⚡ +- ⚡ **3-tier hierarchical memory** (not flat): (1) Core/index — Claude-Code-style `MEMORY.md` ≤25KB always-resident pointer index; (2) Recall — file-based, Anthropic `memory_*` tool API compatible; (3) Archival — Letta-compatible KG/vector hybrid +- Do NOT rebuild MemGPT-style paging; use Anthropic's memory tool surface or Letta server as the storage layer +- Owned `OptimizationContextManager` extending upstream — adds hierarchical summarization + retrieval over recall tier; emits `gen_ai.tool.memory.*` spans +- Acceptance: agent recalls a relevant prior session result on a new related task; memory tool API surface usable by any Claude-native client + +### P8 Self-improvement loop (Wks 18–19) ⚡ +- ⚡ Use **DSPy 3.x + `dspy.GEPA`** (do not roll our own GEPA) — mine failures from trajectory store → propose prompt/tool-description revisions → A/B vs control on Inspect AI golden suite → ratchet only on lower-CI-bound improvement (not point estimate) +- Per `RESEARCH_AHE_ANALYSIS.md` decision: offline-batch evolution (Option A), not online-session — **online self-improvement remains hype in 2026** +- Weekly cadence; every promotion requires human review and a signed promotion record in the audit log +- Acceptance: 1 prompt revision shipped via the loop, demonstrably improving judge pass-rate at p<0.05 with sentinel agreement preserved + +### P9 Multi-agent e2e (Wks 19.5–21.5) ⚡ + +**Two demos, one phase** — split intentionally to hit two distinct JD bullets: + +**P9a (W22) — Vertical pipeline demo**: +- ⚡ AV scenario gen via Cosmos Predict 2.5 + π₀.₅-class policy eval — maps directly onto the Cosmos team's internal stack +- DataAgent (cosmos-curate stages) → TrainOrch (NeMo-RL on SkyPilot) → EvalAgent (Inspect AI + physics-consistency scorer) → OptimizeAgent (Centaur over inference config) +- ⚡ Sub-agent identity propagated via RFC 8693 token exchange (P4b deliverable); each sub-agent's traces show parent_agent_id in OTel `gen_ai.agent.parent_id` +- Wall-clock target: **< 8 hours end-to-end** (P9 numerical target) + +**P9b (W23) — CodeAgent demo (NEW in v3.1)**: +- Hits the JD's headline bullet: *"systems where AI doesn't just run models but helps build them"* — the data/train/eval pipeline alone doesn't prove this. +- A small `CodeAgent` (capability-scoped to `{read_file, write_file, run_tests, git_diff}`) iterates on a 10-bug fixture: read failing test → propose fix → write file → re-run tests → repeat (max N iterations). +- Runs in **E2B sandbox** with full OTel span capture; sentinel = `OutputFormatCheck(diff_is_valid_unified) AND SideEffectCheck(test_command_was_called) AND NoOpCheck` +- Target: **≥60% small-bug-fix success on the 10-bug fixture** (P9 numerical target). Below that, CodeAgent is a research artifact, not a demo. Above that, it's a credible "AI helps build AI" story. +- Owned path: `agent/optimization/agents/code_agent.py` + `tasks/codebench/*.py` (Inspect tasks built from a curated bug fixture) + +**Joint acceptance**: +- P9a pipeline runs to completion in < 8h; one "trajectory of trajectories" captured; all four sub-agent identities verifiable in audit log +- P9b CodeAgent achieves ≥60% on bug fixture; replay logs viewable via Inspect View +- Both demos produce a 60-second screen-capture each for the P10 demo video + +### P10 Production deployment + OSS upstream PR + demo (Wks 21.5–22.5) ⚡ — EXPANDED IN v4 + +**Why 2 weeks instead of 1**: v3.2's P10 was "publication" (`pip install` + demo video + blog). v4's P10 is "production" — actual users, actual traces, actual upstream contribution. This is the difference between "smart project" and "ship-grade product." Per §0.8 production commitments G3 and G5. + +**Week 1 — Production deployment (W21.5–22)**: +- ⚡ **Deploy cosmos-lab reference agent on HF Spaces or Modal endpoint.** Choice TBD by free-tier feature parity at deploy time. Surface = a simple "submit ML task, see judged trajectory" UI; bring-your-own API key. +- Open public access; share in 2-3 ML communities (HF Discord, r/MachineLearning, NVIDIA developer forum) +- Gather **≥100 real user sessions over 1-week window** (Invariant 9 / §0.8 G3) +- Cost telemetry visible — total spend + per-session avg in deploy README +- One incident playbook: what to do if sentinel disagreement spikes / cost runaway / endpoint returns 5xx + +**Week 2 — OSS upstream PR + polish (W22–22.5)**: +- ⚡ **Open ≥1 upstream PR** (§0.8 G5 — review-ready, not draft). Two candidates: + - **PR to nvidia-nat**: contribute the sentinel taxonomy as a `nat`-native scorer plugin; reframes our `RewardHackSentinel` as `aiq.sentinel.NoOpCheck` + 3 siblings. *Strongest Cosmos signal.* + - **PR to Inspect AI**: contribute a `paired_sentinel_scorer(judge, structural)` Scorer combinator that wraps any judge with mandatory structural verification. *Broadest community impact.* + - Pick whichever has the cleanest reception window (check recent PR merge cadence in D1); ship the other as v1.1. +- ⚡ **`pip install cosmos-lab[all]` final release** with `nat`-runnable workflow YAML reference example +- ⚡ **P4b promotions** (deferred from P4b v1): KMS key custody (AWS KMS or GCP KMS) for the audit-log signing key + tamper-evidence mutation test +- ⚡ **NemoClaw stretch** (deferred from P6): wire NemoClaw as an alternative `SandboxRunner` backend, validated against the same 4 P6 workloads. Only ship if NemoClaw repo has stabilized between P6 and P10; otherwise punt to v1.2. +- ⚡ **5-minute demo video** showing: P9a pipeline Inspect View log + Phoenix span tree, P9b CodeAgent fixing a bug live, signed audit log entry, sentinel-caught reward-hack attempt (deliberately constructed), production endpoint dashboard with real user count + cost +- ⚡ **Blog post**: "Building a production-grade closed-loop ML platform — 6 agents on a sentinel-gated library, deployed in front of users, contributed upstream" + +**Acceptance**: +- ≥100 real user sessions captured + summary committed (`docs/p10_production_traces.md`) +- ≥1 upstream PR opened, reviewer engaged (any response = success; merge = stretch goal) +- Demo video published; pip install works from clean env +- Total deployment cost in the budget envelope (~$200-400 across all of v4 GPU + deploy) + +### References (v3) +- [Trajectory-Informed Memory Generation](https://arxiv.org/abs/2603.10600) +- [Self-Evolving Agents Survey](https://arxiv.org/html/2507.21046v4) +- [TRACE benchmark evolution](https://arxiv.org/html/2510.00415) +- [Can LLMs Beat Classical HPO? (`arxiv:2603.24647`)](https://arxiv.org/abs/2603.24647) +- [DSPy GEPA Optimizer docs](https://dspy.ai/api/optimizers/GEPA/overview/) +- [π₀.₅ paper (`arxiv:2504.16054`)](https://arxiv.org/abs/2504.16054) +- [NVIDIA NeMo-RL repo](https://github.com/NVIDIA-NeMo/RL) +- [SkyPilot v0.12 Job Groups](https://github.com/skypilot-org/skypilot) +- [NVIDIA OpenShell + NemoClaw](https://developer.nvidia.com/blog/run-autonomous-self-evolving-agents-more-safely-with-nvidia-openshell/) +- [Anthropic memory tool docs](https://platform.claude.com/docs/en/agents-and-tools/tool-use/memory-tool) +- [Letta v1 agent loop](https://www.letta.com/blog/letta-v1-agent) +- [MCP authorization draft](https://modelcontextprotocol.io/specification/draft/basic/authorization) +- [EU AI Act Art. 12 record-keeping](https://ai-act-service-desk.ec.europa.eu/en/ai-act/article-12) + +--- + +## 6. Cross-cutting engineering practices (v3) + +- **Test-first for owned modules**: every owned class has a unit test before integration +- **Lint**: `ruff check agent/optimization/ --ignore E501,F401,E402` clean each phase +- ⚡ **OTel GenAI from day one (P1+)**: every owned module emits `gen_ai.*` spans through the central `OTelGenAIEmitter`; `logging.getLogger("cosmos_lab.*")` reserved for structured app-level logs (not traces) +- **Reproducibility**: every `evaluate` run dumps a self-contained replay manifest (config + seeds + tool versions + sandbox image digest) +- **No hidden state**: all persistent state (audit log, trajectory store, memory) lives under `~/.cosmos_lab/` with explicit paths in config +- ⚡ **No judge-only metric reaches a gate**: every quality budget / PR gate / promotion decision requires a deterministic structural verifier alongside any judge score (anti-reward-hacking invariant) +- ⚡ **Bootstrap CI on every aggregated metric**: pass-rate, judge agreement, latency p50/p99 — point estimates without CIs are not committed to the leaderboard + +--- + +## 6.5 Vendor independence — library boundary IS the answer (v3.2) + +> **v3.2 reframe**: vendor independence used to be "three pluggable interfaces" (Sink, Provider, Compute). That's still true and still useful. But the *deeper* answer added in v3.2 is: cosmos-lab is a **library** (`pip install cosmos-lab`), not a fork of any platform. The library plugs into multiple agent harnesses via thin adapters (§0.4). This is the strongest possible vendor-independence story — stronger than any number of pluggable interfaces — because **the agent loop itself is swappable**. + +### Harness adapters (the real vendor-independence layer) + +| Harness | Adapter | Status | Use case | +|---|---|---|---| +| **`nvidia-nat`** | `cosmos_lab.harness.nat` | **Primary** (P0.5+) | Cosmos pitch, NVIDIA-native deployments | +| **`ml-intern`** | `cosmos_lab.harness.ml_intern` | v1 compat (P0.5+) | HF stack + web UI + existing flywheel | +| Claude Agent SDK | `cosmos_lab.harness.claude_sdk` | future (v1.1) | Anthropic-native deployments | +| OpenAI Agents SDK | `cosmos_lab.harness.openai_agents` | future (v1.1) | OpenAI-native deployments | +| LangGraph | `cosmos_lab.harness.langgraph` | future (v1.2) | HITL durable workflows | + +### Three pluggable interfaces (still in scope, complement the harness adapters) + +| Interface | Phase introduced | HF-native impl (default) | NVIDIA-native impl | Other backends | +|---|---|---|---|---| +| **Sink** (`TrajectorySink`) | P1 | `OTelGenAIEmitter → Phoenix` (default); `HFDatasetSink` opt-in (P8 flywheel) | `S3Sink` / `NGCArtifactSink` (P4) | `DuckDBSink` opt-in (P4a analytics); MongoSink cut in v3.1 | +| **Provider** (`LLMProvider` via litellm) | P2 | HF router (already in `litellm`) | `NIMProvider` (P2) | Anthropic / OpenAI direct, Modal endpoints (P5+) | +| **Compute** (`ComputeBackend`) | P5 | `HFJobsBackend` (wraps `jobs_tool`) | `DGXBackend` stub + `NGCJobsBackend` (P5 stretch) | `ModalBackend` (P5), `LambdaBackend` (P5) | + +### Observability + +- ⚡ **Primary (v3)**: `OpenTelemetry GenAI` semantic conventions (`gen_ai.*` spans), opt-in via `OTEL_SEMCONV_STABILITY_OPT_IN=gen_ai_latest_experimental`. Backend = Phoenix (OSS) by default; Langfuse / W&B / Weave / DataDog all swap-in by config (mirrors `nvidia-nat` telemetry block). +- HF dataset upload remains as the *data flywheel sink* (feeds `kpis_scheduler.py` rollups and AHE/SFT loops); not removed, just one of N OTel exporters +- ⚡ Local secondary: `logging.getLogger("cosmos_lab.*")` for structured app logs only (not traces) — keeps non-OTel debugging cheap + +### Identity (P0 AuthZ MVP → P4 MCP-OAuth → P9 federated multi-agent) ⚡ + +- Phase 0 (shipped): in-process `AgentIdentity`, unsigned, JSONL audit log — explicitly an AuthZ MVP, no AuthN +- ⚡ Phase 4: graduate to **MCP OAuth 2.1 + RFC 8707 Resource Indicators** (closes confused-deputy hole) + **RFC 8693 token exchange** for sub-agent scope-down + **hash-chained signed audit log** aligned to EU AI Act Art. 12 (enforceable 2026-08-02). Use **WorkOS AuthKit** or **Auth0** as the OAuth 2.1 AS — do not roll our own. +- Phase 9: when multi-agent orchestration lands, wire HF OAuth + a generic OIDC seam so the platform can authenticate against NVIDIA IDP when integrated + +### Sandbox (v3) + +- ⚡ **2-tier `SandboxRunner` interface** (lands in P6): (1) **E2B (Firecracker)** for CPU-only correctness checks (~150ms cold start, hardware isolation, cheap); (2) **Daytona** OR **NVIDIA OpenShell + NemoClaw** for GPU profiling/training (kernel-level policy, persistent stateful workspaces). OpenShell is the on-prem default for Cosmos-aligned deployments. +- Anthropic-managed code execution: explicitly NOT used — opaque, no GPU, no on-prem. + +### What we deliberately do *not* multi-vendor + +- Tool registry / `ToolRouter` interface stays unchanged — it's already provider-agnostic +- Frontend / SSE transport stays HF-stack — porting it has no Cosmos-pitch value +- Memory storage layer: pick Anthropic `memory_*` tool API OR Letta — not both; both is feature-creep + +**Net pitch (v3.2)**: "cosmos-lab is a `pip install`-able library that adds governance — sentinel-gated judging, MCP-OAuth identity with RFC 8693 sub-agent scope-down, GEPA promotion contracts, quality budget invariants — to whatever agent harness you already run. Default install ships adapters for **`nvidia-nat`** (Cosmos team's stack) and **ml-intern** (HF stack); Claude Agent SDK and OpenAI Agents SDK adapters land in v1.1. The library never owns the agent loop — that's what makes it portable. Cosmos team can `pip install cosmos-lab[nat]` and `nat run cosmos-lab.yaml` today; everything else is a config swap." + +--- + +## 7. What's deprioritized vs PLAN.md + +- **Deep optimization sub-phases** (old P2 training opt → P3 inference opt → P4 multimodal opt → P5 VLA opt → P6 custom kernels): collapsed into single P6 OptimizeAgent vertical. Custom CUDA kernel generation removed from main path (re-add as P11 stretch goal if e2e ships early). +- **AHE Stages A–I detail**: subsumed into P8 self-improvement loop. The 7-slot decomposition from `RESEARCH_AHE_ANALYSIS.md` informs which slots get evolved (system_prompt, tool_description) but isn't sequenced as 9 stages. +- **Two-level benchmarking with Amdahl deviation**: deferred to P6 entry; P1 eval harness handles the simpler agent-task case first. + +--- + +## 8. Open questions to resolve before P1 (v3) + +Resolved-in-v3 (carried from v2): +- ✅ **DuckDB vs SQLite** → DuckDB stays as analytics layer **over OTel-shaped rows** (not as primary schema) +- ✅ **Judge model** → Sonnet 4.6 ×3 default in `MultiJudge`; Opus 4.7 only for CI-straddle tie-break +- ✅ **Multi-judge agreement metric** → bootstrap CI on pass-rate is the headline metric; pairwise Cohen's κ as diagnostic only (debate framing dropped) +- ✅ **NIM endpoint** → `cosmos-reason-2` first (smaller surface, easier to mock + judge) + +Still-open (v3 raises): +1. **OTel GenAI semconv stability** — spec is experimental as of 2026-Q1, opt-in via stability flag. *Risk*: schema may change before 1.0. *Mitigation*: pin to a specific draft date in `requirements.txt`, add a v1.0-migration test that runs against both versions. *Decision needed before P1 D1.* +2. **Inspect AI vs `nat eval` as primary harness** — both are credible; Inspect AI has wider production adoption (METR), `nat eval` has direct Cosmos-team alignment. *Recommendation*: Inspect AI as primary; emit Inspect logs in a `nat eval`-compatible artifact directory layout so a Cosmos reviewer can also `nat eval --reuse-artifacts` them. *Confirm before P1 D2.* +3. **Capability × approval-policy ordering** — P1 D1 ticket: confirm capability check fires *before* approval policy (cheaper to deny early; preserves policy budget for genuinely-allowed but-expensive calls). *Verify with an ordering test in `tests/optimization/test_router_policy_integration.py`. (Carried — still open.)* +4. **OAuth 2.1 AS choice** (P4b D1 spike) — WorkOS AuthKit vs Auth0 vs self-hosted Hydra. *Recommendation* (research-subagent's preference, **not yet hands-on verified**): WorkOS for cleanest MCP-native docs as of Q2 2026. *D1 action*: 1-day spike comparing all three; pick by hands-on integration friction, not by doc browsing. Don't lock in before the spike. +5. **Sandbox GPU tier** (P6) — Daytona vs NVIDIA OpenShell as default. Daytona is faster to integrate; OpenShell is the Cosmos-pitch-aligned choice. *Recommendation*: ship both behind `SandboxRunner`; default = Daytona for OSS path, OpenShell for "NVIDIA-aligned profile" config. *Confirm before P6.* +6. **DSPy 3.x version pin** (P8) — `dspy.GEPA` is verified as a public API (per [DSPy docs](https://dspy.ai/api/optimizers/GEPA/overview/), production-ready, integrated under Optimizers section). API surface still evolving. *Mitigation*: pin minor version in `requirements.txt`, add a smoke test that runs one GEPA pass on a fixture. *Confirm before P8.* +7. **EU AI Act Art. 12 hash-chain construction** (P4b) — Merkle vs linear hash chain; signing key custody (HSM vs KMS vs software). *Decision*: linear hash chain (simpler; sufficient for Art. 12 tamper-evidence). *v1 cut*: software-held Ed25519 key in P4b; KMS migration + tamper-mutation test deferred to P10. *Known v1 gap*: software-key signing is acceptable for research-platform threat model but not for shipping to enterprise customers; document explicitly in P4b README and call out in P10 promotion. +8. **P9 vertical choice** — AV scenario gen (Cosmos-aligned, GPU-heavy) vs code-pass@1 (cheap to demo, less Cosmos-credible). *Recommendation*: AV scenario gen, with a code-pass@1 fallback in v0 of the pipeline so we can demo the orchestration even before GPU access. *Confirm before P5 entry.* diff --git a/README.md b/README.md index 8a6c1ccd..702182dd 100644 --- a/README.md +++ b/README.md @@ -56,6 +56,41 @@ ml-intern --max-iterations 100 "your prompt" ml-intern --no-stream "your prompt" ``` +## Sharing Traces + +Every session is auto-uploaded to your **own private Hugging Face dataset** +in [Claude Code JSONL format](https://huggingface.co/changelog/agent-trace-viewer), +which the HF Agent Trace Viewer auto-detects so you can browse turns, tool +calls, and model responses directly on the Hub. + +By default the dataset is named `{your-hf-username}/ml-intern-sessions` and is +**created private**. You can flip it to public from inside the CLI: + +```bash +/share-traces # show current visibility + dataset URL +/share-traces public # publish (anyone can view) +/share-traces private # lock it back down +``` + +You can also flip visibility from the dataset page on huggingface.co — the +agent honours whatever you set there for subsequent uploads. + +To opt out entirely, set in your CLI config (e.g. `configs/cli_agent_config.json` +or `~/.config/ml-intern/cli_agent_config.json`): + +```json +{ "share_traces": false } +``` + +To override the destination repo, set: + +```json +{ "personal_trace_repo_template": "{hf_user}/my-custom-traces" } +``` + +The shared `smolagents/ml-intern-sessions` dataset is unrelated and only +receives anonymized telemetry rows used by the backend KPI scheduler. + ## Supported Gateways ML Intern currently supports one-way notification gateways from CLI sessions. diff --git a/RESEARCH_AHE_ANALYSIS.md b/RESEARCH_AHE_ANALYSIS.md new file mode 100644 index 00000000..56052651 --- /dev/null +++ b/RESEARCH_AHE_ANALYSIS.md @@ -0,0 +1,499 @@ +# AHE Paper Analysis — Senior Engineering Review + +> **Paper**: *Agentic Harness Engineering: Observability-Driven Automatic Evolution of Coding-Agent Harnesses* +> Lin, Liu, Pan, Lin, Dou, Huang, Yan, Han, Gui — Fudan / Peking / Shanghai Qiji Zhifeng +> arXiv:2604.25850v2, April 29, 2026 +> **Reviewed**: 2026-04-30 +> **Reviewer perspective**: Senior agentic-harness engineer, frontier-lab lens +> **Verification status**: All numeric claims cross-checked against the paper extract. Project mappings cross-checked against `CLAUDE.md`. Items I have *not* personally verified in source code are explicitly flagged with `[unverified]`. + +--- + +## 0. TL;DR + +AHE is the most directly relevant paper to our harness architecture published to date. Its three-pillar framing (component / experience / decision observability) **validates the architectural instinct already baked into our `CLAUDE.md`** (zero-diff fork, owned-paths split, minimal-seed phase 0). It also exposes three things we *do not yet have* that the paper provides empirical evidence are load-bearing: structured trajectory analysis, falsifiable change manifests, and layered evidence distillation. + +**Verdict (revised 2026-05-01)**: implement the full AHE stack for our ML optimization domain — 3 agents, 7-slot substrate, manifest contracts, Algorithm 1 loop, cross-model evaluation — sequenced by dependency across 9 stages (A-I, see Section 6). Stage A (slot decomposition) and Stage D (manual manifest discipline) start now. Stages E-I (meta-stack: Debugger, Verifier, Evolve Agent, orchestration, transfer testing) require Stage C (≥50-task scored suite) as a hard precondition. + +**Confidence in this assessment**: Medium-high on architecture, lower on schedule. The paper's empirical case is real but its hard-task tier loss, regression-attribution weakness (precision 11.8%), and sub-additive component interactions mean every stage needs measured acceptance criteria — not blind replication. Stage C (scored task suite) and Stage H compute budget are the rate-limiters, not the code itself. + +--- + +## 1. The Paper in One Page + +### 1.1 Thesis + +The bottleneck for evolving coding-agent harnesses is **observability, not model capability**. Given a decoupled action space, structured trajectory evidence, and falsifiable change manifests, an evolve-agent self-improves a harness without collapsing into trial-and-error. + +### 1.2 Three Observability Pillars + +| Pillar | Mechanism | What it buys | +|---|---|---| +| **Component** (NexAU) | Harness exposed as 7 orthogonal *files*: system prompt, tool description, tool implementation, middleware, skill, sub-agent config, long-term memory | Each failure pattern maps to one component class — clean attribution, no entanglement | +| **Experience** (Agent Debugger) | Million-token traces converted into per-task analysis reports + benchmark overview, navigable as a file environment | Evolve-agent reasons over structured root causes, not raw logs | +| **Decision** (Change Manifest) | Every edit ships with predicted fixes and predicted regressions | Next round verifies the contract → falsifiable evolution | + +### 1.3 The Loop (Algorithm 1) + +``` +Rollout → Clean → Attribute / Rollback → Distill → Evolve → Commit +``` + +Governed by two hard constraints: + +- **Controllability**: only the harness workspace is writable; infrastructure is read-only. +- **Falsifiability**: manifest predictions are checked against actual task-level deltas next round. + +### 1.4 Headline Numbers (Terminal-Bench 2, 89 tasks) + +| Method | All | Easy (4) | Medium (55) | Hard (30) | +|---|---|---|---|---| +| NexAU₀ (seed) | 69.7% | 87.5% | 78.2% | 51.7% | +| ACE (prompt-only self-evolve) | 68.9% | 91.7% | 78.2% | 48.9% | +| TF-GRPO (RL-style) | 72.3% | 100.0% | 79.4% | **55.6%** | +| **AHE** | **77.0%** | 100.0% | **88.2%** | 53.3% | + +Also beats human-designed harnesses: Codex 71.9%, terminus-2 62.9%. + +### 1.5 Operating Mode — What the Paper Validates vs. What We Choose + +> *Verified by direct query against the paper, 2026-05-01. Earlier drafts of this doc and conversational answers asserted things the paper does not actually claim. This section corrects the record.* + +The paper describes and empirically validates exactly **one** operating mode: + +- **Offline batch evolution against fixed benchmarks.** Algorithm 1 runs sequentially through `N=10` iterations. Each iteration runs all 89 Terminal-Bench 2 tasks, analyzes traces, proposes edits, commits. **~32 hours total wall time** for the full 10-iteration run on one benchmark. All three role agents (Code Agent, Agent Debugger, Evolve Agent) share one base model (GPT-5.4 high reasoning), differing by prompt + tools + role. + +The paper does **NOT** describe or validate any of the following — these are gaps to be aware of, not facts to extract: + +- A **deployment / production scenario** for the evolved harness. The paper presents the evolved harness as a research artifact that *transfers* across benchmarks and models. It does not address how end users would receive, run, or interact with it. +- **Online evolution during user sessions.** No experiment evolves the harness while a user (rather than a benchmark) is the trace source. +- **Continuous / test-time evolution at inference.** The paper *motivates* "test-time learning" as a direction in its introduction, but does not implement or evaluate it. Treat as future work. +- **Adaptive termination.** `N=10` is fixed. No plateau-detection. No early-stop rule. +- **Cost amortization across user sessions.** The 32-hour figure is treated as a one-time research expense; the paper offers no model for spreading it across deployment usage. + +#### Operating-mode options for our project + +| Option | Description | Paper status | +|---|---|---| +| **A — Offline-only (paper-faithful)** | AHE runs as a separate batch process, by us, on Phase 8 benchmark. User `ml-intern` sessions invoke only the Code Agent with whatever harness is currently committed in the repo. | **Empirically validated by paper.** | +| **B — Offline evolution + user-trace logging** | Mode A, plus user session traces accumulate as additional evolution data for future batch runs. | Suggested by paper's "test-time learning" framing but **not implemented, not evaluated**. | +| **C — Online evolution during user sessions** | All 3 agents run during user sessions; harness evolves continuously per-user. | **Not described in paper.** Speculation. | + +#### Our default: Option A + +It is the only mode with empirical evidence. Phases 10 and 11 in `PLAN.md` are written for this mode. Moving to B or C is a deliberate scope expansion past the paper's evidence base, and should require its own validation work — not blind adoption. + +#### Practical implication for users + +When a user runs `uv run ml-intern`: + +- **Only the Code Agent runs.** Same as today. +- **Debugger and Evolve Agent do not run.** They are offline tools used by the engineering team to update the harness between releases. +- **No manifests are written, no traces are analyzed, no rollbacks happen during user sessions.** +- The user simply benefits from a harness that has been evolved against the Phase 8 benchmark in a prior offline run committed to the repo. + +This matches the paper's empirical setup. Deviating from it (Option B or C) would put us off the paper's evidence and require us to validate the new mode independently. + +--- + +## 2. Detailed Mechanisms — What Actually Does the Work + +### 2.1 NexAU: Decoupled Harness Substrate + +Seven orthogonal component types as *explicit files*: + +1. System prompt +2. Tool description (the schema/docstring the model sees) +3. Tool implementation (the executable code) +4. Middleware (cross-cutting concerns: retries, parsing, side-effects) +5. Skill (reusable procedural knowledge) +6. Sub-agent configuration (delegation patterns) +7. Long-term memory + +**Why the split matters**: Tool-description and tool-implementation are *two separate slots*. A failure mode where "the model misuses the tool because the docstring is wrong" must be debuggable without touching the tool's actual code path — and vice versa. This is a genuine engineering insight: most harnesses I have seen conflate these. + +**Minimal seed (H₀)**: bash-only, no middleware, no sub-agents. Deliberately spartan. Forces every added component to *justify itself through measured rollouts* and prevents hidden-prior leakage where a fat baseline silently drives gains. This is the same instinct as `git bisect` against a clean ancestor. + +### 2.2 Agent Debugger: Layered Trajectory Evidence + +The non-obvious move: traces are treated as a **navigable file environment**, not a flat log. The debugger agent has tools to drill in (per-task) and zoom out (benchmark overview), and it produces *structured reports* — quoted directly: "structured root causes rather than raw logs." This is what allows the evolve-agent to consume analysis without context explosion. + +This is the part of the paper most under-appreciated by casual readers. A million-token trace cannot be fed to an LLM directly; the question is how you compress it without destroying causal information. Their answer: progressive disclosure with task-level and benchmark-level views. + +### 2.3 Evolve Agent: Change Manifests as Falsifiable Contracts + +Each edit is committed with a manifest containing: + +- The failure evidence that motivated it +- The diagnosed root cause +- The targeted fix +- The expected impact (which tasks should flip, which might regress) + +Next round verifies. This is the discipline that separates AHE from "let an LLM rewrite the prompt and pray." The manifest *is the contract*; if predictions don't hold, you have evidence the diagnosis was wrong, not just the fix. + +--- + +## 3. Honest Reading of the Results + +The headline number is real, but the supporting evidence has cracks worth naming. + +### 3.1 Strengths (Real Signal) + +- **+7.3pp over a competitive seed** with a clear, attributable methodology. Not a noise-level win. +- **Beats prompt-only self-evolution (ACE) and an RL-style baseline (TF-GRPO)** on aggregate. The methodology, not the budget, is doing the work. +- **Cross-model transfer is the most impressive result**: +10.1pp on deepseek-v4-flash, +6.3pp on qwen-3.6-plus, +5.1pp on gemini-3.1-flash-lite. The author's read — *"less capable models lean more heavily on coordination patterns AHE has fixed"* — is plausible and is genuine evidence against benchmark overfitting. +- **Cross-benchmark transfer to SWE-bench-verified**: 75.6% with 32% fewer tokens than ACE. The harness encodes general engineering experience, not Terminal-Bench-specific tricks. + +### 3.2 Weaknesses (Caveats Not to Smooth Over) + +- **Hard tier: AHE 53.3% vs TF-GRPO 55.6%**. AHE *loses* on the hardest 30 tasks. Gains concentrate in medium difficulty. The hard-task ceiling is unclear and not addressed convincingly. +- **Component ablation is sub-additive and partly negative**: + + | Component (alone) | Δ vs seed | + |---|---| + | Memory | +5.6 pp | + | Tools | +3.3 pp | + | Middleware | +2.2 pp | + | System prompt | **−2.3 pp** | + + System-prompt-only evolution makes things *worse*. The full stack is sub-additive — the paper acknowledges *"components interact non-additively, capping aggregate gain"*. The implication: evolving everything simultaneously is not a free lunch; interference is real. + +- **Attribution is barely above random for regressions**: + - Fix precision 33.7% / recall 51.4% (≈5× random) + - Regression precision 11.8% / recall 11.1% (≈2× random) + + The paper calls this "regression blindness" and acknowledges it limits convergence predictability. This means the "falsifiable contract" works for *fixes* but is shaky for *regressions* — which is the more dangerous failure mode. + +- **Step budgets fitted to GPT-5.4-high**. Cross-model results are sensitive to timeout conventions. Not necessarily wrong, but a confound to track. + +- **Engineering overhead is not quantified**. Trajectory analysis + workspace management has compute cost that the paper does not put a number on. + +### 3.3 Net Reading + +The methodology is real and the architectural insights generalize. The empirical case is strong on average and on transfer, weak on hard-task tier and on regression attribution. The paper itself frames AHE as *"a controlled research prototype rather than a fully mature autonomous system"* — that framing is honest and should anchor our adoption decisions. + +--- + +## 4. Architectural Principles (What to Extract) + +Independent of whether we ever implement the AHE *loop*, four principles from the paper stand on their own as architecture guidance. + +### 4.1 Decouple the Substrate Aggressively + +If a failure pattern can be caused by either of two components, those components must live in separate files. The 7-slot decomposition is an existence proof of how granular this can usefully go. + +### 4.2 Start From a Minimal Seed + +A fat baseline hides where gains come from. Anything you add must *earn* its place against a measured floor. This is also the cheapest insurance against premature abstraction (which our project's `RULES.md` already enforces). + +### 4.3 Treat Edits as Falsifiable Contracts + +An edit without a prediction is not a hypothesis, it is a hope. Manifests with predicted-fix and predicted-regression fields convert harness evolution from vibes-based to evidence-based — even before any automation. + +### 4.4 Observability Beats Cleverness + +The paper's central insight: *"once the evolution agent receives structured context over a clear action space, it reliably converges on better designs."* The implication for human engineers is the same. Invest in trace structure before investing in cleverer agents. + +--- + +## 5. Mapping to Our Project + +Our `CLAUDE.md` already encodes a *partial* version of these principles. The mapping is partial, not exact — I want to be precise about what we have and what we do not. + +### 5.1 What We Already Have (Verified Against `CLAUDE.md`) + +| Our invariant | NexAU equivalent | Status | +|---|---|---| +| Zero-diff: never edit upstream files | Controllability constraint (infrastructure read-only) | Present | +| Owned-paths table: `agent/optimization/`, `agent/tools/profiling/`, `prompts/`, `configs/` | Decoupled component substrate | Partial — split exists, but not by NexAU's 7 categories | +| Phase 0 = baseline verification before anything else | Minimal seed H₀ | Present | +| `pytest tests/unit/ -q` exit-0 gate | Rollout + verification | Present, but coarse-grained (binary, not per-task) | + +### 5.2 What We Do Not Yet Have + +| Missing capability | Why it matters | Cost to add | +|---|---|---| +| **Trajectory observability layer** (Agent Debugger analogue) | When optimization agent fails on a profiling/quantization task, the failure is raw logs, not structured root causes | Medium — needs a benchmark first | +| **Change manifests** | Edits to prompts/tools have no predicted-fix / predicted-regression fields. We cannot tell, after the fact, whether a change behaved as designed. | Low — schema + discipline | +| **Layered distillation** | As Phase 1+ knowledge tools grow, context-explosion will hit. Per-task → per-domain distillation is the answer | Medium — defer until pain is real | +| **Per-component ownership in the 7-slot sense** | Our current owned-paths split is by *concern* (profiling, training opt, inference opt), not by NexAU's *component type* (tool description vs tool impl vs middleware) | Low — refactor intent, not files | + +### 5.3 Mapping Caveats + +- I have **not personally verified** the internals of `agent/core/agent_loop.py`, `agent/core/session.py`, or `agent/context_manager/manager.py` `[unverified]`. The mapping above is based on `CLAUDE.md` declarations and standard layering assumptions. Before any refactor, those files should be read end-to-end to confirm where seams actually are. +- Our project is *building an ML optimization agent* — the agent optimizes ML workloads. AHE is about *optimizing the harness itself*. These are adjacent but not identical objectives. Some AHE machinery is overkill for our current phase. + +--- + +## 6. Implementation Roadmap — Build Everything From AHE, Sequenced by Dependency + +**Decision (recorded 2026-05-01)**: implement the full AHE stack (3 agents + 7-slot substrate + manifest contract + Algorithm 1 loop + cross-model evaluation) for the ML optimization domain. Sequence below respects technical dependencies — order is not preference, it is what the paper itself requires for the loop to converge. + +### 6.0 Dependency Graph (read this first) + +``` +[A] NexAU 7-slot decomposition ◄── Phase 0/1 + │ + ▼ +[B] Code Agent = ML optimization agent ◄── Phases 1-5 (existing PLAN.md) + │ + ├─► [D] Change-manifest discipline (manual) ◄── Phase 1+, in parallel + │ + ▼ +[C] Scored ML-opt task suite (≥50 tasks, deterministic) ◄── Phase 8 (rate-limiter) + │ + ├─► [E] Agent Debugger (structured failure reports) ◄── Phase 9 + │ + ▼ +[F] Manifest auto-verification (falsifiability loop) ◄── Phase 9 + │ + ▼ +[G] Evolve Agent (proposes edits, writes manifests) ◄── Phase 10 + │ + ▼ +[H] Full orchestration (Algorithm 1) ◄── Phase 10 + │ + ▼ +[I] Cross-model transfer evaluation ◄── Phase 11 +``` + +Skipping the order does not save time — it amplifies the paper's known failure modes (regression blindness, sub-additive interactions). + +### 6.A — NexAU 7-slot decomposition + +| Field | Value | +|---|---| +| **Phase** | 0 / 1 | +| **Path** | `agent/optimization/config_ext.py`, `agent/optimization/` directory layout | +| **Scope** | Declare all 7 slots in `OptimizationConfig` (system prompt, tool description, tool implementation, middleware, skill, sub-agent config, long-term memory). Empty slots reserved by name. | +| **Acceptance** | Every future PR can answer "which slot does this belong to." Lint check that every new file lives under one of the 7 slot directories. | +| **Dependencies** | None. Start here. | +| **Risk** | Low. Cost: hours. | + +```python +# agent/optimization/config_ext.py +class OptimizationConfig(Config): + # NexAU-aligned slots (some empty initially) + system_prompt_path: str = "agent/prompts/system_prompt_optimization_v1.yaml" + tool_descriptions_path: str | None = None + tool_implementations_dir: str = "agent/tools/" + middleware_dir: str | None = None + skills_dir: str | None = None + subagent_config_dir: str | None = None + long_term_memory_path: str | None = None +``` + +### 6.B — Code Agent (the ML optimization agent itself) + +| Field | Value | +|---|---| +| **Phase** | 1-5 (entire existing `PLAN.md`) | +| **Path** | `agent/optimization/`, `agent/tools/profiling/`, `training_opt/`, `inference_opt/`, `multimodal_opt/`, `vla_opt/` | +| **Scope** | Everything in current `PLAN.md` Phases 1-5: system prompt, knowledge tools, profiling suite, training/inference/multimodal/VLA optimizations. | +| **Acceptance** | Per existing `PLAN.md` Definition of Done. | +| **Dependencies** | A. | +| **Why this is Stage B not Stage Z** | AHE without a Code Agent has nothing to evolve. The Code Agent IS the harness that gets optimized in stages G-H. The existing roadmap is not a preliminary — it is the substrate. | + +### 6.C — Scored ML-optimization task suite + +| Field | Value | +|---|---| +| **Phase** | 5-6 (extension of current Phase 6) | +| **Path** | `tests/optimization/benchmarks/`, `tests/optimization/scoring/` | +| **Scope** | ≥50 tasks with deterministic scoring. Each task: input (model + script + hardware target), success criterion (e.g. "fits in 40GB", "MMLU drop <1%"), reproducible scoring. | +| **Examples** | "Fit Llama-3-8B QLoRA on A100 40GB", "Reduce Mixtral inference latency 30% on 4×H100", "Quantize Gemma-7B <8GB with <1% MMLU drop". | +| **Acceptance** | Same task scored twice = same result. 3 baseline runs of seed harness yield <2% pass-rate variance. | +| **Dependencies** | B. | +| **Risk** | **HIGHEST risk in the roadmap.** Hardware availability, determinism, and scoring oracles for ML tasks are genuinely hard. This stage is the rate-limiter for the entire meta-stack. | + +### 6.D — Change-manifest discipline (manual, parallel with B) + +| Field | Value | +|---|---| +| **Phase** | 1+, in parallel with B | +| **Path** | `manifests/-.yaml` | +| **Scope** | Every non-trivial harness edit ships with a manifest. Schema: `edit`, `slot`, `evidence`, `root_cause`, `expected_fix[]`, `expected_regression_risk`, `verification_round`. | +| **Acceptance** | 100% of harness PRs after Stage A include a manifest. Pre-commit hook enforces schema. | +| **Dependencies** | A. | +| **Risk** | Near zero cost, near zero risk. Highest ROI item in the entire roadmap before Stage F is online. | +| **Why parallel** | This is discipline, not automation. It also builds the dataset of human-authored manifests that informs Stage G's evolve-agent prompt. | + +```yaml +# manifests/2026-05-15-add-hardware-specs-tool.yaml +edit: "Add hardware-spec retrieval tool" +slot: tool_implementation +evidence: "Optimization tasks failing because agent guesses wrong VRAM bounds" +root_cause: "No tool exposes ground-truth hardware specs" +expected_fix: ["task-id-014", "task-id-022", "task-id-031"] +expected_regression_risk: "May increase token usage on simple tasks (~5%)" +verification_round: 3 +``` + +### 6.E — Agent Debugger + +| Field | Value | +|---|---| +| **Phase** | 9 (new — appended to `PLAN.md`) | +| **Path** | `agent/optimization/meta/debugger/` | +| **Scope** | Separate agent. Different prompt + tools, **same base model as Code Agent** (per AHE pattern). Reads `runs/` trace files (read-only). Outputs structured per-task reports + benchmark-level overview using progressive disclosure. | +| **Acceptance** | On a synthetic 20-task failure suite, Debugger correctly classifies root cause for ≥70% of cases. Output token count ≤5% of input trace token count. | +| **Dependencies** | C (needs scored task runs to analyze). | +| **Risk** | Medium. The compression target is the hard part — paper achieves it via "navigable file environment" framing, which we will replicate. | + +### 6.F — Manifest auto-verification + +| Field | Value | +|---|---| +| **Phase** | 9 | +| **Path** | `agent/optimization/meta/verifier.py` | +| **Scope** | Code module (NOT an agent). Compares manifest predictions to actual task-level deltas. Computes fix-precision, fix-recall, regression-precision, regression-recall. | +| **Acceptance** | Replicate paper's metrics on our own data. Baseline computed on ≥10 manual manifests from Stage D. Watch for "regression blindness" — paper's regression precision was 11.8%; our number tells us how reliable Stage G's contracts can be. | +| **Dependencies** | C, D. | +| **Risk** | Low (pure code). | + +### 6.G — Evolve Agent + +| Field | Value | +|---|---| +| **Phase** | 10 (new — appended to `PLAN.md`) | +| **Path** | `agent/optimization/meta/evolver/` | +| **Scope** | Separate agent. Reads Debugger reports, proposes harness edits, writes manifests. Hard constraints: **controllability** (writes only to `workspace/`, infrastructure read-only) and **falsifiability** (every edit must include a manifest with predictions). | +| **Acceptance** | On a held-out task subset, Evolve Agent's proposed edits improve pass-rate over 3 rounds. Rollback rate <30%. System prompt slot edits gated through held-out validation (per paper's −2.3pp warning). | +| **Dependencies** | E, F. | +| **Risk** | High. Sub-additive component interactions are real — stage component additions, measure between each. | + +### 6.H — Full orchestration (Algorithm 1) + +| Field | Value | +|---|---| +| **Phase** | 10 | +| **Path** | `agent/optimization/meta/loop.py` | +| **Scope** | Pure Python orchestrator. Implements `Rollout → Clean → Attribute/Rollback → Distill → Evolve → Commit`. Calls Code Agent / Debugger / Evolve Agent at the right phases. | +| **Acceptance** | Full loop runs end-to-end on the 50-task suite. Improvement over seed harness ≥+5pp over 5 rounds (calibrated against paper's +7.3pp; lower bar reflects ML-opt domain narrowness). | +| **Dependencies** | B, C, E, F, G. | +| **Risk** | Compute budget. **Each round ≈ N_tasks × (rollout cost + debugger cost + evolve cost).** Paper does not quantify; we should expect ≥10K LLM calls per round on a 50-task suite. Budget compute before launching this stage. | + +### 6.I — Cross-model transfer evaluation + +| Field | Value | +|---|---| +| **Phase** | 11 (new — appended to `PLAN.md`) | +| **Path** | `tests/optimization/transfer/` | +| **Scope** | Take auto-evolved harness from Stage H, run with alternate base models. Measure pass-rate transfer. | +| **Acceptance** | ≥3 alternate models tested. Transfer gain ≥+3pp (paper showed +5-10pp, we lower the bar because ML-opt domain is narrower than terminal-bench). | +| **Dependencies** | H. | +| **Risk** | Low — read-only evaluation. | + +### 6.X — Phase mapping summary + +| AHE Stage | `PLAN.md` Phase | +|---|---| +| A — 7-slot decomposition | Phase 0/1 (existing) | +| B — Code Agent | Phases 1-5 (existing) | +| D — Manifest discipline | Phase 1+ (cross-cutting, parallel) | +| C — Scored task suite | **Phase 8** (appended) | +| E — Agent Debugger | **Phase 9** (appended) | +| F — Manifest Verifier | **Phase 9** (appended) | +| G — Evolve Agent | **Phase 10** (appended) | +| H — Algorithm 1 orchestration | **Phase 10** (appended) | +| I — Cross-model transfer | **Phase 11** (appended) | + +Phases 8-11 are now in `PLAN.md` with operational step-by-step detail (Steps 8.1-8.4, 9.1-9.4, 10.1-10.5, 11.1-11.2). + +**Note on numbering:** earlier drafts of this doc proposed Phases 6.5/7/8 for the AHE meta-stack. That conflicted with existing Phase 7 (CUDA Kernel Generation), so the meta-stack was renumbered to 8/9/10/11 to avoid collision. Existing Phases 0-7 unchanged. + +### 6.Y — What we are explicitly committing to + +Everything from the paper, in this order: +- 3 LLM-driven agents (Code Agent, Agent Debugger, Evolve Agent) — all sharing one base model (per paper) +- 7-slot NexAU substrate +- Change manifests with `expected_fix` and `expected_regression_risk` +- Layered trajectory distillation (per-task → benchmark-level) +- Algorithm 1 orchestration (`Rollout → Clean → Attribute/Rollback → Distill → Evolve → Commit`) +- Controllability + falsifiability invariants +- Cross-model transfer testing + +What does NOT change about the project: +- Zero-diff invariant (still applies; meta-stack lives entirely in owned paths) +- Existing `PLAN.md` Phases 0-5 unchanged +- ML optimization remains the *product*; AHE is the *meta-layer* + +### 6.Z — What can break this plan + +1. **Stage C harder than expected.** Building a stable 50-task scored suite for ML optimization may take longer than building the meta-stack itself. Hardware availability, determinism, and scoring oracles are real engineering. Watch this stage closely. +2. **Compute budget for Stage H.** Paper does not quantify AHE's compute overhead. Ballpark: 10K+ LLM calls per round × 5+ rounds = 50K+ calls per benchmark cycle. Budget before launching. +3. **Regression attribution stays unreliable.** If Stage F's measured regression precision/recall mirror the paper's (33%/11%), Stage G's `expected_regression_risk` becomes a flag for human review, not an autonomous filter. +4. **Sub-additive interactions.** Per Table 3 of paper, full stack < sum of individual gains. Our 50-task suite may not have the statistical power to detect interaction effects cleanly. Stage component additions one at a time. + +--- + +## 7. Risks, Watch-Outs, Anti-Patterns + +### 7.1 Don't Let an LLM Rewrite the System Prompt Without Held-Out Validation + +The paper's own ablation: system-prompt-only evolution scored **−2.3 pp**. The system prompt is the highest-leverage and highest-risk slot. Any automated edits there must be gated on held-out task performance, not just inner-loop scores. + +### 7.2 Treat Regression Attribution as Unreliable + +Their numbers: regression precision 11.8%, recall 11.1%. If we adopt change manifests, the `expected_regression_risk` field should be treated as a *flag for human review*, not a trustworthy filter. Until our attribution layer beats theirs (we have no reason to expect it will, initially), every committed harness change deserves a sanity-check rollout on a held-out subset. + +### 7.3 Don't Optimize the Harness Before It Has a Workload + +Our project's *core* objective is the ML optimization agent itself — the harness is a means. AHE-style evolution requires a stable, scored task suite to drive learning. The roadmap (Section 6) sequences Stage C (scored task suite) before Stages E-H (the meta-stack) for exactly this reason. Burning effort on Stages E-H before C is online means optimizing against a noisy metric, which is the failure mode the paper itself warns about (sub-additive interactions, regression blindness). + +### 7.4 Component Interaction Is Real + +Sub-additive aggregate gain is the empirical reality in the paper. Translation: do not assume that adding a memory tool *and* a middleware layer *and* a sub-agent config will give you the sum of their individual gains. They will interact, sometimes negatively. Stage additions, measure between each. + +### 7.5 Don't Conflate Tool-Description and Tool-Implementation + +This is one of the cleanest insights in the paper. In our `agent/tools/`, treat the docstring/schema (what the model sees) as a separate evolvable artifact from the executable code. A failure caused by "model misuses tool because docstring is misleading" should be fixable without touching the implementation, and vice versa. + +--- + +## 8. Decision Framework + +When deciding whether to apply an AHE technique to our project, run this check: + +1. **Does our project currently have a stable scored task suite (Stage C complete)?** If no → only Stages A, B, D are unlocked. Stages E-I require C as a precondition. +2. **Is the proposed change auditable post-hoc?** If no → add a change manifest first. No edit without a hypothesis. +3. **Are we touching the system prompt slot?** If yes → mandatory held-out validation, manual review, no full automation (per paper's −2.3pp ablation). +4. **Is the gain we're targeting on the *hard* task tier?** If yes → AHE has weak evidence here (paper lost to TF-GRPO on hard tier). Be skeptical of expected ROI. +5. **Are we adding multiple components at once?** If yes → stage them, measure between. Non-additivity is real (Table 3). +6. **What's the compute budget for the round we're about to run?** If unknown → estimate before launching. Stage H rounds are not cheap. + +--- + +## 9. Open Questions (For Further Investigation) + +- **What is the actual compute overhead of the AHE loop?** Paper does not quantify. If it is 10× a single rollout, the cost-benefit calculus changes. +- **Does change-manifest discipline alone (without automation) capture most of the benefit?** I suspect it captures a large fraction — perhaps 50%+ of the architectural value, at near-zero cost. The paper does not isolate this. +- **How does AHE behave on tasks where the failure mode is *missing knowledge*, not *missing structure*?** ML optimization tasks lean heavily on domain knowledge. The paper's tasks are coding-flavored. Transfer is plausible but not demonstrated. +- **What is the regression-attribution failure mode in practice?** Paper gives precision/recall but does not characterize the *type* of regression that gets missed. Without that, we cannot design compensating controls. +- **Does the 7-slot decomposition over-fit to terminal-bench-style tasks?** Some slots (e.g. middleware, sub-agent config) may be more or less load-bearing for ML optimization workflows. We will not know until we have data. + +--- + +## 10. References + +- Lin et al., *Agentic Harness Engineering*, arXiv:2604.25850v2, April 2026. +- Project context: `CLAUDE.md`, `PLAN.md`, `SYSTEM.md` (this repo). +- Related baselines mentioned in the paper: ACE (prompt-only self-evolution), TF-GRPO (RL-style), Codex, terminus-2. + +--- + +## 11. Final Engineering Judgment + +**Decision (2026-05-01): build everything from AHE, sequenced by dependency.** Section 6 lays out the 9-stage roadmap (A-I). The paper's stack only converges when each component has its preconditions met: Code Agent before Debugger, scored task suite before Evolve Agent, manual manifest discipline before automation. Skipping order amplifies the paper's known failure modes. + +**Stages A and D are free wins — start now.** 7-slot substrate declaration + manual manifest discipline cost near zero and produce immediate clarity benefits. + +**Stage B is the entire current `PLAN.md` (Phases 1-5).** AHE does not displace the existing roadmap — it *frames* it. The Code Agent we are building IS the harness AHE will eventually evolve. + +**Stage C (scored 50-task suite) is the rate-limiter.** Hardware availability, determinism, and scoring oracles for ML optimization are the hardest part of the entire program — harder than building the meta-stack on top. + +**Stages E-I (meta-stack) are real engineering, not a sprint.** Compute budget and stable scoring are the binding constraints, not code complexity. Append Phases 6.5, 7, 8 to `PLAN.md` to capture them in operational detail. + +**Treat the paper's empirical results with calibrated trust.** The headline win is real (+7.3pp Terminal-Bench 2); the hard-task tier loss, sub-additive ablation, and regression-blindness numbers are real warnings. Acceptance criteria at each stage are calibrated against these, not against best-case headline numbers. The authors themselves call AHE *"a controlled research prototype rather than a fully mature autonomous system"* — we are committing to make it production-grade for the ML optimization domain, eyes open about what that costs. + +The bottom-line claim from the paper that I most agree with as a senior engineer: *the bottleneck is observability, not capability*. The roadmap above commits to building exactly that observability stack — across 9 stages, in dependency order. diff --git a/SYSTEM.md b/SYSTEM.md new file mode 100644 index 00000000..fe2a5455 --- /dev/null +++ b/SYSTEM.md @@ -0,0 +1,1167 @@ +# ML Intern — Phân Tích Hệ Thống Từ First Principles + +> Tài liệu này giải thích toàn bộ hệ thống từ góc nhìn kỹ thuật sâu. +> Mục tiêu: không chỉ hiểu **cái gì** mà phải hiểu **tại sao** mỗi quyết định thiết kế lại được đưa ra. + +--- + +## Mục Lục + +1. [ML Intern Là Gì?](#1-ml-intern-là-gì) +2. [Bản Đồ Kiến Trúc Toàn Hệ Thống](#2-bản-đồ-kiến-trúc-toàn-hệ-thống) +3. [Luồng Dữ Liệu — Một Request Đi Qua Hệ Thống](#3-luồng-dữ-liệu--một-request-đi-qua-hệ-thống) +4. [Agent Core — Trái Tim Của Hệ Thống](#4-agent-core--trái-tim-của-hệ-thống) +5. [Session — Container Trạng Thái](#5-session--container-trạng-thái) +6. [ContextManager — Bộ Nhớ Của Agent](#6-contextmanager--bộ-nhớ-của-agent) +7. [ToolRouter — Tay Của Agent](#7-toolrouter--tay-của-agent) +8. [DoomLoop Detector — Hệ Thống Phòng Vệ](#8-doomloop-detector--hệ-thống-phòng-vệ) +9. [Backend — API Gateway & Session Pool](#9-backend--api-gateway--session-pool) +10. [Frontend — SSE Bridge & React Layer](#10-frontend--sse-bridge--react-layer) +11. [Data Flywheel — Vòng Lặp Thu Thập Dữ Liệu](#11-data-flywheel--vòng-lặp-thu-thập-dữ-liệu) +12. [Security Layer — Redact, Auth, Quotas](#12-security-layer--redact-auth-quotas) +13. [Các Quyết Định Thiết Kế Quan Trọng](#13-các-quyết-định-thiết-kế-quan-trọng) +14. [Thứ Tự Đọc Codebase](#14-thứ-tự-đọc-codebase) + +--- + +## 1. ML Intern Là Gì? + +ML Intern là một **autonomous AI agent** được xây dựng bởi HuggingFace. Nó có thể tự nghiên cứu, viết code, và deploy các ML project bằng cách sử dụng toàn bộ HF ecosystem: Hub, Datasets, Training Jobs, Spaces, Docs, Papers. + +### Hai chế độ triển khai, một agent core + +```text +┌─────────────────────┐ ┌──────────────────────────────────────┐ +│ CLI (Local Tool) │ │ Web App (HuggingFace Space) │ +│ │ │ │ +│ $ ml-intern │ │ https://huggingface.co/spaces/... │ +│ $ ml-intern "..." │ │ React + Vite frontend │ +│ │ │ FastAPI backend │ +│ agent/main.py │ │ Multi-tenant, nhiều users đồng thời │ +└──────────┬──────────┘ └──────────────────┬─────────────────┘ + │ │ + └─────────────────┬───────────────────┘ + │ + CÙNG MỘT AGENT CORE + agent/core/agent_loop.py + agent/core/session.py + agent/core/tools.py +``` + +**Tại sao thiết kế hai surface chia sẻ một core?** + +First principle: **Không lặp lại business logic**. Agent logic (loop LLM, execute tool, manage context) là phần khó nhất và có nhiều edge case nhất. Nếu CLI và Web có hai implementation riêng biệt, bất kỳ bug fix hoặc improvement nào cũng phải thực hiện hai lần. CLI chính là "reference implementation" — nếu nó hoạt động trên CLI, nó sẽ hoạt động trên Web. + +--- + +## 2. Bản Đồ Kiến Trúc Toàn Hệ Thống + +```text +╔══════════════════════════════════════════════════════════════════════════════╗ +║ DEPLOYMENT SURFACES ║ +║ ║ +║ ┌──────────────────────┐ ┌──────────────────────────────────────┐ ║ +║ │ CLI (agent/main.py)│ │ Web App (HF Space) │ ║ +║ │ │ │ │ ║ +║ │ PromptSession │ │ React + Vite + TypeScript │ ║ +║ │ (prompt_toolkit) │ │ useChat (Vercel AI SDK) │ ║ +║ │ │ │ │ SSEChatTransport (custom bridge) │ ║ +║ │ submission_queue │ │ │ POST /api/sessions/{id} │ ║ +║ │ event_queue │ │ │ SSE stream response │ ║ +║ └──────────┬───────────┘ └─────────┼────────────────────────────┘ ║ +║ │ │ ║ +║ │ ┌──────────▼────────────────────────────┐ ║ +║ │ │ FastAPI Backend (backend/) │ ║ +║ │ │ │ ║ +║ │ │ ┌─────────────────────────────────┐ │ ║ +║ │ │ │ SessionManager │ │ ║ +║ │ │ │ ├─ MAX_SESSIONS: 200 │ │ ║ +║ │ │ │ ├─ MAX_PER_USER: 10 │ │ ║ +║ │ │ │ ├─ sessions: dict[id, AgentSess│ │ ║ +║ │ │ │ └─ EventBroadcaster (fan-out) │ │ ║ +║ │ │ └────────────────┬────────────────┘ │ ║ +║ │ │ Auth: HF OAuth │ Quotas: Redis-free│ ║ +║ │ └───────────────────┼────────────────────┘ ║ +║ │ │ ║ +╠═════════════╪════════════════════════════════════════╪═══════════════════════╣ +║ │ AGENT CORE (agent/) │ ║ +║ ▼ ▼ ║ +║ ┌──────────────────────────────────────────────────────────────────────┐ ║ +║ │ submission_loop() [agent_loop.py] │ ║ +║ │ │ ║ +║ │ Đọc Operations từ submission_queue: │ ║ +║ │ USER_INPUT → Handlers.run_agent() │ ║ +║ │ EXEC_APPROVAL → resume sau approval │ ║ +║ │ INTERRUPT → session.cancel() │ ║ +║ │ COMPACT → _compact_and_notify() │ ║ +║ │ UNDO → context_manager.undo_last_turn() │ ║ +║ │ SHUTDOWN → thoát vòng lặp │ ║ +║ └──────────────────────┬───────────────────────────────────────────────┘ ║ +║ │ ║ +║ ┌──────────────────────▼───────────────────────────────────────────────┐ ║ +║ │ Handlers.run_agent() — VÒNG LẶP AGENTIC CHÍNH │ ║ +║ │ │ ║ +║ │ Session │ ║ +║ │ ├─ ContextManager ── message history + auto-compaction │ ║ +║ │ ├─ ToolRouter ──────── built-in tools + MCP servers │ ║ +║ │ ├─ Config ──────────── model, yolo_mode, quotas │ ║ +║ │ └─ logged_events ───── trajectory for SFT data collection │ ║ +║ │ │ ║ +║ │ ┌──────────────────────────────────────────────────────────────┐ │ ║ +║ │ │ AGENTIC LOOP (max 300 iterations per turn) │ │ ║ +║ │ │ │ │ ║ +║ │ │ 1. compact check (nếu > 85% context window) │ │ ║ +║ │ │ 2. doom_loop check (detect A,A,A / [A,B,A,B] patterns) │ │ ║ +║ │ │ 3. with_prompt_caching(messages, tools) │ │ ║ +║ │ │ 4. litellm.acompletion() — streaming hoặc batch │ │ ║ +║ │ │ 5. emit assistant_chunk events → event_queue │ │ ║ +║ │ │ 6. if no tool_calls: emit turn_complete, DONE │ │ ║ +║ │ │ 7. for each tool_call: │ │ ║ +║ │ │ a. _needs_approval()? → emit approval_required │ │ ║ +║ │ │ wait EXEC_APPROVAL operation │ │ ║ +║ │ │ b. tool_router.execute_tool() │ │ ║ +║ │ │ c. context_manager.add_message(result) │ │ ║ +║ │ │ 8. goto 1 │ │ ║ +║ │ └──────────────────────────────────────────────────────────────┘ │ ║ +║ └──────────────────────────────────────────────────────────────────────┘ ║ +║ ║ +╠══════════════════════════════════════════════════════════════════════════════╣ +║ TOOL ECOSYSTEM ║ +║ ║ +║ sandbox_tool ──── remote code exec (HF Space) ║ +║ research_tool ─── multi-step sub-agent với dedicated LLM calls ║ +║ jobs_tool ──────── HF Training Jobs (GPU clusters) ║ +║ docs_tools ──────── HF documentation search + fetch ║ +║ papers_tool ──────── ArXiv papers ║ +║ dataset_tools ─────── HF Hub datasets inspection ║ +║ web_search ──────────── Tavily web search ║ +║ plan_tool ────────────── Structured planning với step tracking ║ +║ notify_tool ─────────────── Slack/gateway out-of-band notifications ║ +║ hf_repo_files ──────────────── HF repo CRUD (read/write/delete) ║ +║ hf_repo_git ─────────────────────── Git operations trên HF repos ║ +║ MCP server (hf-mcp-server) ──── HF Hub native MCP tools ║ +╠══════════════════════════════════════════════════════════════════════════════╣ +║ DATA FLYWHEEL ║ +║ ║ +║ session.logged_events ──► save_trajectory_local() ──► .tmp → atomic rename║ +║ │ ║ +║ └──► subprocess.Popen(session_uploader.py) ── detached, fire-forget ║ +║ │ ║ +║ ▼ ║ +║ smolagents/ml-intern-sessions (HF Dataset) ║ +║ │ ║ +║ └──► scripts/build_sft.py ──► SFT training data ║ +╚══════════════════════════════════════════════════════════════════════════════╝ +``` + +--- + +## 3. Luồng Dữ Liệu — Một Request Đi Qua Hệ Thống + +### 3a. Luồng Web (Browser → Frontend → Backend → Agent → SSE) + +```text +Browser Frontend Backend Agent Core + │ │ │ │ + │ User gõ message │ │ │ + │──────────────────────►│ │ │ + │ │ POST /api/sessions │ │ + │ │ /{id}/submit │ │ + │ │ {text: "..."} │ │ + │ │─────────────────────►│ │ + │ │ │ submit_user_input() │ + │ │ │─────────────────────►│ + │ │ │ │ submission_queue + │ │ │ │ .put(USER_INPUT) + │ │ │ │ + │ │ SSE stream opens │ │ submission_loop + │ │◄─────────────────────│ │ dequeues + │ │ │ │ + │ │ data: {processing} │◄─────────────────────│ event_queue.put() + │ │◄─────────────────────│ EventBroadcaster │ + │ │ │ fan-out to sub │ + │ UI: "thinking..." │ data: {assistant_ │ │ + │◄──────────────────────│ chunk: "Tôi sẽ..."}│◄─────────────────────│ streaming tokens + │ │ │ │ + │ UI: renders text │ data: {tool_call: │ │ execute tool + │◄──────────────────────│ sandbox, code:..} │◄─────────────────────│ + │ │ │ │ + │ UI: shows tool card │ data: {tool_output: │ │ tool returns + │◄──────────────────────│ "output..."} │◄─────────────────────│ + │ │ │ │ + │ UI: final response │ data: {turn_complete}│ │ done + │◄──────────────────────│◄─────────────────────│◄─────────────────────│ + │ │ │ │ + │ │ SSE stream closes │ │ +``` + +### 3b. Luồng Approval (khi agent cần permission) + +```text +Agent event_queue Backend SSE Frontend + │ │ │ │ + │ Tool cần approval │ │ │ + │ (hf_jobs, sandbox, │ │ │ + │ destructive ops) │ │ │ + │────────────────────────► │ │ + │ {approval_required, │ EventBroadcaster │ │ + │ tools: [...]} │────────────────────►│ │ + │ │ │──────────────────────► + │ session.pending_ │ │ onApprovalRequired │ + │ approval = tools │ │ callback fires │ + │ │ │ │ + │ PAUSED — chờ │ │ │ User clicks + │ EXEC_APPROVAL op │ │ │ Approve/Deny + │ │ │◄─────────────────────│ + │ │ │ POST /approve │ + │◄──────────────────────────────────────────── │ │ + │ submission_queue │ │ │ + │ .put(EXEC_APPROVAL) │ │ │ + │ │ │ │ + │ RESUME — execute tool │ │ │ +``` + +--- + +## 4. Agent Core — Trái Tim Của Hệ Thống + +### File: `agent/core/agent_loop.py` + +Đây là file quan trọng nhất của toàn bộ hệ thống. Nó chứa hai thành phần chính: + +#### 4.1 `submission_loop()` — Control Plane + +```text +submission_loop(session, submission_queue) + │ + ▼ + while session.is_running: + │ + ├── dequeue Operation (timeout=1.0s để check is_running) + │ + ├── Op.USER_INPUT ──────► Handlers.run_agent(session, text) + │ + ├── Op.EXEC_APPROVAL ───► resume_after_approval(session, approvals) + │ + ├── Op.COMPACT ─────────► _compact_and_notify(session) + │ + ├── Op.UNDO ────────────► context_manager.undo_last_turn() + │ + emit undo_complete event + │ + ├── Op.SHUTDOWN ────────► session.is_running = False, break + │ + └── Op.INTERRUPT ───────► session.cancel() + (signal dừng giữa chừng) +``` + +**Tại sao dùng queue thay vì gọi hàm trực tiếp?** + +Vấn đề: Agent đang ở giữa một LLM call (đang stream tokens) — lúc này user bấm Ctrl+C (interrupt). Nếu không có queue, ta phải xử lý interrupt bằng exception handling, signal, hoặc thread — tất cả đều phức tạp và dễ leak resource. + +Với queue: INTERRUPT là một Operation được đưa vào queue. `submission_loop` nhận Op này, gọi `session.cancel()` (set asyncio.Event), và vòng lặp agentic kiểm tra `session.is_cancelled` sau mỗi iteration. **Clean cancellation không cần exception magic**. + +Ngoài ra, queue còn cho phép toàn bộ control flow (approve, undo, compact) là **first-class operations** — không phải side-channel hacks. + +#### 4.2 `Handlers.run_agent()` — Data Plane + +```text +run_agent(session, text=None) + │ + ├── Nếu có pending_approval và user gửi message mới: + │ └── _abandon_pending_approval() — inject CANCELLED tool results + │ để LLM context luôn hợp lệ (mỗi tool_call phải có tool_result) + │ + ├── Thêm user message vào ContextManager + │ + └── VÒNG LẶP (max 300 iterations): + │ + ├── 1. _compact_and_notify() + │ Kiểm tra: running_context_usage > compaction_threshold? + │ + ├── 2. check_for_doom_loop(messages) + │ Nếu detect loop → inject corrective prompt vào messages + │ + ├── 3. with_prompt_caching(messages, tools, model_name) + │ Anthropic models: thêm cache_control breakpoints + │ Khác: pass-through không thay đổi + │ + ├── 4. _call_llm_streaming() hoặc _call_llm_non_streaming() + │ litellm.acompletion() với unified interface + │ emit assistant_chunk events + │ + ├── 5. Nếu finish_reason == "stop" và không có tool_calls: + │ emit turn_complete, RETURN + │ + ├── 6. Xây dựng tool_calls list từ accumulated deltas + │ + ├── 7. Kiểm tra malformed JSON (LLM đôi khi gen sai JSON args) + │ _detect_repeated_malformed() → inject error message + │ + ├── 8. _needs_approval(tool_call, config)? + │ Nếu Yes: emit approval_required, set pending_approval + │ PAUSE — wait for EXEC_APPROVAL operation + │ Nếu No: tiếp tục + │ + ├── 9. tool_router.execute_tool(name, args) + │ emit tool_call event (trước khi execute) + │ emit tool_output event (sau khi execute) + │ + ├── 10. context_manager.add_message(tool_result) + │ cập nhật running_context_usage + │ + └── goto 1 +``` + +**Tại sao `_abandon_pending_approval` inject CANCELLED results?** + +Anthropic API yêu cầu: mỗi `assistant` message có `tool_calls` phải được theo sau bởi `tool` messages với đúng `tool_call_id`. Nếu user gửi message mới trong khi đang chờ approval, ta phải inject synthetic tool results với content "CANCELLED BY USER" — nếu không, API sẽ trả về lỗi về malformed conversation history. + +--- + +## 5. Session — Container Trạng Thái + +### File: `agent/core/session.py` + +Session là **đơn vị cô lập** của mỗi conversation. Nó chứa tất cả state cần thiết để một agent hoạt động. + +```text +Session +├── session_id: str (UUID) ← định danh duy nhất cho tracing +├── config: Config ← model, yolo_mode, save_sessions, ... +├── context_manager: ContextManager ← toàn bộ conversation history +├── tool_router: ToolRouter ← tool registry + MCP clients +├── event_queue: asyncio.Queue ← output channel +├── _cancelled: asyncio.Event ← interrupt signal +├── pending_approval: dict | None ← tool calls đang chờ user approve +├── sandbox: Sandbox | None ← remote code execution space +├── _running_job_ids: set[str] ← HF training jobs đang chạy +├── logged_events: list[dict] ← trajectory cho data collection +├── turn_count: int ← đếm số turns để auto-save +├── model_effective_effort: dict ← cache kết quả probe effort cascade +└── notification_gateway: ... ← Slack/webhook notifications +``` + +#### 5.1 Event System + +```python +async def send_event(self, event: Event) -> None: + # 1. Đưa event vào queue (để CLI/Web render) + await self.event_queue.put(event) + + # 2. Log vào trajectory (cho data collection) + self.logged_events.append({...}) + + # 3. Auto-notification (Slack, webhook) + await self._enqueue_auto_notification_requests(event) + + # 4. Heartbeat save (mid-turn, không block) + HeartbeatSaver.maybe_fire(self) +``` + +**Tại sao một hàm làm 4 việc?** Vì mọi event đều cần cả 4 side effects này. Nếu tách ra, mỗi callsite trong `agent_loop.py` phải nhớ gọi đủ 4 — dễ bỏ sót. `send_event` là single point of truth. + +#### 5.2 Trajectory Saving — Atomic Write Pattern + +```text +save_trajectory_local(): + 1. scrub() — xóa secrets (hf_token, API keys) khỏi payload + 2. Tính filepath (stable per session — không tạo file mới mỗi lần) + 3. Write to .tmp file (filepath + ".tmp") + 4. os.rename(.tmp → filepath) ← atomic trên POSIX + + Tại sao atomic? Nếu process crash giữa chừng khi đang write, + ta có file .json cũ (đầy đủ) thay vì file .json mới (bị truncate) + mà retry scanner không đọc được. +``` + +#### 5.3 Detached Upload Pattern + +```text +save_and_upload_detached(repo_id): + 1. save_trajectory_local() ← fast, synchronous + 2. subprocess.Popen( + [sys.executable, "session_uploader.py", "upload", path, repo_id], + start_new_session=True, ← detach khỏi parent process + stdin/stdout/stderr=DEVNULL + ) + + Tại sao subprocess thay vì asyncio task? + - asyncio task: nếu main process bị kill, task bị cancel + - subprocess với start_new_session=True: tiếp tục sống sau khi + parent chết. Upload không bao giờ bị mất dù server restart. +``` + +--- + +## 6. ContextManager — Bộ Nhớ Của Agent + +### File: `agent/context_manager/manager.py` + +```text +ContextManager +├── items: list[Message] ← toàn bộ conversation [system, user, assistant, tool, ...] +├── model_max_tokens: int ← context window của model (từ litellm.get_model_info) +├── running_context_usage: int ← token count hiện tại (cập nhật sau mỗi LLM call) +├── compact_size: int ← 10% của model_max_tokens = reserved space sau compaction +├── untouched_messages: int = 5 ← số messages gần nhất không bao giờ bị compact +└── system_prompt: str ← từ agent/prompts/system_prompt_v3.yaml (Jinja2 template) +``` + +#### 6.1 Compaction Threshold + +```text +model_max_tokens = 200,000 (Claude Sonnet, ví dụ) +compact_size = 20,000 (10%) + +compaction_threshold = 200,000 - 20,000 = 180,000 tokens + +Khi running_context_usage > 180,000: + needs_compaction = True +``` + +**Tại sao không compact tại 100%?** Nếu compact tại 100%, ta không còn đủ tokens để gọi LLM và xử lý kết quả compaction. Reserve 10% là buffer an toàn. + +#### 6.2 Compaction Algorithm + +```text +Trước compaction: +[system] [user_1] [assistant_1] [tool] [tool] [user_2] [assistant_2] ... [user_N] [last_5_msgs] + │ │ │ │ + │ first_user_msg (task ban đầu — không bao giờ compact) kept kept + │ + └── preserved + +Sau compaction: +[system] [user_1] [SUMMARY: "Agent đã làm X, Y, Z vì..."] [user_N] [last_5_msgs] + +SUMMARY được tạo bằng cách gọi LLM với prompt đặc biệt: +"Tóm tắt conversation trên, tập trung vào key decisions, WHY, + problems solved, context cần thiết cho người mới." +``` + +**Tại sao giữ `user_1` (first user message)?** Đây là task ban đầu. Agent cần luôn nhớ mình đang làm gì. Mất task ban đầu = agent không biết mình đang làm gì. + +**Tại sao giữ 5 messages cuối?** Để agent không bị "mất mạch" giữa chừng của một operation. Nếu đang execute một sequence of tool calls, 5 messages cuối đảm bảo context gần nhất luôn đầy đủ. + +#### 6.3 Dangling Tool Call Patch + +```text +_patch_dangling_tool_calls(): + Vấn đề: Anthropic API yêu cầu mỗi tool_call trong assistant message + phải có một matching tool_result message. Trong quá trình compaction + hoặc undo, một số tool_result có thể bị xóa nhưng tool_call vẫn còn. + + Fix: Scan toàn bộ items, tìm tool_calls không có matching tool_result, + inject synthetic tool_result với content "TOOL_RESULT_MISSING". + + Tại sao cần? Vì Anthropic API trả về 400 error nếu conversation + history malformed. Synthetic results là "lie" nhỏ nhất để keep API happy. +``` + +--- + +## 7. ToolRouter — Tay Của Agent + +### File: `agent/core/tools.py` + +```text +ToolRouter +├── tools: dict[str, ToolSpec] ← registry của built-in tools +├── mcp_client: Client | None ← FastMCP client cho MCP servers +└── _mcp_initialized: bool ← lazy init flag + +ToolSpec (dataclass) +├── name: str +├── description: str ← LLM đọc description này để biết dùng tool gì +├── parameters: dict ← JSON Schema, LLM tạo args theo schema này +└── handler: Callable ← async fn(args) → tuple[str, bool] + str = result text + bool = success flag +``` + +#### 7.1 Built-in Tools (16 tools) + +```text +Research & Knowledge: +├── research ── multi-step sub-agent với dedicated LLM calls +├── web_search ── Tavily search +├── hf_papers ── ArXiv papers từ HF daily papers +├── explore_hf_docs ── search trong HF documentation tree +└── fetch_hf_docs ── fetch specific doc page + +HF Hub: +├── hf_inspect_dataset ── xem dataset structure, splits, features +├── hf_repo_files ── đọc/ghi/xóa files trong HF repos +└── hf_repo_git ── git operations (commit, push, history) + +Compute: +├── sandbox_* ── remote Python execution trong HF Space +│ ├── sandbox_create +│ ├── sandbox_exec +│ ├── sandbox_read_file +│ ├── sandbox_write_file +│ └── sandbox_status +└── hf_jobs ── submit/monitor HF Training Jobs (GPU clusters) + +GitHub: +├── github_find_examples ── search code examples +├── github_read_file ── đọc file từ GitHub +└── github_list_repos ── list repos + +Utility: +├── plan ── structured planning với step tracking +└── notify ── gửi notification ra Slack/gateway +``` + +#### 7.2 MCP Integration + +```text +ToolRouter.__init__(): + 1. register tất cả built-in tools + 2. if mcp_servers config tồn tại: + inject HF token vào headers của mỗi server + tạo FastMCP Client với multi-server config + +ToolRouter.__aenter__(): ← context manager (async with tool_router:) + 1. mcp_client.initialize() + 2. fetch tool specs từ MCP servers + 3. register MCP tools vào tools dict + +execute_tool(name, args): + if name in self.tools: + return await self.tools[name].handler(args) + elif self.mcp_client and name in mcp_tools: + result = await mcp_client.call_tool(name, args) + return convert_mcp_content_to_string(result), True +``` + +**Tại sao MCP cho HF Hub?** MCP (Model Context Protocol) là chuẩn mở cho tool calls. `hf-mcp-server` tại `huggingface.co/mcp?login` expose toàn bộ HF Hub API. Thay vì implement từng API call thủ công, agent dùng MCP để tự động có quyền truy cập vào mọi capability của HF Hub — kể cả các feature mới được thêm vào Hub sau này. + +--- + +## 8. DoomLoop Detector — Hệ Thống Phòng Vệ + +### File: `agent/core/doom_loop.py` + +Đây là một trong những component thú vị nhất về mặt engineering. LLM agents có xu hướng bị stuck trong các vòng lặp vô tận — gọi cùng một tool với cùng arguments, không nhận ra mình đang lặp. + +#### 8.1 Cấu Trúc Dữ Liệu + +```python +@dataclass(frozen=True) +class ToolCallSignature: + name: str # tên tool + args_hash: str # MD5(canonical_json(args))[:12] + result_hash: str # MD5(str(result))[:12] — QUAN TRỌNG +``` + +**Tại sao hash cả result?** Nếu agent đang polling một job (gọi `hf_jobs` mỗi 30s), args giống nhau nhưng result khác (job status thay đổi). Đây là legitimate polling, không phải doom loop. Chỉ hash args sẽ false-positive cho polling. Hash cả result = chỉ trigger khi **cả args lẫn result đều giống hệt nhau**. + +#### 8.2 Canonical JSON Normalization + +```python +def _normalize_args(args_str: str) -> str: + # LLM có thể gen: {"a": 1, "b": 2} hoặc {"b": 2, "a": 1} + # Cả hai đều là cùng một call nhưng hash khác nhau nếu không normalize + return json.dumps(json.loads(args_str), sort_keys=True, separators=(",", ":")) +``` + +#### 8.3 Hai Pattern Được Detect + +```text +Pattern 1: Identical Consecutive (threshold=3) +───────────────────────────────────────────── +signatures = [A, B, C, C, C] ← 3 C liên tiếp → DOOM LOOP! + +Pattern 2: Repeating Sequence (length 2-5, reps ≥ 2) +────────────────────────────────────────────────────── +signatures = [X, Y, Z, A, B, A, B] ← [A,B] lặp 2 lần → DOOM LOOP! +signatures = [X, A, B, C, A, B, C] ← [A,B,C] lặp 2 lần → DOOM LOOP! +``` + +#### 8.4 Response Khi Detect + +```text +Thay vì crash hoặc stop agent, inject một "system message" vào đầu messages: + +"[SYSTEM: REPETITION GUARD] You have called 'sandbox_exec' with the same +arguments multiple times in a row, getting the same result each time. +STOP repeating this approach — it is not working. Step back and try a +fundamentally different strategy..." + +Tại sao inject vào messages thay vì throw exception? +LLM cần đọc được lý do tại sao nó bị dừng. Exception không giải thích được. +Message injection = LLM có thể self-correct và thử cách khác. +``` + +--- + +## 9. Backend — API Gateway & Session Pool + +### Files: `backend/session_manager.py`, `backend/routes/agent.py` + +#### 9.1 Session Pool Architecture + +```text +SessionManager (singleton) +│ +├── sessions: dict[str, AgentSession] +│ ├── session_id_1 → AgentSession { session, tool_router, queues, task, broadcaster } +│ ├── session_id_2 → AgentSession { ... } +│ └── ... +│ +├── _lock: asyncio.Lock ← guard create/delete operations +├── MAX_SESSIONS: 200 ← global cap +└── MAX_PER_USER: 10 ← per-user cap + +Sizing rationale: + 8 vCPU / 32 GB RAM (HF Space tier) + Mỗi session dùng ~10-20 MB (context, queues, asyncio task) + 200 sessions × 20 MB = 4 GB worst case + Còn 28 GB cho Python runtime + per-request overhead +``` + +#### 9.2 `AgentSession` — Session Wrapper + +```text +AgentSession (dataclass) +├── session_id: str +├── session: Session ← agent state +├── tool_router: ToolRouter ← tool registry +├── submission_queue: Queue ← input channel +├── user_id: str ← owner (authorization) +├── hf_token: str | None ← OAuth token của user +├── task: asyncio.Task ← coroutine chạy agent loop +├── broadcaster: EventBroadcaster ← fan-out events đến SSE subscribers +├── is_active: bool +├── is_processing: bool ← đang xử lý request? +└── claude_counted: bool ← đã tính quota Claude chưa? +``` + +#### 9.3 EventBroadcaster — Fan-out Pattern + +```text +EventBroadcaster +├── _source: asyncio.Queue ← đọc từ agent's event_queue +└── _subscribers: dict[id, Queue] ← mỗi SSE connection là một subscriber + +run(): + while True: + event = await _source.get() ← 1 event từ agent + for sub_q in _subscribers: + await sub_q.put(event) ← fan-out đến mọi subscriber +``` + +**Tại sao cần fan-out?** Một session có thể có nhiều SSE connections đồng thời (ví dụ: user mở cùng session trên 2 tab). EventBroadcaster đảm bảo mọi subscriber đều nhận được mọi event. Events đến khi không có subscriber nào sẽ bị discard — không buffer vì mỗi SSE turn là một request riêng biệt. + +#### 9.4 Session Creation — Thread Pool cho Blocking I/O + +```python +def _create_session_sync(): + # Blocking operations: + # - ToolRouter.__init__: có thể call HF API + # - Session.__init__: litellm.get_model_info() (HTTP call) + # - ContextManager.__init__: load system prompt, whoami API + tool_router = ToolRouter(config.mcpServers, hf_token=hf_token) + session = Session(event_queue, config=session_config, ...) + return tool_router, session + +# Chạy trong thread pool để không block event loop +tool_router, session = await asyncio.to_thread(_create_session_sync) +``` + +**Tại sao quan trọng?** FastAPI chạy trên asyncio event loop. Nếu `Session.__init__` block event loop (vì HTTP call), toàn bộ server dừng xử lý requests trong thời gian đó. `asyncio.to_thread()` chạy blocking code trong thread pool, event loop tự do nhận requests khác. + +#### 9.5 Session Rehydration — `seed_from_summary()` + +```text +Vấn đề: User đóng tab (session vẫn sống trên server), sau đó mở lại. +Frontend có cached messages cũ. Server có session mới (context trống). + +Giải pháp: seed_from_summary() + 1. Frontend gửi cached messages lên + 2. Backend gọi LLM để summarize chúng + 3. Inject summary vào context của session mới: + "[SYSTEM: Your prior memory of this conversation — written + in your own voice right before restart. Continue from here.]" + 4. Agent "nhớ lại" context cũ mà không cần re-process toàn bộ history +``` + +**Tại sao không replay toàn bộ messages?** Nếu session cũ có 200 messages × 1000 tokens = 200k tokens, replay sẽ lập tức fill context window. Summarization giữ essence của conversation trong ~2000 tokens. + +#### 9.6 Interrupt — Bypass Queue + +```python +async def interrupt(session_id: str) -> bool: + agent_session = self.sessions.get(session_id) + agent_session.session.cancel() # Set asyncio.Event trực tiếp + return True +``` + +**Interrupt bypass queue**, không thêm vào submission_queue. Tại sao? Nếu queue đang có 5 operations đang chờ, thêm INTERRUPT vào cuối queue có nghĩa là agent phải xử lý xong 5 operations trước mới interrupt. Đó không phải interrupt, đó là "schedule interrupt later". Gọi `session.cancel()` trực tiếp = immediate signal. + +--- + +## 10. Frontend — SSE Bridge & React Layer + +### Files: `frontend/src/lib/sse-chat-transport.ts`, `frontend/src/hooks/useAgentChat.ts` + +#### 10.1 Protocol Mismatch Problem + +```text +Backend protocol: Vercel AI SDK protocol: +───────────────── ─────────────────────── +data: { UIMessageChunk { + event_type: "tool_call" type: "tool-call" + tool: "sandbox_exec" toolCallId: "..." + args: {...} toolName: "sandbox_exec" +} args: {...} + } + +data: { + event_type: "assistant_chunk" → UIMessageChunk { type: "text-delta", textDelta: "..." } + chunk: "Tôi sẽ..." +} +``` + +`SSEChatTransport` là adapter layer giải quyết mismatch này. + +#### 10.2 SSEChatTransport — Dual Stream Architecture + +```text +sendMessages() được gọi khi user submit message + │ + ├── 1. POST /api/sessions/{id}/submit {text: "..."} + │ → Backend enqueue USER_INPUT operation + │ + ├── 2. fetch /api/sessions/{id}/events (SSE) + │ → Nhận stream của AgentEvent objects + │ + ├── 3. createSSEParserStream() + │ TransformStream + │ Parse "data: {...}\n\n" format + │ + ├── 4. createEventToChunkStream(sideChannel) + │ TransformStream + │ │ + │ ├── event_type == "assistant_chunk" + │ │ → UIMessageChunk { type: "text-delta", textDelta } + │ │ + │ ├── event_type == "tool_call" + │ │ → UIMessageChunk { type: "tool-call-streaming-start" } + │ │ → sideChannel.onToolCallPanel(tool, args) + │ │ + │ ├── event_type == "tool_output" + │ │ → UIMessageChunk { type: "tool-result" } + │ │ → sideChannel.onToolOutputPanel(tool, output) + │ │ + │ ├── event_type == "approval_required" + │ │ → sideChannel.onApprovalRequired(tools) + │ │ → UIMessageChunk approval request + │ │ + │ ├── event_type == "turn_complete" + │ │ → UIMessageChunk { type: "finish" } + │ │ → sideChannel.onProcessingDone() + │ │ + │ ├── event_type == "ready" + │ │ → sideChannel.onReady() + │ │ + │ └── event_type == "error"/"shutdown"/"interrupted" + │ → sideChannel callbacks + │ + └── 5. Return ReadableStream cho useChat +``` + +**Tại sao không dùng WebSocket?** SSE là uni-directional (server → client), đơn giản hơn WebSocket. Backend Agent emits events → client consumes. Khi user cần gửi message, dùng một POST request riêng. Hai operations riêng biệt (gửi + nhận) dễ reason hơn một bidirectional socket. + +#### 10.3 State Management — Zustand Stores + +```text +3 stores độc lập, có quan hệ: + +agentStore (global state + per-session state) +├── connected: bool ← session có đang kết nối không +├── isProcessing: bool ← active session đang xử lý không +├── error: string | null +├── plan: PlanStep[] | null ← current plan từ plan_tool +├── sessions: Map ← per-session state +└── updateSession(id, partial) ← update session, mirror to globals nếu active + +sessionStore (session list) +├── sessions: SessionInfo[] ← list của sessions +├── activeSessionId: string | null +└── setSessionActive(id, bool) + +layoutStore (UI state) +├── isRightPanelOpen: bool ← code/plan panel +└── setRightPanelOpen(bool) +``` + +**Tại sao tách 3 stores?** Single global store sẽ cause mọi component re-render khi bất kỳ state nào thay đổi. Tách stores theo domain = chỉ components cần `layoutStore` re-render khi panel mở/đóng, không ảnh hưởng đến Chat components. + +#### 10.4 Multi-Session Architecture + +```text +Frontend support nhiều sessions (như browser tabs): + +SessionSidebar: + [session_1] ── active ──► agentChat_1 mounted, useAgentChat running + [session_2] ── inactive ► agentChat_2 mounted, useAgentChat suspended + [+ New] + +Mỗi session có own useAgentChat instance với own: + - SSEChatTransport + - Message store (localStorage key: `chat_messages_{sessionId}`) + - Research state store + - Backend message cache + +Khi switch session: + - isActive prop thay đổi + - Side-channel callbacks check isActiveRef trước khi update global state + - Chỉ active session mirror state lên global agentStore +``` + +--- + +## 11. Data Flywheel — Vòng Lặp Thu Thập Dữ Liệu + +### Files: `agent/core/session.py`, `agent/sft/tagger.py`, `scripts/build_sft.py` + +#### 11.1 Tại Sao "Data Flywheel"? + +```text +Agent hoạt động tốt + │ + ▼ +Users dùng nhiều hơn + │ + ▼ +Thu thập nhiều session trajectories hơn + │ + ▼ +Train model tốt hơn trên trajectories đó + │ + ▼ +Agent hoạt động tốt hơn ──────────────────┐ + │ │ + └──────────────────────────────────────┘ + Flywheel! +``` + +#### 11.2 Session Trajectory Structure + +```json +{ + "session_id": "uuid", + "user_id": "hf_username", + "session_start_time": "2026-04-28T10:00:00", + "model_name": "anthropic/claude-opus-4-6", + "total_cost_usd": 0.42, + "messages": [ + {"role": "system", "content": "..."}, + {"role": "user", "content": "Fine-tune llama on my dataset"}, + {"role": "assistant", "content": null, "tool_calls": [...]}, + {"role": "tool", "content": "...", "tool_call_id": "..."}, + ... + ], + "events": [ + {"timestamp": "...", "event_type": "tool_call", "data": {...}}, + {"timestamp": "...", "event_type": "llm_call", "data": {"cost_usd": 0.05}}, + ... + ], + "tools": [...], + "upload_status": "success" +} +``` + +#### 11.3 SFT Tagging System + +```text +tag_session(trajectory) → list[str] + +Tags được tạo tự động từ trajectory: + +tool: → "tool:hf_jobs", "tool:sandbox_exec" +outcome: → "outcome:completed", "outcome:doom_loop" +hf_job: → "hf_job:succeeded", "hf_job:oom" +gpu: → "gpu:a100", "gpu:h100" +sandbox: → "sandbox:created", "sandbox:long_lived" +model: → "model:opus", "model:kimi" +turns: → "turns:short" (<5), "turns:medium", "turns:long" (>20) +cost: → "cost:low" (<$0.10), "cost:med", "cost:high" +task: → "task:training", "task:inference", "task:research_only" +``` + +Tags cho phép filter dataset downstream: +```python +# Chỉ lấy sessions training thành công với GPU +good_sessions = df[ + df.tags.apply(lambda t: + "outcome:completed" in t and + "task:training" in t and + "hf_job:succeeded" in t + ) +] +``` + +#### 11.4 Heartbeat Saves + +```text +HeartbeatSaver.maybe_fire(session): + Gọi sau mỗi send_event() + + Nếu elapsed > HEARTBEAT_INTERVAL: + save_trajectory_local() ← không phải upload, chỉ local + cập nhật _last_heartbeat_ts + + Mục đích: Nếu server crash giữa một long-running task + (fine-tuning có thể mất nhiều giờ), heartbeat đảm bảo + ta không mất toàn bộ trajectory. Upload sẽ retry sau. +``` + +--- + +## 12. Security Layer — Redact, Auth, Quotas + +#### 12.1 Secret Redaction (`agent/core/redact.py`) + +```text +scrub(trajectory_payload): + Đệ quy qua dict/list/str + Apply regex patterns: + ├── hf_[A-Za-z0-9]{30,} → [REDACTED_HF_TOKEN] + ├── sk-ant-[...]{20,} → [REDACTED_ANTHROPIC_KEY] + ├── sk-[...]{40,} → [REDACTED_OPENAI_KEY] + ├── gh[pousr]_[...]{36,} → [REDACTED_GITHUB_TOKEN] + ├── github_pat_[...]{36,} → [REDACTED_GITHUB_TOKEN] + ├── AKIA/ASIA[A-Z0-9]{16} → [REDACTED_AWS_KEY_ID] + ├── Bearer → Bearer [REDACTED] + └── KEY=value, KEY: value → KEY=[REDACTED] + +Redact xảy ra TẠI THỜI ĐIỂM SAVE (không phải trước khi agent xử lý). +Agent cần secrets để hoạt động. Secrets chỉ cần được xóa trước khi +lưu xuống disk hoặc upload. +``` + +#### 12.2 HF OAuth Auth (`backend/routes/auth.py`) + +```text +GET /api/auth/user + ← HF OAuth token từ cookie/header + → gọi HF API /api/whoami-v2 + → trả về {username, isPro, orgs} + +Tại sao không JWT? HF đã có identity system. Tái dùng HF token += users không cần tạo account riêng cho ML Intern. +``` + +#### 12.3 Quota System (`backend/user_quotas.py`) + +```text +Claude quota (Anthropic models): +├── Mỗi user có daily_claude_sessions_cap +├── Tính tại message-submit time (không phải session-create time) +│ Lý do: user có thể tạo Claude session để "nhìn around" mà không +│ tốn quota. Chỉ tính khi thực sự dùng. +├── Flag claude_counted trên AgentSession để tránh double-count +│ (user switch model trong session thì count không đổi) +└── 429 Too Many Requests nếu vượt cap + +Anthropic model gate: +├── Chỉ HF staff (in HuggingFace org) mới dùng được Claude Opus +├── Claude dùng ANTHROPIC_API_KEY của Space (bill cho HF) +├── Model free (Kimi, MiniMax, GLM) dùng HF Router (bill qua X-HF-Bill-To) +└── Non-HF users vẫn có thể dùng free models +``` + +#### 12.4 Prompt Caching — Cost Optimization + +```text +with_prompt_caching(messages, tools, model_name): + Chỉ áp dụng cho Anthropic models + + 1. Tools block: thêm cache_control vào LAST tool spec + → Toàn bộ tool definitions (~3-4k tokens) được cache + → Các turns tiếp theo trong 5 phút: ~10% của input price + + 2. System message: wrap content thành cached block + → System prompt (~1-2k tokens) được cache + + Kết quả: mỗi turn chỉ phải trả full price cho NEW tokens + (user message + tool results). Static context (tools + system) = free. + + Ví dụ: 20 turns × 5k tokens static = 100k tokens tiết kiệm mỗi session + Với Claude Opus 4.6: ~$1.50 tiết kiệm mỗi session dài. +``` + +--- + +## 13. Các Quyết Định Thiết Kế Quan Trọng + +### 13.1 LiteLLM — Model Abstraction Layer + +```text +Vấn đề: Agent muốn support Anthropic, OpenAI, HF Router, Bedrock, ... + Mỗi provider có API format khác nhau. + +Giải pháp: litellm.acompletion() — unified interface + litellm.acompletion( + model="anthropic/claude-opus-4-6", # hoặc "openai/gpt-5.5" + messages=[...], # OpenAI format, litellm translate + tools=[...], + stream=True, + ) + +Bonus: litellm.drop_params = True + → Nếu một provider không support một param (ví dụ: thinking_effort), + litellm tự drop param đó thay vì throw error. + Agent không cần biết capabilities của từng provider. +``` + +### 13.2 LiteLLM Effort Cascade — Graceful Degradation + +```text +Vấn đề: Mỗi model support các effort levels khác nhau. + Claude Opus 4.7: "xhigh", "high", "medium", "low" + Claude Opus 4.6: "max", "high", "medium", "low" + Kimi K2.6: không support thinking params + +effort_probe.py: + Thử: "max" → nếu 400 error: + Thử: "high" → nếu 400 error: + Thử: "medium" → nếu fail: + Kết luận: model không support thinking, gửi None + + Kết quả được cache trong session.model_effective_effort[model_name] + để lần sau không cần probe lại. +``` + +### 13.3 Local Mode + +```text +Config option: local_mode = True + +Trong ToolRouter, khi local_mode=True: + → Không tạo sandbox (không cần HF Space) + → Không expose hf_jobs tool + → Dùng local_tools.py thay vì remote tools + +Tại sao? CLI users chạy locally có thể không có HF token. +local_mode cho phép agent hoạt động với quyền giới hạn hơn. +``` + +### 13.4 YOLO Mode + +```text +Config: yolo_mode = True / False (default: False) + +_needs_approval(): + if session.config.yolo_mode: + return False # skip tất cả approvals + +Normally cần approval: + - hf_jobs: submit training job (tốn tiền) + - sandbox: create/destroy Space + - hf_repo_files: upload/delete files + - hf_repo_git: destructive git ops + +YOLO mode = agent tự approve mọi thứ. Hữu ích cho headless/automated runs. +``` + +--- + +## 14. Thứ Tự Đọc Codebase + +Đọc theo thứ tự này để xây dựng mental model từ dưới lên trên. + +### Phase 1 — Contracts & Config (30 phút) + +| File | Mục tiêu | +|------|---------| +| [agent/config.py](agent/config.py) | Hiểu shape của Config — model, tools, messaging | +| [agent/tools/types.py](agent/tools/types.py) | ToolSpec dataclass | +| [agent/core/session.py](agent/core/session.py) | `OpType`, `Event`, `Session` — unit of conversation | +| [configs/cli_agent_config.json](configs/cli_agent_config.json) | Config thực tế | + +### Phase 2 — Agent Heart (1 giờ) + +| File | Mục tiêu | +|------|---------| +| [agent/core/agent_loop.py](agent/core/agent_loop.py) | `submission_loop` + `run_agent` — ĐỌC KỸ | +| [agent/core/tools.py](agent/core/tools.py) | `ToolSpec`, `ToolRouter`, MCP integration | +| [agent/context_manager/manager.py](agent/context_manager/manager.py) | Compaction, history management | +| [agent/core/doom_loop.py](agent/core/doom_loop.py) | Loop detection — ngắn nhưng elegant | +| [agent/core/prompt_caching.py](agent/core/prompt_caching.py) | Cache breakpoints cho Anthropic | + +### Phase 3 — Tools (45 phút) + +| File | Mục tiêu | +|------|---------| +| [agent/tools/sandbox_tool.py](agent/tools/sandbox_tool.py) | Remote code execution — tool quan trọng nhất | +| [agent/tools/research_tool.py](agent/tools/research_tool.py) | Sub-agent pattern | +| [agent/tools/jobs_tool.py](agent/tools/jobs_tool.py) | HF Training Jobs | +| Còn lại trong [agent/tools/](agent/tools/) | Scan nhanh, tất cả follow ToolSpec pattern | + +### Phase 4 — CLI Entrypoint (20 phút) + +| File | Mục tiêu | +|------|---------| +| [agent/main.py](agent/main.py) | Cách CLI wire queues, headless vs interactive | + +### Phase 5 — Web Backend (45 phút) + +| File | Mục tiêu | +|------|---------| +| [backend/session_manager.py](backend/session_manager.py) | Multi-tenant pool, EventBroadcaster | +| [backend/routes/agent.py](backend/routes/agent.py) | REST + SSE endpoints, quota | +| [backend/routes/auth.py](backend/routes/auth.py) | HF OAuth | +| [backend/dependencies.py](backend/dependencies.py) | Auth middleware | +| [backend/user_quotas.py](backend/user_quotas.py) | Daily caps | + +### Phase 6 — Frontend (1 giờ) + +| File | Mục tiêu | +|------|---------| +| [frontend/src/lib/sse-chat-transport.ts](frontend/src/lib/sse-chat-transport.ts) | **Critical** — SSE → AI SDK bridge | +| [frontend/src/hooks/useAgentChat.ts](frontend/src/hooks/useAgentChat.ts) | React hook wire mọi thứ | +| [frontend/src/store/agentStore.ts](frontend/src/store/agentStore.ts) | Zustand global state | +| [frontend/src/components/SessionChat.tsx](frontend/src/components/SessionChat.tsx) | Main chat component | +| [frontend/src/components/Chat/](frontend/src/components/Chat/) | Scan nhanh rendering components | + +### Phase 7 — Data Flywheel (20 phút) + +| File | Mục tiêu | +|------|---------| +| [agent/core/redact.py](agent/core/redact.py) | Secret scrubbing patterns | +| [agent/core/telemetry.py](agent/core/telemetry.py) | Cost tracking, heartbeat | +| [agent/sft/tagger.py](agent/sft/tagger.py) | Automatic session tagging | +| [scripts/build_sft.py](scripts/build_sft.py) | Trajectory → training data | + +--- + +## Tóm Tắt Mental Model + +Hệ thống này có thể được hiểu qua một analogy đơn giản: + +```text +Hãy nghĩ về nó như một UNIX pipeline: + +[User Input] ←→ stdin +[submission_queue] ←→ pipe (buffered, typed) +[agent_loop] ←→ process (stateful, iterative) +[event_queue] ←→ pipe (event stream) +[CLI/SSE/Web] ←→ stdout (multiple sinks) +[session_logs] ←→ tee (fork to log file) +[HF Dataset] ←→ log aggregator + +Và mỗi "pipe" là typed AsyncQueue — không phải byte stream, +mà là structured objects với clear contracts. +``` + +**Ba invariant quan trọng nhất của hệ thống:** + +1. **Agent Core không biết về transport**: `agent_loop.py` không import FastAPI, không biết về SSE, không biết về CLI. Nó chỉ đọc từ `submission_queue` và write vào `event_queue`. + +2. **Session là đơn vị cô lập**: Hai sessions không share state. Mỗi session có context, tools, config riêng. Session isolation = multi-tenancy là free. + +3. **Every event is logged**: `session.send_event()` là single point of truth. Telemetry, data collection, notifications đều đi qua đây — không có out-of-band paths bị bỏ sót. diff --git a/WORKFLOW.md b/WORKFLOW.md new file mode 100644 index 00000000..5a72be18 --- /dev/null +++ b/WORKFLOW.md @@ -0,0 +1,224 @@ +# Development Workflow + +This repo is a fork of `huggingface/ml-intern` extended with an AI optimization agent. +The upstream team ships to `huggingface/ml-intern` daily. This document explains how to +build on top of that without conflicts — for both human engineers and AI agents. + +--- + +## Remote Setup + +```text +upstream → https://github.com/huggingface/ml-intern (source of truth, read-only) +origin → https://github.com/andreidhoang/ml-optimization-agent (your fork, push here) +``` + +Verify at any time: + +```bash +git remote -v +``` + +Expected output: + +```text +origin https://github.com/andreidhoang/ml-optimization-agent.git (fetch) +origin https://github.com/andreidhoang/ml-optimization-agent.git (push) +upstream https://github.com/huggingface/ml-intern (fetch) +upstream https://github.com/huggingface/ml-intern (push) +``` + +If `upstream` is missing, add it: + +```bash +git remote add upstream https://github.com/huggingface/ml-intern.git +``` + +--- + +## Syncing Upstream Changes + +Run this whenever the upstream team ships new commits (daily or before starting work): + +```bash +git fetch upstream +git merge upstream/main +git push origin main +``` + +**This will never overwrite your work.** See the "Why merges are always clean" section below. + +To check what upstream shipped before merging: + +```bash +git fetch upstream +git log upstream/main --oneline --not main # commits in upstream not yet in your branch +git diff main upstream/main --stat # which files changed +``` + +--- + +## The Zero-Diff Rule + +> **Never modify any file that already exists in the upstream repo.** + +This is the single rule that makes `git merge upstream/main` conflict-free forever. + +All new code lives in paths that do not exist in upstream: + +| Your path | Upstream has it? | +| --- | --- | +| `agent/optimization/` | No — safe to create | +| `agent/tools/hardware_specs.py` | No — safe to create | +| `agent/tools/profiling/` | No — safe to create | +| `agent/tools/training_opt/` | No — safe to create | +| `agent/tools/inference_opt/` | No — safe to create | +| `configs/optimization_agent_config.json` | No — safe to create | +| `agent/prompts/system_prompt_optimization_v1.yaml` | No — safe to create | +| `tests/optimization/` | No — safe to create | +| `agent/core/agent_loop.py` | **Yes — do not touch** | +| `agent/core/session.py` | **Yes — do not touch** | +| `agent/core/tools.py` | **Yes — do not touch** | +| `agent/config.py` | **Yes — do not touch** | +| `agent/context_manager/manager.py` | **Yes — do not touch** | +| `backend/` | **Yes — do not touch** | +| `frontend/` | **Yes — do not touch** | + +--- + +## Extending Upstream Classes (Without Modifying Them) + +Three upstream classes need extension. Use Python subclassing — no source changes. + +### Config → OptimizationConfig + +Upstream file: `agent/config.py` — **do not edit** + +Your extension: `agent/optimization/config_ext.py` + +```python +from agent.config import Config + +class OptimizationConfig(Config): + optimization_target: str | None = None + target_hardware: str | None = None + quality_budget: float = 0.98 + optimization_loop_enabled: bool = True +``` + +When upstream adds a new field to `Config`, you get it automatically via inheritance. + +### ContextManager → OptimizationContextManager + +Upstream file: `agent/context_manager/manager.py` — **do not edit** + +Your extension: `agent/optimization/context_manager_ext.py` + +```python +from agent.context_manager.manager import ContextManager + +class OptimizationContextManager(ContextManager): + def __init__(self, *args, **kwargs): + super().__init__(*args, **kwargs) + self.persistent_state: dict = {} # survives compaction + + async def compact(self, *args, **kwargs): + await super().compact(*args, **kwargs) + # re-inject persistent_state after compaction + ... +``` + +When upstream improves `compact()`, your subclass picks up the fix via `super().compact()`. + +### ToolRouter — no subclass needed + +Upstream file: `agent/core/tools.py` — **do not edit** + +`ToolRouter` already has a public `register_tool()` method. Add tools after construction: + +```python +from agent.core.tools import ToolRouter + +router = ToolRouter(mcp_servers=config.mcpServers) +router.register_tool(HARDWARE_SPECS_TOOL_SPEC) # your tool +router.register_tool(MLSYS_PAPERS_TOOL_SPEC) # your tool +``` + +When upstream adds new tools to `create_builtin_tools()`, they appear in `router` automatically. + +--- + +## Adding New Code + +1. Create your file under one of the owned paths listed above. +2. Import from upstream freely — imports are not modifications. +3. Run existing tests to make sure nothing broke. + +```bash +uv run pytest tests/unit/ -q # upstream tests must always pass +``` + +--- + +## Why Merges Are Always Clean + +`git merge` applies a **diff**, not a folder copy. It only changes lines that upstream changed. + +Three cases: + +**Untracked files** (e.g. `PLAN.md`, `WORKFLOW.md`) — git has never heard of them. +No merge command can touch them. They are invisible to git until you `git add` them. + +**Your new files** (e.g. `agent/optimization/`) — upstream has no history for these paths. +Upstream's diff says nothing about them. Merge leaves them alone. + +**Upstream files you didn't touch** (e.g. `agent/config.py`) — git does a 3-way merge: + +- Base: the commit where both branches last agreed +- Upstream: changed line X → X' +- You: never changed line X +- Result: apply X → X' cleanly, no conflict + +A conflict only occurs when **both you and upstream changed the same line**. +The zero-diff rule makes that impossible. + +--- + +## What To Do When Upstream Changes a File You Extend + +Example: upstream refactors `ContextManager.compact()` and renames a parameter. + +1. The merge still succeeds — no conflict, because you didn't touch `manager.py`. +2. Run tests: `uv run pytest tests/ -q` +3. If a test fails, inspect what changed: `git diff upstream/main HEAD -- agent/context_manager/manager.py` +4. Fix the affected `super()` call in your subclass (`agent/optimization/context_manager_ext.py`). +5. Tests pass again. + +This is the only maintenance cost of this architecture. It happens rarely and takes minutes. + +--- + +## File Ownership Reference + +| Owner | Paths | Rule | +| --- | --- | --- | +| **Upstream** | `agent/core/`, `agent/config.py`, `agent/context_manager/manager.py`, `agent/tools/*.py` (existing), `backend/`, `frontend/`, `tests/unit/` | Never modify. Pull freely. | +| **This fork** | `agent/optimization/`, `agent/tools/hardware_specs.py`, `agent/tools/profiling/`, `agent/tools/training_opt/`, `agent/tools/inference_opt/`, `agent/tools/multimodal_opt/`, `agent/tools/vla_opt/`, `agent/prompts/system_prompt_optimization_v1.yaml`, `configs/optimization_agent_config.json`, `tests/optimization/` | Full ownership. Never exists in upstream. | + +--- + +## Quick Reference + +```bash +# Sync upstream (run daily) +git fetch upstream && git merge upstream/main && git push origin main + +# Check what upstream shipped +git log upstream/main --oneline --not main + +# Verify tests still pass after sync +uv run pytest tests/unit/ -q + +# See which files you own vs upstream +git diff upstream/main --name-only # should only list files in "This fork" table above +``` diff --git a/agent/config.py b/agent/config.py index 5a6a8a45..87f2a9c5 100644 --- a/agent/config.py +++ b/agent/config.py @@ -27,6 +27,13 @@ class Config(BaseModel): mcpServers: dict[str, MCPServerConfig] = {} save_sessions: bool = True session_dataset_repo: str = "smolagents/ml-intern-sessions" + # Per-user private dataset that mirrors each session in Claude Code JSONL + # format so the HF Agent Trace Viewer auto-renders it + # (https://huggingface.co/changelog/agent-trace-viewer). Created private + # on first use; user flips it public via /share-traces. ``{hf_user}`` is + # substituted at upload time from the authenticated HF username. + share_traces: bool = True + personal_trace_repo_template: str = "{hf_user}/ml-intern-sessions" auto_save_interval: int = 1 # Save every N user turns (0 = disabled) # Mid-turn heartbeat: save + upload every N seconds while events are being # emitted. Guards against losing trace data on long-running turns that diff --git a/agent/context_manager/manager.py b/agent/context_manager/manager.py index c842c884..330f654c 100644 --- a/agent/context_manager/manager.py +++ b/agent/context_manager/manager.py @@ -4,6 +4,7 @@ import logging import os +import time import zoneinfo from datetime import datetime from pathlib import Path @@ -102,6 +103,8 @@ async def summarize_messages( max_tokens: int = 2000, tool_specs: list[dict] | None = None, prompt: str = _COMPACT_PROMPT, + session: Any = None, + kind: str = "compaction", ) -> tuple[str, int]: """Run a summarization prompt against a list of messages. @@ -110,6 +113,13 @@ async def summarize_messages( instead — it preserves the tool-call trail so the agent can answer follow-up questions about what it did. + ``session`` is optional; when provided, the call is recorded via + ``telemetry.record_llm_call`` so its cost lands in the session's + ``total_cost_usd``. Without it, the call still happens but is + invisible in telemetry — which used to be the case for every + compaction call until 2026-04-29 (~30-50% of Bedrock spend was + attributed to this single source of dark cost). + Returns ``(summary_text, completion_tokens)``. """ from agent.core.llm_params import _resolve_llm_params @@ -119,12 +129,23 @@ async def summarize_messages( prompt_messages, tool_specs = with_prompt_caching( prompt_messages, tool_specs, llm_params.get("model") ) + _t0 = time.monotonic() response = await acompletion( messages=prompt_messages, max_completion_tokens=max_tokens, tools=tool_specs, **llm_params, ) + if session is not None: + from agent.core import telemetry + await telemetry.record_llm_call( + session, + model=model_name, + response=response, + latency_ms=int((time.monotonic() - _t0) * 1000), + finish_reason=response.choices[0].finish_reason if response.choices else None, + kind=kind, + ) summary = response.choices[0].message.content or "" completion_tokens = response.usage.completion_tokens if response.usage else 0 return summary, completion_tokens @@ -219,6 +240,8 @@ def add_message(self, message: Message, token_count: int = None) -> None: """Add a message to the history""" if token_count: self.running_context_usage = token_count + if not getattr(message, "timestamp", None): + message.timestamp = datetime.now().isoformat() self.items.append(message) if self.on_message_added: self.on_message_added(message) @@ -291,6 +314,7 @@ def _patch_dangling_tool_calls(self) -> None: content="Tool was not executed (interrupted or error).", tool_call_id=tc.id, name=tc.function.name, + timestamp=datetime.now().isoformat(), ) ) @@ -355,8 +379,14 @@ async def compact( model_name: str, tool_specs: list[dict] | None = None, hf_token: str | None = None, + session: Any = None, ) -> None: - """Remove old messages to keep history under target size""" + """Remove old messages to keep history under target size. + + ``session`` is optional — if passed, the underlying summarization + LLM call is recorded via ``telemetry.record_llm_call(kind= + "compaction")`` so its cost shows up in ``total_cost_usd``. + """ if not self.needs_compaction: return @@ -394,8 +424,14 @@ async def compact( max_tokens=self.compact_size, tool_specs=tool_specs, prompt=_COMPACT_PROMPT, + session=session, + kind="compaction", + ) + summarized_message = Message( + role="assistant", + content=summary, + timestamp=datetime.now().isoformat(), ) - summarized_message = Message(role="assistant", content=summary) # Reconstruct: system + first user msg + summary + recent messages head = [system_msg] if system_msg else [] diff --git a/agent/core/agent_loop.py b/agent/core/agent_loop.py index 8b7a4572..03a4457a 100644 --- a/agent/core/agent_loop.py +++ b/agent/core/agent_loop.py @@ -19,6 +19,11 @@ from litellm.exceptions import ContextWindowExceededError from agent.config import Config +from agent.core.approval_policy import ( + is_scheduled_operation, + normalize_tool_operation, +) +from agent.core.cost_estimation import CostEstimate, estimate_tool_cost from agent.messaging.gateway import NotificationGateway from agent.core import telemetry from agent.core.doom_loop import check_for_doom_loop @@ -27,6 +32,7 @@ from agent.core.session import Event, OpType, Session from agent.core.tools import ToolRouter from agent.tools.jobs_tool import CPU_FLAVORS +from agent.tools.sandbox_tool import DEFAULT_CPU_SANDBOX_HARDWARE logger = logging.getLogger(__name__) @@ -110,13 +116,39 @@ def _validate_tool_args(tool_args: dict) -> tuple[bool, str | None]: return True, None -def _needs_approval( +_IMMEDIATE_HF_JOB_RUNS = {"run", "uv"} + +@dataclass(frozen=True) +class ApprovalDecision: + requires_approval: bool + auto_approved: bool = False + auto_approval_blocked: bool = False + block_reason: str | None = None + estimated_cost_usd: float | None = None + remaining_cap_usd: float | None = None + billable: bool = False + + +def _operation(tool_args: dict) -> str: + return normalize_tool_operation(tool_args.get("operation")) + + +def _is_immediate_hf_job_run(tool_name: str, tool_args: dict) -> bool: + return tool_name == "hf_jobs" and _operation(tool_args) in _IMMEDIATE_HF_JOB_RUNS + + +def _is_scheduled_hf_job_run(tool_name: str, tool_args: dict) -> bool: + return tool_name == "hf_jobs" and is_scheduled_operation(_operation(tool_args)) + + +def _is_budgeted_auto_approval_target(tool_name: str, tool_args: dict) -> bool: + return tool_name == "sandbox_create" or _is_immediate_hf_job_run(tool_name, tool_args) + + +def _base_needs_approval( tool_name: str, tool_args: dict, config: Config | None = None ) -> bool: - """Check if a tool call requires user approval before execution.""" - # Yolo mode: skip all approvals - if config and config.yolo_mode: - return False + """Check if a tool call requires approval before YOLO policy is applied.""" # If args are malformed, skip approval (validation error will be shown later) args_valid, _ = _validate_tool_args(tool_args) @@ -124,11 +156,14 @@ def _needs_approval( return False if tool_name == "sandbox_create": - return True + hardware = tool_args.get("hardware") or DEFAULT_CPU_SANDBOX_HARDWARE + return hardware != DEFAULT_CPU_SANDBOX_HARDWARE if tool_name == "hf_jobs": - operation = tool_args.get("operation", "") - if operation not in ["run", "uv", "scheduled run", "scheduled uv"]: + operation = _operation(tool_args) + if is_scheduled_operation(operation): + return True + if operation not in _IMMEDIATE_HF_JOB_RUNS: return False # Check if this is a CPU-only job @@ -180,6 +215,143 @@ def _needs_approval( return False +def _needs_approval( + tool_name: str, tool_args: dict, config: Config | None = None +) -> bool: + """Legacy sync approval predicate used by tests and CLI display helpers.""" + if _is_scheduled_hf_job_run(tool_name, tool_args): + return True + if config and config.yolo_mode: + return False + return _base_needs_approval(tool_name, tool_args, config) + + +def _session_auto_approval_enabled(session: Session | None) -> bool: + return bool(session and getattr(session, "auto_approval_enabled", False)) + + +def _effective_yolo_enabled(session: Session | None, config: Config | None) -> bool: + return bool((config and config.yolo_mode) or _session_auto_approval_enabled(session)) + + +def _remaining_budget_after_reservations( + session: Session | None, reserved_spend_usd: float +) -> float | None: + if not session or getattr(session, "auto_approval_cost_cap_usd", None) is None: + return None + cap = float(getattr(session, "auto_approval_cost_cap_usd") or 0.0) + spent = float(getattr(session, "auto_approval_estimated_spend_usd", 0.0) or 0.0) + return round(max(0.0, cap - spent - reserved_spend_usd), 4) + + +def _budget_block_reason( + estimate: CostEstimate, + *, + remaining_cap_usd: float | None, +) -> str | None: + if estimate.estimated_cost_usd is None: + return estimate.block_reason or "Could not estimate the cost safely." + if remaining_cap_usd is not None and estimate.estimated_cost_usd > remaining_cap_usd: + return ( + f"Estimated cost ${estimate.estimated_cost_usd:.2f} exceeds " + f"remaining YOLO cap ${remaining_cap_usd:.2f}." + ) + return None + + +async def _approval_decision( + tool_name: str, + tool_args: dict, + session: Session, + *, + reserved_spend_usd: float = 0.0, +) -> ApprovalDecision: + """Return the approval decision for one parsed tool call.""" + config = session.config + base_requires_approval = _base_needs_approval(tool_name, tool_args, config) + + # Scheduled jobs are recurring/unbounded enough that YOLO never bypasses + # the human confirmation, including legacy config.yolo_mode. + if _is_scheduled_hf_job_run(tool_name, tool_args): + return ApprovalDecision( + requires_approval=True, + auto_approval_blocked=_effective_yolo_enabled(session, config), + block_reason="Scheduled HF jobs always require manual approval.", + ) + + yolo_enabled = _effective_yolo_enabled(session, config) + budgeted_target = _is_budgeted_auto_approval_target(tool_name, tool_args) + + # Cost caps are a session-scoped web policy. Legacy config.yolo_mode + # remains uncapped for CLI/headless, except for scheduled jobs above. + session_yolo_enabled = _session_auto_approval_enabled(session) + if yolo_enabled and budgeted_target and session_yolo_enabled: + estimate = await estimate_tool_cost(tool_name, tool_args, session=session) + remaining = _remaining_budget_after_reservations(session, reserved_spend_usd) + reason = _budget_block_reason(estimate, remaining_cap_usd=remaining) + if reason: + return ApprovalDecision( + requires_approval=True, + auto_approval_blocked=True, + block_reason=reason, + estimated_cost_usd=estimate.estimated_cost_usd, + remaining_cap_usd=remaining, + billable=estimate.billable, + ) + if base_requires_approval: + return ApprovalDecision( + requires_approval=False, + auto_approved=True, + estimated_cost_usd=estimate.estimated_cost_usd, + remaining_cap_usd=remaining, + billable=estimate.billable, + ) + return ApprovalDecision( + requires_approval=False, + estimated_cost_usd=estimate.estimated_cost_usd, + remaining_cap_usd=remaining, + billable=estimate.billable, + ) + + if base_requires_approval and yolo_enabled: + return ApprovalDecision(requires_approval=False, auto_approved=True) + + return ApprovalDecision(requires_approval=base_requires_approval) + + +def _record_estimated_spend(session: Session, decision: ApprovalDecision) -> None: + if not decision.billable or decision.estimated_cost_usd is None: + return + if hasattr(session, "add_auto_approval_estimated_spend"): + session.add_auto_approval_estimated_spend(decision.estimated_cost_usd) + else: + session.auto_approval_estimated_spend_usd = round( + float(getattr(session, "auto_approval_estimated_spend_usd", 0.0) or 0.0) + + float(decision.estimated_cost_usd), + 4, + ) + + +async def _record_manual_approved_spend_if_needed( + session: Session, + tool_name: str, + tool_args: dict, +) -> None: + if not _session_auto_approval_enabled(session): + return + if not _is_budgeted_auto_approval_target(tool_name, tool_args): + return + estimate = await estimate_tool_cost(tool_name, tool_args, session=session) + _record_estimated_spend( + session, + ApprovalDecision( + requires_approval=False, + billable=estimate.billable, + estimated_cost_usd=estimate.estimated_cost_usd, + ), + ) + + # -- LLM retry constants -------------------------------------------------- _MAX_LLM_RETRIES = 3 _LLM_RETRY_DELAYS = [5, 15, 30] # seconds between retries @@ -282,6 +454,7 @@ async def _heal_effort_and_rebuild_params( try: outcome = await probe_effort( model, session.config.reasoning_effort, session.hf_token, + session=session, ) session.model_effective_effort[model] = outcome.effective_effort logger.info( @@ -354,6 +527,7 @@ async def _compact_and_notify(session: Session) -> None: model_name=session.config.model_name, tool_specs=session.tool_router.get_tool_specs_for_llm(), hf_token=session.hf_token, + session=session, ) new_usage = cm.running_context_usage if new_usage != old_usage: @@ -1061,29 +1235,49 @@ async def run_agent( if session.is_cancelled: break - # Separate good tools into approval-required vs auto-execute - approval_required_tools: list[tuple[ToolCall, str, dict]] = [] - non_approval_tools: list[tuple[ToolCall, str, dict]] = [] + # Separate good tools into approval-required vs auto-execute. + # Track reserved spend while classifying a batch so two + # auto-approved jobs in one model response cannot jointly + # exceed the remaining session cap. + approval_required_tools: list[ + tuple[ToolCall, str, dict, ApprovalDecision] + ] = [] + non_approval_tools: list[ + tuple[ToolCall, str, dict, ApprovalDecision] + ] = [] + reserved_auto_spend_usd = 0.0 for tc, tool_name, tool_args in good_tools: - if _needs_approval(tool_name, tool_args, session.config): - approval_required_tools.append((tc, tool_name, tool_args)) + decision = await _approval_decision( + tool_name, + tool_args, + session, + reserved_spend_usd=reserved_auto_spend_usd, + ) + if decision.requires_approval: + approval_required_tools.append((tc, tool_name, tool_args, decision)) else: - non_approval_tools.append((tc, tool_name, tool_args)) + non_approval_tools.append((tc, tool_name, tool_args, decision)) + if ( + decision.auto_approved + and decision.billable + and decision.estimated_cost_usd is not None + ): + reserved_auto_spend_usd += decision.estimated_cost_usd # Execute non-approval tools (in parallel when possible) if non_approval_tools: # 1. Validate args upfront parsed_tools: list[ - tuple[ToolCall, str, dict, bool, str] + tuple[ToolCall, str, dict, ApprovalDecision, bool, str] ] = [] - for tc, tool_name, tool_args in non_approval_tools: + for tc, tool_name, tool_args, decision in non_approval_tools: args_valid, error_msg = _validate_tool_args(tool_args) parsed_tools.append( - (tc, tool_name, tool_args, args_valid, error_msg) + (tc, tool_name, tool_args, decision, args_valid, error_msg) ) # 2. Send all tool_call events upfront (so frontend shows them all) - for tc, tool_name, tool_args, args_valid, _ in parsed_tools: + for tc, tool_name, tool_args, _decision, args_valid, _ in parsed_tools: if args_valid: await session.send_event( Event( @@ -1101,11 +1295,14 @@ async def _exec_tool( tc: ToolCall, name: str, args: dict, + decision: ApprovalDecision, valid: bool, err: str, ) -> tuple[ToolCall, str, dict, str, bool]: if not valid: return (tc, name, args, err, False) + if decision.billable: + _record_estimated_spend(session, decision) out, ok = await session.tool_router.call_tool( name, args, session=session, tool_call_id=tc.id ) @@ -1113,8 +1310,8 @@ async def _exec_tool( gather_task = asyncio.ensure_future(asyncio.gather( *[ - _exec_tool(tc, name, args, valid, err) - for tc, name, args, valid, err in parsed_tools + _exec_tool(tc, name, args, decision, valid, err) + for tc, name, args, decision, valid, err in parsed_tools ] )) cancel_task = asyncio.ensure_future(session._cancelled.wait()) @@ -1131,7 +1328,7 @@ async def _exec_tool( except asyncio.CancelledError: pass # Notify frontend that in-flight tools were cancelled - for tc, name, _args, valid, _ in parsed_tools: + for tc, name, _args, _decision, valid, _ in parsed_tools: if valid: await session.send_event(Event( event_type="tool_state_change", @@ -1169,7 +1366,8 @@ async def _exec_tool( if approval_required_tools: # Prepare batch approval data tools_data = [] - for tc, tool_name, tool_args in approval_required_tools: + blocked_payloads = [] + for tc, tool_name, tool_args, decision in approval_required_tools: # Resolve sandbox file paths for hf_jobs scripts so the # frontend can display & edit the actual file content. if tool_name == "hf_jobs" and isinstance(tool_args.get("script"), str): @@ -1179,20 +1377,42 @@ async def _exec_tool( if resolved: tool_args = {**tool_args, "script": resolved} - tools_data.append({ + tool_payload = { "tool": tool_name, "arguments": tool_args, "tool_call_id": tc.id, - }) - + } + if decision.auto_approval_blocked: + tool_payload.update( + { + "auto_approval_blocked": True, + "block_reason": decision.block_reason, + "estimated_cost_usd": decision.estimated_cost_usd, + "remaining_cap_usd": decision.remaining_cap_usd, + } + ) + blocked_payloads.append(tool_payload) + tools_data.append(tool_payload) + + event_data = {"tools": tools_data, "count": len(tools_data)} + if blocked_payloads: + first = blocked_payloads[0] + event_data.update( + { + "auto_approval_blocked": True, + "block_reason": first.get("block_reason"), + "estimated_cost_usd": first.get("estimated_cost_usd"), + "remaining_cap_usd": first.get("remaining_cap_usd"), + } + ) await session.send_event(Event( event_type="approval_required", - data={"tools": tools_data, "count": len(tools_data)}, + data=event_data, )) # Store all approval-requiring tools (ToolCall objects for execution) session.pending_approval = { - "tool_calls": [tc for tc, _, _ in approval_required_tools], + "tool_calls": [tc for tc, _, _, _ in approval_required_tools], } # Return early - wait for EXEC_APPROVAL operation @@ -1382,6 +1602,8 @@ async def execute_tool(tc, tool_name, tool_args, was_edited): ) ) + await _record_manual_approved_spend_if_needed(session, tool_name, tool_args) + output, success = await session.tool_router.call_tool( tool_name, tool_args, session=session, tool_call_id=tc.id ) @@ -1577,10 +1799,14 @@ async def submission_loop( session_holder[0] = session logger.info("Agent loop started") - # Retry any failed uploads from previous sessions (fire-and-forget) + # Retry any failed uploads from previous sessions (fire-and-forget). + # Includes the personal trace repo when enabled so a session that failed + # to publish to the user's HF dataset gets a fresh attempt on next run. if config and config.save_sessions: Session.retry_failed_uploads_detached( - directory="session_logs", repo_id=config.session_dataset_repo + directory="session_logs", + repo_id=config.session_dataset_repo, + personal_repo_id=session._personal_trace_repo_id(), ) try: diff --git a/agent/core/approval_policy.py b/agent/core/approval_policy.py new file mode 100644 index 00000000..73098ca6 --- /dev/null +++ b/agent/core/approval_policy.py @@ -0,0 +1,11 @@ +"""Shared predicates for approval-gated tool operations.""" + +from typing import Any + + +def normalize_tool_operation(operation: Any) -> str: + return str(operation or "").strip().lower() + + +def is_scheduled_operation(operation: Any) -> bool: + return normalize_tool_operation(operation).startswith("scheduled ") diff --git a/agent/core/cost_estimation.py b/agent/core/cost_estimation.py new file mode 100644 index 00000000..f1f98ec8 --- /dev/null +++ b/agent/core/cost_estimation.py @@ -0,0 +1,278 @@ +"""Conservative cost estimates for auto-approved infrastructure actions.""" + +import os +import re +import time +from dataclasses import dataclass +from typing import Any + +import httpx + +OPENID_PROVIDER_URL = os.environ.get("OPENID_PROVIDER_URL", "https://huggingface.co") +JOBS_HARDWARE_URL = f"{OPENID_PROVIDER_URL}/api/jobs/hardware" +JOBS_PRICE_CACHE_TTL_S = 6 * 60 * 60 + +DEFAULT_JOB_TIMEOUT_HOURS = 0.5 +DEFAULT_SANDBOX_RESERVATION_HOURS = 1.0 + +# Static fallback prices are intentionally conservative enough for a budget +# guard. The live /api/jobs/hardware catalog wins whenever it is reachable. +HF_JOBS_PRICE_USD_PER_HOUR: dict[str, float] = { + "cpu-basic": 0.05, + "cpu-upgrade": 0.25, + "cpu-performance": 0.50, + "cpu-xl": 1.00, + "t4-small": 0.60, + "t4-medium": 0.90, + "l4x1": 1.00, + "l4x4": 4.00, + "l40sx1": 2.00, + "l40sx4": 8.00, + "l40sx8": 16.00, + "a10g-small": 1.00, + "a10g-large": 2.00, + "a10g-largex2": 4.00, + "a10g-largex4": 8.00, + "a100-large": 4.00, + "a100x4": 16.00, + "a100x8": 32.00, + "h200": 10.00, + "h200x2": 20.00, + "h200x4": 40.00, + "h200x8": 80.00, + "inf2x6": 6.00, +} + +SPACE_PRICE_USD_PER_HOUR: dict[str, float] = { + "cpu-basic": 0.0, + "cpu-upgrade": 0.05, + "cpu-performance": 0.50, + "cpu-xl": 1.00, + "t4-small": 0.60, + "t4-medium": 0.90, + "l4x1": 1.00, + "l4x4": 4.00, + "l40sx1": 2.00, + "l40sx4": 8.00, + "l40sx8": 16.00, + "a10g-small": 1.00, + "a10g-large": 2.00, + "a10g-largex2": 4.00, + "a10g-largex4": 8.00, + "a100-large": 4.00, + "a100x4": 16.00, + "a100x8": 32.00, + "h200": 10.00, + "h200x2": 20.00, + "h200x4": 40.00, + "h200x8": 80.00, + "inf2x6": 6.00, +} + +_DURATION_RE = re.compile(r"^\s*(\d+(?:\.\d+)?)\s*([smhd]?)\s*$", re.IGNORECASE) +_PRICE_RE = re.compile(r"(\d+(?:\.\d+)?)") +_jobs_price_cache: tuple[float, dict[str, float]] | None = None + + +@dataclass(frozen=True) +class CostEstimate: + """Estimated cost for a tool call. + + ``estimated_cost_usd=None`` means the call may be billable but we could not + estimate it safely, so auto-approval should fall back to a human decision. + """ + + estimated_cost_usd: float | None + billable: bool + block_reason: str | None = None + label: str | None = None + + +def parse_timeout_hours(value: Any, *, default_hours: float = DEFAULT_JOB_TIMEOUT_HOURS) -> float | None: + """Parse HF timeout values into hours. + + Strings accept ``s``, ``m``, ``h``, or ``d`` suffixes. Numeric values are + treated as seconds, matching the Hub client's typed timeout parameter. + """ + if value is None or value == "": + return default_hours + if isinstance(value, bool): + return None + if isinstance(value, int | float): + seconds = float(value) + return seconds / 3600 if seconds > 0 else None + if not isinstance(value, str): + return None + + match = _DURATION_RE.match(value) + if not match: + return None + amount = float(match.group(1)) + unit = match.group(2).lower() or "s" + if amount <= 0: + return None + if unit == "s": + return amount / 3600 + if unit == "m": + return amount / 60 + if unit == "h": + return amount + if unit == "d": + return amount * 24 + return None + + +def _extract_flavor(item: dict[str, Any]) -> str | None: + for key in ("flavor", "name", "id", "value", "hardware", "hardware_flavor"): + value = item.get(key) + if isinstance(value, str) and value: + return value + return None + + +def _coerce_price(value: Any) -> float | None: + if isinstance(value, bool) or value is None: + return None + if isinstance(value, int | float): + return float(value) if value >= 0 else None + if isinstance(value, str): + match = _PRICE_RE.search(value.replace(",", "")) + if match: + return float(match.group(1)) + return None + + +def _extract_hourly_price(item: dict[str, Any]) -> float | None: + for key in ( + "price", + "price_usd", + "priceUsd", + "price_per_hour", + "pricePerHour", + "hourly_price", + "hourlyPrice", + "usd_per_hour", + "usdPerHour", + ): + price = _coerce_price(item.get(key)) + if price is not None: + return price + for key in ("pricing", "billing", "cost"): + nested = item.get(key) + if isinstance(nested, dict): + price = _extract_hourly_price(nested) + if price is not None: + return price + return None + + +def _iter_hardware_items(payload: Any): + if isinstance(payload, list): + for item in payload: + yield from _iter_hardware_items(item) + elif isinstance(payload, dict): + if _extract_flavor(payload): + yield payload + for key in ("hardware", "flavors", "items", "data", "jobs"): + child = payload.get(key) + if child is not None: + yield from _iter_hardware_items(child) + + +def _parse_jobs_price_catalog(payload: Any) -> dict[str, float]: + prices: dict[str, float] = {} + for item in _iter_hardware_items(payload): + flavor = _extract_flavor(item) + price = _extract_hourly_price(item) + if flavor and price is not None: + prices[flavor] = price + return prices + + +async def hf_jobs_price_catalog() -> dict[str, float]: + """Return live HF Jobs hourly prices, falling back to static prices.""" + global _jobs_price_cache + now = time.monotonic() + if _jobs_price_cache and now - _jobs_price_cache[0] < JOBS_PRICE_CACHE_TTL_S: + return dict(_jobs_price_cache[1]) + + prices: dict[str, float] = {} + try: + async with httpx.AsyncClient(timeout=3.0) as client: + response = await client.get(JOBS_HARDWARE_URL) + if response.status_code == 200: + prices = _parse_jobs_price_catalog(response.json()) + except (httpx.HTTPError, ValueError): + prices = {} + + if not prices: + prices = dict(HF_JOBS_PRICE_USD_PER_HOUR) + else: + prices = {**HF_JOBS_PRICE_USD_PER_HOUR, **prices} + + _jobs_price_cache = (now, prices) + return dict(prices) + + +async def estimate_hf_job_cost(args: dict[str, Any]) -> CostEstimate: + flavor = str( + args.get("hardware_flavor") + or args.get("flavor") + or args.get("hardware") + or "cpu-basic" + ) + timeout_hours = parse_timeout_hours(args.get("timeout")) + if timeout_hours is None: + return CostEstimate( + estimated_cost_usd=None, + billable=True, + block_reason=f"Could not parse HF job timeout: {args.get('timeout')!r}.", + label=flavor, + ) + + prices = await hf_jobs_price_catalog() + price = prices.get(flavor) + if price is None: + return CostEstimate( + estimated_cost_usd=None, + billable=True, + block_reason=f"No price is available for HF job hardware '{flavor}'.", + label=flavor, + ) + + return CostEstimate( + estimated_cost_usd=round(price * timeout_hours, 4), + billable=price > 0, + label=flavor, + ) + + +async def estimate_sandbox_cost(args: dict[str, Any], *, session: Any = None) -> CostEstimate: + if session is not None and getattr(session, "sandbox", None): + return CostEstimate(estimated_cost_usd=0.0, billable=False, label="existing") + + hardware = str(args.get("hardware") or "cpu-basic") + price = SPACE_PRICE_USD_PER_HOUR.get(hardware) + if price is None: + return CostEstimate( + estimated_cost_usd=None, + billable=True, + block_reason=f"No price is available for sandbox hardware '{hardware}'.", + label=hardware, + ) + + return CostEstimate( + estimated_cost_usd=round(price * DEFAULT_SANDBOX_RESERVATION_HOURS, 4), + billable=price > 0, + label=hardware, + ) + + +async def estimate_tool_cost( + tool_name: str, args: dict[str, Any], *, session: Any = None +) -> CostEstimate: + if tool_name == "sandbox_create": + return await estimate_sandbox_cost(args, session=session) + if tool_name == "hf_jobs": + return await estimate_hf_job_cost(args) + return CostEstimate(estimated_cost_usd=0.0, billable=False) diff --git a/agent/core/effort_probe.py b/agent/core/effort_probe.py index 2c0c79ea..b6ac91f6 100644 --- a/agent/core/effort_probe.py +++ b/agent/core/effort_probe.py @@ -22,7 +22,9 @@ import asyncio import logging +import time from dataclasses import dataclass +from typing import Any from litellm import acompletion @@ -139,6 +141,7 @@ async def probe_effort( model_name: str, preference: str | None, hf_token: str | None, + session: Any = None, ) -> ProbeOutcome: """Walk the cascade for ``preference`` on ``model_name``. @@ -147,6 +150,12 @@ async def probe_effort( transient errors (5xx, timeout) — persistent 4xx that aren't thinking/ effort related bubble as the original exception so callers can surface them (auth, model-not-found, quota, etc.). + + ``session`` is optional; when provided, each successful probe attempt + is recorded via ``telemetry.record_llm_call(kind="effort_probe")`` so + the cost shows up in the session's ``total_cost_usd``. Failed probes + (rejected by the provider) typically aren't billed, so we only record + on success. """ loop = asyncio.get_event_loop() start = loop.time() @@ -174,7 +183,8 @@ async def probe_effort( attempts += 1 try: - await asyncio.wait_for( + _t0 = time.monotonic() + response = await asyncio.wait_for( acompletion( messages=[{"role": "user", "content": "ping"}], max_tokens=_PROBE_MAX_TOKENS, @@ -183,6 +193,21 @@ async def probe_effort( ), timeout=_PROBE_TIMEOUT, ) + if session is not None: + # Best-effort telemetry — never let a logging blip propagate + # out of the probe and break model switching. + try: + from agent.core import telemetry + await telemetry.record_llm_call( + session, + model=model_name, + response=response, + latency_ms=int((time.monotonic() - _t0) * 1000), + finish_reason=response.choices[0].finish_reason if response.choices else None, + kind="effort_probe", + ) + except Exception as _telem_err: + logger.debug("effort_probe telemetry failed: %s", _telem_err) except Exception as e: last_error = e if _is_thinking_unsupported(e): diff --git a/agent/core/model_switcher.py b/agent/core/model_switcher.py index 63c0f40c..ea419db1 100644 --- a/agent/core/model_switcher.py +++ b/agent/core/model_switcher.py @@ -32,6 +32,7 @@ {"id": "MiniMaxAI/MiniMax-M2.7", "label": "MiniMax M2.7"}, {"id": "moonshotai/Kimi-K2.6", "label": "Kimi K2.6"}, {"id": "zai-org/GLM-5.1", "label": "GLM 5.1"}, + {"id": "deepseek-ai/DeepSeek-V4-Pro:deepinfra", "label": "DeepSeek V4 Pro"}, ] @@ -187,7 +188,7 @@ async def probe_and_switch_model( console.print(f"[dim]checking {model_id} (effort: {preference})...[/dim]") try: - outcome = await probe_effort(model_id, preference, hf_token) + outcome = await probe_effort(model_id, preference, hf_token, session=session) except ProbeInconclusive as e: _commit_switch(model_id, config, session, effective=None, cache=False) console.print( diff --git a/agent/core/session.py b/agent/core/session.py index c53294cd..370bb3a6 100644 --- a/agent/core/session.py +++ b/agent/core/session.py @@ -1,6 +1,7 @@ import asyncio import json import logging +import os import subprocess import sys import uuid @@ -88,10 +89,12 @@ def __init__( defer_turn_complete_notification: bool = False, session_id: str | None = None, user_id: str | None = None, + hf_username: str | None = None, persistence_store: Any | None = None, ): self.hf_token: Optional[str] = hf_token self.user_id: Optional[str] = user_id + self.hf_username: Optional[str] = hf_username self.persistence_store = persistence_store self.tool_router = tool_router self.stream = stream @@ -113,10 +116,17 @@ def __init__( self._cancelled = asyncio.Event() self.pending_approval: Optional[dict[str, Any]] = None self.sandbox = None + self.sandbox_hardware: Optional[str] = None + self.sandbox_preload_task: Optional[asyncio.Task] = None + self.sandbox_preload_error: Optional[str] = None + self.sandbox_preload_cancel_event: Any | None = None self._running_job_ids: set[str] = set() # HF job IDs currently executing self.notification_gateway = notification_gateway self.notification_destinations = list(notification_destinations or []) self.defer_turn_complete_notification = defer_turn_complete_notification + self.auto_approval_enabled: bool = False + self.auto_approval_cost_cap_usd: float | None = None + self.auto_approval_estimated_spend_usd: float = 0.0 # Session trajectory logging self.logged_events: list[dict] = [] @@ -310,6 +320,40 @@ def update_model(self, model_name: str) -> None: self.config.model_name = model_name self.context_manager.model_max_tokens = _get_max_tokens_safe(model_name) + def set_auto_approval_policy( + self, *, enabled: bool, cost_cap_usd: float | None + ) -> None: + self.auto_approval_enabled = bool(enabled) + self.auto_approval_cost_cap_usd = cost_cap_usd + + def add_auto_approval_estimated_spend(self, amount_usd: float | None) -> None: + if amount_usd is None or amount_usd <= 0: + return + self.auto_approval_estimated_spend_usd = round( + self.auto_approval_estimated_spend_usd + float(amount_usd), 4 + ) + + @property + def auto_approval_remaining_usd(self) -> float | None: + if self.auto_approval_cost_cap_usd is None: + return None + return round( + max( + 0.0, + self.auto_approval_cost_cap_usd + - self.auto_approval_estimated_spend_usd, + ), + 4, + ) + + def auto_approval_policy_summary(self) -> dict[str, Any]: + return { + "enabled": self.auto_approval_enabled, + "cost_cap_usd": self.auto_approval_cost_cap_usd, + "estimated_spend_usd": round(self.auto_approval_estimated_spend_usd, 4), + "remaining_usd": self.auto_approval_remaining_usd, + } + def effective_effort_for(self, model_name: str) -> str | None: """Resolve the effort level to actually send for ``model_name``. @@ -362,6 +406,7 @@ def get_trajectory(self) -> dict: return { "session_id": self.session_id, "user_id": self.user_id, + "hf_username": self.hf_username, "session_start_time": self.session_start_time, "session_end_time": datetime.now().isoformat(), "model_name": self.config.model_name, @@ -456,62 +501,174 @@ def update_local_save_status( logger.error(f"Failed to update local save status: {e}") return False - def save_and_upload_detached(self, repo_id: str) -> Optional[str]: - """ - Save session locally and spawn detached subprocess for upload (fire-and-forget) + def _personal_trace_repo_id(self) -> Optional[str]: + """Resolve the per-user trace repo id from config + HF username. - Args: - repo_id: HuggingFace dataset repo ID - - Returns: - Path to local save file + Returns ``None`` when sharing is disabled, the user is anonymous, + or the template is missing — caller skips the personal upload in + those cases. """ - # Save locally first (fast, synchronous) - local_path = self.save_trajectory_local(upload_status="pending") - if not local_path: + if not getattr(self.config, "share_traces", False): + return None + hf_user = self.hf_username or self.user_id + if not hf_user: + return None + template = getattr(self.config, "personal_trace_repo_template", None) + if not template: + return None + try: + return template.format(hf_user=hf_user) + except (KeyError, IndexError): + logger.debug("personal_trace_repo_template format failed: %r", template) return None - # Spawn detached subprocess for upload (fire-and-forget) + def _spawn_uploader( + self, + action: str, + target: str, + repo_id: str, + *, + format: str, + token_env: Optional[str], + private: bool, + token_value: Optional[str] = None, + ) -> None: + """Fire-and-forget spawn of ``session_uploader.py`` with the given args.""" try: uploader_script = Path(__file__).parent / "session_uploader.py" + cmd = [ + sys.executable, + str(uploader_script), + action, + target, + repo_id, + "--format", + format, + "--private", + "true" if private else "false", + ] + if token_env: + cmd.extend(["--token-env", token_env]) + + env = os.environ.copy() + if token_value: + env["_ML_INTERN_PERSONAL_TOKEN"] = token_value - # Use Popen with detached process subprocess.Popen( - [sys.executable, str(uploader_script), "upload", local_path, repo_id], + cmd, stdin=subprocess.DEVNULL, stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL, + env=env, start_new_session=True, # Detach from parent ) except Exception as e: logger.warning(f"Failed to spawn upload subprocess: {e}") + def save_and_upload_detached(self, repo_id: str) -> Optional[str]: + """ + Save session locally and spawn detached subprocess(es) for upload + (fire-and-forget). + + Always uploads to the shared org dataset (``repo_id``) in the + single-row format used by the KPI scheduler. When + ``config.share_traces`` is enabled and a username is known, also + uploads to the user's personal private dataset in Claude Code JSONL + format so the HF Agent Trace Viewer auto-renders it. + + Args: + repo_id: HuggingFace dataset repo ID for the org/KPI upload. + + Returns: + Path to local save file + """ + local_path = self.save_trajectory_local(upload_status="pending") + if not local_path: + return None + + self._spawn_uploader( + "upload", + local_path, + repo_id, + format="row", + token_env=None, # default org token chain + private=False, + ) + + personal_repo = self._personal_trace_repo_id() + if personal_repo: + # User's own HF_TOKEN write-scoped to their namespace. + self._spawn_uploader( + "upload", + local_path, + personal_repo, + format="claude_code", + token_env="HF_TOKEN", + token_value=self.hf_token, + private=True, + ) + return local_path @staticmethod def retry_failed_uploads_detached( - directory: str = "session_logs", repo_id: Optional[str] = None + directory: str = "session_logs", + repo_id: Optional[str] = None, + *, + personal_repo_id: Optional[str] = None, ) -> None: """ - Spawn detached subprocess to retry failed/pending uploads (fire-and-forget) + Spawn detached subprocess(es) to retry failed/pending uploads + (fire-and-forget). Args: directory: Directory containing session logs - repo_id: Target dataset repo ID + repo_id: Target dataset repo ID for the shared org/KPI upload. + personal_repo_id: Per-user dataset for Claude-Code-format + retries. ``None`` skips the personal retry pass. """ - if not repo_id: + if not repo_id and not personal_repo_id: return try: uploader_script = Path(__file__).parent / "session_uploader.py" - # Spawn detached subprocess for retry - subprocess.Popen( - [sys.executable, str(uploader_script), "retry", directory, repo_id], - stdin=subprocess.DEVNULL, - stdout=subprocess.DEVNULL, - stderr=subprocess.DEVNULL, - start_new_session=True, # Detach from parent - ) + if repo_id: + subprocess.Popen( + [ + sys.executable, + str(uploader_script), + "retry", + directory, + repo_id, + "--format", + "row", + ], + stdin=subprocess.DEVNULL, + stdout=subprocess.DEVNULL, + stderr=subprocess.DEVNULL, + start_new_session=True, + ) + + if personal_repo_id: + subprocess.Popen( + [ + sys.executable, + str(uploader_script), + "retry", + directory, + personal_repo_id, + "--format", + "claude_code", + "--token-env", + "HF_TOKEN", + "--private", + "true", + ], + stdin=subprocess.DEVNULL, + stdout=subprocess.DEVNULL, + stderr=subprocess.DEVNULL, + start_new_session=True, + ) except Exception as e: logger.warning(f"Failed to spawn retry subprocess: {e}") diff --git a/agent/core/session_persistence.py b/agent/core/session_persistence.py index a0ecd279..f2c2d367 100644 --- a/agent/core/session_persistence.py +++ b/agent/core/session_persistence.py @@ -176,6 +176,9 @@ async def upsert_session( pending_approval: list[dict[str, Any]] | None = None, claude_counted: bool = False, notification_destinations: list[str] | None = None, + auto_approval_enabled: bool = False, + auto_approval_cost_cap_usd: float | None = None, + auto_approval_estimated_spend_usd: float = 0.0, ) -> None: if not self._ready(): return @@ -204,6 +207,9 @@ async def upsert_session( "pending_approval": pending_approval or [], "claude_counted": claude_counted, "notification_destinations": notification_destinations or [], + "auto_approval_enabled": auto_approval_enabled, + "auto_approval_cost_cap_usd": auto_approval_cost_cap_usd, + "auto_approval_estimated_spend_usd": auto_approval_estimated_spend_usd, }, }, upsert=True, @@ -224,6 +230,9 @@ async def save_snapshot( claude_counted: bool = False, created_at: datetime | None = None, notification_destinations: list[str] | None = None, + auto_approval_enabled: bool = False, + auto_approval_cost_cap_usd: float | None = None, + auto_approval_estimated_spend_usd: float = 0.0, ) -> None: if not self._ready(): return @@ -241,6 +250,9 @@ async def save_snapshot( pending_approval=pending_approval, claude_counted=claude_counted, notification_destinations=notification_destinations, + auto_approval_enabled=auto_approval_enabled, + auto_approval_cost_cap_usd=auto_approval_cost_cap_usd, + auto_approval_estimated_spend_usd=auto_approval_estimated_spend_usd, ) ops: list[Any] = [] for idx, raw in enumerate(messages): diff --git a/agent/core/session_uploader.py b/agent/core/session_uploader.py index d18ec6b8..035a235d 100644 --- a/agent/core/session_uploader.py +++ b/agent/core/session_uploader.py @@ -3,39 +3,455 @@ Standalone script for uploading session trajectories to HuggingFace. This runs as a separate process to avoid blocking the main agent. Uses individual file uploads to avoid race conditions. + +Two formats are supported: + +* ``row`` — single-line JSONL row used by the existing org telemetry/KPI + pipeline (``smolagents/ml-intern-sessions``). Compatible with + ``backend/kpis_scheduler.py``. +* ``claude_code`` — one event per line in the Claude Code JSONL schema, + auto-detected by the HF Agent Trace Viewer + (https://huggingface.co/changelog/agent-trace-viewer). Used for the + per-user private dataset (default ``{hf_user}/ml-intern-sessions``). """ +import argparse +import hashlib import json import os import sys from datetime import datetime from pathlib import Path +from typing import Any from dotenv import load_dotenv load_dotenv() -# Token for session uploads. Fallback chain (least-privilege first) — matches -# backend/kpis_scheduler.py so one write-scoped token on the Space covers every -# telemetry dataset. Never hardcode tokens in source. -_SESSION_TOKEN = ( - os.environ.get("HF_SESSION_UPLOAD_TOKEN") - or os.environ.get("HF_TOKEN") - or os.environ.get("HF_ADMIN_TOKEN") - or "" +# Token resolution for the org KPI dataset. Fallback chain (least-privilege +# first) — matches backend/kpis_scheduler.py so one write-scoped token on the +# Space covers every telemetry dataset. Never hardcode tokens in source. +_ORG_TOKEN_FALLBACK_CHAIN = ( + "HF_SESSION_UPLOAD_TOKEN", + "HF_TOKEN", + "HF_ADMIN_TOKEN", ) +_PERSONAL_TOKEN_ENV = "_ML_INTERN_PERSONAL_TOKEN" + + +def _resolve_token(token_env: str | None) -> str: + """Resolve an HF token from env. ``token_env`` overrides the fallback chain.""" + if token_env == "HF_TOKEN": + try: + from agent.core.hf_tokens import resolve_hf_token + + return ( + resolve_hf_token( + os.environ.get(_PERSONAL_TOKEN_ENV), + os.environ.get("HF_TOKEN"), + ) + or "" + ) + except Exception: + token = os.environ.get(_PERSONAL_TOKEN_ENV) or os.environ.get("HF_TOKEN") + return token or "" + + if token_env: + return os.environ.get(token_env, "") or "" + for var in _ORG_TOKEN_FALLBACK_CHAIN: + val = os.environ.get(var) + if val: + return val + return "" + + +def _scrub(obj: Any) -> Any: + """Best-effort regex scrub for HF tokens / API keys before upload.""" + try: + from agent.core.redact import scrub # type: ignore + except Exception: + # Fallback for environments where the agent package isn't importable + # (shouldn't happen in our subprocess, but be defensive). + import importlib.util + + _spec = importlib.util.spec_from_file_location( + "_redact", + Path(__file__).parent / "redact.py", + ) + _mod = importlib.util.module_from_spec(_spec) + _spec.loader.exec_module(_mod) # type: ignore + scrub = _mod.scrub + return scrub(obj) + + +def _msg_uuid(session_id: str, role: str, idx: int) -> str: + """Deterministic UUID-shaped id for a Claude Code message. + + Uses sha1 of ``session_id::role::idx`` so re-uploads/heartbeats keep the + parent/child chain stable. Same convention as the example dataset + https://huggingface.co/datasets/clem/hf-coding-tools-traces. + """ + digest = hashlib.sha1(f"{session_id}::{role}::{idx}".encode("utf-8")).hexdigest() + # Format like a UUID for visual familiarity (32 hex chars w/ dashes). + return ( + f"{digest[0:8]}-{digest[8:12]}-{digest[12:16]}-" + f"{digest[16:20]}-{digest[20:32]}" + ) + + +def _content_to_text(content: Any) -> str: + """Best-effort flatten of a litellm/openai content field to plain text.""" + if content is None: + return "" + if isinstance(content, str): + return content + if isinstance(content, list): + parts: list[str] = [] + for block in content: + if isinstance(block, dict): + text = block.get("text") + if isinstance(text, str): + parts.append(text) + else: + # Unknown content block — keep round-trippable representation. + parts.append(json.dumps(block, default=str)) + else: + parts.append(str(block)) + return "\n".join(parts) + return str(content) + + +def _parse_tool_args(raw: Any) -> Any: + """Tool call arguments arrive as a JSON-encoded string from LLMs.""" + if isinstance(raw, dict): + return raw + if isinstance(raw, str): + try: + return json.loads(raw) + except (json.JSONDecodeError, TypeError): + return {"_raw": raw} + return raw + + +def to_claude_code_jsonl(trajectory: dict) -> list[dict]: + """Convert an internal trajectory dict to Claude Code JSONL events. + + Schema reference (per the HF Agent Trace Viewer auto-detector): + + {"type":"user","message":{"role":"user","content":"..."}, + "uuid":"...","parentUuid":null,"sessionId":"...","timestamp":"..."} + {"type":"assistant", + "message":{"role":"assistant","model":"...", + "content":[{"type":"text","text":"..."}, + {"type":"tool_use","id":"...","name":"...","input":{...}}]}, + "uuid":"...","parentUuid":"","sessionId":"...","timestamp":"..."} + {"type":"user","message":{"role":"user", + "content":[{"type":"tool_result", + "tool_use_id":"...","content":"..."}]}, + "uuid":"...","parentUuid":"","sessionId":"...","timestamp":"..."} + + System messages are skipped (they're not part of the viewer schema and + contain large prompts that pollute the trace viewer UI). + """ + session_id = trajectory["session_id"] + model_name = trajectory.get("model_name") or "" + fallback_timestamp = ( + trajectory.get("session_start_time") or datetime.now().isoformat() + ) + messages: list[dict] = trajectory.get("messages") or [] + + out: list[dict] = [] + parent_uuid: str | None = None + + for idx, msg in enumerate(messages): + if not isinstance(msg, dict): + continue + role = msg.get("role") + if role == "system": + continue + timestamp = msg.get("timestamp") or fallback_timestamp + + if role == "user": + content = _content_to_text(msg.get("content")) + event_uuid = _msg_uuid(session_id, "user", idx) + out.append( + { + "type": "user", + "message": {"role": "user", "content": content}, + "uuid": event_uuid, + "parentUuid": parent_uuid, + "sessionId": session_id, + "timestamp": timestamp, + } + ) + parent_uuid = event_uuid + + elif role == "assistant": + content_text = _content_to_text(msg.get("content")) + content_blocks: list[dict] = [] + if content_text: + content_blocks.append({"type": "text", "text": content_text}) + for tc in msg.get("tool_calls") or []: + if not isinstance(tc, dict): + continue + fn = tc.get("function") or {} + content_blocks.append( + { + "type": "tool_use", + "id": tc.get("id") or "", + "name": fn.get("name") or "", + "input": _parse_tool_args(fn.get("arguments")), + } + ) + if not content_blocks: + # Edge case: empty assistant turn (shouldn't normally happen, + # but skip rather than emit an empty content array which + # confuses the viewer). + continue + event_uuid = _msg_uuid(session_id, "assistant", idx) + out.append( + { + "type": "assistant", + "message": { + "role": "assistant", + "model": model_name, + "content": content_blocks, + }, + "uuid": event_uuid, + "parentUuid": parent_uuid, + "sessionId": session_id, + "timestamp": timestamp, + } + ) + parent_uuid = event_uuid + + elif role == "tool": + tool_call_id = msg.get("tool_call_id") or "" + content_text = _content_to_text(msg.get("content")) + event_uuid = _msg_uuid(session_id, "tool", idx) + out.append( + { + "type": "user", + "message": { + "role": "user", + "content": [ + { + "type": "tool_result", + "tool_use_id": tool_call_id, + "content": content_text, + } + ], + }, + "uuid": event_uuid, + "parentUuid": parent_uuid, + "sessionId": session_id, + "timestamp": timestamp, + } + ) + parent_uuid = event_uuid + + return out + + +def _scrub_session_for_upload(data: dict) -> dict: + """Best-effort scrub of transcript fields before any upload temp file.""" + scrubbed = dict(data) + scrubbed["messages"] = _scrub(data.get("messages") or []) + scrubbed["events"] = _scrub(data.get("events") or []) + scrubbed["tools"] = _scrub(data.get("tools") or []) + return scrubbed + + +def _write_row_payload(data: dict, tmp_path: str) -> None: + """Single-row JSONL (existing format) — used by KPI scheduler.""" + scrubbed = _scrub_session_for_upload(data) + session_row = { + "session_id": data["session_id"], + "user_id": data.get("user_id"), + "session_start_time": data["session_start_time"], + "session_end_time": data["session_end_time"], + "model_name": data["model_name"], + "total_cost_usd": data.get("total_cost_usd"), + "messages": json.dumps(scrubbed["messages"]), + "events": json.dumps(scrubbed["events"]), + "tools": json.dumps(scrubbed["tools"]), + } + + with open(tmp_path, "w") as tmp: + json.dump(session_row, tmp) + + +def _write_claude_code_payload(data: dict, tmp_path: str) -> None: + """Multi-line JSONL in Claude Code schema for the HF trace viewer.""" + # Scrub before conversion so secrets never reach the upload temp file. + scrubbed = _scrub_session_for_upload(data) + events = to_claude_code_jsonl(scrubbed) + with open(tmp_path, "w") as tmp: + for event in events: + tmp.write(json.dumps(event)) + tmp.write("\n") + + +def _status_field(format: str) -> str: + """Per-format upload status field on the local trajectory file.""" + return "personal_upload_status" if format == "claude_code" else "upload_status" + + +def _url_field(format: str) -> str: + return "personal_upload_url" if format == "claude_code" else "upload_url" + + +def _read_session_file(session_file: str) -> dict: + """Read a local session file while respecting uploader file locks.""" + import fcntl + + with open(session_file, "r") as f: + fcntl.flock(f, fcntl.LOCK_SH) + try: + return json.load(f) + finally: + fcntl.flock(f, fcntl.LOCK_UN) + + +def _update_upload_status( + session_file: str, + status_key: str, + url_key: str, + status: str, + dataset_url: str | None = None, +) -> None: + """Atomically update only this uploader's status fields. + + The org and personal uploaders run as separate processes against the same + local session JSON file. Re-read under an exclusive lock so one uploader + cannot clobber fields written by the other. + """ + import fcntl + + with open(session_file, "r+") as f: + fcntl.flock(f, fcntl.LOCK_EX) + try: + data = json.load(f) + data[status_key] = status + if dataset_url is not None: + data[url_key] = dataset_url + data["last_save_time"] = datetime.now().isoformat() + f.seek(0) + json.dump(data, f, indent=2) + f.truncate() + f.flush() + os.fsync(f.fileno()) + finally: + fcntl.flock(f, fcntl.LOCK_UN) + + +def dataset_card_readme(repo_id: str) -> str: + """Dataset card for personal ML Intern session trace repos.""" + return f"""--- +pretty_name: "ML Intern Session Traces" +language: +- en +license: other +task_categories: +- text-generation +tags: +- agent-traces +- coding-agent +- ml-intern +- session-traces +- claude-code +- hf-agent-trace-viewer +configs: +- config_name: default + data_files: + - split: train + path: "sessions/**/*.jsonl" +--- + +# ML Intern session traces + +This dataset contains ML Intern coding agent session traces uploaded from local +ML Intern runs. The traces are stored as JSON Lines files under `sessions/`, +with one file per session. + +## Links + +- ML Intern demo: https://smolagents-ml-intern.hf.space +- ML Intern CLI: https://github.com/huggingface/ml-intern + +## Data description + +Each `*.jsonl` file contains a single ML Intern session converted to a +Claude-Code-style event stream for the Hugging Face Agent Trace Viewer. Entries +can include user messages, assistant messages, tool calls, tool results, model +metadata, and timestamps. + +Session files are written to paths of the form: + +```text +sessions/YYYY-MM-DD/.jsonl +``` + +## Redaction and review + +**WARNING: no comprehensive redaction or human review has been performed for this dataset.** + +ML Intern applies automated best-effort scrubbing for common secret patterns +such as Hugging Face, Anthropic, OpenAI, GitHub, and AWS tokens before upload. +This is not a privacy guarantee. + +These traces may contain sensitive information, including prompts, code, +terminal output, file paths, repository names, private task context, tool +outputs, or other data from the local development environment. Treat every +session as potentially sensitive. + +Do not make this dataset public unless you have manually inspected the uploaded +sessions and are comfortable sharing their full contents. + +## Limitations + +Coding agent transcripts can include private or off-topic content, failed +experiments, credentials accidentally pasted by a user, and outputs copied from +local files or services. Use with appropriate caution, especially before +changing repository visibility. +""" + + +def _upload_dataset_card(api: Any, repo_id: str, token: str, format: str) -> None: + """Create/update a README for personal trace datasets.""" + if format != "claude_code": + return + + api.upload_file( + path_or_fileobj=dataset_card_readme(repo_id).encode("utf-8"), + path_in_repo="README.md", + repo_id=repo_id, + repo_type="dataset", + token=token, + commit_message="Update dataset card", + ) def upload_session_as_file( - session_file: str, repo_id: str, max_retries: int = 3 + session_file: str, + repo_id: str, + max_retries: int = 3, + format: str = "row", + token_env: str | None = None, + private: bool = False, ) -> bool: - """ - Upload a single session as an individual JSONL file (no race conditions) + """Upload a single session as an individual JSONL file (no race conditions). Args: session_file: Path to local session JSON file repo_id: HuggingFace dataset repo ID max_retries: Number of retry attempts + format: ``row`` (default, KPI-compatible) or ``claude_code`` (HF + Agent Trace Viewer compatible). + token_env: Name of the env var holding the HF token. ``None`` falls + back to the org-token chain (``HF_SESSION_UPLOAD_TOKEN`` → + ``HF_TOKEN`` → ``HF_ADMIN_TOKEN``). + private: When creating the repo for the first time, mark it private. Returns: True if successful, False otherwise @@ -46,96 +462,60 @@ def upload_session_as_file( print("Error: huggingface_hub library not available", file=sys.stderr) return False + status_key = _status_field(format) + url_key = _url_field(format) + try: - # Load session data - with open(session_file, "r") as f: - data = json.load(f) + data = _read_session_file(session_file) - # Check if already uploaded - upload_status = data.get("upload_status") - if upload_status == "success": + # Skip if already uploaded for this format. + if data.get(status_key) == "success": return True - # Use dedicated session upload token (write-only access to session dataset) - hf_token = _SESSION_TOKEN + hf_token = _resolve_token(token_env) if not hf_token: - # Update status to failed - data["upload_status"] = "failed" - with open(session_file, "w") as f: - json.dump(data, f, indent=2) + _update_upload_status(session_file, status_key, url_key, "failed") return False - # Scrub secrets (HF tokens, API keys, etc.) from messages + events - # before they leave the local disk. Best-effort regex-based redaction — - # see agent/core/redact.py for the patterns covered. - try: - from agent.core.redact import scrub # type: ignore - except Exception: - # Fallback for environments where the agent package isn't importable - # (shouldn't happen in our subprocess, but be defensive). - import importlib.util - _spec = importlib.util.spec_from_file_location( - "_redact", - Path(__file__).parent / "redact.py", - ) - _mod = importlib.util.module_from_spec(_spec) - _spec.loader.exec_module(_mod) # type: ignore - scrub = _mod.scrub - scrubbed_messages = scrub(data["messages"]) - scrubbed_events = scrub(data["events"]) - scrubbed_tools = scrub(data.get("tools") or []) - - # Prepare JSONL content (single line) - # Store messages/events/tools as JSON strings to avoid schema conflicts - # across sessions with different tool rosters. - session_row = { - "session_id": data["session_id"], - "user_id": data.get("user_id"), - "session_start_time": data["session_start_time"], - "session_end_time": data["session_end_time"], - "model_name": data["model_name"], - "total_cost_usd": data.get("total_cost_usd"), - "messages": json.dumps(scrubbed_messages), - "events": json.dumps(scrubbed_events), - "tools": json.dumps(scrubbed_tools), - } - - # Create temporary JSONL file + # Build temp upload payload in the requested format. import tempfile with tempfile.NamedTemporaryFile( mode="w", suffix=".jsonl", delete=False ) as tmp: - json.dump(session_row, tmp) # Single line JSON tmp_path = tmp.name try: - # Generate unique path in repo: sessions/YYYY-MM-DD/session_id.jsonl + if format == "claude_code": + _write_claude_code_payload(data, tmp_path) + else: + _write_row_payload(data, tmp_path) + session_id = data["session_id"] date_str = datetime.fromisoformat(data["session_start_time"]).strftime( "%Y-%m-%d" ) repo_path = f"sessions/{date_str}/{session_id}.jsonl" - # Upload with retries api = HfApi() for attempt in range(max_retries): try: - # Try to create repo if it doesn't exist (idempotent) + # Idempotent create — visibility is set on first creation + # only. Existing repos keep whatever the user picked via + # /share-traces. try: api.create_repo( repo_id=repo_id, repo_type="dataset", - private=False, + private=private, token=hf_token, - exist_ok=True, # Don't fail if already exists + exist_ok=True, ) - except Exception: - # Repo might already exist, continue pass - # Upload the session file + _upload_dataset_card(api, repo_id, hf_token, format) + api.upload_file( path_or_fileobj=tmp_path, path_in_repo=repo_path, @@ -145,12 +525,13 @@ def upload_session_as_file( commit_message=f"Add session {session_id}", ) - # Update local status to success - data["upload_status"] = "success" - data["upload_url"] = f"https://huggingface.co/datasets/{repo_id}" - with open(session_file, "w") as f: - json.dump(data, f, indent=2) - + _update_upload_status( + session_file, + status_key, + url_key, + "success", + f"https://huggingface.co/datasets/{repo_id}", + ) return True except Exception: @@ -160,14 +541,12 @@ def upload_session_as_file( wait_time = 2**attempt time.sleep(wait_time) else: - # Final attempt failed - data["upload_status"] = "failed" - with open(session_file, "w") as f: - json.dump(data, f, indent=2) + _update_upload_status( + session_file, status_key, url_key, "failed" + ) return False finally: - # Clean up temp file try: os.unlink(tmp_path) except Exception: @@ -178,56 +557,102 @@ def upload_session_as_file( return False -def retry_failed_uploads(directory: str, repo_id: str): - """Retry all failed/pending uploads in a directory""" +def retry_failed_uploads( + directory: str, + repo_id: str, + format: str = "row", + token_env: str | None = None, + private: bool = False, +): + """Retry all failed/pending uploads in a directory for the given format.""" log_dir = Path(directory) if not log_dir.exists(): return + status_key = _status_field(format) session_files = list(log_dir.glob("session_*.json")) for filepath in session_files: try: - with open(filepath, "r") as f: - data = json.load(f) - - upload_status = data.get("upload_status", "unknown") - - # Only retry pending or failed uploads - if upload_status in ["pending", "failed"]: - upload_session_as_file(str(filepath), repo_id) + data = _read_session_file(str(filepath)) + + # Only retry pending or failed uploads. Files predating this + # field don't have it; treat unknown as "not yet attempted" for + # the row format (legacy behavior) and "skip" for claude_code + # so we don't suddenly re-upload pre-existing sessions to a + # newly-introduced personal repo. + status = data.get(status_key, "unknown") + if format == "claude_code" and status_key not in data: + continue + + if status in ("pending", "failed", "unknown"): + upload_session_as_file( + str(filepath), + repo_id, + format=format, + token_env=token_env, + private=private, + ) except Exception: pass +def _str2bool(v: str) -> bool: + return str(v).strip().lower() in {"1", "true", "yes", "on"} + + if __name__ == "__main__": - if len(sys.argv) < 3: - print("Usage: session_uploader.py ") - sys.exit(1) - - command = sys.argv[1] - - if command == "upload": - # python session_uploader.py upload - if len(sys.argv) < 4: - print("Usage: session_uploader.py upload ") - sys.exit(1) - session_file = sys.argv[2] - repo_id = sys.argv[3] - success = upload_session_as_file(session_file, repo_id) - sys.exit(0 if success else 1) - - elif command == "retry": - # python session_uploader.py retry - if len(sys.argv) < 4: - print("Usage: session_uploader.py retry ") - sys.exit(1) - directory = sys.argv[2] - repo_id = sys.argv[3] - retry_failed_uploads(directory, repo_id) + parser = argparse.ArgumentParser(prog="session_uploader.py") + sub = parser.add_subparsers(dest="command", required=True) + + p_upload = sub.add_parser("upload") + p_upload.add_argument("session_file") + p_upload.add_argument("repo_id") + p_upload.add_argument( + "--format", + choices=["row", "claude_code"], + default="row", + ) + p_upload.add_argument( + "--token-env", + default=None, + help="Env var name holding the HF token (default: org fallback chain).", + ) + p_upload.add_argument("--private", default="false") + + p_retry = sub.add_parser("retry") + p_retry.add_argument("directory") + p_retry.add_argument("repo_id") + p_retry.add_argument( + "--format", + choices=["row", "claude_code"], + default="row", + ) + p_retry.add_argument("--token-env", default=None) + p_retry.add_argument("--private", default="false") + + args = parser.parse_args() + + if args.command == "upload": + ok = upload_session_as_file( + args.session_file, + args.repo_id, + format=args.format, + token_env=args.token_env, + private=_str2bool(args.private), + ) + sys.exit(0 if ok else 1) + + if args.command == "retry": + retry_failed_uploads( + args.directory, + args.repo_id, + format=args.format, + token_env=args.token_env, + private=_str2bool(args.private), + ) sys.exit(0) - else: - print(f"Unknown command: {command}") - sys.exit(1) + parser.print_help() + sys.exit(1) diff --git a/agent/core/telemetry.py b/agent/core/telemetry.py index 685be8c9..6de45a96 100644 --- a/agent/core/telemetry.py +++ b/agent/core/telemetry.py @@ -78,9 +78,29 @@ async def record_llm_call( response: Any = None, latency_ms: int, finish_reason: str | None, + kind: str = "main", ) -> dict: """Emit an ``llm_call`` event and return the extracted usage dict so - callers can stash it on their result object if they want.""" + callers can stash it on their result object if they want. + + ``kind`` tags the call site so downstream analytics can break spend + down by category. Values currently emitted by the codebase: + + * ``main`` — agent loop turn (user-facing reply or tool follow-up) + * ``research`` — research sub-agent inner loop (3 call sites) + * ``compaction`` — context-window summary on overflow + * ``effort_probe``— effort cascade walk on rejection / model switch + * ``restore`` — session re-seed summary after a Space restart + + Pre-2026-04-29 only ``main`` calls were instrumented; observed gap on + Cost Explorer was ~67%, with the other 5 call sites accounting for + the rest. Tagging lets us split the dataset's ``total_cost_usd`` by + category and validate against AWS billing. + + The ``/title`` (HF Router, not Bedrock) and ``/health/llm`` (diagnostic + endpoint, no session context) call sites are intentionally not + instrumented — together they're <1% of spend. + """ usage = extract_usage(response) if response is not None else {} cost_usd = 0.0 if response is not None: @@ -98,6 +118,7 @@ async def record_llm_call( "latency_ms": latency_ms, "finish_reason": finish_reason, "cost_usd": cost_usd, + "kind": kind, **usage, }, )) diff --git a/agent/main.py b/agent/main.py index f500cc5f..606aaf8e 100644 --- a/agent/main.py +++ b/agent/main.py @@ -21,6 +21,7 @@ from prompt_toolkit import PromptSession from agent.config import load_config +from agent.core.approval_policy import is_scheduled_operation from agent.core.agent_loop import submission_loop from agent.core import model_switcher from agent.core.hf_tokens import resolve_hf_token @@ -55,6 +56,20 @@ CLI_CONFIG_PATH = Path(__file__).parent.parent / "configs" / "cli_agent_config.json" +def _is_scheduled_hf_job_tool(tool_info: dict[str, Any]) -> bool: + if tool_info.get("tool") != "hf_jobs": + return False + arguments = tool_info.get("arguments") or {} + if isinstance(arguments, str): + try: + arguments = json.loads(arguments) + except json.JSONDecodeError: + return False + if not isinstance(arguments, dict): + return False + return is_scheduled_operation(arguments.get("operation")) + + def _configure_runtime_logging() -> None: """Keep third-party warning spam from punching through the interactive UI.""" import logging @@ -375,8 +390,11 @@ def _cancel_event(): tools_data = event.data.get("tools", []) if event.data else [] count = event.data.get("count", 0) if event.data else 0 - # If yolo mode is active, auto-approve everything - if config and config.yolo_mode: + # If yolo mode is active, auto-approve everything except + # scheduled HF jobs, whose recurring cost stays manual. + if config and config.yolo_mode and not any( + _is_scheduled_hf_job_tool(t) for t in tools_data + ): approvals = [ { "tool_call_id": t.get("tool_call_id", ""), @@ -807,10 +825,120 @@ async def _handle_slash_command( print(f"Context items: {len(session.context_manager.items)}") return None + if command == "/share-traces": + session = session_holder[0] if session_holder else None + await _handle_share_traces_command(arg, config, session) + return None + print(f"Unknown command: {command}. Type /help for available commands.") return None +async def _handle_share_traces_command(arg: str, config, session) -> None: + """Show or flip visibility of the user's personal trace dataset. + + Uses the user's own HF_TOKEN (write-scoped to their namespace). Only + operates on the personal trace repo configured via + ``personal_trace_repo_template`` — never touches the shared org dataset. + """ + from huggingface_hub import HfApi + from huggingface_hub.utils import HfHubHTTPError + + console = get_console() + if session is None: + console.print("[bold red]No active session.[/bold red]") + return + + repo_id = session._personal_trace_repo_id() if session is not None else None + if not repo_id: + if not getattr(config, "share_traces", False): + console.print( + "[yellow]share_traces is disabled in config. " + "Set it to true to publish per-session traces to your HF dataset." + "[/yellow]" + ) + return + if not session.user_id: + console.print( + "[yellow]No HF username resolved \u2014 cannot pick a personal " + "trace repo. Set HF_TOKEN to a token tied to your account.[/yellow]" + ) + return + console.print( + "[yellow]personal_trace_repo_template is unset \u2014 nothing to do.[/yellow]" + ) + return + + token = session.hf_token or resolve_hf_token() + if not token: + console.print( + "[bold red]No HF_TOKEN available.[/bold red] Cannot read or change " + "dataset visibility." + ) + return + + api = HfApi(token=token) + url = f"https://huggingface.co/datasets/{repo_id}" + target = arg.strip().lower() + + if not target: + try: + info = await asyncio.to_thread( + api.repo_info, repo_id=repo_id, repo_type="dataset" + ) + visibility = "private" if getattr(info, "private", False) else "public" + console.print(f"[bold]Trace dataset:[/bold] {url}") + console.print(f"[bold]Visibility:[/bold] {visibility}") + console.print( + "[dim]Use '/share-traces public' to publish, " + "'/share-traces private' to lock it back down.[/dim]" + ) + except HfHubHTTPError as e: + if getattr(e.response, "status_code", None) == 404: + console.print( + f"[dim]Dataset {repo_id} doesn't exist yet \u2014 it'll be " + "created (private) on the next session save.[/dim]" + ) + else: + console.print(f"[bold red]Hub error:[/bold red] {e}") + except Exception as e: + console.print(f"[bold red]Could not fetch dataset info:[/bold red] {e}") + return + + if target not in {"public", "private"}: + console.print( + f"[bold red]Unknown argument:[/bold red] {target}. " + "Expected 'public' or 'private'." + ) + return + + private = target == "private" + try: + # Idempotent — create if missing so first-flip works even before any + # session has been saved yet. + await asyncio.to_thread( + api.create_repo, + repo_id=repo_id, + repo_type="dataset", + private=private, + token=token, + exist_ok=True, + ) + await asyncio.to_thread( + api.update_repo_settings, + repo_id=repo_id, + repo_type="dataset", + private=private, + token=token, + ) + except Exception as e: + console.print(f"[bold red]Failed to update visibility:[/bold red] {e}") + return + + label = "PUBLIC" if not private else "private" + console.print(f"[green]Dataset is now {label}.[/green] {url}") + + async def main(model: str | None = None): """Interactive chat with the agent""" @@ -1183,14 +1311,18 @@ async def headless_main( else: print_tool_log(tool, log) elif event.event_type == "approval_required": - # Auto-approve everything in headless mode (safety net if yolo_mode - # didn't prevent the approval event for some reason) + # Auto-approve in headless mode, except scheduled HF jobs. Those + # are rejected because their recurring cost needs manual approval. tools_data = event.data.get("tools", []) if event.data else [] approvals = [ { "tool_call_id": t.get("tool_call_id", ""), - "approved": True, - "feedback": None, + "approved": not _is_scheduled_hf_job_tool(t), + "feedback": ( + "Scheduled HF jobs require manual approval." + if _is_scheduled_hf_job_tool(t) + else None + ), } for t in tools_data ] diff --git a/agent/optimization/__init__.py b/agent/optimization/__init__.py new file mode 100644 index 00000000..e2361b73 --- /dev/null +++ b/agent/optimization/__init__.py @@ -0,0 +1,12 @@ +"""cosmos_lab — agentic ML lifecycle platform built atop ml-intern. + +Owned package per WORKFLOW.md. Never import-shadows or modifies upstream +agent/* modules; only extends them via subclassing or composition. +""" + +from agent.optimization.config_ext import ( + OptimizationConfig, + load_optimization_config, +) + +__all__ = ["OptimizationConfig", "load_optimization_config"] diff --git a/agent/optimization/config_ext.py b/agent/optimization/config_ext.py new file mode 100644 index 00000000..258fd0b3 --- /dev/null +++ b/agent/optimization/config_ext.py @@ -0,0 +1,54 @@ +"""OptimizationConfig — extends upstream Config with cosmos-lab fields. + +Subclassing Config (not modifying it) preserves the zero-diff invariant and +lets upstream evolve config defaults independently. + +Use ``load_optimization_config()`` instead of upstream ``load_config()`` — +the upstream loader returns a ``Config``, which silently drops the owned +fields below. +""" + +from pathlib import Path + +from agent.config import ( + Config, + _load_json_config, + apply_slack_user_defaults, + substitute_env_vars, +) +from dotenv import load_dotenv + +_PROJECT_ROOT = Path(__file__).resolve().parent.parent.parent + + +class OptimizationConfig(Config): + """cosmos-lab configuration. Inherits all upstream Config fields.""" + + optimization_target: str | None = None + target_hardware: str | None = None + quality_budget: float = 0.98 + optimization_loop_enabled: bool = True + + audit_log_path: str = "~/.cosmos_lab/audit.jsonl" + trajectory_db_path: str = "~/.cosmos_lab/trajectories.duckdb" + + +def load_optimization_config( + config_path: str = "configs/optimization_agent_config.json", + include_user_defaults: bool = False, +) -> OptimizationConfig: + """Mirror of ``agent.config.load_config`` that validates as OptimizationConfig. + + Upstream's ``load_config`` returns a base ``Config``, which Pydantic builds + by silently dropping unknown keys — owned fields like ``quality_budget`` + and ``trajectory_db_path`` would be lost. This loader applies the same + .env + env-var substitution + Slack-defaults pipeline but validates the + result against the subclass. + """ + load_dotenv(_PROJECT_ROOT / ".env") + load_dotenv(override=False) + + raw = _load_json_config(Path(config_path)) + if include_user_defaults: + raw = apply_slack_user_defaults(raw) + return OptimizationConfig.model_validate(substitute_env_vars(raw)) diff --git a/agent/optimization/identity/__init__.py b/agent/optimization/identity/__init__.py new file mode 100644 index 00000000..f537dfe7 --- /dev/null +++ b/agent/optimization/identity/__init__.py @@ -0,0 +1,16 @@ +"""Identity, audit, and capability-scoped tool routing for cosmos-lab agents. + +Phase 0 deliverable. Standalone in Phase 0 — wired into agent_loop via +TracedSession in Phase 1+. +""" + +from agent.optimization.identity.audit import AuditLog +from agent.optimization.identity.identity import AgentIdentity, CapabilityDenied +from agent.optimization.identity.router import CapabilityScopedRouter + +__all__ = [ + "AgentIdentity", + "AuditLog", + "CapabilityDenied", + "CapabilityScopedRouter", +] diff --git a/agent/optimization/identity/audit.py b/agent/optimization/identity/audit.py new file mode 100644 index 00000000..211bc76a --- /dev/null +++ b/agent/optimization/identity/audit.py @@ -0,0 +1,51 @@ +"""AuditLog — append-only JSONL audit trail for tool invocations. + +JSONL chosen over a DB so: +- audit is human-readable and `tail -f`-able +- no schema migrations needed in Phase 0 +- easy to ship to S3 / Phoenix / OpenTelemetry collector later + +Thread-safe via a process-local lock. For multi-process writers, prefer one +audit file per process and aggregate offline. +""" + +from __future__ import annotations + +import hashlib +import json +import threading +from datetime import datetime, timezone +from pathlib import Path +from typing import Any + + +def iso_now() -> str: + return datetime.now(timezone.utc).isoformat() + + +def hash_args(arguments: dict[str, Any]) -> str: + """Canonical sha256 over arguments — matches doom_loop.py's normalization.""" + canonical = json.dumps(arguments or {}, sort_keys=True, separators=(",", ":")) + return hashlib.sha256(canonical.encode("utf-8")).hexdigest()[:16] + + +class AuditLog: + def __init__(self, path: str | Path) -> None: + self.path = Path(path).expanduser() + self.path.parent.mkdir(parents=True, exist_ok=True) + self._lock = threading.Lock() + + def record(self, event: dict[str, Any]) -> None: + line = json.dumps(event, sort_keys=True, separators=(",", ":"), default=str) + with self._lock, self.path.open("a", encoding="utf-8") as f: + f.write(line + "\n") + + def read_all(self) -> list[dict[str, Any]]: + if not self.path.exists(): + return [] + out: list[dict[str, Any]] = [] + for line in self.path.read_text(encoding="utf-8").splitlines(): + line = line.strip() + if line: + out.append(json.loads(line)) + return out diff --git a/agent/optimization/identity/identity.py b/agent/optimization/identity/identity.py new file mode 100644 index 00000000..465dc58c --- /dev/null +++ b/agent/optimization/identity/identity.py @@ -0,0 +1,51 @@ +"""AgentIdentity — frozen dataclass identifying an agent and its capabilities. + +Capability semantics (Phase 0): exact tool-name match plus a single wildcard +``"*"`` that grants all tools. No glob/prefix matching — keeps the allow check +trivially auditable. Richer policies can layer on top in later phases. +""" + +from __future__ import annotations + +from dataclasses import dataclass, field +from typing import Optional + +WILDCARD = "*" + + +class CapabilityDenied(PermissionError): + """Raised when an agent attempts to call a tool outside its capability set.""" + + +@dataclass(frozen=True) +class AgentIdentity: + agent_id: str + display_name: str + capabilities: frozenset[str] = field(default_factory=frozenset) + parent_id: Optional[str] = None + + def can_call(self, tool_name: str) -> bool: + return WILDCARD in self.capabilities or tool_name in self.capabilities + + @classmethod + def root(cls, agent_id: str = "root", display_name: str = "Root") -> "AgentIdentity": + return cls( + agent_id=agent_id, + display_name=display_name, + capabilities=frozenset({WILDCARD}), + ) + + @classmethod + def scoped( + cls, + agent_id: str, + display_name: str, + capabilities: list[str] | set[str] | frozenset[str], + parent_id: Optional[str] = None, + ) -> "AgentIdentity": + return cls( + agent_id=agent_id, + display_name=display_name, + capabilities=frozenset(capabilities), + parent_id=parent_id, + ) diff --git a/agent/optimization/identity/router.py b/agent/optimization/identity/router.py new file mode 100644 index 00000000..35397355 --- /dev/null +++ b/agent/optimization/identity/router.py @@ -0,0 +1,115 @@ +"""CapabilityScopedRouter — wraps an upstream ToolRouter to enforce per-agent +capability allowlists and audit every tool invocation. + +Composition (not subclass-init) chosen because ToolRouter.__init__ instantiates +the full builtin tool stack (sandbox tools require HF auth), which makes pure +unit tests hard. Wrapping any object that exposes +``call_tool(name, args, session, tool_call_id) -> (str, bool)`` and +``get_tool_specs_for_llm() -> list[dict]`` keeps tests fast and decoupled. +""" + +from __future__ import annotations + +import logging +import time +from typing import Any, Protocol + +from agent.optimization.identity.audit import AuditLog, hash_args, iso_now +from agent.optimization.identity.identity import AgentIdentity, CapabilityDenied + +logger = logging.getLogger(__name__) + + +class _RouterLike(Protocol): + async def call_tool( + self, + tool_name: str, + arguments: dict[str, Any], + session: Any = None, + tool_call_id: str | None = None, + ) -> tuple[str, bool]: ... + + def get_tool_specs_for_llm(self) -> list[dict[str, Any]]: ... + + +class CapabilityScopedRouter: + def __init__( + self, + base_router: _RouterLike, + identity: AgentIdentity, + audit_log: AuditLog, + ) -> None: + self._base = base_router + self.identity = identity + self.audit_log = audit_log + + def get_tool_specs_for_llm(self) -> list[dict[str, Any]]: + """Filter tool specs to only those this identity can invoke.""" + all_specs = self._base.get_tool_specs_for_llm() + return [ + spec + for spec in all_specs + if self.identity.can_call(spec["function"]["name"]) + ] + + async def call_tool( + self, + tool_name: str, + arguments: dict[str, Any], + session: Any = None, + tool_call_id: str | None = None, + ) -> tuple[str, bool]: + started = time.monotonic() + args_h = hash_args(arguments or {}) + base_event = { + "ts": iso_now(), + "agent_id": self.identity.agent_id, + "tool": tool_name, + "args_hash": args_h, + "tool_call_id": tool_call_id, + } + + if not self.identity.can_call(tool_name): + self.audit_log.record( + {**base_event, "phase": "denied", "allowed": False, "reason": "capability_not_granted"} + ) + logger.warning( + "CapabilityDenied: agent=%s tool=%s", self.identity.agent_id, tool_name + ) + raise CapabilityDenied( + f"agent '{self.identity.agent_id}' is not granted capability '{tool_name}'" + ) + + self.audit_log.record({**base_event, "phase": "before", "allowed": True}) + + try: + result, success = await self._base.call_tool( + tool_name, arguments, session=session, tool_call_id=tool_call_id + ) + except Exception as exc: + duration_ms = int((time.monotonic() - started) * 1000) + self.audit_log.record( + { + **base_event, + "phase": "exception", + "duration_ms": duration_ms, + "exception_type": type(exc).__name__, + "exception_msg": str(exc)[:500], + } + ) + raise + + duration_ms = int((time.monotonic() - started) * 1000) + result_summary = ( + result[:200] if isinstance(result, str) else f"<{type(result).__name__}>" + ) + self.audit_log.record( + { + **base_event, + "phase": "after", + "duration_ms": duration_ms, + "success": success, + "result_summary": result_summary, + } + ) + return result, success diff --git a/agent/prompts/system_prompt_v3.yaml b/agent/prompts/system_prompt_v3.yaml index cb63c901..4543048f 100644 --- a/agent/prompts/system_prompt_v3.yaml +++ b/agent/prompts/system_prompt_v3.yaml @@ -42,7 +42,7 @@ system_prompt: | SILENT DATASET SUBSTITUTION: When a requested dataset fails to load, you will silently switch to a different one without telling the user. Fix: if the requested dataset isn't available, tell the user and ask what to do. - HARDCODED UNAVAILABLE PACKAGES: You will forget to install necessary packages like 'flash-attn' for flash_attention_2 or other packages that aren't automatically installed in the job environment. Fix: install necessary packages before running the job. + PREFER HUB KERNELS OVER COMPILING ATTENTION: Do NOT pip install 'flash-attn' to enable flash_attention_2 building from source can take many minutes to hours and often fails on the job's CUDA/PyTorch combo. Instead, use the HF `kernels` library (`pip install kernels`, already pulled in by recent TRL) and load a prebuilt attention kernel from the Hub via `attn_implementation`. Examples: `AutoModelForCausalLM.from_pretrained(..., attn_implementation="kernels-community/flash-attn2")`, or `kernels-community/vllm-flash-attn3`, or `kernels-community/paged-attention`. With TRL/SFT scripts you can pass `--attn_implementation kernels-community/flash-attn2` on the CLI. Search additional kernels at https://huggingface.co/models?other=kernel. Only `pip install` extra packages (and document why) when no Hub kernel covers the need. SCOPE-CHANGING FIXES: Avoid at all costs! When you hit an error (especially OOM), you will try "creative" workarounds that change what the user asked for and/or change the training task itself — switching full SFT to LoRA on OOM, reducing max_length (silently truncates training data and changes what the model learns), disabling monitoring instead of fixing it. Do not do this. Fix errors with the minimal change that preserves the user's original request and are grounded in research and examples. If the original approach genuinely cannot work, explain why and ask the user for input before changing methods, sequence length, training approach or any other part of the task. @@ -122,8 +122,10 @@ system_prompt: | # Sandbox-first development - For non-trivial scripts, develop and test in a sandbox before launching via hf_jobs: - sandbox_create → install deps → write script → test with small run → fix errors → launch via hf_jobs at scale + A private cpu-basic sandbox is already available for normal code execution in each session. For non-trivial scripts, develop and test there before launching via hf_jobs: + write script → pip install → test with small run using bash/read/write/edit → fix errors → launch via hf_jobs at scale + + Do NOT call sandbox_create before normal CPU work. Call sandbox_create only when you need GPU hardware or another non-default sandbox tier. Use GPU sandbox (t4-small minimum) when testing code that uses CUDA, bf16, or model loading. CPU sandboxes cannot test GPU code paths. diff --git a/agent/tools/docs_tools.py b/agent/tools/docs_tools.py index a1782107..ee40ef35 100644 --- a/agent/tools/docs_tools.py +++ b/agent/tools/docs_tools.py @@ -932,7 +932,7 @@ async def _get_api_search_tool_spec() -> dict[str, Any]: "• argilla — Data annotation, feedback, and human-in-the-loop workflows.\n" "• distilabel — Synthetic data generation and distillation pipelines.\n" "• microsoft-azure — Azure deployment and integration guides.\n" - "• kernels — Lightweight execution environments and notebook-style workflows.\n" + "• kernels — Load prebuilt compute kernels (E.g. flash-attn2) from the Hub via `attn_implementation`; avoids compiling flash-attn from source.\n" "• google-cloud — GCP deployment and serving workflows.\n" ), }, diff --git a/agent/tools/research_tool.py b/agent/tools/research_tool.py index 11131766..c4480b97 100644 --- a/agent/tools/research_tool.py +++ b/agent/tools/research_tool.py @@ -9,10 +9,12 @@ import json import logging +import time from typing import Any from litellm import Message, acompletion +from agent.core import telemetry from agent.core.doom_loop import check_for_doom_loop from agent.core.llm_params import _resolve_llm_params from agent.core.prompt_caching import with_prompt_caching @@ -332,6 +334,7 @@ async def _log(text: str) -> None: )) try: _msgs, _ = with_prompt_caching(messages, None, llm_params.get("model")) + _t0 = time.monotonic() response = await acompletion( messages=_msgs, tools=None, # no tools — force text response @@ -339,6 +342,20 @@ async def _log(text: str) -> None: timeout=120, **llm_params, ) + # Telemetry is best-effort; a logging blip must never mask a + # valid LLM response (the surrounding except would convert it + # to "summary call failed"). + try: + await telemetry.record_llm_call( + session, + model=research_model, + response=response, + latency_ms=int((time.monotonic() - _t0) * 1000), + finish_reason=response.choices[0].finish_reason if response.choices else None, + kind="research", + ) + except Exception as _telem_err: + logger.debug("research telemetry failed: %s", _telem_err) content = response.choices[0].message.content or "" return content or "Research context exhausted — no summary produced.", bool(content) except Exception: @@ -360,6 +377,7 @@ async def _log(text: str) -> None: _msgs, _tools = with_prompt_caching( messages, tool_specs if tool_specs else None, llm_params.get("model") ) + _t0 = time.monotonic() response = await acompletion( messages=_msgs, tools=_tools, @@ -368,6 +386,17 @@ async def _log(text: str) -> None: timeout=120, **llm_params, ) + try: + await telemetry.record_llm_call( + session, + model=research_model, + response=response, + latency_ms=int((time.monotonic() - _t0) * 1000), + finish_reason=response.choices[0].finish_reason if response.choices else None, + kind="research", + ) + except Exception as _telem_err: + logger.debug("research telemetry failed: %s", _telem_err) except Exception as e: logger.error("Research sub-agent LLM error: %s", e) return f"Research agent LLM error: {e}", False @@ -459,6 +488,7 @@ async def _log(text: str) -> None: )) try: _msgs, _ = with_prompt_caching(messages, None, llm_params.get("model")) + _t0 = time.monotonic() response = await acompletion( messages=_msgs, tools=None, @@ -466,6 +496,17 @@ async def _log(text: str) -> None: timeout=120, **llm_params, ) + try: + await telemetry.record_llm_call( + session, + model=research_model, + response=response, + latency_ms=int((time.monotonic() - _t0) * 1000), + finish_reason=response.choices[0].finish_reason if response.choices else None, + kind="research", + ) + except Exception as _telem_err: + logger.debug("research telemetry failed: %s", _telem_err) content = response.choices[0].message.content or "" if content: return content, True diff --git a/agent/tools/sandbox_client.py b/agent/tools/sandbox_client.py index 967d946c..170dcb51 100644 --- a/agent/tools/sandbox_client.py +++ b/agent/tools/sandbox_client.py @@ -13,7 +13,7 @@ - Optionally deletes the Space when done Lifecycle: - sb = Sandbox.create(owner="burtenshaw") # duplicate, wait, connect + sb = Sandbox.create(owner="burtenshaw") # duplicate private Space, wait, connect sb = Sandbox.create(owner="burtenshaw", # with options hardware="t4-small", private=True, @@ -66,6 +66,15 @@ WAIT_INTERVAL = 5 API_WAIT_TIMEOUT = 180 + +def _is_transient_space_visibility_error(error: Exception) -> bool: + """Return True when a newly duplicated Space is not queryable yet.""" + response = getattr(error, "response", None) + if getattr(response, "status_code", None) == 404: + return True + message = str(error) + return "Repository Not Found" in message or "404 Client Error" in message + _DOCKERFILE = """\ FROM ghcr.io/astral-sh/uv:python3.12-bookworm-slim @@ -157,18 +166,20 @@ def _atomic_write(path: pathlib.Path, content: str): app = FastAPI() -def _expected_api_token() -> str: - return os.environ.get("SANDBOX_API_TOKEN") or os.environ.get("HF_TOKEN") or "" +def _bearer_token(header: str) -> str: + scheme, _, supplied = header.partition(" ") + if scheme.lower() != "bearer" or not supplied: + return "" + return supplied def _require_auth(request: Request) -> None: - expected = _expected_api_token() - if not expected: + sandbox_token = os.environ.get("SANDBOX_API_TOKEN") or "" + if not sandbox_token: raise HTTPException(status_code=503, detail="Sandbox API token not configured") - auth_header = request.headers.get("authorization", "") - scheme, _, supplied = auth_header.partition(" ") - if scheme.lower() != "bearer" or not supplied: + supplied = _bearer_token(request.headers.get("x-sandbox-authorization", "")) + if not supplied: raise HTTPException(status_code=401, detail="Missing bearer token") - if not hmac.compare_digest(supplied, expected): + if not hmac.compare_digest(supplied, sandbox_token): raise HTTPException(status_code=401, detail="Invalid bearer token") _AUTH = [Depends(_require_auth)] @@ -513,15 +524,28 @@ def __post_init__(self): # Trailing slash is critical: httpx resolves relative paths against base_url. # Without it, client.get("health") resolves to /health instead of /api/health. self._base_url = f"https://{slug}.hf.space/api/" - api_token = self.api_token or self.token self._client = httpx.Client( base_url=self._base_url, - headers={"Authorization": f"Bearer {api_token}"} if api_token else {}, + headers=self._auth_headers(), timeout=httpx.Timeout(MAX_TIMEOUT, connect=30), follow_redirects=True, ) self._hf_api = HfApi(token=self.token) + def _auth_headers(self) -> dict[str, str]: + """Return headers for private HF Space access plus sandbox API auth. + + Private Spaces require the HF token in ``Authorization`` at the Hub + edge. The sandbox server requires its control-plane token in the + dedicated ``X-Sandbox-Authorization`` header. + """ + headers: dict[str, str] = {} + if self.token: + headers["Authorization"] = f"Bearer {self.token}" + if self.api_token: + headers["X-Sandbox-Authorization"] = f"Bearer {self.api_token}" + return headers + # ── Lifecycle ───────────────────────────────────────────────── class Cancelled(Exception): @@ -535,7 +559,7 @@ def create( name: str | None = None, template: str = TEMPLATE_SPACE, hardware: str = "cpu-basic", - private: bool = False, + private: bool = True, sleep_time: int | None = None, token: str | None = None, secrets: dict[str, str] | None = None, @@ -555,7 +579,7 @@ def create( A unique suffix is always appended. template: Source Space to duplicate (default: burtenshaw/sandbox). hardware: Hardware tier (cpu-basic, t4-small, etc.). - private: Whether the Space should be private. + private: Whether the Space should be private. Defaults to True. sleep_time: Auto-sleep after N seconds of inactivity. token: HF API token (from user's OAuth session). wait_timeout: Max seconds to wait for Space to start (default: 300). @@ -600,6 +624,16 @@ def _check_cancel(): _check_cancel() + # Some template duplicates can initially inherit the template hardware. + # Explicitly request the target tier so automatic CPU sandboxes never + # silently come up on GPU hardware. + api.request_space_hardware( + space_id, + hardware=hardware, + sleep_time=sleep_time, + ) + _log(f"Requested hardware: {hardware}") + # Inject secrets BEFORE uploading server files (which triggers rebuild). # Secrets added after a Space is running aren't available until restart, # so they must be set before the build/start cycle. @@ -618,8 +652,24 @@ def _check_cancel(): deadline = time.time() + wait_timeout while time.time() < deadline: _check_cancel() - runtime = api.get_space_runtime(space_id) + try: + runtime = api.get_space_runtime(space_id) + except Exception as e: + if _is_transient_space_visibility_error(e): + _log(" Space runtime not visible yet...") + time.sleep(WAIT_INTERVAL) + continue + raise if runtime.stage == "RUNNING": + current_hardware = runtime.hardware or getattr( + runtime, "requested_hardware", None + ) + if current_hardware != hardware: + _log( + f" RUNNING on {current_hardware}; waiting for {hardware}..." + ) + time.sleep(WAIT_INTERVAL) + continue _log(f"Space is running (hardware: {runtime.hardware})") break if runtime.stage in ("RUNTIME_ERROR", "BUILD_ERROR"): diff --git a/agent/tools/sandbox_tool.py b/agent/tools/sandbox_tool.py index a5c26aca..8e410f80 100644 --- a/agent/tools/sandbox_tool.py +++ b/agent/tools/sandbox_tool.py @@ -2,11 +2,11 @@ Sandbox tools — expose the Sandbox client as agent tools. 5 tools total: - sandbox_create — explicit sandbox creation (requires approval) - bash, read, write, edit — operations on the sandbox + sandbox_create — create/replace sandbox for non-default hardware + bash, read, write, edit — operations on the active sandbox -If any operation tool is called without an active sandbox, -a cpu-basic sandbox is auto-created (no approval needed). +A cpu-basic sandbox is preloaded for each session. Operation tools wait for it +if startup is still in progress. """ from __future__ import annotations @@ -15,6 +15,7 @@ import logging import re import threading +import weakref from datetime import datetime, timedelta, timezone from typing import Any @@ -26,6 +27,8 @@ logger = logging.getLogger(__name__) +DEFAULT_CPU_SANDBOX_HARDWARE = "cpu-basic" + # Match the exact suffix pattern Sandbox.create produces: "sandbox-<8 hex>". # Used to identify orphan sandboxes from prior sessions safely (won't match # user-renamed lookalikes). @@ -36,6 +39,23 @@ # so we leave it alone. _ORPHAN_STALE_AFTER = timedelta(hours=1) +# HF Space duplication/build APIs can behave poorly when multiple private +# sandboxes are created concurrently for the same namespace. Keep session +# creation non-blocking, but serialize the actual Hub create path per owner. +_SANDBOX_CREATE_LOCKS: weakref.WeakKeyDictionary[ + asyncio.AbstractEventLoop, dict[str, asyncio.Lock] +] = weakref.WeakKeyDictionary() + + +def _get_sandbox_create_lock(owner: str) -> asyncio.Lock: + loop = asyncio.get_running_loop() + locks = _SANDBOX_CREATE_LOCKS.setdefault(loop, {}) + lock = locks.get(owner) + if lock is None: + lock = asyncio.Lock() + locks[owner] = lock + return lock + def _looks_like_path(script: str) -> bool: """Return True if the script string looks like a file path (not inline code).""" @@ -124,7 +144,7 @@ def _cleanup_user_orphan_sandboxes( cutoff = datetime.now(timezone.utc) - _ORPHAN_STALE_AFTER deleted = 0 try: - spaces = list(api.list_spaces(author=owner, limit=200)) + spaces = list(api.list_spaces(author=owner, limit=200, full=True)) except Exception as e: log(f"orphan sweep: list_spaces failed: {e}") return 0 @@ -140,6 +160,9 @@ def _cleanup_user_orphan_sandboxes( last_mod = datetime.fromisoformat(last_mod.replace("Z", "+00:00")) except ValueError: last_mod = None + if last_mod is None: + log(f"orphan sweep: skipping {space.id}; missing lastModified") + continue if last_mod and last_mod > cutoff: # Recent — could be a concurrent live session. Skip. continue @@ -158,8 +181,9 @@ def _cleanup_user_orphan_sandboxes( async def _ensure_sandbox( session: Any, - hardware: str = "cpu-basic", + hardware: str = DEFAULT_CPU_SANDBOX_HARDWARE, extra_secrets: dict[str, str] | None = None, + cancel_event: threading.Event | None = None, **create_kwargs, ) -> tuple[Sandbox | None, str | None]: """ @@ -184,6 +208,45 @@ async def _ensure_sandbox( if not owner: return None, "Could not determine HF username from token." + create_lock = _get_sandbox_create_lock(owner) + if create_lock.locked(): + await session.send_event( + Event( + event_type="tool_log", + data={ + "tool": "sandbox", + "log": "Waiting for sandbox creation slot...", + }, + ) + ) + + async with create_lock: + if getattr(session, "sandbox", None): + return session.sandbox, None + + return await _create_sandbox_locked( + session, + api=api, + owner=owner, + hardware=hardware, + extra_secrets=extra_secrets, + cancel_event=cancel_event, + **create_kwargs, + ) + + +async def _create_sandbox_locked( + session: Any, + *, + api: HfApi, + owner: str, + hardware: str, + extra_secrets: dict[str, str] | None = None, + cancel_event: threading.Event | None = None, + **create_kwargs, +) -> tuple[Sandbox | None, str | None]: + """Create the Space while the per-owner sandbox creation lock is held.""" + token = session.hf_token await session.send_event( Event( event_type="tool_log", @@ -203,27 +266,10 @@ def _log(msg: str) -> None: Event(event_type="tool_log", data={"tool": "sandbox", "log": msg}), ) - # Before we create a new sandbox, sweep this user's stale sandboxes from - # prior sessions. ``_cleanup_sandbox`` in session_manager fires only on - # clean session exit; pod kills, WebSocket drops, etc. leave orphans - # behind, and they accumulate on every new session forever (observed - # 2310 leaked across the Hub on 2026-04-27). Doing the cleanup here at - # session start = self-healing, no separate cron needed. - # - # The 1h staleness filter is the safety: a sandbox modified in the last - # hour might still be tied to a live session in another tab, so we skip. - # Anything older has no realistic chance of being active given typical - # session lengths. - try: - await asyncio.to_thread(_cleanup_user_orphan_sandboxes, api, owner, _log) - except Exception as e: - # Cleanup is best-effort — never block sandbox_create on it. - _log(f"orphan sandbox sweep failed (non-fatal): {e}") - # Bridge asyncio cancel event to a threading.Event for the blocking create call. # We poll session._cancelled from the main loop in a background task and set # a threading.Event that Sandbox.create checks during its polling loops. - cancel_flag = threading.Event() + cancel_flag = cancel_event or threading.Event() async def _watch_cancel(): await session._cancelled.wait() @@ -235,6 +281,7 @@ async def _watch_cancel(): if extra_secrets: secrets.update({k: v for k, v in extra_secrets.items() if v}) + create_kwargs["private"] = True # enforce: overrides any caller-supplied value kwargs = { "owner": owner, "hardware": hardware, @@ -244,7 +291,7 @@ async def _watch_cancel(): "cancel_event": cancel_flag, **create_kwargs, } - if hardware != "cpu-basic": + if hardware != DEFAULT_CPU_SANDBOX_HARDWARE: kwargs["sleep_time"] = 2700 import time as _t _t_start = _t.monotonic() @@ -254,7 +301,18 @@ async def _watch_cancel(): return None, "Sandbox creation cancelled by user." finally: watcher_task.cancel() + + if cancel_flag.is_set(): + if getattr(sb, "_owns_space", False): + try: + await asyncio.to_thread(sb.delete) + except Exception as e: + logger.warning("Failed to delete cancelled sandbox %s: %s", sb.space_id, e) + return None, "Sandbox creation cancelled by user." + session.sandbox = sb + session.sandbox_hardware = hardware + session.sandbox_preload_error = None # Telemetry: sandbox creation (infra consumption signal) from agent.core import telemetry @@ -285,18 +343,146 @@ async def _watch_cancel(): return sb, None +def start_cpu_sandbox_preload(session: Any) -> asyncio.Task | None: + """Start a background ``cpu-basic`` sandbox for this session.""" + if not session or getattr(session, "sandbox", None): + return None + + existing_task = getattr(session, "sandbox_preload_task", None) + if existing_task and not existing_task.done(): + return existing_task + + cancel_event = threading.Event() + session.sandbox_preload_cancel_event = cancel_event + session.sandbox_preload_error = None + + async def _preload() -> Sandbox | None: + try: + sb, error = await _ensure_sandbox( + session, + hardware=DEFAULT_CPU_SANDBOX_HARDWARE, + cancel_event=cancel_event, + ) + if error: + session.sandbox_preload_error = error + return None + return sb + except asyncio.CancelledError: + cancel_event.set() + session.sandbox_preload_error = "Sandbox creation cancelled by user." + raise + except Exception as e: + session.sandbox_preload_error = f"Failed to create sandbox: {e}" + logger.warning("CPU sandbox preload failed: %s", e) + return None + + task = asyncio.create_task(_preload()) + session.sandbox_preload_task = task + return task + + +async def cancel_sandbox_preload(session: Any) -> None: + """Best-effort cancellation for an in-flight CPU sandbox preload.""" + cancel_event = getattr(session, "sandbox_preload_cancel_event", None) + if cancel_event is not None: + cancel_event.set() + + task = getattr(session, "sandbox_preload_task", None) + if not task or task.done(): + return + + current_task = asyncio.current_task() + if task is current_task: + return + + try: + await asyncio.wait_for(asyncio.shield(task), timeout=30) + except asyncio.TimeoutError: + logger.warning( + "Timed out waiting for CPU sandbox preload cancellation; " + "task is still live, cancelling asyncio wrapper" + ) + task.cancel() + except asyncio.CancelledError: + raise + except Exception: + pass + + +async def get_active_or_preloaded_sandbox( + session: Any, +) -> tuple[Sandbox | None, str | None]: + """Return the active sandbox, waiting for the startup preload if needed.""" + if not session: + return None, "No session available." + if getattr(session, "sandbox", None): + return session.sandbox, None + + task = getattr(session, "sandbox_preload_task", None) + if task: + try: + await asyncio.shield(task) + except asyncio.CancelledError: + raise + except Exception as e: + session.sandbox_preload_error = f"Failed to create sandbox: {e}" + + if getattr(session, "sandbox", None): + return session.sandbox, None + + preload_error = getattr(session, "sandbox_preload_error", None) + if preload_error: + return None, preload_error + + return None, "Sandbox is still starting. Please retry shortly." + + +async def teardown_session_sandbox(session: Any) -> None: + """Cancel sandbox preload and delete the active owned sandbox, if present.""" + if not session: + return + + await cancel_sandbox_preload(session) + + sandbox = getattr(session, "sandbox", None) + session.sandbox = None + session.sandbox_hardware = None + + if not (sandbox and getattr(sandbox, "_owns_space", False)): + return + + space_id = getattr(sandbox, "space_id", None) + last_err: Exception | None = None + for attempt in range(3): + try: + logger.info("Deleting sandbox %s (attempt %s/3)...", space_id, attempt + 1) + await asyncio.to_thread(sandbox.delete) + from agent.core import telemetry + await telemetry.record_sandbox_destroy(session, sandbox) + return + except Exception as e: + last_err = e + if attempt < 2: + await asyncio.sleep(2 ** attempt) + logger.error( + "Failed to delete sandbox %s after 3 attempts: %s. " + "Orphan — sweep script will pick it up.", + space_id, + last_err, + ) + + # ── sandbox_create tool ────────────────────────────────────────────── SANDBOX_CREATE_TOOL_SPEC = { "name": "sandbox_create", "description": ( - "Create a persistent remote Linux environment for developing and testing scripts.\n\n" - "Workflow: sandbox_create → write script → pip install → test with small run → fix errors → hf_jobs at scale.\n" - "The sandbox persists across tool calls within the session. pip install works out of the box.\n\n" - "Use this when: you need to develop, test, and iterate on scripts before launching via hf_jobs. " - "Especially for training scripts where you need to verify imports, test on a small subset, and fix errors interactively.\n\n" - "Skip this when: the task is a simple one-shot operation (status check, resource search, quick data query), " - "or the script is copied from a verified working example with minimal changes.\n\n" + "Create or replace the session sandbox when non-default hardware is needed.\n\n" + "A private cpu-basic sandbox is already started automatically for each session. " + "For normal CPU code execution, call bash/read/write/edit directly; do NOT call sandbox_create first.\n\n" + "Use sandbox_create when: you need GPU hardware, cpu-upgrade, or Trackio secrets before running code. " + "The active sandbox persists across tool calls within the session. pip install works out of the box. " + "Sandboxes are always created as private HF Spaces.\n\n" "For ML code that uses CUDA, bf16, or model loading: use GPU hardware (t4-small minimum). " "CPU sandboxes cannot run GPU code paths — your test will not catch GPU-related errors.\n\n" "Before choosing hardware, estimate your VRAM needs (models you run, training data size). Rule of thumb: bf16/fp16 ≈ 2 bytes/param, " @@ -316,11 +502,10 @@ async def _watch_cancel(): "hardware": { "type": "string", "enum": [e.value for e in SpaceHardware], - "description": "Hardware tier for the sandbox (default: cpu-basic)", - }, - "private": { - "type": "boolean", - "description": "If true, create a private Space", + "description": ( + "Hardware tier for the sandbox. Omit for the existing auto-started " + "cpu-basic sandbox; choose GPU/cpu-upgrade only when needed." + ), }, "trackio_space_id": { "type": "string", @@ -348,7 +533,7 @@ async def sandbox_create_handler( args: dict[str, Any], session: Any = None, tool_call_id: str | None = None ) -> tuple[str, bool]: """Handle sandbox_create tool calls.""" - hardware = args.get("hardware", "cpu-basic") + hardware = args.get("hardware", DEFAULT_CPU_SANDBOX_HARDWARE) trackio_space_id = args.get("trackio_space_id") or None trackio_project = args.get("trackio_project") or None @@ -366,28 +551,78 @@ async def _emit_trackio_state(sb: Sandbox) -> None: data["trackioProject"] = trackio_project await session.send_event(Event(event_type="tool_state_change", data=data)) - # If sandbox already exists, return its info + preload_task = getattr(session, "sandbox_preload_task", None) + if ( + session + and not getattr(session, "sandbox", None) + and preload_task + and not preload_task.done() + and hardware == DEFAULT_CPU_SANDBOX_HARDWARE + ): + sb, error = await get_active_or_preloaded_sandbox(session) + if error: + return error, False + if sb: + await _emit_trackio_state(sb) + return ( + f"Sandbox already active: {sb.space_id}\n" + f"URL: {sb.url}\n" + f"Hardware: {DEFAULT_CPU_SANDBOX_HARDWARE}\n" + f"Use bash/read/write/edit to interact with it." + ), True + + if ( + session + and not getattr(session, "sandbox", None) + and preload_task + and not preload_task.done() + and hardware != DEFAULT_CPU_SANDBOX_HARDWARE + ): + await cancel_sandbox_preload(session) + + # If sandbox already exists, return its info or replace the auto CPU sandbox if session and getattr(session, "sandbox", None): sb = session.sandbox + active_hardware = getattr(session, "sandbox_hardware", None) + if active_hardware == hardware: + await _emit_trackio_state(sb) + return ( + f"Sandbox already active: {sb.space_id}\n" + f"URL: {sb.url}\n" + f"Hardware: {active_hardware}\n" + f"Use bash/read/write/edit to interact with it." + ), True + requested_hardware = args.get("hardware") lockout_note = "" - if requested_hardware: + if ( + active_hardware == DEFAULT_CPU_SANDBOX_HARDWARE + and hardware != DEFAULT_CPU_SANDBOX_HARDWARE + ): + await teardown_session_sandbox(session) + elif requested_hardware: lockout_note = ( f"\nRequested hardware: {requested_hardware}\n" "Hardware cannot be changed by calling sandbox_create again. " "Delete the existing sandbox first if you need a different tier." ) - await _emit_trackio_state(sb) - return ( - f"Sandbox already active: {sb.space_id}\n" - f"URL: {sb.url}\n" - f"{lockout_note}\n" - f"Use bash/read/write/edit to interact with it." - ), True + await _emit_trackio_state(sb) + return ( + f"Sandbox already active: {sb.space_id}\n" + f"URL: {sb.url}\n" + f"{lockout_note}\n" + f"Use bash/read/write/edit to interact with it." + ), True + else: + await _emit_trackio_state(sb) + return ( + f"Sandbox already active: {sb.space_id}\n" + f"URL: {sb.url}\n" + f"Hardware: {active_hardware or 'unknown'}\n" + f"Use bash/read/write/edit to interact with it." + ), True create_kwargs: dict[str, Any] = {} - if "private" in args: - create_kwargs["private"] = args["private"] extra_secrets: dict[str, str] = {} if trackio_space_id: @@ -415,6 +650,7 @@ async def _emit_trackio_state(sb: Sandbox) -> None: f"Sandbox created: {sb.space_id}\n" f"URL: {sb.url}\n" f"Hardware: {hardware}\n" + "Visibility: private\n" f"Use bash/read/write/edit to interact with it." ), True @@ -423,11 +659,11 @@ def _make_tool_handler(sandbox_tool_name: str): """Factory: create a handler for a sandbox operation tool.""" async def handler(args: dict[str, Any], session: Any = None) -> tuple[str, bool]: - # Require sandbox to exist — user must approve sandbox_create first - if not session or not getattr(session, "sandbox", None): - return "No sandbox running. Call sandbox_create first to start one.", False - - sb = session.sandbox + sb, error = await get_active_or_preloaded_sandbox(session) + if error: + return error, False + if not sb: + return "Sandbox is still starting. Please retry shortly.", False try: result = await asyncio.to_thread(sb.call_tool, sandbox_tool_name, args) @@ -452,7 +688,7 @@ def get_sandbox_tools(): tools = [] - # sandbox_create (explicit creation, requires approval) + # sandbox_create (for GPU or other non-default hardware) tools.append( ToolSpec( name=SANDBOX_CREATE_TOOL_SPEC["name"], @@ -465,10 +701,16 @@ def get_sandbox_tools(): # Operation tools (auto-execute, no approval needed) for name in Sandbox.TOOLS.keys(): spec = Sandbox.TOOLS[name] + description = ( + "Uses the session's active sandbox. A private cpu-basic sandbox is " + "started automatically for normal CPU work; call sandbox_create only " + "for GPU or other non-default hardware.\n\n" + + spec["description"] + ) tools.append( ToolSpec( name=name, - description=spec["description"], + description=description, parameters=spec["parameters"], handler=_make_tool_handler(name), ) diff --git a/agent/utils/terminal_display.py b/agent/utils/terminal_display.py index 8ff9d525..f2b73301 100644 --- a/agent/utils/terminal_display.py +++ b/agent/utils/terminal_display.py @@ -425,6 +425,7 @@ def print_yolo_approve(count: int) -> None: {_I} [cyan]/effort[/cyan] [level] Reasoning effort (minimal|low|medium|high|xhigh|max|off) {_I} [cyan]/yolo[/cyan] Toggle auto-approve mode {_I} [cyan]/status[/cyan] Current model & turn count +{_I} [cyan]/share-traces[/cyan] [public|private] Show/flip visibility of your HF trace dataset {_I} [cyan]/quit[/cyan] Exit""" diff --git a/agentic_build_workflow.md b/agentic_build_workflow.md new file mode 100644 index 00000000..3a394d5e --- /dev/null +++ b/agentic_build_workflow.md @@ -0,0 +1,161 @@ +# Agentic Build Workflow + +> A simple, repeatable workflow for any engineering or building project — +> distilled from `cowork_os_karpathy.md`. +> Works for features, prototypes, training runs, infra, research probes, +> production systems. +> Six phases. No more. No checklists you wouldn't actually run. + +--- + +## The whole thing on one screen + +```text + DEFINE → PROBE → BUILD → REVIEW → SHIP → LEARN + ↑ │ + └────── update spec / verifier on every surprise ──┘ +``` + +--- + +## Phase 1 — DEFINE (before any code) + +```text +[ ] Write the goal in ONE sentence. +[ ] Ask: can ONE model call do this end-to-end? If yes — stop, use it. +[ ] Write the spec: what it does, what it must NOT do, success criteria. +[ ] Write the verifier: a script that returns pass / fail. Not a description. +[ ] List the invariants that must never break (identity, security, data, $). +[ ] Decide what you will NOT build today. Write it down. +``` + +> If you can't write the verifier, the goal is wrong. Fix the goal, not the agent. + +--- + +## Phase 2 — PROBE (5 minutes, before trusting the agent) + +```text +[ ] Run 3–5 small known-answer test cases of the task on the model. +[ ] Decide which regime you're in: +``` + +```text +mostly works inside trained circuits delegate freely +sometimes works at the edge delegate + verifier + spot review +mostly fails outside do it yourself, fine-tune, or rescope +``` + +Skipping the probe is how you discover, the expensive way, that the model has +a hole exactly where you needed competence. + +--- + +## Phase 3 — BUILD (the agentic loop) + +```text +[ ] Hand the spec to agent #1. Ask for a PLAN, not code. +[ ] Read the plan. Reject anything that violates an invariant. +[ ] Agent #1 implements + writes its own tests. +[ ] Agent #2 critiques: edge cases, security, missing invariants. +[ ] Run the verifier. It must pass. +[ ] Run the verifier on a deliberately broken input. It must fail. +``` + +Cap: 3 concurrent agents max. Above that you rubber-stamp. + +A verifier that always passes is a decoration. Step 6 catches the most subtle bugs. + +--- + +## Phase 4 — REVIEW (the human gate) + +```text +[ ] Read the test file FIRST. Are tests checking the spec, or the bug? +[ ] Read the diff in reverse — last change to first. +[ ] For every new function: what's the 3am failure mode? + Silent corruption = reject. Loud crash = acceptable. +[ ] For every deleted line: was it load-bearing? + If you can't answer in 10 seconds, don't merge. +[ ] Substrate check: identity, money, security, state — never delegated. +``` + +If you can't defend a decision from first principles, you don't merge. + +--- + +## Phase 5 — SHIP (with sensors) + +```text +[ ] Define KILL conditions BEFORE launch — concrete numeric thresholds. + e.g. "loss spike >2× baseline for 50 steps", "p99 latency >200ms", + "error rate >1%", "MFU <30%". +[ ] Add observability: logs, metrics, audit trail. +[ ] Add a rollback path. Test it once. +[ ] Deploy. Watch the sensors. +``` + +> Rule: never add an actuator (anything that changes state) without a matching +> sensor (a way to observe what changed). An agent that acts without verifying +> its action is a bomb with a timer. + +--- + +## Phase 6 — LEARN (per surprise) + +```text +[ ] What surprised you? Write one paragraph. +[ ] Does the surprise become a new verifier? Add it now. +[ ] Update the spec to cover the failure mode. +[ ] If you rubber-stamped anything in REVIEW, flag it for tomorrow. +``` + +> A postmortem that doesn't update a verifier is a diary, not engineering. + +--- + +## When something fails, walk this in order + +```text +Rare in pretraining? → put canonical examples in context +Outside RL coverage? → build a verifier, loop yourself +Bad context? → fix the context first; most failures live here +Wrong layer (1.0/2.0/3.0)?→ stop editing prompts when the bug is in the code/data +``` + +Replaces frustration with a debugging procedure. + +--- + +## What stays yours, always + +```text +identity, permissions, money / accounting logic, security boundaries +secrets, distributed state, schema migrations, irreversible deploys +performance-critical architecture, the choice of what to build at all +memory layout, data movement, numerical precision, concurrency +``` + +The rest is fair game for delegation. Verify everything. + +--- + +## Anti-patterns to catch yourself in + +```text +1. Editing the prompt when the bug is in the data. +2. Editing the data when the bug is in the spec. +3. Trusting "I have verified this" from an agent — it hasn't. +4. Building a pipeline that should have been one model call. +5. Adding a fourth concurrent agent. You will regret it. +6. Saying "the agent decided" — replace with "P(output | context) was high." +``` + +--- + +## The mantra + +> Design the environment so jagged agents can safely produce high-quality work, +> and never let typing speed substitute for understanding. + +That's the whole game. Six phases. One screen. diff --git a/backend/dependencies.py b/backend/dependencies.py index bd9e9070..5ce6e229 100644 --- a/backend/dependencies.py +++ b/backend/dependencies.py @@ -111,7 +111,7 @@ async def _fetch_user_plan(token: str) -> str: # OAuth whoami sets `type: "user"` and surfaces Pro via the `isPro` boolean # — see Space discussion #21. HF-Jobs eligibility (PR #172) ignores plan - # entirely; the Claude daily-cap tier is still a free vs pro/org split. + # entirely; the premium-model daily-cap tier is still a free vs pro/org split. if whoami.get("isPro") is True or whoami.get("is_pro") is True: return "pro" plan_str = "" @@ -138,6 +138,38 @@ async def _extract_user_from_token(token: str) -> dict[str, Any] | None: return user +async def _dev_user_from_env() -> dict[str, Any]: + """Use HF_TOKEN as the dev identity when available. + + Local dev often runs without OAuth, but session trace uploads still need a + real HF namespace. Deriving the dev user from HF_TOKEN keeps local uploads + pointed at the token owner's dataset instead of dev/ml-intern-sessions. + """ + token = os.environ.get("HF_TOKEN", "") + if not token: + return DEV_USER + + whoami = await fetch_whoami_v2(token) + if not isinstance(whoami, dict): + return DEV_USER + + username = None + for key in ("name", "user", "preferred_username"): + value = whoami.get(key) + if isinstance(value, str) and value: + username = value + break + if not username: + return DEV_USER + + return { + "user_id": username, + "username": username, + "authenticated": True, + "plan": await _fetch_user_plan(token), + } + + async def check_org_membership(token: str, org_name: str) -> bool: """Check if the token owner belongs to an HF org. Only caches positive results.""" now = time.time() @@ -170,10 +202,10 @@ async def get_current_user(request: Request) -> dict[str, Any]: 1. Authorization: Bearer header 2. hf_access_token cookie - In dev mode (AUTH_ENABLED=False), returns a default dev user. + In dev mode (AUTH_ENABLED=False), uses HF_TOKEN as the user when possible. """ if not AUTH_ENABLED: - return DEV_USER + return await _dev_user_from_env() # Try Authorization header token = bearer_token_from_header(request.headers.get("Authorization", "")) diff --git a/backend/models.py b/backend/models.py index 04048013..40a72509 100644 --- a/backend/models.py +++ b/backend/models.py @@ -66,6 +66,7 @@ class SessionResponse(BaseModel): session_id: str ready: bool = True + model: str | None = None class PendingApprovalTool(BaseModel): @@ -76,6 +77,15 @@ class PendingApprovalTool(BaseModel): arguments: dict[str, Any] = {} +class SessionAutoApprovalInfo(BaseModel): + """Per-session auto-approval budget state.""" + + enabled: bool = False + cost_cap_usd: float | None = None + estimated_spend_usd: float = 0.0 + remaining_usd: float | None = None + + class SessionInfo(BaseModel): """Session metadata.""" @@ -89,6 +99,9 @@ class SessionInfo(BaseModel): model: str | None = None title: str | None = None notification_destinations: list[str] = Field(default_factory=list) + auto_approval: SessionAutoApprovalInfo = Field( + default_factory=SessionAutoApprovalInfo + ) class SessionNotificationsRequest(BaseModel): @@ -97,6 +110,13 @@ class SessionNotificationsRequest(BaseModel): destinations: list[str] +class SessionYoloRequest(BaseModel): + """Update a session's auto-approval policy.""" + + enabled: bool + cost_cap_usd: float | None = Field(default=None, ge=0) + + class HealthResponse(BaseModel): """Health check response.""" diff --git a/backend/routes/agent.py b/backend/routes/agent.py index 96830568..ed33650d 100644 --- a/backend/routes/agent.py +++ b/backend/routes/agent.py @@ -26,6 +26,7 @@ SessionInfo, SessionNotificationsRequest, SessionResponse, + SessionYoloRequest, SubmitRequest, TruncateRequest, ) @@ -40,84 +41,153 @@ logger = logging.getLogger(__name__) router = APIRouter(prefix="/api", tags=["agent"]) +_background_teardown_tasks: set[asyncio.Task] = set() -AVAILABLE_MODELS = [ - { - "id": "moonshotai/Kimi-K2.6", - "label": "Kimi K2.6", - "provider": "huggingface", - "tier": "free", - "recommended": True, - }, - { - "id": "bedrock/us.anthropic.claude-opus-4-6-v1", - "label": "Claude Opus 4.6", - "provider": "anthropic", - "tier": "pro", - "recommended": True, - }, - { - "id": "MiniMaxAI/MiniMax-M2.7", - "label": "MiniMax M2.7", - "provider": "huggingface", - "tier": "free", - }, - { - "id": "zai-org/GLM-5.1", - "label": "GLM 5.1", - "provider": "huggingface", - "tier": "free", - }, -] - - -def _is_anthropic_model(model_id: str) -> bool: - return "anthropic" in model_id - - -async def _require_hf_for_anthropic(request: Request, model_id: str) -> None: - """403 if a non-``huggingface``-org user tries to select an Anthropic model. - - Anthropic models are billed to the Space's ``ANTHROPIC_API_KEY``; every - other model in ``AVAILABLE_MODELS`` is routed through HF Router and - billed via ``X-HF-Bill-To``. The gate only fires for Anthropic so - non-HF users can still freely switch between the free models. - - Pattern: https://github.com/huggingface/ml-intern/pull/63 +DEFAULT_CLAUDE_MODEL_ID = "bedrock/us.anthropic.claude-opus-4-6-v1" +DEFAULT_FREE_MODEL_ID = "moonshotai/Kimi-K2.6" +GATED_MODEL_IDS = { + DEFAULT_CLAUDE_MODEL_ID, + "openai/gpt-5.5", +} + + +def _claude_picker_model_id() -> str: + """Return the model ID used by the Claude option in the UI. + + The frontend config sets ``session_manager.config.model_name`` from + ``ML_INTERN_CLAUDE_MODEL_ID`` when that env var is present, otherwise it + falls back to the production Bedrock Claude model. This function only + exposes that resolved config value for the Claude picker; non-Claude models + are listed separately in the model switcher. + """ + return session_manager.config.model_name + + +def _available_models() -> list[dict[str, Any]]: + models = [ + { + "id": "moonshotai/Kimi-K2.6", + "label": "Kimi K2.6", + "provider": "huggingface", + "tier": "free", + "recommended": True, + }, + { + "id": _claude_picker_model_id(), + "label": "Claude Opus 4.6", + "provider": "anthropic", + "tier": "pro", + "recommended": True, + }, + { + "id": "openai/gpt-5.5", + "label": "GPT-5.5", + "provider": "openai", + "tier": "pro", + }, + { + "id": "MiniMaxAI/MiniMax-M2.7", + "label": "MiniMax M2.7", + "provider": "huggingface", + "tier": "free", + }, + { + "id": "zai-org/GLM-5.1", + "label": "GLM 5.1", + "provider": "huggingface", + "tier": "free", + }, + { + "id": "deepseek-ai/DeepSeek-V4-Pro:deepinfra", + "label": "DeepSeek V4 Pro", + "provider": "huggingface", + "tier": "free", + }, + ] + return models + + +AVAILABLE_MODELS = _available_models() + + +def _is_gated_model(model_id: str) -> bool: + return model_id in GATED_MODEL_IDS + + +def _premium_model_restricted_error() -> HTTPException: + return HTTPException( + status_code=403, + detail={ + "error": "premium_model_restricted", + "message": ( + "Premium models are gated to HF staff. Pick a free model — " + "Kimi K2.6, MiniMax M2.7, GLM 5.1, or DeepSeek V4 Pro — " + "instead." + ), + }, + ) + + +async def _require_hf_for_gated_model(request: Request, model_id: str) -> None: + """403 if a non-``huggingface``-org user tries to select a gated model. + + Gated models are deployed paid endpoints backed by service-owned + credentials. The gate only fires for deployed paid models so non-HF users + can still freely switch between the free models. """ - if not _is_anthropic_model(model_id): + if not _is_gated_model(model_id): return if not await require_huggingface_org_member(request): - raise HTTPException( - status_code=403, - detail={ - "error": "anthropic_restricted", - "message": ( - "Opus is gated to HF staff. Pick a free model — " - "Kimi K2.6, MiniMax M2.7, or GLM 5.1 — instead." - ), - }, - ) + raise _premium_model_restricted_error() -async def _enforce_claude_quota( +async def _model_override_for_new_session( + request: Request, + requested_model: str | None, +) -> str | None: + """Return the model override to use when creating a new session. + + Explicit gated-model requests keep the hard membership gate. Implicit + default sessions are more forgiving: when the configured default is gated + and the user lacks access, start them on the first free model instead of + blocking session creation. + """ + resolved_model = requested_model or session_manager.config.model_name + if not _is_gated_model(resolved_model): + return requested_model + if await require_huggingface_org_member(request): + return requested_model + if requested_model: + raise _premium_model_restricted_error() + + logger.info( + "Default gated model %s is unavailable to this user; " + "creating session with free fallback %s", + resolved_model, + DEFAULT_FREE_MODEL_ID, + ) + return DEFAULT_FREE_MODEL_ID + + +async def _enforce_gated_model_quota( user: dict[str, Any], agent_session: AgentSession, ) -> None: - """Charge the user's daily Claude quota on first use of Anthropic in a session. + """Charge the user's daily gated-model quota on first use in a session. Runs at *message-submit* time, not session-create time — so spinning up a - Claude session to look around doesn't burn quota. The ``claude_counted`` - flag on ``AgentSession`` guards against re-counting the same session. + gated-model session to look around doesn't burn quota. The + ``claude_counted`` flag on ``AgentSession`` guards against re-counting the + same session; the stored field name is kept for persistence compatibility. - No-ops when the session's current model isn't Anthropic, or when this + No-ops when the session's current model isn't gated, or when this session has already been charged. Raises 429 when the user has hit their daily cap. """ if agent_session.claude_counted: return model_name = agent_session.session.config.model_name - if not _is_anthropic_model(model_name): + if not _is_gated_model(model_name): return user_id = user["user_id"] cap = user_quotas.daily_cap_for(user.get("plan")) @@ -126,11 +196,11 @@ async def _enforce_claude_quota( raise HTTPException( status_code=429, detail={ - "error": "claude_daily_cap", + "error": "premium_model_daily_cap", "plan": user.get("plan", "free"), "cap": cap, "message": ( - "Daily Claude limit reached. Upgrade to HF Pro for " + "Daily premium model limit reached. Upgrade to HF Pro for " f"{user_quotas.CLAUDE_PRO_DAILY}/day or use a free model." ), }, @@ -150,6 +220,7 @@ async def _check_session_access( session_id, user["user_id"], hf_token=hf_token, + hf_username=user.get("username"), ) if not agent_session: raise HTTPException(status_code=404, detail="Session not found") @@ -306,8 +377,8 @@ async def create_session( behalf of the user. Optional body ``{"model"?: }`` selects the session's LLM; unknown - ids are rejected (400). The Claude-quota gate runs at message-submit - time, not here — spinning up an Opus session to look around is free. + ids are rejected (400). The gated-model quota runs at message-submit + time, not here — spinning up a session to look around is free. Returns 503 if the server or user has reached the session limit. """ @@ -327,14 +398,14 @@ async def create_session( if model and model not in valid_ids: raise HTTPException(status_code=400, detail=f"Unknown model: {model}") - # Opus is gated to HF staff (PR #63). Only fires when the resolved model - # is Anthropic; free models pass through. - resolved_model = model or session_manager.config.model_name - await _require_hf_for_anthropic(request, resolved_model) + # Explicit premium selections remain gated. If the implicit configured + # default is unavailable, start the session on a free model instead. + model = await _model_override_for_new_session(request, model) try: session_id = await session_manager.create_session( user_id=user["user_id"], + hf_username=user.get("username"), hf_token=hf_token, model=model, is_pro=user.get("plan") == "pro", @@ -342,7 +413,11 @@ async def create_session( except SessionCapacityError as e: raise HTTPException(status_code=503, detail=str(e)) - return SessionResponse(session_id=session_id, ready=True) + return SessionResponse( + session_id=session_id, + ready=True, + model=model or session_manager.config.model_name, + ) @router.post("/session/restore-summary", response_model=SessionResponse) @@ -355,7 +430,7 @@ async def restore_session_summary( session's context as a user-role system note. Optional ``"model"`` in the body overrides the session's LLM. The - Claude-quota gate runs at message-submit time, not here. + gated-model quota runs at message-submit time, not here. """ messages = body.get("messages") if not isinstance(messages, list) or not messages: @@ -368,12 +443,12 @@ async def restore_session_summary( if model and model not in valid_ids: raise HTTPException(status_code=400, detail=f"Unknown model: {model}") - resolved_model = model or session_manager.config.model_name - await _require_hf_for_anthropic(request, resolved_model) + model = await _model_override_for_new_session(request, model) try: session_id = await session_manager.create_session( user_id=user["user_id"], + hf_username=user.get("username"), hf_token=hf_token, model=model, is_pro=user.get("plan") == "pro", @@ -393,7 +468,11 @@ async def restore_session_summary( f"Seeded session {session_id} for {user.get('username', 'unknown')} " f"(summary of {summarized} messages)" ) - return SessionResponse(session_id=session_id, ready=True) + return SessionResponse( + session_id=session_id, + ready=True, + model=model or session_manager.config.model_name, + ) @router.get("/session/{session_id}", response_model=SessionInfo) @@ -417,10 +496,10 @@ async def set_session_model( Takes effect on the next LLM call in that session — other sessions (including other browser tabs) are unaffected. Model switches don't - charge quota — the Claude-quota gate only fires at message-submit time. + charge quota — the gated-model quota only fires at message-submit time. - Switching TO an Anthropic model requires HF org membership (PR #63); - free-model switches are unrestricted. + Switching TO a gated deployed model requires HF org membership; free-model + and local-dev direct provider switches are unrestricted. """ agent_session = await _check_session_access(session_id, user, request) model_id = body.get("model") @@ -429,7 +508,7 @@ async def set_session_model( valid_ids = {m["id"] for m in AVAILABLE_MODELS} if model_id not in valid_ids: raise HTTPException(status_code=400, detail=f"Unknown model: {model_id}") - await _require_hf_for_anthropic(request, model_id) + await _require_hf_for_gated_model(request, model_id) if not agent_session: raise HTTPException(status_code=404, detail="Session not found") await session_manager.update_session_model(session_id, model_id) @@ -461,17 +540,38 @@ async def set_session_notifications( } +@router.patch("/session/{session_id}/yolo") +async def set_session_yolo( + session_id: str, + body: SessionYoloRequest, + user: dict = Depends(get_current_user), +) -> dict: + """Update the session-scoped auto-approval policy.""" + await _check_session_access(session_id, user) + try: + summary = await session_manager.update_session_auto_approval( + session_id, + enabled=body.enabled, + cost_cap_usd=body.cost_cap_usd, + cap_provided="cost_cap_usd" in body.model_fields_set, + ) + except ValueError as e: + raise HTTPException(status_code=400, detail=str(e)) + return {"session_id": session_id, **summary} + + @router.get("/user/quota") async def get_user_quota(user: dict = Depends(get_current_user)) -> dict: - """Return the user's plan tier and today's Claude-session quota state.""" + """Return the user's plan tier and today's premium-model quota state.""" plan = user.get("plan", "free") used = await user_quotas.get_claude_used_today(user["user_id"]) cap = user_quotas.daily_cap_for(plan) + remaining = max(0, cap - used) return { "plan": plan, - "claude_used_today": used, - "claude_daily_cap": cap, - "claude_remaining": max(0, cap - used), + "premium_used_today": used, + "premium_daily_cap": cap, + "premium_remaining": remaining, } @@ -500,6 +600,18 @@ async def list_sessions(user: dict = Depends(get_current_user)) -> list[SessionI return [SessionInfo(**s) for s in sessions] +@router.post("/session/{session_id}/sandbox/teardown") +async def teardown_session_sandbox( + session_id: str, user: dict = Depends(get_current_user) +) -> dict: + """Best-effort sandbox teardown that preserves durable chat history.""" + await _check_session_access(session_id, user) + task = asyncio.create_task(session_manager.teardown_sandbox(session_id)) + _background_teardown_tasks.add(task) + task.add_done_callback(_background_teardown_tasks.discard) + return {"status": "teardown_requested", "session_id": session_id} + + @router.delete("/session/{session_id}") async def delete_session( session_id: str, user: dict = Depends(get_current_user) @@ -518,7 +630,7 @@ async def submit_input( ) -> dict: """Submit user input to a session. Only accessible by the session owner.""" agent_session = await _check_session_access(request.session_id, user) - await _enforce_claude_quota(user, agent_session) + await _enforce_gated_model_quota(user, agent_session) success = await session_manager.submit_user_input(request.session_id, request.text) if not success: raise HTTPException(status_code=404, detail="Session not found or inactive") @@ -570,12 +682,12 @@ async def chat_sse( text = body.get("text") approvals = body.get("approvals") - # Gate user-message sends against the daily Claude quota. Approvals are + # Gate user-message sends against the daily gated-model quota. Approvals are # continuations of an in-progress turn — the session was already charged # on its first message, so we skip the gate there. if text is not None and not approvals: try: - await _enforce_claude_quota(user, agent_session) + await _enforce_gated_model_quota(user, agent_session) except HTTPException: broadcaster.unsubscribe(sub_id) raise diff --git a/backend/session_manager.py b/backend/session_manager.py index 91650859..af8ac8bf 100644 --- a/backend/session_manager.py +++ b/backend/session_manager.py @@ -87,6 +87,7 @@ class AgentSession: tool_router: ToolRouter submission_queue: asyncio.Queue user_id: str = "dev" # Owner of this session + hf_username: str | None = None # HF namespace used for personal trace uploads hf_token: str | None = None # User's HF OAuth token for tool execution task: asyncio.Task | None = None created_at: datetime = field(default_factory=datetime.utcnow) @@ -115,6 +116,7 @@ def __init__(self, message: str, error_type: str = "global") -> None: # and per-request overhead. MAX_SESSIONS: int = 200 MAX_SESSIONS_PER_USER: int = 10 +DEFAULT_YOLO_COST_CAP_USD: float = 5.0 class SessionManager: @@ -157,6 +159,7 @@ def _create_session_sync( *, session_id: str, user_id: str, + hf_username: str | None, hf_token: str | None, model: str | None, event_queue: asyncio.Queue, @@ -178,6 +181,7 @@ def _create_session_sync( tool_router=tool_router, hf_token=hf_token, user_id=user_id, + hf_username=hf_username, notification_gateway=self.messaging_gateway, notification_destinations=notification_destinations or [], session_id=session_id, @@ -294,6 +298,20 @@ def _runtime_state(agent_session: AgentSession) -> str: return "ended" return "idle" + @staticmethod + def _auto_approval_summary(session: Session) -> dict[str, Any]: + if hasattr(session, "auto_approval_policy_summary"): + return session.auto_approval_policy_summary() + cap = getattr(session, "auto_approval_cost_cap_usd", None) + estimated = float(getattr(session, "auto_approval_estimated_spend_usd", 0.0) or 0.0) + remaining = None if cap is None else round(max(0.0, float(cap) - estimated), 4) + return { + "enabled": bool(getattr(session, "auto_approval_enabled", False)), + "cost_cap_usd": cap, + "estimated_spend_usd": round(estimated, 4), + "remaining_usd": remaining, + } + async def _start_agent_session( self, *, @@ -318,6 +336,20 @@ async def _start_agent_session( agent_session.task = task return agent_session + @staticmethod + def _start_cpu_sandbox_preload(agent_session: AgentSession) -> None: + """Kick off a best-effort cpu-basic sandbox for the session.""" + try: + from agent.tools.sandbox_tool import start_cpu_sandbox_preload + + start_cpu_sandbox_preload(agent_session.session) + except Exception as e: + logger.warning( + "Failed to start CPU sandbox preload for %s: %s", + agent_session.session_id, + e, + ) + @staticmethod def _can_access_session(agent_session: AgentSession, user_id: str) -> bool: return ( @@ -327,11 +359,18 @@ def _can_access_session(agent_session: AgentSession, user_id: str) -> bool: ) @staticmethod - def _update_hf_token(agent_session: AgentSession, hf_token: str | None) -> None: - if not hf_token: - return - agent_session.hf_token = hf_token - agent_session.session.hf_token = hf_token + def _update_hf_identity( + agent_session: AgentSession, + *, + hf_token: str | None, + hf_username: str | None, + ) -> None: + if hf_token: + agent_session.hf_token = hf_token + agent_session.session.hf_token = hf_token + if hf_username: + agent_session.hf_username = hf_username + agent_session.session.hf_username = hf_username async def persist_session_snapshot( self, @@ -360,6 +399,20 @@ async def persist_session_snapshot( notification_destinations=list( agent_session.session.notification_destinations ), + auto_approval_enabled=bool( + getattr(agent_session.session, "auto_approval_enabled", False) + ), + auto_approval_cost_cap_usd=getattr( + agent_session.session, "auto_approval_cost_cap_usd", None + ), + auto_approval_estimated_spend_usd=float( + getattr( + agent_session.session, + "auto_approval_estimated_spend_usd", + 0.0, + ) + or 0.0 + ), ) except Exception as e: logger.warning( @@ -373,13 +426,18 @@ async def ensure_session_loaded( session_id: str, user_id: str, hf_token: str | None = None, + hf_username: str | None = None, ) -> AgentSession | None: """Return a live runtime session, lazily restoring it from Mongo.""" async with self._lock: existing = self.sessions.get(session_id) if existing: if self._can_access_session(existing, user_id): - self._update_hf_token(existing, hf_token) + self._update_hf_identity( + existing, + hf_token=hf_token, + hf_username=hf_username, + ) return existing return None @@ -392,7 +450,11 @@ async def ensure_session_loaded( existing = self.sessions.get(session_id) if existing: if self._can_access_session(existing, user_id): - self._update_hf_token(existing, hf_token) + self._update_hf_identity( + existing, + hf_token=hf_token, + hf_username=hf_username, + ) return existing return None @@ -410,6 +472,7 @@ async def ensure_session_loaded( self._create_session_sync, session_id=session_id, user_id=owner or user_id, + hf_username=hf_username, hf_token=hf_token, model=model, event_queue=event_queue, @@ -431,6 +494,14 @@ async def ensure_session_loaded( self._restore_pending_approval(session, meta.get("pending_approval") or []) session.turn_count = int(meta.get("turn_count") or 0) + session.auto_approval_enabled = bool(meta.get("auto_approval_enabled", False)) + raw_cap = meta.get("auto_approval_cost_cap_usd") + session.auto_approval_cost_cap_usd = ( + float(raw_cap) if isinstance(raw_cap, int | float) else None + ) + session.auto_approval_estimated_spend_usd = float( + meta.get("auto_approval_estimated_spend_usd") or 0.0 + ) created_at = meta.get("created_at") if not isinstance(created_at, datetime): @@ -442,6 +513,7 @@ async def ensure_session_loaded( tool_router=tool_router, submission_queue=submission_queue, user_id=owner or user_id, + hf_username=hf_username, hf_token=hf_token, created_at=created_at, is_active=True, @@ -455,14 +527,20 @@ async def ensure_session_loaded( tool_router=tool_router, ) if started is not agent_session: - self._update_hf_token(started, hf_token) + self._update_hf_identity( + started, + hf_token=hf_token, + hf_username=hf_username, + ) return started + self._start_cpu_sandbox_preload(agent_session) logger.info("Restored session %s for user %s", session_id, owner or user_id) return agent_session async def create_session( self, user_id: str = "dev", + hf_username: str | None = None, hf_token: str | None = None, model: str | None = None, is_pro: bool | None = None, @@ -475,6 +553,7 @@ async def create_session( Args: user_id: The ID of the user who owns this session. + hf_username: The HF username/namespace used for personal trace uploads. hf_token: The user's HF OAuth token, stored for tool execution. model: Optional model override. When set, replaces ``model_name`` on the per-session config clone. None falls back to the @@ -513,6 +592,7 @@ async def create_session( self._create_session_sync, session_id=session_id, user_id=user_id, + hf_username=hf_username, hf_token=hf_token, model=model, event_queue=event_queue, @@ -525,6 +605,7 @@ async def create_session( tool_router=tool_router, submission_queue=submission_queue, user_id=user_id, + hf_username=hf_username, hf_token=hf_token, ) @@ -533,6 +614,7 @@ async def create_session( event_queue=event_queue, tool_router=tool_router, ) + self._start_cpu_sandbox_preload(agent_session) await self.persist_session_snapshot(agent_session, runtime_state="idle") if is_pro is not None and user_id and user_id != "dev": @@ -612,6 +694,8 @@ async def seed_from_summary(self, session_id: str, messages: list[dict]) -> int: max_tokens=4000, prompt=_RESTORE_PROMPT, tool_specs=tool_specs, + session=session, + kind="restore", ) except Exception as e: logger.error("Summary call failed during seed: %s", e) @@ -637,27 +721,9 @@ async def _cleanup_sandbox(session: Session) -> None: with exponential backoff. A single missed delete = a permanently orphaned Space, so the cost of an extra retry beats the alternative. """ - sandbox = getattr(session, "sandbox", None) - if not (sandbox and getattr(sandbox, "_owns_space", False)): - return + from agent.tools.sandbox_tool import teardown_session_sandbox - space_id = getattr(sandbox, "space_id", None) - last_err: Exception | None = None - for attempt in range(3): - try: - logger.info(f"Deleting sandbox {space_id} (attempt {attempt + 1}/3)...") - await asyncio.to_thread(sandbox.delete) - from agent.core import telemetry - await telemetry.record_sandbox_destroy(session, sandbox) - return - except Exception as e: - last_err = e - if attempt < 2: - await asyncio.sleep(2 ** attempt) - logger.error( - f"Failed to delete sandbox {space_id} after 3 attempts: {last_err}. " - f"Orphan — sweep script will pick it up." - ) + await teardown_session_sandbox(session) async def _run_session( self, @@ -837,6 +903,18 @@ async def delete_session(self, session_id: str) -> bool: return True + async def teardown_sandbox(self, session_id: str) -> bool: + """Delete only this session's sandbox runtime, preserving chat state.""" + async with self._lock: + agent_session = self.sessions.get(session_id) + + if not agent_session or not agent_session.is_active: + return False + + await self._cleanup_sandbox(agent_session.session) + await self.persist_session_snapshot(agent_session, runtime_state="idle") + return True + async def update_session_title(self, session_id: str, title: str | None) -> None: """Persist a user-visible title for sidebar rehydration.""" agent_session = self.sessions.get(session_id) @@ -852,6 +930,43 @@ async def update_session_model(self, session_id: str, model_id: str) -> bool: await self.persist_session_snapshot(agent_session, runtime_state="idle") return True + async def update_session_auto_approval( + self, + session_id: str, + *, + enabled: bool, + cost_cap_usd: float | None, + cap_provided: bool = False, + ) -> dict[str, Any]: + agent_session = self.sessions.get(session_id) + if not agent_session or not agent_session.is_active: + raise ValueError("Session not found or inactive") + + session = agent_session.session + if enabled: + if not cap_provided and cost_cap_usd is None: + cost_cap_usd = getattr( + session, "auto_approval_cost_cap_usd", None + ) + if cost_cap_usd is None: + cost_cap_usd = DEFAULT_YOLO_COST_CAP_USD + elif cost_cap_usd is None: + cost_cap_usd = DEFAULT_YOLO_COST_CAP_USD + else: + if not cap_provided: + cost_cap_usd = getattr(session, "auto_approval_cost_cap_usd", None) + + if hasattr(session, "set_auto_approval_policy"): + session.set_auto_approval_policy( + enabled=enabled, + cost_cap_usd=cost_cap_usd, + ) + else: + session.auto_approval_enabled = bool(enabled) + session.auto_approval_cost_cap_usd = cost_cap_usd + await self.persist_session_snapshot(agent_session) + return self._auto_approval_summary(session) + def get_session_owner(self, session_id: str) -> str | None: """Get the user_id that owns a session, or None if session doesn't exist.""" agent_session = self.sessions.get(session_id) @@ -894,6 +1009,7 @@ def get_session_info(self, session_id: str) -> dict[str, Any] | None: "notification_destinations": list( agent_session.session.notification_destinations ), + "auto_approval": self._auto_approval_summary(agent_session.session), } def set_notification_destinations( @@ -960,6 +1076,25 @@ async def list_sessions(self, user_id: str | None = None) -> list[dict[str, Any] "model": row.get("model"), "title": row.get("title"), "notification_destinations": row.get("notification_destinations") or [], + "auto_approval": { + "enabled": bool(row.get("auto_approval_enabled", False)), + "cost_cap_usd": row.get("auto_approval_cost_cap_usd"), + "estimated_spend_usd": float( + row.get("auto_approval_estimated_spend_usd") or 0.0 + ), + "remaining_usd": ( + None + if row.get("auto_approval_cost_cap_usd") is None + else round( + max( + 0.0, + float(row.get("auto_approval_cost_cap_usd") or 0.0) + - float(row.get("auto_approval_estimated_spend_usd") or 0.0), + ), + 4, + ) + ), + }, } ) return results diff --git a/backend/user_quotas.py b/backend/user_quotas.py index 94b1b027..94ce92f0 100644 --- a/backend/user_quotas.py +++ b/backend/user_quotas.py @@ -1,12 +1,15 @@ -"""Daily quota for Claude session creations. +"""Daily quota for premium model session creations. -Tracks per-user Claude session starts against a daily cap derived from the -user's HF plan. MongoDB is the source of truth when configured; the +Tracks per-user premium model session starts against a daily cap derived from +the user's HF plan. MongoDB is the source of truth when configured; the in-process dict remains the fallback for local/dev/test runs. -Unit: session *creations*, not messages. A user who selects Claude in a new -session consumes one quota point; switching an existing Claude session to -Claude again doesn't (`AgentSession.claude_counted` guards that). +The public names still say ``claude`` because this quota bucket originally +only covered Claude and the persisted session field uses that name. + +Unit: session *creations*, not messages. A user who sends with a premium model +in a new session consumes one quota point; switching an already-counted session +back to a premium model doesn't (`AgentSession.claude_counted` guards that). Cap tiers: free user → CLAUDE_FREE_DAILY (1) diff --git a/bin/verify.sh b/bin/verify.sh new file mode 100755 index 00000000..6c0cec8f --- /dev/null +++ b/bin/verify.sh @@ -0,0 +1,26 @@ +#!/usr/bin/env bash +# verify.sh — phase verifier router +# Usage: ./bin/verify.sh +# Returns: 0 = pass, 1 = fail +# Phases: p0_5_d1, p0_5, p1, p2, ... (add as you go) + +set -euo pipefail + +PHASE="${1:-}" + +if [[ -z "$PHASE" ]]; then + echo "Usage: $0 " + echo "Phases:" + ls bin/verify_*.sh 2>/dev/null | sed 's|bin/verify_| |;s|\.sh||' + exit 1 +fi + +SCRIPT="bin/verify_${PHASE}.sh" +if [[ ! -f "$SCRIPT" ]]; then + echo "❌ No verifier for phase '$PHASE'" + echo "Expected: $SCRIPT" + exit 1 +fi + +echo "=== Verifying phase: $PHASE ===" +exec bash "$SCRIPT" diff --git a/bin/verify_p0_5_d1.sh b/bin/verify_p0_5_d1.sh new file mode 100755 index 00000000..ef617287 --- /dev/null +++ b/bin/verify_p0_5_d1.sh @@ -0,0 +1,100 @@ +#!/usr/bin/env bash +# Verify P0.5 D1: library restructure — cosmos_lab/ package importable, dual-path works. +# Workflow: this is the "verifier" per Phase 1 DEFINE — must return pass/fail, not narrative. + +set -uo pipefail + +PASS=0 +FAIL=0 + +check() { + local name="$1" + shift + if "$@" >/dev/null 2>&1; then + echo " ✅ $name" + PASS=$((PASS+1)) + else + echo " ❌ $name" + FAIL=$((FAIL+1)) + fi +} + +echo "--- Structure ---" +check "cosmos_lab/ directory exists" test -d cosmos_lab +check "cosmos_lab/__init__.py exists" test -f cosmos_lab/__init__.py +check "cosmos_lab/identity/__init__.py exists" test -f cosmos_lab/identity/__init__.py + +echo "--- Imports (new path) ---" +check "from cosmos_lab.identity import AgentIdentity" \ + python -c "from cosmos_lab.identity import AgentIdentity" +check "from cosmos_lab.identity import AuditLog" \ + python -c "from cosmos_lab.identity import AuditLog" +check "from cosmos_lab.identity import CapabilityScopedRouter" \ + python -c "from cosmos_lab.identity import CapabilityScopedRouter" +check "from cosmos_lab.identity import CapabilityDenied" \ + python -c "from cosmos_lab.identity import CapabilityDenied" +check "from cosmos_lab import OptimizationConfig" \ + python -c "from cosmos_lab import OptimizationConfig" + +echo "--- Backward compat (old path still works) ---" +check "from agent.optimization.identity import AgentIdentity" \ + python -c "from agent.optimization.identity import AgentIdentity" +check "from agent.optimization import OptimizationConfig" \ + python -c "from agent.optimization import OptimizationConfig" + +echo "--- Packaging ---" +check "pyproject.toml mentions cosmos-lab" \ + grep -q "cosmos.lab" pyproject.toml +check "pyproject.toml has [project.optional-dependencies]" \ + grep -q "optional-dependencies" pyproject.toml + +echo "--- Tests ---" +echo " Running cosmos-lab tests..." +if uv run python -m pytest tests/optimization/ -q 2>&1 | tail -5; then + echo " ✅ tests/optimization/ exits 0" + PASS=$((PASS+1)) +else + echo " ❌ tests/optimization/ failed" + FAIL=$((FAIL+1)) +fi + +echo "--- Zero-diff invariant ---" +# Owned paths/files (allowed to differ from upstream/main): +# - cosmos_lab/, agent/optimization/, configs/optimization, tests/optimization (code) +# - docs/, bin/ (this harness) +# - pyproject.toml (packaging — we own extras + packages.find for cosmos_lab) +# - uv.lock (derived from pyproject.toml — auto-regenerated by uv sync) +# - CLAUDE.md, AGENTS* (always-loaded harness) +# - PLAN_V2.md, PLAN.md, SYSTEM.md, EVAL_SPEC.md, WORKFLOW.md, +# RESEARCH_AHE_ANALYSIS.md, agentic_build_workflow.md (owned planning docs) +DIFF=$(git diff upstream/main --name-only 2>/dev/null \ + | grep -v "^cosmos_lab/" \ + | grep -v "^agent/optimization/" \ + | grep -v "^configs/optimization" \ + | grep -v "^tests/optimization" \ + | grep -v "^docs/" \ + | grep -v "^bin/" \ + | grep -v "^pyproject.toml" \ + | grep -v "^uv\.lock" \ + | grep -v "^CLAUDE.md" \ + | grep -v "^AGENTS" \ + | grep -v "^PLAN_V2.md$" \ + | grep -v "^PLAN.md$" \ + | grep -v "^SYSTEM.md$" \ + | grep -v "^EVAL_SPEC.md$" \ + | grep -v "^WORKFLOW.md$" \ + | grep -v "^RESEARCH_AHE_ANALYSIS.md$" \ + | grep -v "^AGENTIC_EVAL_SPEC.md$" \ + | grep -v "^agentic_build_workflow" || true) +if [[ -z "$DIFF" ]]; then + echo " ✅ git diff upstream/main --name-only shows owned paths only" + PASS=$((PASS+1)) +else + echo " ❌ Unexpected upstream diff:" + echo "$DIFF" | sed 's/^/ /' + FAIL=$((FAIL+1)) +fi + +echo +echo "===== P0.5 D1 verifier: $PASS pass, $FAIL fail =====" +[[ $FAIL -eq 0 ]] && exit 0 || exit 1 diff --git a/bin/verify_p0_5_d2.sh b/bin/verify_p0_5_d2.sh new file mode 100755 index 00000000..d38949ab --- /dev/null +++ b/bin/verify_p0_5_d2.sh @@ -0,0 +1,106 @@ +#!/usr/bin/env bash +# Verify P0.5 D2: cosmos_lab.harness.ml_intern adapter ships and works. +# Per workflow Phase 1 DEFINE — verifier returns pass/fail, not narrative. + +set -uo pipefail + +PASS=0 +FAIL=0 + +check() { + local name="$1" + shift + if "$@" >/dev/null 2>&1; then + echo " ✅ $name" + PASS=$((PASS+1)) + else + echo " ❌ $name" + FAIL=$((FAIL+1)) + fi +} + +echo "--- Structure ---" +check "cosmos_lab/harness/__init__.py exists" test -f cosmos_lab/harness/__init__.py +check "cosmos_lab/harness/ml_intern.py exists" test -f cosmos_lab/harness/ml_intern.py +check "tests/optimization/harness/__init__.py exists" test -f tests/optimization/harness/__init__.py +check "tests/optimization/harness/test_ml_intern_adapter.py exists" \ + test -f tests/optimization/harness/test_ml_intern_adapter.py + +echo "--- Imports ---" +check "from cosmos_lab.harness import install_into_session" \ + python -c "from cosmos_lab.harness import install_into_session" +check "from cosmos_lab.harness.ml_intern import install_into_session" \ + python -c "from cosmos_lab.harness.ml_intern import install_into_session" + +echo "--- Adapter LOC budget (≤ 200 LOC) ---" +LOC=$(wc -l < cosmos_lab/harness/ml_intern.py | tr -d ' ') +if [[ $LOC -le 200 ]]; then + echo " ✅ ml_intern.py is ${LOC} LOC (budget 200)" + PASS=$((PASS+1)) +else + echo " ❌ ml_intern.py is ${LOC} LOC (budget 200 — over)" + FAIL=$((FAIL+1)) +fi + +echo "--- Smoke tests ---" +echo " Running adapter tests..." +if uv run python -m pytest tests/optimization/harness/ -q 2>&1 | tail -3; then + PASS_COUNT=$(uv run python -m pytest tests/optimization/harness/ -q 2>&1 | grep -oE "[0-9]+ passed" | head -1 | grep -oE "[0-9]+") + echo " ✅ tests/optimization/harness/ exits 0 (${PASS_COUNT:-?} tests passed)" + PASS=$((PASS+1)) +else + echo " ❌ tests/optimization/harness/ failed" + FAIL=$((FAIL+1)) +fi + +echo "--- D1 still green (no regression) ---" +if ./bin/verify_p0_5_d1.sh > /dev/null 2>&1; then + echo " ✅ P0.5 D1 verifier still 14/14" + PASS=$((PASS+1)) +else + echo " ❌ P0.5 D1 verifier regressed!" + FAIL=$((FAIL+1)) +fi + +echo "--- Upstream baseline preserved ---" +UPSTREAM_FAILS=$(uv run python -m pytest tests/unit/ -q 2>&1 | grep -oE "[0-9]+ failed" | head -1 | grep -oE "[0-9]+") +if [[ "${UPSTREAM_FAILS:-0}" -le 3 ]]; then + echo " ✅ tests/unit/ shows ${UPSTREAM_FAILS:-0} failures (≤ 3 known-broken baseline)" + PASS=$((PASS+1)) +else + echo " ❌ tests/unit/ shows ${UPSTREAM_FAILS} failures (> 3 — regression!)" + FAIL=$((FAIL+1)) +fi + +echo "--- Zero-diff invariant ---" +DIFF=$(git diff upstream/main --name-only 2>/dev/null \ + | grep -v "^cosmos_lab/" \ + | grep -v "^agent/optimization/" \ + | grep -v "^configs/optimization" \ + | grep -v "^tests/optimization" \ + | grep -v "^docs/" \ + | grep -v "^bin/" \ + | grep -v "^pyproject.toml" \ + | grep -v "^uv\.lock" \ + | grep -v "^CLAUDE.md" \ + | grep -v "^AGENTS" \ + | grep -v "^PLAN_V2.md$" \ + | grep -v "^PLAN.md$" \ + | grep -v "^SYSTEM.md$" \ + | grep -v "^EVAL_SPEC.md$" \ + | grep -v "^WORKFLOW.md$" \ + | grep -v "^RESEARCH_AHE_ANALYSIS.md$" \ + | grep -v "^AGENTIC_EVAL_SPEC.md$" \ + | grep -v "^agentic_build_workflow" || true) +if [[ -z "$DIFF" ]]; then + echo " ✅ git diff upstream/main --name-only shows owned paths only" + PASS=$((PASS+1)) +else + echo " ❌ Unexpected upstream diff:" + echo "$DIFF" | sed 's/^/ /' + FAIL=$((FAIL+1)) +fi + +echo +echo "===== P0.5 D2 verifier: $PASS pass, $FAIL fail =====" +[[ $FAIL -eq 0 ]] && exit 0 || exit 1 diff --git a/bin/verify_p0_5_d3.sh b/bin/verify_p0_5_d3.sh new file mode 100755 index 00000000..97df4c11 --- /dev/null +++ b/bin/verify_p0_5_d3.sh @@ -0,0 +1,116 @@ +#!/usr/bin/env bash +# Verify P0.5 D3: cosmos_lab.harness.nat lightweight wrapper ships and works. +# Per v5.1 architecture (PLAN_V2 §0.4.5): nat is deployment wrapper, not +# runtime substrate. D3 ships the registration shim + smoke test. + +set -uo pipefail + +PASS=0 +FAIL=0 + +check() { + local name="$1" + shift + if "$@" >/dev/null 2>&1; then + echo " ✅ $name" + PASS=$((PASS+1)) + else + echo " ❌ $name" + FAIL=$((FAIL+1)) + fi +} + +echo "--- Structure ---" +check "cosmos_lab/harness/nat.py exists" test -f cosmos_lab/harness/nat.py +check "tests/optimization/harness/test_nat_adapter.py exists" \ + test -f tests/optimization/harness/test_nat_adapter.py + +echo "--- Imports ---" +check "from cosmos_lab.harness import register_as_nat_tool" \ + python -c "from cosmos_lab.harness import register_as_nat_tool" +check "from cosmos_lab.harness.nat import register_as_nat_tool" \ + python -c "from cosmos_lab.harness.nat import register_as_nat_tool" + +echo "--- LOC budget (~50 LOC target, 200 LOC hard cap per v5.1) ---" +LOC=$(grep -vcE "^\s*(#|$|\"\"\")" cosmos_lab/harness/nat.py) +TOTAL_LOC=$(wc -l < cosmos_lab/harness/nat.py | tr -d ' ') +if [[ $TOTAL_LOC -le 200 ]]; then + echo " ✅ nat.py is ${TOTAL_LOC} LOC total (~${LOC} non-comment, budget 200)" + PASS=$((PASS+1)) +else + echo " ❌ nat.py is ${TOTAL_LOC} LOC (over 200 cap — v5.1 expected ~50)" + FAIL=$((FAIL+1)) +fi + +echo "--- Smoke tests ---" +echo " Running adapter tests..." +TEST_OUTPUT=$(uv run python -m pytest tests/optimization/harness/test_nat_adapter.py -q 2>&1) +if echo "$TEST_OUTPUT" | grep -qE "passed"; then + PASS_COUNT=$(echo "$TEST_OUTPUT" | grep -oE "[0-9]+ passed" | head -1 | grep -oE "[0-9]+") + echo " ✅ tests/optimization/harness/test_nat_adapter.py exits 0 (${PASS_COUNT:-?} tests passed)" + PASS=$((PASS+1)) +else + echo " ❌ tests/optimization/harness/test_nat_adapter.py failed" + echo "$TEST_OUTPUT" | tail -10 | sed 's/^/ /' + FAIL=$((FAIL+1)) +fi + +echo "--- D1 + D2 still green (no regression) ---" +if ./bin/verify_p0_5_d1.sh > /dev/null 2>&1; then + echo " ✅ P0.5 D1 verifier still 14/14" + PASS=$((PASS+1)) +else + echo " ❌ P0.5 D1 verifier regressed!" + FAIL=$((FAIL+1)) +fi + +if ./bin/verify_p0_5_d2.sh > /dev/null 2>&1; then + echo " ✅ P0.5 D2 verifier still 11/11" + PASS=$((PASS+1)) +else + echo " ❌ P0.5 D2 verifier regressed!" + FAIL=$((FAIL+1)) +fi + +echo "--- Upstream baseline preserved ---" +UPSTREAM_FAILS=$(uv run python -m pytest tests/unit/ -q 2>&1 | grep -oE "[0-9]+ failed" | head -1 | grep -oE "[0-9]+") +if [[ "${UPSTREAM_FAILS:-0}" -le 3 ]]; then + echo " ✅ tests/unit/ shows ${UPSTREAM_FAILS:-0} failures (≤ 3 known-broken baseline)" + PASS=$((PASS+1)) +else + echo " ❌ tests/unit/ shows ${UPSTREAM_FAILS} failures (> 3 — regression!)" + FAIL=$((FAIL+1)) +fi + +echo "--- Zero-diff invariant ---" +DIFF=$(git diff upstream/main --name-only 2>/dev/null \ + | grep -v "^cosmos_lab/" \ + | grep -v "^agent/optimization/" \ + | grep -v "^configs/optimization" \ + | grep -v "^tests/optimization" \ + | grep -v "^docs/" \ + | grep -v "^bin/" \ + | grep -v "^pyproject.toml" \ + | grep -v "^uv\.lock" \ + | grep -v "^CLAUDE.md" \ + | grep -v "^AGENTS" \ + | grep -v "^PLAN_V2.md$" \ + | grep -v "^PLAN.md$" \ + | grep -v "^SYSTEM.md$" \ + | grep -v "^EVAL_SPEC.md$" \ + | grep -v "^WORKFLOW.md$" \ + | grep -v "^RESEARCH_AHE_ANALYSIS.md$" \ + | grep -v "^AGENTIC_EVAL_SPEC.md$" \ + | grep -v "^agentic_build_workflow" || true) +if [[ -z "$DIFF" ]]; then + echo " ✅ git diff upstream/main --name-only shows owned paths only" + PASS=$((PASS+1)) +else + echo " ❌ Unexpected upstream diff:" + echo "$DIFF" | sed 's/^/ /' + FAIL=$((FAIL+1)) +fi + +echo +echo "===== P0.5 D3 verifier: $PASS pass, $FAIL fail =====" +[[ $FAIL -eq 0 ]] && exit 0 || exit 1 diff --git a/bin/verify_p0_5_d4.sh b/bin/verify_p0_5_d4.sh new file mode 100755 index 00000000..b6a72745 --- /dev/null +++ b/bin/verify_p0_5_d4.sh @@ -0,0 +1,122 @@ +#!/usr/bin/env bash +# Verify P0.5 D4: cosmos_lab/harness/CONTRACT.md + parametrized adapter contract tests. +# Per workflow Phase 1 DEFINE — verifier returns pass/fail, not narrative. + +set -uo pipefail + +PASS=0 +FAIL=0 + +check() { + local name="$1" + shift + if "$@" >/dev/null 2>&1; then + echo " ✅ $name" + PASS=$((PASS+1)) + else + echo " ❌ $name" + FAIL=$((FAIL+1)) + fi +} + +echo "--- Structure ---" +check "cosmos_lab/harness/CONTRACT.md exists" test -f cosmos_lab/harness/CONTRACT.md +check "tests/optimization/harness/test_adapter_contract.py exists" \ + test -f tests/optimization/harness/test_adapter_contract.py + +echo "--- CONTRACT.md content quality ---" +check "CONTRACT.md mentions Family A (execution substrate)" \ + grep -q "Family A" cosmos_lab/harness/CONTRACT.md +check "CONTRACT.md mentions Family B (deployment surface)" \ + grep -q "Family B" cosmos_lab/harness/CONTRACT.md +check "CONTRACT.md documents shared requirements S1-S5" \ + bash -c 'grep -q "S1.*Idempotency" cosmos_lab/harness/CONTRACT.md && grep -q "S5" cosmos_lab/harness/CONTRACT.md' +check "CONTRACT.md documents both shipped adapters" \ + bash -c 'grep -q "install_into_session" cosmos_lab/harness/CONTRACT.md && grep -q "register_as_nat_tool" cosmos_lab/harness/CONTRACT.md' + +echo "--- Parametrized contract tests ---" +echo " Running adapter contract tests..." +TEST_OUTPUT=$(uv run python -m pytest tests/optimization/harness/test_adapter_contract.py -q 2>&1) +if echo "$TEST_OUTPUT" | grep -qE "passed"; then + PASS_COUNT=$(echo "$TEST_OUTPUT" | grep -oE "[0-9]+ passed" | head -1 | grep -oE "[0-9]+") + echo " ✅ test_adapter_contract.py exits 0 (${PASS_COUNT:-?} tests passed)" + PASS=$((PASS+1)) +else + echo " ❌ test_adapter_contract.py failed" + echo "$TEST_OUTPUT" | tail -10 | sed 's/^/ /' + FAIL=$((FAIL+1)) +fi + +echo "--- Coverage: both shipped adapters in registry ---" +check "test_adapter_contract.py registers ml_intern" \ + grep -q '"ml_intern"' tests/optimization/harness/test_adapter_contract.py +check "test_adapter_contract.py registers nat" \ + grep -q '"nat"' tests/optimization/harness/test_adapter_contract.py + +echo "--- D1 + D2 + D3 still green (no regression) ---" +if ./bin/verify_p0_5_d1.sh > /dev/null 2>&1; then + echo " ✅ P0.5 D1 verifier still 14/14" + PASS=$((PASS+1)) +else + echo " ❌ P0.5 D1 verifier regressed!" + FAIL=$((FAIL+1)) +fi + +if ./bin/verify_p0_5_d2.sh > /dev/null 2>&1; then + echo " ✅ P0.5 D2 verifier still 11/11" + PASS=$((PASS+1)) +else + echo " ❌ P0.5 D2 verifier regressed!" + FAIL=$((FAIL+1)) +fi + +if ./bin/verify_p0_5_d3.sh > /dev/null 2>&1; then + echo " ✅ P0.5 D3 verifier still 10/10" + PASS=$((PASS+1)) +else + echo " ❌ P0.5 D3 verifier regressed!" + FAIL=$((FAIL+1)) +fi + +echo "--- Upstream baseline preserved ---" +UPSTREAM_FAILS=$(uv run python -m pytest tests/unit/ -q 2>&1 | grep -oE "[0-9]+ failed" | head -1 | grep -oE "[0-9]+") +if [[ "${UPSTREAM_FAILS:-0}" -le 3 ]]; then + echo " ✅ tests/unit/ shows ${UPSTREAM_FAILS:-0} failures (≤ 3 known-broken baseline)" + PASS=$((PASS+1)) +else + echo " ❌ tests/unit/ shows ${UPSTREAM_FAILS} failures (> 3 — regression!)" + FAIL=$((FAIL+1)) +fi + +echo "--- Zero-diff invariant ---" +DIFF=$(git diff upstream/main --name-only 2>/dev/null \ + | grep -v "^cosmos_lab/" \ + | grep -v "^agent/optimization/" \ + | grep -v "^configs/optimization" \ + | grep -v "^tests/optimization" \ + | grep -v "^docs/" \ + | grep -v "^bin/" \ + | grep -v "^pyproject.toml" \ + | grep -v "^uv\.lock" \ + | grep -v "^CLAUDE.md" \ + | grep -v "^AGENTS" \ + | grep -v "^PLAN_V2.md$" \ + | grep -v "^PLAN.md$" \ + | grep -v "^SYSTEM.md$" \ + | grep -v "^EVAL_SPEC.md$" \ + | grep -v "^WORKFLOW.md$" \ + | grep -v "^RESEARCH_AHE_ANALYSIS.md$" \ + | grep -v "^AGENTIC_EVAL_SPEC.md$" \ + | grep -v "^agentic_build_workflow" || true) +if [[ -z "$DIFF" ]]; then + echo " ✅ git diff upstream/main --name-only shows owned paths only" + PASS=$((PASS+1)) +else + echo " ❌ Unexpected upstream diff:" + echo "$DIFF" | sed 's/^/ /' + FAIL=$((FAIL+1)) +fi + +echo +echo "===== P0.5 D4 verifier: $PASS pass, $FAIL fail =====" +[[ $FAIL -eq 0 ]] && exit 0 || exit 1 diff --git a/configs/cli_agent_config.json b/configs/cli_agent_config.json index 5c6a22a3..ed247998 100644 --- a/configs/cli_agent_config.json +++ b/configs/cli_agent_config.json @@ -2,6 +2,8 @@ "model_name": "anthropic/claude-opus-4-6", "save_sessions": true, "session_dataset_repo": "smolagents/ml-intern-sessions", + "share_traces": true, + "personal_trace_repo_template": "{hf_user}/ml-intern-sessions", "yolo_mode": false, "confirm_cpu_jobs": true, "auto_file_upload": true, diff --git a/configs/frontend_agent_config.json b/configs/frontend_agent_config.json index c73ea380..c674a223 100644 --- a/configs/frontend_agent_config.json +++ b/configs/frontend_agent_config.json @@ -1,7 +1,9 @@ { - "model_name": "bedrock/us.anthropic.claude-opus-4-6-v1", + "model_name": "${ML_INTERN_CLAUDE_MODEL_ID:-bedrock/us.anthropic.claude-opus-4-6-v1}", "save_sessions": true, "session_dataset_repo": "smolagents/ml-intern-sessions", + "share_traces": true, + "personal_trace_repo_template": "{hf_user}/ml-intern-sessions", "yolo_mode": false, "confirm_cpu_jobs": true, "auto_file_upload": true, diff --git a/configs/optimization_agent_config.json b/configs/optimization_agent_config.json new file mode 100644 index 00000000..885f52b3 --- /dev/null +++ b/configs/optimization_agent_config.json @@ -0,0 +1,15 @@ +{ + "model_name": "anthropic/claude-sonnet-4-6", + "mcpServers": {}, + "save_sessions": true, + "share_traces": true, + "yolo_mode": false, + "max_iterations": 300, + "reasoning_effort": "high", + "optimization_target": null, + "target_hardware": null, + "quality_budget": 0.98, + "optimization_loop_enabled": true, + "audit_log_path": "~/.cosmos_lab/audit.jsonl", + "trajectory_db_path": "~/.cosmos_lab/trajectories.duckdb" +} diff --git a/cosmos_lab/__init__.py b/cosmos_lab/__init__.py new file mode 100644 index 00000000..a322288b --- /dev/null +++ b/cosmos_lab/__init__.py @@ -0,0 +1,36 @@ +"""cosmos-lab — production-grade agentic ML lifecycle library. + +Six reference agents (Data, Eval, Train, Optimize, Video, Code) on a shared +governance runtime: sentinel-gated judging, MCP-OAuth identity with sub-agent +scope-down, GEPA promotion contracts, quality-budget invariants. Plugs into +`nvidia-nat` (primary harness), `ml-intern` (compat), Claude SDK (v1.1). + +This is the v1 importable surface. Code physically lives under +`agent/optimization/` per the zero-diff fork strategy; `cosmos_lab.*` re-exports +so library consumers can `from cosmos_lab import ...` without depending on +upstream ml-intern import paths. + +See PLAN_V2.md §0.4 for the library architecture rationale. +""" + +from agent.optimization.config_ext import ( + OptimizationConfig, + load_optimization_config, +) +from cosmos_lab.identity import ( + AgentIdentity, + AuditLog, + CapabilityDenied, + CapabilityScopedRouter, +) + +__all__ = [ + "AgentIdentity", + "AuditLog", + "CapabilityDenied", + "CapabilityScopedRouter", + "OptimizationConfig", + "load_optimization_config", +] + +__version__ = "0.1.0.dev0" diff --git a/cosmos_lab/harness/CONTRACT.md b/cosmos_lab/harness/CONTRACT.md new file mode 100644 index 00000000..57c5519d --- /dev/null +++ b/cosmos_lab/harness/CONTRACT.md @@ -0,0 +1,115 @@ +# cosmos_lab/harness — Adapter Contract + +> Per PLAN_V2.md §0.4 (library architecture) + §0.4.5 (two-layer decision) + §0.6 (v6 — 9 agents on ml-intern primitives), cosmos-lab is a Python library that plugs into agent runtimes via thin adapters. This document specifies the contract every adapter must satisfy. + +## Two adapter families (v5.1/v6 reality) + +The two shipped adapters serve different purposes and have different contracts. Honest naming: + +### Family A — Execution Substrate Adapter + +**Purpose**: install cosmos-lab governance (capability-scoped tool calls + identity-aware audit) INTO an autonomous agent runtime. The host runs the actual ReAct loop; cosmos-lab governs every tool call inside it. + +**Examples**: +- `cosmos_lab.harness.ml_intern.install_into_session` (P0.5 D2 — shipped) +- `cosmos_lab.harness.claude_sdk.install_into_agent` (v1.1 — planned) +- `cosmos_lab.harness.openai_agents.install_into_agent` (v1.2 — planned) + +**Contract signature**: +```python +def install(host: HostType, identity: AgentIdentity, audit_log: AuditLog) -> None: + """Wrap host's tool router with CapabilityScopedRouter.""" +``` + +**Behavior**: +1. Validate host has the prerequisite (tool_router for Session, equivalent for other hosts) +2. Wrap host's tool router with `CapabilityScopedRouter(identity, audit_log)` +3. Mark host as installed (idempotency tag) +4. Mutate host in place; return `None` + +### Family B — Deployment Surface Adapter + +**Purpose**: register cosmos-lab as an invokable tool in a workflow runtime. The workflow runtime invokes cosmos-lab CLI; cosmos-lab CLI orchestrates the actual work using the family A execution substrate adapter. + +**Examples**: +- `cosmos_lab.harness.nat.register_as_nat_tool` (P0.5 D3 — shipped) +- `cosmos_lab.harness.langgraph.register_as_langgraph_node` (v1.2 — planned) +- `cosmos_lab.harness.airflow.register_as_airflow_operator` (future) + +**Contract signature**: +```python +def register_as_X_tool(builder: BuilderType) -> None: + """Register cosmos_lab_principal as an invokable tool in the workflow builder.""" +``` + +**Behavior**: +1. Validate builder has a recognizable tool-registration method +2. Register one tool named `cosmos_lab_principal` (uniform across deployment surfaces for predictability) +3. Tool callable accepts `(task: str, budget_usd: float, timeout_sec: int)`, returns structured dict +4. Mark builder as registered (idempotency check) +5. Mutate builder in place; return `None` + +## Shared requirements (both families) + +These requirements apply to ALL adapters regardless of family. Tested cross-family in `test_adapter_contract.py`. + +### S1 — Idempotency +Re-installing/re-registering on the same host raises a clear error (RuntimeError or equivalent) with explanation. Silent re-installation could shadow audit history or duplicate tool registrations — both are correctness bugs. + +### S2 — Composition only (Invariant 1) +Adapter never modifies upstream files. Wraps existing host attributes via composition. Adapter must work against any duck-typed host that satisfies the protocol. + +### S3 — Input validation +Adapter raises a clear error (ValueError, AttributeError, TypeError) when host lacks prerequisites. Errors must name the specific missing prerequisite to aid debugging. + +### S4 — In-place mutation, returns None +Adapter mutates the host in place and returns None. No new host instance returned. This makes adapter usage uniform: `adapter.install(host, ...); use(host)`. + +### S5 — No partial state on failure +If adapter raises during install/register, host must remain unchanged from pre-call state. No half-installed governance. (Atomicity guarantee.) + +## Per-adapter specifics + +### `cosmos_lab.harness.ml_intern.install_into_session` + +- **HostType**: `agent.core.session.Session` (or duck-typed equivalent with `.tool_router`) +- **Prerequisite**: `host.tool_router is not None` +- **Wraps**: `host.tool_router` with `CapabilityScopedRouter` +- **Idempotency tag**: `host._cosmos_lab_installed = True` +- **Tested in**: `tests/optimization/harness/test_ml_intern_adapter.py` (6 tests) + +### `cosmos_lab.harness.nat.register_as_nat_tool` + +- **HostType**: nat `Builder` (or duck-typed `BuilderLike` with one of `add_function`/`register_function`/`register_tool`/`add_tool`) +- **Prerequisite**: builder has a recognizable registration method +- **Registers**: one tool named `cosmos_lab_principal` +- **Idempotency check**: `cosmos_lab_principal` not already in `builder.functions`/`tools`/`_functions`/`_tools` +- **Tool body**: stub in P0.5 D3; real CLI invocation lands in P3 (PrincipalAgent v0 ships specialty agents) +- **Tested in**: `tests/optimization/harness/test_nat_adapter.py` (11 tests) + +## Future adapters (v1.1+) + +When adding a new adapter: + +1. **Determine family**: Family A (execution substrate) or Family B (deployment surface)? +2. **Match family contract signature**: install vs register +3. **Satisfy all 5 shared requirements** (S1-S5) +4. **Document per-adapter specifics** in this file +5. **Add adapter-specific test file**: `tests/optimization/harness/test_X_adapter.py` +6. **Add cross-family parametrization** in `test_adapter_contract.py` for the shared requirements (S1-S5) + +## Anti-patterns explicitly rejected + +- ❌ One unified adapter signature for both families — they serve different purposes, forcing one signature loses clarity +- ❌ Abstract base class enforcing contract — duck typing + Protocol + test suite is more flexible (Pythonic) +- ❌ Adapters that modify upstream files — violates Invariant 1 +- ❌ Silent re-installation — shadows audit history +- ❌ Partial-state failure — host left in inconsistent state on error + +## References + +- PLAN_V2.md §0.4 — library architecture decision +- PLAN_V2.md §0.4.5 — two-layer architecture (cosmos-lab CLI + ml-intern Session, nat as deployment wrapper) +- PLAN_V2.md §0.6 (v6) — what cosmos-lab ships vs leverages from ml-intern +- `cosmos_lab/harness/ml_intern.py` — Family A reference implementation +- `cosmos_lab/harness/nat.py` — Family B reference implementation diff --git a/cosmos_lab/harness/__init__.py b/cosmos_lab/harness/__init__.py new file mode 100644 index 00000000..941fc3d6 --- /dev/null +++ b/cosmos_lab/harness/__init__.py @@ -0,0 +1,22 @@ +"""cosmos_lab.harness — adapters that install cosmos-lab governance into agent harnesses. + +Each adapter implements the contract: + install(host, identity, audit_log) -> None + - wraps host's tool router with CapabilityScopedRouter + - validates host meets the contract before mutation + - is idempotent under re-install on a fresh host (raises on double-install + of the same wrapped router) + +Adapters in this package: + - ml_intern: install into agent.core.session.Session (HF stack, P0.5 D2) + - nat: install into nvidia-nat Builder (Cosmos stack, P0.5 D3 — pending) + - claude_sdk: install into Claude Agent SDK (v1.1 — pending) + +See PLAN_V2.md §0.4 for the library architecture rationale and +docs/02_current_phase.md for the current adapter being built. +""" + +from cosmos_lab.harness.ml_intern import install_into_session +from cosmos_lab.harness.nat import register_as_nat_tool + +__all__ = ["install_into_session", "register_as_nat_tool"] diff --git a/cosmos_lab/harness/ml_intern.py b/cosmos_lab/harness/ml_intern.py new file mode 100644 index 00000000..a3abf5ff --- /dev/null +++ b/cosmos_lab/harness/ml_intern.py @@ -0,0 +1,84 @@ +"""cosmos_lab.harness.ml_intern — install cosmos-lab governance into ml-intern Session. + +Per PLAN_V2.md §0.4 library architecture: cosmos-lab is a library that plugs +into agent harnesses via thin adapters; this is the v1 compat adapter for +ml-intern's `agent.core.session.Session`. Composition only — no upstream +files are modified (Invariant 1 zero-diff). + +Contract (install_into_session): + Wraps the host Session's tool_router with `CapabilityScopedRouter` so that + every tool call routed through the Session is governed by cosmos-lab + identity + audit. The host Session is mutated in place; no other state + changes. + +Scope (P0.5 D2): + This adapter only wraps the tool router. OTel span emission and lifecycle + hook wiring land in P1 (TrajectorySink + OTelGenAIEmitter). +""" + +from __future__ import annotations + +from typing import TYPE_CHECKING + +from cosmos_lab.identity import ( + AgentIdentity, + AuditLog, + CapabilityScopedRouter, +) + +if TYPE_CHECKING: + # Session is imported only for type-checkers to avoid pulling in the + # full ml-intern session module (which transitively imports backends, + # context manager, sandbox, etc.) at adapter import time. + from agent.core.session import Session + + +_INSTALLED_MARKER = "_cosmos_lab_installed" + + +def install_into_session( + session: "Session", + identity: AgentIdentity, + audit_log: AuditLog, +) -> None: + """Install cosmos-lab governance into an existing ml-intern Session. + + After this call, every tool invocation through `session.tool_router` + is filtered by `identity.can_call(...)` and audited via `audit_log`. + + Args: + session: an ml-intern `agent.core.session.Session` instance with a + `tool_router` attribute already initialized + identity: the `AgentIdentity` whose capabilities scope this Session + audit_log: the `AuditLog` to which tool-call events are recorded + + Raises: + ValueError: if the Session has no `tool_router` (cannot wrap nothing) + RuntimeError: if cosmos-lab governance is already installed on this + Session (use a fresh Session or detach first) + + Idempotency: + Re-installing on the same Session raises `RuntimeError`. This is + intentional — silent re-wrapping would shadow audit history and + confuse capability scope reasoning. Spawn a fresh Session instead. + """ + if getattr(session, "tool_router", None) is None: + raise ValueError( + "Session has no tool_router to wrap; cosmos-lab requires an " + "initialized router (Session must be constructed with " + "tool_router=ToolRouter(...) or equivalent)" + ) + + if getattr(session, _INSTALLED_MARKER, False): + raise RuntimeError( + "cosmos-lab governance is already installed on this Session; " + "re-installation would shadow audit history. Spawn a fresh " + "Session or detach first." + ) + + session.tool_router = CapabilityScopedRouter( + base_router=session.tool_router, + identity=identity, + audit_log=audit_log, + ) + setattr(session, _INSTALLED_MARKER, True) diff --git a/cosmos_lab/harness/nat.py b/cosmos_lab/harness/nat.py new file mode 100644 index 00000000..ca378f34 --- /dev/null +++ b/cosmos_lab/harness/nat.py @@ -0,0 +1,132 @@ +"""cosmos_lab.harness.nat — register cosmos-lab as an invokable tool in nat workflows. + +Per PLAN_V2.md §0.4.5 two-layer architecture (v5.1): + cosmos-lab CLI is the primary entry point; nat is a *deployment wrapper*, + not a runtime substrate. This module provides the thin registration shim + so a nat workflow YAML can invoke `cosmos-lab principal --task ` as + a tool callable. + +Why nat-as-deployment-only (not nat-as-runtime): + ml-intern's `submission_loop` (agent_loop.py:1771) is queue-based, not + function-based. Embedding it as runtime substrate inside a nat workflow + requires a 1-2 week async-bridge engineering effort that v5.1 explicitly + rejected. cosmos-lab CLI is the natural orchestrator; nat invokes it. + +Cosmos team usage: + + # 1. Author a nat workflow YAML that lists cosmos_lab_principal as a tool: + # + # workflow: + # _type: nat.react_agent + # tools: [cosmos_lab_principal, ...] + # + # 2. Register the tool with nat: + # + # from cosmos_lab.harness.nat import register_as_nat_tool + # register_as_nat_tool(builder) + # + # 3. Run: $ nat run --config-file cosmos-lab.yaml + # + # nat invokes `cosmos-lab principal --task ` per workflow rules. + # Real nat integration validated in P10 deployment (per v5.1). + +Contract: + register_as_nat_tool(builder) + - Adds ONE tool to the builder named `cosmos_lab_principal` + - The tool's invocation runs `cosmos-lab principal --task ` and + returns a structured dict {outcome, cost_usd, trajectory_id, ...} + - Builder is mutated in place; idempotency check via _COSMOS_LAB_TOOL_NAME +""" + +from __future__ import annotations + +from typing import Any, Protocol + + +_COSMOS_LAB_TOOL_NAME = "cosmos_lab_principal" + + +class BuilderLike(Protocol): + """Duck-typed nat Builder protocol. + + Real `nvidia-nat` Builder has add_function() / register_function(). + We accept either via getattr fallback so this works against both v1.6 + and any future Builder API stabilization. + """ + + def add_function(self, name: str, func: Any) -> None: # pragma: no cover + ... + + +def register_as_nat_tool(builder: BuilderLike) -> None: + """Register cosmos-lab as an invokable tool in a nat Builder. + + After this call, the nat workflow can invoke `cosmos_lab_principal(task=...)` + as a tool. The implementation shells out to the cosmos-lab CLI per the + v5.1 two-layer architecture (PLAN_V2 §0.4.5). + + Args: + builder: nat Builder (or a BuilderLike test double) + + Raises: + RuntimeError: if cosmos-lab tool is already registered on this builder + """ + if _is_already_registered(builder): + raise RuntimeError( + f"Tool '{_COSMOS_LAB_TOOL_NAME}' already registered on this builder; " + "double-registration would shadow the existing tool. Use a fresh " + "builder or unregister first." + ) + + add_fn = _resolve_add_function(builder) + add_fn(_COSMOS_LAB_TOOL_NAME, _cosmos_lab_principal_tool) + + +def _resolve_add_function(builder: BuilderLike): + """Find the right method on the builder to register a tool. + + nat v1.6 uses add_function; future versions may rename. Be tolerant. + """ + for method_name in ("add_function", "register_function", "register_tool", "add_tool"): + method = getattr(builder, method_name, None) + if callable(method): + return method + raise AttributeError( + f"Builder {type(builder).__name__} has no recognizable tool-registration " + "method (tried: add_function, register_function, register_tool, add_tool)" + ) + + +def _is_already_registered(builder: BuilderLike) -> bool: + """Best-effort idempotency check — looks at builder's known tool listing.""" + for attr in ("functions", "tools", "_functions", "_tools"): + registry = getattr(builder, attr, None) + if registry is not None and _COSMOS_LAB_TOOL_NAME in registry: + return True + return False + + +def _cosmos_lab_principal_tool( + task: str, + budget_usd: float = 10.0, + timeout_sec: int = 600, +) -> dict[str, Any]: + """The actual tool nat invokes — shells to cosmos-lab CLI. + + Returns a dict with outcome, cost, and trajectory pointer that nat can + surface back to its workflow. + + NOTE: in v0 this is a placeholder that returns a structured stub. Real + CLI invocation lands when `cosmos-lab principal` CLI ships in P3 + (PrincipalAgent v0). For P0.5 D3, the contract + registration are + what's tested; real subprocess execution is P3 work. + """ + return { + "task": task, + "budget_usd": budget_usd, + "timeout_sec": timeout_sec, + "outcome": "", + "cost_usd": 0.0, + "trajectory_id": None, + "status": "stub", + } diff --git a/cosmos_lab/identity/__init__.py b/cosmos_lab/identity/__init__.py new file mode 100644 index 00000000..1ed5127e --- /dev/null +++ b/cosmos_lab/identity/__init__.py @@ -0,0 +1,22 @@ +"""cosmos_lab.identity — AgentIdentity, AuditLog, CapabilityScopedRouter. + +Re-export shim. Implementation lives at `agent.optimization.identity` per the +zero-diff fork strategy (see PLAN_V2.md §0.4 library architecture). + +P0 ships AuthZ + audit (unsigned identity, JSONL log). P4b graduates to +MCP OAuth 2.1 + RFC 8707/8693 + Ed25519 signed log. +""" + +from agent.optimization.identity import ( + AgentIdentity, + AuditLog, + CapabilityDenied, + CapabilityScopedRouter, +) + +__all__ = [ + "AgentIdentity", + "AuditLog", + "CapabilityDenied", + "CapabilityScopedRouter", +] diff --git a/docs/00_workflow.md b/docs/00_workflow.md new file mode 120000 index 00000000..a930333e --- /dev/null +++ b/docs/00_workflow.md @@ -0,0 +1 @@ +../agentic_build_workflow.md \ No newline at end of file diff --git a/docs/01_north_star.md b/docs/01_north_star.md new file mode 100644 index 00000000..f4cb87cb --- /dev/null +++ b/docs/01_north_star.md @@ -0,0 +1,89 @@ +# North Star — cosmos-lab in 1 screen (v7 — frontier-aligned, final) + +## What we are building + +**A frontier-aligned production agentic system for NVIDIA Cosmos team's ML lifecycle work.** + +5 production agents + 1 Skill + 3 offline tools + ~16 infrastructure, on LangGraph durable substrate + Magentic-One ledger pattern + ml-intern's tool primitives. Verified against 2026 frontier patterns at Anthropic, NVIDIA, LangGraph (Uber/JPMC production), Microsoft Agent Framework, Inspect AI (UK AISI), Mem0, MCP authorization spec. + +## Why we are building it + +NVIDIA Cosmos team JD literally asks: *"agentic systems that reason about, build, evaluate, and improve AI systems themselves"* + *"agents (plural) help generate data, surface failures, evaluate outputs"* + stand-out *"agent-based systems doing real work — coding, eval, data gen, triage, experimentation, orchestration"*. + +This describes **multiple specialty agents** for different ML lifecycle stages PLUS production governance. ml-intern primitives are HF-flavored building blocks; v7 specializes them into Cosmos-aligned production agents using 2026-converged patterns. + +## The 5 production agents (v7 honest count after frontier audit) + +### Layer 1 — PrincipalAgent supervisor + 4 specialty workers + +| Agent | Phase | Distinct tool surface | Frontier pattern | +|---|---|---|---| +| **PrincipalAgent** | P3 | LangGraph supervisor + Magentic-One Task/Progress Ledger + Skills loader | Hierarchical orchestrator-worker (Anthropic Multi-Agent Research, Magentic-One, LangGraph supervisor — convergent 2026) | +| **DataAgent** | P4a | cosmos-curate/NeMo Curator/synthetic gen | Magentic-One worker pattern | +| **EvalAgent** | P5 | Inspect AI/MultiJudge/5-type sentinels | Inspect AI standard substrate | +| **TrainOrchestrator** | P5 | NeMo-RL/SkyPilot/HF Jobs | nat plugin pattern + production training | +| **OptimizeAgent** | P6 | profiler/kernel/sandbox 2-tier | Production optimization pattern | + +### Layer 2 — Skills (loaded by PrincipalAgent — Anthropic Skills pattern) + +| Skill | Phase | Why a Skill not Agent | +|---|---|---| +| **CodeWork** | P7 | Commodity tools (file ops + tests in E2B); Anthropic Skills blog rejects per-domain agents for commodity capabilities | + +### Layer 3 — 3 offline governance tools (NOT standing agents — frontier convergence) + +| Tool | Cadence | Why offline | +|---|---|---| +| **GepaOptimizer** | Monthly cron | Decagon ships GEPA offline only; NO production deployment as standing agent | +| **CapabilityProbe** | CI/CD on capability expansion | METR pattern; co-resident standing would poison trace store | +| **CrossAgentEvaluator** | Quarterly | Inspect AI cross-agent comparison standard | + +### Layer 4 — ~16 infrastructure components + +Identity (P0 + RFC 8693) | 5-type sentinels via Anthropic PostToolUse hooks | OTel + 4-scope hybrid memory (Mem0/Letta) | Inspect AI + cross-family MultiJudge | **LangGraph durable supervisor + Magentic-One ledger** | **Context engineering discipline** (cache-aware prompt structure + 75% compaction + just-in-time retrieval + cosmos-progress.md state file + behavior-vs-capability staleness check) | ComputeBackend + sandbox 2-tier | reproducibility envelope (incl. CUDA versions) | nat deployment wrapper + +### Layer 5 — ml-intern primitives (LEVERAGED inside LangGraph worker nodes) + +agent_loop, 16 generic tools, sandbox, MCP, cost estimation, doom-loop detection. + +## v7 frontier-fixed (vs v6) + +6 issues caught by 3 parallel audit agents, all addressed in v7: + +| Issue | v6 | v7 fix | +|---|---|---| +| Per-domain agents = anti-pattern (Anthropic Skills blog) | 6 specialty agents | 4 specialty workers (distinct tool surfaces) + 1 PrincipalAgent supervisor + CodeWork Skill | +| GEPA standing agent has no production precedent | GepaOptimizer as standing agent | Offline batch tool (Decagon pattern) | +| Sentinel "tripwire-replan" not in production | Novel mechanism | Anthropic PostToolUse hooks contract | +| 3-tier memory hierarchy is research not convergent | 3-tier hierarchical | 4-scope hybrid (Mem0/Letta) | +| Co-resident probe poisons trace store | Standing agent | CI/CD eval lane via Inspect AI snapshots | +| "Earned-trust capability expansion" oversold | Custom semantics | Standard RFC 8693 delegation only | + +Plus 8 frontier additions: LangGraph durable substrate, Magentic-One ledgers, 5th sentinel (judge-hacking per Gaia2), cross-family MultiJudge, CodeWork Skill, RFC 8707 day-one, reward-hack Pareto axis, CUDA versions in envelope. + +## Schedule + +~21 weeks. P0 + P0.5 (~3 days work) shipped. ~18 weeks remaining for P1-P10. + +Slightly more than v6's 19w because LangGraph integration + PrincipalAgent foundation + Magentic-One ledger pattern + 5th sentinel are all frontier-required additions per audit. Honest scoping, not optimistic. + +## When done (Week ~21) + +A Cosmos hiring manager opens the repo and sees: +- README → `pip install cosmos-lab[nat]` → `nat run cosmos-lab.yaml` +- 5 agents + CodeWork Skill + 3 offline tools + ~16 infrastructure +- LangGraph durable supervisor with Magentic-One ledger pattern +- 5-type sentinel suite (incl. judge-hacking detector) +- Cross-family MultiJudge (3× Sonnet + 1× non-Anthropic) +- MCP OAuth + RFC 8693 + signed audit (EU AI Act Art. 12 compliant) +- 4-scope hybrid memory via Mem0/Letta +- Real GPU runs with measured numbers (Invariant 9) +- Real OSS upstream PR +- Production endpoint dashboard with real users +- 5-min demo video showing all 5 agents + CodeWork Skill solving Cosmos task + +> *"AI helps build AI"* — frontier-aligned production agentic system for ML lifecycle work, with Cosmos team's actual stack (NIM + cosmos-curate + NeMo-RL). + +## Confidence anchor — why v7 IS final + +3 independent senior-engineer research agents conducted parallel audits and converged on the same 6 fixes + 8 additions. This is the synthesis. Future audit findings document as v1.1+ work, not v8 — process needs to converge. diff --git a/docs/02_current_phase.md b/docs/02_current_phase.md new file mode 100644 index 00000000..155bba78 --- /dev/null +++ b/docs/02_current_phase.md @@ -0,0 +1,147 @@ +# Current Phase — 🎉 P0.5 COMPLETE → Next: P1 (Eval Infrastructure) + +> ⚡ **LIVE FILE** — updated as we move through phases. If this is stale, fix it before doing more work. + +**Today's date**: 2026-05-03 (P0 + P0.5 D1/D2/D3/D4 + v3 → v6 plan evolution all shipped same-day) +**Active phase**: **P0.5 COMPLETE** ✅ +**Next phase**: **P1 — Eval infrastructure** (foundation for EvalAgent + used by all 6 specialty agents) + +--- + +## P0.5 — DONE (recap of all 4 days) + +### D1 — Library restructure ✅ (14/14 verifier) +- `cosmos_lab/__init__.py` + `cosmos_lab/identity/__init__.py` re-export shims +- `pyproject.toml` updated: `cosmos_lab*` in packages.find + `[nat]`/`[ml_intern]`/`[claude_sdk]` extras + +### D2 — ml_intern adapter (Family A — execution substrate) ✅ (11/11 verifier) +- `cosmos_lab/harness/ml_intern.py` — `install_into_session()` wraps Session.tool_router with CapabilityScopedRouter +- 6 smoke tests (3 contract + 3 e2e behavior) +- **This is THE primary product surface**: every specialty agent (P3+) constructs ml-intern Sessions and calls `install_into_session` to install governance + +### D3 — nat wrapper (Family B — deployment surface) ✅ (10/10 verifier) +- `cosmos_lab/harness/nat.py` — `register_as_nat_tool()` registers `cosmos_lab_principal` as nat workflow tool +- 11 smoke tests covering registration mechanics + tool callable contract +- Tool body is v0 stub; real CLI invocation lands in P3 when PrincipalAgent v0 ships + +### D4 — Adapter contract + dual-adapter test matrix ✅ (14/14 verifier) +- `cosmos_lab/harness/CONTRACT.md` — formal contract documentation: + - Family A vs Family B distinction (per v5.1/v6 architecture) + - 5 shared requirements (S1 idempotency, S2 composition only, S3 input validation, S4 returns None, S5 no partial state on failure) + - Per-adapter specifics + future adapter checklist +- `tests/optimization/harness/test_adapter_contract.py` — parametrized contract tests: + - 9 tests run across BOTH shipped adapters (`ml_intern`, `nat`) + - When v1.1 adds claude_sdk or langgraph, just add row to ADAPTERS registry — automatic contract enforcement + +### v3 → v6 plan evolution (same day, captured in commits) +- v3.1 / v3.2 / v4 / v5 / v5.1 / v5.2 / **v6 (final)** — see PLAN_V2.md §0.5 deltas table rows 1-13 +- v6 final framing: 6 Cosmos-specialty agents + 3 governance agents + ~16 infrastructure components, on ml-intern primitives leveraged AS-IS, ~19 weeks + +--- + +## Final P0.5 metrics + +| Metric | Value | +|---|---| +| **Verifier scores** | D1: 14/14, D2: 11/11, D3: 10/10, D4: 14/14 (all green) | +| **Total cosmos-lab tests** | 42 (16 P0 + 6 D2 + 11 D3 + 9 D4 contract) | +| **Upstream baseline** | 237 pass / 3 known-broken (no regression across all 4 days) | +| **Total commits on branch** | 12 (P0 + 4 P0.5 days + 5 plan evolutions + 2 fixups) | +| **LOC added** | ~3500 (cosmos_lab/ + tests/ + bin/ + docs/ + planning) | +| **Zero-diff invariant** | ✅ holds throughout | + +--- + +## P0.5 LEARN (cumulative, across all 4 days) + +1. **D1 — `uv.lock` regenerates on `uv sync`** → verifier exclusion list must include it +2. **D2 — `uv sync` without `--extra dev` removes pytest** → CLAUDE.md updated +3. **D2 — `uv run pytest` PATH-leaks to system pytest** → use `uv run python -m pytest` always +4. **D2 — Smoke test design** → duck-typed mocks > real host construction (MockSession vs real ml-intern Session) +5. **D3 — Editable install metadata stale** when new submodule added → re-run `uv sync` after package changes +6. **D4 — Two adapter families honest** → `ml_intern` (execution substrate) and `nat` (deployment surface) have DIFFERENT contracts. Forcing one signature loses clarity. Two families + 5 shared requirements is the right design. + +--- + +## P1 spec — Phase 1 of workflow (DEFINE) — eval infrastructure + +> **Per PLAN_V2.md §1 v6 phase table**: P1 ships eval infrastructure that becomes the foundation for EvalAgent (P4a) and is used by all 6 specialty agents. Schedule: 2 weeks. + +### Goal (one sentence) +Ship the eval infrastructure: `TrajectorySink` Protocol, `OTelGenAIEmitter` (Phoenix backend default), 4 sentinel types per §3.1, `MultiJudge` with bootstrap CIs, Inspect AI bridge, 5 seed Inspect tasks, `evaluate` CLI — so every cosmos-lab specialty agent (P3+) inherits sentinel-gated evaluation + OTel observability + Inspect AI integration for free. + +### Spec — what it does (P1 deliverables, ~2w) + +**Module: `cosmos_lab/trajectory/`** +- `sink.py` — `TrajectorySink` Protocol +- `otel_emitter.py` — `OTelGenAIEmitter` emits `gen_ai.*` spans → Phoenix +- `duckdb_sink.py` — opt-in analytics layer +- `hf_sink.py` — opt-in HF dataset upload (P8 flywheel) + +**Module: `cosmos_lab/eval/`** +- `judge.py` — `LLMJudge` (single-pass) +- `multi_judge.py` — `MultiJudge` (N=3 with bootstrap CI, no debate per arxiv:2508.17536) +- `sentinels/` — 4 sentinel types per §3.1 taxonomy: + - `deterministic.py` + - `output_format.py` + - `side_effect.py` + - `no_op.py` (mandatory on every task) +- `inspect_bridge.py` — exposes scorers/judges as Inspect AI Scorers +- `tool_judge.py` — ToolAugmentedJudge (judge can call read-only tools) + +**Tasks: `tasks/seed/*.py`** +- 5 Inspect AI tasks: dataset inspect, code task, ML debug, paper summary, profiling +- Each ships with one judge scorer + one composed sentinel + +**CLI: `agent/optimization/cli/evaluate.py`** +- `cosmos-lab evaluate --suite seed --judge multi --sinks otel,duckdb,hf` + +### Acceptance criteria (P1 numerical targets per §0.7) +- 5 Inspect tasks × 3 runs = 15 trajectories: spans land in Phoenix via OTel; round-trip p99 < 500ms +- `MultiJudge` reports bootstrap 95% CI; CI width ≤ 8pp at N=15 runs +- Sentinel/judge agreement ≥ 98% on green seed runs +- A capability-denied call is blocked BEFORE approval policy is consulted (ordering test) +- One golden Phoenix screenshot committed to docs/ + +### Spec — what it does NOT do (today) +- Does NOT build any of the 6 specialty agents (those are P3+) +- Does NOT add KMS for audit log signing (deferred to P10) +- Does NOT integrate with real Cosmos NIM endpoints (P2 + P9) +- Does NOT include S5 monthly red-team sprint (P8) + +### Verifier +`./bin/verify.sh p1` — checks: all modules exist, 5 Inspect tasks load, evaluate CLI works, 5 P1 numerical targets met. + +--- + +## After P1 → v7 phase progression (frontier-aligned final) + +> **v7 update**: 3-audit frontier verification confirmed v6 was ~60% aligned with 6 specific issues. v7 fixes them. See PLAN_V2.md §0.5 row 14 for cited rationale. + +Per PLAN_V2 §1 v7 phase table: +- P1 (2w): Eval infra — **5 sentinel types** (incl. judge-hacking per Gaia2), **cross-family MultiJudge** (3× Sonnet + 1× non-Anthropic), Inspect AI bridge via Anthropic PostToolUse hooks contract +- P2 (1w): Cosmos toolset (NIMProvider + cosmos_reason/predict/transfer wrappers) +- **P3 (2w): 🤖 PrincipalAgent foundation (NEW v7)** — LangGraph durable supervisor + Magentic-One Task/Progress Ledger pattern + 4-scope hybrid memory (Mem0/Letta) +- **P4a (1.5w): 🤖 DataAgent** (worker #1) +- P4b (2w): Identity v2 (MCP OAuth + RFC 8707 + RFC 8693 + signed audit; standard delegation only) +- **P5 (2w): 🤖 EvalAgent + 🤖 TrainOrchestrator** (workers #2 + #3 — first real GPU run) +- P5.5 (1w): PyTorch depth artifact +- **P6 (1.5w): 🤖 OptimizeAgent** (worker #4) +- **P7 (1w): CodeWork Skill** (Anthropic Skills pattern, NOT separate agent) + **CapabilityProbe in CI/CD lane** (NOT standing agent) +- **P8 (1.5w): GepaOptimizer offline batch tool** (NOT standing agent — Decagon pattern) +- **P9 (1.5w): MultimodalPipeline DEMO** (orchestrate existing 4 workers via PrincipalAgent on real Cosmos NIM endpoint) +- **P10 (2w): CrossAgentEvaluator offline + production deploy + nat YAML + OSS PR + demo** + +**~18 weeks of agent + governance work after P0.5 completion.** + +**v7 final agent count**: 5 production agents (1 PrincipalAgent supervisor + 4 specialty workers) + 1+ Skills (CodeWork) + 3 OFFLINE governance tools (NOT standing agents per frontier convergence). Total: 5 production + Skills + offline tools + ~16 infrastructure on LangGraph + Magentic-One + ml-intern primitives. + +--- + +## Branch state + +`p0_5_library_restructure` — 12 commits, all green, ready for PR. + +After P0.5 completion commit (this one): +- Open PR? (recommended for review of foundation before P1 starts) +- Or continue to P1 D1 immediately? diff --git a/docs/03_pointers.md b/docs/03_pointers.md new file mode 100644 index 00000000..338cff16 --- /dev/null +++ b/docs/03_pointers.md @@ -0,0 +1,56 @@ +# Phase → PLAN_V2.md anchor map + +> Use this when you need deep detail on a phase. Read the *specific section* of `PLAN_V2.md`, not the whole file. + +| Phase | What it ships | PLAN_V2.md section | +|---|---|---| +| **P0** (shipped) | Identity AuthZ MVP | `§2 Phase 0` | +| **P0.5** (current, 4 days) | Library restructure + harness adapters | `§2.5 Phase 0.5` | +| **P1** (W2-3) | TrajectorySink + OTel GenAI + Inspect AI + sentinel taxonomy | `§3 Phase 1` + `§3.1 Sentinel taxonomy` | +| **P2** (W4) | Cosmos provider scaffolding (NIMProvider + tool wrappers) | `§4 Phase 2` | +| **P3** (W5-6) | DataAgent — real video curation through cosmos-curate | `§5 P3` | +| **P4a** (W7) | EvalAgent platform — multi-judge + bootstrap CI + PR gate | `§5 P4a` | +| **P4b** (W8-9) | Identity v2 — MCP OAuth + RFC 8707/8693 + signed log | `§5 P4b` | +| **P5** (W10-11.5) | TrainOrchestrator — Centaur HPO + ComputeBackend | `§5 P5` | +| **P5.5** (W12) | PyTorch Depth — custom autograd OR torch.compile pattern | `§5 P5.5` | +| **P6** (W13-14.5) | OptimizeAgent — ≥1.5× speedup, ≤2% regression on real GPU | `§5 P6` | +| **P7** (W15-16) | Memory & compression (3-tier hierarchical) | `§5 P7` | +| **P8** (W17-18) | GEPA self-improvement loop (offline DSPy) | `§5 P8` | +| **P9a** (W18.5-19.5) | Multi-agent e2e — Cosmos Predict 2.5 + π₀.₅ pipeline | `§5 P9` | +| **P9b** (W19.5-20.5) | CodeAgent — bug fix in E2B sandbox | `§5 P9` | +| **P10** (W20.5-22.5) | Production deploy + OSS upstream PR + demo video | `§5 P10` | + +## Cross-cutting + +| Topic | PLAN_V2.md section | +|---|---| +| Library architecture (why) | `§0.4 Library architecture` | +| What only cosmos-lab does | `§0.6 unique value` | +| 6 reference agents matrix | `§0.65 Six reference agents` | +| 24 numerical targets | `§0.7 Numerical targets` | +| 5 production commitments | `§0.8 Production commitments` | +| 9 invariants | `§0 Invariants` | +| Reuse map (upstream + external) | `§1.5 Reuse map` | +| Sentinel taxonomy | `§3.1 Sentinel taxonomy` | +| PrincipalAgent architecture (long-horizon loop, memory tiers) | `§3.2 PrincipalAgent architecture` | +| Agentic eval architecture (5-tier + 6 surfaces) | `§3.3 Agentic eval architecture` (pointer) → `AGENTIC_EVAL_SPEC.md` (full spec) | +| Vendor independence | `§6.5 Vendor independence` | +| Open questions | `§8 Open questions` | + +## Companion specification docs (load on-demand) + +| Need | Read | +|---|---| +| ML-output eval methodology (perplexity, KL, latency p99, GPU OOM) | `EVAL_SPEC.md` | +| Agent-system eval methodology (trajectory, plan, replan, capability boundary, reward-hack, cross-agent comparison) | `AGENTIC_EVAL_SPEC.md` | +| Original 16-week ML optimization plan (historical) | `PLAN.md` | +| Architecture deep-dive (Vietnamese) | `SYSTEM.md` | +| AHE research informing P8 GEPA decisions | `RESEARCH_AHE_ANALYSIS.md` | + +## How to read PLAN_V2.md efficiently + +```bash +# Don't load the whole file. Use grep + Read with offset: +grep -n "^## " PLAN_V2.md # see all section anchors +# Then Read tool with offset pointing to the section you need. +``` diff --git a/docs/04_jd.md b/docs/04_jd.md new file mode 100644 index 00000000..b7db9c68 --- /dev/null +++ b/docs/04_jd.md @@ -0,0 +1,40 @@ +# NVIDIA Cosmos Job Description (reference) + +> Pointer: this is the JD we are aligning to. Don't load unless you need to verify a coverage claim. + +## Role mission +Build agentic systems that reason about, build, evaluate, and improve AI systems themselves. AI doesn't just run models — AI helps build them. Create the meta-layer of modern ML. + +## What you'll be doing +1. Design and implement agentic workflows across the ML lifecycle (data, eval, debug, training, iteration) +2. Build AI-native systems where models/agents interact with codebases, tools, experiments, environments +3. Create self-improving loops (agents help generate data, surface failures, evaluate outputs) +4. Own and evolve large-scale Python and PyTorch codebases +5. Design and scale evaluation platforms (auto + human + agent-driven analysis) +6. Build and maintain multimodal ML pipelines (data → experiment → benchmark → deploy) +7. Integrate OSS and internal components into unified systems +8. Engineering excellence — testing, reproducibility, packaging, code health + +## What we need to see +1. Significant ML systems / platforms experience (not just models) +2. Expert-level Python (modularity, abstraction, code health long-term) +3. Deep PyTorch — debug, adapt, extend model behavior +4. Pipelines / eval / dev tooling at meaningful scale +5. SWE fundamentals — system design, testing, packaging, debugging +6. Strong agency in LLM systems — tool use, planning, multi-step, code agents +7. Comfort in fast-moving environments +8. BS/MS CS + 12+ years software development + +## Stand-out (where cosmos-lab differentiates) +1. **Agent-based systems doing real work** — coding, eval, data gen, triage, experimentation, orchestration +2. **Impactful OSS contribution** — beyond own library +3. **Context compression / agent memory** +4. **Agent safety + identity (AuthN, AuthZ, IAM)** +5. **High craftsmanship bar in research-adjacent without slowing innovation** + +## Salary +- L5 Senior: $224K - $356.5K base +- L6 Principal/Staff: $272K - $431.25K base + +## Coverage matrix +See `PLAN_V2.md §0.65` (six agents map to JD bullets) and `§0.6` (unique value over assembled OSS). diff --git a/frontend/src/components/Chat/ChatInput.tsx b/frontend/src/components/Chat/ChatInput.tsx index 58e253c1..fc753a4e 100644 --- a/frontend/src/components/Chat/ChatInput.tsx +++ b/frontend/src/components/Chat/ChatInput.tsx @@ -8,7 +8,14 @@ import { useUserQuota } from '@/hooks/useUserQuota'; import ClaudeCapDialog from '@/components/ClaudeCapDialog'; import JobsUpgradeDialog from '@/components/JobsUpgradeDialog'; import { useAgentStore } from '@/store/agentStore'; -import { CLAUDE_MODEL_PATH, FIRST_FREE_MODEL_PATH, isClaudePath } from '@/utils/model'; +import { useSessionStore } from '@/store/sessionStore'; +import { + CLAUDE_MODEL_PATH, + FIRST_FREE_MODEL_PATH, + GPT_55_MODEL_PATH, + isClaudePath, + isPremiumPath, +} from '@/utils/model'; // Model configuration interface ModelOption { @@ -25,7 +32,7 @@ const getHfAvatarUrl = (modelId: string) => { return `https://huggingface.co/api/avatars/${org}`; }; -const MODEL_OPTIONS: ModelOption[] = [ +const DEFAULT_MODEL_OPTIONS: ModelOption[] = [ { id: 'kimi-k2.6', name: 'Kimi K2.6', @@ -42,6 +49,13 @@ const MODEL_OPTIONS: ModelOption[] = [ avatarUrl: 'https://huggingface.co/api/avatars/Anthropic', recommended: true, }, + { + id: 'gpt-5.5', + name: 'GPT-5.5', + description: 'OpenAI', + modelPath: GPT_55_MODEL_PATH, + avatarUrl: 'https://huggingface.co/api/avatars/openai', + }, { id: 'minimax-m2.7', name: 'MiniMax M2.7', @@ -56,14 +70,26 @@ const MODEL_OPTIONS: ModelOption[] = [ modelPath: 'zai-org/GLM-5.1', avatarUrl: getHfAvatarUrl('zai-org/GLM-5.1'), }, + { + id: 'deepseek-v4-pro', + name: 'DeepSeek V4 Pro', + description: 'DeepInfra', + modelPath: 'deepseek-ai/DeepSeek-V4-Pro:deepinfra', + avatarUrl: getHfAvatarUrl('deepseek-ai/DeepSeek-V4-Pro'), + }, ]; -const findModelByPath = (path: string): ModelOption | undefined => { - return MODEL_OPTIONS.find(m => m.modelPath === path || path?.includes(m.id)); +const findModelByPath = (path: string, options: ModelOption[]): ModelOption | undefined => { + if (isClaudePath(path)) { + const claude = options.find(isClaudeModel); + if (claude) return claude; + } + return options.find(m => m.modelPath === path || path?.includes(m.id)); }; interface ChatInputProps { sessionId?: string; + initialModelPath?: string | null; onSend: (text: string) => void; onStop?: () => void; isProcessing?: boolean; @@ -72,16 +98,22 @@ interface ChatInputProps { } const isClaudeModel = (m: ModelOption) => isClaudePath(m.modelPath); -const firstFreeModel = () => MODEL_OPTIONS.find(m => !isClaudeModel(m)) ?? MODEL_OPTIONS[0]; +const isPremiumModel = (m: ModelOption) => isPremiumPath(m.modelPath); +const firstFreeModel = (options: ModelOption[]) => options.find(m => !isPremiumModel(m)) ?? options[0]; -export default function ChatInput({ sessionId, onSend, onStop, isProcessing = false, disabled = false, placeholder = 'Ask anything...' }: ChatInputProps) { +export default function ChatInput({ sessionId, initialModelPath, onSend, onStop, isProcessing = false, disabled = false, placeholder = 'Ask anything...' }: ChatInputProps) { const [input, setInput] = useState(''); const inputRef = useRef(null); - const [selectedModelId, setSelectedModelId] = useState(MODEL_OPTIONS[0].id); + const [modelOptions, setModelOptions] = useState(DEFAULT_MODEL_OPTIONS); + const modelOptionsRef = useRef(DEFAULT_MODEL_OPTIONS); + const sessionIdRef = useRef(sessionId); + const [selectedModelId, setSelectedModelId] = useState( + () => findModelByPath(initialModelPath ?? '', DEFAULT_MODEL_OPTIONS)?.id ?? DEFAULT_MODEL_OPTIONS[0].id, + ); const [modelAnchorEl, setModelAnchorEl] = useState(null); const { quota, refresh: refreshQuota } = useUserQuota(); // The daily-cap dialog is triggered from two places: (a) a 429 returned - // from the chat transport when the user tries to send on Opus over cap — + // from the chat transport when the user tries to send on a premium model over cap — // surfaced via the agent-store flag — and (b) nothing else right now // (switching models is free). Keeping the open state in the store means // the hook layer can flip it without threading props through. @@ -89,9 +121,45 @@ export default function ChatInput({ sessionId, onSend, onStop, isProcessing = fa const setClaudeQuotaExhausted = useAgentStore((s) => s.setClaudeQuotaExhausted); const jobsUpgradeRequired = useAgentStore((s) => s.jobsUpgradeRequired); const setJobsUpgradeRequired = useAgentStore((s) => s.setJobsUpgradeRequired); + const updateSessionModel = useSessionStore((s) => s.updateSessionModel); const [awaitingTopUp, setAwaitingTopUp] = useState(false); const lastSentRef = useRef(''); + useEffect(() => { + modelOptionsRef.current = modelOptions; + }, [modelOptions]); + + useEffect(() => { + sessionIdRef.current = sessionId; + }, [sessionId]); + + useEffect(() => { + let cancelled = false; + apiFetch('/api/config/model') + .then((res) => (res.ok ? res.json() : null)) + .then((data) => { + if (cancelled || !data?.available) return; + const claude = data.available.find((m: { provider?: string; id?: string }) => ( + m.provider === 'anthropic' && m.id + )); + if (!claude?.id) return; + + const next = DEFAULT_MODEL_OPTIONS.map((option) => ( + isClaudeModel(option) + ? { ...option, modelPath: claude.id, name: claude.label ?? option.name } + : option + )); + modelOptionsRef.current = next; + setModelOptions(next); + if (!sessionIdRef.current) { + const current = data.current ? findModelByPath(data.current, next) : null; + if (current) setSelectedModelId(current.id); + } + }) + .catch(() => { /* ignore */ }); + return () => { cancelled = true; }; + }, []); + // Model is per-session: fetch this tab's current model every time the // session changes. Other tabs keep their own selections independently. useEffect(() => { @@ -102,15 +170,16 @@ export default function ChatInput({ sessionId, onSend, onStop, isProcessing = fa .then((data) => { if (cancelled) return; if (data?.model) { - const model = findModelByPath(data.model); + const model = findModelByPath(data.model, modelOptionsRef.current); if (model) setSelectedModelId(model.id); + updateSessionModel(sessionId, data.model); } }) .catch(() => { /* ignore */ }); return () => { cancelled = true; }; - }, [sessionId]); + }, [sessionId, updateSessionModel]); - const selectedModel = MODEL_OPTIONS.find(m => m.id === selectedModelId) || MODEL_OPTIONS[0]; + const selectedModel = modelOptions.find(m => m.id === selectedModelId) || modelOptions[0]; // Auto-focus the textarea when the session becomes ready useEffect(() => { @@ -127,7 +196,7 @@ export default function ChatInput({ sessionId, onSend, onStop, isProcessing = fa } }, [input, disabled, onSend]); - // When the chat transport reports a Claude-quota 429, restore the typed + // When the chat transport reports a premium-model quota 429, restore the typed // text so the user doesn't lose their message. useEffect(() => { if (claudeQuotaExhausted && lastSentRef.current) { @@ -168,7 +237,10 @@ export default function ChatInput({ sessionId, onSend, onStop, isProcessing = fa method: 'POST', body: JSON.stringify({ model: model.modelPath }), }); - if (res.ok) setSelectedModelId(model.id); + if (res.ok) { + setSelectedModelId(model.id); + updateSessionModel(sessionId, model.modelPath); + } } catch { /* ignore */ } }; @@ -178,12 +250,12 @@ export default function ChatInput({ sessionId, onSend, onStop, isProcessing = fa }, [setClaudeQuotaExhausted]); // "Use a free model" — switch the current session to Kimi (or the first - // non-Anthropic option) and auto-retry the send that tripped the cap. + // non-premium option) and auto-retry the send that tripped the cap. const handleUseFreeModel = useCallback(async () => { setClaudeQuotaExhausted(false); if (!sessionId) return; - const free = MODEL_OPTIONS.find(m => m.modelPath === FIRST_FREE_MODEL_PATH) - ?? firstFreeModel(); + const free = modelOptions.find(m => m.modelPath === FIRST_FREE_MODEL_PATH) + ?? firstFreeModel(modelOptions); try { const res = await apiFetch(`/api/session/${sessionId}/model`, { method: 'POST', @@ -191,6 +263,7 @@ export default function ChatInput({ sessionId, onSend, onStop, isProcessing = fa }); if (res.ok) { setSelectedModelId(free.id); + updateSessionModel(sessionId, free.modelPath); const retryText = lastSentRef.current; if (retryText) { onSend(retryText); @@ -199,14 +272,14 @@ export default function ChatInput({ sessionId, onSend, onStop, isProcessing = fa } } } catch { /* ignore */ } - }, [sessionId, onSend, setClaudeQuotaExhausted]); + }, [sessionId, onSend, setClaudeQuotaExhausted, modelOptions, updateSessionModel]); - const handleClaudeUpgradeClick = useCallback(async () => { + const handlePremiumUpgradeClick = useCallback(async () => { if (!sessionId) return; try { await apiFetch(`/api/pro-click/${sessionId}`, { method: 'POST', - body: JSON.stringify({ source: 'claude_cap_dialog', target: 'pro_pricing' }), + body: JSON.stringify({ source: 'premium_cap_dialog', target: 'pro_pricing' }), }); } catch { /* tracking is best-effort */ @@ -254,14 +327,14 @@ export default function ChatInput({ sessionId, onSend, onStop, isProcessing = fa return () => document.removeEventListener('visibilitychange', onVisible); }, [awaitingTopUp, jobsUpgradeRequired, handleJobsRetry]); - // Hide the chip until the user has actually burned quota — an unused - // Opus session shouldn't populate a counter. - const claudeChip = (() => { - if (!quota || quota.claudeUsedToday === 0) return null; + // Hide the chip until the user has actually burned quota; opening a + // premium-model session without sending should not populate a counter. + const premiumChip = (() => { + if (!quota || quota.premiumUsedToday === 0) return null; if (quota.plan === 'free') { - return quota.claudeRemaining > 0 ? 'Free today' : 'Pro only'; + return quota.premiumRemaining > 0 ? 'Free today' : 'Pro only'; } - return `${quota.claudeUsedToday}/${quota.claudeDailyCap} today`; + return `${quota.premiumUsedToday}/${quota.premiumDailyCap} today`; })(); return ( @@ -426,7 +499,7 @@ export default function ChatInput({ sessionId, onSend, onStop, isProcessing = fa } }} > - {MODEL_OPTIONS.map((model) => ( + {modelOptions.map((model) => ( handleSelectModel(model)} @@ -462,9 +535,9 @@ export default function ChatInput({ sessionId, onSend, onStop, isProcessing = fa }} /> )} - {isClaudeModel(model) && claudeChip && ( + {isPremiumModel(model) && premiumChip && ( (null); const [error, setError] = useState(null); @@ -50,12 +50,13 @@ export default function ExpiredBanner({ sessionId }: Props) { useAgentStore.getState().clearSessionState(sessionId); renameSession(sessionId, newId); + if (data.model) updateSessionModel(newId, data.model); } catch (e) { logger.warn('Catch-up failed:', e); setError("Couldn't catch up — try starting over."); setBusy(null); } - }, [sessionId, renameSession]); + }, [sessionId, renameSession, updateSessionModel]); const handleStartOver = useCallback(() => { setBusy('start-over'); diff --git a/frontend/src/components/Chat/ToolCallGroup.tsx b/frontend/src/components/Chat/ToolCallGroup.tsx index 657e9e36..b85de8b2 100644 --- a/frontend/src/components/Chat/ToolCallGroup.tsx +++ b/frontend/src/components/Chat/ToolCallGroup.tsx @@ -1,5 +1,5 @@ import { useCallback, useEffect, useMemo, useRef, useState } from 'react'; -import { Box, Stack, Typography, Chip, Button, TextField, IconButton, Link, CircularProgress } from '@mui/material'; +import { Alert, Box, Stack, Typography, Chip, Button, TextField, IconButton, Link, CircularProgress } from '@mui/material'; import CheckCircleOutlineIcon from '@mui/icons-material/CheckCircleOutline'; import ErrorOutlineIcon from '@mui/icons-material/ErrorOutline'; import OpenInNewIcon from '@mui/icons-material/OpenInNew'; @@ -502,6 +502,7 @@ function InlineApproval({ }) { const [feedback, setFeedback] = useState(''); const args = input as Record | undefined; + const autoApproval = useAgentStore((state) => state.budgetBlocks[toolCallId]); const { setPanel, getEditedScript } = useAgentStore(); const { setRightPanelOpen, setLeftSidebarOpen } = useLayoutStore(); const hasEditedScript = !!getEditedScript(toolCallId); @@ -521,6 +522,24 @@ function InlineApproval({ return ( + {autoApproval && ( + + + YOLO paused: {autoApproval.reason || 'manual approval required.'} + + + )} + {toolName === 'sandbox_create' && args && (() => { const hw = String(args.hardware || 'cpu-basic'); const cost = costLabel(hw); @@ -536,9 +555,7 @@ function InlineApproval({ {' '}({cost}) )} - {!!args.private && ( - {' (private)'} - )} + {' (private)'} Creates a temporary HF Space to develop and test scripts before running jobs. Takes 1-2 min to start. diff --git a/frontend/src/components/ClaudeCapDialog.tsx b/frontend/src/components/ClaudeCapDialog.tsx index f959a44c..3beda4f8 100644 --- a/frontend/src/components/ClaudeCapDialog.tsx +++ b/frontend/src/components/ClaudeCapDialog.tsx @@ -55,15 +55,15 @@ export default function ClaudeCapDialog({ - You've hit your Opus limit + You've hit your premium model limit - Opus costs an arm and a leg, so we unfortunately have to cap you at {cap}{' '} - {cap === 1 ? 'session' : 'sessions'} a day. Give Kimi, MiniMax, or GLM a spin — - they are genuinely good and we use them all the time. + Opus and GPT-5.5 are expensive to run, so we cap premium models at {cap}{' '} + {cap === 1 ? 'session' : 'sessions'} a day. Give Kimi, MiniMax, GLM, + or DeepSeek a spin instead. - HF Pro ($9/mo) — more Opus, more everything + HF Pro ($9/mo) — more premium model sessions - {PRO_CAP} Opus sessions/day here, 20× HF Inference credits, ZeroGPU access, - and priority on Spaces hardware. + {PRO_CAP} premium model sessions/day here, 20× HF Inference credits, + ZeroGPU access, and priority on Spaces hardware. diff --git a/frontend/src/components/Layout/AppLayout.tsx b/frontend/src/components/Layout/AppLayout.tsx index b37118a8..b0fc0c36 100644 --- a/frontend/src/components/Layout/AppLayout.tsx +++ b/frontend/src/components/Layout/AppLayout.tsx @@ -24,6 +24,7 @@ import SessionSidebar from '@/components/SessionSidebar/SessionSidebar'; import SessionChat from '@/components/SessionChat'; import CodePanel from '@/components/CodePanel/CodePanel'; import WelcomeScreen from '@/components/WelcomeScreen/WelcomeScreen'; +import YoloControl from '@/components/YoloControl'; import { apiFetch } from '@/utils/api'; const DRAWER_WIDTH = 260; @@ -121,6 +122,39 @@ export default function AppLayout() { }; }, [isConnected, activeSessionId]); + // Best-effort sandbox cleanup when the browser tab/window closes. This + // preserves durable chat history; explicit delete still removes the session. + useEffect(() => { + const teardownSandboxes = () => { + const liveSessionIds = useSessionStore + .getState() + .sessions + .filter((session) => session.isActive && !session.expired) + .map((session) => session.id); + + for (const sessionId of liveSessionIds) { + const url = `/api/session/${sessionId}/sandbox/teardown`; + const body = '{}'; + const blob = new Blob([body], { type: 'application/json' }); + + if (navigator.sendBeacon?.(url, blob)) { + continue; + } + + fetch(url, { + method: 'POST', + body, + keepalive: true, + credentials: 'include', + headers: { 'Content-Type': 'application/json' }, + }).catch(() => {}); + } + }; + + window.addEventListener('pagehide', teardownSandboxes); + return () => window.removeEventListener('pagehide', teardownSandboxes); + }, []); + const handleSessionDead = useCallback( (deadSessionId: string) => { // Backend lost this session — mark it expired so the chat shows a @@ -252,6 +286,7 @@ export default function AppLayout() { + s.id === sessionId)?.expired === true; + const sessionMeta = sessions.find((s) => s.id === sessionId); + const isExpired = sessionMeta?.expired === true; const { messages, sendMessage, stop, status, undoLastTurn, editAndRegenerate, approveTools } = useAgentChat({ sessionId, @@ -112,6 +113,7 @@ export default function SessionChat({ sessionId, isActive, onSessionDead }: Sess ) : ( = 100) return `$${value.toFixed(0)}`; + return `$${value.toFixed(2).replace(/\.00$/, '')}`; +} + +export default function YoloControl() { + const { sessions, activeSessionId, updateSessionYolo } = useSessionStore(); + const activeSession = useMemo( + () => sessions.find((s) => s.id === activeSessionId) || null, + [sessions, activeSessionId], + ); + const [dialogOpen, setDialogOpen] = useState(false); + const [capInput, setCapInput] = useState(String(DEFAULT_CAP_USD)); + const [busy, setBusy] = useState(false); + const [error, setError] = useState(null); + + const enabled = Boolean(activeSession?.autoApprovalEnabled); + const disabled = !activeSessionId || activeSession?.expired || busy; + const remaining = activeSession?.autoApprovalRemainingUsd ?? null; + const cap = activeSession?.autoApprovalCostCapUsd ?? null; + + useEffect(() => { + if (!activeSession) return; + setCapInput(String(activeSession.autoApprovalCostCapUsd ?? DEFAULT_CAP_USD)); + }, [activeSession?.id, activeSession?.autoApprovalCostCapUsd]); // eslint-disable-line react-hooks/exhaustive-deps + + async function patchPolicy(nextEnabled: boolean, nextCap?: number) { + if (!activeSessionId) return null; + setBusy(true); + setError(null); + try { + const body: Record = { enabled: nextEnabled }; + if (nextCap !== undefined) body.cost_cap_usd = nextCap; + const response = await apiFetch(`/api/session/${activeSessionId}/yolo`, { + method: 'PATCH', + body: JSON.stringify(body), + }); + if (!response.ok) { + throw new Error(await response.text()); + } + const data = await response.json(); + updateSessionYolo(activeSessionId, data); + return data; + } catch { + setError('Could not update YOLO settings.'); + return null; + } finally { + setBusy(false); + } + } + + const handleToggle = async () => { + if (disabled) return; + if (enabled) { + await patchPolicy(false); + return; + } + const nextCap = cap ?? DEFAULT_CAP_USD; + const updated = await patchPolicy(true, nextCap); + if (updated) { + setCapInput(String(updated.cost_cap_usd ?? nextCap)); + setDialogOpen(true); + } + }; + + const handleSaveCap = async () => { + const parsed = Number(capInput); + if (!Number.isFinite(parsed) || parsed < 0) { + setError('Enter a non-negative dollar amount.'); + return; + } + const updated = await patchPolicy(true, parsed); + if (updated) setDialogOpen(false); + }; + + return ( + <> + + + + + + + setDialogOpen(false)} maxWidth="xs" fullWidth> + YOLO Budget + + + Auto-approval is active for this session. Scheduled HF jobs still require approval. + + setCapInput(e.target.value)} + inputProps={{ min: 0, step: 0.5 }} + error={Boolean(error)} + helperText={error || `Estimated spend: ${money(activeSession?.autoApprovalEstimatedSpendUsd ?? 0)} of ${money(cap)}`} + /> + + + + + + + + ); +} diff --git a/frontend/src/hooks/useAgentChat.ts b/frontend/src/hooks/useAgentChat.ts index 4cabb3dc..23683894 100644 --- a/frontend/src/hooks/useAgentChat.ts +++ b/frontend/src/hooks/useAgentChat.ts @@ -36,7 +36,7 @@ export function useAgentChat({ sessionId, isActive, onReady, onError, onSessionD const isActiveRef = useRef(isActive); isActiveRef.current = isActive; - const { setNeedsAttention } = useSessionStore(); + const { setNeedsAttention, updateSessionYolo } = useSessionStore(); // Helper: update this session's state (mirrors to globals if active) const updateSession = useAgentStore.getState().updateSession; @@ -186,6 +186,20 @@ export function useAgentChat({ sessionId, isActive, onReady, onError, onSessionD if (!tools.length) return; setNeedsAttention(sessionId, true); + const store = useAgentStore.getState(); + for (const tool of tools) { + store.setToolBudgetBlock( + tool.tool_call_id, + tool.auto_approval_blocked + ? { + reason: tool.block_reason ?? null, + estimatedCostUsd: tool.estimated_cost_usd ?? null, + remainingCapUsd: tool.remaining_cap_usd ?? null, + } + : null, + ); + } + updateSession(sessionId, { activityStatus: { type: 'waiting-approval' } }); // Build panel data for this session's pending approval @@ -346,7 +360,7 @@ export function useAgentChat({ sessionId, isActive, onReady, onError, onSessionD sendAutomaticallyWhen: lastAssistantMessageIsCompleteWithApprovalResponses, onError: (error) => { updateSession(sessionId, { isProcessing: false }); - // Claude daily-cap: open the cap dialog instead of the generic error + // Premium-model daily cap: open the cap dialog instead of the generic error // banner. Transport marks the error with this sentinel. if (error.message === 'CLAUDE_QUOTA_EXHAUSTED') { if (isActiveRef.current) { @@ -480,6 +494,9 @@ export function useAgentChat({ sessionId, isActive, onReady, onError, onSessionD ); if (pendingIds.size > 0) setNeedsAttention(sessionId, true); } + if (info.auto_approval) { + updateSessionYolo(sessionId, info.auto_approval); + } return { data, pendingIds, info }; } return { data, pendingIds, info: null }; @@ -562,7 +579,15 @@ export function useAgentChat({ sessionId, isActive, onReady, onError, onSessionD return true; } else if (et === 'approval_required') { sideChannel.onApprovalRequired( - (event.data?.tools || []) as Array<{ tool: string; arguments: Record; tool_call_id: string }>, + (event.data?.tools || []) as Array<{ + tool: string; + arguments: Record; + tool_call_id: string; + auto_approval_blocked?: boolean; + block_reason?: string | null; + estimated_cost_usd?: number | null; + remaining_cap_usd?: number | null; + }>, ); stopReconnect(); const result = await hydrateMessages(); diff --git a/frontend/src/hooks/useUserQuota.ts b/frontend/src/hooks/useUserQuota.ts index 39fa763a..7fcf4ee3 100644 --- a/frontend/src/hooks/useUserQuota.ts +++ b/frontend/src/hooks/useUserQuota.ts @@ -1,5 +1,5 @@ /** - * Reads the current user's Claude daily quota + plan tier from the backend. + * Reads the current user's premium-model daily quota + plan tier from the backend. * * Fetches once when the user becomes authenticated, and exposes a `refresh()` * that callers invoke after a successful session-create / model-switch so the @@ -13,9 +13,9 @@ export type PlanTier = 'free' | 'pro' | 'org'; export interface UserQuota { plan: PlanTier; - claudeUsedToday: number; - claudeDailyCap: number; - claudeRemaining: number; + premiumUsedToday: number; + premiumDailyCap: number; + premiumRemaining: number; } export function useUserQuota() { @@ -32,9 +32,9 @@ export function useUserQuota() { const data = await res.json(); setQuota({ plan: (data.plan ?? 'free') as PlanTier, - claudeUsedToday: data.claude_used_today ?? 0, - claudeDailyCap: data.claude_daily_cap ?? 1, - claudeRemaining: data.claude_remaining ?? 0, + premiumUsedToday: data.premium_used_today ?? 0, + premiumDailyCap: data.premium_daily_cap ?? 1, + premiumRemaining: data.premium_remaining ?? 0, }); } catch { /* backend unreachable — leave previous value */ diff --git a/frontend/src/lib/sse-chat-transport.ts b/frontend/src/lib/sse-chat-transport.ts index 77f85189..cfe94789 100644 --- a/frontend/src/lib/sse-chat-transport.ts +++ b/frontend/src/lib/sse-chat-transport.ts @@ -26,7 +26,15 @@ export interface SideChannelCallbacks { onToolLog: (tool: string, log: string, agentId?: string, label?: string) => void; onConnectionChange: (connected: boolean) => void; onSessionDead: (sessionId: string) => void; - onApprovalRequired: (tools: Array<{ tool: string; arguments: Record; tool_call_id: string }>) => void; + onApprovalRequired: (tools: Array<{ + tool: string; + arguments: Record; + tool_call_id: string; + auto_approval_blocked?: boolean; + block_reason?: string | null; + estimated_cost_usd?: number | null; + remaining_cap_usd?: number | null; + }>) => void; onToolCallPanel: (tool: string, args: Record) => void; onToolOutputPanel: (tool: string, toolCallId: string, output: string, success: boolean) => void; onStreaming: () => void; @@ -236,6 +244,10 @@ function createEventToChunkStream(sideChannel: SideChannelCallbacks): TransformS tool: string; arguments: Record; tool_call_id: string; + auto_approval_blocked?: boolean; + block_reason?: string | null; + estimated_cost_usd?: number | null; + remaining_cap_usd?: number | null; }>; if (!tools) break; @@ -402,7 +414,7 @@ export class SSEChatTransport implements ChatTransport { this.sideChannel.onSessionDead(sessionId); } if (response.status === 429) { - // Claude daily-quota gate tripped. The prefix is the detection marker + // Premium-model daily quota gate tripped. The prefix is the detection marker // for useAgentChat's onError handler, which surfaces the cap dialog // instead of a generic error banner. throw new Error('CLAUDE_QUOTA_EXHAUSTED'); diff --git a/frontend/src/store/agentStore.ts b/frontend/src/store/agentStore.ts index 9bd7ac7c..08a68a29 100644 --- a/frontend/src/store/agentStore.ts +++ b/frontend/src/store/agentStore.ts @@ -50,6 +50,12 @@ export interface JobsUpgradeState { namespace?: string | null; } +export interface ToolBudgetBlockState { + reason?: string | null; + estimatedCostUsd?: number | null; + remainingCapUsd?: number | null; +} + export type ActivityStatus = | { type: 'idle' } | { type: 'thinking' } @@ -113,7 +119,7 @@ interface AgentStore { user: User | null; error: string | null; llmHealthError: LLMHealthError | null; - /** Set when a Claude-send hits the daily quota — ChatInput opens the cap dialog in response. */ + /** Set when a premium-model send hits the daily quota; ChatInput opens the cap dialog. */ claudeQuotaExhausted: boolean; jobsUpgradeRequired: JobsUpgradeState | null; @@ -145,6 +151,9 @@ interface AgentStore { // Tool rejected states (tool_call_id -> true if rejected by user) - persisted across renders rejectedTools: Record; + // Tool budget-block metadata (tool_call_id -> display metadata) - transient UI state + budgetBlocks: Record; + // ── Per-session actions ───────────────────────────────────────────── /** Update a session's state. If it's the active session, also update flat state. */ @@ -196,6 +205,9 @@ interface AgentStore { setToolRejected: (toolCallId: string, isRejected: boolean) => void; getToolRejected: (toolCallId: string) => boolean | undefined; + + setToolBudgetBlock: (toolCallId: string, block: ToolBudgetBlockState | null) => void; + getToolBudgetBlock: (toolCallId: string) => ToolBudgetBlockState | undefined; } /** @@ -300,6 +312,7 @@ export const useAgentStore = create()((set, get) => ({ trackioDashboards: loadTrackioDashboards(), toolErrors: loadToolErrors(), rejectedTools: loadRejectedTools(), + budgetBlocks: {}, // ── Per-session state management ────────────────────────────────── @@ -529,4 +542,24 @@ export const useAgentStore = create()((set, get) => ({ }, getToolRejected: (toolCallId) => get().rejectedTools[toolCallId], + + // ── Tool Budget Blocks ─────────────────────────────────────────────── + + setToolBudgetBlock: (toolCallId, block) => { + set((state) => { + if (!block) { + const next = { ...state.budgetBlocks }; + delete next[toolCallId]; + return { budgetBlocks: next }; + } + return { + budgetBlocks: { + ...state.budgetBlocks, + [toolCallId]: block, + }, + }; + }); + }, + + getToolBudgetBlock: (toolCallId) => get().budgetBlocks[toolCallId], })); diff --git a/frontend/src/store/sessionStore.ts b/frontend/src/store/sessionStore.ts index 967c6500..e4129e51 100644 --- a/frontend/src/store/sessionStore.ts +++ b/frontend/src/store/sessionStore.ts @@ -9,11 +9,12 @@ interface SessionStore { activeSessionId: string | null; // Actions - createSession: (id: string) => void; + createSession: (id: string, model?: string | null) => void; deleteSession: (id: string) => void; switchSession: (id: string) => void; setSessionActive: (id: string, isActive: boolean) => void; updateSessionTitle: (id: string, title: string) => void; + updateSessionModel: (id: string, model: string | null) => void; setNeedsAttention: (id: string, needs: boolean) => void; /** Mark a session as expired (backend no longer has it). The UI shows a * recovery banner and disables input. */ @@ -26,8 +27,21 @@ interface SessionStore { title?: string | null; created_at: string; is_active?: boolean; + model?: string | null; pending_approval?: unknown[] | null; + auto_approval?: { + enabled?: boolean; + cost_cap_usd?: number | null; + estimated_spend_usd?: number; + remaining_usd?: number | null; + } | null; }>) => void; + updateSessionYolo: (id: string, policy: { + enabled: boolean; + cost_cap_usd?: number | null; + estimated_spend_usd?: number; + remaining_usd?: number | null; + }) => void; /** Atomically swap a session's id in the list + both localStorage caches. * Used when we rehydrate an expired session into a freshly-created backend * session — preserves title, timestamps, and messages. */ @@ -40,13 +54,18 @@ export const useSessionStore = create()( sessions: [], activeSessionId: null, - createSession: (id: string) => { + createSession: (id: string, model?: string | null) => { const newSession: SessionMeta = { id, title: `Chat ${get().sessions.length + 1}`, createdAt: new Date().toISOString(), isActive: true, needsAttention: false, + model: model ?? null, + autoApprovalEnabled: false, + autoApprovalCostCapUsd: null, + autoApprovalEstimatedSpendUsd: 0, + autoApprovalRemainingUsd: null, }; set((state) => ({ sessions: [...state.sessions, newSession], @@ -93,12 +112,22 @@ export const useSessionStore = create()( if (!id) continue; const existing = byId.get(id); if (existing) { + const auto = server.auto_approval; const updated = { ...existing, title: server.title || existing.title, isActive: server.is_active ?? existing.isActive, + model: server.model ?? existing.model ?? null, needsAttention: Boolean(server.pending_approval?.length) || existing.needsAttention, expired: false, + ...(auto + ? { + autoApprovalEnabled: Boolean(auto.enabled), + autoApprovalCostCapUsd: auto.cost_cap_usd ?? null, + autoApprovalEstimatedSpendUsd: auto.estimated_spend_usd ?? 0, + autoApprovalRemainingUsd: auto.remaining_usd ?? null, + } + : {}), }; const idx = merged.findIndex((s) => s.id === id); if (idx >= 0) merged[idx] = updated; @@ -111,7 +140,12 @@ export const useSessionStore = create()( createdAt: server.created_at || new Date().toISOString(), isActive: server.is_active ?? true, needsAttention: Boolean(server.pending_approval?.length), + model: server.model ?? null, expired: false, + autoApprovalEnabled: Boolean(server.auto_approval?.enabled), + autoApprovalCostCapUsd: server.auto_approval?.cost_cap_usd ?? null, + autoApprovalEstimatedSpendUsd: server.auto_approval?.estimated_spend_usd ?? 0, + autoApprovalRemainingUsd: server.auto_approval?.remaining_usd ?? null, }; merged.push(newSession); byId.set(id, newSession); @@ -123,6 +157,22 @@ export const useSessionStore = create()( }); }, + updateSessionYolo: (id, policy) => { + set((state) => ({ + sessions: state.sessions.map((s) => + s.id === id + ? { + ...s, + autoApprovalEnabled: policy.enabled, + autoApprovalCostCapUsd: policy.cost_cap_usd ?? null, + autoApprovalEstimatedSpendUsd: policy.estimated_spend_usd ?? 0, + autoApprovalRemainingUsd: policy.remaining_usd ?? null, + } + : s, + ), + })); + }, + renameSession: (oldId: string, newId: string) => { if (oldId === newId) return; moveMessages(oldId, newId); @@ -160,6 +210,14 @@ export const useSessionStore = create()( })); }, + updateSessionModel: (id: string, model: string | null) => { + set((state) => ({ + sessions: state.sessions.map((s) => + s.id === id ? { ...s, model } : s + ), + })); + }, + setNeedsAttention: (id: string, needs: boolean) => { set((state) => ({ sessions: state.sessions.map((s) => diff --git a/frontend/src/types/agent.ts b/frontend/src/types/agent.ts index c48f2415..3847a9c6 100644 --- a/frontend/src/types/agent.ts +++ b/frontend/src/types/agent.ts @@ -16,11 +16,16 @@ export interface SessionMeta { createdAt: string; isActive: boolean; needsAttention: boolean; + model?: string | null; /** True when the backend no longer recognizes this session id (e.g. * after a backend restart). The UI shows a recovery banner and * disables input until the user chooses to restore-with-summary or * start fresh. */ expired?: boolean; + autoApprovalEnabled?: boolean; + autoApprovalCostCapUsd?: number | null; + autoApprovalEstimatedSpendUsd?: number; + autoApprovalRemainingUsd?: number | null; } export interface ToolApproval { diff --git a/frontend/src/types/events.ts b/frontend/src/types/events.ts index 7319f253..54795827 100644 --- a/frontend/src/types/events.ts +++ b/frontend/src/types/events.ts @@ -68,6 +68,10 @@ export interface ApprovalToolItem { tool: string; arguments: Record; tool_call_id: string; + auto_approval_blocked?: boolean; + block_reason?: string | null; + estimated_cost_usd?: number | null; + remaining_cap_usd?: number | null; } export interface TurnCompleteEventData { diff --git a/frontend/src/utils/model.ts b/frontend/src/utils/model.ts index 37cdece0..84754f99 100644 --- a/frontend/src/utils/model.ts +++ b/frontend/src/utils/model.ts @@ -1,14 +1,19 @@ /** * Shared model-id constants used by session-create call sites and the - * ClaudeCapDialog "Use a free model" escape hatch. + * premium-model cap dialog "Use a free model" escape hatch. * * Keep in sync with MODEL_OPTIONS in components/Chat/ChatInput.tsx and * AVAILABLE_MODELS in backend/routes/agent.py. */ export const CLAUDE_MODEL_PATH = 'bedrock/us.anthropic.claude-opus-4-6-v1'; +export const GPT_55_MODEL_PATH = 'openai/gpt-5.5'; export const FIRST_FREE_MODEL_PATH = 'moonshotai/Kimi-K2.6'; export function isClaudePath(modelPath: string | undefined): boolean { return !!modelPath && modelPath.includes('anthropic'); } + +export function isPremiumPath(modelPath: string | undefined): boolean { + return modelPath === CLAUDE_MODEL_PATH || modelPath === GPT_55_MODEL_PATH; +} diff --git a/pyproject.toml b/pyproject.toml index c9773753..44dc9065 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -46,9 +46,18 @@ dev = [ "pytest-asyncio>=1.2.0", ] -# All dependencies (eval + dev) +# cosmos-lab harness adapter extras (P0.5+) +# These are placeholders; actual deps land as adapters are written: +# D2: ml_intern adapter (no new deps — uses upstream agent.core.session) +# D3: nat adapter (will pin nvidia-nat>=1.6.0) +# v1.1: claude_sdk adapter (will pin claude-agent-sdk) +nat = [] # nvidia-nat>=1.6.0 added in P0.5 D3 +ml_intern = [] # no extra deps — uses upstream agent.core.* directly +claude_sdk = [] # claude-agent-sdk added in v1.1 + +# All dependencies (eval + dev + adapters) all = [ - "ml-intern[eval,dev]", + "ml-intern[eval,dev,nat,ml_intern,claude_sdk]", ] [project.scripts] @@ -63,7 +72,10 @@ build-backend = "setuptools.build_meta" # runtime (resolves to /configs/cli_agent_config.json). # Without it, `uv tool install` / `pip install` produce a broken install # that imports fine but crashes at startup with FileNotFoundError. -include = ["agent*", "configs"] +# +# `cosmos_lab` is the v1 importable surface for the cosmos-lab library +# (re-exports from agent.optimization.*). See PLAN_V2.md §0.4. +include = ["agent*", "configs", "cosmos_lab*"] [tool.setuptools.package-data] configs = ["*.json"] diff --git a/tests/integration/test_live_sandbox_auth.py b/tests/integration/test_live_sandbox_auth.py index b68f9990..f070919d 100644 --- a/tests/integration/test_live_sandbox_auth.py +++ b/tests/integration/test_live_sandbox_auth.py @@ -1,7 +1,8 @@ """Opt-in live sandbox communication test. -This test creates a real Hugging Face Space sandbox, verifies that unauthenticated -requests are rejected, then exercises the authenticated agent client end-to-end. +This test creates a real private Hugging Face Space sandbox, verifies that +unauthenticated requests are rejected, then exercises the authenticated agent +client end-to-end. It is skipped unless ``ML_INTERN_LIVE_SANDBOX_TESTS=1`` and ``HF_TOKEN`` are set. """ @@ -41,7 +42,7 @@ def test_live_sandbox_authenticated_agent_communication(): owner=owner, name="ml-intern-live-auth", hardware="cpu-basic", - private=False, + private=True, token=token, secrets={"HF_TOKEN": token}, wait_timeout=900, @@ -54,7 +55,7 @@ def test_live_sandbox_authenticated_agent_communication(): ) try: denied = unauthenticated.post("exists", json={"path": "/tmp"}) - assert denied.status_code == 401 + assert denied.status_code in {401, 403, 404} # HF private-Space edge may 404 to avoid leaking existence finally: unauthenticated.close() diff --git a/tests/optimization/__init__.py b/tests/optimization/__init__.py new file mode 100644 index 00000000..e69de29b diff --git a/tests/optimization/harness/__init__.py b/tests/optimization/harness/__init__.py new file mode 100644 index 00000000..e69de29b diff --git a/tests/optimization/harness/test_adapter_contract.py b/tests/optimization/harness/test_adapter_contract.py new file mode 100644 index 00000000..41964a5c --- /dev/null +++ b/tests/optimization/harness/test_adapter_contract.py @@ -0,0 +1,223 @@ +"""P0.5 D4 — adapter contract tests parametrized across shipped adapters. + +Per cosmos_lab/harness/CONTRACT.md, these tests verify the 5 shared +requirements (S1-S5) that ALL adapters must satisfy regardless of family. +Family-specific tests live in test_ml_intern_adapter.py and +test_nat_adapter.py. + +Why parametrize: when v1.1 adds claude_sdk adapter (Family A) or +langgraph adapter (Family B), they MUST pass these same shared +requirement tests. Adding a row to ADAPTERS = automatic contract +enforcement for the new adapter. +""" + +from __future__ import annotations + +from pathlib import Path +from typing import Any, Callable + +import pytest + +from cosmos_lab.harness import install_into_session, register_as_nat_tool +from cosmos_lab.identity import AgentIdentity, AuditLog + + +# --------------------------------------------------------------------------- +# Test doubles for both families +# --------------------------------------------------------------------------- + + +class _FakeRouter: + """Minimal router for Family A (execution substrate) adapter tests.""" + + def __init__(self) -> None: + self.calls: list[Any] = [] + + def get_tool_specs_for_llm(self) -> list[dict[str, Any]]: + return [{"type": "function", "function": {"name": "read_file", "description": "", "parameters": {}}}] + + async def call_tool(self, tool_name, arguments, session=None, tool_call_id=None): + self.calls.append((tool_name, arguments)) + return ("ok", True) + + +class _MockSession: + """Minimal Family A host (matches ml-intern Session shape for adapter).""" + + def __init__(self, tool_router: Any = None) -> None: + self.tool_router = tool_router + + +class _MockBuilder: + """Minimal Family B host (matches nat Builder shape for adapter).""" + + def __init__(self) -> None: + self.functions: dict[str, Any] = {} + + def add_function(self, name: str, func: Any) -> None: + self.functions[name] = func + + +# --------------------------------------------------------------------------- +# Adapter registry — single source of truth for parametrization +# +# When a new adapter ships, add a row here and the contract tests below +# run against it automatically. +# --------------------------------------------------------------------------- + + +def _ml_intern_install_factory(tmp_path: Path) -> Callable[[], None]: + """Returns a callable that performs a successful install on a fresh host.""" + session = _MockSession(tool_router=_FakeRouter()) + identity = AgentIdentity.root("contract-test") + audit = AuditLog(tmp_path / "audit.jsonl") + + def do_install() -> None: + install_into_session(session, identity, audit) + + do_install._host = session # attached for inspection + do_install._fresh_host_factory = lambda: _MockSession(tool_router=_FakeRouter()) + do_install._call_with_fresh_host = lambda host: install_into_session(host, identity, audit) + return do_install + + +def _nat_register_factory(tmp_path: Path) -> Callable[[], None]: + """Returns a callable that performs a successful register on a fresh host.""" + builder = _MockBuilder() + + def do_register() -> None: + register_as_nat_tool(builder) + + do_register._host = builder + do_register._fresh_host_factory = lambda: _MockBuilder() + do_register._call_with_fresh_host = lambda host: register_as_nat_tool(host) + return do_register + + +# (adapter_name, family, factory) +ADAPTERS = [ + ("ml_intern", "A", _ml_intern_install_factory), + ("nat", "B", _nat_register_factory), +] + + +@pytest.fixture(params=ADAPTERS, ids=[a[0] for a in ADAPTERS]) +def adapter(request, tmp_path): + """Parametrized fixture: yields a freshly-built install/register callable.""" + name, family, factory = request.param + fn = factory(tmp_path) + return {"name": name, "family": family, "fn": fn} + + +# --------------------------------------------------------------------------- +# Shared requirement tests (S1-S5 from CONTRACT.md) +# All adapters must pass all of these. +# --------------------------------------------------------------------------- + + +def test_s1_idempotency_double_install_raises(adapter) -> None: + """S1 — Re-installing/re-registering on same host raises clear error.""" + fn = adapter["fn"] + fn() # first call succeeds + with pytest.raises((RuntimeError, ValueError)): + fn() # second call must raise + + +def test_s4_returns_none(adapter) -> None: + """S4 — Adapter returns None (mutates host in place).""" + fn = adapter["fn"] + result = fn() + assert result is None + + +def test_s5_no_partial_state_on_failure(adapter) -> None: + """S5 — Atomicity: failed install leaves host unchanged. + + For Family A: an install on a host with no tool_router should fail with + ValueError, and the host's tool_router (None) and _cosmos_lab_installed + (absent) state must remain unchanged. + + For Family B: an install on an empty builder (no recognizable method) + should fail with AttributeError, and the builder's functions must + remain unchanged. + """ + if adapter["family"] == "A": + # Family A: pass a host without tool_router → should raise + bad_host = _MockSession(tool_router=None) + identity = AgentIdentity.root("test") + audit_path = Path("/tmp/cosmos_lab_test_audit_s5.jsonl") + audit = AuditLog(audit_path) + with pytest.raises(ValueError): + install_into_session(bad_host, identity, audit) + # Host unchanged + assert bad_host.tool_router is None + assert not getattr(bad_host, "_cosmos_lab_installed", False) + else: # Family B + # Family B: pass a builder with no registration method → should raise + class _EmptyBuilder: + pass + + bad_builder = _EmptyBuilder() + with pytest.raises(AttributeError): + register_as_nat_tool(bad_builder) + # No functions/tools attribute should be created + assert not hasattr(bad_builder, "functions") + assert not hasattr(bad_builder, "tools") + + +# --------------------------------------------------------------------------- +# Family-specific contract tests (only run for relevant family) +# --------------------------------------------------------------------------- + + +def test_family_a_wraps_tool_router_with_capability_scoped(adapter, tmp_path) -> None: + """Family A specific: install must wrap host.tool_router with CapabilityScopedRouter.""" + if adapter["family"] != "A": + pytest.skip("Family A only") + + from cosmos_lab.identity import CapabilityScopedRouter + + base_router = _FakeRouter() + session = _MockSession(tool_router=base_router) + identity = AgentIdentity.root("test-fa") + audit = AuditLog(tmp_path / "audit_fa.jsonl") + + install_into_session(session, identity, audit) + + assert isinstance(session.tool_router, CapabilityScopedRouter) + assert session.tool_router._base is base_router + + +def test_family_b_registers_cosmos_lab_principal_tool(adapter, tmp_path) -> None: + """Family B specific: register must add 'cosmos_lab_principal' tool.""" + if adapter["family"] != "B": + pytest.skip("Family B only") + + from cosmos_lab.harness.nat import _COSMOS_LAB_TOOL_NAME + + builder = _MockBuilder() + register_as_nat_tool(builder) + + assert _COSMOS_LAB_TOOL_NAME in builder.functions + assert callable(builder.functions[_COSMOS_LAB_TOOL_NAME]) + + +# --------------------------------------------------------------------------- +# Coverage summary test (meta — sanity check on the parametrization) +# --------------------------------------------------------------------------- + + +def test_both_shipped_adapters_are_in_registry() -> None: + """Sanity: both adapters shipped in P0.5 are registered for parametrization. + + When a new adapter ships (claude_sdk, langgraph), this test should be + extended to assert it's in the registry too. + """ + adapter_names = [a[0] for a in ADAPTERS] + assert "ml_intern" in adapter_names, "Family A: ml_intern adapter (D2) missing from registry" + assert "nat" in adapter_names, "Family B: nat adapter (D3) missing from registry" + + # Family coverage: at least one of each family + families = {a[1] for a in ADAPTERS} + assert "A" in families, "No Family A (execution substrate) adapter shipped" + assert "B" in families, "No Family B (deployment surface) adapter shipped" diff --git a/tests/optimization/harness/test_ml_intern_adapter.py b/tests/optimization/harness/test_ml_intern_adapter.py new file mode 100644 index 00000000..60120ff7 --- /dev/null +++ b/tests/optimization/harness/test_ml_intern_adapter.py @@ -0,0 +1,175 @@ +"""P0.5 D2 — smoke tests for cosmos_lab.harness.ml_intern adapter. + +Tests verify the adapter's contract: it wraps Session.tool_router with +CapabilityScopedRouter, refuses to install on a router-less Session, and +refuses to double-install. + +Uses a duck-typed `MockSession` rather than constructing a real ml-intern +Session — the adapter only touches `.tool_router`, so the smoke test +should verify that one thing without pulling in the full Session +construction stack (which requires Config, ContextManager, event_queue, +sandbox, etc.). +""" + +from __future__ import annotations + +from pathlib import Path +from typing import Any + +import pytest + +from cosmos_lab.harness import install_into_session +from cosmos_lab.harness.ml_intern import _INSTALLED_MARKER +from cosmos_lab.identity import ( + AgentIdentity, + AuditLog, + CapabilityDenied, + CapabilityScopedRouter, +) + + +# --------------------------------------------------------------------------- +# Test doubles +# --------------------------------------------------------------------------- + + +class FakeRouter: + """Duck-typed router matching the Protocol CapabilityScopedRouter wraps.""" + + def __init__(self, tools: dict[str, tuple[str, bool]] | None = None) -> None: + self._tools = tools or { + "read_file": ("file contents", True), + "delete_repo": ("deleted", True), + } + self.calls: list[tuple[str, dict[str, Any]]] = [] + + def get_tool_specs_for_llm(self) -> list[dict[str, Any]]: + return [ + {"type": "function", "function": {"name": name, "description": "", "parameters": {}}} + for name in self._tools + ] + + async def call_tool( + self, + tool_name: str, + arguments: dict[str, Any], + session: Any = None, + tool_call_id: str | None = None, + ) -> tuple[str, bool]: + self.calls.append((tool_name, arguments)) + if tool_name not in self._tools: + return f"unknown tool {tool_name}", False + return self._tools[tool_name] + + +class MockSession: + """Duck-typed Session for adapter tests. + + Adapter only touches `.tool_router` (per its contract). A real + `agent.core.session.Session` requires Config, ContextManager, + event_queue, sandbox, and more — overkill for verifying the + one-method adapter contract. + """ + + def __init__(self, tool_router: Any = None) -> None: + self.tool_router = tool_router + + +# --------------------------------------------------------------------------- +# Contract tests +# --------------------------------------------------------------------------- + + +def test_install_wraps_router_with_capability_scoped(tmp_path: Path) -> None: + base = FakeRouter() + session = MockSession(tool_router=base) + identity = AgentIdentity.scoped("a1", "A1", capabilities=["read_file"]) + audit = AuditLog(tmp_path / "audit.jsonl") + + install_into_session(session, identity, audit) + + assert isinstance(session.tool_router, CapabilityScopedRouter) + assert session.tool_router._base is base + assert getattr(session, _INSTALLED_MARKER) is True + + +def test_install_raises_when_session_has_no_router(tmp_path: Path) -> None: + session = MockSession(tool_router=None) + identity = AgentIdentity.root() + audit = AuditLog(tmp_path / "audit.jsonl") + + with pytest.raises(ValueError, match="no tool_router to wrap"): + install_into_session(session, identity, audit) + + +def test_install_is_not_re_installable_on_same_session(tmp_path: Path) -> None: + session = MockSession(tool_router=FakeRouter()) + identity = AgentIdentity.root() + audit = AuditLog(tmp_path / "audit.jsonl") + + install_into_session(session, identity, audit) + + with pytest.raises(RuntimeError, match="already installed"): + install_into_session(session, identity, audit) + + +# --------------------------------------------------------------------------- +# End-to-end behavior tests (governance actually works after install) +# --------------------------------------------------------------------------- + + +@pytest.mark.asyncio +async def test_after_install_unauthorized_call_is_denied_and_audited( + tmp_path: Path, +) -> None: + base = FakeRouter() + session = MockSession(tool_router=base) + identity = AgentIdentity.scoped("a2", "A2", capabilities=["read_file"]) + audit = AuditLog(tmp_path / "audit.jsonl") + install_into_session(session, identity, audit) + + with pytest.raises(CapabilityDenied): + await session.tool_router.call_tool("delete_repo", {"repo": "x/y"}) + + rows = audit.read_all() + assert len(rows) == 1 + assert rows[0]["phase"] == "denied" + assert rows[0]["tool"] == "delete_repo" + assert rows[0]["agent_id"] == "a2" + + # Base router was never reached. + assert base.calls == [] + + +@pytest.mark.asyncio +async def test_after_install_authorized_call_passes_through_and_audits( + tmp_path: Path, +) -> None: + base = FakeRouter() + session = MockSession(tool_router=base) + identity = AgentIdentity.scoped("a3", "A3", capabilities=["read_file"]) + audit = AuditLog(tmp_path / "audit.jsonl") + install_into_session(session, identity, audit) + + out, ok = await session.tool_router.call_tool("read_file", {"path": "/x"}) + + assert ok is True + assert out == "file contents" + assert base.calls == [("read_file", {"path": "/x"})] + + rows = audit.read_all() + phases = [r["phase"] for r in rows] + assert phases == ["before", "after"] + + +def test_after_install_tool_specs_are_filtered_by_capability(tmp_path: Path) -> None: + base = FakeRouter() + session = MockSession(tool_router=base) + identity = AgentIdentity.scoped("a4", "A4", capabilities=["read_file"]) + audit = AuditLog(tmp_path / "audit.jsonl") + install_into_session(session, identity, audit) + + specs = session.tool_router.get_tool_specs_for_llm() + names = {s["function"]["name"] for s in specs} + + assert names == {"read_file"} # delete_repo filtered out diff --git a/tests/optimization/harness/test_nat_adapter.py b/tests/optimization/harness/test_nat_adapter.py new file mode 100644 index 00000000..01cbb67f --- /dev/null +++ b/tests/optimization/harness/test_nat_adapter.py @@ -0,0 +1,136 @@ +"""P0.5 D3 — smoke tests for cosmos_lab.harness.nat lightweight wrapper. + +Per PLAN_V2.md §0.4.5 v5.1 architecture: nat is a deployment wrapper, not +a runtime substrate. These tests verify the registration contract works +against a duck-typed MockBuilder. Real `nvidia-nat` integration is +validated in P10 deployment. + +Pattern matches D2's MockSession approach (workflow LEARN #3): the +adapter's contract is "I register one tool" — smoke test verifies that +without pulling in the full nat package. +""" + +from __future__ import annotations + +from typing import Any + +import pytest + +from cosmos_lab.harness import register_as_nat_tool +from cosmos_lab.harness.nat import ( + _COSMOS_LAB_TOOL_NAME, + _cosmos_lab_principal_tool, +) + + +# --------------------------------------------------------------------------- +# Test doubles — minimal nat Builder duck types +# --------------------------------------------------------------------------- + + +class MockBuilder: + """nat Builder shape for register_as_nat_tool() tests. + + Real Builder has add_function (or register_function in newer APIs). + Provides .functions registry for idempotency check. + """ + + def __init__(self) -> None: + self.functions: dict[str, Any] = {} + + def add_function(self, name: str, func: Any) -> None: + self.functions[name] = func + + +class MockBuilderAlt: + """nat Builder using register_function naming (future-proofing test).""" + + def __init__(self) -> None: + self.tools: dict[str, Any] = {} + + def register_function(self, name: str, func: Any) -> None: + self.tools[name] = func + + +class EmptyBuilder: + """Builder with no recognizable registration method.""" + + +# --------------------------------------------------------------------------- +# Contract tests — registration mechanics +# --------------------------------------------------------------------------- + + +def test_register_adds_cosmos_lab_principal_tool() -> None: + builder = MockBuilder() + register_as_nat_tool(builder) + assert _COSMOS_LAB_TOOL_NAME in builder.functions + + +def test_register_works_with_alternative_method_name() -> None: + builder = MockBuilderAlt() + register_as_nat_tool(builder) + assert _COSMOS_LAB_TOOL_NAME in builder.tools + + +def test_register_raises_on_double_registration() -> None: + builder = MockBuilder() + register_as_nat_tool(builder) + + with pytest.raises(RuntimeError, match="already registered"): + register_as_nat_tool(builder) + + +def test_register_raises_when_builder_has_no_known_method() -> None: + builder = EmptyBuilder() + with pytest.raises(AttributeError, match="no recognizable tool-registration"): + register_as_nat_tool(builder) + + +def test_register_only_adds_one_tool() -> None: + builder = MockBuilder() + register_as_nat_tool(builder) + assert len(builder.functions) == 1 + + +# --------------------------------------------------------------------------- +# Tool callable contract — what nat will actually invoke +# --------------------------------------------------------------------------- + + +def test_registered_tool_is_callable() -> None: + builder = MockBuilder() + register_as_nat_tool(builder) + tool = builder.functions[_COSMOS_LAB_TOOL_NAME] + assert callable(tool) + + +def test_tool_returns_structured_dict_with_required_keys() -> None: + result = _cosmos_lab_principal_tool(task="dummy task") + required_keys = {"task", "budget_usd", "timeout_sec", "outcome", "cost_usd", "trajectory_id", "status"} + assert required_keys.issubset(result.keys()) + + +def test_tool_passes_task_through() -> None: + result = _cosmos_lab_principal_tool(task="improve cosmos-reason-2 by 3pp") + assert result["task"] == "improve cosmos-reason-2 by 3pp" + + +def test_tool_uses_default_budget_and_timeout() -> None: + result = _cosmos_lab_principal_tool(task="x") + assert result["budget_usd"] == 10.0 + assert result["timeout_sec"] == 600 + + +def test_tool_accepts_explicit_budget_and_timeout() -> None: + result = _cosmos_lab_principal_tool(task="x", budget_usd=400.0, timeout_sec=86400) + assert result["budget_usd"] == 400.0 + assert result["timeout_sec"] == 86400 + + +def test_v0_tool_returns_stub_status() -> None: + """In P0.5 D3, the tool is a registered placeholder; real CLI invocation + lands in P3 (PrincipalAgent v0). The stub status communicates this to + any caller.""" + result = _cosmos_lab_principal_tool(task="x") + assert result["status"] == "stub" diff --git a/tests/optimization/test_identity_scoping.py b/tests/optimization/test_identity_scoping.py new file mode 100644 index 00000000..d629fac8 --- /dev/null +++ b/tests/optimization/test_identity_scoping.py @@ -0,0 +1,271 @@ +"""Phase 0 — identity scoping + audit log + capability-scoped router.""" + +from __future__ import annotations + +import json +from pathlib import Path +from typing import Any + +import pytest + +from agent.optimization import OptimizationConfig, load_optimization_config +from agent.optimization.identity import ( + AgentIdentity, + AuditLog, + CapabilityDenied, + CapabilityScopedRouter, +) + + +# --------------------------------------------------------------------------- +# Test doubles +# --------------------------------------------------------------------------- + + +class FakeRouter: + """Duck-typed stand-in for ToolRouter so tests don't need HF auth or sandbox.""" + + def __init__(self, tools: dict[str, tuple[str, bool]] | None = None) -> None: + # tool_name -> (return_string, success_bool) + self._tools = tools or { + "read_file": ("file contents", True), + "write_file": ("ok", True), + "delete_repo": ("deleted", True), + } + self.calls: list[tuple[str, dict[str, Any]]] = [] + + def get_tool_specs_for_llm(self) -> list[dict[str, Any]]: + return [ + {"type": "function", "function": {"name": name, "description": "", "parameters": {}}} + for name in self._tools + ] + + async def call_tool( + self, + tool_name: str, + arguments: dict[str, Any], + session: Any = None, + tool_call_id: str | None = None, + ) -> tuple[str, bool]: + self.calls.append((tool_name, arguments)) + if tool_name not in self._tools: + return f"unknown tool {tool_name}", False + return self._tools[tool_name] + + +class ExplodingRouter(FakeRouter): + async def call_tool(self, tool_name, arguments, session=None, tool_call_id=None): # type: ignore[override] + raise RuntimeError("kaboom") + + +# --------------------------------------------------------------------------- +# OptimizationConfig +# --------------------------------------------------------------------------- + + +def test_optimization_config_inherits_upstream_fields_with_defaults(): + cfg = OptimizationConfig(model_name="anthropic/claude-sonnet-4-6") + # Upstream fields preserved + assert cfg.max_iterations == 300 + assert cfg.save_sessions is True + # Owned fields with defaults + assert cfg.quality_budget == 0.98 + assert cfg.optimization_loop_enabled is True + assert cfg.optimization_target is None + assert cfg.audit_log_path.endswith("audit.jsonl") + + +def test_optimization_config_round_trips_via_pydantic(): + raw = { + "model_name": "anthropic/claude-sonnet-4-6", + "optimization_target": "throughput", + "quality_budget": 0.95, + } + cfg = OptimizationConfig.model_validate(raw) + assert cfg.optimization_target == "throughput" + assert cfg.quality_budget == 0.95 + + +def test_load_optimization_config_preserves_owned_fields(tmp_path): + """Regression: upstream load_config() returns a base Config and silently + drops owned fields. load_optimization_config must preserve them.""" + import json + + cfg_path = tmp_path / "cfg.json" + cfg_path.write_text( + json.dumps( + { + "model_name": "anthropic/claude-sonnet-4-6", + "optimization_target": "latency", + "quality_budget": 0.95, + "trajectory_db_path": "/tmp/foo.duckdb", + } + ) + ) + cfg = load_optimization_config(str(cfg_path)) + assert isinstance(cfg, OptimizationConfig) + assert cfg.optimization_target == "latency" + assert cfg.quality_budget == 0.95 + assert cfg.trajectory_db_path == "/tmp/foo.duckdb" + + +def test_shipped_config_loads_with_owned_fields(): + """The actual shipped config file must round-trip through the loader.""" + cfg = load_optimization_config("configs/optimization_agent_config.json") + assert isinstance(cfg, OptimizationConfig) + assert cfg.quality_budget == 0.98 + assert cfg.optimization_loop_enabled is True + assert cfg.audit_log_path.endswith("audit.jsonl") + assert cfg.trajectory_db_path.endswith("trajectories.duckdb") + + +# --------------------------------------------------------------------------- +# AgentIdentity +# --------------------------------------------------------------------------- + + +def test_root_identity_can_call_anything(): + root = AgentIdentity.root() + assert root.can_call("read_file") + assert root.can_call("delete_repo") + assert root.can_call("any_future_tool_name") + + +def test_scoped_identity_allows_listed_only(): + ident = AgentIdentity.scoped( + agent_id="reader", + display_name="ReadOnlyAgent", + capabilities=["read_file"], + ) + assert ident.can_call("read_file") + assert not ident.can_call("write_file") + assert not ident.can_call("delete_repo") + + +def test_scoped_identity_records_parent_chain(): + parent = AgentIdentity.root("parent") + child = AgentIdentity.scoped( + agent_id="child", display_name="Child", capabilities=["read_file"], parent_id=parent.agent_id + ) + assert child.parent_id == "parent" + + +# --------------------------------------------------------------------------- +# AuditLog +# --------------------------------------------------------------------------- + + +def test_audit_log_writes_jsonl_round_trip(tmp_path: Path): + log = AuditLog(tmp_path / "audit.jsonl") + log.record({"event": "start", "n": 1}) + log.record({"event": "end", "n": 2}) + rows = log.read_all() + assert len(rows) == 2 + assert rows[0]["event"] == "start" + assert rows[1]["n"] == 2 + + +def test_audit_log_creates_parent_dirs(tmp_path: Path): + deep = tmp_path / "a" / "b" / "c" / "audit.jsonl" + log = AuditLog(deep) + log.record({"x": 1}) + assert deep.exists() + assert deep.read_text().strip().startswith("{") + + +def test_audit_log_lines_are_independently_parseable(tmp_path: Path): + log = AuditLog(tmp_path / "audit.jsonl") + for i in range(5): + log.record({"i": i}) + lines = (tmp_path / "audit.jsonl").read_text().splitlines() + assert len(lines) == 5 + for i, line in enumerate(lines): + assert json.loads(line)["i"] == i + + +# --------------------------------------------------------------------------- +# CapabilityScopedRouter +# --------------------------------------------------------------------------- + + +@pytest.mark.asyncio +async def test_router_denies_unauthorized_call_and_audits(tmp_path: Path): + audit = AuditLog(tmp_path / "audit.jsonl") + ident = AgentIdentity.scoped("a1", "A1", capabilities=["read_file"]) + router = CapabilityScopedRouter(FakeRouter(), ident, audit) + + with pytest.raises(CapabilityDenied): + await router.call_tool("delete_repo", {"repo": "x/y"}) + + rows = audit.read_all() + assert len(rows) == 1 + assert rows[0]["phase"] == "denied" + assert rows[0]["allowed"] is False + assert rows[0]["tool"] == "delete_repo" + assert rows[0]["agent_id"] == "a1" + + +@pytest.mark.asyncio +async def test_router_allows_authorized_call_and_audits_before_after(tmp_path: Path): + audit = AuditLog(tmp_path / "audit.jsonl") + ident = AgentIdentity.scoped("a2", "A2", capabilities=["read_file"]) + base = FakeRouter() + router = CapabilityScopedRouter(base, ident, audit) + + out, ok = await router.call_tool("read_file", {"path": "/x"}) + assert ok is True + assert out == "file contents" + assert base.calls == [("read_file", {"path": "/x"})] + + rows = audit.read_all() + phases = [r["phase"] for r in rows] + assert phases == ["before", "after"] + assert rows[1]["success"] is True + assert rows[1]["duration_ms"] >= 0 + assert rows[1]["result_summary"].startswith("file contents") + + +@pytest.mark.asyncio +async def test_router_audits_exception_and_reraises(tmp_path: Path): + audit = AuditLog(tmp_path / "audit.jsonl") + ident = AgentIdentity.root("root") + router = CapabilityScopedRouter(ExplodingRouter(), ident, audit) + + with pytest.raises(RuntimeError, match="kaboom"): + await router.call_tool("read_file", {}) + + rows = audit.read_all() + assert [r["phase"] for r in rows] == ["before", "exception"] + assert rows[1]["exception_type"] == "RuntimeError" + assert "kaboom" in rows[1]["exception_msg"] + + +def test_router_filters_tool_specs_by_capability(tmp_path: Path): + audit = AuditLog(tmp_path / "audit.jsonl") + ident = AgentIdentity.scoped("a3", "A3", capabilities=["read_file"]) + router = CapabilityScopedRouter(FakeRouter(), ident, audit) + + specs = router.get_tool_specs_for_llm() + names = {s["function"]["name"] for s in specs} + assert names == {"read_file"} + + +def test_router_root_sees_all_specs(tmp_path: Path): + audit = AuditLog(tmp_path / "audit.jsonl") + router = CapabilityScopedRouter(FakeRouter(), AgentIdentity.root(), audit) + names = {s["function"]["name"] for s in router.get_tool_specs_for_llm()} + assert names == {"read_file", "write_file", "delete_repo"} + + +@pytest.mark.asyncio +async def test_router_args_hash_is_canonical(tmp_path: Path): + """Reordered keys with same values must produce same args_hash.""" + audit = AuditLog(tmp_path / "audit.jsonl") + router = CapabilityScopedRouter(FakeRouter(), AgentIdentity.root(), audit) + + await router.call_tool("read_file", {"a": 1, "b": 2}) + await router.call_tool("read_file", {"b": 2, "a": 1}) + rows = audit.read_all() + before_rows = [r for r in rows if r["phase"] == "before"] + assert len(before_rows) == 2 + assert before_rows[0]["args_hash"] == before_rows[1]["args_hash"] diff --git a/tests/unit/test_agent_model_gating.py b/tests/unit/test_agent_model_gating.py new file mode 100644 index 00000000..f3cdd423 --- /dev/null +++ b/tests/unit/test_agent_model_gating.py @@ -0,0 +1,255 @@ +"""Tests for gated model handling in backend/routes/agent.py.""" + +import sys +from pathlib import Path +from types import SimpleNamespace + +import pytest +from fastapi import HTTPException + +_BACKEND_DIR = Path(__file__).resolve().parent.parent.parent / "backend" +if str(_BACKEND_DIR) not in sys.path: + sys.path.insert(0, str(_BACKEND_DIR)) + +from routes import agent # noqa: E402 + + +@pytest.fixture(autouse=True) +def _reset_quota_store(): + agent.user_quotas._reset_for_tests() + yield + agent.user_quotas._reset_for_tests() + + +def test_gated_model_predicate_includes_bedrock_claude_and_gpt55_only(): + assert agent._is_gated_model("bedrock/us.anthropic.claude-opus-4-6-v1") + assert agent._is_gated_model("openai/gpt-5.5") + assert not agent._is_gated_model("anthropic/claude-opus-4-6") + assert not agent._is_gated_model("moonshotai/Kimi-K2.6") + + +@pytest.mark.asyncio +async def test_gated_model_gate_rejects_gpt55_for_non_hf_user(monkeypatch): + async def fake_require_hf_org_member(_request): + return False + + monkeypatch.setattr( + agent, + "require_huggingface_org_member", + fake_require_hf_org_member, + ) + + with pytest.raises(HTTPException) as exc_info: + await agent._require_hf_for_gated_model(None, "openai/gpt-5.5") + + assert exc_info.value.status_code == 403 + assert exc_info.value.detail["error"] == "premium_model_restricted" + + +@pytest.mark.asyncio +async def test_default_gated_session_falls_back_to_free_model_for_non_hf_user( + monkeypatch, +): + async def fake_require_hf_org_member(_request): + return False + + monkeypatch.setattr( + agent, + "require_huggingface_org_member", + fake_require_hf_org_member, + ) + monkeypatch.setattr( + agent.session_manager.config, + "model_name", + agent.DEFAULT_CLAUDE_MODEL_ID, + ) + + model = await agent._model_override_for_new_session(None, None) + + assert model == agent.DEFAULT_FREE_MODEL_ID + + +@pytest.mark.asyncio +async def test_default_gated_session_stays_default_for_hf_user(monkeypatch): + async def fake_require_hf_org_member(_request): + return True + + monkeypatch.setattr( + agent, + "require_huggingface_org_member", + fake_require_hf_org_member, + ) + monkeypatch.setattr( + agent.session_manager.config, + "model_name", + agent.DEFAULT_CLAUDE_MODEL_ID, + ) + + model = await agent._model_override_for_new_session(None, None) + + assert model is None + + +@pytest.mark.asyncio +async def test_explicit_gated_session_allowed_for_hf_user(monkeypatch): + async def fake_require_hf_org_member(_request): + return True + + monkeypatch.setattr( + agent, + "require_huggingface_org_member", + fake_require_hf_org_member, + ) + + model = await agent._model_override_for_new_session( + None, + agent.DEFAULT_CLAUDE_MODEL_ID, + ) + + assert model == agent.DEFAULT_CLAUDE_MODEL_ID + + +@pytest.mark.asyncio +async def test_explicit_gated_session_request_still_rejects_non_hf_user(monkeypatch): + async def fake_require_hf_org_member(_request): + return False + + monkeypatch.setattr(agent, "require_huggingface_org_member", fake_require_hf_org_member) + + with pytest.raises(HTTPException) as exc_info: + await agent._model_override_for_new_session(None, agent.DEFAULT_CLAUDE_MODEL_ID) + + assert exc_info.value.status_code == 403 + assert exc_info.value.detail["error"] == "premium_model_restricted" + + +@pytest.mark.asyncio +async def test_ungated_models_skip_hf_membership_check(monkeypatch): + async def fail_if_called(_request): + raise AssertionError("ungated models must not require HF org membership") + + monkeypatch.setattr(agent, "require_huggingface_org_member", fail_if_called) + + await agent._require_hf_for_gated_model(None, "moonshotai/Kimi-K2.6") + await agent._require_hf_for_gated_model(None, "anthropic/claude-opus-4-6") + + +@pytest.mark.asyncio +async def test_gated_quota_charges_gpt55(monkeypatch): + persisted = [] + + async def fake_persist_session_snapshot(agent_session): + persisted.append(agent_session) + + monkeypatch.setattr( + agent.session_manager, + "persist_session_snapshot", + fake_persist_session_snapshot, + ) + + agent_session = SimpleNamespace( + claude_counted=False, + session=SimpleNamespace( + config=SimpleNamespace(model_name="openai/gpt-5.5"), + ), + ) + + await agent._enforce_gated_model_quota( + {"user_id": "u1", "plan": "free"}, + agent_session, + ) + + assert agent_session.claude_counted is True + assert persisted == [agent_session] + assert await agent.user_quotas.get_claude_used_today("u1") == 1 + + +@pytest.mark.asyncio +async def test_gated_quota_skips_direct_anthropic(monkeypatch): + async def fail_if_persisted(_agent_session): + raise AssertionError("direct Anthropic should not consume deployed gated quota") + + monkeypatch.setattr( + agent.session_manager, + "persist_session_snapshot", + fail_if_persisted, + ) + + agent_session = SimpleNamespace( + claude_counted=False, + session=SimpleNamespace( + config=SimpleNamespace(model_name="anthropic/claude-opus-4-6"), + ), + ) + + await agent._enforce_gated_model_quota( + {"user_id": "u1", "plan": "free"}, + agent_session, + ) + + assert agent_session.claude_counted is False + assert await agent.user_quotas.get_claude_used_today("u1") == 0 + + +@pytest.mark.asyncio +async def test_user_quota_response_uses_premium_fields_only(monkeypatch): + async def fake_get_used_today(user_id): + assert user_id == "u1" + return 2 + + monkeypatch.setattr(agent.user_quotas, "get_claude_used_today", fake_get_used_today) + monkeypatch.setattr(agent.user_quotas, "daily_cap_for", lambda plan: 5) + + response = await agent.get_user_quota({"user_id": "u1", "plan": "pro"}) + + assert response == { + "plan": "pro", + "premium_used_today": 2, + "premium_daily_cap": 5, + "premium_remaining": 3, + } + + +@pytest.mark.asyncio +async def test_set_session_yolo_calls_manager_with_cap_presence(monkeypatch): + async def fake_check_session_access(session_id, user, request=None): + assert session_id == "s1" + assert user["user_id"] == "u1" + return object() + + calls = [] + + async def fake_update_session_auto_approval(session_id, **kwargs): + calls.append((session_id, kwargs)) + return { + "enabled": kwargs["enabled"], + "cost_cap_usd": 7.5, + "estimated_spend_usd": 0.0, + "remaining_usd": 7.5, + } + + monkeypatch.setattr(agent, "_check_session_access", fake_check_session_access) + monkeypatch.setattr( + agent.session_manager, + "update_session_auto_approval", + fake_update_session_auto_approval, + ) + + response = await agent.set_session_yolo( + "s1", + agent.SessionYoloRequest(enabled=True, cost_cap_usd=7.5), + {"user_id": "u1"}, + ) + + assert response["enabled"] is True + assert response["remaining_usd"] == 7.5 + assert calls == [ + ( + "s1", + { + "enabled": True, + "cost_cap_usd": 7.5, + "cap_provided": True, + }, + ) + ] diff --git a/tests/unit/test_auto_approval_policy.py b/tests/unit/test_auto_approval_policy.py new file mode 100644 index 00000000..3d8b37fe --- /dev/null +++ b/tests/unit/test_auto_approval_policy.py @@ -0,0 +1,185 @@ +from types import SimpleNamespace + +import pytest + +from agent.config import Config +from agent.core import agent_loop +from agent.core.cost_estimation import CostEstimate + + +def _config(**overrides): + data = { + "model_name": "moonshotai/Kimi-K2.6", + "confirm_cpu_jobs": True, + "auto_file_upload": False, + "yolo_mode": False, + **overrides, + } + return Config.model_validate(data) + + +def _session(*, cap=5.0, spent=0.0, enabled=True): + return SimpleNamespace( + config=_config(), + auto_approval_enabled=enabled, + auto_approval_cost_cap_usd=cap, + auto_approval_estimated_spend_usd=spent, + sandbox=None, + ) + + +@pytest.mark.asyncio +async def test_session_yolo_auto_approves_non_costed_approval_tool(): + decision = await agent_loop._approval_decision( + "hf_repo_files", + {"operation": "upload", "path": "README.md"}, + _session(), + ) + + assert decision.requires_approval is False + assert decision.auto_approved is True + + +@pytest.mark.asyncio +@pytest.mark.parametrize( + "operation", + ["scheduled run", "scheduled uv", "scheduled run"], +) +async def test_scheduled_hf_jobs_always_require_manual_approval(operation): + session = _session() + session.config.yolo_mode = True + + decision = await agent_loop._approval_decision( + "hf_jobs", + {"operation": operation, "script": "print(1)"}, + session, + ) + + assert decision.requires_approval is True + assert decision.auto_approval_blocked is True + assert "Scheduled HF jobs" in decision.block_reason + assert agent_loop._needs_approval("hf_jobs", {"operation": operation}, session.config) + + +@pytest.mark.asyncio +async def test_immediate_hf_job_under_cap_auto_runs(monkeypatch): + async def fake_estimate(*args, **kwargs): + return CostEstimate(estimated_cost_usd=2.0, billable=True) + + monkeypatch.setattr(agent_loop, "estimate_tool_cost", fake_estimate) + + decision = await agent_loop._approval_decision( + "hf_jobs", + {"operation": "run", "hardware_flavor": "a10g-large", "timeout": "1h"}, + _session(cap=5.0, spent=1.0), + ) + + assert decision.requires_approval is False + assert decision.auto_approved is True + assert decision.estimated_cost_usd == 2.0 + + +@pytest.mark.asyncio +async def test_immediate_hf_job_over_cap_falls_back_to_approval(monkeypatch): + async def fake_estimate(*args, **kwargs): + return CostEstimate(estimated_cost_usd=2.0, billable=True) + + monkeypatch.setattr(agent_loop, "estimate_tool_cost", fake_estimate) + + decision = await agent_loop._approval_decision( + "hf_jobs", + {"operation": "run", "hardware_flavor": "a10g-large", "timeout": "1h"}, + _session(cap=5.0, spent=4.0), + ) + + assert decision.requires_approval is True + assert decision.auto_approval_blocked is True + assert "exceeds" in decision.block_reason + assert decision.remaining_cap_usd == 1.0 + + +@pytest.mark.asyncio +async def test_unknown_cost_falls_back_to_approval(monkeypatch): + async def fake_estimate(*args, **kwargs): + return CostEstimate( + estimated_cost_usd=None, + billable=True, + block_reason="No price is available.", + ) + + monkeypatch.setattr(agent_loop, "estimate_tool_cost", fake_estimate) + + decision = await agent_loop._approval_decision( + "sandbox_create", + {"hardware": "mystery-gpu"}, + _session(), + ) + + assert decision.requires_approval is True + assert decision.auto_approval_blocked is True + assert decision.estimated_cost_usd is None + + +@pytest.mark.asyncio +async def test_batch_reservation_blocks_second_over_budget_job(monkeypatch): + async def fake_estimate(*args, **kwargs): + return CostEstimate(estimated_cost_usd=3.0, billable=True) + + monkeypatch.setattr(agent_loop, "estimate_tool_cost", fake_estimate) + session = _session(cap=5.0, spent=0.0) + + first = await agent_loop._approval_decision( + "hf_jobs", + {"operation": "run", "hardware_flavor": "a10g-large"}, + session, + reserved_spend_usd=0.0, + ) + second = await agent_loop._approval_decision( + "hf_jobs", + {"operation": "run", "hardware_flavor": "a10g-large"}, + session, + reserved_spend_usd=first.estimated_cost_usd or 0.0, + ) + + assert first.requires_approval is False + assert second.requires_approval is True + assert second.remaining_cap_usd == 2.0 + + +@pytest.mark.asyncio +async def test_manual_approval_does_not_record_spend_when_session_yolo_disabled(monkeypatch): + called = False + + async def fake_estimate(*args, **kwargs): + nonlocal called + called = True + return CostEstimate(estimated_cost_usd=2.0, billable=True) + + monkeypatch.setattr(agent_loop, "estimate_tool_cost", fake_estimate) + session = _session(enabled=False, cap=5.0, spent=0.0) + + await agent_loop._record_manual_approved_spend_if_needed( + session, + "sandbox_create", + {"hardware": "a10g-large"}, + ) + + assert called is False + assert session.auto_approval_estimated_spend_usd == 0.0 + + +@pytest.mark.asyncio +async def test_manual_approval_records_spend_when_session_yolo_enabled(monkeypatch): + async def fake_estimate(*args, **kwargs): + return CostEstimate(estimated_cost_usd=1.25, billable=True) + + monkeypatch.setattr(agent_loop, "estimate_tool_cost", fake_estimate) + session = _session(enabled=True, cap=5.0, spent=0.5) + + await agent_loop._record_manual_approved_spend_if_needed( + session, + "sandbox_create", + {"hardware": "a10g-large"}, + ) + + assert session.auto_approval_estimated_spend_usd == 1.75 diff --git a/tests/unit/test_cost_estimation.py b/tests/unit/test_cost_estimation.py new file mode 100644 index 00000000..3127de79 --- /dev/null +++ b/tests/unit/test_cost_estimation.py @@ -0,0 +1,58 @@ +from types import SimpleNamespace + +import pytest + +from agent.core import cost_estimation + + +def test_parse_timeout_hours_common_units(): + assert cost_estimation.parse_timeout_hours(None) == 0.5 + assert cost_estimation.parse_timeout_hours("30m") == 0.5 + assert cost_estimation.parse_timeout_hours("3h") == 3 + assert cost_estimation.parse_timeout_hours(3600) == 1 + assert cost_estimation.parse_timeout_hours("not-a-duration") is None + + +@pytest.mark.asyncio +async def test_estimate_hf_job_cost_uses_catalog_price(monkeypatch): + async def fake_catalog(): + return {"a100-large": 4.0} + + monkeypatch.setattr(cost_estimation, "hf_jobs_price_catalog", fake_catalog) + + estimate = await cost_estimation.estimate_hf_job_cost( + {"hardware_flavor": "a100-large", "timeout": "8h"} + ) + + assert estimate.estimated_cost_usd == 32.0 + assert estimate.billable is True + + +@pytest.mark.asyncio +async def test_estimate_hf_job_cost_blocks_unknown_price(monkeypatch): + async def fake_catalog(): + return {} + + monkeypatch.setattr(cost_estimation, "hf_jobs_price_catalog", fake_catalog) + + estimate = await cost_estimation.estimate_hf_job_cost( + {"hardware_flavor": "mystery-gpu", "timeout": "30m"} + ) + + assert estimate.estimated_cost_usd is None + assert estimate.billable is True + assert "No price" in estimate.block_reason + + +@pytest.mark.asyncio +async def test_estimate_sandbox_cost_is_zero_for_existing_or_cpu_basic(): + existing = await cost_estimation.estimate_sandbox_cost( + {"hardware": "a100-large"}, + session=SimpleNamespace(sandbox=object()), + ) + cpu = await cost_estimation.estimate_sandbox_cost({"hardware": "cpu-basic"}) + + assert existing.estimated_cost_usd == 0.0 + assert existing.billable is False + assert cpu.estimated_cost_usd == 0.0 + assert cpu.billable is False diff --git a/tests/unit/test_dangling_tool_calls.py b/tests/unit/test_dangling_tool_calls.py index 1e8ac3fa..513c8da2 100644 --- a/tests/unit/test_dangling_tool_calls.py +++ b/tests/unit/test_dangling_tool_calls.py @@ -28,6 +28,7 @@ def _make_cm() -> ContextManager: cm.running_context_usage = 0 cm.untouched_messages = 5 cm.items = [Message(role="system", content="system")] + cm.on_message_added = None return cm @@ -66,6 +67,15 @@ def test_no_orphan_means_no_stub(): assert tool_msgs[0].content == "ok" +def test_add_message_records_message_timestamp(): + cm = _make_cm() + msg = Message(role="user", content="hello") + + cm.add_message(msg) + + assert getattr(cm.items[-1], "timestamp", None) + + def test_multiple_dangling_tool_calls_in_one_assistant_message_are_all_patched(): cm = _make_cm() cm.items.extend([ diff --git a/tests/unit/test_personal_trace_repo.py b/tests/unit/test_personal_trace_repo.py new file mode 100644 index 00000000..40a90856 --- /dev/null +++ b/tests/unit/test_personal_trace_repo.py @@ -0,0 +1,43 @@ +import asyncio +from types import SimpleNamespace + +from agent.core.session import Session + + +class DummyToolRouter: + def get_tool_specs_for_llm(self) -> list[dict]: + return [] + + +def _session(*, user_id: str | None, hf_username: str | None) -> Session: + config = SimpleNamespace( + model_name="moonshotai/Kimi-K2.6", + save_sessions=True, + share_traces=True, + personal_trace_repo_template="{hf_user}/ml-intern-sessions", + session_dataset_repo="smolagents/ml-intern-sessions", + auto_save_interval=1, + heartbeat_interval_s=0, + reasoning_effort=None, + ) + context_manager = SimpleNamespace(items=[], on_message_added=None) + return Session( + event_queue=asyncio.Queue(), + config=config, + tool_router=DummyToolRouter(), + context_manager=context_manager, + user_id=user_id, + hf_username=hf_username, + ) + + +def test_personal_trace_repo_uses_hf_username_before_oauth_subject(): + session = _session(user_id="oauth-subject", hf_username="lewtun") + + assert session._personal_trace_repo_id() == "lewtun/ml-intern-sessions" + + +def test_personal_trace_repo_falls_back_to_user_id_for_cli(): + session = _session(user_id="lewtun", hf_username=None) + + assert session._personal_trace_repo_id() == "lewtun/ml-intern-sessions" diff --git a/tests/unit/test_sandbox_api_auth.py b/tests/unit/test_sandbox_api_auth.py index e60dfa5b..83b666b6 100644 --- a/tests/unit/test_sandbox_api_auth.py +++ b/tests/unit/test_sandbox_api_auth.py @@ -37,7 +37,7 @@ def test_file_and_command_routes_require_bearer_token(monkeypatch): assert response.status_code == 401 -def test_file_and_command_routes_accept_valid_bearer_token(monkeypatch): +def test_file_and_command_routes_reject_authorization_bearer_token(monkeypatch): client = TestClient(_sandbox_app(monkeypatch, "sandbox-secret")) response = client.post( @@ -46,11 +46,42 @@ def test_file_and_command_routes_accept_valid_bearer_token(monkeypatch): headers={"Authorization": "Bearer sandbox-secret"}, ) + assert response.status_code == 401 + + +def test_file_and_command_routes_accept_sandbox_header_with_hf_bearer(monkeypatch): + client = TestClient( + _sandbox_app(monkeypatch, "sandbox-secret", hf_token="hf-secret") + ) + + response = client.post( + "/api/exists", + json={"path": "/tmp"}, + headers={ + "Authorization": "Bearer hf-secret", + "X-Sandbox-Authorization": "Bearer sandbox-secret", + }, + ) + assert response.status_code == 200 assert response.json()["success"] is True -def test_legacy_hf_token_fallback_is_accepted(monkeypatch): +def test_hf_bearer_alone_is_rejected_when_sandbox_token_is_configured(monkeypatch): + client = TestClient( + _sandbox_app(monkeypatch, "sandbox-secret", hf_token="hf-secret") + ) + + response = client.post( + "/api/exists", + json={"path": "/tmp"}, + headers={"Authorization": "Bearer hf-secret"}, + ) + + assert response.status_code == 401 + + +def test_legacy_hf_token_fallback_is_rejected(monkeypatch): client = TestClient(_sandbox_app(monkeypatch, token=None, hf_token="hf-secret")) response = client.post( @@ -59,8 +90,7 @@ def test_legacy_hf_token_fallback_is_accepted(monkeypatch): headers={"Authorization": "Bearer hf-secret"}, ) - assert response.status_code == 200 - assert response.json()["success"] is True + assert response.status_code == 503 def test_protected_routes_fail_closed_without_configured_token(monkeypatch): @@ -75,10 +105,11 @@ def test_protected_routes_fail_closed_without_configured_token(monkeypatch): assert response.status_code == 503 -def test_sandbox_prefers_control_plane_token_for_api_headers(): +def test_sandbox_sends_hub_auth_and_control_plane_header(): sandbox = Sandbox("owner/name", token="hf-token", api_token="sandbox-secret") - assert sandbox._client.headers["authorization"] == "Bearer sandbox-secret" + assert sandbox._client.headers["authorization"] == "Bearer hf-token" + assert sandbox._client.headers["x-sandbox-authorization"] == "Bearer sandbox-secret" def test_sandbox_api_token_is_hidden_from_repr(): diff --git a/tests/unit/test_sandbox_auto_start.py b/tests/unit/test_sandbox_auto_start.py new file mode 100644 index 00000000..b99e28ca --- /dev/null +++ b/tests/unit/test_sandbox_auto_start.py @@ -0,0 +1,31 @@ +from types import SimpleNamespace +from pathlib import Path + +from agent.core.agent_loop import _needs_approval +from agent.tools.sandbox_tool import get_sandbox_tools + + +def test_default_cpu_sandbox_create_does_not_require_approval(): + config = SimpleNamespace(yolo_mode=False) + + assert _needs_approval("sandbox_create", {}, config) is False + assert _needs_approval("sandbox_create", {"hardware": "cpu-basic"}, config) is False + + +def test_non_default_sandbox_create_still_requires_approval(): + config = SimpleNamespace(yolo_mode=False) + + assert _needs_approval("sandbox_create", {"hardware": "cpu-upgrade"}, config) is True + assert _needs_approval("sandbox_create", {"hardware": "t4-small"}, config) is True + + +def test_prompt_and_tool_specs_do_not_require_cpu_sandbox_create(): + prompt = Path("agent/prompts/system_prompt_v3.yaml").read_text() + tool_specs = {tool.name: tool.description for tool in get_sandbox_tools()} + + assert "sandbox_create → install deps" not in prompt + assert "Do NOT call sandbox_create before normal CPU work" in prompt + assert "cpu-basic sandbox is already available" in prompt + + assert "cpu-basic sandbox is already started automatically" in tool_specs["sandbox_create"] + assert "started automatically for normal CPU work" in tool_specs["bash"] diff --git a/tests/unit/test_sandbox_private_spaces.py b/tests/unit/test_sandbox_private_spaces.py new file mode 100644 index 00000000..843a0c1c --- /dev/null +++ b/tests/unit/test_sandbox_private_spaces.py @@ -0,0 +1,421 @@ +import asyncio +import threading +import time +from types import SimpleNamespace + +from agent.core import telemetry +from agent.tools import sandbox_client, sandbox_tool +from agent.tools.sandbox_client import Sandbox +from agent.tools.sandbox_tool import sandbox_create_handler + + +def test_sandbox_client_defaults_to_private_spaces(monkeypatch): + duplicate_kwargs = {} + requested_hardware = [] + + class FakeApi: + def __init__(self, token=None): + self.token = token + + def duplicate_space(self, **kwargs): + duplicate_kwargs.update(kwargs) + + def request_space_hardware(self, space_id, hardware, sleep_time=None): + requested_hardware.append((space_id, hardware, sleep_time)) + return SimpleNamespace(stage="BUILDING", hardware=None) + + def add_space_secret(self, *args, **kwargs): + pass + + def get_space_runtime(self, space_id): + return SimpleNamespace(stage="RUNNING", hardware="cpu-basic") + + monkeypatch.setattr(sandbox_client, "HfApi", FakeApi) + monkeypatch.setattr( + Sandbox, + "_setup_server", + staticmethod(lambda *args, **kwargs: None), + ) + monkeypatch.setattr(Sandbox, "_wait_for_api", lambda self, *args, **kwargs: None) + + Sandbox.create(owner="alice", token="hf-token", log=lambda msg: None) + + assert duplicate_kwargs["private"] is True + assert requested_hardware == [(duplicate_kwargs["to_id"], "cpu-basic", None)] + + +def test_sandbox_client_retries_transient_runtime_404(monkeypatch): + runtime_calls = 0 + + class FakeResponse: + status_code = 404 + + class FakeRuntime404(Exception): + response = FakeResponse() + + def __str__(self): + return "404 Client Error: Repository Not Found" + + class FakeApi: + def __init__(self, token=None): + self.token = token + + def duplicate_space(self, **kwargs): + pass + + def request_space_hardware(self, space_id, hardware, sleep_time=None): + return SimpleNamespace(stage="BUILDING", hardware=None) + + def add_space_secret(self, *args, **kwargs): + pass + + def get_space_runtime(self, space_id): + nonlocal runtime_calls + runtime_calls += 1 + if runtime_calls == 1: + raise FakeRuntime404() + return SimpleNamespace(stage="RUNNING", hardware="cpu-basic") + + monkeypatch.setattr(sandbox_client, "HfApi", FakeApi) + monkeypatch.setattr(sandbox_client.time, "sleep", lambda seconds: None) + monkeypatch.setattr( + Sandbox, + "_setup_server", + staticmethod(lambda *args, **kwargs: None), + ) + monkeypatch.setattr(Sandbox, "_wait_for_api", lambda self, *args, **kwargs: None) + + sandbox = Sandbox.create(owner="alice", token="hf-token", log=lambda msg: None) + + assert sandbox.space_id.startswith("alice/sandbox-") + assert runtime_calls == 2 + + +def test_sandbox_tool_forces_private_spaces(monkeypatch): + captured_kwargs = {} + + async def fake_ensure_sandbox( + session, + hardware="cpu-basic", + extra_secrets=None, + **create_kwargs, + ): + captured_kwargs.update(create_kwargs) + return ( + SimpleNamespace( + space_id="alice/sandbox-12345678", + url="https://huggingface.co/spaces/alice/sandbox-12345678", + ), + None, + ) + + monkeypatch.setattr(sandbox_tool, "_ensure_sandbox", fake_ensure_sandbox) + + out, ok = asyncio.run( + sandbox_create_handler( + {"private": False}, + session=SimpleNamespace(sandbox=None), + ) + ) + + assert ok is True + assert "private" not in captured_kwargs + assert "Visibility: private" in out + + +def test_orphan_sweep_preserves_spaces_without_last_modified(): + deleted: list[str] = [] + logs: list[str] = [] + + class FakeApi: + def list_spaces(self, **kwargs): + assert kwargs["full"] is True + return [SimpleNamespace(id="alice/sandbox-12345678")] + + def delete_repo(self, repo_id, repo_type): + deleted.append(repo_id) + + count = sandbox_tool._cleanup_user_orphan_sandboxes( + FakeApi(), + "alice", + logs.append, + ) + + assert count == 0 + assert deleted == [] + assert logs == ["orphan sweep: skipping alice/sandbox-12345678; missing lastModified"] + + +def test_ensure_sandbox_overrides_private_argument(monkeypatch): + captured_kwargs = {} + + class FakeApi: + def __init__(self, token=None): + self.token = token + + def whoami(self): + return {"name": "alice"} + + class FakeSession: + def __init__(self): + self.hf_token = "hf-token" + self.sandbox = None + self.event_queue = SimpleNamespace(put_nowait=lambda event: None) + self._cancelled = asyncio.Event() + + async def send_event(self, event): + pass + + def fake_create(**kwargs): + captured_kwargs.update(kwargs) + return SimpleNamespace( + space_id="alice/sandbox-12345678", + url="https://huggingface.co/spaces/alice/sandbox-12345678", + ) + + async def fake_record_sandbox_create(*args, **kwargs): + pass + + monkeypatch.setattr(sandbox_tool, "HfApi", FakeApi) + monkeypatch.setattr(sandbox_tool, "_cleanup_user_orphan_sandboxes", lambda *args: 0) + monkeypatch.setattr(Sandbox, "create", staticmethod(fake_create)) + monkeypatch.setattr(telemetry, "record_sandbox_create", fake_record_sandbox_create) + monkeypatch.setattr("huggingface_hub.metadata_update", lambda *args, **kwargs: None) + + async def run(): + session = FakeSession() + sb, error = await sandbox_tool._ensure_sandbox(session, private=False) + return sb, error + + sb, error = asyncio.run(run()) + + assert error is None + assert sb is not None + assert captured_kwargs["private"] is True + + +def test_sandbox_creation_is_serialized_per_owner(monkeypatch): + active_creates = 0 + max_active_creates = 0 + active_lock = threading.Lock() + + class FakeApi: + def __init__(self, token=None): + self.token = token + + def whoami(self): + return {"name": "alice"} + + class FakeSession: + def __init__(self): + self.hf_token = "hf-token" + self.sandbox = None + self.event_queue = SimpleNamespace(put_nowait=lambda event: None) + self._cancelled = asyncio.Event() + + async def send_event(self, event): + pass + + def fake_create(**kwargs): + nonlocal active_creates, max_active_creates + with active_lock: + active_creates += 1 + max_active_creates = max(max_active_creates, active_creates) + time.sleep(0.02) + with active_lock: + active_creates -= 1 + return SimpleNamespace( + space_id=f"alice/sandbox-{kwargs['hardware']}", + url="https://huggingface.co/spaces/alice/sandbox", + ) + + async def fake_record_sandbox_create(*args, **kwargs): + pass + + monkeypatch.setattr(sandbox_tool, "HfApi", FakeApi) + monkeypatch.setattr(sandbox_tool, "_cleanup_user_orphan_sandboxes", lambda *args: 0) + monkeypatch.setattr(Sandbox, "create", staticmethod(fake_create)) + monkeypatch.setattr(telemetry, "record_sandbox_create", fake_record_sandbox_create) + monkeypatch.setattr("huggingface_hub.metadata_update", lambda *args, **kwargs: None) + + async def run(): + await asyncio.gather( + sandbox_tool._ensure_sandbox(FakeSession()), + sandbox_tool._ensure_sandbox(FakeSession()), + ) + + asyncio.run(run()) + + assert max_active_creates == 1 + + +def test_sandbox_operation_waits_for_cpu_preload(): + calls: list[tuple[str, dict]] = [] + + class FakeSandbox: + def call_tool(self, name, args): + calls.append((name, args)) + return SimpleNamespace(success=True, output="preloaded-ok", error="") + + async def run(): + session = SimpleNamespace( + sandbox=None, + sandbox_preload_error=None, + ) + + async def preload(): + await asyncio.sleep(0) + session.sandbox = FakeSandbox() + + session.sandbox_preload_task = asyncio.create_task(preload()) + handler = sandbox_tool._make_tool_handler("bash") + return await handler({"command": "echo ok"}, session=session) + + out, ok = asyncio.run(run()) + + assert ok is True + assert out == "preloaded-ok" + assert calls == [("bash", {"command": "echo ok"})] + + +def test_default_sandbox_create_waits_for_cpu_preload(): + class FakeSandbox: + space_id = "alice/sandbox-cpu" + url = "https://huggingface.co/spaces/alice/sandbox-cpu" + + async def run(): + session = SimpleNamespace( + sandbox=None, + sandbox_preload_error=None, + ) + + async def preload(): + await asyncio.sleep(0) + session.sandbox = FakeSandbox() + session.sandbox_hardware = "cpu-basic" + + session.sandbox_preload_task = asyncio.create_task(preload()) + return await sandbox_tool.sandbox_create_handler({}, session=session) + + out, ok = asyncio.run(run()) + + assert ok is True + assert "Sandbox already active: alice/sandbox-cpu" in out + assert "Hardware: cpu-basic" in out + + +def test_sandbox_create_replaces_auto_cpu_sandbox(monkeypatch): + deleted: list[str] = [] + + class FakeSession: + def __init__(self): + self.sandbox = SimpleNamespace( + space_id="alice/sandbox-cpu", + url="https://huggingface.co/spaces/alice/sandbox-cpu", + _owns_space=True, + delete=lambda: deleted.append("alice/sandbox-cpu"), + ) + self.sandbox_hardware = "cpu-basic" + self.sandbox_preload_task = None + self.sandbox_preload_cancel_event = None + + async def send_event(self, event): + pass + + gpu_sandbox = SimpleNamespace( + space_id="alice/sandbox-gpu", + url="https://huggingface.co/spaces/alice/sandbox-gpu", + _owns_space=True, + ) + + async def fake_ensure_sandbox(session, hardware="cpu-basic", **kwargs): + session.sandbox = gpu_sandbox + session.sandbox_hardware = hardware + return gpu_sandbox, None + + async def fake_record_sandbox_destroy(*args, **kwargs): + pass + + monkeypatch.setattr(sandbox_tool, "_ensure_sandbox", fake_ensure_sandbox) + monkeypatch.setattr(telemetry, "record_sandbox_destroy", fake_record_sandbox_destroy) + + session = FakeSession() + out, ok = asyncio.run( + sandbox_tool.sandbox_create_handler( + {"hardware": "a100-large"}, + session=session, + ) + ) + + assert ok is True + assert deleted == ["alice/sandbox-cpu"] + assert session.sandbox is gpu_sandbox + assert session.sandbox_hardware == "a100-large" + assert "Hardware: a100-large" in out + + +def test_teardown_cancels_preload_and_deletes_owned_sandbox(monkeypatch): + deleted: list[str] = [] + + async def fake_record_sandbox_destroy(*args, **kwargs): + pass + + monkeypatch.setattr(telemetry, "record_sandbox_destroy", fake_record_sandbox_destroy) + + async def run(): + cancel_event = threading.Event() + + async def preload(): + await asyncio.sleep(0) + + session = SimpleNamespace( + sandbox=SimpleNamespace( + space_id="alice/sandbox-12345678", + _owns_space=True, + delete=lambda: deleted.append("alice/sandbox-12345678"), + ), + sandbox_hardware="cpu-basic", + sandbox_preload_task=asyncio.create_task(preload()), + sandbox_preload_cancel_event=cancel_event, + ) + + await sandbox_tool.teardown_session_sandbox(session) + return session, cancel_event + + session, cancel_event = asyncio.run(run()) + + assert cancel_event.is_set() + assert deleted == ["alice/sandbox-12345678"] + assert session.sandbox is None + assert session.sandbox_hardware is None + + +def test_cancel_sandbox_preload_cancels_task_after_timeout(monkeypatch): + async def run(): + async def fake_wait_for(awaitable, timeout): + await asyncio.sleep(0) + raise asyncio.TimeoutError + + monkeypatch.setattr(sandbox_tool.asyncio, "wait_for", fake_wait_for) + + cancel_event = threading.Event() + blocker = asyncio.Event() + + async def preload(): + await blocker.wait() + + task = asyncio.create_task(preload()) + session = SimpleNamespace( + sandbox_preload_task=task, + sandbox_preload_cancel_event=cancel_event, + ) + + await sandbox_tool.cancel_sandbox_preload(session) + await asyncio.sleep(0) + + return task.cancelled(), cancel_event.is_set() + + task_cancelled, cancel_event_set = asyncio.run(run()) + + assert task_cancelled is True + assert cancel_event_set is True diff --git a/tests/unit/test_session_manager_persistence.py b/tests/unit/test_session_manager_persistence.py index 355f9387..a4451a1e 100644 --- a/tests/unit/test_session_manager_persistence.py +++ b/tests/unit/test_session_manager_persistence.py @@ -27,6 +27,23 @@ def __init__(self, *, hf_token: str | None = None, model: str = "test-model"): self.turn_count = 0 self.config = SimpleNamespace(model_name=model) self.notification_destinations = [] + self.auto_approval_enabled = False + self.auto_approval_cost_cap_usd = None + self.auto_approval_estimated_spend_usd = 0.0 + + def auto_approval_policy_summary(self): + cap = self.auto_approval_cost_cap_usd + remaining = None if cap is None else max(0, cap - self.auto_approval_estimated_spend_usd) + return { + "enabled": self.auto_approval_enabled, + "cost_cap_usd": cap, + "estimated_spend_usd": self.auto_approval_estimated_spend_usd, + "remaining_usd": remaining, + } + + def set_auto_approval_policy(self, *, enabled, cost_cap_usd): + self.auto_approval_enabled = enabled + self.auto_approval_cost_cap_usd = cost_cap_usd class RestoreStore(NoopSessionStore): @@ -85,6 +102,24 @@ def _runtime_agent_session( ) +@pytest.mark.asyncio +async def test_update_session_auto_approval_defaults_to_five_dollars(): + manager = _manager_with_store(NoopSessionStore()) + existing = _runtime_agent_session("s1", user_id="owner") + manager.sessions["s1"] = existing + + summary = await manager.update_session_auto_approval( + "s1", + enabled=True, + cost_cap_usd=None, + cap_provided=False, + ) + + assert summary["enabled"] is True + assert summary["cost_cap_usd"] == 5.0 + assert summary["remaining_usd"] == 5.0 + + def _install_fake_runtime(manager: SessionManager) -> asyncio.Event: stop = asyncio.Event() manager.run_calls = 0 # type: ignore[attr-defined] @@ -151,6 +186,12 @@ async def test_concurrent_lazy_restore_starts_only_one_agent_task(): store = RestoreStore(delay=0.01) manager = _manager_with_store(store) stop = _install_fake_runtime(manager) + scheduled: list[str] = [] + + def fake_start_cpu_sandbox_preload(agent_session: AgentSession) -> None: + scheduled.append(agent_session.session_id) + + manager._start_cpu_sandbox_preload = fake_start_cpu_sandbox_preload # type: ignore[method-assign] try: first, second = await asyncio.gather( @@ -162,12 +203,56 @@ async def test_concurrent_lazy_restore_starts_only_one_agent_task(): assert first is second assert list(manager.sessions) == ["persisted-session"] assert manager.run_calls == 1 # type: ignore[attr-defined] + assert scheduled == ["persisted-session"] assert not stop.is_set() finally: stop.set() await _cancel_runtime_tasks(manager) +@pytest.mark.asyncio +async def test_create_session_schedules_cpu_sandbox_preload(): + manager = _manager_with_store(NoopSessionStore()) + stop = _install_fake_runtime(manager) + scheduled: list[str] = [] + + def fake_start_cpu_sandbox_preload(agent_session: AgentSession) -> None: + scheduled.append(agent_session.session_id) + + manager._start_cpu_sandbox_preload = fake_start_cpu_sandbox_preload # type: ignore[method-assign] + + try: + session_id = await manager.create_session(user_id="owner", hf_token="token") + + assert scheduled == [session_id] + assert session_id in manager.sessions + finally: + stop.set() + await _cancel_runtime_tasks(manager) + + +@pytest.mark.asyncio +async def test_lazy_restore_schedules_cpu_sandbox_preload(): + manager = _manager_with_store(RestoreStore()) + stop = _install_fake_runtime(manager) + scheduled: list[str] = [] + + def fake_start_cpu_sandbox_preload(agent_session: AgentSession) -> None: + scheduled.append(agent_session.session_id) + + manager._start_cpu_sandbox_preload = fake_start_cpu_sandbox_preload # type: ignore[method-assign] + + try: + restored = await manager.ensure_session_loaded("persisted-session", user_id="owner") + + assert restored is not None + assert scheduled == ["persisted-session"] + assert "persisted-session" in manager.sessions + finally: + stop.set() + await _cancel_runtime_tasks(manager) + + @pytest.mark.asyncio async def test_lazy_restore_preserves_pending_approval_tool_calls(): store = RestoreStore( @@ -204,6 +289,34 @@ async def test_lazy_restore_preserves_pending_approval_tool_calls(): await _cancel_runtime_tasks(manager) +@pytest.mark.asyncio +async def test_lazy_restore_preserves_auto_approval_policy(): + store = RestoreStore( + metadata={ + "session_id": "yolo-session", + "user_id": "owner", + "model": "test-model", + "auto_approval_enabled": True, + "auto_approval_cost_cap_usd": 5.0, + "auto_approval_estimated_spend_usd": 1.25, + } + ) + manager = _manager_with_store(store) + stop = _install_fake_runtime(manager) + + try: + restored = await manager.ensure_session_loaded("yolo-session", user_id="owner") + + assert restored is not None + assert restored.session.auto_approval_enabled is True + assert restored.session.auto_approval_cost_cap_usd == 5.0 + assert restored.session.auto_approval_estimated_spend_usd == 1.25 + assert restored.session.auto_approval_policy_summary()["remaining_usd"] == 3.75 + finally: + stop.set() + await _cancel_runtime_tasks(manager) + + @pytest.mark.asyncio async def test_list_sessions_dev_uses_store_dev_visibility(): class ListStore(NoopSessionStore): @@ -221,6 +334,9 @@ async def list_sessions(self, user_id: str, **_: Any) -> list[dict[str, Any]]: "user_id": "alice", "model": "m", "created_at": datetime.now(UTC), + "auto_approval_enabled": True, + "auto_approval_cost_cap_usd": 5.0, + "auto_approval_estimated_spend_usd": 2.0, }, { "session_id": "s2", @@ -238,3 +354,10 @@ async def list_sessions(self, user_id: str, **_: Any) -> list[dict[str, Any]]: assert store.seen_user_id == "dev" assert {session["session_id"] for session in sessions} == {"s1", "s2"} + yolo = next(session for session in sessions if session["session_id"] == "s1") + assert yolo["auto_approval"] == { + "enabled": True, + "cost_cap_usd": 5.0, + "estimated_spend_usd": 2.0, + "remaining_usd": 3.0, + } diff --git a/tests/unit/test_session_uploader.py b/tests/unit/test_session_uploader.py new file mode 100644 index 00000000..dfbc27fb --- /dev/null +++ b/tests/unit/test_session_uploader.py @@ -0,0 +1,202 @@ +import json + +from agent.core.session_uploader import ( + _PERSONAL_TOKEN_ENV, + _resolve_token, + _update_upload_status, + _upload_dataset_card, + _write_claude_code_payload, + _write_row_payload, + dataset_card_readme, + to_claude_code_jsonl, +) + +HF_SECRET = "hf_" + "a" * 30 +ANTHROPIC_SECRET = "sk-ant-" + "b" * 24 +GITHUB_SECRET = "ghp_" + "c" * 36 + + +def test_dataset_card_readme_has_metadata_and_public_warning(): + readme = dataset_card_readme("lewtun/ml-intern-sessions") + + assert readme.startswith("---\n") + assert 'pretty_name: "ML Intern Session Traces"' in readme + assert "task_categories:\n- text-generation" in readme + assert "- agent-traces" in readme + assert "- coding-agent" in readme + assert "- ml-intern" in readme + assert 'path: "sessions/**/*.jsonl"' in readme + assert "ML Intern demo: https://smolagents-ml-intern.hf.space" in readme + assert "ML Intern CLI: https://github.com/huggingface/ml-intern" in readme + assert "Repository: https://huggingface.co/datasets/" not in readme + assert ( + "**WARNING: no comprehensive redaction or human review has been performed for this dataset.**" + in readme + ) + assert "automated best-effort scrubbing" in readme + assert "Do not make this dataset public" in readme + + +def test_upload_dataset_card_only_for_claude_code_format(): + class FakeApi: + def __init__(self): + self.calls = [] + + def upload_file(self, **kwargs): + self.calls.append(kwargs) + + api = FakeApi() + + _upload_dataset_card(api, "lewtun/ml-intern-sessions", "hf_token", "row") + assert api.calls == [] + + _upload_dataset_card(api, "lewtun/ml-intern-sessions", "hf_token", "claude_code") + assert len(api.calls) == 1 + assert api.calls[0]["path_in_repo"] == "README.md" + assert api.calls[0]["repo_id"] == "lewtun/ml-intern-sessions" + assert api.calls[0]["repo_type"] == "dataset" + assert api.calls[0]["token"] == "hf_token" + assert b"no comprehensive redaction or human review" in api.calls[0]["path_or_fileobj"] + + +def test_personal_token_env_takes_precedence_for_hf_token(monkeypatch): + monkeypatch.setenv(_PERSONAL_TOKEN_ENV, "personal-token") + monkeypatch.setenv("HF_TOKEN", "env-token") + + assert _resolve_token("HF_TOKEN") == "personal-token" + + +def test_update_upload_status_preserves_other_uploader_fields(tmp_path): + session_file = tmp_path / "session_123.json" + session_file.write_text( + json.dumps( + { + "session_id": "123", + "upload_status": "success", + "upload_url": "https://huggingface.co/datasets/org/sessions", + "personal_upload_status": "pending", + } + ) + ) + + _update_upload_status( + str(session_file), + "personal_upload_status", + "personal_upload_url", + "success", + "https://huggingface.co/datasets/user/ml-intern-sessions", + ) + + data = json.loads(session_file.read_text()) + assert data["upload_status"] == "success" + assert data["upload_url"] == "https://huggingface.co/datasets/org/sessions" + assert data["personal_upload_status"] == "success" + assert ( + data["personal_upload_url"] + == "https://huggingface.co/datasets/user/ml-intern-sessions" + ) + + +def test_claude_code_jsonl_uses_message_timestamps(): + events = to_claude_code_jsonl( + { + "session_id": "session-123", + "model_name": "anthropic/claude-opus-4-6", + "session_start_time": "2026-01-01T00:00:00", + "messages": [ + { + "role": "user", + "content": "hello", + "timestamp": "2026-01-01T00:00:01", + }, + { + "role": "assistant", + "content": "hi", + "timestamp": "2026-01-01T00:00:02", + }, + { + "role": "tool", + "tool_call_id": "call-1", + "content": "ok", + "timestamp": "2026-01-01T00:00:03", + }, + ], + } + ) + + assert [event["timestamp"] for event in events] == [ + "2026-01-01T00:00:01", + "2026-01-01T00:00:02", + "2026-01-01T00:00:03", + ] + + +def test_row_payload_scrubs_messages_events_and_tools(tmp_path): + tmp_file = tmp_path / "row.jsonl" + data = { + "session_id": "session-123", + "user_id": "lewtun", + "session_start_time": "2026-01-01T00:00:00", + "session_end_time": "2026-01-01T00:00:03", + "model_name": "anthropic/claude-opus-4-6", + "total_cost_usd": 0.01, + "messages": [{"role": "user", "content": f"token {HF_SECRET}"}], + "events": [{"type": "debug", "content": f"key {ANTHROPIC_SECRET}"}], + "tools": [{"name": "bash", "env": f"GITHUB_TOKEN={GITHUB_SECRET}"}], + } + + _write_row_payload(data, str(tmp_file)) + + payload = tmp_file.read_text() + assert HF_SECRET not in payload + assert ANTHROPIC_SECRET not in payload + assert GITHUB_SECRET not in payload + assert "[REDACTED_HF_TOKEN]" in payload + assert "[REDACTED_ANTHROPIC_KEY]" in payload + assert "GITHUB_TOKEN=[REDACTED]" in payload + + +def test_claude_code_payload_scrubs_messages_before_conversion(tmp_path): + tmp_file = tmp_path / "claude_code.jsonl" + data = { + "session_id": "session-123", + "model_name": "anthropic/claude-opus-4-6", + "session_start_time": "2026-01-01T00:00:00", + "messages": [ + { + "role": "user", + "content": f"token {HF_SECRET}", + "timestamp": "2026-01-01T00:00:01", + }, + { + "role": "assistant", + "content": "running tool", + "tool_calls": [ + { + "id": "call-1", + "function": { + "name": "bash", + "arguments": json.dumps({"key": ANTHROPIC_SECRET}), + }, + } + ], + "timestamp": "2026-01-01T00:00:02", + }, + { + "role": "tool", + "tool_call_id": "call-1", + "content": f"GITHUB_TOKEN={GITHUB_SECRET}", + "timestamp": "2026-01-01T00:00:03", + }, + ], + } + + _write_claude_code_payload(data, str(tmp_file)) + + payload = tmp_file.read_text() + assert HF_SECRET not in payload + assert ANTHROPIC_SECRET not in payload + assert GITHUB_SECRET not in payload + assert "[REDACTED_HF_TOKEN]" in payload + assert "[REDACTED_ANTHROPIC_KEY]" in payload + assert "GITHUB_TOKEN=[REDACTED]" in payload diff --git a/uv.lock b/uv.lock index 73df668c..e04ad0df 100644 --- a/uv.lock +++ b/uv.lock @@ -1,5 +1,5 @@ version = 1 -revision = 3 +revision = 2 requires-python = ">=3.11" resolution-markers = [ "python_full_version >= '3.12'", @@ -1828,7 +1828,7 @@ requires-dist = [ { name = "huggingface-hub", specifier = ">=1.12.0" }, { name = "inspect-ai", marker = "extra == 'eval'", specifier = ">=0.3.149" }, { name = "litellm", specifier = ">=1.83.0" }, - { name = "ml-intern", extras = ["eval", "dev"], marker = "extra == 'all'" }, + { name = "ml-intern", extras = ["eval", "dev", "nat", "ml-intern", "claude-sdk"], marker = "extra == 'all'" }, { name = "nbconvert", specifier = ">=7.16.6" }, { name = "nbformat", specifier = ">=5.10.4" }, { name = "pandas", marker = "extra == 'eval'", specifier = ">=2.3.3" }, @@ -1846,7 +1846,7 @@ requires-dist = [ { name = "websockets", specifier = ">=13.0" }, { name = "whoosh", specifier = ">=2.7.4" }, ] -provides-extras = ["eval", "dev", "all"] +provides-extras = ["eval", "dev", "nat", "ml-intern", "claude-sdk", "all"] [[package]] name = "mmh3"