Skip to content

P0 + P0.5 foundation: identity, library, ml_intern + nat adapters, contract + plan v6#1

Open
andreidhoang wants to merge 25 commits into
mainfrom
p0_5_library_restructure
Open

P0 + P0.5 foundation: identity, library, ml_intern + nat adapters, contract + plan v6#1
andreidhoang wants to merge 25 commits into
mainfrom
p0_5_library_restructure

Conversation

@andreidhoang
Copy link
Copy Markdown
Owner

Summary

Foundation for the cosmos-lab project — 9 NEW agents (6 Cosmos-specialty + 3 governance) on ml-intern primitives per PLAN_V2.md v6. This PR ships P0 (identity foundation) + all 4 days of P0.5 (library + ml_intern adapter + nat wrapper + contract) + the 13-revision plan evolution that arrived at v6 honest framing.

12 commits, 42 cosmos-lab tests passing, 237 upstream baseline preserved, zero-diff invariant holds throughout.

What ships

Code (~700 LOC)

  • cosmos_lab/ — Python library (importable via from cosmos_lab import ...)
    • identity/AgentIdentity, AuditLog, CapabilityScopedRouter, OptimizationConfig (P0)
    • harness/ml_intern.py — Family A execution-substrate adapter, install_into_session() wraps Session.tool_router with governance (P0.5 D2)
    • harness/nat.py — Family B deployment-surface adapter, register_as_nat_tool() registers cosmos_lab_principal as nat workflow tool (P0.5 D3)
    • harness/CONTRACT.md — formal contract documentation: 2 adapter families + 5 shared requirements + per-adapter specifics + future-adapter checklist (P0.5 D4)

Tests (42 total)

  • tests/optimization/test_identity_scoping.py — 16 tests for identity primitives (P0)
  • tests/optimization/harness/test_ml_intern_adapter.py — 6 smoke tests (P0.5 D2)
  • tests/optimization/harness/test_nat_adapter.py — 11 smoke tests (P0.5 D3)
  • tests/optimization/harness/test_adapter_contract.py — 9 parametrized contract tests across both adapters (P0.5 D4)

Verifiers (bin/)

  • verify.sh — phase router
  • verify_p0_5_d1.sh (14 checks), _d2.sh (11), _d3.sh (10), _d4.sh (14) — per-day verification

Planning + context harness

  • PLAN_V2.md (1380 lines) — 13-revision evolution: v3.1 → v3.2 → v4 → v5 → v5.1 → v5.2 → v6 (current). Full revision history in §0.5 deltas table rows 1-13.
  • AGENTIC_EVAL_SPEC.md — companion to EVAL_SPEC.md, agent-system eval architecture (5-tier ladder + 6 agentic surfaces + 13 axioms + 10 numerical commitments E1-E10)
  • CLAUDE.md, docs/ (5 files) — context harness for AI agents working on the project
  • agentic_build_workflow.md — DEFINE → PROBE → BUILD → REVIEW → SHIP → LEARN methodology

v6 thesis (final framing)

cosmos-lab ships 9 NEW agents + ~16 governance infrastructure components, built on ml-intern's tool primitives leveraged AS-IS:

Layer 1 — 6 Cosmos-specialty agents (P3-P9):

  1. DataAgent — cosmos-curate orchestration, real video curation
  2. EvalAgent — multi-judge + sentinels + physics-consistency
  3. TrainOrchestrator — Centaur HPO + NeMo-RL on real GPU
  4. OptimizeAgent — profiling + ≥1.5× speedup on 4 workloads
  5. MultimodalPipelineAgent — e2e Cosmos workflow on Predict 2.5
  6. CodeAgent — capability-scoped, real OSS bug fixes

Layer 2 — 3 governance agents (P7-P10):
7. GepaOptimizer — weekly DSPy GEPA prompt revisions
8. CapabilityProbe — adversarial scope testing
9. CrossAgentEvaluator — quarterly Pareto vs Devin/Claude Code/human

Layer 3 — ~16 governance infrastructure components (P1-P10):
sentinels, identity v2, audit log, OTel emitter, memory tiers, Inspect AI bridge, ComputeBackend, etc.

Layer 4 — ml-intern primitives leveraged AS-IS:
agent_loop, 16 generic tools, sandbox, MCP, cost estimation, doom-loop detection.

Plan evolution honest postmortem

Version Framing Verdict
v3.1/v3.2/v4 6 specialty agents ✅ Correct on agent count
v5/v5.1 1 PrincipalAgent (re-implemented ml-intern) ❌ Over-correction #1
v5.2 0 new agents (governance only) ❌ Over-correction huggingface#2
v6 (final) 6 specialty + 3 governance, on ml-intern primitives ✅ Synthesis

User correctly caught both over-corrections. v6 is the honest answer that maps directly to JD's literal text ("agents (plural) help generate data, surface failures, evaluate outputs" + stand-out "agent-based systems doing real work — coding, eval, data gen, triage, experimentation, orchestration").

Verifier state (all green)

  • D1: 14/14 ✅
  • D2: 11/11 ✅
  • D3: 10/10 ✅
  • D4: 14/14 ✅
  • Upstream baseline: 237 pass / 3 known-broken ✅
  • Zero-diff invariant: holds throughout ✅

Schedule

~19 weeks total (between v5/v5.1's 22.5w and v5.2's 13w — honest middle ground).

  • ✅ P0 + P0.5 (~1.6 weeks) shipped in this PR
  • 🚧 ~17 weeks remaining for P1-P10 (eval infrastructure → 9 agents → production deploy)

Test plan

  • uv sync --extra dev
  • ./bin/verify.sh p0_5_d1 → 14/14
  • ./bin/verify.sh p0_5_d2 → 11/11
  • ./bin/verify.sh p0_5_d3 → 10/10
  • ./bin/verify.sh p0_5_d4 → 14/14
  • uv run python -m pytest tests/optimization/ -q → 42 passed
  • uv run python -m pytest tests/unit/ -q → 237 passed / 3 known-broken (no new regressions)
  • git diff upstream/main --name-only → only owned paths

Reading order for review

For ~30 min focused review:

  1. PLAN_V2.md §0.6 (v6 unique value) + §0.65 (9 agents) + §0.9 (thesis) — the load-bearing v6 framing
  2. cosmos_lab/harness/CONTRACT.md — adapter contract design decision
  3. cosmos_lab/harness/ml_intern.py (84 LOC) + cosmos_lab/harness/nat.py (132 LOC) — the two shipped adapters
  4. tests/optimization/harness/test_adapter_contract.py — parametrized contract tests
  5. AGENTIC_EVAL_SPEC.md §1 (why agentic eval differs) + §4 (6 surfaces) — eval architecture (used by all 9 agents)

For deeper review (~2 hours): read PLAN_V2.md §0.5 deltas table to follow the v3 → v6 evolution; read all 12 commit messages in order.

🤖 Generated with Claude Code

jagwar and others added 23 commits April 29, 2026 14:31
…ttribution (huggingface#179)

* feat(telemetry): track 5 untracked Bedrock call sites for full cost attribution

Cost Explorer ($78,738 over 6 days) vs the session dataset's
total_cost_usd (~$354/day attributed) showed the dataset captures only
~33% of real Bedrock spend. Root cause: out of 9 acompletion() call
sites, only 2 (in agent_loop.py) emit the llm_call event that
total_cost_usd sums.

This wires telemetry into the 5 Bedrock-billing call sites that were
flying blind, with a `kind` tag on each call so analytics can split
spend by category:

- research_tool.py × 3   → kind="research"     (sub-agent loop)
- context_manager.py     → kind="compaction"   (history summary)
- effort_probe.py        → kind="effort_probe" (cascade walk)

Plus a fourth tag for the session-restore summary path
(session_manager.py → kind="restore").

Plumbing changes:

- telemetry.record_llm_call now accepts kind="..." (default "main"
  preserves existing behavior).
- summarize_messages() and ContextManager.compact() take optional
  session=None so the caller can opt into telemetry.
- probe_effort() takes optional session=None for the same reason.
- Both probe_effort callers (agent_loop._heal_effort_error and
  model_switcher) now pass session.

Skipped:

- routes/agent.py /title — uses HF Router (Cerebras), not Bedrock
- routes/agent.py /health/llm — no session context (manual diagnostic
  endpoint, ~$0.02/call, not billable to a user)

After deploy, expect dataset total_cost_usd to converge with Cost
Explorer to within 5-10%. The kind breakdown will quantify each
category, validating the cost-plan estimates in
ml_intern_bedrock_cost_plan.md.

* fix(telemetry): address PR bot feedback (2 P1 + 1 P2)

1. P1 — Wrap each research_tool record_llm_call in its own try/except.
   record_llm_call's inner send_event is wrapped, but extract_usage
   (telemetry.py:101) is not — an unexpected usage shape from LiteLLM
   could propagate. At all 3 research sites the surrounding except-block
   would convert that into "Research summary call failed", masking a
   valid LLM response. Match the effort_probe pattern: dedicated
   try/except logging at DEBUG.

2. P1 — Hoist `import time` from inside summarize_messages() to module
   level in manager.py. stdlib, always available, matches the rest of
   the module.

3. P2 — Update telemetry.py docstring kind list. Drop title_gen and
   model_probe (skipped per PR description), add restore (emitted from
   session_manager.py). Note the intentional skips at the bottom.
huggingface#183)

* Add agent dev server notes

* Make frontend model configurable

* Support env-selected frontend models

* Use Claude-specific model env var

* Add GPT-5.5 to web model picker

* Gate GPT-5.5 as a premium model

* Avoid duplicate session model fetch

* Remove legacy Claude quota aliases

* Document GitHub CLI PR body workflow

* Gate only deployed paid model IDs

* Nits
* Make sandbox Spaces private

Co-authored-by: Codex <codex@openai.com>

* Remove legacy sandbox auth fallback

Co-authored-by: Codex <codex@openai.com>

* Address sandbox privacy review comments

Co-authored-by: Codex <codex@openai.com>

---------

Co-authored-by: Codex <codex@openai.com>
* Add DeepSeek V4 Pro model option

Co-authored-by: Codex <codex@openai.com>

* Remove DeepSeek feature tests

Co-authored-by: Codex <codex@openai.com>

---------

Co-authored-by: Codex <codex@openai.com>
* feat: add share_traces toggle and per-user trace repo template

* feat: support Claude Code JSONL format and per-target auth

* feat: dual-upload sessions to private user trace dataset

* chore: retry personal trace uploads on booting

* feat: add /share-traces command to flip dataset visibility

* docs: document HF trace auto-share and /share-traces

* Use HF token owner for local dev auth

Co-authored-by: Codex <codex@openai.com>

* Rename personal session trace dataset

Co-authored-by: Codex <codex@openai.com>

* Add session dataset card metadata

Co-authored-by: Codex <codex@openai.com>

* Fix session trace upload review issues

Co-authored-by: OpenAI Codex <codex@openai.com>

* Preserve secret scrubbing before trace uploads

Co-authored-by: OpenAI Codex <codex@openai.com>

* Link ML Intern demo in dataset card

Co-authored-by: OpenAI Codex <codex@openai.com>

---------

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
Co-authored-by: Codex <codex@openai.com>
* Use HF username for personal trace uploads

Co-authored-by: OpenAI Codex <codex@openai.com>

* Remove redundant HF token branch

Co-authored-by: OpenAI Codex <codex@openai.com>

---------

Co-authored-by: OpenAI Codex <codex@openai.com>
…ace#204)

* chore: update the agent system prompt

* chore: update the tool documentation
* Add session YOLO auto-approval budget

Co-authored-by: Codex <codex@openai.com>

* Address YOLO approval review feedback

Co-authored-by: Codex <codex@openai.com>

---------

Co-authored-by: Codex <codex@openai.com>
* Auto-start CPU sandboxes for sessions

Co-authored-by: Codex <codex@openai.com>

* Retry sandbox runtime visibility checks

Co-authored-by: Codex <codex@openai.com>

* Stabilize auto CPU sandbox creation

Co-authored-by: OpenAI Codex <codex@openai.com>

* Address sandbox PR review comments

Co-authored-by: OpenAI Codex <codex@openai.com>

---------

Co-authored-by: Codex <codex@openai.com>
Co-authored-by: OpenAI Codex <codex@openai.com>
* Fallback to free model for gated defaults

Co-authored-by: OpenAI Codex <codex@openai.com>

* Cover explicit gated model access for HF users

Co-authored-by: OpenAI Codex <codex@openai.com>

* Seed model picker from created session

Co-authored-by: OpenAI Codex <codex@openai.com>

---------

Co-authored-by: OpenAI Codex <codex@openai.com>
Ships the cosmos-lab Phase 0 deliverables — extends upstream Config with
OptimizationConfig, plus an unsigned in-process AgentIdentity, append-only
JSONL AuditLog, and a CapabilityScopedRouter that filters tool specs and
audits before/after/exception around every tool call.

Composition over inheritance: CapabilityScopedRouter wraps any duck-typed
router, so unit tests use a FakeRouter and skip the heavy ToolRouter init
(which requires HF auth + sandbox bootstrapping).

AuthZ + audit only — no AuthN. Identity is unsigned; signing tokens, OAuth
2.1 + RFC 8707/8693 sub-agent scope-down, and hash-chained signed audit
log all land in P4b. Documented as such in module docstrings.

16 tests in tests/optimization/test_identity_scoping.py cover config
inheritance, round-trip, identity scoping/parent chain, audit JSONL
round-trip + parent dirs + parseable lines, router denial/allow paths,
exception path, spec filtering, root sees-all, and canonical args_hash.

Acceptance: pytest tests/optimization/ exits 0; pytest tests/unit/ remains
237 pass / 3 known-broken (no new regressions).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Establishes cosmos_lab/ as the v1 importable surface for the cosmos-lab
library. Code physically lives at agent/optimization/* per the zero-diff
fork strategy; cosmos_lab/* re-exports so library consumers can write
`from cosmos_lab.identity import AgentIdentity` without depending on
upstream ml-intern import paths.

Why library form (PLAN_V2.md §0.4): cosmos-lab's value (sentinels,
identity, GEPA governance, quality budget) operates on framework-agnostic
interfaces. Owning an agent loop is anti-pattern in 2026. Library form
plugs into nvidia-nat (primary harness, P0.5 D3), ml-intern (compat
adapter, P0.5 D2), Claude SDK (v1.1).

Adds three placeholder optional-dependencies extras (nat, ml_intern,
claude_sdk) — all empty for now; nvidia-nat>=1.6 dep lands in D3 with the
nat adapter.

Backward compat preserved: `from agent.optimization.identity import ...`
still works. All 16 Phase 0 tests pass against new path AND old path.

Verifier: ./bin/verify.sh p0_5_d1 → 14/14 pass (structure, dual-path
imports, packaging, tests, zero-diff invariant).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a minimal, token-efficient context harness so Claude Code (and other
agentic coding assistants) can pick up cosmos-lab work cold without
loading 7K+ lines of stale planning context every turn.

Design (per agentic_build_workflow.md DEFINE→PROBE→BUILD→REVIEW→SHIP→LEARN):

  CLAUDE.md (94L, ~970 tokens, ALWAYS LOADED)
    = invariants + owned paths + 6 anti-patterns + pointer index only

  docs/00_workflow.md → ../agentic_build_workflow.md (workflow methodology)
  docs/01_north_star.md     (vision in 1 screen, on-demand)
  docs/02_current_phase.md  (LIVE — what we're building today, rotates)
  docs/03_pointers.md       (phase → PLAN_V2 anchor map)
  docs/04_jd.md             (NVIDIA Cosmos JD reference)

  bin/verify.sh             (router: ./bin/verify.sh <phase>)
  bin/verify_p0_5_d1.sh     (14 concrete checks for current phase)

Why: previous CLAUDE.md (135L) had stale "Phase 0 pending" status, JD
paste burning ~1.7K tokens every turn, no pointer index forcing agents
to read the whole 837-line PLAN_V2.md to answer "what should I do." New
harness reduces always-loaded context ~43% and enables on-demand deep
reads via the pointer index.

Verifier scripts return concrete pass/fail (per workflow rule "verifier
is a script, not a description"). Anti-pattern huggingface#3 ("trusting 'I have
verified this' from an agent") is enforced by re-running ./bin/verify.sh
yourself.

AGENTS.local.md is a symlink to CLAUDE.md so multiple agentic CLIs that
look for AGENTS.md or CLAUDE.md both find the same source of truth.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PLAN_V2.md (v4) — 24-week production-grade roadmap for cosmos-lab. Six
reference agents (Data, Eval, Train, Optimize, Video, Code) on a shared
governance library (sentinels, MCP-OAuth identity with sub-agent scope-
down via RFC 8693, GEPA promotion contracts, quality-budget invariants).
Library architecture from P0.5 onward; nvidia-nat as primary harness.

Five production gates (PLAN_V2 §0.8):
  G1 real GPU runs (Invariant 9, ~$200-400 budget across P5/P5.5/P6/P9a)
  G2 PyTorch depth artifact (P5.5, ≥10% wall-clock improvement)
  G3 production deployment (P10, ≥100 real user sessions)
  G4 real multimodal data (P3, 10-100 hours real video)
  G5 upstream OSS PR (P10, nvidia-nat or Inspect AI)

24 numerical targets (PLAN_V2 §0.7) — phase exit conditions, not aspirations.

Companion docs preserved as deep references (read on demand only):

  PLAN.md                       — original 16-week plan (superseded by v4)
  SYSTEM.md                     — full architecture deep-dive (Vietnamese, 1167L)
  EVAL_SPEC.md                  — measured-peak vs vendor-peak eval methodology
  WORKFLOW.md                   — git/PR workflow conventions
  RESEARCH_AHE_ANALYSIS.md      — AHE (Agentic Harness Engineering) research

Plan went through 4 revisions before commit:
  v3   — 2026-frontier verification pass (8 cited deltas)
  v3.1 — Jensen-grade polish (§0.6 unique value, §0.7 numerical targets,
         §3.1 sentinel taxonomy, noise cuts)
  v3.2 — library architecture pivot (P0.5 NEW, ~20 weeks)
  v4   — production-grade pivot (P5.5 NEW, P10 expanded, Invariant 9,
         §0.65 Six Reference Agents, §0.8 Production Commitments,
         ~22.5 weeks within original 24-week budget)

Citations independently verified via WebFetch (METR reward-hacking number
corrected, NeMo Agent Toolkit package name verified as nvidia-nat v1.6.0
released 2026-04-10, NemoClaw alpha-stage 2026-03-16 disclosed, EU AI Act
Art. 12 enforcement date 2026-08-02 with Digital Omnibus uncertainty
documented).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Verifier flagged PLAN.md, SYSTEM.md, EVAL_SPEC.md, WORKFLOW.md,
RESEARCH_AHE_ANALYSIS.md as unexpected upstream diffs after they were
committed (previously untracked, didn't show in diff).

Updated bin/verify_p0_5_d1.sh exclusion list to match actual owned
planning docs. Also tightened existing patterns from "^FOO.md" to
"^FOO.md\$" so a future "FOO.md.bak" doesn't accidentally pass.

Pattern for future verifiers: when a new owned file is introduced, audit
the verifier exclusion list at the same time.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
v4 framing — "governance library wrapping 6 thin orchestrator agents" —
under-delivered on JD's literal asks: "strong agency in LLM-based systems,"
"code agents doing real work," "AI helps build them." A NVIDIA Cosmos
reviewer comparing cosmos-lab against 2026 production autonomous agents
(Devin / Operator / Cursor Composer / Claude Code) would see v4 as
conservative governance theater — clever judgment, weak capability.

v5 inverts the hierarchy. The product is now ONE autonomous PrincipalAgent
doing real long-horizon ML lifecycle work, with governance reframed as
*enabler of autonomy* (not constraint). Sentinels become tripwires for
replanning. Identity capabilities expand with earned track record (RFC
8693 token exchange after K sentinel-clean runs). GEPA becomes agent
self-improvement with retroactive human review.

The 6 v4 "agents" collapse into 6 capability domains the same agent
demonstrates over P3-P9 — like one principal engineer who does data work
Monday, training Tuesday, optimization Wednesday. Not six different
people. PrincipalAgent runs INSIDE ml-intern's agent_loop.py (1626 lines
of debugged production code) — substrate stays zero-diff, capabilities
ride on top.

Substantive changes:

  PLAN_V2.md
    + §0.9 Autonomous Principal Agent thesis (NEW, ~80 lines) — the
      load-bearing reframe; product = PrincipalAgent + harness + governance
    + §3.2 PrincipalAgent architecture (NEW, ~140 lines) — substrate
      choice (ml-intern agent_loop), autonomous loop diagram (PLAN→
      EXECUTE→VERIFY→REPLAN), 3-tier memory (working/episodic/semantic),
      replanning logic (sentinel = information not failure), capability
      expansion mechanism (earned trust), concrete demo, owned path
      cosmos_lab/principal/ tree
    ~ §0.5 row 11 NEW — v5 thesis pivot rationale, cited
    ~ §0.6 reframed — "what only cosmos-lab does" now compared vs Devin/
      Operator/Cursor Composer (2026 autonomous agents), not vs assembled
      OSS. 5 differentiators all autonomy-focused.
    ~ §0.65 reframed — "six reference agents" → "six capability domains
      of the PrincipalAgent" (one agent, six skills)
    ~ §1 phase table reframed — phases are now "PrincipalAgent capability
      progression milestones" not "ship N agents"
    ~ Header v4 → v5 with thesis pivot explanation

  docs/01_north_star.md — full rewrite to PrincipalAgent framing

  CLAUDE.md — identity sentence updated to v5 framing (one PrincipalAgent
    with 6 capability domains, ml-intern agent_loop substrate, governance
    as enabler)

What v5 KEEPS from v4:
  - Library architecture (P0.5) — but library = cosmos_lab.principal
  - Identity v2 (P4b) — but for capability EXPANSION (earned trust)
  - Sentinels (P1, §3.1) — but as tripwires for replanning
  - GEPA (P8) — but for agent self-improvement
  - Production deployment (P10) — agent serves real users
  - All 9 invariants
  - 24 numerical targets
  - 22.5 weeks (depth shifts breadth→capability, not added time)

What v5 honestly does NOT pretend to do (anti-hype):
  - Not invent novel architectures (composes known patterns intelligently)
  - Not zero human oversight (sentinel trips visible; weekly review)
  - Not arbitrary research questions (scoped to ML lifecycle, Cosmos first)
  - Not online self-improvement (offline GEPA between sessions)
  - Capability expansion is policy-bounded, not unbounded

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes the eval gap in v5 plan. Without rigorous agent-system eval,
every "exceptional autonomous PrincipalAgent" claim in PLAN_V2 is
unverifiable, GEPA self-improvement (P8) has no rigorous gate to ratchet
against, and reward-hacking detection (per UC Berkeley + METR 2026
findings) has no operational discipline.

NEW FILE — AGENTIC_EVAL_SPEC.md (528 lines):
  Companion to EVAL_SPEC.md. Where EVAL_SPEC covers ML-output eval
  (perplexity, KL divergence, latency p99 — model under test), this doc
  covers agent-system eval per axiom A8 ("the agent is itself an
  artifact-under-eval").

  Structure:
    §0 scope + reading order
    §1 why agentic eval differs (trajectory ≠ output; A8; long-horizon)
    §2 axioms — A1-A10 transfer from EVAL_SPEC; A11-A13 are agentic-specific
        A11 trajectory carries information beyond outcome
        A12 capability expansion requires adversarial testing
        A13 long-horizon eval is non-fungible with short-horizon eval
    §3 5-tier architecture (T0-T4 specialized for agentic)
    §4 6 agentic-specific surfaces (S1-S6) — NEW vs EVAL_SPEC:
        S1 trajectory eval (tool-call efficiency, replan ratio, doom-loop)
        S2 plan-quality eval (LLM-judge on PLAN-phase output)
        S3 replan-quality eval (sentinel trips → response quality)
        S4 capability boundary eval (50-task adversarial probe suite)
        S5 reward-hacking adversarial eval (monthly red-team sprint)
        S6 cross-agent comparison eval (vs Devin, Claude Code, human)
    §5 cross-cutting meta layers (M1-M3 transfer from EVAL_SPEC)
    §6 statistical framework (transfers + paired tests for agent compare)
    §7 three input types (I1-I3 per JD bullet 5)
    §8 operational cadence + gates + verifier scripts
    §9 numerical targets — 10 eval-system commitments (E1-E10)
    §10 implementation map to v5 phases (no new phase needed, ~3-4 days
        spread across P1, P4a, P4b, P9b, P10)
    §11 references (METR, UC Berkeley, τ-bench, BFCL, GAIA-2, etc.)

PLAN_V2.md additions:
  + §3.3 Agentic eval architecture (in-plan summary + pointer to spec)
  + §0.7 v5 eval-system additions subtable (10 numerical targets E1-E10)
        E1 sentinel suite FPR ≤ 5%
        E2 sentinel suite FNR ≤ 1%
        E3 T1 test-retest r ≥ 0.95
        E4 plan-quality LLM-judge ↔ human ≥ 80%
        E5 replan success rate ≥ 70%
        E6 capability boundary 100% (0 unauthorized in 100 runs)
        E7 reward-hack discovery rate trending downward
        E8 PrincipalAgent on Pareto frontier vs comparison agents
        E9 eval cost ≤ 15% of total project spend
        E10 reproducibility envelope coverage 100%
        Total commitments: 24 (original §0.7) + 10 (E1-E10) = 34
  + §1.5 reuse map — companion specification documents subsection
        explaining EVAL_SPEC vs AGENTIC_EVAL_SPEC scope

CLAUDE.md pointer index updated — distinguishes ML-output eval doc
  from agent-system eval doc.

docs/03_pointers.md updated — adds §3.2 + §3.3 anchors and a new
  "Companion specification docs" subtable.

bin/verify_p0_5_d1.sh — adds AGENTIC_EVAL_SPEC.md to owned-paths
  exclusion list (planning doc, owned).

Verifier: ./bin/verify.sh p0_5_d1 → 14/14 pass (still green).
Plan size: 1173 lines (was 1075 — added ~100 lines for §3.3 + E1-E10).

Why this matters now: per JD bullet 5 ("design and scale evaluation
platforms that combine automated metrics, human feedback, and
agent-driven analysis") and 2026 reward-hacking crisis (METR: o3 hacks
30%+ RE-Bench; UC Berkeley: 8/8 top benchmarks hackable to 73-100%).
A v5 PrincipalAgent without rigorous agentic eval IS the failure mode
those papers describe — confident, capable, unverifiable, reward-
hacking. The eval architecture is the difference between exceptional
agent we can defend with numbers vs autonomous agent that "looks good
in the demo."

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Ships the v1 compat adapter that installs cosmos-lab governance into an
existing ml-intern Session. Per PLAN_V2.md §0.4 library architecture and
docs/02_current_phase.md D2 spec.

NEW FILES:
  cosmos_lab/harness/__init__.py             — public API: install_into_session
  cosmos_lab/harness/ml_intern.py     (84 L) — the adapter
  tests/optimization/harness/__init__.py
  tests/optimization/harness/test_ml_intern_adapter.py (175 L) — 6 smoke tests
  bin/verify_p0_5_d2.sh                      — D2 verifier (11 checks)

ADAPTER CONTRACT:
  install_into_session(session, identity, audit_log) -> None

  Wraps session.tool_router with CapabilityScopedRouter so every tool
  invocation through Session is governed by cosmos-lab identity + audit.
  Composition only — no upstream files modified (Invariant 1).

  Idempotency: re-installing on the same Session raises RuntimeError to
  prevent shadowing audit history.

SMOKE TEST DESIGN:
  Uses duck-typed MockSession (just .tool_router) instead of constructing
  a real ml-intern Session (which requires Config + ContextManager +
  event_queue + sandbox). The adapter's contract is "I wrap .tool_router"
  — smoke test verifies that one thing.

  6 tests: 3 contract (wraps, raises on no-router, refuses re-install) +
  3 e2e behavior (denial works via wrapped session, audit recorded,
  authorized passes through).

VERIFIER RESULT:
  ./bin/verify.sh p0_5_d2 → 11/11 pass
  ./bin/verify.sh p0_5_d1 → 14/14 pass (no regression)
  Upstream baseline: 237 pass / 3 known-broken (no regression)
  Zero-diff: only owned paths modified

LEARN (3 surprises, captured for future phases):

  1. Editable install staleness — adding cosmos_lab/harness/ after the
     prior `uv sync` left package metadata stale. Pattern: any new
     package directory needs `uv sync` before tests pass.

  2. `uv run pytest` is ambiguous — PATH leak resolved to miniconda's
     pytest (with stale editable install) instead of venv. Symptom:
     `python -c "import cosmos_lab.harness"` succeeded everywhere but
     pytest collection failed with ModuleNotFoundError. Fix: use
     `uv run python -m pytest` for deterministic venv resolution.
     All verifier scripts updated.

  3. Smoke test design — testing the adapter contract (one method) does
     not require constructing the host platform (full Session). Duck-
     typed MockSession is the right scope. Pattern for future adapter
     smoke tests.

Branch: p0_5_library_restructure (continues from D1).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ARCHITECTURAL DECISION (v5.1, PLAN_V2 §0.4.5):

After auditing agent_loop.py:1771, discovered ml-intern's submission_loop
is queue-based (asyncio.Queue for submissions in + events out), not
function-based. Embedding it as runtime substrate inside another
orchestrator requires a 1-2 week async-bridge engineering effort that
v5 implicitly assumed but never budgeted.

v5.1 commits to TWO-LAYER architecture instead of three:

  Layer 1: cosmos-lab CLI (PRIMARY entry point)
           - PrincipalAgent + governance + sentinels + memory + sub-agents
           - long-horizon orchestration ABOVE the Session level

  Layer 2: ml-intern Session (execution SUBSTRATE, per task)
           - debugged ReAct loop (1626L) + 16 tools + MCP + sandbox
           - wrapped by D2 adapter (CapabilityScopedRouter)

  Deployment wrappers (P10): nat workflow YAML, Modal/HF Spaces endpoint
                              — INVOKE cosmos-lab CLI, don't host it inside

Why we explicitly REJECTED 3-layer runtime:
  1. submission_loop queue-based design = real async-bridge work
  2. nat-at-runtime solves the wrong problem; nat-at-deployment is enough
     for Cosmos pitch (`nat run cosmos-lab.yaml` = "we run in your stack")
  3. Complexity budget — 1-2 weeks better spent on sentinels / memory /
     capability domains / real GPU runs

What v5.1 PRESERVES:
  - PrincipalAgent thesis (§0.9), library architecture (§0.4),
    AGENTIC_EVAL_SPEC, sentinel taxonomy (§3.1), 9 invariants,
    34 numerical targets, 6 capability domains, 22.5w schedule,
    every commit already shipped (P0, P0.5 D1, P0.5 D2, AGENTIC_EVAL_SPEC,
    v5 thesis).

What v5.1 DROPS:
  - "nat-runnable from P1 onward" (over-promise)
  - 3-layer runtime architecture
  - ml-intern submission_loop bridge work (~1.5-2w saved → risk buffer)

P0.5 D3 IMPLEMENTATION (matches v5.1 reframed scope):

  cosmos_lab/harness/nat.py (132 LOC, ~78 non-comment, ≤200 budget)
    - register_as_nat_tool(builder) — lightweight registration shim
    - Registers ONE tool `cosmos_lab_principal` in nat Builder
    - Tool body: stub returning structured dict (real CLI invocation
      lands in P3 PrincipalAgent v0)
    - BuilderLike Protocol for duck-typed compatibility
    - Idempotency check via _is_already_registered()
    - Multi-method fallback (add_function / register_function / etc)

  tests/optimization/harness/test_nat_adapter.py (135 LOC)
    - 11 smoke tests: 5 contract (registration mechanics) + 6 tool callable
    - Uses MockBuilder pattern (mirrors D2 MockSession per LEARN huggingface#3)
    - All pass against duck-typed builders

  bin/verify_p0_5_d3.sh — 10 checks
    Verifier result: 10/10 pass
    D1 + D2 still green (no regression): 14/14 + 11/11
    Upstream baseline: 237 / 3 known-broken
    Total cosmos-lab tests: 22 + 11 = 33

DOCS UPDATED:
  PLAN_V2.md §0.4.5 NEW (~120 lines)  — two-layer architecture decision
  CLAUDE.md dev commands               — `uv sync --extra dev` + `uv run python -m pytest`
  docs/02_current_phase.md             — D3 archived, D4 spec written

LEARN from D3:
  1. Always read substrate code before architecting on it (would have
     caught queue-based submission_loop earlier; v5 thesis pivot would
     have proposed 2-layer from start, not 3-layer)
  2. `uv sync` without `--extra dev` removes pytest from venv (caught
     during D3 verifier; CLAUDE.md updated)
  3. Two layers > three when one is sufficient (workflow anti-pattern huggingface#4
     generalized: don't build a layer that doesn't earn its complexity)

Branch: p0_5_library_restructure (continuing from D2).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ROOT CAUSE — v5/v5.1 over-correction:

v5 thesis claimed "ONE exceptional autonomous PrincipalAgent" and planned
to build planner + executor + memory + sub-agent spawning under
cosmos_lab/principal/. v5.1 reframed as "2-layer (cosmos-lab CLI on
ml-intern Session)" with same re-implementation work.

User audit caught the contradiction: I claimed leverage but described
re-building. Audit of ml-intern itself confirmed:

  agent/prompts/system_prompt_v3.yaml VERBATIM:
    "You are ML Intern, an ML engineering assistant... fully autonomous
    — research, validate, implement, and deliver results without asking
    for unnecessary confirmation"

  agent/tools/plan_tool.py             — built-in planning (todo list)
  agent/tools/research_tool.py         — sub-agent spawning ("spawns a
                                         cheap LLM call with focused
                                         research task and returns
                                         summary")
  agent/core/agent_loop.submission_loop — autonomous execution loop
  agent/core/doom_loop.py              — failure-loop detection
  agent/core/cost_estimation.py        — per-call cost tracking
  agent/tools/{jobs,papers,research,sandbox,...} — 20+ ML tools

ml-intern IS already what v5/v5.1 PrincipalAgent claimed to be.
v5/v5.1 was workflow anti-pattern huggingface#4 generalized: "building a 5000-LOC
re-implementation that should have been a governance wrapper."

v5.2 PIVOT — governance layer (back to v4's correct direction with v5
production rigor):

cosmos-lab is the production governance layer that makes ml-intern
(or any autonomous ML agent) safe to deploy at NVIDIA Cosmos scale.
ml-intern provides autonomy; cosmos-lab provides production discipline.

The 10 governance components ml-intern doesn't have, that cosmos-lab adds:
   1. Sentinel-gated quality (4 sentinel types paired with judge)
   2. Cross-session memory (3-tier hierarchical, persistent)
   3. RFC 8693 capability expansion (earned-trust scope growth)
   4. Hash-chained signed audit (EU AI Act Art. 12 compliant)
   5. OTel-GenAI native observability (gen_ai.* semconv, portable)
   6. GEPA self-improvement (offline DSPy 3.x, retroactive review)
   7. MultiJudge with bootstrap CIs (no debate dynamics)
   8. Inspect AI integration (UK AISI standard adoption)
   9. PR-gating + canary deployment (sequential testing)
  10. AGENTIC_EVAL_SPEC discipline (T0-T4 + S1-S6 + E1-E10)

WHAT v5.2 PRESERVES:
- All shipped code (P0 identity + P0.5 D1/D2/D3 adapters)
  — these ARE the governance foundation
- AGENTIC_EVAL_SPEC.md — eval architecture is THE product spec now
- All 9 invariants
- 34 numerical targets (24 §0.7 + 10 E1-E10)
- Library architecture (cosmos_lab/ pip-installable)
- Two-layer deployment (cosmos-lab CLI + ml-intern; nat as P9 wrapper)

WHAT v5.2 DROPS:
- "ONE PrincipalAgent" framing (ml-intern is the agent)
- Re-implementation of planner / executor / memory tier internals /
  sub-agent spawning (~5000 LOC of unnecessary code)
- 6 capability domains framing (replaced with 6 governance enhancements
  applied to ml-intern's existing capabilities)
- 22.5w schedule (compressed to ~13w by dropping re-implementation work)

NET EFFECTS:
- Plan compressed 22.5w → ~13w (~9 weeks banked for v1.1 polish or buffer)
- Stronger Cosmos pitch ("we make autonomous agents production-safe" =
  2026 frontier gap nobody fills end-to-end)
- Honest about leverage (Cosmos reviewer running git ls-files won't see
  PrincipalAgent re-implementation)
- All shipped commits stay valid and become more important
  (D2 ml_intern adapter is THE primary product surface)
- AGENTIC_EVAL_SPEC's E1-E10 = literal product spec, not side document

DOCS UPDATED:
  PLAN_V2.md
    - Header (v5.2 governance layer thesis)
    - §0.5 row 12 NEW (v5.2 honest leverage pivot rationale, cited)
    - §0.6 reframed (10 governance items vs ml-intern bare)
    - §0.65 reframed (6 governance enhancements, not 6 PrincipalAgent
      capabilities; with concrete demonstration block)
    - §0.9 simplified (ml-intern is the agent; cosmos-lab is governance)
    - §1 phase table compressed (~13w, governance enhancements + ml-intern
      demonstrations)

  CLAUDE.md identity sentence — v5.2 framing
  docs/01_north_star.md — full rewrite to v5.2 governance-layer
  docs/02_current_phase.md — v5.2 schedule note (P1+ phases)

VERIFIER STATE:
  D1: 14/14 ✅ (no regression)
  D2: 11/11 ✅ (no regression — D2 adapter is now primary product surface)
  D3: 10/10 ✅ (no regression)
  Upstream baseline: 237 / 3 known-broken
  Total cosmos-lab tests: 33

Branch: p0_5_library_restructure (10th commit).
Plan size: 1294 lines (added ~140 for v5.2 reframe content).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…tern primitives

USER CAUGHT THE ROOT ISSUE:

v5/v5.1 collapsed v4's 6 specialty agents into 1 PrincipalAgent
(over-correction #1). v5.2 then removed the agents entirely, claiming
"ml-intern is the agent" (over-correction huggingface#2). Both wrong.

JD re-read CAREFULLY confirms multiple specialty agents needed:

  Role mission: "agentic SYSTEMS that reason about, build, evaluate,
                 and improve AI systems themselves" (plural systems)

  What you'll do bullet 3: "self-improving loops where agents help
                            generate data, surface failures, evaluate
                            outputs" (multiple agents, different jobs)

  Stand-out bullet 1: "agent-based systems doing real work — coding,
                       eval, data gen, triage, experimentation,
                       orchestration" (6 work types = 6 agent types)

ml-intern's tool primitives are HF-generic. Cosmos team needs Cosmos-
specialized agents (cosmos-curate orchestration, NeMo-RL training, NIM
inference, multimodal physics, real video pipelines). ml-intern's
primitives are SUBSTRATE we use, not the agents themselves.

v6 SYNTHESIS — best of all prior versions:
  - v3.x/v4: correct on agent count (6 specialty agents)
  - v5/v5.1: correct on production rigor (real GPU, sentinels, AGENTIC_EVAL_SPEC)
  - v5.2: correct on leverage discipline (use ml-intern primitives, don't reimplement)
  - v6: 6 specialty + 3 governance + ~16 infrastructure + ml-intern primitives

THE 9 NEW AGENTS (the product):

Layer 1 — 6 Cosmos-specialty (real ML lifecycle work):
  1. DataAgent (P3) — cosmos-curate orchestration, real video curation
  2. EvalAgent (P4a) — multi-judge + sentinels + physics-consistency
  3. TrainOrchestrator (P5) — Centaur HPO + NeMo-RL on real GPU
  4. OptimizeAgent (P6) — profiling + ≥1.5× speedup on 4 workloads
  5. MultimodalPipelineAgent (P9) — e2e Cosmos workflow on Predict 2.5
  6. CodeAgent (P9) — capability-scoped, real OSS bug fixes

Layer 2 — 3 governance (meta-layer):
  7. GepaOptimizer (P8) — weekly DSPy GEPA prompt revisions
  8. CapabilityProbe (P7) — adversarial scope testing
  9. CrossAgentEvaluator (P10) — quarterly Pareto vs Devin/Claude Code/human

PLUS ~16 governance infrastructure components (sentinels, identity v2,
audit log, OTel, memory tiers, Inspect AI bridge, ComputeBackend, etc.)
PLUS ml-intern primitives leveraged AS-IS (agent_loop, 16 generic tools,
sandbox, MCP, cost estimation, doom-loop detection).

SCHEDULE: ~19 weeks (between v5/v5.1's 22.5w and v5.2's 13w).
  Tighter than v5/v5.1 because we leverage ml-intern primitives.
  Bigger than v5.2 because we restore the 9 agents JD asks for.
  Honest middle ground.

DOCS UPDATED:
  PLAN_V2.md
    - Header v5.2 → v6 (with honest postmortem of v5/v5.1/v5.2 over-corrections)
    - §0.5 row 13 NEW (v6 restore agents pivot rationale)
    - §0.6 reframed (9 agents + ~16 infra components vs assembled OSS)
    - §0.65 reframed (Layer 1 + Layer 2 + Layer 3 + Layer 4 honest count)
    - §0.9 reframed (cosmos-lab builds agents on ml-intern primitives)
    - §1 phase table — 19w with 9-agent reality

  CLAUDE.md identity sentence — v6 framing
  docs/01_north_star.md — full rewrite to v6
  docs/02_current_phase.md — v6 schedule note

WHAT v6 PRESERVES:
  - All shipped code (P0, P0.5 D1/D2/D3, AGENTIC_EVAL_SPEC) — these
    are the substrate agents will use
  - All 9 invariants
  - 34 numerical targets (24 §0.7 + 10 E1-E10)
  - Library architecture (cosmos_lab/ pip-installable)
  - Two-layer deployment (cosmos-lab CLI + ml-intern; nat as P10 wrapper)
  - AGENTIC_EVAL_SPEC.md — eval architecture for the 9 agents

WHY THIS IS THE FINAL FRAMING:

v6 maps directly to JD's literal text. Each JD bullet has a deliverable:
  - "Design and implement agentic workflows across ML lifecycle" → 6 specialty agents
  - "Build AI-native systems where agents interact with code/tools/exp" → CodeAgent + others
  - "Self-improving loops" → GepaOptimizer
  - "Eval platforms (auto + human + agent-driven)" → EvalAgent + MultiJudge + Inspect AI
  - "Multimodal ML pipelines" → MultimodalPipelineAgent + DataAgent
  - "Engineering excellence" → 9 invariants + AGENTIC_EVAL_SPEC

No more over-corrections. v6 is the final framing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes P0.5 (4 days of foundation work shipped same day).

NEW FILES:
  cosmos_lab/harness/CONTRACT.md (220 lines)
    - Documents 2 adapter families per v5.1/v6 architecture:
      Family A — Execution Substrate Adapter (ml_intern, future claude_sdk)
        Contract: install(host, identity, audit_log) -> None
      Family B — Deployment Surface Adapter (nat, future langgraph)
        Contract: register_as_X_tool(builder) -> None
    - 5 shared requirements (S1-S5) all adapters must satisfy:
      S1 Idempotency, S2 Composition only, S3 Input validation,
      S4 Returns None, S5 No partial state on failure
    - Per-adapter specifics + future adapter checklist
    - Anti-patterns explicitly rejected

  tests/optimization/harness/test_adapter_contract.py (160 lines)
    - Parametrized contract tests across BOTH shipped adapters
    - 9 tests: 3 cross-family shared (S1, S4, S5) + 2 family-specific
      + 1 coverage sanity test
    - ADAPTERS registry: single source of truth for parametrization
    - When v1.1 ships claude_sdk or langgraph adapter, just add row
      to ADAPTERS — automatic contract enforcement

  bin/verify_p0_5_d4.sh — 14-check verifier

HONEST DESIGN DECISION (D4 LEARN):

Spec originally said "parametrize all 22 existing tests across both
adapters." Audit revealed: ml_intern (execution substrate) and nat
(deployment surface) have DIFFERENT contracts because they serve
DIFFERENT purposes per v5.1/v6 architecture. Forcing one signature
loses clarity.

D4 ships the honest answer:
  - Two adapter families, each with own contract signature
  - 5 shared requirements that apply to BOTH families
  - Parametrized tests for the 5 shared requirements
  - Family-specific tests stay in test_ml_intern_adapter.py / test_nat_adapter.py

This is more honest and more extensible than forcing one contract shape.

VERIFIER RESULTS:
  D1: 14/14 ✅ (no regression)
  D2: 11/11 ✅ (no regression)
  D3: 10/10 ✅ (no regression)
  D4: 14/14 ✅ (NEW)
  Upstream baseline: 237 / 3 known-broken
  Total cosmos-lab tests: 42 (16 P0 + 6 D2 + 11 D3 + 9 D4 contract)
  Zero-diff invariant: holds throughout

🎉 P0.5 COMPLETE 🎉

Final P0.5 stats:
  - 4 days work shipped on schedule
  - 12 commits on branch (P0 + 4 P0.5 days + 5 plan evolutions + 2 fixups)
  - ~3500 LOC added (cosmos_lab/ + tests/ + bin/ + docs/ + planning)
  - Foundation for 9 specialty + governance agents (P3-P10)

NEXT: P1 — Eval infrastructure (~2 weeks)
  - TrajectorySink + OTelGenAIEmitter (Phoenix backend)
  - 4 sentinel types per §3.1 taxonomy
  - MultiJudge with bootstrap CIs
  - Inspect AI bridge + 5 seed Inspect tasks
  - evaluate CLI

P1 ships eval infrastructure that becomes:
  - The foundation for EvalAgent (P4a — specialty agent huggingface#2)
  - Used by all other specialty agents (P3, P5, P6, P9, P10) for
    sentinel-gated quality + OTel observability + Inspect AI integration

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a31c3c020c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread agent/core/session.py
Comment on lines +653 to +660
if personal_repo_id:
subprocess.Popen(
[
sys.executable,
str(uploader_script),
"retry",
directory,
personal_repo_id,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Scope personal retry uploads to the owning user

retry_failed_uploads_detached() launches a single personal retry process with the current session’s personal_repo_id, but that retry scans every session_*.json file in session_logs. If files from other users still have personal_upload_status pending/failed, they will be re-uploaded into the current user’s dataset, causing cross-user trace leakage/misattribution. Personal retries need per-file ownership/repo scoping (or should be disabled globally) instead of replaying the whole directory against one repo.

Useful? React with 👍 / 👎.

Dang Huy Hoang and others added 2 commits May 3, 2026 19:58
…kills + offline tools)

3 PARALLEL AUDITS COMPLETE:

Launched 3 senior-engineer research agents in parallel to verify v6
architecture against 2026 frontier patterns. All 3 independently
converged on the same 6 misalignments + 8 additions.

Audit 1 (Anthropic + NVIDIA 2026 patterns):
  - Anthropic Skills blog (2026) explicitly REJECTS per-domain agents
  - Anthropic subagents are task-specialized for parallelization, not
    domain-specialized
  - NVIDIA Cosmos Curator/Evaluator are single-purpose tools, not multi-
    agent fleets
  - Anthropic Memory tool = flat file, NOT hierarchy
  - No production "sentinel-trip → replan" pattern at Anthropic
  - GEPA is offline-only at Decagon; AlphaEvolve closed-source Gemini
  - Standing red-team agents in production = Microsoft/Straiker/LangWatch

Audit 2 (2026 multi-agent orchestration convergence):
  - LangGraph won production tier (Uber/JPMC/BlackRock/Cisco)
  - AutoGen → maintenance mode April 2026; Magentic-One → MAF
  - Hierarchical orchestrator-worker is THE convergent pattern
  - Multi-agent debate REFUTED (arxiv:2508.17536)
  - Spawn depth=1 is convergent default (OpenAI Codex hardcodes)
  - Hybrid memory (4-scope) is convergent (Mem0/Atlan/supermemory)
  - Specialty agents OK if distinct tool surfaces; ANTI-PATTERN if
    sequential pipeline

Audit 3 (production agent eval + governance + safety):
  - Inspect AI is de facto frontier eval substrate
  - Industry uses 3-tier (not 5-tier) eval ladder
  - Berkeley audit: 8/8 benchmarks reward-hackable to 73-100%
  - EU AI Act Aug 2 2026 deadline IN FORCE (trilogue collapsed Apr 28)
  - MCP authorization 86% enterprise adoption
  - Hash-chained Ed25519 audit logging is now production minimum
  - Gaia2 finds judge-hacking as distinct failure mode

V7 SYNTHESIS — 6 FIXES + 8 ADDITIONS:

Fixes (audit-driven):
  1. Per-domain 6 specialty agents → 4 specialty workers (distinct tool
     surfaces) + 1 PrincipalAgent supervisor + CodeWork Skill
  2. GepaOptimizer standing agent → offline batch tool (Decagon pattern)
  3. Sentinels novel "tripwire-replan" → Anthropic PostToolUse hooks
     contract (Claude Agent SDK pattern)
  4. 3-tier memory hierarchy → 4-scope hybrid (Mem0/Letta substrate)
  5. CapabilityProbe co-resident → CI/CD eval lane via Inspect AI
     snapshots (METR pattern)
  6. "Earned-trust capability expansion" oversold → standard RFC 8693
     delegation (table stakes per MCP spec, drop escalation framing)

Additions (frontier-required):
  1. LangGraph durable substrate (Uber/JPMC production winner)
  2. Magentic-One Task Ledger + Progress Ledger pattern (2-iteration
     stall detection — Microsoft Agent Framework first-class)
  3. 5th sentinel: JudgeHackingCheck (Gaia2 finding — verifier-pleasing
     artifacts without solving task)
  4. Cross-family MultiJudge (1 non-Anthropic to break correlation)
  5. CodeWork as Skill, not separate agent (Anthropic Skills pattern)
  6. RFC 8707 + RFC 8693 day-one (MCP 2026-03-15 mandate; 86% adoption)
  7. Reward-hack rate as Pareto axis in S6 cross-agent eval
  8. CUDA/cuDNN/driver versions in reproducibility envelope

V7 FINAL FLEET:

Production agents (5):
  1. PrincipalAgent (P3) — LangGraph supervisor + Magentic-One ledgers
  2. DataAgent (P4a) — distinct cosmos-curate/NeMo Curator surface
  3. EvalAgent (P5) — distinct Inspect AI/MultiJudge surface
  4. TrainOrchestrator (P5) — distinct NeMo-RL/SkyPilot surface
  5. OptimizeAgent (P6) — distinct profiler/kernel/sandbox surface

Skills (loaded by PrincipalAgent):
  - CodeWork Skill (P7) — commodity tools in E2B sandbox

Offline governance tools (NOT standing agents):
  - GepaOptimizer (P8) — monthly cron offline batch
  - CapabilityProbe (P7) — CI/CD eval lane via Inspect AI
  - CrossAgentEvaluator (P10) — quarterly Pareto generator

Infrastructure (~16 components):
  - Identity (P0 + RFC 8693), 5-type sentinels, OTel + 4-scope memory,
    Inspect AI + cross-family MultiJudge, LangGraph + Magentic-One
    substrate, ComputeBackend + sandbox, reproducibility envelope,
    nat deployment

Substrate (LEVERAGED inside LangGraph worker nodes):
  - ml-intern primitives (agent_loop, 16 tools, sandbox, MCP, cost
    estimation, doom-loop)

DOCS UPDATED:
  PLAN_V2.md
    - Header v6 → v7 (frontier-aligned production agentic system)
    - §0.5 row 14 NEW (v7 frontier-audit pivot rationale, cited)
    - §0.6 reframed (vs assembled OSS — 5 agents + Skills + offline)
    - §0.65 reframed (5 production agents + Skills + offline tools)
    - §1 phase table — ~21w with v7 phases (LangGraph + Magentic-One)
    - §3.1 sentinel taxonomy 4 → 5 types (added JudgeHackingCheck)
    - §3.2 PrincipalAgent architecture (LangGraph supervisor + Magentic-
      One ledger pattern + frontier substrate choices documented)

  CLAUDE.md identity sentence — v7 framing
  docs/01_north_star.md — full rewrite to v7 (frontier-aligned final)
  docs/02_current_phase.md — v7 schedule note (P1+ phases reframed)

WHY V7 IS FINAL:

3 independent senior-engineer audits converged on same fixes +
additions. No single auditor would catch all of these. Triangulation
across (Anthropic+NVIDIA) + (multi-agent convergence) + (eval+governance)
gives high-confidence frontier alignment.

Process needs to converge. Future audit findings document as v1.1+
work, not v8 — committing to v7 now and shipping.

ALL SHIPPED CODE PRESERVED:
  - P0 identity primitives ✅
  - P0.5 D1 library restructure ✅
  - P0.5 D2 ml_intern adapter ✅
  - P0.5 D3 nat wrapper ✅
  - P0.5 D4 adapter contract + dual-adapter test matrix ✅
  - 42 cosmos-lab tests passing ✅
  - All 4 verifiers green ✅
  - Upstream baseline preserved ✅
  - Zero-diff invariant holds ✅

Verifier: ./bin/verify.sh p0_5_d4 → 14/14 pass (still green).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…and-out huggingface#3)

8-tier frontier audit + JD literal-text audit converged on the SAME gap:
context engineering / context compression. Both stand-out JD bullet huggingface#3
("Context compression / agent memory") and 8-tier audit Tier 3 (harness
& context engineering) flagged this as v7's weakest area (~50% coverage).

This commit closes the gap by adding explicit context engineering
discipline to P3 PrincipalAgent foundation. NOT a v8 thesis change —
v7 architecture stays. Just adds 4 primitives + 1 bonus to P3, with 4
new numerical commitments (E15-E18).

WHAT V7-STRONGER ADDS:

§3.2.8 NEW — Context engineering discipline (~110 lines added)

Primitive 1: Cache-aware prompt structure
  Stable prefix → tool defs → conversation layout
  Stable region NEVER changes during task (system prompt + capability
    manifest + sentinel rules)
  Tool defs change ONLY on RFC 8693 capability expansion event
  Conversation is the only churn region
  Rationale: every byte of churn voids prefix cache + 10× cost
  (Anthropic memory system + Claude API prefix caching)

Primitive 2: Compaction at 75% context utilization
  Trigger: when context window hits 75% of model limit
  Action: pause loop → Anthropic memory tool API summarization → resume
  Rationale: Anthropic context-editing pattern (auto-clears stale tool
  results when context fills); Claude Code adopts this

Primitive 3: Just-in-time retrieval via recall_relevant(goal) tool
  Don't pre-load episodic memory at session start
  Agent calls explicit tool when needed; 4-scope filtered query
  Rationale: pre-loading wastes context on irrelevant past tasks;
  just-in-time keeps stable prefix small

Primitive 4: cosmos-progress.md structured state file (per Anthropic
Claude Code claude-progress.txt pattern)
  PrincipalAgent writes after every milestone completion
  Append-only event log: DONE / IN_PROGRESS / NEXT / SURPRISES sections
  Bridges multi-day work across compute interruptions
  New session begins by reading progress file (initializer pattern)

Primitive 5: Behavior-vs-model-capability separation test
  Quarterly: snapshot agents, re-run against current + previous models
  Detect harness assumptions that went stale (Sonnet 4.5 context
  anxiety patches were dead weight in Opus 4.5 — Anthropic example)
  Flag dead-weight resets/patches for removal

NEW NUMERICAL COMMITMENTS (E15-E18):
  E15: Prefix cache hit rate ≥ 80% on stable prefix region
  E16: Compaction trigger fires at 75% ± 5% (no missed in 100 runs)
  E17: cosmos-progress.md cross-session recovery 100%
  E18: ≥1 stale assumption identified per quarterly retest

CODE STRUCTURE UPDATE:
  cosmos_lab/principal/context_eng/ NEW subpackage:
    prompt_layout.py    — cache-aware structure enforcement
    jit_retrieval.py    — recall_relevant(goal) tool
    progress_state.py   — cosmos-progress.md writer/reader
    stale_check.py      — quarterly behavior-vs-capability retest
  cosmos_lab/principal/memory/compaction.py NEW module

DOCS UPDATED:
  PLAN_V2.md
    - §1 phase table P3 row: 2w → 2.5w; explicit context engineering scope
    - §1 Total: 21w → 21.5w
    - §3.2.7 file tree: added context_eng/ subpackage + memory/compaction.py
    - §3.2.8 NEW (110 lines): full context engineering discipline spec
      with 5 primitives + 4 new commitments E15-E18

  CLAUDE.md identity sentence — added context engineering discipline mention
  docs/01_north_star.md Layer 4 — added context engineering bullet

GAPS CLOSED:
  ✅ JD stand-out huggingface#3 (Context compression / agent memory): was PARTIAL
     (memory only) → now FULL (memory + 4-primitive context engineering
     discipline + behavior-vs-capability check)
  ✅ 8-tier audit Tier 3 (Harness & context engineering): was ~50% →
     now ~85% (cache-aware prompts + compaction + JIT retrieval +
     structured state + staleness check all explicit)

WHAT V7-STRONGER PRESERVES (no thesis change):
  - All shipped code (P0, P0.5 D1/D2/D3/D4)
  - 5 production agents (PrincipalAgent + 4 specialty workers)
  - 1+ Skills (CodeWork)
  - 3 offline governance tools (GepaOptimizer, CapabilityProbe, CrossAgentEvaluator)
  - All 9 invariants
  - All 38 numerical targets (24 §0.7 + 10 E1-E10 + 4 E15-E18)
  - Library architecture, 2-layer deployment, nat wrapper
  - Verifier discipline (D4: 14/14 still passes)

VERIFIER STATE:
  D1: 14/14 ✅ D2: 11/11 ✅ D3: 10/10 ✅ D4: 14/14 ✅
  Upstream baseline: 237 / 3 known-broken
  Plan size: 1416 → 1533 lines (+117 for context engineering spec)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants