P0 + P0.5 foundation: identity, library, ml_intern + nat adapters, contract + plan v6 by andreidhoang · Pull Request #1 · andreidhoang/ml-optimization-agent

andreidhoang · 2026-05-03T12:18:17Z

Summary

Foundation for the cosmos-lab project — 9 NEW agents (6 Cosmos-specialty + 3 governance) on ml-intern primitives per PLAN_V2.md v6. This PR ships P0 (identity foundation) + all 4 days of P0.5 (library + ml_intern adapter + nat wrapper + contract) + the 13-revision plan evolution that arrived at v6 honest framing.

12 commits, 42 cosmos-lab tests passing, 237 upstream baseline preserved, zero-diff invariant holds throughout.

What ships

Code (~700 LOC)

cosmos_lab/ — Python library (importable via from cosmos_lab import ...)
- identity/ — AgentIdentity, AuditLog, CapabilityScopedRouter, OptimizationConfig (P0)
- harness/ml_intern.py — Family A execution-substrate adapter, install_into_session() wraps Session.tool_router with governance (P0.5 D2)
- harness/nat.py — Family B deployment-surface adapter, register_as_nat_tool() registers cosmos_lab_principal as nat workflow tool (P0.5 D3)
- harness/CONTRACT.md — formal contract documentation: 2 adapter families + 5 shared requirements + per-adapter specifics + future-adapter checklist (P0.5 D4)

Tests (42 total)

tests/optimization/test_identity_scoping.py — 16 tests for identity primitives (P0)
tests/optimization/harness/test_ml_intern_adapter.py — 6 smoke tests (P0.5 D2)
tests/optimization/harness/test_nat_adapter.py — 11 smoke tests (P0.5 D3)
tests/optimization/harness/test_adapter_contract.py — 9 parametrized contract tests across both adapters (P0.5 D4)

Verifiers (`bin/`)

verify.sh — phase router
verify_p0_5_d1.sh (14 checks), _d2.sh (11), _d3.sh (10), _d4.sh (14) — per-day verification

Planning + context harness

PLAN_V2.md (1380 lines) — 13-revision evolution: v3.1 → v3.2 → v4 → v5 → v5.1 → v5.2 → v6 (current). Full revision history in §0.5 deltas table rows 1-13.
AGENTIC_EVAL_SPEC.md — companion to EVAL_SPEC.md, agent-system eval architecture (5-tier ladder + 6 agentic surfaces + 13 axioms + 10 numerical commitments E1-E10)
CLAUDE.md, docs/ (5 files) — context harness for AI agents working on the project
agentic_build_workflow.md — DEFINE → PROBE → BUILD → REVIEW → SHIP → LEARN methodology

v6 thesis (final framing)

cosmos-lab ships 9 NEW agents + ~16 governance infrastructure components, built on ml-intern's tool primitives leveraged AS-IS:

Layer 1 — 6 Cosmos-specialty agents (P3-P9):

DataAgent — cosmos-curate orchestration, real video curation
EvalAgent — multi-judge + sentinels + physics-consistency
TrainOrchestrator — Centaur HPO + NeMo-RL on real GPU
OptimizeAgent — profiling + ≥1.5× speedup on 4 workloads
MultimodalPipelineAgent — e2e Cosmos workflow on Predict 2.5
CodeAgent — capability-scoped, real OSS bug fixes

Layer 2 — 3 governance agents (P7-P10):
7. GepaOptimizer — weekly DSPy GEPA prompt revisions
8. CapabilityProbe — adversarial scope testing
9. CrossAgentEvaluator — quarterly Pareto vs Devin/Claude Code/human

Layer 3 — ~16 governance infrastructure components (P1-P10):
sentinels, identity v2, audit log, OTel emitter, memory tiers, Inspect AI bridge, ComputeBackend, etc.

Layer 4 — ml-intern primitives leveraged AS-IS:
agent_loop, 16 generic tools, sandbox, MCP, cost estimation, doom-loop detection.

Plan evolution honest postmortem

Version	Framing	Verdict
v3.1/v3.2/v4	6 specialty agents	✅ Correct on agent count
v5/v5.1	1 PrincipalAgent (re-implemented ml-intern)	❌ Over-correction #1
v5.2	0 new agents (governance only)	❌ Over-correction huggingface#2
v6 (final)	6 specialty + 3 governance, on ml-intern primitives	✅ Synthesis

User correctly caught both over-corrections. v6 is the honest answer that maps directly to JD's literal text ("agents (plural) help generate data, surface failures, evaluate outputs" + stand-out "agent-based systems doing real work — coding, eval, data gen, triage, experimentation, orchestration").

Verifier state (all green)

D1: 14/14 ✅
D2: 11/11 ✅
D3: 10/10 ✅
D4: 14/14 ✅
Upstream baseline: 237 pass / 3 known-broken ✅
Zero-diff invariant: holds throughout ✅

Schedule

~19 weeks total (between v5/v5.1's 22.5w and v5.2's 13w — honest middle ground).

✅ P0 + P0.5 (~1.6 weeks) shipped in this PR
🚧 ~17 weeks remaining for P1-P10 (eval infrastructure → 9 agents → production deploy)

Test plan

uv sync --extra dev
./bin/verify.sh p0_5_d1 → 14/14
./bin/verify.sh p0_5_d2 → 11/11
./bin/verify.sh p0_5_d3 → 10/10
./bin/verify.sh p0_5_d4 → 14/14
uv run python -m pytest tests/optimization/ -q → 42 passed
uv run python -m pytest tests/unit/ -q → 237 passed / 3 known-broken (no new regressions)
git diff upstream/main --name-only → only owned paths

Reading order for review

For ~30 min focused review:

PLAN_V2.md §0.6 (v6 unique value) + §0.65 (9 agents) + §0.9 (thesis) — the load-bearing v6 framing
cosmos_lab/harness/CONTRACT.md — adapter contract design decision
cosmos_lab/harness/ml_intern.py (84 LOC) + cosmos_lab/harness/nat.py (132 LOC) — the two shipped adapters
tests/optimization/harness/test_adapter_contract.py — parametrized contract tests
AGENTIC_EVAL_SPEC.md §1 (why agentic eval differs) + §4 (6 surfaces) — eval architecture (used by all 9 agents)

For deeper review (~2 hours): read PLAN_V2.md §0.5 deltas table to follow the v3 → v6 evolution; read all 12 commit messages in order.

🤖 Generated with Claude Code

…ttribution (huggingface#179) * feat(telemetry): track 5 untracked Bedrock call sites for full cost attribution Cost Explorer ($78,738 over 6 days) vs the session dataset's total_cost_usd (~$354/day attributed) showed the dataset captures only ~33% of real Bedrock spend. Root cause: out of 9 acompletion() call sites, only 2 (in agent_loop.py) emit the llm_call event that total_cost_usd sums. This wires telemetry into the 5 Bedrock-billing call sites that were flying blind, with a `kind` tag on each call so analytics can split spend by category: - research_tool.py × 3 → kind="research" (sub-agent loop) - context_manager.py → kind="compaction" (history summary) - effort_probe.py → kind="effort_probe" (cascade walk) Plus a fourth tag for the session-restore summary path (session_manager.py → kind="restore"). Plumbing changes: - telemetry.record_llm_call now accepts kind="..." (default "main" preserves existing behavior). - summarize_messages() and ContextManager.compact() take optional session=None so the caller can opt into telemetry. - probe_effort() takes optional session=None for the same reason. - Both probe_effort callers (agent_loop._heal_effort_error and model_switcher) now pass session. Skipped: - routes/agent.py /title — uses HF Router (Cerebras), not Bedrock - routes/agent.py /health/llm — no session context (manual diagnostic endpoint, ~$0.02/call, not billable to a user) After deploy, expect dataset total_cost_usd to converge with Cost Explorer to within 5-10%. The kind breakdown will quantify each category, validating the cost-plan estimates in ml_intern_bedrock_cost_plan.md. * fix(telemetry): address PR bot feedback (2 P1 + 1 P2) 1. P1 — Wrap each research_tool record_llm_call in its own try/except. record_llm_call's inner send_event is wrapped, but extract_usage (telemetry.py:101) is not — an unexpected usage shape from LiteLLM could propagate. At all 3 research sites the surrounding except-block would convert that into "Research summary call failed", masking a valid LLM response. Match the effort_probe pattern: dedicated try/except logging at DEBUG. 2. P1 — Hoist `import time` from inside summarize_messages() to module level in manager.py. stdlib, always available, matches the rest of the module. 3. P2 — Update telemetry.py docstring kind list. Drop title_gen and model_probe (skipped per PR description), add restore (emitted from session_manager.py). Note the intentional skips at the bottom.

huggingface#183) * Add agent dev server notes * Make frontend model configurable * Support env-selected frontend models * Use Claude-specific model env var * Add GPT-5.5 to web model picker * Gate GPT-5.5 as a premium model * Avoid duplicate session model fetch * Remove legacy Claude quota aliases * Document GitHub CLI PR body workflow * Gate only deployed paid model IDs * Nits

* Make sandbox Spaces private Co-authored-by: Codex <codex@openai.com> * Remove legacy sandbox auth fallback Co-authored-by: Codex <codex@openai.com> * Address sandbox privacy review comments Co-authored-by: Codex <codex@openai.com> --------- Co-authored-by: Codex <codex@openai.com>

* Add DeepSeek V4 Pro model option Co-authored-by: Codex <codex@openai.com> * Remove DeepSeek feature tests Co-authored-by: Codex <codex@openai.com> --------- Co-authored-by: Codex <codex@openai.com>

* feat: add share_traces toggle and per-user trace repo template * feat: support Claude Code JSONL format and per-target auth * feat: dual-upload sessions to private user trace dataset * chore: retry personal trace uploads on booting * feat: add /share-traces command to flip dataset visibility * docs: document HF trace auto-share and /share-traces * Use HF token owner for local dev auth Co-authored-by: Codex <codex@openai.com> * Rename personal session trace dataset Co-authored-by: Codex <codex@openai.com> * Add session dataset card metadata Co-authored-by: Codex <codex@openai.com> * Fix session trace upload review issues Co-authored-by: OpenAI Codex <codex@openai.com> * Preserve secret scrubbing before trace uploads Co-authored-by: OpenAI Codex <codex@openai.com> * Link ML Intern demo in dataset card Co-authored-by: OpenAI Codex <codex@openai.com> --------- Co-authored-by: lewtun <lewis.c.tunstall@gmail.com> Co-authored-by: Codex <codex@openai.com>

* Use HF username for personal trace uploads Co-authored-by: OpenAI Codex <codex@openai.com> * Remove redundant HF token branch Co-authored-by: OpenAI Codex <codex@openai.com> --------- Co-authored-by: OpenAI Codex <codex@openai.com>

…ace#204) * chore: update the agent system prompt * chore: update the tool documentation

* Add session YOLO auto-approval budget Co-authored-by: Codex <codex@openai.com> * Address YOLO approval review feedback Co-authored-by: Codex <codex@openai.com> --------- Co-authored-by: Codex <codex@openai.com>

* Auto-start CPU sandboxes for sessions Co-authored-by: Codex <codex@openai.com> * Retry sandbox runtime visibility checks Co-authored-by: Codex <codex@openai.com> * Stabilize auto CPU sandbox creation Co-authored-by: OpenAI Codex <codex@openai.com> * Address sandbox PR review comments Co-authored-by: OpenAI Codex <codex@openai.com> --------- Co-authored-by: Codex <codex@openai.com>

Co-authored-by: OpenAI Codex <codex@openai.com>

* Fallback to free model for gated defaults Co-authored-by: OpenAI Codex <codex@openai.com> * Cover explicit gated model access for HF users Co-authored-by: OpenAI Codex <codex@openai.com> * Seed model picker from created session Co-authored-by: OpenAI Codex <codex@openai.com> --------- Co-authored-by: OpenAI Codex <codex@openai.com>

Ships the cosmos-lab Phase 0 deliverables — extends upstream Config with OptimizationConfig, plus an unsigned in-process AgentIdentity, append-only JSONL AuditLog, and a CapabilityScopedRouter that filters tool specs and audits before/after/exception around every tool call. Composition over inheritance: CapabilityScopedRouter wraps any duck-typed router, so unit tests use a FakeRouter and skip the heavy ToolRouter init (which requires HF auth + sandbox bootstrapping). AuthZ + audit only — no AuthN. Identity is unsigned; signing tokens, OAuth 2.1 + RFC 8707/8693 sub-agent scope-down, and hash-chained signed audit log all land in P4b. Documented as such in module docstrings. 16 tests in tests/optimization/test_identity_scoping.py cover config inheritance, round-trip, identity scoping/parent chain, audit JSONL round-trip + parent dirs + parseable lines, router denial/allow paths, exception path, spec filtering, root sees-all, and canonical args_hash. Acceptance: pytest tests/optimization/ exits 0; pytest tests/unit/ remains 237 pass / 3 known-broken (no new regressions). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Establishes cosmos_lab/ as the v1 importable surface for the cosmos-lab library. Code physically lives at agent/optimization/* per the zero-diff fork strategy; cosmos_lab/* re-exports so library consumers can write `from cosmos_lab.identity import AgentIdentity` without depending on upstream ml-intern import paths. Why library form (PLAN_V2.md §0.4): cosmos-lab's value (sentinels, identity, GEPA governance, quality budget) operates on framework-agnostic interfaces. Owning an agent loop is anti-pattern in 2026. Library form plugs into nvidia-nat (primary harness, P0.5 D3), ml-intern (compat adapter, P0.5 D2), Claude SDK (v1.1). Adds three placeholder optional-dependencies extras (nat, ml_intern, claude_sdk) — all empty for now; nvidia-nat>=1.6 dep lands in D3 with the nat adapter. Backward compat preserved: `from agent.optimization.identity import ...` still works. All 16 Phase 0 tests pass against new path AND old path. Verifier: ./bin/verify.sh p0_5_d1 → 14/14 pass (structure, dual-path imports, packaging, tests, zero-diff invariant). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds a minimal, token-efficient context harness so Claude Code (and other agentic coding assistants) can pick up cosmos-lab work cold without loading 7K+ lines of stale planning context every turn. Design (per agentic_build_workflow.md DEFINE→PROBE→BUILD→REVIEW→SHIP→LEARN): CLAUDE.md (94L, ~970 tokens, ALWAYS LOADED) = invariants + owned paths + 6 anti-patterns + pointer index only docs/00_workflow.md → ../agentic_build_workflow.md (workflow methodology) docs/01_north_star.md (vision in 1 screen, on-demand) docs/02_current_phase.md (LIVE — what we're building today, rotates) docs/03_pointers.md (phase → PLAN_V2 anchor map) docs/04_jd.md (NVIDIA Cosmos JD reference) bin/verify.sh (router: ./bin/verify.sh <phase>) bin/verify_p0_5_d1.sh (14 concrete checks for current phase) Why: previous CLAUDE.md (135L) had stale "Phase 0 pending" status, JD paste burning ~1.7K tokens every turn, no pointer index forcing agents to read the whole 837-line PLAN_V2.md to answer "what should I do." New harness reduces always-loaded context ~43% and enables on-demand deep reads via the pointer index. Verifier scripts return concrete pass/fail (per workflow rule "verifier is a script, not a description"). Anti-pattern huggingface#3 ("trusting 'I have verified this' from an agent") is enforced by re-running ./bin/verify.sh yourself. AGENTS.local.md is a symlink to CLAUDE.md so multiple agentic CLIs that look for AGENTS.md or CLAUDE.md both find the same source of truth. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

PLAN_V2.md (v4) — 24-week production-grade roadmap for cosmos-lab. Six reference agents (Data, Eval, Train, Optimize, Video, Code) on a shared governance library (sentinels, MCP-OAuth identity with sub-agent scope- down via RFC 8693, GEPA promotion contracts, quality-budget invariants). Library architecture from P0.5 onward; nvidia-nat as primary harness. Five production gates (PLAN_V2 §0.8): G1 real GPU runs (Invariant 9, ~$200-400 budget across P5/P5.5/P6/P9a) G2 PyTorch depth artifact (P5.5, ≥10% wall-clock improvement) G3 production deployment (P10, ≥100 real user sessions) G4 real multimodal data (P3, 10-100 hours real video) G5 upstream OSS PR (P10, nvidia-nat or Inspect AI) 24 numerical targets (PLAN_V2 §0.7) — phase exit conditions, not aspirations. Companion docs preserved as deep references (read on demand only): PLAN.md — original 16-week plan (superseded by v4) SYSTEM.md — full architecture deep-dive (Vietnamese, 1167L) EVAL_SPEC.md — measured-peak vs vendor-peak eval methodology WORKFLOW.md — git/PR workflow conventions RESEARCH_AHE_ANALYSIS.md — AHE (Agentic Harness Engineering) research Plan went through 4 revisions before commit: v3 — 2026-frontier verification pass (8 cited deltas) v3.1 — Jensen-grade polish (§0.6 unique value, §0.7 numerical targets, §3.1 sentinel taxonomy, noise cuts) v3.2 — library architecture pivot (P0.5 NEW, ~20 weeks) v4 — production-grade pivot (P5.5 NEW, P10 expanded, Invariant 9, §0.65 Six Reference Agents, §0.8 Production Commitments, ~22.5 weeks within original 24-week budget) Citations independently verified via WebFetch (METR reward-hacking number corrected, NeMo Agent Toolkit package name verified as nvidia-nat v1.6.0 released 2026-04-10, NemoClaw alpha-stage 2026-03-16 disclosed, EU AI Act Art. 12 enforcement date 2026-08-02 with Digital Omnibus uncertainty documented). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Verifier flagged PLAN.md, SYSTEM.md, EVAL_SPEC.md, WORKFLOW.md, RESEARCH_AHE_ANALYSIS.md as unexpected upstream diffs after they were committed (previously untracked, didn't show in diff). Updated bin/verify_p0_5_d1.sh exclusion list to match actual owned planning docs. Also tightened existing patterns from "^FOO.md" to "^FOO.md\$" so a future "FOO.md.bak" doesn't accidentally pass. Pattern for future verifiers: when a new owned file is introduced, audit the verifier exclusion list at the same time. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

v4 framing — "governance library wrapping 6 thin orchestrator agents" — under-delivered on JD's literal asks: "strong agency in LLM-based systems," "code agents doing real work," "AI helps build them." A NVIDIA Cosmos reviewer comparing cosmos-lab against 2026 production autonomous agents (Devin / Operator / Cursor Composer / Claude Code) would see v4 as conservative governance theater — clever judgment, weak capability. v5 inverts the hierarchy. The product is now ONE autonomous PrincipalAgent doing real long-horizon ML lifecycle work, with governance reframed as *enabler of autonomy* (not constraint). Sentinels become tripwires for replanning. Identity capabilities expand with earned track record (RFC 8693 token exchange after K sentinel-clean runs). GEPA becomes agent self-improvement with retroactive human review. The 6 v4 "agents" collapse into 6 capability domains the same agent demonstrates over P3-P9 — like one principal engineer who does data work Monday, training Tuesday, optimization Wednesday. Not six different people. PrincipalAgent runs INSIDE ml-intern's agent_loop.py (1626 lines of debugged production code) — substrate stays zero-diff, capabilities ride on top. Substantive changes: PLAN_V2.md + §0.9 Autonomous Principal Agent thesis (NEW, ~80 lines) — the load-bearing reframe; product = PrincipalAgent + harness + governance + §3.2 PrincipalAgent architecture (NEW, ~140 lines) — substrate choice (ml-intern agent_loop), autonomous loop diagram (PLAN→ EXECUTE→VERIFY→REPLAN), 3-tier memory (working/episodic/semantic), replanning logic (sentinel = information not failure), capability expansion mechanism (earned trust), concrete demo, owned path cosmos_lab/principal/ tree ~ §0.5 row 11 NEW — v5 thesis pivot rationale, cited ~ §0.6 reframed — "what only cosmos-lab does" now compared vs Devin/ Operator/Cursor Composer (2026 autonomous agents), not vs assembled OSS. 5 differentiators all autonomy-focused. ~ §0.65 reframed — "six reference agents" → "six capability domains of the PrincipalAgent" (one agent, six skills) ~ §1 phase table reframed — phases are now "PrincipalAgent capability progression milestones" not "ship N agents" ~ Header v4 → v5 with thesis pivot explanation docs/01_north_star.md — full rewrite to PrincipalAgent framing CLAUDE.md — identity sentence updated to v5 framing (one PrincipalAgent with 6 capability domains, ml-intern agent_loop substrate, governance as enabler) What v5 KEEPS from v4: - Library architecture (P0.5) — but library = cosmos_lab.principal - Identity v2 (P4b) — but for capability EXPANSION (earned trust) - Sentinels (P1, §3.1) — but as tripwires for replanning - GEPA (P8) — but for agent self-improvement - Production deployment (P10) — agent serves real users - All 9 invariants - 24 numerical targets - 22.5 weeks (depth shifts breadth→capability, not added time) What v5 honestly does NOT pretend to do (anti-hype): - Not invent novel architectures (composes known patterns intelligently) - Not zero human oversight (sentinel trips visible; weekly review) - Not arbitrary research questions (scoped to ML lifecycle, Cosmos first) - Not online self-improvement (offline GEPA between sessions) - Capability expansion is policy-bounded, not unbounded Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Closes the eval gap in v5 plan. Without rigorous agent-system eval, every "exceptional autonomous PrincipalAgent" claim in PLAN_V2 is unverifiable, GEPA self-improvement (P8) has no rigorous gate to ratchet against, and reward-hacking detection (per UC Berkeley + METR 2026 findings) has no operational discipline. NEW FILE — AGENTIC_EVAL_SPEC.md (528 lines): Companion to EVAL_SPEC.md. Where EVAL_SPEC covers ML-output eval (perplexity, KL divergence, latency p99 — model under test), this doc covers agent-system eval per axiom A8 ("the agent is itself an artifact-under-eval"). Structure: §0 scope + reading order §1 why agentic eval differs (trajectory ≠ output; A8; long-horizon) §2 axioms — A1-A10 transfer from EVAL_SPEC; A11-A13 are agentic-specific A11 trajectory carries information beyond outcome A12 capability expansion requires adversarial testing A13 long-horizon eval is non-fungible with short-horizon eval §3 5-tier architecture (T0-T4 specialized for agentic) §4 6 agentic-specific surfaces (S1-S6) — NEW vs EVAL_SPEC: S1 trajectory eval (tool-call efficiency, replan ratio, doom-loop) S2 plan-quality eval (LLM-judge on PLAN-phase output) S3 replan-quality eval (sentinel trips → response quality) S4 capability boundary eval (50-task adversarial probe suite) S5 reward-hacking adversarial eval (monthly red-team sprint) S6 cross-agent comparison eval (vs Devin, Claude Code, human) §5 cross-cutting meta layers (M1-M3 transfer from EVAL_SPEC) §6 statistical framework (transfers + paired tests for agent compare) §7 three input types (I1-I3 per JD bullet 5) §8 operational cadence + gates + verifier scripts §9 numerical targets — 10 eval-system commitments (E1-E10) §10 implementation map to v5 phases (no new phase needed, ~3-4 days spread across P1, P4a, P4b, P9b, P10) §11 references (METR, UC Berkeley, τ-bench, BFCL, GAIA-2, etc.) PLAN_V2.md additions: + §3.3 Agentic eval architecture (in-plan summary + pointer to spec) + §0.7 v5 eval-system additions subtable (10 numerical targets E1-E10) E1 sentinel suite FPR ≤ 5% E2 sentinel suite FNR ≤ 1% E3 T1 test-retest r ≥ 0.95 E4 plan-quality LLM-judge ↔ human ≥ 80% E5 replan success rate ≥ 70% E6 capability boundary 100% (0 unauthorized in 100 runs) E7 reward-hack discovery rate trending downward E8 PrincipalAgent on Pareto frontier vs comparison agents E9 eval cost ≤ 15% of total project spend E10 reproducibility envelope coverage 100% Total commitments: 24 (original §0.7) + 10 (E1-E10) = 34 + §1.5 reuse map — companion specification documents subsection explaining EVAL_SPEC vs AGENTIC_EVAL_SPEC scope CLAUDE.md pointer index updated — distinguishes ML-output eval doc from agent-system eval doc. docs/03_pointers.md updated — adds §3.2 + §3.3 anchors and a new "Companion specification docs" subtable. bin/verify_p0_5_d1.sh — adds AGENTIC_EVAL_SPEC.md to owned-paths exclusion list (planning doc, owned). Verifier: ./bin/verify.sh p0_5_d1 → 14/14 pass (still green). Plan size: 1173 lines (was 1075 — added ~100 lines for §3.3 + E1-E10). Why this matters now: per JD bullet 5 ("design and scale evaluation platforms that combine automated metrics, human feedback, and agent-driven analysis") and 2026 reward-hacking crisis (METR: o3 hacks 30%+ RE-Bench; UC Berkeley: 8/8 top benchmarks hackable to 73-100%). A v5 PrincipalAgent without rigorous agentic eval IS the failure mode those papers describe — confident, capable, unverifiable, reward- hacking. The eval architecture is the difference between exceptional agent we can defend with numbers vs autonomous agent that "looks good in the demo." Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Ships the v1 compat adapter that installs cosmos-lab governance into an existing ml-intern Session. Per PLAN_V2.md §0.4 library architecture and docs/02_current_phase.md D2 spec. NEW FILES: cosmos_lab/harness/__init__.py — public API: install_into_session cosmos_lab/harness/ml_intern.py (84 L) — the adapter tests/optimization/harness/__init__.py tests/optimization/harness/test_ml_intern_adapter.py (175 L) — 6 smoke tests bin/verify_p0_5_d2.sh — D2 verifier (11 checks) ADAPTER CONTRACT: install_into_session(session, identity, audit_log) -> None Wraps session.tool_router with CapabilityScopedRouter so every tool invocation through Session is governed by cosmos-lab identity + audit. Composition only — no upstream files modified (Invariant 1). Idempotency: re-installing on the same Session raises RuntimeError to prevent shadowing audit history. SMOKE TEST DESIGN: Uses duck-typed MockSession (just .tool_router) instead of constructing a real ml-intern Session (which requires Config + ContextManager + event_queue + sandbox). The adapter's contract is "I wrap .tool_router" — smoke test verifies that one thing. 6 tests: 3 contract (wraps, raises on no-router, refuses re-install) + 3 e2e behavior (denial works via wrapped session, audit recorded, authorized passes through). VERIFIER RESULT: ./bin/verify.sh p0_5_d2 → 11/11 pass ./bin/verify.sh p0_5_d1 → 14/14 pass (no regression) Upstream baseline: 237 pass / 3 known-broken (no regression) Zero-diff: only owned paths modified LEARN (3 surprises, captured for future phases): 1. Editable install staleness — adding cosmos_lab/harness/ after the prior `uv sync` left package metadata stale. Pattern: any new package directory needs `uv sync` before tests pass. 2. `uv run pytest` is ambiguous — PATH leak resolved to miniconda's pytest (with stale editable install) instead of venv. Symptom: `python -c "import cosmos_lab.harness"` succeeded everywhere but pytest collection failed with ModuleNotFoundError. Fix: use `uv run python -m pytest` for deterministic venv resolution. All verifier scripts updated. 3. Smoke test design — testing the adapter contract (one method) does not require constructing the host platform (full Session). Duck- typed MockSession is the right scope. Pattern for future adapter smoke tests. Branch: p0_5_library_restructure (continues from D1). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ARCHITECTURAL DECISION (v5.1, PLAN_V2 §0.4.5): After auditing agent_loop.py:1771, discovered ml-intern's submission_loop is queue-based (asyncio.Queue for submissions in + events out), not function-based. Embedding it as runtime substrate inside another orchestrator requires a 1-2 week async-bridge engineering effort that v5 implicitly assumed but never budgeted. v5.1 commits to TWO-LAYER architecture instead of three: Layer 1: cosmos-lab CLI (PRIMARY entry point) - PrincipalAgent + governance + sentinels + memory + sub-agents - long-horizon orchestration ABOVE the Session level Layer 2: ml-intern Session (execution SUBSTRATE, per task) - debugged ReAct loop (1626L) + 16 tools + MCP + sandbox - wrapped by D2 adapter (CapabilityScopedRouter) Deployment wrappers (P10): nat workflow YAML, Modal/HF Spaces endpoint — INVOKE cosmos-lab CLI, don't host it inside Why we explicitly REJECTED 3-layer runtime: 1. submission_loop queue-based design = real async-bridge work 2. nat-at-runtime solves the wrong problem; nat-at-deployment is enough for Cosmos pitch (`nat run cosmos-lab.yaml` = "we run in your stack") 3. Complexity budget — 1-2 weeks better spent on sentinels / memory / capability domains / real GPU runs What v5.1 PRESERVES: - PrincipalAgent thesis (§0.9), library architecture (§0.4), AGENTIC_EVAL_SPEC, sentinel taxonomy (§3.1), 9 invariants, 34 numerical targets, 6 capability domains, 22.5w schedule, every commit already shipped (P0, P0.5 D1, P0.5 D2, AGENTIC_EVAL_SPEC, v5 thesis). What v5.1 DROPS: - "nat-runnable from P1 onward" (over-promise) - 3-layer runtime architecture - ml-intern submission_loop bridge work (~1.5-2w saved → risk buffer) P0.5 D3 IMPLEMENTATION (matches v5.1 reframed scope): cosmos_lab/harness/nat.py (132 LOC, ~78 non-comment, ≤200 budget) - register_as_nat_tool(builder) — lightweight registration shim - Registers ONE tool `cosmos_lab_principal` in nat Builder - Tool body: stub returning structured dict (real CLI invocation lands in P3 PrincipalAgent v0) - BuilderLike Protocol for duck-typed compatibility - Idempotency check via _is_already_registered() - Multi-method fallback (add_function / register_function / etc) tests/optimization/harness/test_nat_adapter.py (135 LOC) - 11 smoke tests: 5 contract (registration mechanics) + 6 tool callable - Uses MockBuilder pattern (mirrors D2 MockSession per LEARN huggingface#3) - All pass against duck-typed builders bin/verify_p0_5_d3.sh — 10 checks Verifier result: 10/10 pass D1 + D2 still green (no regression): 14/14 + 11/11 Upstream baseline: 237 / 3 known-broken Total cosmos-lab tests: 22 + 11 = 33 DOCS UPDATED: PLAN_V2.md §0.4.5 NEW (~120 lines) — two-layer architecture decision CLAUDE.md dev commands — `uv sync --extra dev` + `uv run python -m pytest` docs/02_current_phase.md — D3 archived, D4 spec written LEARN from D3: 1. Always read substrate code before architecting on it (would have caught queue-based submission_loop earlier; v5 thesis pivot would have proposed 2-layer from start, not 3-layer) 2. `uv sync` without `--extra dev` removes pytest from venv (caught during D3 verifier; CLAUDE.md updated) 3. Two layers > three when one is sufficient (workflow anti-pattern huggingface#4 generalized: don't build a layer that doesn't earn its complexity) Branch: p0_5_library_restructure (continuing from D2). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ROOT CAUSE — v5/v5.1 over-correction: v5 thesis claimed "ONE exceptional autonomous PrincipalAgent" and planned to build planner + executor + memory + sub-agent spawning under cosmos_lab/principal/. v5.1 reframed as "2-layer (cosmos-lab CLI on ml-intern Session)" with same re-implementation work. User audit caught the contradiction: I claimed leverage but described re-building. Audit of ml-intern itself confirmed: agent/prompts/system_prompt_v3.yaml VERBATIM: "You are ML Intern, an ML engineering assistant... fully autonomous — research, validate, implement, and deliver results without asking for unnecessary confirmation" agent/tools/plan_tool.py — built-in planning (todo list) agent/tools/research_tool.py — sub-agent spawning ("spawns a cheap LLM call with focused research task and returns summary") agent/core/agent_loop.submission_loop — autonomous execution loop agent/core/doom_loop.py — failure-loop detection agent/core/cost_estimation.py — per-call cost tracking agent/tools/{jobs,papers,research,sandbox,...} — 20+ ML tools ml-intern IS already what v5/v5.1 PrincipalAgent claimed to be. v5/v5.1 was workflow anti-pattern huggingface#4 generalized: "building a 5000-LOC re-implementation that should have been a governance wrapper." v5.2 PIVOT — governance layer (back to v4's correct direction with v5 production rigor): cosmos-lab is the production governance layer that makes ml-intern (or any autonomous ML agent) safe to deploy at NVIDIA Cosmos scale. ml-intern provides autonomy; cosmos-lab provides production discipline. The 10 governance components ml-intern doesn't have, that cosmos-lab adds: 1. Sentinel-gated quality (4 sentinel types paired with judge) 2. Cross-session memory (3-tier hierarchical, persistent) 3. RFC 8693 capability expansion (earned-trust scope growth) 4. Hash-chained signed audit (EU AI Act Art. 12 compliant) 5. OTel-GenAI native observability (gen_ai.* semconv, portable) 6. GEPA self-improvement (offline DSPy 3.x, retroactive review) 7. MultiJudge with bootstrap CIs (no debate dynamics) 8. Inspect AI integration (UK AISI standard adoption) 9. PR-gating + canary deployment (sequential testing) 10. AGENTIC_EVAL_SPEC discipline (T0-T4 + S1-S6 + E1-E10) WHAT v5.2 PRESERVES: - All shipped code (P0 identity + P0.5 D1/D2/D3 adapters) — these ARE the governance foundation - AGENTIC_EVAL_SPEC.md — eval architecture is THE product spec now - All 9 invariants - 34 numerical targets (24 §0.7 + 10 E1-E10) - Library architecture (cosmos_lab/ pip-installable) - Two-layer deployment (cosmos-lab CLI + ml-intern; nat as P9 wrapper) WHAT v5.2 DROPS: - "ONE PrincipalAgent" framing (ml-intern is the agent) - Re-implementation of planner / executor / memory tier internals / sub-agent spawning (~5000 LOC of unnecessary code) - 6 capability domains framing (replaced with 6 governance enhancements applied to ml-intern's existing capabilities) - 22.5w schedule (compressed to ~13w by dropping re-implementation work) NET EFFECTS: - Plan compressed 22.5w → ~13w (~9 weeks banked for v1.1 polish or buffer) - Stronger Cosmos pitch ("we make autonomous agents production-safe" = 2026 frontier gap nobody fills end-to-end) - Honest about leverage (Cosmos reviewer running git ls-files won't see PrincipalAgent re-implementation) - All shipped commits stay valid and become more important (D2 ml_intern adapter is THE primary product surface) - AGENTIC_EVAL_SPEC's E1-E10 = literal product spec, not side document DOCS UPDATED: PLAN_V2.md - Header (v5.2 governance layer thesis) - §0.5 row 12 NEW (v5.2 honest leverage pivot rationale, cited) - §0.6 reframed (10 governance items vs ml-intern bare) - §0.65 reframed (6 governance enhancements, not 6 PrincipalAgent capabilities; with concrete demonstration block) - §0.9 simplified (ml-intern is the agent; cosmos-lab is governance) - §1 phase table compressed (~13w, governance enhancements + ml-intern demonstrations) CLAUDE.md identity sentence — v5.2 framing docs/01_north_star.md — full rewrite to v5.2 governance-layer docs/02_current_phase.md — v5.2 schedule note (P1+ phases) VERIFIER STATE: D1: 14/14 ✅ (no regression) D2: 11/11 ✅ (no regression — D2 adapter is now primary product surface) D3: 10/10 ✅ (no regression) Upstream baseline: 237 / 3 known-broken Total cosmos-lab tests: 33 Branch: p0_5_library_restructure (10th commit). Plan size: 1294 lines (added ~140 for v5.2 reframe content). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…tern primitives USER CAUGHT THE ROOT ISSUE: v5/v5.1 collapsed v4's 6 specialty agents into 1 PrincipalAgent (over-correction #1). v5.2 then removed the agents entirely, claiming "ml-intern is the agent" (over-correction huggingface#2). Both wrong. JD re-read CAREFULLY confirms multiple specialty agents needed: Role mission: "agentic SYSTEMS that reason about, build, evaluate, and improve AI systems themselves" (plural systems) What you'll do bullet 3: "self-improving loops where agents help generate data, surface failures, evaluate outputs" (multiple agents, different jobs) Stand-out bullet 1: "agent-based systems doing real work — coding, eval, data gen, triage, experimentation, orchestration" (6 work types = 6 agent types) ml-intern's tool primitives are HF-generic. Cosmos team needs Cosmos- specialized agents (cosmos-curate orchestration, NeMo-RL training, NIM inference, multimodal physics, real video pipelines). ml-intern's primitives are SUBSTRATE we use, not the agents themselves. v6 SYNTHESIS — best of all prior versions: - v3.x/v4: correct on agent count (6 specialty agents) - v5/v5.1: correct on production rigor (real GPU, sentinels, AGENTIC_EVAL_SPEC) - v5.2: correct on leverage discipline (use ml-intern primitives, don't reimplement) - v6: 6 specialty + 3 governance + ~16 infrastructure + ml-intern primitives THE 9 NEW AGENTS (the product): Layer 1 — 6 Cosmos-specialty (real ML lifecycle work): 1. DataAgent (P3) — cosmos-curate orchestration, real video curation 2. EvalAgent (P4a) — multi-judge + sentinels + physics-consistency 3. TrainOrchestrator (P5) — Centaur HPO + NeMo-RL on real GPU 4. OptimizeAgent (P6) — profiling + ≥1.5× speedup on 4 workloads 5. MultimodalPipelineAgent (P9) — e2e Cosmos workflow on Predict 2.5 6. CodeAgent (P9) — capability-scoped, real OSS bug fixes Layer 2 — 3 governance (meta-layer): 7. GepaOptimizer (P8) — weekly DSPy GEPA prompt revisions 8. CapabilityProbe (P7) — adversarial scope testing 9. CrossAgentEvaluator (P10) — quarterly Pareto vs Devin/Claude Code/human PLUS ~16 governance infrastructure components (sentinels, identity v2, audit log, OTel, memory tiers, Inspect AI bridge, ComputeBackend, etc.) PLUS ml-intern primitives leveraged AS-IS (agent_loop, 16 generic tools, sandbox, MCP, cost estimation, doom-loop detection). SCHEDULE: ~19 weeks (between v5/v5.1's 22.5w and v5.2's 13w). Tighter than v5/v5.1 because we leverage ml-intern primitives. Bigger than v5.2 because we restore the 9 agents JD asks for. Honest middle ground. DOCS UPDATED: PLAN_V2.md - Header v5.2 → v6 (with honest postmortem of v5/v5.1/v5.2 over-corrections) - §0.5 row 13 NEW (v6 restore agents pivot rationale) - §0.6 reframed (9 agents + ~16 infra components vs assembled OSS) - §0.65 reframed (Layer 1 + Layer 2 + Layer 3 + Layer 4 honest count) - §0.9 reframed (cosmos-lab builds agents on ml-intern primitives) - §1 phase table — 19w with 9-agent reality CLAUDE.md identity sentence — v6 framing docs/01_north_star.md — full rewrite to v6 docs/02_current_phase.md — v6 schedule note WHAT v6 PRESERVES: - All shipped code (P0, P0.5 D1/D2/D3, AGENTIC_EVAL_SPEC) — these are the substrate agents will use - All 9 invariants - 34 numerical targets (24 §0.7 + 10 E1-E10) - Library architecture (cosmos_lab/ pip-installable) - Two-layer deployment (cosmos-lab CLI + ml-intern; nat as P10 wrapper) - AGENTIC_EVAL_SPEC.md — eval architecture for the 9 agents WHY THIS IS THE FINAL FRAMING: v6 maps directly to JD's literal text. Each JD bullet has a deliverable: - "Design and implement agentic workflows across ML lifecycle" → 6 specialty agents - "Build AI-native systems where agents interact with code/tools/exp" → CodeAgent + others - "Self-improving loops" → GepaOptimizer - "Eval platforms (auto + human + agent-driven)" → EvalAgent + MultiJudge + Inspect AI - "Multimodal ML pipelines" → MultimodalPipelineAgent + DataAgent - "Engineering excellence" → 9 invariants + AGENTIC_EVAL_SPEC No more over-corrections. v6 is the final framing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Closes P0.5 (4 days of foundation work shipped same day). NEW FILES: cosmos_lab/harness/CONTRACT.md (220 lines) - Documents 2 adapter families per v5.1/v6 architecture: Family A — Execution Substrate Adapter (ml_intern, future claude_sdk) Contract: install(host, identity, audit_log) -> None Family B — Deployment Surface Adapter (nat, future langgraph) Contract: register_as_X_tool(builder) -> None - 5 shared requirements (S1-S5) all adapters must satisfy: S1 Idempotency, S2 Composition only, S3 Input validation, S4 Returns None, S5 No partial state on failure - Per-adapter specifics + future adapter checklist - Anti-patterns explicitly rejected tests/optimization/harness/test_adapter_contract.py (160 lines) - Parametrized contract tests across BOTH shipped adapters - 9 tests: 3 cross-family shared (S1, S4, S5) + 2 family-specific + 1 coverage sanity test - ADAPTERS registry: single source of truth for parametrization - When v1.1 ships claude_sdk or langgraph adapter, just add row to ADAPTERS — automatic contract enforcement bin/verify_p0_5_d4.sh — 14-check verifier HONEST DESIGN DECISION (D4 LEARN): Spec originally said "parametrize all 22 existing tests across both adapters." Audit revealed: ml_intern (execution substrate) and nat (deployment surface) have DIFFERENT contracts because they serve DIFFERENT purposes per v5.1/v6 architecture. Forcing one signature loses clarity. D4 ships the honest answer: - Two adapter families, each with own contract signature - 5 shared requirements that apply to BOTH families - Parametrized tests for the 5 shared requirements - Family-specific tests stay in test_ml_intern_adapter.py / test_nat_adapter.py This is more honest and more extensible than forcing one contract shape. VERIFIER RESULTS: D1: 14/14 ✅ (no regression) D2: 11/11 ✅ (no regression) D3: 10/10 ✅ (no regression) D4: 14/14 ✅ (NEW) Upstream baseline: 237 / 3 known-broken Total cosmos-lab tests: 42 (16 P0 + 6 D2 + 11 D3 + 9 D4 contract) Zero-diff invariant: holds throughout 🎉 P0.5 COMPLETE 🎉 Final P0.5 stats: - 4 days work shipped on schedule - 12 commits on branch (P0 + 4 P0.5 days + 5 plan evolutions + 2 fixups) - ~3500 LOC added (cosmos_lab/ + tests/ + bin/ + docs/ + planning) - Foundation for 9 specialty + governance agents (P3-P10) NEXT: P1 — Eval infrastructure (~2 weeks) - TrajectorySink + OTelGenAIEmitter (Phoenix backend) - 4 sentinel types per §3.1 taxonomy - MultiJudge with bootstrap CIs - Inspect AI bridge + 5 seed Inspect tasks - evaluate CLI P1 ships eval infrastructure that becomes: - The foundation for EvalAgent (P4a — specialty agent huggingface#2) - Used by all other specialty agents (P3, P5, P6, P9, P10) for sentinel-gated quality + OTel observability + Inspect AI integration Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a31c3c020c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-03T12:23:58Z

+            if personal_repo_id:
+                subprocess.Popen(
+                    [
+                        sys.executable,
+                        str(uploader_script),
+                        "retry",
+                        directory,
+                        personal_repo_id,


Scope personal retry uploads to the owning user

retry_failed_uploads_detached() launches a single personal retry process with the current session’s personal_repo_id, but that retry scans every session_*.json file in session_logs. If files from other users still have personal_upload_status pending/failed, they will be re-uploaded into the current user’s dataset, causing cross-user trace leakage/misattribution. Personal retries need per-file ownership/repo scoping (or should be disabled globally) instead of replaying the whole directory against one repo.

Useful? React with 👍 / 👎.

…kills + offline tools) 3 PARALLEL AUDITS COMPLETE: Launched 3 senior-engineer research agents in parallel to verify v6 architecture against 2026 frontier patterns. All 3 independently converged on the same 6 misalignments + 8 additions. Audit 1 (Anthropic + NVIDIA 2026 patterns): - Anthropic Skills blog (2026) explicitly REJECTS per-domain agents - Anthropic subagents are task-specialized for parallelization, not domain-specialized - NVIDIA Cosmos Curator/Evaluator are single-purpose tools, not multi- agent fleets - Anthropic Memory tool = flat file, NOT hierarchy - No production "sentinel-trip → replan" pattern at Anthropic - GEPA is offline-only at Decagon; AlphaEvolve closed-source Gemini - Standing red-team agents in production = Microsoft/Straiker/LangWatch Audit 2 (2026 multi-agent orchestration convergence): - LangGraph won production tier (Uber/JPMC/BlackRock/Cisco) - AutoGen → maintenance mode April 2026; Magentic-One → MAF - Hierarchical orchestrator-worker is THE convergent pattern - Multi-agent debate REFUTED (arxiv:2508.17536) - Spawn depth=1 is convergent default (OpenAI Codex hardcodes) - Hybrid memory (4-scope) is convergent (Mem0/Atlan/supermemory) - Specialty agents OK if distinct tool surfaces; ANTI-PATTERN if sequential pipeline Audit 3 (production agent eval + governance + safety): - Inspect AI is de facto frontier eval substrate - Industry uses 3-tier (not 5-tier) eval ladder - Berkeley audit: 8/8 benchmarks reward-hackable to 73-100% - EU AI Act Aug 2 2026 deadline IN FORCE (trilogue collapsed Apr 28) - MCP authorization 86% enterprise adoption - Hash-chained Ed25519 audit logging is now production minimum - Gaia2 finds judge-hacking as distinct failure mode V7 SYNTHESIS — 6 FIXES + 8 ADDITIONS: Fixes (audit-driven): 1. Per-domain 6 specialty agents → 4 specialty workers (distinct tool surfaces) + 1 PrincipalAgent supervisor + CodeWork Skill 2. GepaOptimizer standing agent → offline batch tool (Decagon pattern) 3. Sentinels novel "tripwire-replan" → Anthropic PostToolUse hooks contract (Claude Agent SDK pattern) 4. 3-tier memory hierarchy → 4-scope hybrid (Mem0/Letta substrate) 5. CapabilityProbe co-resident → CI/CD eval lane via Inspect AI snapshots (METR pattern) 6. "Earned-trust capability expansion" oversold → standard RFC 8693 delegation (table stakes per MCP spec, drop escalation framing) Additions (frontier-required): 1. LangGraph durable substrate (Uber/JPMC production winner) 2. Magentic-One Task Ledger + Progress Ledger pattern (2-iteration stall detection — Microsoft Agent Framework first-class) 3. 5th sentinel: JudgeHackingCheck (Gaia2 finding — verifier-pleasing artifacts without solving task) 4. Cross-family MultiJudge (1 non-Anthropic to break correlation) 5. CodeWork as Skill, not separate agent (Anthropic Skills pattern) 6. RFC 8707 + RFC 8693 day-one (MCP 2026-03-15 mandate; 86% adoption) 7. Reward-hack rate as Pareto axis in S6 cross-agent eval 8. CUDA/cuDNN/driver versions in reproducibility envelope V7 FINAL FLEET: Production agents (5): 1. PrincipalAgent (P3) — LangGraph supervisor + Magentic-One ledgers 2. DataAgent (P4a) — distinct cosmos-curate/NeMo Curator surface 3. EvalAgent (P5) — distinct Inspect AI/MultiJudge surface 4. TrainOrchestrator (P5) — distinct NeMo-RL/SkyPilot surface 5. OptimizeAgent (P6) — distinct profiler/kernel/sandbox surface Skills (loaded by PrincipalAgent): - CodeWork Skill (P7) — commodity tools in E2B sandbox Offline governance tools (NOT standing agents): - GepaOptimizer (P8) — monthly cron offline batch - CapabilityProbe (P7) — CI/CD eval lane via Inspect AI - CrossAgentEvaluator (P10) — quarterly Pareto generator Infrastructure (~16 components): - Identity (P0 + RFC 8693), 5-type sentinels, OTel + 4-scope memory, Inspect AI + cross-family MultiJudge, LangGraph + Magentic-One substrate, ComputeBackend + sandbox, reproducibility envelope, nat deployment Substrate (LEVERAGED inside LangGraph worker nodes): - ml-intern primitives (agent_loop, 16 tools, sandbox, MCP, cost estimation, doom-loop) DOCS UPDATED: PLAN_V2.md - Header v6 → v7 (frontier-aligned production agentic system) - §0.5 row 14 NEW (v7 frontier-audit pivot rationale, cited) - §0.6 reframed (vs assembled OSS — 5 agents + Skills + offline) - §0.65 reframed (5 production agents + Skills + offline tools) - §1 phase table — ~21w with v7 phases (LangGraph + Magentic-One) - §3.1 sentinel taxonomy 4 → 5 types (added JudgeHackingCheck) - §3.2 PrincipalAgent architecture (LangGraph supervisor + Magentic- One ledger pattern + frontier substrate choices documented) CLAUDE.md identity sentence — v7 framing docs/01_north_star.md — full rewrite to v7 (frontier-aligned final) docs/02_current_phase.md — v7 schedule note (P1+ phases reframed) WHY V7 IS FINAL: 3 independent senior-engineer audits converged on same fixes + additions. No single auditor would catch all of these. Triangulation across (Anthropic+NVIDIA) + (multi-agent convergence) + (eval+governance) gives high-confidence frontier alignment. Process needs to converge. Future audit findings document as v1.1+ work, not v8 — committing to v7 now and shipping. ALL SHIPPED CODE PRESERVED: - P0 identity primitives ✅ - P0.5 D1 library restructure ✅ - P0.5 D2 ml_intern adapter ✅ - P0.5 D3 nat wrapper ✅ - P0.5 D4 adapter contract + dual-adapter test matrix ✅ - 42 cosmos-lab tests passing ✅ - All 4 verifiers green ✅ - Upstream baseline preserved ✅ - Zero-diff invariant holds ✅ Verifier: ./bin/verify.sh p0_5_d4 → 14/14 pass (still green). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…and-out huggingface#3) 8-tier frontier audit + JD literal-text audit converged on the SAME gap: context engineering / context compression. Both stand-out JD bullet huggingface#3 ("Context compression / agent memory") and 8-tier audit Tier 3 (harness & context engineering) flagged this as v7's weakest area (~50% coverage). This commit closes the gap by adding explicit context engineering discipline to P3 PrincipalAgent foundation. NOT a v8 thesis change — v7 architecture stays. Just adds 4 primitives + 1 bonus to P3, with 4 new numerical commitments (E15-E18). WHAT V7-STRONGER ADDS: §3.2.8 NEW — Context engineering discipline (~110 lines added) Primitive 1: Cache-aware prompt structure Stable prefix → tool defs → conversation layout Stable region NEVER changes during task (system prompt + capability manifest + sentinel rules) Tool defs change ONLY on RFC 8693 capability expansion event Conversation is the only churn region Rationale: every byte of churn voids prefix cache + 10× cost (Anthropic memory system + Claude API prefix caching) Primitive 2: Compaction at 75% context utilization Trigger: when context window hits 75% of model limit Action: pause loop → Anthropic memory tool API summarization → resume Rationale: Anthropic context-editing pattern (auto-clears stale tool results when context fills); Claude Code adopts this Primitive 3: Just-in-time retrieval via recall_relevant(goal) tool Don't pre-load episodic memory at session start Agent calls explicit tool when needed; 4-scope filtered query Rationale: pre-loading wastes context on irrelevant past tasks; just-in-time keeps stable prefix small Primitive 4: cosmos-progress.md structured state file (per Anthropic Claude Code claude-progress.txt pattern) PrincipalAgent writes after every milestone completion Append-only event log: DONE / IN_PROGRESS / NEXT / SURPRISES sections Bridges multi-day work across compute interruptions New session begins by reading progress file (initializer pattern) Primitive 5: Behavior-vs-model-capability separation test Quarterly: snapshot agents, re-run against current + previous models Detect harness assumptions that went stale (Sonnet 4.5 context anxiety patches were dead weight in Opus 4.5 — Anthropic example) Flag dead-weight resets/patches for removal NEW NUMERICAL COMMITMENTS (E15-E18): E15: Prefix cache hit rate ≥ 80% on stable prefix region E16: Compaction trigger fires at 75% ± 5% (no missed in 100 runs) E17: cosmos-progress.md cross-session recovery 100% E18: ≥1 stale assumption identified per quarterly retest CODE STRUCTURE UPDATE: cosmos_lab/principal/context_eng/ NEW subpackage: prompt_layout.py — cache-aware structure enforcement jit_retrieval.py — recall_relevant(goal) tool progress_state.py — cosmos-progress.md writer/reader stale_check.py — quarterly behavior-vs-capability retest cosmos_lab/principal/memory/compaction.py NEW module DOCS UPDATED: PLAN_V2.md - §1 phase table P3 row: 2w → 2.5w; explicit context engineering scope - §1 Total: 21w → 21.5w - §3.2.7 file tree: added context_eng/ subpackage + memory/compaction.py - §3.2.8 NEW (110 lines): full context engineering discipline spec with 5 primitives + 4 new commitments E15-E18 CLAUDE.md identity sentence — added context engineering discipline mention docs/01_north_star.md Layer 4 — added context engineering bullet GAPS CLOSED: ✅ JD stand-out huggingface#3 (Context compression / agent memory): was PARTIAL (memory only) → now FULL (memory + 4-primitive context engineering discipline + behavior-vs-capability check) ✅ 8-tier audit Tier 3 (Harness & context engineering): was ~50% → now ~85% (cache-aware prompts + compaction + JIT retrieval + structured state + staleness check all explicit) WHAT V7-STRONGER PRESERVES (no thesis change): - All shipped code (P0, P0.5 D1/D2/D3/D4) - 5 production agents (PrincipalAgent + 4 specialty workers) - 1+ Skills (CodeWork) - 3 offline governance tools (GepaOptimizer, CapabilityProbe, CrossAgentEvaluator) - All 9 invariants - All 38 numerical targets (24 §0.7 + 10 E1-E10 + 4 E15-E18) - Library architecture, 2-layer deployment, nat wrapper - Verifier discipline (D4: 14/14 still passes) VERIFIER STATE: D1: 14/14 ✅ D2: 11/11 ✅ D3: 10/10 ✅ D4: 14/14 ✅ Upstream baseline: 237 / 3 known-broken Plan size: 1416 → 1533 lines (+117 for context engineering spec) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

jagwar and others added 23 commits April 29, 2026 14:31

Add DeepSeek V4 Pro model option (huggingface#189)

1b922dd

* Add DeepSeek V4 Pro model option Co-authored-by: Codex <codex@openai.com> * Remove DeepSeek feature tests Co-authored-by: Codex <codex@openai.com> --------- Co-authored-by: Codex <codex@openai.com>

Steer agent to HF kernels instead of pip install flash-attn (huggingf…

7599843

…ace#204) * chore: update the agent system prompt * chore: update the tool documentation

Add session YOLO auto-approval budget (huggingface#201)

77324b8

* Add session YOLO auto-approval budget Co-authored-by: Codex <codex@openai.com> * Address YOLO approval review feedback Co-authored-by: Codex <codex@openai.com> --------- Co-authored-by: Codex <codex@openai.com>

Document HF Space deploy flow (huggingface#207)

7b561e3

Co-authored-by: OpenAI Codex <codex@openai.com>

chatgpt-codex-connector Bot reviewed May 3, 2026

View reviewed changes

Dang Huy Hoang and others added 2 commits May 3, 2026 19:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

P0 + P0.5 foundation: identity, library, ml_intern + nat adapters, contract + plan v6#1

P0 + P0.5 foundation: identity, library, ml_intern + nat adapters, contract + plan v6#1
andreidhoang wants to merge 25 commits into
mainfrom
p0_5_library_restructure

andreidhoang commented May 3, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

andreidhoang commented May 3, 2026

Summary

What ships

Code (~700 LOC)

Tests (42 total)

Verifiers (bin/)

Planning + context harness

v6 thesis (final framing)

Plan evolution honest postmortem

Verifier state (all green)

Schedule

Test plan

Reading order for review

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 3, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Verifiers (`bin/`)