P0 + P0.5 foundation: identity, library, ml_intern + nat adapters, contract + plan v6#1
P0 + P0.5 foundation: identity, library, ml_intern + nat adapters, contract + plan v6#1andreidhoang wants to merge 25 commits into
Conversation
…ttribution (huggingface#179) * feat(telemetry): track 5 untracked Bedrock call sites for full cost attribution Cost Explorer ($78,738 over 6 days) vs the session dataset's total_cost_usd (~$354/day attributed) showed the dataset captures only ~33% of real Bedrock spend. Root cause: out of 9 acompletion() call sites, only 2 (in agent_loop.py) emit the llm_call event that total_cost_usd sums. This wires telemetry into the 5 Bedrock-billing call sites that were flying blind, with a `kind` tag on each call so analytics can split spend by category: - research_tool.py × 3 → kind="research" (sub-agent loop) - context_manager.py → kind="compaction" (history summary) - effort_probe.py → kind="effort_probe" (cascade walk) Plus a fourth tag for the session-restore summary path (session_manager.py → kind="restore"). Plumbing changes: - telemetry.record_llm_call now accepts kind="..." (default "main" preserves existing behavior). - summarize_messages() and ContextManager.compact() take optional session=None so the caller can opt into telemetry. - probe_effort() takes optional session=None for the same reason. - Both probe_effort callers (agent_loop._heal_effort_error and model_switcher) now pass session. Skipped: - routes/agent.py /title — uses HF Router (Cerebras), not Bedrock - routes/agent.py /health/llm — no session context (manual diagnostic endpoint, ~$0.02/call, not billable to a user) After deploy, expect dataset total_cost_usd to converge with Cost Explorer to within 5-10%. The kind breakdown will quantify each category, validating the cost-plan estimates in ml_intern_bedrock_cost_plan.md. * fix(telemetry): address PR bot feedback (2 P1 + 1 P2) 1. P1 — Wrap each research_tool record_llm_call in its own try/except. record_llm_call's inner send_event is wrapped, but extract_usage (telemetry.py:101) is not — an unexpected usage shape from LiteLLM could propagate. At all 3 research sites the surrounding except-block would convert that into "Research summary call failed", masking a valid LLM response. Match the effort_probe pattern: dedicated try/except logging at DEBUG. 2. P1 — Hoist `import time` from inside summarize_messages() to module level in manager.py. stdlib, always available, matches the rest of the module. 3. P2 — Update telemetry.py docstring kind list. Drop title_gen and model_probe (skipped per PR description), add restore (emitted from session_manager.py). Note the intentional skips at the bottom.
huggingface#183) * Add agent dev server notes * Make frontend model configurable * Support env-selected frontend models * Use Claude-specific model env var * Add GPT-5.5 to web model picker * Gate GPT-5.5 as a premium model * Avoid duplicate session model fetch * Remove legacy Claude quota aliases * Document GitHub CLI PR body workflow * Gate only deployed paid model IDs * Nits
* Make sandbox Spaces private Co-authored-by: Codex <codex@openai.com> * Remove legacy sandbox auth fallback Co-authored-by: Codex <codex@openai.com> * Address sandbox privacy review comments Co-authored-by: Codex <codex@openai.com> --------- Co-authored-by: Codex <codex@openai.com>
* Add DeepSeek V4 Pro model option Co-authored-by: Codex <codex@openai.com> * Remove DeepSeek feature tests Co-authored-by: Codex <codex@openai.com> --------- Co-authored-by: Codex <codex@openai.com>
* feat: add share_traces toggle and per-user trace repo template * feat: support Claude Code JSONL format and per-target auth * feat: dual-upload sessions to private user trace dataset * chore: retry personal trace uploads on booting * feat: add /share-traces command to flip dataset visibility * docs: document HF trace auto-share and /share-traces * Use HF token owner for local dev auth Co-authored-by: Codex <codex@openai.com> * Rename personal session trace dataset Co-authored-by: Codex <codex@openai.com> * Add session dataset card metadata Co-authored-by: Codex <codex@openai.com> * Fix session trace upload review issues Co-authored-by: OpenAI Codex <codex@openai.com> * Preserve secret scrubbing before trace uploads Co-authored-by: OpenAI Codex <codex@openai.com> * Link ML Intern demo in dataset card Co-authored-by: OpenAI Codex <codex@openai.com> --------- Co-authored-by: lewtun <lewis.c.tunstall@gmail.com> Co-authored-by: Codex <codex@openai.com>
* Use HF username for personal trace uploads Co-authored-by: OpenAI Codex <codex@openai.com> * Remove redundant HF token branch Co-authored-by: OpenAI Codex <codex@openai.com> --------- Co-authored-by: OpenAI Codex <codex@openai.com>
…ace#204) * chore: update the agent system prompt * chore: update the tool documentation
* Add session YOLO auto-approval budget Co-authored-by: Codex <codex@openai.com> * Address YOLO approval review feedback Co-authored-by: Codex <codex@openai.com> --------- Co-authored-by: Codex <codex@openai.com>
* Auto-start CPU sandboxes for sessions Co-authored-by: Codex <codex@openai.com> * Retry sandbox runtime visibility checks Co-authored-by: Codex <codex@openai.com> * Stabilize auto CPU sandbox creation Co-authored-by: OpenAI Codex <codex@openai.com> * Address sandbox PR review comments Co-authored-by: OpenAI Codex <codex@openai.com> --------- Co-authored-by: Codex <codex@openai.com>
Co-authored-by: OpenAI Codex <codex@openai.com>
* Fallback to free model for gated defaults Co-authored-by: OpenAI Codex <codex@openai.com> * Cover explicit gated model access for HF users Co-authored-by: OpenAI Codex <codex@openai.com> * Seed model picker from created session Co-authored-by: OpenAI Codex <codex@openai.com> --------- Co-authored-by: OpenAI Codex <codex@openai.com>
Ships the cosmos-lab Phase 0 deliverables — extends upstream Config with OptimizationConfig, plus an unsigned in-process AgentIdentity, append-only JSONL AuditLog, and a CapabilityScopedRouter that filters tool specs and audits before/after/exception around every tool call. Composition over inheritance: CapabilityScopedRouter wraps any duck-typed router, so unit tests use a FakeRouter and skip the heavy ToolRouter init (which requires HF auth + sandbox bootstrapping). AuthZ + audit only — no AuthN. Identity is unsigned; signing tokens, OAuth 2.1 + RFC 8707/8693 sub-agent scope-down, and hash-chained signed audit log all land in P4b. Documented as such in module docstrings. 16 tests in tests/optimization/test_identity_scoping.py cover config inheritance, round-trip, identity scoping/parent chain, audit JSONL round-trip + parent dirs + parseable lines, router denial/allow paths, exception path, spec filtering, root sees-all, and canonical args_hash. Acceptance: pytest tests/optimization/ exits 0; pytest tests/unit/ remains 237 pass / 3 known-broken (no new regressions). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Establishes cosmos_lab/ as the v1 importable surface for the cosmos-lab library. Code physically lives at agent/optimization/* per the zero-diff fork strategy; cosmos_lab/* re-exports so library consumers can write `from cosmos_lab.identity import AgentIdentity` without depending on upstream ml-intern import paths. Why library form (PLAN_V2.md §0.4): cosmos-lab's value (sentinels, identity, GEPA governance, quality budget) operates on framework-agnostic interfaces. Owning an agent loop is anti-pattern in 2026. Library form plugs into nvidia-nat (primary harness, P0.5 D3), ml-intern (compat adapter, P0.5 D2), Claude SDK (v1.1). Adds three placeholder optional-dependencies extras (nat, ml_intern, claude_sdk) — all empty for now; nvidia-nat>=1.6 dep lands in D3 with the nat adapter. Backward compat preserved: `from agent.optimization.identity import ...` still works. All 16 Phase 0 tests pass against new path AND old path. Verifier: ./bin/verify.sh p0_5_d1 → 14/14 pass (structure, dual-path imports, packaging, tests, zero-diff invariant). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a minimal, token-efficient context harness so Claude Code (and other
agentic coding assistants) can pick up cosmos-lab work cold without
loading 7K+ lines of stale planning context every turn.
Design (per agentic_build_workflow.md DEFINE→PROBE→BUILD→REVIEW→SHIP→LEARN):
CLAUDE.md (94L, ~970 tokens, ALWAYS LOADED)
= invariants + owned paths + 6 anti-patterns + pointer index only
docs/00_workflow.md → ../agentic_build_workflow.md (workflow methodology)
docs/01_north_star.md (vision in 1 screen, on-demand)
docs/02_current_phase.md (LIVE — what we're building today, rotates)
docs/03_pointers.md (phase → PLAN_V2 anchor map)
docs/04_jd.md (NVIDIA Cosmos JD reference)
bin/verify.sh (router: ./bin/verify.sh <phase>)
bin/verify_p0_5_d1.sh (14 concrete checks for current phase)
Why: previous CLAUDE.md (135L) had stale "Phase 0 pending" status, JD
paste burning ~1.7K tokens every turn, no pointer index forcing agents
to read the whole 837-line PLAN_V2.md to answer "what should I do." New
harness reduces always-loaded context ~43% and enables on-demand deep
reads via the pointer index.
Verifier scripts return concrete pass/fail (per workflow rule "verifier
is a script, not a description"). Anti-pattern huggingface#3 ("trusting 'I have
verified this' from an agent") is enforced by re-running ./bin/verify.sh
yourself.
AGENTS.local.md is a symlink to CLAUDE.md so multiple agentic CLIs that
look for AGENTS.md or CLAUDE.md both find the same source of truth.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PLAN_V2.md (v4) — 24-week production-grade roadmap for cosmos-lab. Six
reference agents (Data, Eval, Train, Optimize, Video, Code) on a shared
governance library (sentinels, MCP-OAuth identity with sub-agent scope-
down via RFC 8693, GEPA promotion contracts, quality-budget invariants).
Library architecture from P0.5 onward; nvidia-nat as primary harness.
Five production gates (PLAN_V2 §0.8):
G1 real GPU runs (Invariant 9, ~$200-400 budget across P5/P5.5/P6/P9a)
G2 PyTorch depth artifact (P5.5, ≥10% wall-clock improvement)
G3 production deployment (P10, ≥100 real user sessions)
G4 real multimodal data (P3, 10-100 hours real video)
G5 upstream OSS PR (P10, nvidia-nat or Inspect AI)
24 numerical targets (PLAN_V2 §0.7) — phase exit conditions, not aspirations.
Companion docs preserved as deep references (read on demand only):
PLAN.md — original 16-week plan (superseded by v4)
SYSTEM.md — full architecture deep-dive (Vietnamese, 1167L)
EVAL_SPEC.md — measured-peak vs vendor-peak eval methodology
WORKFLOW.md — git/PR workflow conventions
RESEARCH_AHE_ANALYSIS.md — AHE (Agentic Harness Engineering) research
Plan went through 4 revisions before commit:
v3 — 2026-frontier verification pass (8 cited deltas)
v3.1 — Jensen-grade polish (§0.6 unique value, §0.7 numerical targets,
§3.1 sentinel taxonomy, noise cuts)
v3.2 — library architecture pivot (P0.5 NEW, ~20 weeks)
v4 — production-grade pivot (P5.5 NEW, P10 expanded, Invariant 9,
§0.65 Six Reference Agents, §0.8 Production Commitments,
~22.5 weeks within original 24-week budget)
Citations independently verified via WebFetch (METR reward-hacking number
corrected, NeMo Agent Toolkit package name verified as nvidia-nat v1.6.0
released 2026-04-10, NemoClaw alpha-stage 2026-03-16 disclosed, EU AI Act
Art. 12 enforcement date 2026-08-02 with Digital Omnibus uncertainty
documented).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Verifier flagged PLAN.md, SYSTEM.md, EVAL_SPEC.md, WORKFLOW.md, RESEARCH_AHE_ANALYSIS.md as unexpected upstream diffs after they were committed (previously untracked, didn't show in diff). Updated bin/verify_p0_5_d1.sh exclusion list to match actual owned planning docs. Also tightened existing patterns from "^FOO.md" to "^FOO.md\$" so a future "FOO.md.bak" doesn't accidentally pass. Pattern for future verifiers: when a new owned file is introduced, audit the verifier exclusion list at the same time. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
v4 framing — "governance library wrapping 6 thin orchestrator agents" —
under-delivered on JD's literal asks: "strong agency in LLM-based systems,"
"code agents doing real work," "AI helps build them." A NVIDIA Cosmos
reviewer comparing cosmos-lab against 2026 production autonomous agents
(Devin / Operator / Cursor Composer / Claude Code) would see v4 as
conservative governance theater — clever judgment, weak capability.
v5 inverts the hierarchy. The product is now ONE autonomous PrincipalAgent
doing real long-horizon ML lifecycle work, with governance reframed as
*enabler of autonomy* (not constraint). Sentinels become tripwires for
replanning. Identity capabilities expand with earned track record (RFC
8693 token exchange after K sentinel-clean runs). GEPA becomes agent
self-improvement with retroactive human review.
The 6 v4 "agents" collapse into 6 capability domains the same agent
demonstrates over P3-P9 — like one principal engineer who does data work
Monday, training Tuesday, optimization Wednesday. Not six different
people. PrincipalAgent runs INSIDE ml-intern's agent_loop.py (1626 lines
of debugged production code) — substrate stays zero-diff, capabilities
ride on top.
Substantive changes:
PLAN_V2.md
+ §0.9 Autonomous Principal Agent thesis (NEW, ~80 lines) — the
load-bearing reframe; product = PrincipalAgent + harness + governance
+ §3.2 PrincipalAgent architecture (NEW, ~140 lines) — substrate
choice (ml-intern agent_loop), autonomous loop diagram (PLAN→
EXECUTE→VERIFY→REPLAN), 3-tier memory (working/episodic/semantic),
replanning logic (sentinel = information not failure), capability
expansion mechanism (earned trust), concrete demo, owned path
cosmos_lab/principal/ tree
~ §0.5 row 11 NEW — v5 thesis pivot rationale, cited
~ §0.6 reframed — "what only cosmos-lab does" now compared vs Devin/
Operator/Cursor Composer (2026 autonomous agents), not vs assembled
OSS. 5 differentiators all autonomy-focused.
~ §0.65 reframed — "six reference agents" → "six capability domains
of the PrincipalAgent" (one agent, six skills)
~ §1 phase table reframed — phases are now "PrincipalAgent capability
progression milestones" not "ship N agents"
~ Header v4 → v5 with thesis pivot explanation
docs/01_north_star.md — full rewrite to PrincipalAgent framing
CLAUDE.md — identity sentence updated to v5 framing (one PrincipalAgent
with 6 capability domains, ml-intern agent_loop substrate, governance
as enabler)
What v5 KEEPS from v4:
- Library architecture (P0.5) — but library = cosmos_lab.principal
- Identity v2 (P4b) — but for capability EXPANSION (earned trust)
- Sentinels (P1, §3.1) — but as tripwires for replanning
- GEPA (P8) — but for agent self-improvement
- Production deployment (P10) — agent serves real users
- All 9 invariants
- 24 numerical targets
- 22.5 weeks (depth shifts breadth→capability, not added time)
What v5 honestly does NOT pretend to do (anti-hype):
- Not invent novel architectures (composes known patterns intelligently)
- Not zero human oversight (sentinel trips visible; weekly review)
- Not arbitrary research questions (scoped to ML lifecycle, Cosmos first)
- Not online self-improvement (offline GEPA between sessions)
- Capability expansion is policy-bounded, not unbounded
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes the eval gap in v5 plan. Without rigorous agent-system eval,
every "exceptional autonomous PrincipalAgent" claim in PLAN_V2 is
unverifiable, GEPA self-improvement (P8) has no rigorous gate to ratchet
against, and reward-hacking detection (per UC Berkeley + METR 2026
findings) has no operational discipline.
NEW FILE — AGENTIC_EVAL_SPEC.md (528 lines):
Companion to EVAL_SPEC.md. Where EVAL_SPEC covers ML-output eval
(perplexity, KL divergence, latency p99 — model under test), this doc
covers agent-system eval per axiom A8 ("the agent is itself an
artifact-under-eval").
Structure:
§0 scope + reading order
§1 why agentic eval differs (trajectory ≠ output; A8; long-horizon)
§2 axioms — A1-A10 transfer from EVAL_SPEC; A11-A13 are agentic-specific
A11 trajectory carries information beyond outcome
A12 capability expansion requires adversarial testing
A13 long-horizon eval is non-fungible with short-horizon eval
§3 5-tier architecture (T0-T4 specialized for agentic)
§4 6 agentic-specific surfaces (S1-S6) — NEW vs EVAL_SPEC:
S1 trajectory eval (tool-call efficiency, replan ratio, doom-loop)
S2 plan-quality eval (LLM-judge on PLAN-phase output)
S3 replan-quality eval (sentinel trips → response quality)
S4 capability boundary eval (50-task adversarial probe suite)
S5 reward-hacking adversarial eval (monthly red-team sprint)
S6 cross-agent comparison eval (vs Devin, Claude Code, human)
§5 cross-cutting meta layers (M1-M3 transfer from EVAL_SPEC)
§6 statistical framework (transfers + paired tests for agent compare)
§7 three input types (I1-I3 per JD bullet 5)
§8 operational cadence + gates + verifier scripts
§9 numerical targets — 10 eval-system commitments (E1-E10)
§10 implementation map to v5 phases (no new phase needed, ~3-4 days
spread across P1, P4a, P4b, P9b, P10)
§11 references (METR, UC Berkeley, τ-bench, BFCL, GAIA-2, etc.)
PLAN_V2.md additions:
+ §3.3 Agentic eval architecture (in-plan summary + pointer to spec)
+ §0.7 v5 eval-system additions subtable (10 numerical targets E1-E10)
E1 sentinel suite FPR ≤ 5%
E2 sentinel suite FNR ≤ 1%
E3 T1 test-retest r ≥ 0.95
E4 plan-quality LLM-judge ↔ human ≥ 80%
E5 replan success rate ≥ 70%
E6 capability boundary 100% (0 unauthorized in 100 runs)
E7 reward-hack discovery rate trending downward
E8 PrincipalAgent on Pareto frontier vs comparison agents
E9 eval cost ≤ 15% of total project spend
E10 reproducibility envelope coverage 100%
Total commitments: 24 (original §0.7) + 10 (E1-E10) = 34
+ §1.5 reuse map — companion specification documents subsection
explaining EVAL_SPEC vs AGENTIC_EVAL_SPEC scope
CLAUDE.md pointer index updated — distinguishes ML-output eval doc
from agent-system eval doc.
docs/03_pointers.md updated — adds §3.2 + §3.3 anchors and a new
"Companion specification docs" subtable.
bin/verify_p0_5_d1.sh — adds AGENTIC_EVAL_SPEC.md to owned-paths
exclusion list (planning doc, owned).
Verifier: ./bin/verify.sh p0_5_d1 → 14/14 pass (still green).
Plan size: 1173 lines (was 1075 — added ~100 lines for §3.3 + E1-E10).
Why this matters now: per JD bullet 5 ("design and scale evaluation
platforms that combine automated metrics, human feedback, and
agent-driven analysis") and 2026 reward-hacking crisis (METR: o3 hacks
30%+ RE-Bench; UC Berkeley: 8/8 top benchmarks hackable to 73-100%).
A v5 PrincipalAgent without rigorous agentic eval IS the failure mode
those papers describe — confident, capable, unverifiable, reward-
hacking. The eval architecture is the difference between exceptional
agent we can defend with numbers vs autonomous agent that "looks good
in the demo."
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Ships the v1 compat adapter that installs cosmos-lab governance into an
existing ml-intern Session. Per PLAN_V2.md §0.4 library architecture and
docs/02_current_phase.md D2 spec.
NEW FILES:
cosmos_lab/harness/__init__.py — public API: install_into_session
cosmos_lab/harness/ml_intern.py (84 L) — the adapter
tests/optimization/harness/__init__.py
tests/optimization/harness/test_ml_intern_adapter.py (175 L) — 6 smoke tests
bin/verify_p0_5_d2.sh — D2 verifier (11 checks)
ADAPTER CONTRACT:
install_into_session(session, identity, audit_log) -> None
Wraps session.tool_router with CapabilityScopedRouter so every tool
invocation through Session is governed by cosmos-lab identity + audit.
Composition only — no upstream files modified (Invariant 1).
Idempotency: re-installing on the same Session raises RuntimeError to
prevent shadowing audit history.
SMOKE TEST DESIGN:
Uses duck-typed MockSession (just .tool_router) instead of constructing
a real ml-intern Session (which requires Config + ContextManager +
event_queue + sandbox). The adapter's contract is "I wrap .tool_router"
— smoke test verifies that one thing.
6 tests: 3 contract (wraps, raises on no-router, refuses re-install) +
3 e2e behavior (denial works via wrapped session, audit recorded,
authorized passes through).
VERIFIER RESULT:
./bin/verify.sh p0_5_d2 → 11/11 pass
./bin/verify.sh p0_5_d1 → 14/14 pass (no regression)
Upstream baseline: 237 pass / 3 known-broken (no regression)
Zero-diff: only owned paths modified
LEARN (3 surprises, captured for future phases):
1. Editable install staleness — adding cosmos_lab/harness/ after the
prior `uv sync` left package metadata stale. Pattern: any new
package directory needs `uv sync` before tests pass.
2. `uv run pytest` is ambiguous — PATH leak resolved to miniconda's
pytest (with stale editable install) instead of venv. Symptom:
`python -c "import cosmos_lab.harness"` succeeded everywhere but
pytest collection failed with ModuleNotFoundError. Fix: use
`uv run python -m pytest` for deterministic venv resolution.
All verifier scripts updated.
3. Smoke test design — testing the adapter contract (one method) does
not require constructing the host platform (full Session). Duck-
typed MockSession is the right scope. Pattern for future adapter
smoke tests.
Branch: p0_5_library_restructure (continues from D1).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ARCHITECTURAL DECISION (v5.1, PLAN_V2 §0.4.5):
After auditing agent_loop.py:1771, discovered ml-intern's submission_loop
is queue-based (asyncio.Queue for submissions in + events out), not
function-based. Embedding it as runtime substrate inside another
orchestrator requires a 1-2 week async-bridge engineering effort that
v5 implicitly assumed but never budgeted.
v5.1 commits to TWO-LAYER architecture instead of three:
Layer 1: cosmos-lab CLI (PRIMARY entry point)
- PrincipalAgent + governance + sentinels + memory + sub-agents
- long-horizon orchestration ABOVE the Session level
Layer 2: ml-intern Session (execution SUBSTRATE, per task)
- debugged ReAct loop (1626L) + 16 tools + MCP + sandbox
- wrapped by D2 adapter (CapabilityScopedRouter)
Deployment wrappers (P10): nat workflow YAML, Modal/HF Spaces endpoint
— INVOKE cosmos-lab CLI, don't host it inside
Why we explicitly REJECTED 3-layer runtime:
1. submission_loop queue-based design = real async-bridge work
2. nat-at-runtime solves the wrong problem; nat-at-deployment is enough
for Cosmos pitch (`nat run cosmos-lab.yaml` = "we run in your stack")
3. Complexity budget — 1-2 weeks better spent on sentinels / memory /
capability domains / real GPU runs
What v5.1 PRESERVES:
- PrincipalAgent thesis (§0.9), library architecture (§0.4),
AGENTIC_EVAL_SPEC, sentinel taxonomy (§3.1), 9 invariants,
34 numerical targets, 6 capability domains, 22.5w schedule,
every commit already shipped (P0, P0.5 D1, P0.5 D2, AGENTIC_EVAL_SPEC,
v5 thesis).
What v5.1 DROPS:
- "nat-runnable from P1 onward" (over-promise)
- 3-layer runtime architecture
- ml-intern submission_loop bridge work (~1.5-2w saved → risk buffer)
P0.5 D3 IMPLEMENTATION (matches v5.1 reframed scope):
cosmos_lab/harness/nat.py (132 LOC, ~78 non-comment, ≤200 budget)
- register_as_nat_tool(builder) — lightweight registration shim
- Registers ONE tool `cosmos_lab_principal` in nat Builder
- Tool body: stub returning structured dict (real CLI invocation
lands in P3 PrincipalAgent v0)
- BuilderLike Protocol for duck-typed compatibility
- Idempotency check via _is_already_registered()
- Multi-method fallback (add_function / register_function / etc)
tests/optimization/harness/test_nat_adapter.py (135 LOC)
- 11 smoke tests: 5 contract (registration mechanics) + 6 tool callable
- Uses MockBuilder pattern (mirrors D2 MockSession per LEARN huggingface#3)
- All pass against duck-typed builders
bin/verify_p0_5_d3.sh — 10 checks
Verifier result: 10/10 pass
D1 + D2 still green (no regression): 14/14 + 11/11
Upstream baseline: 237 / 3 known-broken
Total cosmos-lab tests: 22 + 11 = 33
DOCS UPDATED:
PLAN_V2.md §0.4.5 NEW (~120 lines) — two-layer architecture decision
CLAUDE.md dev commands — `uv sync --extra dev` + `uv run python -m pytest`
docs/02_current_phase.md — D3 archived, D4 spec written
LEARN from D3:
1. Always read substrate code before architecting on it (would have
caught queue-based submission_loop earlier; v5 thesis pivot would
have proposed 2-layer from start, not 3-layer)
2. `uv sync` without `--extra dev` removes pytest from venv (caught
during D3 verifier; CLAUDE.md updated)
3. Two layers > three when one is sufficient (workflow anti-pattern huggingface#4
generalized: don't build a layer that doesn't earn its complexity)
Branch: p0_5_library_restructure (continuing from D2).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ROOT CAUSE — v5/v5.1 over-correction:
v5 thesis claimed "ONE exceptional autonomous PrincipalAgent" and planned
to build planner + executor + memory + sub-agent spawning under
cosmos_lab/principal/. v5.1 reframed as "2-layer (cosmos-lab CLI on
ml-intern Session)" with same re-implementation work.
User audit caught the contradiction: I claimed leverage but described
re-building. Audit of ml-intern itself confirmed:
agent/prompts/system_prompt_v3.yaml VERBATIM:
"You are ML Intern, an ML engineering assistant... fully autonomous
— research, validate, implement, and deliver results without asking
for unnecessary confirmation"
agent/tools/plan_tool.py — built-in planning (todo list)
agent/tools/research_tool.py — sub-agent spawning ("spawns a
cheap LLM call with focused
research task and returns
summary")
agent/core/agent_loop.submission_loop — autonomous execution loop
agent/core/doom_loop.py — failure-loop detection
agent/core/cost_estimation.py — per-call cost tracking
agent/tools/{jobs,papers,research,sandbox,...} — 20+ ML tools
ml-intern IS already what v5/v5.1 PrincipalAgent claimed to be.
v5/v5.1 was workflow anti-pattern huggingface#4 generalized: "building a 5000-LOC
re-implementation that should have been a governance wrapper."
v5.2 PIVOT — governance layer (back to v4's correct direction with v5
production rigor):
cosmos-lab is the production governance layer that makes ml-intern
(or any autonomous ML agent) safe to deploy at NVIDIA Cosmos scale.
ml-intern provides autonomy; cosmos-lab provides production discipline.
The 10 governance components ml-intern doesn't have, that cosmos-lab adds:
1. Sentinel-gated quality (4 sentinel types paired with judge)
2. Cross-session memory (3-tier hierarchical, persistent)
3. RFC 8693 capability expansion (earned-trust scope growth)
4. Hash-chained signed audit (EU AI Act Art. 12 compliant)
5. OTel-GenAI native observability (gen_ai.* semconv, portable)
6. GEPA self-improvement (offline DSPy 3.x, retroactive review)
7. MultiJudge with bootstrap CIs (no debate dynamics)
8. Inspect AI integration (UK AISI standard adoption)
9. PR-gating + canary deployment (sequential testing)
10. AGENTIC_EVAL_SPEC discipline (T0-T4 + S1-S6 + E1-E10)
WHAT v5.2 PRESERVES:
- All shipped code (P0 identity + P0.5 D1/D2/D3 adapters)
— these ARE the governance foundation
- AGENTIC_EVAL_SPEC.md — eval architecture is THE product spec now
- All 9 invariants
- 34 numerical targets (24 §0.7 + 10 E1-E10)
- Library architecture (cosmos_lab/ pip-installable)
- Two-layer deployment (cosmos-lab CLI + ml-intern; nat as P9 wrapper)
WHAT v5.2 DROPS:
- "ONE PrincipalAgent" framing (ml-intern is the agent)
- Re-implementation of planner / executor / memory tier internals /
sub-agent spawning (~5000 LOC of unnecessary code)
- 6 capability domains framing (replaced with 6 governance enhancements
applied to ml-intern's existing capabilities)
- 22.5w schedule (compressed to ~13w by dropping re-implementation work)
NET EFFECTS:
- Plan compressed 22.5w → ~13w (~9 weeks banked for v1.1 polish or buffer)
- Stronger Cosmos pitch ("we make autonomous agents production-safe" =
2026 frontier gap nobody fills end-to-end)
- Honest about leverage (Cosmos reviewer running git ls-files won't see
PrincipalAgent re-implementation)
- All shipped commits stay valid and become more important
(D2 ml_intern adapter is THE primary product surface)
- AGENTIC_EVAL_SPEC's E1-E10 = literal product spec, not side document
DOCS UPDATED:
PLAN_V2.md
- Header (v5.2 governance layer thesis)
- §0.5 row 12 NEW (v5.2 honest leverage pivot rationale, cited)
- §0.6 reframed (10 governance items vs ml-intern bare)
- §0.65 reframed (6 governance enhancements, not 6 PrincipalAgent
capabilities; with concrete demonstration block)
- §0.9 simplified (ml-intern is the agent; cosmos-lab is governance)
- §1 phase table compressed (~13w, governance enhancements + ml-intern
demonstrations)
CLAUDE.md identity sentence — v5.2 framing
docs/01_north_star.md — full rewrite to v5.2 governance-layer
docs/02_current_phase.md — v5.2 schedule note (P1+ phases)
VERIFIER STATE:
D1: 14/14 ✅ (no regression)
D2: 11/11 ✅ (no regression — D2 adapter is now primary product surface)
D3: 10/10 ✅ (no regression)
Upstream baseline: 237 / 3 known-broken
Total cosmos-lab tests: 33
Branch: p0_5_library_restructure (10th commit).
Plan size: 1294 lines (added ~140 for v5.2 reframe content).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…tern primitives USER CAUGHT THE ROOT ISSUE: v5/v5.1 collapsed v4's 6 specialty agents into 1 PrincipalAgent (over-correction #1). v5.2 then removed the agents entirely, claiming "ml-intern is the agent" (over-correction huggingface#2). Both wrong. JD re-read CAREFULLY confirms multiple specialty agents needed: Role mission: "agentic SYSTEMS that reason about, build, evaluate, and improve AI systems themselves" (plural systems) What you'll do bullet 3: "self-improving loops where agents help generate data, surface failures, evaluate outputs" (multiple agents, different jobs) Stand-out bullet 1: "agent-based systems doing real work — coding, eval, data gen, triage, experimentation, orchestration" (6 work types = 6 agent types) ml-intern's tool primitives are HF-generic. Cosmos team needs Cosmos- specialized agents (cosmos-curate orchestration, NeMo-RL training, NIM inference, multimodal physics, real video pipelines). ml-intern's primitives are SUBSTRATE we use, not the agents themselves. v6 SYNTHESIS — best of all prior versions: - v3.x/v4: correct on agent count (6 specialty agents) - v5/v5.1: correct on production rigor (real GPU, sentinels, AGENTIC_EVAL_SPEC) - v5.2: correct on leverage discipline (use ml-intern primitives, don't reimplement) - v6: 6 specialty + 3 governance + ~16 infrastructure + ml-intern primitives THE 9 NEW AGENTS (the product): Layer 1 — 6 Cosmos-specialty (real ML lifecycle work): 1. DataAgent (P3) — cosmos-curate orchestration, real video curation 2. EvalAgent (P4a) — multi-judge + sentinels + physics-consistency 3. TrainOrchestrator (P5) — Centaur HPO + NeMo-RL on real GPU 4. OptimizeAgent (P6) — profiling + ≥1.5× speedup on 4 workloads 5. MultimodalPipelineAgent (P9) — e2e Cosmos workflow on Predict 2.5 6. CodeAgent (P9) — capability-scoped, real OSS bug fixes Layer 2 — 3 governance (meta-layer): 7. GepaOptimizer (P8) — weekly DSPy GEPA prompt revisions 8. CapabilityProbe (P7) — adversarial scope testing 9. CrossAgentEvaluator (P10) — quarterly Pareto vs Devin/Claude Code/human PLUS ~16 governance infrastructure components (sentinels, identity v2, audit log, OTel, memory tiers, Inspect AI bridge, ComputeBackend, etc.) PLUS ml-intern primitives leveraged AS-IS (agent_loop, 16 generic tools, sandbox, MCP, cost estimation, doom-loop detection). SCHEDULE: ~19 weeks (between v5/v5.1's 22.5w and v5.2's 13w). Tighter than v5/v5.1 because we leverage ml-intern primitives. Bigger than v5.2 because we restore the 9 agents JD asks for. Honest middle ground. DOCS UPDATED: PLAN_V2.md - Header v5.2 → v6 (with honest postmortem of v5/v5.1/v5.2 over-corrections) - §0.5 row 13 NEW (v6 restore agents pivot rationale) - §0.6 reframed (9 agents + ~16 infra components vs assembled OSS) - §0.65 reframed (Layer 1 + Layer 2 + Layer 3 + Layer 4 honest count) - §0.9 reframed (cosmos-lab builds agents on ml-intern primitives) - §1 phase table — 19w with 9-agent reality CLAUDE.md identity sentence — v6 framing docs/01_north_star.md — full rewrite to v6 docs/02_current_phase.md — v6 schedule note WHAT v6 PRESERVES: - All shipped code (P0, P0.5 D1/D2/D3, AGENTIC_EVAL_SPEC) — these are the substrate agents will use - All 9 invariants - 34 numerical targets (24 §0.7 + 10 E1-E10) - Library architecture (cosmos_lab/ pip-installable) - Two-layer deployment (cosmos-lab CLI + ml-intern; nat as P10 wrapper) - AGENTIC_EVAL_SPEC.md — eval architecture for the 9 agents WHY THIS IS THE FINAL FRAMING: v6 maps directly to JD's literal text. Each JD bullet has a deliverable: - "Design and implement agentic workflows across ML lifecycle" → 6 specialty agents - "Build AI-native systems where agents interact with code/tools/exp" → CodeAgent + others - "Self-improving loops" → GepaOptimizer - "Eval platforms (auto + human + agent-driven)" → EvalAgent + MultiJudge + Inspect AI - "Multimodal ML pipelines" → MultimodalPipelineAgent + DataAgent - "Engineering excellence" → 9 invariants + AGENTIC_EVAL_SPEC No more over-corrections. v6 is the final framing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes P0.5 (4 days of foundation work shipped same day).
NEW FILES:
cosmos_lab/harness/CONTRACT.md (220 lines)
- Documents 2 adapter families per v5.1/v6 architecture:
Family A — Execution Substrate Adapter (ml_intern, future claude_sdk)
Contract: install(host, identity, audit_log) -> None
Family B — Deployment Surface Adapter (nat, future langgraph)
Contract: register_as_X_tool(builder) -> None
- 5 shared requirements (S1-S5) all adapters must satisfy:
S1 Idempotency, S2 Composition only, S3 Input validation,
S4 Returns None, S5 No partial state on failure
- Per-adapter specifics + future adapter checklist
- Anti-patterns explicitly rejected
tests/optimization/harness/test_adapter_contract.py (160 lines)
- Parametrized contract tests across BOTH shipped adapters
- 9 tests: 3 cross-family shared (S1, S4, S5) + 2 family-specific
+ 1 coverage sanity test
- ADAPTERS registry: single source of truth for parametrization
- When v1.1 ships claude_sdk or langgraph adapter, just add row
to ADAPTERS — automatic contract enforcement
bin/verify_p0_5_d4.sh — 14-check verifier
HONEST DESIGN DECISION (D4 LEARN):
Spec originally said "parametrize all 22 existing tests across both
adapters." Audit revealed: ml_intern (execution substrate) and nat
(deployment surface) have DIFFERENT contracts because they serve
DIFFERENT purposes per v5.1/v6 architecture. Forcing one signature
loses clarity.
D4 ships the honest answer:
- Two adapter families, each with own contract signature
- 5 shared requirements that apply to BOTH families
- Parametrized tests for the 5 shared requirements
- Family-specific tests stay in test_ml_intern_adapter.py / test_nat_adapter.py
This is more honest and more extensible than forcing one contract shape.
VERIFIER RESULTS:
D1: 14/14 ✅ (no regression)
D2: 11/11 ✅ (no regression)
D3: 10/10 ✅ (no regression)
D4: 14/14 ✅ (NEW)
Upstream baseline: 237 / 3 known-broken
Total cosmos-lab tests: 42 (16 P0 + 6 D2 + 11 D3 + 9 D4 contract)
Zero-diff invariant: holds throughout
🎉 P0.5 COMPLETE 🎉
Final P0.5 stats:
- 4 days work shipped on schedule
- 12 commits on branch (P0 + 4 P0.5 days + 5 plan evolutions + 2 fixups)
- ~3500 LOC added (cosmos_lab/ + tests/ + bin/ + docs/ + planning)
- Foundation for 9 specialty + governance agents (P3-P10)
NEXT: P1 — Eval infrastructure (~2 weeks)
- TrajectorySink + OTelGenAIEmitter (Phoenix backend)
- 4 sentinel types per §3.1 taxonomy
- MultiJudge with bootstrap CIs
- Inspect AI bridge + 5 seed Inspect tasks
- evaluate CLI
P1 ships eval infrastructure that becomes:
- The foundation for EvalAgent (P4a — specialty agent huggingface#2)
- Used by all other specialty agents (P3, P5, P6, P9, P10) for
sentinel-gated quality + OTel observability + Inspect AI integration
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a31c3c020c
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if personal_repo_id: | ||
| subprocess.Popen( | ||
| [ | ||
| sys.executable, | ||
| str(uploader_script), | ||
| "retry", | ||
| directory, | ||
| personal_repo_id, |
There was a problem hiding this comment.
Scope personal retry uploads to the owning user
retry_failed_uploads_detached() launches a single personal retry process with the current session’s personal_repo_id, but that retry scans every session_*.json file in session_logs. If files from other users still have personal_upload_status pending/failed, they will be re-uploaded into the current user’s dataset, causing cross-user trace leakage/misattribution. Personal retries need per-file ownership/repo scoping (or should be disabled globally) instead of replaying the whole directory against one repo.
Useful? React with 👍 / 👎.
…kills + offline tools)
3 PARALLEL AUDITS COMPLETE:
Launched 3 senior-engineer research agents in parallel to verify v6
architecture against 2026 frontier patterns. All 3 independently
converged on the same 6 misalignments + 8 additions.
Audit 1 (Anthropic + NVIDIA 2026 patterns):
- Anthropic Skills blog (2026) explicitly REJECTS per-domain agents
- Anthropic subagents are task-specialized for parallelization, not
domain-specialized
- NVIDIA Cosmos Curator/Evaluator are single-purpose tools, not multi-
agent fleets
- Anthropic Memory tool = flat file, NOT hierarchy
- No production "sentinel-trip → replan" pattern at Anthropic
- GEPA is offline-only at Decagon; AlphaEvolve closed-source Gemini
- Standing red-team agents in production = Microsoft/Straiker/LangWatch
Audit 2 (2026 multi-agent orchestration convergence):
- LangGraph won production tier (Uber/JPMC/BlackRock/Cisco)
- AutoGen → maintenance mode April 2026; Magentic-One → MAF
- Hierarchical orchestrator-worker is THE convergent pattern
- Multi-agent debate REFUTED (arxiv:2508.17536)
- Spawn depth=1 is convergent default (OpenAI Codex hardcodes)
- Hybrid memory (4-scope) is convergent (Mem0/Atlan/supermemory)
- Specialty agents OK if distinct tool surfaces; ANTI-PATTERN if
sequential pipeline
Audit 3 (production agent eval + governance + safety):
- Inspect AI is de facto frontier eval substrate
- Industry uses 3-tier (not 5-tier) eval ladder
- Berkeley audit: 8/8 benchmarks reward-hackable to 73-100%
- EU AI Act Aug 2 2026 deadline IN FORCE (trilogue collapsed Apr 28)
- MCP authorization 86% enterprise adoption
- Hash-chained Ed25519 audit logging is now production minimum
- Gaia2 finds judge-hacking as distinct failure mode
V7 SYNTHESIS — 6 FIXES + 8 ADDITIONS:
Fixes (audit-driven):
1. Per-domain 6 specialty agents → 4 specialty workers (distinct tool
surfaces) + 1 PrincipalAgent supervisor + CodeWork Skill
2. GepaOptimizer standing agent → offline batch tool (Decagon pattern)
3. Sentinels novel "tripwire-replan" → Anthropic PostToolUse hooks
contract (Claude Agent SDK pattern)
4. 3-tier memory hierarchy → 4-scope hybrid (Mem0/Letta substrate)
5. CapabilityProbe co-resident → CI/CD eval lane via Inspect AI
snapshots (METR pattern)
6. "Earned-trust capability expansion" oversold → standard RFC 8693
delegation (table stakes per MCP spec, drop escalation framing)
Additions (frontier-required):
1. LangGraph durable substrate (Uber/JPMC production winner)
2. Magentic-One Task Ledger + Progress Ledger pattern (2-iteration
stall detection — Microsoft Agent Framework first-class)
3. 5th sentinel: JudgeHackingCheck (Gaia2 finding — verifier-pleasing
artifacts without solving task)
4. Cross-family MultiJudge (1 non-Anthropic to break correlation)
5. CodeWork as Skill, not separate agent (Anthropic Skills pattern)
6. RFC 8707 + RFC 8693 day-one (MCP 2026-03-15 mandate; 86% adoption)
7. Reward-hack rate as Pareto axis in S6 cross-agent eval
8. CUDA/cuDNN/driver versions in reproducibility envelope
V7 FINAL FLEET:
Production agents (5):
1. PrincipalAgent (P3) — LangGraph supervisor + Magentic-One ledgers
2. DataAgent (P4a) — distinct cosmos-curate/NeMo Curator surface
3. EvalAgent (P5) — distinct Inspect AI/MultiJudge surface
4. TrainOrchestrator (P5) — distinct NeMo-RL/SkyPilot surface
5. OptimizeAgent (P6) — distinct profiler/kernel/sandbox surface
Skills (loaded by PrincipalAgent):
- CodeWork Skill (P7) — commodity tools in E2B sandbox
Offline governance tools (NOT standing agents):
- GepaOptimizer (P8) — monthly cron offline batch
- CapabilityProbe (P7) — CI/CD eval lane via Inspect AI
- CrossAgentEvaluator (P10) — quarterly Pareto generator
Infrastructure (~16 components):
- Identity (P0 + RFC 8693), 5-type sentinels, OTel + 4-scope memory,
Inspect AI + cross-family MultiJudge, LangGraph + Magentic-One
substrate, ComputeBackend + sandbox, reproducibility envelope,
nat deployment
Substrate (LEVERAGED inside LangGraph worker nodes):
- ml-intern primitives (agent_loop, 16 tools, sandbox, MCP, cost
estimation, doom-loop)
DOCS UPDATED:
PLAN_V2.md
- Header v6 → v7 (frontier-aligned production agentic system)
- §0.5 row 14 NEW (v7 frontier-audit pivot rationale, cited)
- §0.6 reframed (vs assembled OSS — 5 agents + Skills + offline)
- §0.65 reframed (5 production agents + Skills + offline tools)
- §1 phase table — ~21w with v7 phases (LangGraph + Magentic-One)
- §3.1 sentinel taxonomy 4 → 5 types (added JudgeHackingCheck)
- §3.2 PrincipalAgent architecture (LangGraph supervisor + Magentic-
One ledger pattern + frontier substrate choices documented)
CLAUDE.md identity sentence — v7 framing
docs/01_north_star.md — full rewrite to v7 (frontier-aligned final)
docs/02_current_phase.md — v7 schedule note (P1+ phases reframed)
WHY V7 IS FINAL:
3 independent senior-engineer audits converged on same fixes +
additions. No single auditor would catch all of these. Triangulation
across (Anthropic+NVIDIA) + (multi-agent convergence) + (eval+governance)
gives high-confidence frontier alignment.
Process needs to converge. Future audit findings document as v1.1+
work, not v8 — committing to v7 now and shipping.
ALL SHIPPED CODE PRESERVED:
- P0 identity primitives ✅
- P0.5 D1 library restructure ✅
- P0.5 D2 ml_intern adapter ✅
- P0.5 D3 nat wrapper ✅
- P0.5 D4 adapter contract + dual-adapter test matrix ✅
- 42 cosmos-lab tests passing ✅
- All 4 verifiers green ✅
- Upstream baseline preserved ✅
- Zero-diff invariant holds ✅
Verifier: ./bin/verify.sh p0_5_d4 → 14/14 pass (still green).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…and-out huggingface#3) 8-tier frontier audit + JD literal-text audit converged on the SAME gap: context engineering / context compression. Both stand-out JD bullet huggingface#3 ("Context compression / agent memory") and 8-tier audit Tier 3 (harness & context engineering) flagged this as v7's weakest area (~50% coverage). This commit closes the gap by adding explicit context engineering discipline to P3 PrincipalAgent foundation. NOT a v8 thesis change — v7 architecture stays. Just adds 4 primitives + 1 bonus to P3, with 4 new numerical commitments (E15-E18). WHAT V7-STRONGER ADDS: §3.2.8 NEW — Context engineering discipline (~110 lines added) Primitive 1: Cache-aware prompt structure Stable prefix → tool defs → conversation layout Stable region NEVER changes during task (system prompt + capability manifest + sentinel rules) Tool defs change ONLY on RFC 8693 capability expansion event Conversation is the only churn region Rationale: every byte of churn voids prefix cache + 10× cost (Anthropic memory system + Claude API prefix caching) Primitive 2: Compaction at 75% context utilization Trigger: when context window hits 75% of model limit Action: pause loop → Anthropic memory tool API summarization → resume Rationale: Anthropic context-editing pattern (auto-clears stale tool results when context fills); Claude Code adopts this Primitive 3: Just-in-time retrieval via recall_relevant(goal) tool Don't pre-load episodic memory at session start Agent calls explicit tool when needed; 4-scope filtered query Rationale: pre-loading wastes context on irrelevant past tasks; just-in-time keeps stable prefix small Primitive 4: cosmos-progress.md structured state file (per Anthropic Claude Code claude-progress.txt pattern) PrincipalAgent writes after every milestone completion Append-only event log: DONE / IN_PROGRESS / NEXT / SURPRISES sections Bridges multi-day work across compute interruptions New session begins by reading progress file (initializer pattern) Primitive 5: Behavior-vs-model-capability separation test Quarterly: snapshot agents, re-run against current + previous models Detect harness assumptions that went stale (Sonnet 4.5 context anxiety patches were dead weight in Opus 4.5 — Anthropic example) Flag dead-weight resets/patches for removal NEW NUMERICAL COMMITMENTS (E15-E18): E15: Prefix cache hit rate ≥ 80% on stable prefix region E16: Compaction trigger fires at 75% ± 5% (no missed in 100 runs) E17: cosmos-progress.md cross-session recovery 100% E18: ≥1 stale assumption identified per quarterly retest CODE STRUCTURE UPDATE: cosmos_lab/principal/context_eng/ NEW subpackage: prompt_layout.py — cache-aware structure enforcement jit_retrieval.py — recall_relevant(goal) tool progress_state.py — cosmos-progress.md writer/reader stale_check.py — quarterly behavior-vs-capability retest cosmos_lab/principal/memory/compaction.py NEW module DOCS UPDATED: PLAN_V2.md - §1 phase table P3 row: 2w → 2.5w; explicit context engineering scope - §1 Total: 21w → 21.5w - §3.2.7 file tree: added context_eng/ subpackage + memory/compaction.py - §3.2.8 NEW (110 lines): full context engineering discipline spec with 5 primitives + 4 new commitments E15-E18 CLAUDE.md identity sentence — added context engineering discipline mention docs/01_north_star.md Layer 4 — added context engineering bullet GAPS CLOSED: ✅ JD stand-out huggingface#3 (Context compression / agent memory): was PARTIAL (memory only) → now FULL (memory + 4-primitive context engineering discipline + behavior-vs-capability check) ✅ 8-tier audit Tier 3 (Harness & context engineering): was ~50% → now ~85% (cache-aware prompts + compaction + JIT retrieval + structured state + staleness check all explicit) WHAT V7-STRONGER PRESERVES (no thesis change): - All shipped code (P0, P0.5 D1/D2/D3/D4) - 5 production agents (PrincipalAgent + 4 specialty workers) - 1+ Skills (CodeWork) - 3 offline governance tools (GepaOptimizer, CapabilityProbe, CrossAgentEvaluator) - All 9 invariants - All 38 numerical targets (24 §0.7 + 10 E1-E10 + 4 E15-E18) - Library architecture, 2-layer deployment, nat wrapper - Verifier discipline (D4: 14/14 still passes) VERIFIER STATE: D1: 14/14 ✅ D2: 11/11 ✅ D3: 10/10 ✅ D4: 14/14 ✅ Upstream baseline: 237 / 3 known-broken Plan size: 1416 → 1533 lines (+117 for context engineering spec) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
Foundation for the cosmos-lab project — 9 NEW agents (6 Cosmos-specialty + 3 governance) on ml-intern primitives per PLAN_V2.md v6. This PR ships P0 (identity foundation) + all 4 days of P0.5 (library + ml_intern adapter + nat wrapper + contract) + the 13-revision plan evolution that arrived at v6 honest framing.
12 commits, 42 cosmos-lab tests passing, 237 upstream baseline preserved, zero-diff invariant holds throughout.
What ships
Code (~700 LOC)
cosmos_lab/— Python library (importable viafrom cosmos_lab import ...)identity/—AgentIdentity,AuditLog,CapabilityScopedRouter,OptimizationConfig(P0)harness/ml_intern.py— Family A execution-substrate adapter,install_into_session()wraps Session.tool_router with governance (P0.5 D2)harness/nat.py— Family B deployment-surface adapter,register_as_nat_tool()registerscosmos_lab_principalas nat workflow tool (P0.5 D3)harness/CONTRACT.md— formal contract documentation: 2 adapter families + 5 shared requirements + per-adapter specifics + future-adapter checklist (P0.5 D4)Tests (42 total)
tests/optimization/test_identity_scoping.py— 16 tests for identity primitives (P0)tests/optimization/harness/test_ml_intern_adapter.py— 6 smoke tests (P0.5 D2)tests/optimization/harness/test_nat_adapter.py— 11 smoke tests (P0.5 D3)tests/optimization/harness/test_adapter_contract.py— 9 parametrized contract tests across both adapters (P0.5 D4)Verifiers (
bin/)verify.sh— phase routerverify_p0_5_d1.sh(14 checks),_d2.sh(11),_d3.sh(10),_d4.sh(14) — per-day verificationPlanning + context harness
PLAN_V2.md(1380 lines) — 13-revision evolution: v3.1 → v3.2 → v4 → v5 → v5.1 → v5.2 → v6 (current). Full revision history in §0.5 deltas table rows 1-13.AGENTIC_EVAL_SPEC.md— companion to EVAL_SPEC.md, agent-system eval architecture (5-tier ladder + 6 agentic surfaces + 13 axioms + 10 numerical commitments E1-E10)CLAUDE.md,docs/(5 files) — context harness for AI agents working on the projectagentic_build_workflow.md— DEFINE → PROBE → BUILD → REVIEW → SHIP → LEARN methodologyv6 thesis (final framing)
cosmos-lab ships 9 NEW agents + ~16 governance infrastructure components, built on ml-intern's tool primitives leveraged AS-IS:
Layer 1 — 6 Cosmos-specialty agents (P3-P9):
Layer 2 — 3 governance agents (P7-P10):
7. GepaOptimizer — weekly DSPy GEPA prompt revisions
8. CapabilityProbe — adversarial scope testing
9. CrossAgentEvaluator — quarterly Pareto vs Devin/Claude Code/human
Layer 3 — ~16 governance infrastructure components (P1-P10):
sentinels, identity v2, audit log, OTel emitter, memory tiers, Inspect AI bridge, ComputeBackend, etc.
Layer 4 — ml-intern primitives leveraged AS-IS:
agent_loop, 16 generic tools, sandbox, MCP, cost estimation, doom-loop detection.
Plan evolution honest postmortem
User correctly caught both over-corrections. v6 is the honest answer that maps directly to JD's literal text ("agents (plural) help generate data, surface failures, evaluate outputs" + stand-out "agent-based systems doing real work — coding, eval, data gen, triage, experimentation, orchestration").
Verifier state (all green)
Schedule
~19 weeks total (between v5/v5.1's 22.5w and v5.2's 13w — honest middle ground).
Test plan
uv sync --extra dev./bin/verify.sh p0_5_d1→ 14/14./bin/verify.sh p0_5_d2→ 11/11./bin/verify.sh p0_5_d3→ 10/10./bin/verify.sh p0_5_d4→ 14/14uv run python -m pytest tests/optimization/ -q→ 42 passeduv run python -m pytest tests/unit/ -q→ 237 passed / 3 known-broken (no new regressions)git diff upstream/main --name-only→ only owned pathsReading order for review
For ~30 min focused review:
PLAN_V2.md §0.6(v6 unique value) +§0.65(9 agents) +§0.9(thesis) — the load-bearing v6 framingcosmos_lab/harness/CONTRACT.md— adapter contract design decisioncosmos_lab/harness/ml_intern.py(84 LOC) +cosmos_lab/harness/nat.py(132 LOC) — the two shipped adapterstests/optimization/harness/test_adapter_contract.py— parametrized contract testsAGENTIC_EVAL_SPEC.md §1(why agentic eval differs) +§4(6 surfaces) — eval architecture (used by all 9 agents)For deeper review (~2 hours): read PLAN_V2.md §0.5 deltas table to follow the v3 → v6 evolution; read all 12 commit messages in order.
🤖 Generated with Claude Code