andreidhoang · andreidhoang · Apr 29, 2026 · Apr 30, 2026 · Apr 30, 2026 · Apr 30, 2026
diff --git a/AGENTIC_EVAL_SPEC.md b/AGENTIC_EVAL_SPEC.md
diff --git a/AGENTS.local.md b/AGENTS.local.md
@@ -0,0 +1 @@
+CLAUDE.md
diff --git a/AGENTS.md b/AGENTS.md
@@ -0,0 +1,37 @@
+# Agent Notes
+
+## Local Dev Servers
+
+- Frontend: from `frontend/`, run `npm ci` if dependencies are missing, then `npm run dev`.
+- Backend: from `backend/`, run `uv run uvicorn main:app --host ::1 --port 7860`.
+- Frontend URL: http://localhost:5173/
+- Backend health check: `curl -g http://[::1]:7860/api`
+- Frontend proxy health check: `curl http://localhost:5173/api`
+
+Notes:
+
+- Vite proxies `/api` and `/auth` to `http://localhost:7860`.
+- If `127.0.0.1:7860` is already owned by another local process, binding the backend to `::1` lets the Vite proxy resolve `localhost` cleanly.
+- Prefer `npm ci` over `npm install` for setup, since `npm install` may rewrite `frontend/package-lock.json` metadata depending on npm version.
+- Production defaults to the Bedrock Claude model. For local development with a personal Anthropic key, set `ANTHROPIC_API_KEY` and `ML_INTERN_CLAUDE_MODEL_ID=anthropic/claude-opus-4-6` before starting the backend. Other models are selected through the app's model switcher.
+
+## GitHub CLI
+
+- For multiline PR descriptions, prefer `gh pr edit <number> --body-file <file>` over inline `--body` so shell quoting, `$` env-var names, backticks, and newlines are preserved correctly.
+
+## Hugging Face Space Deploys
+
+- The Space remote is `space` and points to `https://huggingface.co/spaces/smolagents/ml-intern`.
+- Deploy GitHub `main` to the Space from the local `space-main` branch by merging `origin/main` into `space-main` with a single merge commit, then pushing `space-main:main` to the `space` remote.
+- Keep the Space-only README frontmatter on `space-main`; `.gitattributes` should contain `README.md merge=ours` and the local repo config should include `merge.ours.driver=true`.
+- Recommended deploy flow:
+
+```bash
+git pull --ff-only origin main
+git switch space-main
+git config merge.ours.driver true
+git merge --no-ff origin/main -m "Deploy $(date +%Y-%m-%d)" \
+  -m "Co-authored-by: OpenAI Codex <codex@openai.com>"
+git push space space-main:main
+git switch main
+```
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -0,0 +1,98 @@
+# cosmos-lab — Context Harness
+
+> **What this is**: zero-diff fork of `huggingface/ml-intern` that ships the **cosmos-lab** library — a frontier-aligned production agentic system: **5 production agents** (1 PrincipalAgent supervisor + 4 specialty workers: Data, Eval, Train, Optimize) + **CodeWork Skill** (Anthropic Skills pattern) + **3 offline governance tools** (GepaOptimizer, CapabilityProbe, CrossAgentEvaluator — explicitly NOT standing agents per 2026 frontier convergence) + **~16 governance infrastructure components** (5 sentinel types incl. judge-hacking detector, cross-family MultiJudge, MCP OAuth+RFC 8693, hash-chained signed audit, OTel-GenAI, 4-scope hybrid memory via Mem0/Letta, Inspect AI bridge, **LangGraph durable supervisor + Magentic-One Task/Progress Ledger pattern**, **context engineering discipline** — cache-aware prompt structure + 75% compaction + just-in-time retrieval + cosmos-progress.md state file), built on ml-intern's tool primitives leveraged AS-IS inside LangGraph worker nodes. Deployed via `nat run cosmos-lab.yaml`. See `docs/01_north_star.md` for vision; PLAN_V2.md §0.6 + §0.65 + §0.9 + §3.1 + §3.2 (incl. §3.2.8 context engineering) for v7-stronger thesis.
+
+**Current phase** → `docs/02_current_phase.md` (LIVE — read this first when starting work)
+
+---
+
+## Invariants — never break
+
+1. **ZERO-DIFF**: never edit any file existing in `huggingface/ml-intern`. Use subclass / composition / new path.
+2. Never `git commit` or `git push` without explicit user request.
+3. `uv run pytest tests/unit/` must match baseline (237 pass / 3 upstream-broken — see PLAN_V2 invariant 2).
+4. New code only in owned paths (table below).
+5. **One-optimization-per-experiment** + **measured-peak over vendor-peak** (EVAL_SPEC.md, applies P6+).
+6. **Trajectory-on-by-default**: from P1, no agent run is unobserved.
+7. **OTel-GenAI-on-by-default**: from P1, every span uses `gen_ai.*` semconv.
+8. **No judge-only metric reaches a gate**: every quality gate requires (judge, structural-verifier) pair.
+9. **No GPU phase exits without one measured real run** (P5/P5.5/P6/P9a; ~$200-400 budget).
+
+---
+
+## Owned paths
+
+| Write here | Never touch (upstream) |
+|---|---|
+| `cosmos_lab/` (P0.5+) | `agent/core/` |
+| `agent/optimization/` (Phase 0 home) | `agent/config.py` |
+| `agent/tools/{profiling,training_opt,inference_opt,multimodal_opt,vla_opt}/` | `agent/context_manager/manager.py` |
+| `agent/tools/hardware_specs.py` | `agent/tools/*.py` (existing) |
+| `agent/prompts/system_prompt_optimization_*.yaml` | `backend/`, `frontend/`, `tests/unit/` |
+| `configs/optimization_agent_config.json` | |
+| `tests/optimization/` | |
+| `docs/`, `bin/` (this harness) | |
+
+**Verify ownership**: `git diff upstream/main --name-only` must show only owned paths.
+
+---
+
+## Anti-patterns (catch yourself)
+
+1. Editing the prompt when the bug is in the data.
+2. Editing the data when the bug is in the spec.
+3. Trusting "I have verified this" from an agent — re-run the verifier yourself.
+4. Building a pipeline that should have been one model call.
+5. Adding a fourth concurrent agent. You will regret it.
+6. Saying "the agent decided" — replace with "P(output | context) was high."
+
+---
+
+## Workflow (every task)
+
+`DEFINE → PROBE → BUILD → REVIEW → SHIP → LEARN` — see `docs/00_workflow.md`.
+
+**Hard rule**: if you can't write the verifier, the goal is wrong. Fix the goal, not the agent.
+
+---
+
+## Dev commands
+
+```bash
+uv sync --extra dev                                        # install (--extra dev for pytest)
+uv run python -m pytest tests/unit/ -q                     # upstream baseline (must match)
+uv run python -m pytest tests/optimization/ -q             # cosmos-lab tests (must pass)
+PYTHONPATH=. ruff check agent/ --ignore E501,F401,E402     # lint
+./bin/verify.sh <phase>                                    # phase verifier (e.g. p0_5_d3)
+git fetch upstream && git merge upstream/main              # daily upstream sync
+```
+
+Note: use `uv run python -m pytest` (NOT `uv run pytest`) — bare `uv run pytest`
+can resolve to a system pytest with stale package metadata. Captured in P0.5 D2 LEARN.
+
+---
+
+## Pointer index (load on demand)
+
+| Need | Read |
+|---|---|
+| What we're building this week | `docs/02_current_phase.md` |
+| Vision in 1 screen | `docs/01_north_star.md` |
+| Workflow phases | `docs/00_workflow.md` |
+| Phase → PLAN_V2 anchor map | `docs/03_pointers.md` |
+| Full plan (24 sections, 837L) | `PLAN_V2.md` (read specific section, not whole file) |
+| Architecture deep WHY (Vietnamese, 1167L) | `SYSTEM.md` (rare — only for upstream debugging) |
+| Eval spec — ML output (perplexity, KL, latency p99) | `EVAL_SPEC.md` |
+| Eval spec — agent system (trajectory, plan, replan, capability boundary, reward-hack, cross-agent) | `AGENTIC_EVAL_SPEC.md` |
+| Self-improvement research | `RESEARCH_AHE_ANALYSIS.md` |
+| Dev server / deploy notes | `AGENTS.md` |
+| NVIDIA Cosmos JD | `docs/04_jd.md` |
+
+---
+
+## Git remotes
+
+```
+upstream → https://github.com/huggingface/ml-intern  (read-only, never push)
+origin   → https://github.com/andreidhoang/ml-optimization-agent
+```