Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
25 commits
Select commit Hold shift + click to select a range
7867a7a
feat(telemetry): track 5 untracked Bedrock call sites for full cost a…
jagwar Apr 29, 2026
5db99fa
Expose OpenAI models in the UI & make Claude model picker configurabl…
lewtun Apr 30, 2026
15d6506
Make sandboxes private (#195)
lewtun Apr 30, 2026
1b922dd
Add DeepSeek V4 Pro model option (#189)
lewtun Apr 30, 2026
0bd7547
Auto-share session traces to a private HF dataset (#193)
DarshanCode2005 May 1, 2026
71e1892
Use HF username for personal trace uploads (#199)
lewtun May 1, 2026
7599843
Steer agent to HF kernels instead of pip install flash-attn (#204)
DarshanCode2005 May 1, 2026
77324b8
Add session YOLO auto-approval budget (#201)
lewtun May 1, 2026
8615c28
Auto-start CPU sandboxes for sessions (#200)
lewtun May 1, 2026
7b561e3
Document HF Space deploy flow (#207)
lewtun May 1, 2026
092f909
Fallback to free model for gated defaults (#208)
lewtun May 1, 2026
3434de4
P0: Foundation + identity (AuthZ MVP)
May 2, 2026
66603f0
P0.5 D1: cosmos_lab library restructure
May 2, 2026
520f137
Context harness for AI agents
May 2, 2026
9e1dc3d
Planning docs — PLAN_V2 v4 + reference
May 2, 2026
d0b598b
Fix verifier exclusion list — add planning docs
May 2, 2026
e455c77
PLAN v5 thesis pivot — autonomous PrincipalAgent (governance as enabler)
May 2, 2026
d24a279
Add AGENTIC_EVAL_SPEC.md + PLAN v5 §3.3 + 10 eval-system targets
May 2, 2026
1727133
P0.5 D2: cosmos_lab.harness.ml_intern adapter
May 2, 2026
5fa91de
v5.1 two-layer architecture decision + P0.5 D3 nat wrapper
May 3, 2026
2426cb3
v5.2 honest leverage pivot — governance layer for autonomous ml-intern
May 3, 2026
972fcb7
v6 restore specialty agents pivot — 9 NEW agents + governance + ml-in…
May 3, 2026
a31c3c0
P0.5 D4 + P0.5 COMPLETE: adapter contract + dual-adapter test matrix
May 3, 2026
264fa8e
v7 frontier-audit pivot — final architecture (5 production agents + S…
May 3, 2026
af8ae9f
v7-stronger: add context engineering discipline to P3 (Tier 3 + JD st…
May 3, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
528 changes: 528 additions & 0 deletions AGENTIC_EVAL_SPEC.md

Large diffs are not rendered by default.

1 change: 1 addition & 0 deletions AGENTS.local.md
37 changes: 37 additions & 0 deletions AGENTS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
# Agent Notes

## Local Dev Servers

- Frontend: from `frontend/`, run `npm ci` if dependencies are missing, then `npm run dev`.
- Backend: from `backend/`, run `uv run uvicorn main:app --host ::1 --port 7860`.
- Frontend URL: http://localhost:5173/
- Backend health check: `curl -g http://[::1]:7860/api`
- Frontend proxy health check: `curl http://localhost:5173/api`

Notes:

- Vite proxies `/api` and `/auth` to `http://localhost:7860`.
- If `127.0.0.1:7860` is already owned by another local process, binding the backend to `::1` lets the Vite proxy resolve `localhost` cleanly.
- Prefer `npm ci` over `npm install` for setup, since `npm install` may rewrite `frontend/package-lock.json` metadata depending on npm version.
- Production defaults to the Bedrock Claude model. For local development with a personal Anthropic key, set `ANTHROPIC_API_KEY` and `ML_INTERN_CLAUDE_MODEL_ID=anthropic/claude-opus-4-6` before starting the backend. Other models are selected through the app's model switcher.

## GitHub CLI

- For multiline PR descriptions, prefer `gh pr edit <number> --body-file <file>` over inline `--body` so shell quoting, `$` env-var names, backticks, and newlines are preserved correctly.

## Hugging Face Space Deploys

- The Space remote is `space` and points to `https://huggingface.co/spaces/smolagents/ml-intern`.
- Deploy GitHub `main` to the Space from the local `space-main` branch by merging `origin/main` into `space-main` with a single merge commit, then pushing `space-main:main` to the `space` remote.
- Keep the Space-only README frontmatter on `space-main`; `.gitattributes` should contain `README.md merge=ours` and the local repo config should include `merge.ours.driver=true`.
- Recommended deploy flow:

```bash
git pull --ff-only origin main
git switch space-main
git config merge.ours.driver true
git merge --no-ff origin/main -m "Deploy $(date +%Y-%m-%d)" \
-m "Co-authored-by: OpenAI Codex <codex@openai.com>"
git push space space-main:main
git switch main
```
98 changes: 98 additions & 0 deletions CLAUDE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,98 @@
# cosmos-lab — Context Harness

> **What this is**: zero-diff fork of `huggingface/ml-intern` that ships the **cosmos-lab** library — a frontier-aligned production agentic system: **5 production agents** (1 PrincipalAgent supervisor + 4 specialty workers: Data, Eval, Train, Optimize) + **CodeWork Skill** (Anthropic Skills pattern) + **3 offline governance tools** (GepaOptimizer, CapabilityProbe, CrossAgentEvaluator — explicitly NOT standing agents per 2026 frontier convergence) + **~16 governance infrastructure components** (5 sentinel types incl. judge-hacking detector, cross-family MultiJudge, MCP OAuth+RFC 8693, hash-chained signed audit, OTel-GenAI, 4-scope hybrid memory via Mem0/Letta, Inspect AI bridge, **LangGraph durable supervisor + Magentic-One Task/Progress Ledger pattern**, **context engineering discipline** — cache-aware prompt structure + 75% compaction + just-in-time retrieval + cosmos-progress.md state file), built on ml-intern's tool primitives leveraged AS-IS inside LangGraph worker nodes. Deployed via `nat run cosmos-lab.yaml`. See `docs/01_north_star.md` for vision; PLAN_V2.md §0.6 + §0.65 + §0.9 + §3.1 + §3.2 (incl. §3.2.8 context engineering) for v7-stronger thesis.

**Current phase** → `docs/02_current_phase.md` (LIVE — read this first when starting work)

---

## Invariants — never break

1. **ZERO-DIFF**: never edit any file existing in `huggingface/ml-intern`. Use subclass / composition / new path.
2. Never `git commit` or `git push` without explicit user request.
3. `uv run pytest tests/unit/` must match baseline (237 pass / 3 upstream-broken — see PLAN_V2 invariant 2).
4. New code only in owned paths (table below).
5. **One-optimization-per-experiment** + **measured-peak over vendor-peak** (EVAL_SPEC.md, applies P6+).
6. **Trajectory-on-by-default**: from P1, no agent run is unobserved.
7. **OTel-GenAI-on-by-default**: from P1, every span uses `gen_ai.*` semconv.
8. **No judge-only metric reaches a gate**: every quality gate requires (judge, structural-verifier) pair.
9. **No GPU phase exits without one measured real run** (P5/P5.5/P6/P9a; ~$200-400 budget).

---

## Owned paths

| Write here | Never touch (upstream) |
|---|---|
| `cosmos_lab/` (P0.5+) | `agent/core/` |
| `agent/optimization/` (Phase 0 home) | `agent/config.py` |
| `agent/tools/{profiling,training_opt,inference_opt,multimodal_opt,vla_opt}/` | `agent/context_manager/manager.py` |
| `agent/tools/hardware_specs.py` | `agent/tools/*.py` (existing) |
| `agent/prompts/system_prompt_optimization_*.yaml` | `backend/`, `frontend/`, `tests/unit/` |
| `configs/optimization_agent_config.json` | |
| `tests/optimization/` | |
| `docs/`, `bin/` (this harness) | |

**Verify ownership**: `git diff upstream/main --name-only` must show only owned paths.

---

## Anti-patterns (catch yourself)

1. Editing the prompt when the bug is in the data.
2. Editing the data when the bug is in the spec.
3. Trusting "I have verified this" from an agent — re-run the verifier yourself.
4. Building a pipeline that should have been one model call.
5. Adding a fourth concurrent agent. You will regret it.
6. Saying "the agent decided" — replace with "P(output | context) was high."

---

## Workflow (every task)

`DEFINE → PROBE → BUILD → REVIEW → SHIP → LEARN` — see `docs/00_workflow.md`.

**Hard rule**: if you can't write the verifier, the goal is wrong. Fix the goal, not the agent.

---

## Dev commands

```bash
uv sync --extra dev # install (--extra dev for pytest)
uv run python -m pytest tests/unit/ -q # upstream baseline (must match)
uv run python -m pytest tests/optimization/ -q # cosmos-lab tests (must pass)
PYTHONPATH=. ruff check agent/ --ignore E501,F401,E402 # lint
./bin/verify.sh <phase> # phase verifier (e.g. p0_5_d3)
git fetch upstream && git merge upstream/main # daily upstream sync
```

Note: use `uv run python -m pytest` (NOT `uv run pytest`) — bare `uv run pytest`
can resolve to a system pytest with stale package metadata. Captured in P0.5 D2 LEARN.

---

## Pointer index (load on demand)

| Need | Read |
|---|---|
| What we're building this week | `docs/02_current_phase.md` |
| Vision in 1 screen | `docs/01_north_star.md` |
| Workflow phases | `docs/00_workflow.md` |
| Phase → PLAN_V2 anchor map | `docs/03_pointers.md` |
| Full plan (24 sections, 837L) | `PLAN_V2.md` (read specific section, not whole file) |
| Architecture deep WHY (Vietnamese, 1167L) | `SYSTEM.md` (rare — only for upstream debugging) |
| Eval spec — ML output (perplexity, KL, latency p99) | `EVAL_SPEC.md` |
| Eval spec — agent system (trajectory, plan, replan, capability boundary, reward-hack, cross-agent) | `AGENTIC_EVAL_SPEC.md` |
| Self-improvement research | `RESEARCH_AHE_ANALYSIS.md` |
| Dev server / deploy notes | `AGENTS.md` |
| NVIDIA Cosmos JD | `docs/04_jd.md` |

---

## Git remotes

```
upstream → https://github.com/huggingface/ml-intern (read-only, never push)
origin → https://github.com/andreidhoang/ml-optimization-agent
```
Loading