From 60a6fbf06ac950931cb5d81c66fd1187a1a521cd Mon Sep 17 00:00:00 2001 From: huntharo Date: Sun, 29 Mar 2026 18:03:31 -0400 Subject: [PATCH 1/6] ce:optimize - brainstorm / plan MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Capturing this at the brainstorm / plan You know what this is... don't make me say it 😂 --- ...terative-optimization-loop-requirements.md | 951 ++++++++++++++++++ ...ative-optimization-loop-skill-beta-plan.md | 560 +++++++++++ 2 files changed, 1511 insertions(+) create mode 100644 docs/brainstorms/2026-03-29-iterative-optimization-loop-requirements.md create mode 100644 docs/plans/2026-03-29-001-feat-iterative-optimization-loop-skill-beta-plan.md diff --git a/docs/brainstorms/2026-03-29-iterative-optimization-loop-requirements.md b/docs/brainstorms/2026-03-29-iterative-optimization-loop-requirements.md new file mode 100644 index 00000000..d378d774 --- /dev/null +++ b/docs/brainstorms/2026-03-29-iterative-optimization-loop-requirements.md @@ -0,0 +1,951 @@ +# Iterative Optimization Loop Skill — Requirements Brainstorm + +## Problem Statement + +CE has strong knowledge-compounding (learn from past work) and multi-agent review (quality gates), but no skill for **metric-driven iterative optimization** — the pattern where you define a measurable goal, build measurement scaffolding, then run an automated loop that tries many approaches, measures each, keeps improvements, and converges toward the best solution. + +### Motivating Example + +A project builds issue/PR clusters for a large open-source repo. Currently only ~20% of issues/PRs land in clusters with >1 item. The suspected achievable target is ~95%. Getting there requires testing many hypotheses: + +- Extracting signal (unique user-entered text) from noise (PR/issue template boilerplate that makes all vectors too similar) +- Using issue-to-PR links as a new clustering signal +- Adjusting similarity thresholds +- Trying different embedding models or chunking strategies +- Combining multiple signals (text similarity + link graph + label overlap + author patterns) +- Pre-filtering or normalizing template sections before embedding + +No single hypothesis will get from 20% to 95%. It requires systematic experimentation — trying dozens or hundreds of variations, measuring each, and building on successes. + +## Landscape Analysis + +### Karpathy's AutoResearch (March 2026, 21k+ stars) + +The simplest and most influential model. Core design: + +- **One mutable file** (`train.py`) — the agent edits only this +- **One immutable evaluator** (`prepare.py`) — the agent cannot touch measurement +- **One instruction file** (`program.md`) — defines objectives, constraints, stopping criteria +- **One metric** (`val_bpb`) — scalar, lower is better +- **Linear keep/revert loop**: modify -> commit -> run -> measure -> if improved keep, else `git reset` +- **History**: `results.tsv` accumulates all experiment results; git log preserves successful commits +- **Result**: 700 experiments in 2 days, 20 discovered optimizations, ~12 experiments/hour + +**Strengths**: Dead simple. Git-native history. Easy to understand and debug. +**Weaknesses**: Linear — can't explore multiple directions simultaneously. Single scalar metric. No backtracking to earlier promising states. + +### AIDE / WecoAI + +- **Tree search** in solution space — each script is a node, LLM patches spawn children +- Can backtrack to any previous node and explore alternatives +- 4x more Kaggle medals than linear agents on MLE-Bench +- More complex but better at escaping local optima + +### Sakana AI Scientist v2 + +- **Agentic tree search** with parallel experiment execution +- VLM feedback for analyzing figures +- Full paper generation with automated peer review +- Overkill for code optimization but shows the value of tree-structured exploration + +### DSPy (Stanford) + +- Automated prompt/weight optimization for LLM programs +- Bayesian optimization (MIPROv2), iterative feedback (GEPA), coordinate ascent (COPRO) +- Shows that different optimization strategies suit different problem shapes + +### Existing Claude Code AutoResearch Forks + +- `uditgoenka/autoresearch` — packages the pattern as a Claude Code skill +- `autoexp` — generalized for any project with a quantifiable metric +- Multiple teams report 50-80% improvements over 30-70 iterations overnight + +## Key Design Decisions + +### 1. Linear vs. Tree Search + +| Approach | Pros | Cons | +|---|---|---| +| Linear (autoresearch) | Simple, easy to understand, git-native | Can't explore multiple directions, stuck in local optima | +| Tree search (AIDE) | Can backtrack, explore alternatives | More complex state management, harder to review | +| Hybrid: linear with manual branch points | Best of both — simple default, user chooses when to fork | Requires user interaction to fork | + +**Recommendation**: Start with linear keep/revert (Karpathy model) as the default. Add optional "branch point" support where the user can snapshot the current best and start a new exploration direction. Each direction is its own branch. This keeps the core loop simple while allowing multi-direction exploration when needed. + +### 2. What Gets Measured — The Three-Tier Metric Architecture + +AutoResearch uses a single scalar metric (val_bpb). That works when you have an objective function with clear ground truth. Most real-world optimization problems don't — especially when the quality of the output requires human judgment. + +**Key insight**: Hard scalar metrics are often the wrong optimization target. For clustering, "bigger clusters" isn't inherently better. "Fewer singletons" isn't inherently better. A solution with 35% singletons where every cluster is coherent beats a solution with 5% singletons where clusters are garbage. Hard metrics catch *degenerate* solutions; *quality* requires judgment. + +**Three tiers**: + +1. **Degenerate-case gates** (hard, cheap, fully automated): + - Catch obviously broken solutions before expensive evaluation + - Examples: "all items in 1 cluster" (degenerate merge), "all singletons" (degenerate split), "runtime > 10 minutes" (performance regression) + - These are fast boolean checks: pass/fail. If any gate fails, the experiment is immediately reverted without running the expensive judge + - Think of these as "sanity checks" not "optimization targets" + +2. **LLM-as-judge quality score** (the actual optimization target): + - For problems where quality requires judgment, this IS the primary metric + - Cost-controlled via stratified sampling (not exhaustive) + - Produces a scalar score the loop can optimize against + - Can include multiple dimensions (coherence, granularity, completeness) + - See detailed design below + +3. **Diagnostics** (logged for understanding, not gated on): + - Distribution stats, counts, histograms + - Useful for understanding WHY a judge score changed + - Examples: median cluster size, singleton %, largest cluster size, cluster count + - Logged in the experiment record but never used for keep/revert decisions + +**When to use which configuration**: + +| Problem Type | Degenerate Gates | Primary Metric | Example | +|---|---|---|---| +| Objective function exists | Yes | Hard metric (scalar) | Build time, test pass rate, API latency | +| Quality requires judgment | Yes | LLM-as-judge score | Clustering quality, search relevance, content generation | +| Hybrid | Yes | Hard metric + LLM-judge as guard rail | Latency (optimize) + response quality (must not drop) | + +**Recommendation**: Support all three tiers. The user declares whether the primary optimization target is a hard metric or an LLM-judge score. Degenerate gates always run first (cheap). Judge runs only on experiments that pass gates. + +### 3. What the Agent Can Edit + +AutoResearch constrains the agent to one file. This is elegant but too restrictive for most software projects. + +**Recommendation**: Define an explicit allowlist of mutable files/directories and an explicit denylist (measurement harness, test fixtures, evaluation data). The agent operates within the allowlist. The measurement harness is immutable — the agent cannot game the metric by changing how it's measured. + +### 4. Measurement Scaffolding First + +This is critical and distinguishes this from "just run the code in a loop": + +1. **Define the measurement spec** before any optimization begins +2. **Build and validate the measurement harness** — ensure it produces reliable, reproducible results +3. **Establish baseline** — run the harness on the current code to get starting metrics +4. Only then begin the optimization loop + +**Recommendation**: Make this a hard phase gate. The skill refuses to enter the optimization loop until the measurement harness passes a validation check (runs successfully, produces expected metric types, baseline is recorded). + +### 5. History and Memory + +What gets remembered across iterations: + +- **Results log**: Every experiment's metrics, hypothesis, and outcome (kept/reverted) +- **Git history**: Successful experiments are commits; branches are preserved +- **Hypothesis log**: What was tried, why, what was learned — prevents re-trying failed approaches +- **Strategy evolution**: As the agent learns what works, it should adapt its exploration strategy + +**Recommendation**: A structured experiment log (YAML or JSON) that captures: iteration number, hypothesis, changes made, metrics before/after, outcome (kept/reverted/error), and learnings. The agent reads this before proposing the next hypothesis. Git branches are preserved for all kept experiments. + +### 6. How Long It Runs + +- AutoResearch runs "indefinitely until manually stopped" +- Real-world needs: time budgets, iteration budgets, metric targets, or "until no improvement for N iterations" + +**Recommendation**: Support multiple stopping criteria (any can trigger stop): +- Target metric reached +- Max iterations +- Max wall-clock time +- No improvement for N consecutive iterations +- Manual stop (user interrupts) + +### 7. Parallelism + +AutoResearch is single-threaded. AIDE and AI Scientist run parallel experiments. For CE: + +- **Phase 1 (v1)**: Single-threaded linear loop. Simple, debuggable, works with git worktrees. +- **Phase 2 (future)**: Parallel experiments using multiple worktrees or Codex sandboxes. Each experiment is independent. + +**Recommendation**: Start single-threaded. Design the experiment log and branching model to support parallelism later. + +### 8. Integration with Existing CE Skills + +The optimization loop should compose with existing CE capabilities: + +- **`/ce:ideate`** or **`/ce:brainstorm`** to generate initial hypothesis space +- **Learnings researcher** to check if similar optimization was done before +- **`/ce:compound`** to capture the winning strategy as institutional knowledge after the loop completes +- **`/ce:review`** optionally on the final winning diff before it's merged + +## Proposed Skill: `/ce:optimize` + +### Workflow Phases + +``` +Phase 0: Setup + |-- Read/create optimization spec (target metric, guard rails, mutable files, constraints) + |-- Search learnings for prior related optimization attempts + '-- Validate spec completeness + +Phase 1: Measurement Scaffolding (HARD GATE - user must approve before Phase 2) + |-- If user provides harness: + | |-- Review docs (or document usage if undocumented) + | |-- Run harness once against current implementation + | '-- Confirm baseline measurement is accurate with user + |-- If agent builds harness: + | |-- Build measurement harness (immutable evaluator) + | |-- Run validation: harness executes, produces expected metric types + | '-- Establish baseline metrics + |-- Parallelism readiness probe: + | |-- Check for hardcoded ports -> parameterize via env var + | |-- Check for shared DB files (SQLite, etc.) -> plan copy strategy + | |-- Check for shared external services -> warn user + | |-- Check for exclusive resource needs (GPU, etc.) + | '-- Produce parallel_readiness assessment + |-- Stability validation (if mode: repeat): + | |-- Run harness repeat_count times + | |-- Verify variance is within noise_threshold + | '-- Confirm aggregation method produces stable baseline + '-- GATE: Present baseline + parallel readiness to user. Refuse to proceed until approved. + +Phase 2: Hypothesis Generation + Dependency Approval + |-- Analyze the problem space (read code, understand current approach) + |-- Generate initial hypothesis list (agent + optionally /ce:ideate) + |-- Prioritize by expected impact and feasibility + |-- Identify new dependencies across ALL planned hypotheses + |-- Present dependency list for bulk approval + '-- Record hypothesis backlog (with dep approval status per hypothesis) + +Phase 3: Optimization Loop (repeats in parallel batches) + |-- Select batch of hypotheses (batch_size = min(backlog, max_concurrent)) + | '-- Prefer diversity: mix different hypothesis categories per batch + |-- For each experiment in batch (PARALLEL by default): + | |-- Create worktree or Codex sandbox + | |-- Copy shared resources (DB files, data files) + | |-- Apply parameterization (ports, env vars) + | |-- Implement hypothesis (within mutable scope) + | |-- Run measurement harness (respecting stability config) + | '-- Collect metrics + diff + |-- Wait for batch completion + |-- Evaluate results: + | |-- Rank by primary metric improvement + | |-- Filter by guard rails (reject any that violate) + | |-- If best > current: KEEP (merge to optimization branch) + | |-- If best has unapproved dep: mark deferred_needs_approval + | '-- All others: REVERT (log results, clean up worktrees) + |-- Handle unapproved deps: + | '-- Set aside, don't block pipeline, batch-ask at end or check-in + |-- Update experiment log with ALL results (kept + reverted) + |-- Re-baseline: remaining hypotheses evaluated against new best + |-- Generate new hypotheses based on learnings from this batch + |-- Check stopping criteria + '-- Next batch + +Phase 4: Wrap-Up + |-- Present deferred hypotheses needing dep approval (if any) + |-- Summarize results: baseline -> final metrics, total iterations, kept improvements + |-- Preserve ALL experiment branches for reference + |-- Optionally run /ce:review on cumulative diff + |-- Optionally run /ce:compound to capture winning strategy as learning + '-- Report to user +``` + +### Optimization Spec File Format + +See "Updated Spec File Format" in the Resolved Design Decisions section below for the full spec with parallel execution and stability config. + +### Experiment Log Format + +```yaml +# .context/compound-engineering/optimize/experiment-log.yaml +spec: "improve-issue-clustering" + +baseline: + timestamp: "2026-03-29T10:00:00Z" + gates: + largest_cluster_pct: 0.02 + singleton_pct: 0.79 + cluster_count: 342 + runtime_seconds: 45 + diagnostics: + singleton_pct: 0.79 + median_cluster_size: 2 + cluster_count: 342 + avg_cluster_size: 2.8 + p95_cluster_size: 7 + judge: + mean_score: 3.1 + pct_scoring_4plus: 0.33 + mean_distinct_topics: 1.8 + singleton_false_negative_pct: 0.45 # 45% of sampled singletons should be clustered + sample_seed: 42 + judge_cost_usd: 0.42 + +experiments: + - iteration: 1 + batch: 1 + hypothesis: "Remove PR template boilerplate before embedding to reduce noise" + category: "signal-extraction" + changes: + - file: "src/preprocessing/text_cleaner.py" + summary: "Added template detection and removal using common PR template patterns" + gates: + largest_cluster_pct: 0.03 + singleton_pct: 0.62 + cluster_count: 489 + runtime_seconds: 48 + gates_passed: true + diagnostics: + singleton_pct: 0.62 + median_cluster_size: 3 + cluster_count: 489 + avg_cluster_size: 3.4 + judge: + mean_score: 3.8 + pct_scoring_4plus: 0.57 + mean_distinct_topics: 1.4 + singleton_false_negative_pct: 0.31 + judge_cost_usd: 0.38 + outcome: "kept" + primary_delta: "+0.7" # mean_score: 3.1 -> 3.8 + learnings: "Template removal significantly improved coherence. Clusters now group by actual issue content rather than shared boilerplate. Singleton rate dropped 17pp." + commit: "abc123" + + - iteration: 2 + batch: 1 # same batch as iteration 1 (ran in parallel) + hypothesis: "Lower similarity threshold from 0.85 to 0.75" + category: "clustering-algorithm" + changes: + - file: "config/clustering.yaml" + summary: "Changed similarity_threshold from 0.85 to 0.75" + gates: + largest_cluster_pct: 0.08 + singleton_pct: 0.35 + cluster_count: 210 + runtime_seconds: 47 + gates_passed: true + diagnostics: + singleton_pct: 0.35 + median_cluster_size: 5 + cluster_count: 210 + judge: + mean_score: 2.4 + pct_scoring_4plus: 0.13 + mean_distinct_topics: 3.1 # clusters covering too many unrelated topics + singleton_false_negative_pct: 0.12 + judge_cost_usd: 0.41 + outcome: "reverted" + primary_delta: "-0.7" # mean_score: 3.1 -> 2.4 + learnings: "Lower threshold pulled in more items but destroyed coherence. Clusters became grab-bags. The hard metrics looked good (fewer singletons!) but judge correctly identified the quality drop. Validates that singleton_pct alone is a misleading optimization target." + + - iteration: 3 + batch: 2 # new batch, runs on top of iteration 1's changes + hypothesis: "Use issue-to-PR link graph as additional clustering signal" + category: "graph-signals" + changes: + - file: "src/clustering/signals.py" + summary: "Added link-graph signal extraction from issue-PR references" + - file: "src/clustering/merger.py" + summary: "Combined text similarity with link-graph signal using weighted average" + gates: + largest_cluster_pct: 0.04 + singleton_pct: 0.48 + cluster_count: 520 + runtime_seconds: 52 + gates_passed: true + diagnostics: + singleton_pct: 0.48 + median_cluster_size: 3 + cluster_count: 520 + judge: + mean_score: 4.1 + pct_scoring_4plus: 0.70 + mean_distinct_topics: 1.2 + singleton_false_negative_pct: 0.22 + judge_cost_usd: 0.39 + outcome: "kept" + primary_delta: "+0.3" # mean_score: 3.8 -> 4.1 (from iteration 1 baseline) + learnings: "Link graph is a strong complementary signal. Issues referencing the same PR are almost always related. Judge scores jumped — 70% of clusters now score 4+. Singleton false negatives dropped further." + commit: "def456" + + - iteration: 4 + batch: 2 + hypothesis: "Add scikit-learn HDBSCAN for hierarchical density clustering" + category: "clustering-algorithm" + changes: [] + gates_passed: false # not evaluated — deferred + outcome: "deferred_needs_approval" + deferred_reason: "Requires unapproved dependency: scikit-learn" + learnings: "Set aside for batch approval at end of loop." + +best: + iteration: 3 + judge: + mean_score: 4.1 + pct_scoring_4plus: 0.70 + total_judge_cost_usd: 1.60 # running total across all experiments +``` + +## Hypothesis Generation Strategies + +For the clustering example, here's the kind of hypothesis space the agent should explore: + +### Signal Extraction +- Remove PR/issue template boilerplate before embedding +- Extract only user-authored text (strip auto-generated sections) +- Weight title more heavily than body +- Use code snippets / file paths mentioned as signals +- Extract error messages and stack traces as high-signal features + +### Graph-Based Signals +- Issue-to-PR links (issues referencing same PR are related) +- Cross-references between issues (`#123` mentions) +- Author patterns (same author filing similar issues) +- Label co-occurrence +- Milestone/project board grouping + +### Embedding & Similarity +- Try different embedding models (different size/quality tradeoffs) +- Chunk long issues before embedding vs. truncate vs. summarize +- Weighted combination of multiple similarity signals +- Asymmetric similarity (issue-to-PR vs. issue-to-issue) + +### Clustering Algorithm +- Adjust similarity thresholds (per-signal or combined) +- Try hierarchical clustering vs. graph-based community detection +- Two-pass: coarse clusters then split/merge refinement +- Minimum cluster size constraints +- Handle outlier issues that genuinely don't cluster + +### Pre-processing +- Normalize markdown formatting +- Deduplicate near-identical issues before clustering +- Language detection and translation for multilingual repos +- Time-decay weighting (recent issues weighted more) + +## Resolved Design Decisions + +### D1: Measurement Harness Ownership -> DECIDED: Agent builds, user validates + +The agent builds the measurement harness in Phase 1 and evaluates it against the current implementation. If the user provides an existing harness, the agent documents how to use it (or reviews existing docs), runs it once, and confirms the baseline measurement is accurate. Either way, the user reviews and approves before the loop starts. This is a hard gate. + +### D2: Flaky Metrics -> DECIDED: User-configurable, default stable + +The spec supports a `stability` block: + +```yaml +measurement: + command: "python evaluate.py" + stability: + mode: "stable" # default: run once, trust the result + # mode: "repeat" # run N times, aggregate + # repeat_count: 5 # how many runs + # aggregation: "median" # median | mean | min | max | custom + # noise_threshold: 0.02 # improvement must exceed this to count +``` + +When `mode: repeat`, the harness runs `repeat_count` times. The `aggregation` function reduces results to a single value per metric. The `noise_threshold` prevents accepting improvements within the noise floor. Default is `stable` — run once, trust it. + +### D3: New Dependencies -> DECIDED: Pre-approve expected, defer surprises + +During Phase 2 (Hypothesis Generation), the agent outlines expected new dependencies across all planned variations and gets bulk approval up front. If an experiment during the loop discovers it needs an unapproved dependency, the agent: +1. Sets that hypothesis aside (marks it `deferred_needs_approval` in the experiment log) +2. Continues with other hypotheses that don't need new deps +3. At the end of the loop (or at a user check-in), presents the deferred hypotheses and their dep requirements for batch approval +4. If approved, those hypotheses enter the next iteration batch + +This prevents blocking the pipeline on interactive approval during long unattended runs. + +### D4: LLM-as-Judge -> DECIDED: Include in v1 (cost-controlled via sampling) + +LLM-as-judge is essential for problems where quality requires judgment — it's often the *actual* optimization target, not a nice-to-have. Hard metrics catch degenerate cases but can't tell you whether clusters are coherent or search results are relevant. + +**Cost control via stratified sampling**: +- Don't judge every output item — sample a representative set +- Stratified sampling ensures coverage of edge cases (small clusters, large clusters, singletons) +- Default: ~30 samples per evaluation (configurable) +- At ~$0.01-0.03 per judgment call, 30 samples = ~$0.30-0.90 per experiment +- Over 100 experiments = $30-90 total — manageable + +**Sampling strategy**: +```yaml +judge: + sample_size: 30 + stratification: + - bucket: "small" # 2-3 items + count: 10 + - bucket: "medium" # 4-10 items + count: 10 + - bucket: "large" # 11+ items + count: 10 + # For singletons: sample 10 and ask "should any of these be in a cluster?" + singleton_sample: 10 +``` + +**Rubric-based scoring** (user-defined, per problem): +```yaml +judge: + rubric: | + Rate this cluster 1-5: + - 5: All items clearly about the same issue/feature + - 4: Strong theme, minor outliers + - 3: Related but covers 2-3 sub-topics + - 2: Weak connection + - 1: Unrelated items grouped together + + Also answer: + - How many distinct sub-topics does this cluster represent? + - Should any items be removed from this cluster? + + scoring: + primary: "mean_score" # mean of 1-5 ratings + secondary: "pct_scoring_4plus" # % of samples scoring 4 or 5 + output_format: "json" # {"score": 4, "distinct_topics": 1, "remove_items": []} +``` + +**Judge execution order**: +1. Run degenerate-case gates (fast, free) -- reject obviously broken solutions +2. Run hard metrics (fast, free) -- collect diagnostics +3. Only if gates pass: run LLM-as-judge on sampled outputs (slow, costs money) +4. Keep/revert decision uses judge score as primary metric + +**Judge consistency**: +- Use the same sample indices across experiments when possible (same random seed) +- This reduces noise from sample variance — you're comparing the same clusters across runs +- When the output structure changes (different number of clusters), re-sample but log the seed change + +**Judge model selection**: +- Default: Haiku (fast, cheap, good enough for rubric-based scoring) +- Option: Sonnet for nuanced judgment (2-3x cost) +- The judge prompt is part of the immutable measurement harness — the agent cannot modify it + +**Singleton evaluation** (the non-obvious case): +- Low singleton % isn't automatically good. High singleton % isn't automatically bad. +- Sample singletons and ask the judge: "Given these other clusters, should this item be in one of them? Which one? Or is it genuinely unique?" +- This catches false-negative clustering (items that should cluster but don't) AND validates true singletons + +### D5: Codex Support -> DECIDED: Include from v1 + +Based on patterns from PRs #364/#365 in the compound-engineering plugin: + +**Dispatch pattern**: Write experiment prompt to a temp file, pipe to `codex exec` via stdin: +```bash +cat /tmp/optimize-exp-XXXXX.txt | codex exec --skip-git-repo-check - 2>&1 +``` + +**Security posture**: User selects once per session (same as ce-work-beta): +- Workspace write (`--full-auto`) +- Full access (`--dangerously-bypass-approvals-and-sandbox`) + +**Result collection**: Inspect working directory diff after `codex exec` completes. No structured result format — Codex writes files, orchestrator reads the diff and runs the measurement harness. + +**Guard rails**: +- Check for `CODEX_SANDBOX` / `CODEX_SESSION_ID` env vars to prevent recursive delegation +- 3 consecutive delegate failures auto-disable Codex for remaining experiments +- Orchestrator retains control of git operations, measurement, and keep/revert decisions + +### D6: Parallel Execution -> DECIDED: Parallel by default + +Experiments run in parallel by default. The user can specify serial execution if the system under test requires it. The skill actively probes for parallelism blockers. + +See full parallel execution design below. + +--- + +## Parallel Execution Design + +### Default: Parallel Experiments + +The optimization loop dispatches multiple experiments simultaneously unless the user explicitly requests serial execution. This is the primary throughput lever — running 4-8 experiments in parallel vs. 1 at a time means 4-8x more iterations per hour. + +### Isolation Strategy + +Each parallel experiment needs full filesystem isolation. Two mechanisms, selectable per session: + +**Local worktrees** (default): +``` +.claude/worktrees/optimize-exp-001/ # full repo copy +.claude/worktrees/optimize-exp-002/ +.claude/worktrees/optimize-exp-003/ +``` +- Created via `git worktree add` with a unique branch per experiment +- Each worktree gets its own copy of shared resources (see below) +- Cleaned up after measurement: kept experiments merge to the optimization branch, reverted experiments have their worktree removed + +**Codex sandboxes** (opt-in): +- Each experiment dispatched as an independent `codex exec` invocation +- Codex provides built-in filesystem isolation +- Orchestrator collects diffs after completion +- Best for maximizing parallelism (no local resource limits) + +**Hybrid** (future): +- Use Codex for implementation, local worktree for measurement +- Useful when measurement requires local resources (GPU, specific hardware, large datasets) + +### Parallelism Blocker Detection (Phase 1) + +During Phase 1 (Measurement Scaffolding), the skill actively probes for common parallelism blockers: + +**Port conflicts**: +- Run the measurement harness and check if it binds to fixed ports +- Search config and code for hardcoded port numbers +- If found: parameterize via environment variable (e.g., `PORT=0` for random, or `BASE_PORT + experiment_index`) +- Add to spec: `parallel.port_strategy: "parameterized"` with the env var name + +**Shared database files**: +- Check for SQLite databases, local file-based stores +- If found: each experiment gets a copy of the database in its worktree +- Cleanup: remove copies after measurement +- Add to spec: `parallel.shared_files: ["data/clusters.db"]` with copy strategy + +**Shared external services**: +- Check if the system writes to a shared external database, API, or queue +- If found: warn user, suggest serial mode or test database isolation +- This is a hard blocker for parallel unless the user confirms isolation + +**Resource contention**: +- Check for GPU usage, large memory requirements +- If the system needs exclusive access to a resource, serial mode is required +- Add to spec: `parallel.exclusive_resources: ["gpu"]` + +**Detection output**: Phase 1 produces a `parallel_readiness` assessment: +```yaml +parallel: + mode: "parallel" # parallel | serial | user-decision + max_concurrent: 4 # default, adjustable + blockers_found: [] # or list of issues + mitigations_applied: + - type: "port_parameterization" + env_var: "EVAL_PORT" + strategy: "base_port_plus_index" + base: 9000 + - type: "database_copy" + source: "data/clusters.db" + strategy: "copy_per_worktree" + blockers_unresolved: [] # these force serial unless user resolves +``` + +### Parallel Loop Mechanics + +``` +Orchestrator (main branch) + | + |-- Batch N experiments from hypothesis backlog + | (batch_size = min(backlog_size, max_concurrent)) + | + |-- For each experiment in batch (parallel): + | |-- Create worktree / Codex sandbox + | |-- Copy shared resources (DB files, etc.) + | |-- Apply parameterization (ports, env vars) + | |-- Implement hypothesis (agent edits mutable files) + | |-- Run measurement harness + | |-- Collect metrics + diff + | |-- Clean up shared resource copies + | + |-- Wait for all experiments in batch to complete + | + |-- Evaluate results: + | |-- Rank by primary metric improvement + | |-- Filter by guard rails + | |-- Select best experiment that passes all guards + | |-- If best > current best: KEEP (merge to optimization branch) + | |-- All others: REVERT (remove worktrees, log results) + | |-- If none improve: log all results, advance to next batch + | + |-- Update experiment log with all results (kept + reverted) + |-- Update hypothesis backlog based on learnings from ALL experiments + |-- Check stopping criteria + |-- Next batch +``` + +### Parallel-Aware Keep/Revert + +With parallel experiments, multiple experiments might improve the metric but conflict with each other (they modify the same files in incompatible ways). Resolution strategy: + +1. **Non-overlapping changes**: If the best experiment's changes don't overlap with the second-best, consider keeping both (merge sequentially, re-measure after merge to confirm) +2. **Overlapping changes**: Keep only the best. Log the second-best as "promising but conflicts with experiment N" for potential future retry on top of the new baseline +3. **Re-baseline**: After keeping any experiment, all remaining experiments in the batch that were reverted get re-measured mentally against the new baseline — their hypotheses go back into the backlog for potential retry + +### Experiment Prompt Template (for Codex dispatch) + +```markdown +# Optimization Experiment #{iteration} + +## Context +You are running experiment #{iteration} for optimization target: {spec.name} +Current best metrics: {current_best_metrics} +Baseline metrics: {baseline_metrics} + +## Your Hypothesis +{hypothesis.description} + +## What To Change +Modify ONLY files in the mutable scope: +{spec.scope.mutable} + +DO NOT modify: +{spec.scope.immutable} + +## Constraints +{spec.constraints} +{approved_dependencies} + +## Previous Experiments (for context) +{recent_experiment_summaries} + +## Instructions +1. Implement the hypothesis +2. Do NOT run the measurement harness (orchestrator handles this) +3. Do NOT commit (orchestrator handles this) +4. Run `git diff --stat` when done so the orchestrator can see your changes +``` + +### Concurrency Limits + +```yaml +parallel: + max_concurrent: 4 # default for local worktrees + # max_concurrent: 8 # default for Codex (no local resource limits) + codex_rate_limit: 10 # max Codex invocations per minute + worktree_cleanup: "immediate" # or "batch" (clean up after full batch) +``` + +--- + +## Updated Spec File Format + +### Example A: Hard-Metric Primary (build performance, test pass rate) + +```yaml +# .context/compound-engineering/optimize/spec.yaml +name: "reduce-build-time" +description: "Reduce CI build time while maintaining test pass rate" + +metric: + primary: + type: "hard" # hard | judge + name: "build_time_seconds" + direction: "minimize" + baseline: null # filled by Phase 1 + target: 60 # optional target to stop at + + degenerate_gates: # fast boolean checks, run first + - name: "test_pass_rate" + check: ">= 1.0" # all tests must pass + - name: "build_exits_zero" + check: "== true" + + diagnostics: + - name: "cache_hit_rate" + - name: "slowest_step" + - name: "total_test_count" + +measurement: + command: "python evaluate.py" + timeout_seconds: 600 + output_format: "json" + stability: + mode: "stable" +``` + +### Example B: LLM-Judge Primary (clustering quality, search relevance) + +```yaml +# .context/compound-engineering/optimize/spec.yaml +name: "improve-issue-clustering" +description: "Improve coherence and coverage of issue/PR clusters" + +metric: + primary: + type: "judge" + name: "cluster_coherence" + direction: "maximize" + baseline: null + target: 4.2 # mean judge score (1-5 scale) + + degenerate_gates: # cheap checks that reject obviously broken solutions + - name: "largest_cluster_pct" + description: "% of all items in the single largest cluster" + check: "<= 0.10" # if >10% of items are in one cluster, it's degenerate + - name: "singleton_pct" + description: "% of items that are singletons" + check: "<= 0.80" # if >80% singletons, clustering isn't working at all + - name: "cluster_count" + check: ">= 10" # fewer than 10 clusters for 18k items is degenerate + - name: "runtime_seconds" + check: "<= 600" + + diagnostics: # logged for understanding, never gated on + - name: "singleton_pct" # note: same metric can be diagnostic AND gate + - name: "median_cluster_size" + - name: "cluster_count" + - name: "avg_cluster_size" + - name: "p95_cluster_size" + + judge: + model: "haiku" # haiku (cheap) | sonnet (nuanced) + sample_size: 30 + stratification: + - bucket: "small" # 2-3 items per cluster + count: 10 + - bucket: "medium" # 4-10 items + count: 10 + - bucket: "large" # 11+ items + count: 10 + singleton_sample: 10 # also sample singletons to check false negatives + sample_seed: 42 # fixed seed for cross-experiment consistency + rubric: | + Rate this cluster 1-5: + - 5: All items clearly about the same issue/feature + - 4: Strong theme, minor outliers + - 3: Related but covers 2-3 sub-topics + - 2: Weak connection + - 1: Unrelated items grouped together + + Also answer in JSON: + - "score": your 1-5 rating + - "distinct_topics": how many distinct sub-topics this cluster represents + - "outlier_count": how many items don't belong + singleton_rubric: | + This item is currently a singleton (not in any cluster). + Given the cluster titles listed below, should this item be in one of them? + + Answer in JSON: + - "should_cluster": true/false + - "best_cluster_id": cluster ID it belongs in (or null) + - "confidence": 1-5 how confident you are + scoring: + primary: "mean_score" # what the loop optimizes + secondary: + - "pct_scoring_4plus" # % of samples scoring 4+ + - "mean_distinct_topics" # lower is better (tighter clusters) + - "singleton_false_negative_pct" # % of sampled singletons that should be clustered + +measurement: + command: "python evaluate.py" # outputs JSON with gate + diagnostic metrics + timeout_seconds: 600 + output_format: "json" + stability: + mode: "stable" + +scope: + mutable: + - "src/clustering/" + - "src/preprocessing/" + - "config/clustering.yaml" + immutable: + - "evaluate.py" + - "tests/fixtures/" + - "data/" + +execution: + mode: "parallel" + backend: "worktree" + max_concurrent: 4 + codex_security: null + +parallel: + port_strategy: null + shared_files: ["data/clusters.db"] + exclusive_resources: [] + +dependencies: + approved: [] + +constraints: + - "Do not change the output format of clusters" + - "Preserve backward compatibility with existing cluster consumers" + +stopping: + max_iterations: 100 + max_hours: 8 + plateau_iterations: 10 + target_reached: true +``` + +### Evaluation Execution Order (per experiment) + +``` +1. Run measurement command (evaluate.py) + -> Produces JSON with gate metrics + diagnostics + -> Fast, free + +2. Check degenerate gates + -> If ANY gate fails: REVERT immediately, log as "degenerate" + -> Do NOT run the judge (saves money) + +3. If primary type is "judge": Run LLM-as-judge + -> Sample outputs according to stratification config + -> Send each sample to judge model with rubric + -> Aggregate scores per scoring config + -> This is the number the loop optimizes against + +4. Keep/revert decision + -> Based on primary metric (hard or judge score) + -> Must also pass all degenerate gates (already checked in step 2) +``` + +--- + +## Open Questions (Remaining) + +1. **Should the agent propose hypotheses, or should the user provide them?** + - Both — agent generates from analysis, user can inject ideas, agent prioritizes + +2. **Judge calibration across experiments** + - LLM judges can drift or be inconsistent across calls + - Should we include "anchor samples" — a fixed set of clusters with known scores — in every judge batch to detect drift? + - If anchor scores shift >0.5 from baseline, re-calibrate or flag for user review + +3. **Judge rubric iteration** + - The rubric itself might need improvement after seeing early results + - But changing the rubric mid-loop invalidates comparisons to earlier experiments + - Solution: if rubric changes, re-judge the current best with the new rubric to re-baseline? + +4. **Relationship to `/lfg` and `/slfg`?** + - `/lfg` is autonomous execution of a single task + - `/ce:optimize` is autonomous execution of an iterative search + - `/ce:optimize` can delegate each experiment to Codex (decided D5) + - Local experiments use subagent dispatch similar to `/ce:review` + +5. **Branch strategy details?** + - Main optimization branch: `optimize/` + - Each kept experiment is a commit on that branch + - Branch points create `optimize//direction-` + - All branches preserved for later reference and comparison + +6. **Batch size adaptation?** + - Should the batch size grow/shrink based on success rate? + - High success rate -> larger batches (more exploration) + - Low success rate -> smaller batches (more focused) + - Or keep it simple and let the user tune `max_concurrent` + +7. **Hypothesis diversity within a batch?** + - Should parallel experiments in the same batch be intentionally diverse? + - E.g., one threshold tweak + one new signal + one preprocessing change + - Or let the prioritization algorithm decide naturally? + +8. **Judge cost budgets?** + - Should the spec include a `max_judge_cost_usd` budget? + - When budget is exhausted, switch to hard-metrics-only mode or stop? + - Or just track cost in the log and let the user decide? + +## What Makes This Different From "Just Using AutoResearch" + +AutoResearch is designed for ML training on a single GPU. CE's version needs to handle: + +1. **Multi-file changes** — real code changes span multiple files +2. **Complex metrics** — not just one scalar, but primary + guard rails + diagnostics +3. **Varied execution environments** — not just `python train.py` but arbitrary commands +4. **Integration with existing workflows** — learnings, review, ideation +5. **User-in-the-loop** — pause for approval on scope-expanding changes, inject new hypotheses +6. **Knowledge capture** — document what worked and why for the team, not just for the agent's context +7. **Non-ML domains** — clustering, search quality, API performance, test coverage, build times, etc. + +## Success Criteria for This Skill + +- User can define an optimization target in <15 minutes +- Measurement scaffolding is validated before the loop starts +- Loop runs unattended for hours, producing measurable improvement +- All experiments are preserved in git for later reference +- The winning strategy is documented as a learning +- A human reviewing the experiment log can understand what was tried and why +- The skill handles failures gracefully (bad experiments don't corrupt state) + +## Next Steps + +1. `/ce:brainstorm` this further to refine the spec format and phase gates +2. Build a prototype with the clustering use case as the driving example +3. Test with a second use case (e.g., prompt optimization, build performance) to validate generality +4. Design the measurement harness validation checks +5. Decide on the v1 branch/git strategy diff --git a/docs/plans/2026-03-29-001-feat-iterative-optimization-loop-skill-beta-plan.md b/docs/plans/2026-03-29-001-feat-iterative-optimization-loop-skill-beta-plan.md new file mode 100644 index 00000000..3adab6f4 --- /dev/null +++ b/docs/plans/2026-03-29-001-feat-iterative-optimization-loop-skill-beta-plan.md @@ -0,0 +1,560 @@ +--- +title: "feat(ce-optimize): Add iterative optimization loop skill" +type: feat +status: active +date: 2026-03-29 +origin: docs/brainstorms/2026-03-29-iterative-optimization-loop-requirements.md +--- + +# feat(ce-optimize): Add iterative optimization loop skill + +## Overview + +Add a new `/ce:optimize` skill that implements metric-driven iterative optimization — the pattern where you define a measurable goal, build measurement scaffolding first, then run an automated loop that tries many parallel experiments, measures each against hard gates and/or LLM-as-judge quality scores, keeps improvements, and converges toward the best solution. Inspired by Karpathy's autoresearch but generalized for multi-file code changes, complex metrics, and non-ML domains. + +## Problem Frame + +CE has knowledge-compounding and quality gates but no skill for systematic experimentation. When a developer needs to improve a measurable outcome (clustering quality, build performance, search relevance), they currently iterate manually — one change at a time, eyeballing results. This skill automates the modify-measure-decide cycle, runs experiments in parallel via worktrees or Codex sandboxes, and preserves all experiment history in git for later reference. (see origin: `docs/brainstorms/2026-03-29-iterative-optimization-loop-requirements.md`) + +## Requirements Trace + +- R1. User can define an optimization target (spec file) in <15 minutes +- R2. Measurement scaffolding is validated before the loop starts (hard phase gate) +- R3. Three-tier metric architecture: degenerate gates (cheap boolean checks) -> LLM-as-judge quality score (sampled, cost-controlled) -> diagnostics (logged, not gated) +- R4. LLM-as-judge with stratified sampling and user-defined rubric is a first-class primary metric type, not deferred +- R5. Experiments run in parallel by default using worktree isolation or Codex sandboxes +- R6. Parallelism blockers (ports, shared DBs, exclusive resources) are actively detected and mitigated during Phase 1 +- R7. Dependencies are pre-approved in bulk during hypothesis generation; unapproved deps defer the hypothesis without blocking the pipeline +- R8. Flaky metrics are configurable (repeat N times, aggregate via median/mean, noise threshold) +- R9. All experiments preserved in git for later reference; experiment log captures hypothesis, metrics, outcome, and learnings +- R10. The winning strategy is documented via `/ce:compound` integration +- R11. Codex support from v1 using established `codex exec` stdin-pipe pattern +- R12. Loop handles failures gracefully (bad experiments don't corrupt state) +- R13. Multiple stopping criteria: target reached, max iterations, max hours, plateau (N iterations no improvement), manual stop + +## Scope Boundaries + +- No tree search / backtracking in v1 — linear keep/revert with optional manual branch points only +- No batch size adaptation — fixed `max_concurrent`, user-tunable +- No LLM-as-judge calibration anchors in v1 — deferred to future iteration +- No rubric mid-loop iteration protocol in v1 +- No judge cost budget enforcement — cost tracked in log, user decides +- This plan covers the skill, reference files, and scripts. It does not cover changes to the CLI converter or other targets + +## Context & Research + +### Relevant Code and Patterns + +- **Skill format**: `plugins/compound-engineering/skills/ce-work/SKILL.md` — multi-phase skill with YAML frontmatter, `#$ARGUMENTS` input, parallel subagent dispatch +- **Parallel dispatch**: `plugins/compound-engineering/skills/ce-review/SKILL.md` — spawns N reviewers in parallel, merges structured JSON results +- **Subagent template**: `plugins/compound-engineering/skills/ce-review/references/subagent-template.md` — confidence rubric, false-positive suppression +- **Codex delegation**: `plugins/compound-engineering/skills/ce-work-beta/SKILL.md` — `codex exec` stdin pipe, security posture, 3-failure auto-disable, environment guard +- **Worktree management**: `plugins/compound-engineering/skills/git-worktree/SKILL.md` + `scripts/worktree-manager.sh` +- **Scratch space**: `.context/compound-engineering//` with per-run subdirs for concurrent runs +- **State file patterns**: YAML frontmatter in plan files, JSON schemas in ce:review references +- **Skill-to-skill references**: `Load the skill` for pass-through; `/ce:compound` slash syntax for published commands + +### Institutional Learnings + +- **State machine design is mandatory** for multi-phase workflows — re-read state after every transition, never carry stale values (`docs/solutions/skill-design/git-workflow-skills-need-explicit-state-machines-2026-03-27.md`) +- **Script-first for measurement harnesses** — 60-75% token savings by moving mechanical work (parsing, classification, aggregation) into bundled scripts (`docs/solutions/skill-design/script-first-skill-architecture.md`) +- **Confidence rubric pattern** — use 0.0-1.0 scale with explicit suppression threshold (0.60 proven in production), define false-positive categories (`ce:review subagent-template.md`) +- **Pass paths not content to sub-agents** — orchestrator discovers paths, workers read what they need (`docs/solutions/skill-design/pass-paths-not-content-to-subagents-2026-03-26.md`) +- **State transitions must be load-bearing** — if experiment states exist (proposed/running/measured/evaluated), at least one consumer must branch on them (`docs/solutions/workflow/todo-status-lifecycle.md`) +- **Branch name sanitization** — `/` to `~` is injective for filesystem paths (`docs/solutions/developer-experience/branch-based-plugin-install-and-testing-2026-03-26.md`) + +## Key Technical Decisions + +- **Linear keep/revert with parallel batches**: Each batch runs N experiments in parallel, best-in-batch is kept if it improves on current best, all others reverted. Simpler than tree search, compatible with git-native workflows. (see origin: Decision 1) +- **Three-tier metrics**: Degenerate gates (fast, free, boolean) -> LLM-as-judge or hard primary metric -> diagnostics (logged only). Gates run first to avoid wasting judge calls on obviously broken solutions. (see origin: Decision 2) +- **LLM-as-judge via stratified sampling**: ~30 samples per evaluation, stratified by output category (small/medium/large clusters), with user-defined rubric. Cost: ~$0.30-0.90 per experiment. Judge prompt is immutable (part of measurement harness). (see origin: D4) +- **Script-first measurement**: Mechanical work (run command, parse JSON, check gates, aggregate repeats) lives in a bundled shell script. The model handles only judgment and decision-making. This follows the script-first architecture learning. +- **Worktree isolation for parallel experiments**: Each experiment gets a git worktree with copied shared resources. Codex sandboxes as opt-in alternative. Orchestrator retains git control. (see origin: D6) +- **Codex dispatch via stdin pipe**: Write prompt to temp file, pipe to `codex exec`, collect diff after completion. Security posture selected once per session. (see origin: D5) + +## Open Questions + +### Resolved During Planning + +- **Skill naming**: `ce:optimize` with directory `ce-optimize/`. Uses `ce:` prefix per naming convention. +- **Where does experiment state live**: `.context/compound-engineering/ce-optimize//` — contains spec, experiment log, and per-batch scratch. Cleaned after successful completion except the final experiment log which moves to the optimization branch. +- **How are experiment branches named**: `optimize/` for the main optimization branch. Per-experiment worktree branches: `optimize//exp-`. Sanitized with `/` to `~` for filesystem paths. +- **Judge model selection**: Haiku by default (fast, cheap), Sonnet optional. Specified in spec file. + +### Deferred to Implementation + +- **Exact gate check parsing**: The spec uses operator strings like `">= 0.85"` and `"<= 300"`. Parsing these in the measurement script will be straightforward but the exact implementation depends on what edge cases arise. +- **Codex exec flag compatibility**: The exact `codex exec` flags may change. The skill should check `codex --version` and adapt. +- **Worktree cleanup timing**: Whether to clean up worktrees immediately after each batch or defer to end-of-loop may depend on disk space constraints discovered at runtime. +- **Judge prompt template specifics**: The exact judge dispatch mechanism (subagent vs direct API call) depends on what the model can do within the skill context. The template structure is defined but invocation details are deferred. + +## High-Level Technical Design + +> *This illustrates the intended approach and is directional guidance for review, not implementation specification. The implementing agent should treat it as context, not code to reproduce.* + +``` + +-----------------+ + | User provides | + | goal + scope | + +--------+--------+ + | + +--------v--------+ + | Phase 0: Setup | + | Create/load spec| + +--------+--------+ + | + +--------v-----------+ + | Phase 1: Scaffold | + | Build/validate | + | harness + baseline | + | Probe parallelism | + +--------+-----------+ + | + [USER GATE] + | + +--------v-----------+ + | Phase 2: Hypotheses| + | Generate + approve | + | deps in bulk | + +--------+-----------+ + | + +--------------v--------------+ + | Phase 3: Optimize Loop | + | | + | +--- Batch N hypotheses | + | | | + | | +--+ Worktree/Codex | + | | | | per experiment | + | | | | implement | + | | | | measure | + | | | | collect metrics | + | | +--+ | + | | | + | +--- Evaluate batch | + | | gates -> judge -> rank | + | | KEEP best / REVERT | + | | | + | +--- Update log + backlog | + | +--- Check stop criteria | + | +--- Next batch | + +--------------+--------------+ + | + +--------v--------+ + | Phase 4: Wrap-Up| + | Summarize | + | /ce:compound | + | /ce:review | + +--------+--------+ + | + [DONE] +``` + +## Implementation Units + +### Phase A: Reference Files and Scripts (no dependencies between units) + +- [ ] **Unit 1: Optimization spec schema** + +**Goal:** Define the YAML schema for the optimization spec file that users create to configure an optimization run. + +**Requirements:** R1, R3, R4, R5, R8, R13 + +**Dependencies:** None + +**Files:** +- Create: `plugins/compound-engineering/skills/ce-optimize/references/optimize-spec-schema.yaml` + +**Approach:** +- Define a commented YAML schema document (not JSON Schema — YAML is more readable for skill-authoring context) that the skill references to validate user-provided specs +- Cover all three metric tiers: `metric.primary` (type: hard|judge), `metric.degenerate_gates`, `metric.diagnostics`, `metric.judge` +- Include `measurement` (command, timeout, stability), `scope` (mutable/immutable), `execution` (mode, backend, max_concurrent), `parallel` (port strategy, shared files, exclusive resources), `dependencies`, `constraints`, `stopping` +- Include inline comments explaining each field, valid values, and defaults +- Use the two example specs from the brainstorm (hard-metric primary and LLM-judge primary) as validation targets + +**Patterns to follow:** +- `plugins/compound-engineering/skills/ce-review/references/findings-schema.json` for structured schema reference +- `plugins/compound-engineering/skills/ce-compound/references/schema.yaml` for YAML schema with inline comments + +**Test scenarios:** +- Schema covers all fields from both example specs in the brainstorm +- Required vs optional fields are clearly marked +- Default values are documented for every optional field + +**Verification:** +- A user reading only this file can create a valid spec without consulting other docs + +--- + +- [ ] **Unit 2: Experiment log schema** + +**Goal:** Define the YAML schema for the experiment log that accumulates across the optimization run. + +**Requirements:** R9, R12 + +**Dependencies:** None + +**Files:** +- Create: `plugins/compound-engineering/skills/ce-optimize/references/experiment-log-schema.yaml` + +**Approach:** +- Define the structure: baseline metrics, experiments array (iteration, batch, hypothesis, category, changes, gates, diagnostics, judge, outcome, primary_delta, learnings, commit), and best-so-far summary +- Include all experiment outcome states: `kept`, `reverted`, `degenerate`, `error`, `deferred_needs_approval`, `timeout` +- These states are load-bearing — the loop branches on them (per todo-status-lifecycle learning) + +**Patterns to follow:** +- `plugins/compound-engineering/skills/ce-compound/references/schema.yaml` + +**Test scenarios:** +- Schema covers the full experiment log example from the brainstorm +- All outcome states documented with transition rules + +**Verification:** +- An implementer reading this schema can produce or parse an experiment log without ambiguity + +--- + +- [ ] **Unit 3: Experiment worker prompt template** + +**Goal:** Define the prompt template used to dispatch each experiment to a subagent or Codex. + +**Requirements:** R5, R11 + +**Dependencies:** None + +**Files:** +- Create: `plugins/compound-engineering/skills/ce-optimize/references/experiment-prompt-template.md` + +**Approach:** +- Template with variable substitution slots: `{iteration}`, `{spec.name}`, `{current_best_metrics}`, `{hypothesis.description}`, `{scope.mutable}`, `{scope.immutable}`, `{constraints}`, `{approved_dependencies}`, `{recent_experiment_summaries}` +- Include explicit instructions: implement only, do NOT run harness, do NOT commit, do NOT modify immutable files +- Include `git diff --stat` instruction at end for orchestrator to collect changes +- Follow the path-not-content pattern — pass file paths for large context, inline only small structural data + +**Patterns to follow:** +- `plugins/compound-engineering/skills/ce-review/references/subagent-template.md` for variable substitution pattern and output contract + +**Test scenarios:** +- Template produces a clear, unambiguous prompt when all slots are filled +- Immutable file constraints are prominent and unambiguous +- Works for both subagent and Codex dispatch (no platform-specific assumptions in template body) + +**Verification:** +- An implementer can fill this template and dispatch it without needing to read other reference files + +--- + +- [ ] **Unit 4: Judge evaluation prompt template** + +**Goal:** Define the prompt template for LLM-as-judge evaluation of sampled outputs. + +**Requirements:** R3, R4 + +**Dependencies:** None + +**Files:** +- Create: `plugins/compound-engineering/skills/ce-optimize/references/judge-prompt-template.md` + +**Approach:** +- Two template sections: cluster/item evaluation (using the user's rubric from the spec) and singleton evaluation (using the user's singleton_rubric) +- Template includes: the rubric text, the sample data to evaluate, and explicit JSON output format instructions +- Include confidence calibration guidance adapted from ce:review's rubric pattern: each judge call returns a score + structured metadata +- Template is designed for Haiku by default — keep prompts concise and well-structured for smaller models +- Include the false-positive suppression concept: judge should flag if a sample is ambiguous rather than forcing a score + +**Patterns to follow:** +- `plugins/compound-engineering/skills/ce-review/references/subagent-template.md` — confidence rubric structure, JSON output contract + +**Test scenarios:** +- Template works with both the cluster coherence rubric and a generic quality rubric +- JSON output format is unambiguous and parseable +- Template handles edge cases: empty clusters, single-item clusters, very large clusters + +**Verification:** +- Filling this template with a rubric and sample data produces a prompt that a model can respond to with valid JSON + +--- + +- [ ] **Unit 5: Measurement runner script** + +**Goal:** Create a script that runs the measurement command, parses JSON output, checks degenerate gates, and handles stability repeats. + +**Requirements:** R2, R3, R8, R12 + +**Dependencies:** None + +**Files:** +- Create: `plugins/compound-engineering/skills/ce-optimize/scripts/measure.sh` + +**Approach:** +- Script-first architecture per institutional learning — all mechanical work in the script, model handles only judgment +- Input: spec file path, working directory, experiment index (for port parameterization) +- Steps: apply environment parameterization (ports) -> run measurement command -> capture JSON output -> parse gate checks -> if stability mode is repeat, run N times and aggregate -> output structured result JSON +- Output: JSON with fields: `gates_passed` (bool), `gate_results` (per-gate pass/fail), `metrics` (all metric values), `diagnostics`, `error` (if command failed or timed out) +- Handle: command timeout, non-zero exit, malformed JSON output, missing expected metric fields +- Gate check parsing: support operators `>=`, `<=`, `>`, `<`, `==`, `!=` against numeric or boolean values +- Stability aggregation: support `median`, `mean`, `min`, `max` across repeated runs + +**Patterns to follow:** +- `plugins/compound-engineering/skills/git-worktree/scripts/worktree-manager.sh` for script structure and error handling +- `plugins/compound-engineering/skills/claude-permissions-optimizer/` for script-first pattern + +**Test scenarios:** +- Command succeeds: JSON parsed, gates evaluated, result returned +- Command fails (non-zero exit): error captured, gates_passed = false +- Command times out: timeout handled, error recorded +- Gate fails: gates_passed = false, specific gate failure identified +- Stability mode: N runs aggregated correctly with median/mean +- Malformed JSON: error captured gracefully + +**Verification:** +- Script can be run standalone with a spec file and produces valid JSON output + +--- + +- [ ] **Unit 6: Parallelism probe script** + +**Goal:** Create a script that detects common parallelism blockers in the target project. + +**Requirements:** R5, R6 + +**Dependencies:** None + +**Files:** +- Create: `plugins/compound-engineering/skills/ce-optimize/scripts/parallel-probe.sh` + +**Approach:** +- Input: spec file path (for measurement command and mutable scope), project directory +- Checks: + 1. Port detection: search measurement command output and config files for hardcoded port patterns (`:\d{4,5}`, `PORT=`, `--port`, `bind`, `listen`) + 2. Shared file detection: check for SQLite files (`.db`, `.sqlite`, `.sqlite3`), local file stores in mutable/measurement paths + 3. Lock file detection: check for `.lock`, `.pid` files created by the measurement command + 4. Resource contention: check for GPU references (`cuda`, `torch.device`, `gpu`), large memory markers +- Output: JSON with `mode` (parallel|serial|user-decision), `blockers_found` array, `mitigations` array, `unresolved` array +- This is advisory — the skill presents results to the user for approval, does not auto-mitigate + +**Patterns to follow:** +- `plugins/compound-engineering/skills/git-worktree/scripts/worktree-manager.sh` + +**Test scenarios:** +- No blockers found: mode = parallel +- Port hardcoded: detected and reported with suggested mitigation +- SQLite file in scope: detected and reported +- Multiple blockers: all listed + +**Verification:** +- Script can be run against a sample project directory and produces valid JSON + +--- + +- [ ] **Unit 7: Experiment worktree manager script** + +**Goal:** Create a script that manages experiment worktrees — creation with shared file copying, and cleanup. + +**Requirements:** R5, R6, R12 + +**Dependencies:** None + +**Files:** +- Create: `plugins/compound-engineering/skills/ce-optimize/scripts/experiment-worktree.sh` + +**Approach:** +- Subcommands: `create`, `cleanup`, `cleanup-all` +- `create`: takes spec name, experiment index, list of shared files to copy, base branch + - Creates worktree at `.claude/worktrees/optimize--exp-/` on branch `optimize//exp-` + - Copies shared files from main repo into worktree + - Copies `.env`, `.env.local` if they exist (per existing worktree convention) + - Applies port parameterization if configured (writes env var to worktree's `.env`) + - Returns worktree path +- `cleanup`: removes a single experiment worktree and its branch +- `cleanup-all`: removes all experiment worktrees for a given spec name +- Error handling: verify git repo, check for existing worktrees, handle cleanup of partially created worktrees + +**Patterns to follow:** +- `plugins/compound-engineering/skills/git-worktree/scripts/worktree-manager.sh` — worktree creation, `.env` copying, branch management + +**Test scenarios:** +- Create worktree: directory exists, branch created, shared files copied +- Create with port parameterization: env var written to worktree +- Cleanup: worktree removed, branch deleted +- Cleanup-all: all experiment worktrees for spec removed +- Partial failure: cleanup handles partially created state + +**Verification:** +- Script can create and clean up worktrees in a test git repo + +--- + +### Phase B: Core Skill (depends on all Phase A units) + +- [ ] **Unit 8: Core SKILL.md** + +**Goal:** Create the main `/ce:optimize` skill file implementing the full four-phase workflow. + +**Requirements:** R1-R13 (all) + +**Dependencies:** Units 1-7 (all reference files and scripts must exist) + +**Files:** +- Create: `plugins/compound-engineering/skills/ce-optimize/SKILL.md` + +**Approach:** + +The SKILL.md is the primary artifact. It orchestrates all phases using the reference files and scripts. Key design points: + +*Frontmatter:* +- `name: ce:optimize` +- `description:` — rich description covering what it does (iterative optimization), when to use it (measurable improvement goals), and key capabilities (parallel experiments, LLM-as-judge, git-native history) +- No `disable-model-invocation` — this is a v1 skill, not beta + +*Phase 0: Setup* +- Accept spec file path as argument, or interactively create one guided by the schema reference +- Validate spec against schema (required fields, valid metric types, valid operators) +- Search learnings via `compound-engineering:research:learnings-researcher` for prior optimization work on similar topics +- Create optimization branch: `optimize/` +- Create scratch directory: `.context/compound-engineering/ce-optimize//` + +*Phase 1: Measurement Scaffolding (HARD GATE)* +- If user provides measurement harness: run it once, validate JSON output matches expected metric names, present baseline to user +- If agent must build harness: analyze codebase, build `evaluate.py` (or equivalent), validate it, present baseline to user +- Run parallelism probe script, present results +- If stability mode is repeat: run harness N times, validate variance within noise threshold +- GATE: Present baseline metrics + parallel readiness to user. Use platform question tool. Refuse to proceed until approved. +- State re-read: after gate approval, re-read spec and baseline from disk (per state-machine learning) + +*Phase 2: Hypothesis Generation* +- Analyze mutable scope code to understand current approach +- Generate hypothesis list — optionally via `compound-engineering:research:repo-research-analyst` for deeper codebase analysis +- Categorize hypotheses (signal-extraction, graph-signals, embedding, algorithm, preprocessing, etc.) +- Identify new dependencies across all hypotheses +- Present dependency list for bulk approval via platform question tool +- Record hypothesis backlog in experiment log file +- Include user-provided hypotheses if any were given as input + +*Phase 3: Optimization Loop* +- For each batch: + 1. Select hypotheses (batch_size = min(backlog_size, max_concurrent)) + 2. Prefer diversity across categories within each batch + 3. Dispatch experiments in parallel: + - **Worktree backend**: create worktree per experiment (via script), dispatch subagent with experiment prompt template + - **Codex backend**: write prompt to temp file, dispatch via `codex exec` stdin pipe (per ce-work-beta pattern) + - Environment guard: check for `CODEX_SANDBOX`/`CODEX_SESSION_ID` to prevent recursive delegation + 4. Wait for batch completion + 5. For each completed experiment: + - Run measurement script in the experiment's worktree + - If gates pass and primary type is judge: run LLM-as-judge evaluation + - Record all results in experiment log + 6. Evaluate batch: + - Rank by primary metric improvement (hard metric value or judge mean_score) + - If best improves on current best: KEEP (merge experiment branch to optimization branch, commit) + - Handle deferred deps: mark hypothesis, continue pipeline + - All others: REVERT (log, cleanup worktree) + - For non-overlapping runners-up that also improved: consider keeping if merge is clean and re-measurement confirms + 7. Update experiment log with all results + 8. Generate new hypotheses based on learnings from this batch + 9. Check stopping criteria (target reached, max iterations, max hours, plateau, manual stop) + 10. State re-read: re-read current best from experiment log before next batch + +*Phase 4: Wrap-Up* +- Present deferred hypotheses needing dep approval (if any) +- Summarize: baseline -> final metrics, total iterations run, kept count, reverted count, judge cost total +- Preserve optimization branch with all commits +- Offer post-completion options via platform question tool: + 1. Run `/ce:review` on cumulative diff (baseline -> final) + 2. Run `/ce:compound` to document the winning strategy + 3. Create PR from optimization branch + 4. Continue with more experiments (re-enter Phase 3) + 5. Done + +*Cross-cutting concerns:* +- **Codex failure cascade**: 3 consecutive delegate failures auto-disable Codex for remaining experiments, fall back to subagent +- **Error handling**: experiment errors (command crash, timeout, malformed output) are logged as `outcome: error` and the experiment is reverted. The loop continues. +- **Progress reporting**: after each batch, report: batch N of ~M, experiments run, current best metric, improvement from baseline +- **Manual stop**: if user interrupts, save current experiment log state and offer wrap-up + +**Execution note:** Execution target: external-delegate for the SKILL.md writing itself (it is large and well-specified) + +**Patterns to follow:** +- `plugins/compound-engineering/skills/ce-work/SKILL.md` — multi-phase structure, progress reporting, error handling +- `plugins/compound-engineering/skills/ce-review/SKILL.md` — parallel subagent dispatch, structured result merging +- `plugins/compound-engineering/skills/ce-work-beta/SKILL.md` — Codex delegation section +- `plugins/compound-engineering/skills/ce-ideate/SKILL.md` — hypothesis generation, categorization, iterative refinement + +**Test scenarios:** +- Spec with hard primary metric: gates + hard metric evaluation, no judge calls +- Spec with judge primary metric: gates -> judge -> keep/revert based on judge score +- Parallel batch of 4 experiments: all dispatched, results collected, best kept, others reverted +- Experiment that violates degenerate gate: immediately reverted, no judge call +- Experiment needing unapproved dep: deferred, pipeline continues +- Codex dispatch failure: fallback to subagent after 3 failures +- Plateau stopping: 10 consecutive batches with no improvement -> stop +- Flaky metric with repeat mode: 5 runs, median aggregation, noise threshold applied + +**Verification:** +- Skill loads without errors in Claude Code +- YAML frontmatter passes `bun test tests/frontmatter.test.ts` +- All reference file paths use backtick syntax (no markdown links) +- Cross-platform question tool pattern used for all user interactions +- Script references use relative paths (`bash scripts/measure.sh`) +- No hardcoded tool names — capability-class descriptions with platform hints + +--- + +### Phase C: Registration (depends on Unit 8) + +- [ ] **Unit 9: Plugin registration and validation** + +**Goal:** Register the new skill in plugin documentation and validate consistency. + +**Requirements:** R1 + +**Dependencies:** Unit 8 + +**Files:** +- Modify: `plugins/compound-engineering/README.md` + +**Approach:** +- Add `ce:optimize` to the skills table in README.md with description +- Update skill count in README.md +- Run `bun run release:validate` to verify plugin consistency +- Do NOT bump version in plugin.json or marketplace.json (per versioning rules) + +**Patterns to follow:** +- Existing skill table entries in `plugins/compound-engineering/README.md` + +**Test scenarios:** +- `bun run release:validate` passes +- Skill count in README matches actual skill count +- Skill table entry is alphabetically placed and has accurate description + +**Verification:** +- `bun run release:validate` exits 0 +- `bun test` passes (especially frontmatter tests) + +## System-Wide Impact + +- **Interaction graph:** The skill dispatches to learnings-researcher (Phase 0), repo-research-analyst (Phase 2), and optionally ce:review and ce:compound (Phase 4). It creates git worktrees and branches. It invokes Codex as an external process. +- **Error propagation:** Experiment failures are contained — each runs in an isolated worktree. Failures are logged and reverted. The optimization branch only advances on successful, validated improvements. +- **State lifecycle risks:** The experiment log is the critical state artifact. It must be written atomically (write to temp, rename) to prevent corruption from mid-write interrupts. Worktree cleanup must handle partial creation states. +- **API surface parity:** This is a new skill, no existing surface to maintain parity with. +- **Integration coverage:** The parallelism readiness probe should be validated against real projects with known blockers (SQLite DBs, hardcoded ports) to ensure detection works. + +## Risks & Dependencies + +- **Codex exec flags may change** — the skill should detect `codex` version and adapt. Mitigate by checking `codex --version` before first dispatch. +- **Worktree disk usage** — parallel experiments with large repos consume disk. Mitigate by cleaning up worktrees immediately after measurement and offering a `worktree_cleanup: immediate` option. +- **LLM-as-judge consistency** — judge scores may vary across calls for the same input. Mitigate by using fixed sample seeds and logging per-sample scores for post-hoc analysis. v2 can add anchor-based calibration. +- **Long-running unattended execution** — the loop may run for hours. Mitigate by saving experiment log after every batch (not just at end) and designing for graceful resume from saved state. + +## Documentation / Operational Notes + +- Update `plugins/compound-engineering/README.md` skill table +- No new MCP servers or external dependencies for the plugin itself +- The skill will appear in Claude Code's skill list automatically once the SKILL.md exists + +## Sources & References + +- **Origin document:** [docs/brainstorms/2026-03-29-iterative-optimization-loop-requirements.md](docs/brainstorms/2026-03-29-iterative-optimization-loop-requirements.md) +- Related code: `plugins/compound-engineering/skills/ce-work-beta/SKILL.md` (Codex delegation), `plugins/compound-engineering/skills/ce-review/SKILL.md` (parallel dispatch) +- Related PRs: #364 (Codex security posture), #365 (Codex exec pitfalls) +- External: Karpathy autoresearch (github.com/karpathy/autoresearch), AIDE/WecoAI (github.com/WecoAI/aideml) +- Learnings: `docs/solutions/skill-design/script-first-skill-architecture.md`, `docs/solutions/skill-design/git-workflow-skills-need-explicit-state-machines-2026-03-27.md`, `docs/solutions/skill-design/pass-paths-not-content-to-subagents-2026-03-26.md`, `docs/solutions/workflow/todo-status-lifecycle.md` From 76577cd9c1f2701d12fc088594e4ffc3318cd9d4 Mon Sep 17 00:00:00 2001 From: huntharo Date: Sun, 29 Mar 2026 20:47:00 -0400 Subject: [PATCH 2/6] feat(ce-optimize): add iterative optimization loop skill New /ce:optimize skill for metric-driven iterative optimization. Defines a measurable goal, builds measurement scaffolding first, then runs parallel experiments via worktrees or Codex that converge toward the best solution. Key capabilities: - Three-tier metrics: degenerate gates -> LLM-as-judge or hard metric -> diagnostics - Parallel experiments via git worktrees (max 6) or Codex sandboxes - Stratified sampling with user-defined rubrics for LLM-as-judge - Parallelism blocker detection (ports, SQLite, GPU) - Rolling context window + strategy digest for long runs - Git-native history with all experiments preserved - Integration with /ce:compound and /ce:review at wrap-up Includes SKILL.md (4-phase workflow), 4 reference files (spec schema, experiment log schema, experiment prompt template, judge prompt template), and 3 scripts (measure.sh, parallel-probe.sh, experiment-worktree.sh). Co-Authored-By: Claude Opus 4.6 (1M context) --- ...ative-optimization-loop-skill-beta-plan.md | 278 +++++++---- plugins/compound-engineering/README.md | 4 +- .../skills/ce-optimize/SKILL.md | 467 ++++++++++++++++++ .../references/experiment-log-schema.yaml | 241 +++++++++ .../references/experiment-prompt-template.md | 89 ++++ .../references/judge-prompt-template.md | 110 +++++ .../references/optimize-spec-schema.yaml | 383 ++++++++++++++ .../scripts/experiment-worktree.sh | 221 +++++++++ .../skills/ce-optimize/scripts/measure.sh | 47 ++ .../ce-optimize/scripts/parallel-probe.sh | 96 ++++ 10 files changed, 1848 insertions(+), 88 deletions(-) create mode 100644 plugins/compound-engineering/skills/ce-optimize/SKILL.md create mode 100644 plugins/compound-engineering/skills/ce-optimize/references/experiment-log-schema.yaml create mode 100644 plugins/compound-engineering/skills/ce-optimize/references/experiment-prompt-template.md create mode 100644 plugins/compound-engineering/skills/ce-optimize/references/judge-prompt-template.md create mode 100644 plugins/compound-engineering/skills/ce-optimize/references/optimize-spec-schema.yaml create mode 100755 plugins/compound-engineering/skills/ce-optimize/scripts/experiment-worktree.sh create mode 100755 plugins/compound-engineering/skills/ce-optimize/scripts/measure.sh create mode 100755 plugins/compound-engineering/skills/ce-optimize/scripts/parallel-probe.sh diff --git a/docs/plans/2026-03-29-001-feat-iterative-optimization-loop-skill-beta-plan.md b/docs/plans/2026-03-29-001-feat-iterative-optimization-loop-skill-beta-plan.md index 3adab6f4..7bdefba9 100644 --- a/docs/plans/2026-03-29-001-feat-iterative-optimization-loop-skill-beta-plan.md +++ b/docs/plans/2026-03-29-001-feat-iterative-optimization-loop-skill-beta-plan.md @@ -1,9 +1,10 @@ --- title: "feat(ce-optimize): Add iterative optimization loop skill" type: feat -status: active +status: completed date: 2026-03-29 origin: docs/brainstorms/2026-03-29-iterative-optimization-loop-requirements.md +deepened: 2026-03-29 --- # feat(ce-optimize): Add iterative optimization loop skill @@ -67,26 +68,33 @@ CE has knowledge-compounding and quality gates but no skill for systematic exper - **Linear keep/revert with parallel batches**: Each batch runs N experiments in parallel, best-in-batch is kept if it improves on current best, all others reverted. Simpler than tree search, compatible with git-native workflows. (see origin: Decision 1) - **Three-tier metrics**: Degenerate gates (fast, free, boolean) -> LLM-as-judge or hard primary metric -> diagnostics (logged only). Gates run first to avoid wasting judge calls on obviously broken solutions. (see origin: Decision 2) -- **LLM-as-judge via stratified sampling**: ~30 samples per evaluation, stratified by output category (small/medium/large clusters), with user-defined rubric. Cost: ~$0.30-0.90 per experiment. Judge prompt is immutable (part of measurement harness). (see origin: D4) -- **Script-first measurement**: Mechanical work (run command, parse JSON, check gates, aggregate repeats) lives in a bundled shell script. The model handles only judgment and decision-making. This follows the script-first architecture learning. -- **Worktree isolation for parallel experiments**: Each experiment gets a git worktree with copied shared resources. Codex sandboxes as opt-in alternative. Orchestrator retains git control. (see origin: D6) +- **LLM-as-judge via stratified sampling**: ~30 samples per evaluation, stratified by output category (small/medium/large clusters), with user-defined rubric. Cost: ~$0.30-0.90 per experiment. Judge prompt is immutable (part of measurement harness). Judge score requires `minimum_improvement` (default 0.3 on a 1-5 scale) to accept as "better" — this accounts for sample-composition variance when output structure changes between experiments. (see origin: D4) +- **Model-parsed spec, script-executed measurement**: The orchestrating agent reads and parses the YAML spec file directly (agents are natively capable of YAML handling). The measurement script receives flat arguments (command, timeout, working directory), runs the command, and returns raw JSON output. The agent evaluates gates and aggregates stability repeats. This follows the established plugin pattern where no shell scripts parse YAML — the model interprets structure, scripts handle I/O. +- **Parallel-batch merge strategy**: When multiple experiments in a batch improve the metric: (1) Keep the best experiment, merge to optimization branch. (2) For each runner-up that also improved: check **file-level disjointness** with the kept experiment (same file modified by both = overlapping, even if different lines). (3) If disjoint: cherry-pick runner-up onto new baseline, re-run full measurement. (4) If combined measurement is strictly better: keep the cherry-pick. Otherwise revert and log as "promising alone but neutral/harmful in combination." (5) Process runners-up in descending metric order; stop after first failed combination. Config: `max_runner_up_merges_per_batch` (default: 1). Rationale: two changes that each independently improve a metric can interfere when combined (e.g., one tightens thresholds while another loosens them). This is expected, not a bug. +- **Worktree isolation for parallel experiments**: Each experiment gets a git worktree under `.worktrees/` (aligned with existing convention) with copied shared resources. Codex sandboxes as opt-in alternative. Orchestrator retains git control. Max concurrent capped at 6 for worktree backend (git performance degrades beyond ~10-15 concurrent worktrees); 8+ only valid for Codex backend. (see origin: D6) - **Codex dispatch via stdin pipe**: Write prompt to temp file, pipe to `codex exec`, collect diff after completion. Security posture selected once per session. (see origin: D5) +- **Context window management via rolling window + strategy digest**: The experiment log grows unboundedly (20-30 lines per experiment). The orchestrator does NOT read the full log each iteration. Instead: (1) maintain a rolling window of the last 10 experiments in working memory, (2) after each batch write a strategy digest summarizing what categories have been tried, what succeeded/failed, and the exploration frontier, (3) read the full log only in filtered sections (e.g., by category) when checking whether a specific hypothesis was already tried. The full log remains the durable ground truth on disk. +- **Judge dispatch via batched parallel sub-agents**: Orchestrator selects samples per stratification config, groups them into batches of `judge.batch_size` (default: 10), dispatches `ceil(sample_size / batch_size)` parallel sub-agents. Each sub-agent evaluates its batch and returns structured JSON scores. Orchestrator aggregates. This follows the ce:review parallel reviewer dispatch pattern and avoids the overhead of spawning one sub-agent per sample. ## Open Questions ### Resolved During Planning - **Skill naming**: `ce:optimize` with directory `ce-optimize/`. Uses `ce:` prefix per naming convention. -- **Where does experiment state live**: `.context/compound-engineering/ce-optimize//` — contains spec, experiment log, and per-batch scratch. Cleaned after successful completion except the final experiment log which moves to the optimization branch. +- **Where does experiment state live**: `.context/compound-engineering/ce-optimize//` — contains spec, experiment log, strategy digest, and per-batch scratch. Cleaned after successful completion except the final experiment log which moves to the optimization branch. - **How are experiment branches named**: `optimize/` for the main optimization branch. Per-experiment worktree branches: `optimize//exp-`. Sanitized with `/` to `~` for filesystem paths. - **Judge model selection**: Haiku by default (fast, cheap), Sonnet optional. Specified in spec file. +- **Who parses the YAML spec**: The orchestrating agent (model), not the measurement script. No CE scripts parse YAML — the established pattern is model reads structure, scripts handle I/O. The measurement script receives flat arguments and returns raw JSON. +- **Judge dispatch mechanism**: Batched parallel sub-agents following ce:review pattern. Orchestrator selects samples, groups into batches of `judge.batch_size` (default: 10), dispatches parallel sub-agents, aggregates JSON scores. +- **Branch collision on re-run**: Phase 0 detects existing `optimize/` branch and experiment log. Presents user with choice: resume (inherit existing state, continue from last iteration) or fresh start (archive old branch to `optimize//archived-`, clear log). +- **Judge score comparability**: Add `judge.minimum_improvement` (default: 0.3 on 1-5 scale) as minimum improvement to accept. This accounts for sample-composition variance when output structure changes. Distinct from `noise_threshold` which handles run-to-run flakiness. ### Deferred to Implementation -- **Exact gate check parsing**: The spec uses operator strings like `">= 0.85"` and `"<= 300"`. Parsing these in the measurement script will be straightforward but the exact implementation depends on what edge cases arise. +- **Exact gate check evaluation**: The agent interprets operator strings like `">= 0.85"` from the spec and evaluates them against metric values. The exact edge cases depend on what metric shapes users provide. - **Codex exec flag compatibility**: The exact `codex exec` flags may change. The skill should check `codex --version` and adapt. - **Worktree cleanup timing**: Whether to clean up worktrees immediately after each batch or defer to end-of-loop may depend on disk space constraints discovered at runtime. -- **Judge prompt template specifics**: The exact judge dispatch mechanism (subagent vs direct API call) depends on what the model can do within the skill context. The template structure is defined but invocation details are deferred. +- **Harness bug discovered mid-loop**: If the measurement harness itself has a bug discovered during the loop, the user must fix it manually. The harness is immutable by design — the agent cannot modify it. After the fix, the user should re-baseline and resume (or start fresh). The exact UX for this depends on implementation. ## High-Level Technical Design @@ -276,9 +284,9 @@ CE has knowledge-compounding and quality gates but no skill for systematic exper - [ ] **Unit 5: Measurement runner script** -**Goal:** Create a script that runs the measurement command, parses JSON output, checks degenerate gates, and handles stability repeats. +**Goal:** Create a script that runs the measurement command, captures JSON output, and handles timeouts and errors. The orchestrating agent (not this script) evaluates gates and handles stability repeats. -**Requirements:** R2, R3, R8, R12 +**Requirements:** R2, R12 **Dependencies:** None @@ -286,28 +294,25 @@ CE has knowledge-compounding and quality gates but no skill for systematic exper - Create: `plugins/compound-engineering/skills/ce-optimize/scripts/measure.sh` **Approach:** -- Script-first architecture per institutional learning — all mechanical work in the script, model handles only judgment -- Input: spec file path, working directory, experiment index (for port parameterization) -- Steps: apply environment parameterization (ports) -> run measurement command -> capture JSON output -> parse gate checks -> if stability mode is repeat, run N times and aggregate -> output structured result JSON -- Output: JSON with fields: `gates_passed` (bool), `gate_results` (per-gate pass/fail), `metrics` (all metric values), `diagnostics`, `error` (if command failed or timed out) -- Handle: command timeout, non-zero exit, malformed JSON output, missing expected metric fields -- Gate check parsing: support operators `>=`, `<=`, `>`, `<`, `==`, `!=` against numeric or boolean values -- Stability aggregation: support `median`, `mean`, `min`, `max` across repeated runs +- Division of labor follows established plugin pattern: scripts handle I/O, the model interprets structure +- Input: flat positional arguments only — command to run, timeout in seconds, working directory, optional environment variables (KEY=VALUE pairs for port parameterization) +- Steps: set environment variables -> cd to working directory -> run measurement command with timeout -> capture stdout (expected JSON) and stderr (for error context) -> exit with the command's exit code +- Output: raw JSON from the measurement command to stdout, stderr passed through. No post-processing, no YAML parsing, no gate evaluation — the orchestrating agent handles all of that after reading the script's output +- Handle: command timeout (via `timeout` command), non-zero exit (pass through), stderr capture for error diagnosis +- The script does NOT: parse YAML spec files, evaluate gate checks, aggregate stability repeats, or produce structured result envelopes. These are all orchestrator responsibilities. **Patterns to follow:** -- `plugins/compound-engineering/skills/git-worktree/scripts/worktree-manager.sh` for script structure and error handling -- `plugins/compound-engineering/skills/claude-permissions-optimizer/` for script-first pattern +- `plugins/compound-engineering/skills/git-worktree/scripts/worktree-manager.sh` — flat positional arguments, no structured data parsing +- `plugins/compound-engineering/skills/resolve-pr-feedback/scripts/get-pr-comments` — simple script that runs a command and returns JSON **Test scenarios:** -- Command succeeds: JSON parsed, gates evaluated, result returned -- Command fails (non-zero exit): error captured, gates_passed = false -- Command times out: timeout handled, error recorded -- Gate fails: gates_passed = false, specific gate failure identified -- Stability mode: N runs aggregated correctly with median/mean -- Malformed JSON: error captured gracefully +- Command succeeds: JSON output passed through to stdout +- Command fails (non-zero exit): exit code passed through, stderr available +- Command times out: timeout exit code returned +- Environment variables applied: PORT env var set before command runs **Verification:** -- Script can be run standalone with a spec file and produces valid JSON output +- Script can be run standalone with a command and timeout and returns the command's raw output --- @@ -386,128 +391,220 @@ CE has knowledge-compounding and quality gates but no skill for systematic exper ### Phase B: Core Skill (depends on all Phase A units) -- [ ] **Unit 8: Core SKILL.md** +- [ ] **Unit 8: SKILL.md — Phase 0 (Setup) and Phase 1 (Measurement Scaffolding)** -**Goal:** Create the main `/ce:optimize` skill file implementing the full four-phase workflow. +**Goal:** Create the SKILL.md file with frontmatter, Phase 0 (setup, spec validation, run identity, learnings search), and Phase 1 (harness validation, baseline, parallelism probe, clean-tree gate, user approval gate). -**Requirements:** R1-R13 (all) +**Requirements:** R1, R2, R6, R8 -**Dependencies:** Units 1-7 (all reference files and scripts must exist) +**Dependencies:** Units 1-7 **Files:** - Create: `plugins/compound-engineering/skills/ce-optimize/SKILL.md` **Approach:** -The SKILL.md is the primary artifact. It orchestrates all phases using the reference files and scripts. Key design points: - *Frontmatter:* - `name: ce:optimize` - `description:` — rich description covering what it does (iterative optimization), when to use it (measurable improvement goals), and key capabilities (parallel experiments, LLM-as-judge, git-native history) - No `disable-model-invocation` — this is a v1 skill, not beta *Phase 0: Setup* -- Accept spec file path as argument, or interactively create one guided by the schema reference -- Validate spec against schema (required fields, valid metric types, valid operators) +- Accept spec file path as argument, or interactively create one guided by the spec schema reference (`references/optimize-spec-schema.yaml`) +- Agent reads and validates spec (required fields, valid metric types, valid operators). Agent parses YAML natively — no shell script parsing. - Search learnings via `compound-engineering:research:learnings-researcher` for prior optimization work on similar topics -- Create optimization branch: `optimize/` +- **Run identity detection**: Check if `optimize/` branch already exists. If yes, check for existing experiment log. Present user with choice via platform question tool: resume (inherit state, continue from last iteration) or fresh start (archive old branch to `optimize//archived-`, clear log) +- Create or switch to optimization branch - Create scratch directory: `.context/compound-engineering/ce-optimize//` *Phase 1: Measurement Scaffolding (HARD GATE)* -- If user provides measurement harness: run it once, validate JSON output matches expected metric names, present baseline to user -- If agent must build harness: analyze codebase, build `evaluate.py` (or equivalent), validate it, present baseline to user +- **Clean-tree gate**: Verify `git status` shows no uncommitted changes to files within `scope.mutable` or `scope.immutable`. If dirty, require commit or stash before proceeding. +- If user provides measurement harness: run it once via measurement script (pass command and timeout as flat args), validate JSON output matches expected metric names, present baseline to user +- If agent must build harness: analyze codebase, build evaluation script, validate it, present baseline to user - Run parallelism probe script, present results -- If stability mode is repeat: run harness N times, validate variance within noise threshold -- GATE: Present baseline metrics + parallel readiness to user. Use platform question tool. Refuse to proceed until approved. +- **Worktree budget check**: Count existing worktrees. Warn if total + `max_concurrent` would exceed 12. +- If stability mode is repeat: run harness `repeat_count` times, agent aggregates results (median/mean/min/max), validate variance within `noise_threshold` +- GATE: Present baseline metrics + parallel readiness + clean-tree status to user. Use platform question tool. Refuse to proceed until approved. - State re-read: after gate approval, re-read spec and baseline from disk (per state-machine learning) +**Patterns to follow:** +- `plugins/compound-engineering/skills/ce-work/SKILL.md` — Phase 0 input triage and Phase 1 setup pattern +- `plugins/compound-engineering/skills/ce-plan/SKILL.md` — Phase 0 resume detection pattern + +**Test scenarios:** +- Spec validation catches missing required fields +- Existing optimization branch detected: resume and fresh-start paths both work +- Clean-tree gate: blocks on dirty worktree, passes on clean +- Baseline measurement: harness runs and produces valid JSON +- Parallelism probe: blockers detected and presented + +**Verification:** +- YAML frontmatter passes `bun test tests/frontmatter.test.ts` +- All reference file paths use backtick syntax (no markdown links) +- Cross-platform question tool pattern used for user gate + +--- + +- [ ] **Unit 9: SKILL.md — Phase 2 (Hypothesis Generation)** + +**Goal:** Add Phase 2 to the SKILL.md — hypothesis generation, categorization, dependency pre-approval, and backlog recording. + +**Requirements:** R7 + +**Dependencies:** Unit 8 + +**Files:** +- Modify: `plugins/compound-engineering/skills/ce-optimize/SKILL.md` + +**Approach:** + *Phase 2: Hypothesis Generation* - Analyze mutable scope code to understand current approach - Generate hypothesis list — optionally via `compound-engineering:research:repo-research-analyst` for deeper codebase analysis - Categorize hypotheses (signal-extraction, graph-signals, embedding, algorithm, preprocessing, etc.) - Identify new dependencies across all hypotheses - Present dependency list for bulk approval via platform question tool -- Record hypothesis backlog in experiment log file +- Record hypothesis backlog in experiment log file (with dep approval status per hypothesis) - Include user-provided hypotheses if any were given as input +**Patterns to follow:** +- `plugins/compound-engineering/skills/ce-ideate/SKILL.md` — hypothesis generation, categorization, iterative refinement + +**Test scenarios:** +- Hypotheses generated from codebase analysis +- User-provided hypotheses merged into backlog +- Dependencies identified and presented for bulk approval +- Hypotheses needing unapproved deps marked in backlog + +**Verification:** +- Hypothesis backlog recorded in experiment log with categories and dep status + +--- + +- [ ] **Unit 10: SKILL.md — Phase 3 (Optimization Loop)** + +**Goal:** Add Phase 3 to the SKILL.md — the core parallel batch dispatch, measurement, judge evaluation, keep/revert logic, and stopping criteria. This is the largest and riskiest unit. + +**Requirements:** R3, R4, R5, R9, R11, R12, R13 + +**Dependencies:** Unit 9 + +**Files:** +- Modify: `plugins/compound-engineering/skills/ce-optimize/SKILL.md` + +**Approach:** + *Phase 3: Optimization Loop* - For each batch: - 1. Select hypotheses (batch_size = min(backlog_size, max_concurrent)) - 2. Prefer diversity across categories within each batch - 3. Dispatch experiments in parallel: - - **Worktree backend**: create worktree per experiment (via script), dispatch subagent with experiment prompt template + 1. Select hypotheses (batch_size = min(backlog_size, max_concurrent)). Prefer diversity across categories within each batch. + 2. Dispatch experiments in parallel: + - **Worktree backend**: create worktree per experiment (via script), dispatch subagent with experiment prompt template (`references/experiment-prompt-template.md`) - **Codex backend**: write prompt to temp file, dispatch via `codex exec` stdin pipe (per ce-work-beta pattern) - Environment guard: check for `CODEX_SANDBOX`/`CODEX_SESSION_ID` to prevent recursive delegation - 4. Wait for batch completion - 5. For each completed experiment: - - Run measurement script in the experiment's worktree - - If gates pass and primary type is judge: run LLM-as-judge evaluation + 3. Wait for batch completion + 4. For each completed experiment: + - Run measurement script in the experiment's worktree (flat args: command, timeout, working dir, env vars) + - Agent reads raw JSON output, evaluates degenerate gates + - If gates pass and primary type is judge: dispatch batched parallel judge sub-agents per judge prompt template (`references/judge-prompt-template.md`). Group samples into batches of `judge.batch_size` (default: 10), dispatch `ceil(sample_size / batch_size)` sub-agents. Aggregate returned JSON scores. + - If gates pass and primary type is hard: use hard metric value directly - Record all results in experiment log - 6. Evaluate batch: - - Rank by primary metric improvement (hard metric value or judge mean_score) - - If best improves on current best: KEEP (merge experiment branch to optimization branch, commit) - - Handle deferred deps: mark hypothesis, continue pipeline + 5. Evaluate batch using the parallel-batch merge strategy (see Key Technical Decisions): + - Rank by primary metric improvement (hard metric delta or judge `mean_score` delta, must exceed `minimum_improvement`) + - Best improves on current: KEEP (merge experiment branch to optimization branch) + - Check file-disjoint runners-up: cherry-pick, re-measure, keep if combined is strictly better + - Handle deferred deps: mark hypothesis `deferred_needs_approval`, continue - All others: REVERT (log, cleanup worktree) - - For non-overlapping runners-up that also improved: consider keeping if merge is clean and re-measurement confirms - 7. Update experiment log with all results - 8. Generate new hypotheses based on learnings from this batch + 6. Update experiment log with ALL results from this batch + 7. Write strategy digest summarizing categories tried, successes, failures, exploration frontier + 8. Generate new hypotheses based on learnings from this batch (read rolling window of last 10 experiments + strategy digest, not full log) 9. Check stopping criteria (target reached, max iterations, max hours, plateau, manual stop) 10. State re-read: re-read current best from experiment log before next batch -*Phase 4: Wrap-Up* -- Present deferred hypotheses needing dep approval (if any) -- Summarize: baseline -> final metrics, total iterations run, kept count, reverted count, judge cost total -- Preserve optimization branch with all commits -- Offer post-completion options via platform question tool: - 1. Run `/ce:review` on cumulative diff (baseline -> final) - 2. Run `/ce:compound` to document the winning strategy - 3. Create PR from optimization branch - 4. Continue with more experiments (re-enter Phase 3) - 5. Done - *Cross-cutting concerns:* - **Codex failure cascade**: 3 consecutive delegate failures auto-disable Codex for remaining experiments, fall back to subagent - **Error handling**: experiment errors (command crash, timeout, malformed output) are logged as `outcome: error` and the experiment is reverted. The loop continues. -- **Progress reporting**: after each batch, report: batch N of ~M, experiments run, current best metric, improvement from baseline +- **Progress reporting**: after each batch, report: batch N of ~M, experiments run, current best metric, improvement from baseline, cumulative judge cost - **Manual stop**: if user interrupts, save current experiment log state and offer wrap-up +- **Crash recovery**: each experiment writes a `result.yaml` marker in its worktree upon measurement completion. On resume, scan for completed-but-unlogged experiments before starting a new batch. -**Execution note:** Execution target: external-delegate for the SKILL.md writing itself (it is large and well-specified) +**Execution note:** Execution target: external-delegate — this unit is large and well-specified **Patterns to follow:** -- `plugins/compound-engineering/skills/ce-work/SKILL.md` — multi-phase structure, progress reporting, error handling -- `plugins/compound-engineering/skills/ce-review/SKILL.md` — parallel subagent dispatch, structured result merging +- `plugins/compound-engineering/skills/ce-review/SKILL.md` — parallel subagent dispatch (Stage 4), structured result merging (Stage 5) - `plugins/compound-engineering/skills/ce-work-beta/SKILL.md` — Codex delegation section -- `plugins/compound-engineering/skills/ce-ideate/SKILL.md` — hypothesis generation, categorization, iterative refinement +- `plugins/compound-engineering/skills/ce-review/references/subagent-template.md` — sub-agent prompt structure and JSON output contract **Test scenarios:** - Spec with hard primary metric: gates + hard metric evaluation, no judge calls -- Spec with judge primary metric: gates -> judge -> keep/revert based on judge score +- Spec with judge primary metric: gates -> batched judge sub-agents -> keep/revert based on aggregated judge score - Parallel batch of 4 experiments: all dispatched, results collected, best kept, others reverted -- Experiment that violates degenerate gate: immediately reverted, no judge call +- Experiment that violates degenerate gate: immediately reverted, no judge call, no judge cost - Experiment needing unapproved dep: deferred, pipeline continues - Codex dispatch failure: fallback to subagent after 3 failures - Plateau stopping: 10 consecutive batches with no improvement -> stop -- Flaky metric with repeat mode: 5 runs, median aggregation, noise threshold applied +- Flaky metric with repeat mode: agent runs harness N times, aggregates, applies noise threshold +- Runner-up merge: file-disjoint runner-up cherry-picked, re-measured, combined is better -> kept +- Runner-up merge fails: combined is worse than best-only -> runner-up reverted, logged +- Context management: after 50 experiments, strategy digest used instead of full log **Verification:** -- Skill loads without errors in Claude Code -- YAML frontmatter passes `bun test tests/frontmatter.test.ts` -- All reference file paths use backtick syntax (no markdown links) -- Cross-platform question tool pattern used for all user interactions +- Experiment log updated after every batch (not just at end) +- Strategy digest file written after every batch +- Worktrees cleaned up after measurement +- All reference file paths use backtick syntax - Script references use relative paths (`bash scripts/measure.sh`) -- No hardcoded tool names — capability-class descriptions with platform hints --- -### Phase C: Registration (depends on Unit 8) +- [ ] **Unit 11: SKILL.md — Phase 4 (Wrap-Up)** + +**Goal:** Add Phase 4 to the SKILL.md — deferred hypothesis presentation, result summary, branch preservation, and integration with ce:review and ce:compound. + +**Requirements:** R9, R10 + +**Dependencies:** Unit 10 + +**Files:** +- Modify: `plugins/compound-engineering/skills/ce-optimize/SKILL.md` -- [ ] **Unit 9: Plugin registration and validation** +**Approach:** + +*Phase 4: Wrap-Up* +- Present deferred hypotheses needing dep approval (if any) +- Summarize: baseline -> final metrics, total iterations run, kept count, reverted count, judge cost total +- Preserve optimization branch with all commits +- Offer post-completion options via platform question tool: + 1. Run `/ce:review` on cumulative diff (baseline -> final) + 2. Run `/ce:compound` to document the winning strategy + 3. Create PR from optimization branch + 4. Continue with more experiments (re-enter Phase 3) + 5. Done + +**Patterns to follow:** +- `plugins/compound-engineering/skills/ce-work/SKILL.md` — Phase 4 (Ship It) post-completion options +- `plugins/compound-engineering/skills/lfg/SKILL.md` — skill-to-skill handoff pattern + +**Test scenarios:** +- Deferred hypotheses presented with dep requirements +- Summary includes all key metrics and cost data +- Each post-completion option works (ce:review, ce:compound, PR creation, continue, done) +- "Continue" re-enters Phase 3 cleanly with state re-read + +**Verification:** +- Optimization branch preserved with full commit history +- Post-completion options use platform question tool pattern + +--- + +### Phase C: Registration (depends on Unit 11) + +- [ ] **Unit 12: Plugin registration and validation** **Goal:** Register the new skill in plugin documentation and validate consistency. **Requirements:** R1 -**Dependencies:** Unit 8 +**Dependencies:** Unit 11 **Files:** - Modify: `plugins/compound-engineering/README.md` @@ -532,18 +629,25 @@ The SKILL.md is the primary artifact. It orchestrates all phases using the refer ## System-Wide Impact -- **Interaction graph:** The skill dispatches to learnings-researcher (Phase 0), repo-research-analyst (Phase 2), and optionally ce:review and ce:compound (Phase 4). It creates git worktrees and branches. It invokes Codex as an external process. -- **Error propagation:** Experiment failures are contained — each runs in an isolated worktree. Failures are logged and reverted. The optimization branch only advances on successful, validated improvements. -- **State lifecycle risks:** The experiment log is the critical state artifact. It must be written atomically (write to temp, rename) to prevent corruption from mid-write interrupts. Worktree cleanup must handle partial creation states. +- **Interaction graph:** The skill dispatches to learnings-researcher (Phase 0), repo-research-analyst (Phase 2), parallel judge sub-agents (Phase 3), and optionally ce:review and ce:compound (Phase 4). It creates git worktrees and branches. It invokes Codex as an external process. +- **Error propagation:** Experiment failures are contained — each runs in an isolated worktree. Failures are logged and reverted. The optimization branch only advances on successful, validated improvements. If the orchestrator crashes mid-batch, each completed experiment should have a `result.yaml` marker in its worktree; on resume the orchestrator scans for completed-but-unlogged experiments before starting a new batch. +- **State lifecycle risks:** The experiment log is the critical state artifact. It must be written after each batch (not just at end) to survive crashes. Log atomicity is ensured by the batch-then-evaluate architecture — only the single-threaded orchestrator writes to the log, never concurrent workers. +- **Context window pressure:** The experiment log grows ~25 lines per experiment. At 100 experiments that is ~2,500 lines of YAML. The orchestrator manages this via a rolling summary window (last 10 experiments) + a strategy digest file, never reading the full log unless filtering by category for duplicate-hypothesis detection. +- **Branch collision:** If `optimize/` already exists from a prior run, Phase 0 detects it and offers resume vs. fresh start. This prevents accidental overwrites of prior experiment history. +- **Dirty working tree:** Phase 1 includes a clean-tree gate: `git status` must show no uncommitted changes to files within `scope.mutable` or `scope.immutable`. If dirty, require commit or stash before proceeding. This prevents baseline measurement from differing between the main worktree and experiment worktrees. +- **Worktree budget:** Optimization worktrees live under `.worktrees/` (same convention as git-worktree skill). Before creating experiment worktrees, check total worktree count (including non-optimize worktrees from ce:work or ce:review). Refuse to exceed 12 total worktrees to prevent git performance degradation. - **API surface parity:** This is a new skill, no existing surface to maintain parity with. - **Integration coverage:** The parallelism readiness probe should be validated against real projects with known blockers (SQLite DBs, hardcoded ports) to ensure detection works. ## Risks & Dependencies - **Codex exec flags may change** — the skill should detect `codex` version and adapt. Mitigate by checking `codex --version` before first dispatch. -- **Worktree disk usage** — parallel experiments with large repos consume disk. Mitigate by cleaning up worktrees immediately after measurement and offering a `worktree_cleanup: immediate` option. -- **LLM-as-judge consistency** — judge scores may vary across calls for the same input. Mitigate by using fixed sample seeds and logging per-sample scores for post-hoc analysis. v2 can add anchor-based calibration. -- **Long-running unattended execution** — the loop may run for hours. Mitigate by saving experiment log after every batch (not just at end) and designing for graceful resume from saved state. +- **Worktree disk usage** — parallel experiments with large repos consume disk. Mitigate by cleaning up worktrees immediately after measurement, capping at 6 concurrent for worktree backend, and enforcing a 12-worktree budget across all CE skills. +- **LLM-as-judge consistency** — judge scores may vary across calls for the same input. Mitigate by using fixed sample seeds, requiring `minimum_improvement` threshold (default 0.3) to accept, and logging per-sample scores for post-hoc analysis. v2 can add anchor-based calibration. +- **Long-running unattended execution** — the loop may run for hours. Mitigate by saving experiment log after every batch, writing per-experiment `result.yaml` markers for crash recovery, and designing for graceful resume from saved state. +- **Context window exhaustion** — experiment log grows ~25 lines per experiment. Mitigate with rolling summary window (last 10 experiments) + strategy digest file. The orchestrator never reads the full log in one pass. +- **Judge API rate limiting** — if using Claude API for judge calls, rate limits could throttle parallel judge evaluation. Mitigate by batching judge calls (10 per sub-agent) to reduce total API calls, and adding a brief delay between judge sub-agent dispatches if rate-limited. +- **Runner-up merge interactions** — two independently beneficial changes can be harmful in combination. Mitigate by re-measuring after every merge, stopping after the first failed combination per batch, and logging interactions as learnings. ## Documentation / Operational Notes diff --git a/plugins/compound-engineering/README.md b/plugins/compound-engineering/README.md index b218bea1..b6beb5fd 100644 --- a/plugins/compound-engineering/README.md +++ b/plugins/compound-engineering/README.md @@ -7,7 +7,8 @@ AI-powered development tools that get smarter with every use. Make each unit of | Component | Count | |-----------|-------| | Agents | 35+ | -| Skills | 40+ | +| Skills | 41+ | +| MCP Servers | 1 | ## Skills @@ -24,6 +25,7 @@ The primary entry points for engineering work, invoked as slash commands: | `/ce:work` | Execute work items systematically | | `/ce:compound` | Document solved problems to compound team knowledge | | `/ce:compound-refresh` | Refresh stale or drifting learnings and decide whether to keep, update, replace, or archive them | +| `/ce:optimize` | Run iterative optimization loops with parallel experiments, measurement gates, and LLM-as-judge quality scoring | ### Git Workflow diff --git a/plugins/compound-engineering/skills/ce-optimize/SKILL.md b/plugins/compound-engineering/skills/ce-optimize/SKILL.md new file mode 100644 index 00000000..c691c15f --- /dev/null +++ b/plugins/compound-engineering/skills/ce-optimize/SKILL.md @@ -0,0 +1,467 @@ +--- +name: ce:optimize +description: "Run metric-driven iterative optimization loops. Define a measurable goal, build measurement scaffolding, then run parallel experiments that try many approaches, measure each against hard gates and/or LLM-as-judge quality scores, keep improvements, and converge toward the best solution. Use when optimizing clustering quality, search relevance, build performance, prompt quality, or any measurable outcome that benefits from systematic experimentation. Inspired by Karpathy's autoresearch, generalized for multi-file code changes and non-ML domains." +argument-hint: "[path to optimization spec YAML, or describe the optimization goal]" +--- + +# Iterative Optimization Loop + +Run metric-driven iterative optimization. Define a goal, build measurement scaffolding, then run parallel experiments that converge toward the best solution. + +## Interaction Method + +Use the platform's blocking question tool when available (`AskUserQuestion` in Claude Code, `request_user_input` in Codex, `ask_user` in Gemini). Otherwise, present numbered options in chat and wait for the user's reply before proceeding. + +## Input + + #$ARGUMENTS + +If the input above is empty, ask: "What would you like to optimize? Describe the goal, or provide a path to an optimization spec YAML file." + +## Optimization Spec Schema + +Reference the spec schema for validation: + +`references/optimize-spec-schema.yaml` + +## Experiment Log Schema + +Reference the experiment log schema for state management: + +`references/experiment-log-schema.yaml` + +--- + +## Phase 0: Setup + +### 0.1 Determine Input Type + +Check whether the input is: +- **A spec file path** (ends in `.yaml` or `.yml`): read and validate it +- **A description of the optimization goal**: help the user create a spec interactively + +### 0.2 Load or Create Spec + +**If spec file provided:** +1. Read the YAML spec file. The orchestrating agent parses YAML natively -- no shell script parsing. +2. Validate against `references/optimize-spec-schema.yaml`: + - All required fields present + - `metric.primary.type` is `hard` or `judge` + - If type is `judge`, `metric.judge` section exists with `rubric` and `scoring` + - At least one degenerate gate defined + - `measurement.command` is non-empty + - `scope.mutable` and `scope.immutable` each have at least one entry + - Gate check operators are valid (`>=`, `<=`, `>`, `<`, `==`, `!=`) + - `execution.max_concurrent` does not exceed 6 when backend is `worktree` +3. If validation fails, report errors and ask the user to fix them + +**If description provided:** +1. Analyze the project to understand what can be measured +2. Guide the user through creating a spec: + - What is the optimization target? (metric name, direction, type) + - What degenerate cases should be rejected? (gates) + - If judge type: what rubric should the judge use? + - What command runs the measurement? + - What files can be modified? What is immutable? + - Any constraints or dependencies? +3. Write the spec to `.context/compound-engineering/ce-optimize//spec.yaml` +4. Present the spec to the user for approval before proceeding + +### 0.3 Search Prior Learnings + +Dispatch `compound-engineering:research:learnings-researcher` to search for prior optimization work on similar topics. If relevant learnings exist, incorporate them into the approach. + +### 0.4 Run Identity Detection + +Check if `optimize/` branch already exists: + +```bash +git rev-parse --verify "optimize/" 2>/dev/null +``` + +**If branch exists**, check for an existing experiment log at `.context/compound-engineering/ce-optimize//experiment-log.yaml`. + +Present the user with a choice via the platform question tool: +- **Resume**: inherit existing state, continue from the last iteration number +- **Fresh start**: archive the old branch to `optimize//archived-`, clear the experiment log, start from scratch + +### 0.5 Create Optimization Branch and Scratch Space + +```bash +git checkout -b "optimize/" # or switch to existing if resuming +``` + +Create scratch directory: +```bash +mkdir -p .context/compound-engineering/ce-optimize// +``` + +--- + +## Phase 1: Measurement Scaffolding + +**This phase is a HARD GATE. The user must approve baseline and parallel readiness before Phase 2.** + +### 1.1 Clean-Tree Gate + +Verify no uncommitted changes to files within `scope.mutable` or `scope.immutable`: + +```bash +git status --porcelain +``` + +Filter the output against the scope paths. If any in-scope files have uncommitted changes: +- Report which files are dirty +- Ask the user to commit or stash before proceeding +- Do NOT continue until the working tree is clean for in-scope files + +### 1.2 Build or Validate Measurement Harness + +**If user provides a measurement harness** (the `measurement.command` already exists): +1. Run it once via the measurement script: + ```bash + bash scripts/measure.sh "" + ``` +2. Validate the JSON output: + - Contains keys for all degenerate gate metric names + - Contains keys for all diagnostic metric names + - Values are numeric or boolean as expected +3. If validation fails, report what is missing and ask the user to fix the harness + +**If agent must build the harness:** +1. Analyze the codebase to understand the current approach and what should be measured +2. Build an evaluation script (e.g., `evaluate.py`, `evaluate.sh`, or equivalent) +3. Add the evaluation script path to `scope.immutable` -- the experiment agent must not modify it +4. Run it once and validate the output +5. Present the harness and its output to the user for review + +### 1.3 Establish Baseline + +Run the measurement harness on the current code. + +**If stability mode is `repeat`:** +1. Run the harness `repeat_count` times +2. Aggregate results using the configured aggregation method (median, mean, min, max) +3. Calculate variance across runs +4. If variance exceeds `noise_threshold`, warn the user and suggest increasing `repeat_count` + +Record the baseline in the experiment log: +```yaml +baseline: + timestamp: "" + gates: + : + ... + diagnostics: + : + ... +``` + +If primary type is `judge`, also run the judge evaluation on baseline output to establish the starting judge score. + +### 1.4 Parallelism Readiness Probe + +Run the parallelism probe script: +```bash +bash scripts/parallel-probe.sh "" "" +``` + +Read the JSON output. Present any blockers to the user with suggested mitigations. + +### 1.5 Worktree Budget Check + +Count existing worktrees: +```bash +bash scripts/experiment-worktree.sh count +``` + +If count + `execution.max_concurrent` would exceed 12: +- Warn the user +- Suggest cleaning up existing worktrees or reducing `max_concurrent` +- Do NOT block -- the user may proceed at their own risk + +### 1.6 User Approval Gate + +Present to the user via the platform question tool: + +- **Baseline metrics**: all gate values, diagnostic values, and judge scores (if applicable) +- **Parallel readiness**: probe results, any blockers, mitigations applied +- **Clean-tree status**: confirmed clean +- **Worktree budget**: current count and projected usage + +**Options:** +1. **Proceed** -- approve baseline and parallel config, move to Phase 2 +2. **Adjust spec** -- modify spec settings before proceeding +3. **Fix issues** -- user needs to resolve blockers first + +Do NOT proceed to Phase 2 until the user explicitly approves. + +**State re-read:** After gate approval, re-read the spec and baseline from disk. Do not carry stale in-memory values forward. + +--- + +## Phase 2: Hypothesis Generation + +### 2.1 Analyze Current Approach + +Read the code within `scope.mutable` to understand: +- The current implementation approach +- Obvious improvement opportunities +- Constraints and dependencies between components + +Optionally dispatch `compound-engineering:research:repo-research-analyst` for deeper codebase analysis if the scope is large or unfamiliar. + +### 2.2 Generate Hypothesis List + +Generate an initial set of hypotheses. Each hypothesis should have: +- **Description**: what to try +- **Category**: one of the standard categories (signal-extraction, graph-signals, embedding, algorithm, preprocessing, parameter-tuning, architecture, data-handling) or a domain-specific category +- **Priority**: high, medium, or low based on expected impact and feasibility +- **Required dependencies**: any new packages or tools needed + +Include user-provided hypotheses if any were given as input. + +Aim for 10-30 hypotheses in the initial backlog. More can be generated during the loop based on learnings. + +### 2.3 Dependency Pre-Approval + +Collect all unique new dependencies across all hypotheses. + +If any hypotheses require new dependencies: +1. Present the full dependency list to the user via the platform question tool +2. Ask for bulk approval +3. Mark each hypothesis's `dep_status` as `approved` or `needs_approval` + +Hypotheses with unapproved dependencies remain in the backlog but are skipped during batch selection. They are re-presented at wrap-up for potential approval. + +### 2.4 Record Hypothesis Backlog + +Write the initial backlog to the experiment log file: +```yaml +hypothesis_backlog: + - description: "Remove template boilerplate before embedding" + category: "signal-extraction" + priority: high + dep_status: approved + required_deps: [] + - description: "Try HDBSCAN clustering algorithm" + category: "algorithm" + priority: medium + dep_status: needs_approval + required_deps: ["scikit-learn"] +``` + +--- + +## Phase 3: Optimization Loop + +This phase repeats in batches until a stopping criterion is met. + +### 3.1 Batch Selection + +Select hypotheses for this batch: +- `batch_size = min(backlog_size, execution.max_concurrent)` +- Skip hypotheses with `dep_status: needs_approval` +- Prefer diversity: select from different categories when possible +- Within a category, select by priority (high first) + +If the backlog is empty and no new hypotheses can be generated, proceed to Phase 4 (wrap-up). + +### 3.2 Dispatch Experiments + +For each hypothesis in the batch, dispatch in parallel: + +**Worktree backend:** +1. Create experiment worktree: + ```bash + WORKTREE_PATH=$(bash scripts/experiment-worktree.sh create "" "optimize/" ) + ``` +2. Apply port parameterization if configured (set env vars for the measurement script) +3. Fill the experiment prompt template (`references/experiment-prompt-template.md`) with: + - Iteration number, spec name + - Hypothesis description and category + - Current best and baseline metrics + - Mutable and immutable scope + - Constraints and approved dependencies + - Rolling window of last 10 experiments (concise summaries) +4. Dispatch a subagent with the filled prompt, working in the experiment worktree + +**Codex backend:** +1. Check environment guard -- do NOT delegate if already inside a Codex sandbox: + ```bash + # If these exist, we're already in Codex -- fall back to subagent + test -n "${CODEX_SANDBOX:-}" || test -n "${CODEX_SESSION_ID:-}" || test ! -w .git + ``` +2. Fill the experiment prompt template +3. Write the filled prompt to a temp file +4. Dispatch via Codex: + ```bash + cat /tmp/optimize-exp-XXXXX.txt | codex exec --skip-git-repo-check - 2>&1 + ``` +5. Security posture: use the user's selection (ask once per session if not set in spec) + +### 3.3 Collect Results + +Wait for all experiments in the batch to complete. + +For each completed experiment: + +1. **Run measurement** in the experiment's worktree: + ```bash + bash scripts/measure.sh "" "" + ``` + +2. **Read raw JSON output** from the measurement script + +3. **Evaluate degenerate gates**: + - For each gate in `metric.degenerate_gates`, parse the operator and threshold + - Compare the metric value against the threshold + - If ANY gate fails: mark outcome as `degenerate`, skip judge evaluation, save money + +4. **If gates pass AND primary type is `judge`**: + - Read the experiment's output (cluster assignments, search results, etc.) + - Apply stratified sampling per `metric.judge.stratification` config (using `sample_seed`) + - Group samples into batches of `metric.judge.batch_size` + - Fill the judge prompt template (`references/judge-prompt-template.md`) for each batch + - Dispatch `ceil(sample_size / batch_size)` parallel judge sub-agents + - Each sub-agent returns structured JSON scores + - Aggregate scores: compute `scoring.primary` (e.g., mean_score) and `scoring.secondary` values + - If `singleton_sample > 0`: also dispatch singleton evaluation sub-agents + +5. **If gates pass AND primary type is `hard`**: + - Use the metric value directly from the measurement output + +6. **Record results** in the experiment log entry + +### 3.4 Evaluate Batch + +After all experiments in the batch have been measured: + +1. **Rank** experiments by primary metric improvement: + - For hard metrics: compare to current best value + - For judge metrics: compare `mean_score` to current best, must exceed `minimum_improvement` + +2. **Identify the best experiment** that passes all gates and improves the primary metric + +3. **If best improves on current best: KEEP** + - Merge the experiment branch to the optimization branch + - Commit with message: `optimize(): ` + - This is now the new baseline for subsequent batches + +4. **Check file-disjoint runners-up** (up to `max_runner_up_merges_per_batch`): + - For each runner-up that also improved, check file-level disjointness with the kept experiment + - **File-level disjointness**: two experiments are disjoint if they modified completely different files. Same file = overlapping, even if different lines. + - If disjoint: cherry-pick the runner-up onto the new baseline, re-run full measurement + - If combined measurement is strictly better: keep the cherry-pick (outcome: `runner_up_kept`) + - Otherwise: revert the cherry-pick, log as "promising alone but neutral/harmful in combination" (outcome: `runner_up_reverted`) + - Stop after first failed combination + +5. **Handle deferred deps**: experiments that need unapproved dependencies get outcome `deferred_needs_approval` + +6. **Revert all others**: cleanup worktrees, log as `reverted` + +### 3.5 Update State + +1. **Update experiment log** with ALL results from this batch (kept, reverted, degenerate, error, deferred) + +2. **Write strategy digest** to `.context/compound-engineering/ce-optimize//strategy-digest.md`: + - Categories tried so far (with success/failure counts) + - Key learnings from this batch and overall + - Exploration frontier: what categories and approaches remain untried + - Current best metrics and improvement from baseline + +3. **Generate new hypotheses** based on learnings: + - Read the rolling window (last 10 experiments from the log) and the strategy digest + - Do NOT read the full experiment log -- use the digest for broad context + - Add new hypotheses to the backlog based on what succeeded (explore further in that direction) and what failed (avoid similar approaches) + +### 3.6 Check Stopping Criteria + +Stop the loop if ANY of these are true: +- **Target reached**: primary metric meets or exceeds `stopping.target` (if set in spec) +- **Max iterations**: total experiments run >= `stopping.max_iterations` +- **Max hours**: wall-clock time since Phase 3 start >= `stopping.max_hours` +- **Plateau**: no improvement for `stopping.plateau_iterations` consecutive experiments +- **Manual stop**: user interrupts (save state and proceed to Phase 4) +- **Empty backlog**: no hypotheses remain and no new ones can be generated + +If no stopping criterion is met, proceed to the next batch (step 3.1). + +### 3.7 Cross-Cutting Concerns + +**Codex failure cascade**: Track consecutive Codex delegation failures. After 3 consecutive failures, auto-disable Codex for remaining experiments and fall back to subagent dispatch. Log the switch. + +**Error handling**: If an experiment's measurement command crashes, times out, or produces malformed output: +- Log as outcome `error` or `timeout` with the error message +- Revert the experiment (cleanup worktree) +- The loop continues with remaining experiments in the batch + +**Progress reporting**: After each batch, report: +- Batch N of estimated M (based on backlog size) +- Experiments run this batch and total +- Current best metric and improvement from baseline +- Cumulative judge cost (if applicable) + +**Crash recovery**: Each experiment should write a small `result.yaml` marker in its worktree upon measurement completion. On resume (Phase 0.4 detects existing branch), scan `.worktrees/` for completed-but-unlogged experiment markers before starting a new batch. + +--- + +## Phase 4: Wrap-Up + +### 4.1 Present Deferred Hypotheses + +If any hypotheses were deferred due to unapproved dependencies: +1. List them with their dependency requirements +2. Ask the user whether to approve, skip, or save for a future run +3. If approved: add to backlog and offer to re-enter Phase 3 for one more round + +### 4.2 Summarize Results + +Present a comprehensive summary: + +``` +Optimization: +Duration: +Total experiments: + Kept: (including runner-up merges) + Reverted: + Degenerate: + Errors: + Deferred: + +Baseline -> Final: + : -> () + : ... + : ... + +Judge cost: $ (if applicable) + +Key improvements: + 1. (+) + 2. (+) + ... +``` + +### 4.3 Preserve and Offer Next Steps + +The optimization branch (`optimize/`) is preserved with all commits from kept experiments. + +Present post-completion options via the platform question tool: + +1. **Run `/ce:review`** on the cumulative diff (baseline to final). Load the `ce:review` skill with `mode:autofix` on the optimization branch. +2. **Run `/ce:compound`** to document the winning strategy as an institutional learning. +3. **Create PR** from the optimization branch to the default branch. +4. **Continue** with more experiments: re-enter Phase 3 with the current state. State re-read first. +5. **Done** -- leave the optimization branch for manual review. + +### 4.4 Cleanup + +Clean up scratch space: +```bash +# Keep the experiment log (it moves with the branch) +# Remove temporary batch artifacts +rm -f .context/compound-engineering/ce-optimize//strategy-digest.md +``` + +Do NOT delete the experiment log -- it is part of the optimization branch's history. +Do NOT delete experiment worktrees that are still being referenced. diff --git a/plugins/compound-engineering/skills/ce-optimize/references/experiment-log-schema.yaml b/plugins/compound-engineering/skills/ce-optimize/references/experiment-log-schema.yaml new file mode 100644 index 00000000..0dc4f9a1 --- /dev/null +++ b/plugins/compound-engineering/skills/ce-optimize/references/experiment-log-schema.yaml @@ -0,0 +1,241 @@ +# Experiment Log Schema +# This is the canonical schema for the experiment log file that accumulates +# across an optimization run. The orchestrating agent reads and writes this +# file after each batch. +# +# Location: .context/compound-engineering/ce-optimize//experiment-log.yaml +# +# The experiment log is the durable ground truth for all experiments run. +# The orchestrator does NOT read it in full each iteration -- it uses a +# rolling window (last 10 experiments) + a strategy digest file. + +# ============================================================================ +# TOP-LEVEL STRUCTURE +# ============================================================================ + +structure: + + spec: + type: string + required: true + description: "Name of the optimization spec this log belongs to" + + run_id: + type: string + required: true + description: "Unique identifier for this optimization run (timestamp-based). Distinguishes resumed runs from fresh starts." + + started_at: + type: string + format: "ISO 8601 timestamp" + required: true + + baseline: + type: object + required: true + description: "Metrics measured on the original code before any optimization" + children: + timestamp: + type: string + format: "ISO 8601 timestamp" + gates: + type: object + description: "Key-value pairs of gate metric names to their baseline values" + diagnostics: + type: object + description: "Key-value pairs of diagnostic metric names to their baseline values" + judge: + type: object + description: "Judge scores on the baseline (only when primary type is 'judge')" + children: + # All fields from the scoring config appear here + # Plus: + sample_seed: + type: integer + judge_cost_usd: + type: number + + experiments: + type: array + required: true + description: "Ordered list of all experiments, including kept, reverted, errored, and deferred" + items: + type: object + # See EXPERIMENT ENTRY below + + best: + type: object + required: true + description: "Summary of the current best result" + children: + iteration: + type: integer + description: "Iteration number of the best experiment" + metrics: + type: object + description: "All metric values from the best experiment" + judge: + type: object + description: "Judge scores from the best experiment (only when primary type is 'judge')" + total_judge_cost_usd: + type: number + description: "Running total of all judge costs across all experiments" + + hypothesis_backlog: + type: array + description: "Remaining hypotheses not yet tested" + items: + type: object + children: + description: + type: string + category: + type: string + priority: + type: string + enum: [high, medium, low] + dep_status: + type: string + enum: [approved, needs_approval, not_applicable] + required_deps: + type: array + items: + type: string + +# ============================================================================ +# EXPERIMENT ENTRY +# ============================================================================ + +experiment_entry: + required_children: + + iteration: + type: integer + description: "Sequential experiment number (1-indexed, monotonically increasing)" + + batch: + type: integer + description: "Batch number this experiment was part of. Multiple experiments in the same batch ran in parallel." + + hypothesis: + type: string + description: "Human-readable description of what this experiment tried" + + category: + type: string + description: "Category for grouping and diversity selection (e.g., signal-extraction, graph-signals, embedding, algorithm, preprocessing)" + + outcome: + type: enum + values: + - kept # primary metric improved, gates passed -> merged to optimization branch + - reverted # primary metric did not improve or was worse -> changes discarded + - degenerate # degenerate gate failed -> immediately reverted, no judge evaluation + - error # measurement command crashed, timed out, or produced malformed output + - deferred_needs_approval # experiment needs an unapproved dependency -> set aside for batch approval + - timeout # measurement command exceeded timeout_seconds + - runner_up_kept # file-disjoint runner-up that was cherry-picked and re-measured successfully + - runner_up_reverted # file-disjoint runner-up that was cherry-picked but combined measurement was not better + description: > + Load-bearing state: the loop branches on this value. + 'kept' and 'runner_up_kept' advance the optimization branch. + 'deferred_needs_approval' items are re-presented at wrap-up. + All other states are terminal for that experiment. + + optional_children: + + changes: + type: array + description: "Files modified by this experiment" + items: + type: object + children: + file: + type: string + summary: + type: string + + gates: + type: object + description: "Gate metric values from the measurement command" + + gates_passed: + type: boolean + description: "Whether all degenerate gates passed" + + diagnostics: + type: object + description: "Diagnostic metric values from the measurement command" + + judge: + type: object + description: "Judge evaluation scores (only when primary type is 'judge' and gates passed)" + children: + # All fields from scoring.primary and scoring.secondary appear here + # Plus: + judge_cost_usd: + type: number + description: "Cost of judge calls for this experiment" + + primary_delta: + type: string + description: "Change in primary metric from current best (e.g., '+0.7', '-0.3')" + + learnings: + type: string + description: "What was learned from this experiment. The agent reads these to avoid re-trying similar approaches and to inform new hypothesis generation." + + commit: + type: string + description: "Git commit SHA on the optimization branch (only for 'kept' and 'runner_up_kept' outcomes)" + + deferred_reason: + type: string + description: "Why this experiment was deferred (only for 'deferred_needs_approval' outcome)" + + error_message: + type: string + description: "Error details (only for 'error' and 'timeout' outcomes)" + + merged_with: + type: integer + description: "Iteration number of the experiment this was merged with (only for 'runner_up_kept' and 'runner_up_reverted')" + +# ============================================================================ +# OUTCOME STATE TRANSITIONS +# ============================================================================ +# +# proposed (in hypothesis_backlog) +# -> selected for batch +# -> experiment dispatched +# -> measurement completed +# -> gates failed -> outcome: degenerate +# -> measurement error -> outcome: error +# -> measurement timeout -> outcome: timeout +# -> gates passed +# -> judge evaluated (if type: judge) +# -> best in batch, improved -> outcome: kept +# -> runner-up, file-disjoint -> cherry-pick + re-measure +# -> combined better -> outcome: runner_up_kept +# -> combined not better -> outcome: runner_up_reverted +# -> not improved -> outcome: reverted +# -> needs unapproved dep -> outcome: deferred_needs_approval +# +# Only 'kept' and 'runner_up_kept' produce a commit on the optimization branch. +# Only 'deferred_needs_approval' items are re-presented at wrap-up for approval. + +# ============================================================================ +# STRATEGY DIGEST (separate file) +# ============================================================================ +# +# Written after each batch to: +# .context/compound-engineering/ce-optimize//strategy-digest.md +# +# Contains a compressed summary of: +# - What hypothesis categories have been tried +# - Which approaches succeeded (kept) and which failed (reverted) +# - The exploration frontier: what hasn't been tried yet +# - Key learnings that should inform next hypotheses +# +# The orchestrator reads the strategy digest (not the full experiment log) +# when generating new hypotheses between batches. diff --git a/plugins/compound-engineering/skills/ce-optimize/references/experiment-prompt-template.md b/plugins/compound-engineering/skills/ce-optimize/references/experiment-prompt-template.md new file mode 100644 index 00000000..6ae2f466 --- /dev/null +++ b/plugins/compound-engineering/skills/ce-optimize/references/experiment-prompt-template.md @@ -0,0 +1,89 @@ +# Experiment Worker Prompt Template + +This template is used by the orchestrator to dispatch each experiment to a subagent or Codex. Variable substitution slots are filled at spawn time. + +--- + +## Template + +``` +You are an optimization experiment worker. + +Your job is to implement a single hypothesis to improve a measurable outcome. You will modify code within a defined scope, then stop. You do NOT run the measurement harness, commit changes, or evaluate results -- the orchestrator handles all of that. + + +Experiment: #{iteration} for optimization target: {spec_name} +Hypothesis: {hypothesis_description} +Category: {hypothesis_category} + +Current best metrics: +{current_best_metrics} + +Baseline metrics (before any optimization): +{baseline_metrics} + + + +You MAY modify files in these paths: +{scope_mutable} + +You MUST NOT modify files in these paths: +{scope_immutable} + +CRITICAL: Do not modify any file outside the mutable scope. The measurement harness and evaluation data are immutable by design -- the agent cannot game the metric by changing how it is measured. + + + +{constraints} + + + +You may add or use these dependencies without further approval: +{approved_dependencies} + +If your implementation requires a dependency NOT in this list, STOP and note it in your output. Do not install unapproved dependencies. + + + +Recent experiments and their outcomes (for context -- avoid re-trying approaches that already failed): + +{recent_experiment_summaries} + + + +1. Read and understand the relevant code in the mutable scope +2. Implement the hypothesis described above +3. Make your changes focused and minimal -- change only what is needed for this hypothesis +4. Do NOT run the measurement harness (the orchestrator handles this) +5. Do NOT commit (the orchestrator handles this) +6. Do NOT modify files outside the mutable scope +7. When done, run `git diff --stat` so the orchestrator can see your changes +8. If you discover you need an unapproved dependency, note it and stop + +Focus on implementing the hypothesis well. The orchestrator will measure and evaluate the results. + +``` + +## Variable Reference + +| Variable | Source | Description | +|----------|--------|-------------| +| `{iteration}` | Experiment counter | Sequential experiment number | +| `{spec_name}` | Spec file `name` field | Optimization target identifier | +| `{hypothesis_description}` | Hypothesis backlog | What this experiment should try | +| `{hypothesis_category}` | Hypothesis backlog | Category (signal-extraction, algorithm, etc.) | +| `{current_best_metrics}` | Experiment log `best` section | Current best metric values (compact YAML or key: value pairs) | +| `{baseline_metrics}` | Experiment log `baseline` section | Original baseline before any optimization | +| `{scope_mutable}` | Spec `scope.mutable` | List of files/dirs the worker may modify | +| `{scope_immutable}` | Spec `scope.immutable` | List of files/dirs the worker must not touch | +| `{constraints}` | Spec `constraints` | Free-text constraints to follow | +| `{approved_dependencies}` | Spec `dependencies.approved` | Dependencies approved for use | +| `{recent_experiment_summaries}` | Rolling window (last 10) from experiment log | Compact summaries: hypothesis, outcome, learnings | + +## Notes + +- This template works for both subagent and Codex dispatch. No platform-specific assumptions. +- For Codex dispatch: write the filled template to a temp file and pipe via stdin (`cat /tmp/optimize-exp-XXXXX.txt | codex exec --skip-git-repo-check - 2>&1`). +- For subagent dispatch: pass the filled template as the subagent prompt. +- Keep `{recent_experiment_summaries}` concise -- 2-3 lines per experiment, last 10 only. Do not include the full experiment log. +- The worker should NOT read the full experiment log or strategy digest. It receives only what the orchestrator provides. diff --git a/plugins/compound-engineering/skills/ce-optimize/references/judge-prompt-template.md b/plugins/compound-engineering/skills/ce-optimize/references/judge-prompt-template.md new file mode 100644 index 00000000..01593c9e --- /dev/null +++ b/plugins/compound-engineering/skills/ce-optimize/references/judge-prompt-template.md @@ -0,0 +1,110 @@ +# Judge Evaluation Prompt Template + +This template is used by the orchestrator to dispatch batched LLM-as-judge evaluation calls. Each judge sub-agent evaluates a batch of sampled output items and returns structured JSON scores. + +The orchestrator: +1. Reads the experiment's output +2. Selects samples per the stratification config (using fixed seed) +3. Groups samples into batches of `judge.batch_size` +4. Dispatches `ceil(sample_size / batch_size)` parallel sub-agents using this template +5. Aggregates returned JSON scores + +--- + +## Item Evaluation Template + +``` +You are a quality judge evaluating output items for an optimization experiment. + +Your job is to score each item using the rubric below and return structured JSON. Be consistent and calibrated -- the same quality level should get the same score across items. + + +{rubric} + + + +{items_json} + + + +Return ONLY a valid JSON array. No prose, no markdown, no explanation outside the JSON. + +Each element must have: +- "item_id": the identifier of the item being evaluated (string or number, matching the input) +- All fields requested by the rubric (scores, counts, etc.) +- "ambiguous": true if you cannot confidently score this item (e.g., insufficient context, borderline case). When ambiguous, still provide your best-guess score but flag it. + +Example output format (adapt field names to match the rubric): +[ + {"item_id": "cluster-42", "score": 4, "distinct_topics": 1, "outlier_count": 0, "ambiguous": false}, + {"item_id": "cluster-17", "score": 2, "distinct_topics": 3, "outlier_count": 2, "ambiguous": false}, + {"item_id": "cluster-99", "score": 3, "distinct_topics": 2, "outlier_count": 1, "ambiguous": true} +] + +Rules: +- Evaluate each item independently +- Score based on the rubric, not on how other items in this batch scored +- If an item is empty or has only 1 element when it should have more, score it based on what is present +- For very large items (many elements), focus on a representative subset and note if quality varies across the item +- Every item in the batch MUST appear in your output + +``` + +## Singleton Evaluation Template + +``` +You are a quality judge evaluating singleton items -- items that are currently NOT in any group/cluster. + +Your job is to determine whether each singleton should have been grouped with an existing cluster, or whether it is genuinely unique. Return structured JSON. + + +{singleton_rubric} + + + +{singletons_json} + + + +A summary of existing clusters for reference (titles/themes only, not full contents): +{cluster_summaries} + + + +Return ONLY a valid JSON array. No prose, no markdown, no explanation outside the JSON. + +Each element must have: +- "item_id": the identifier of the singleton +- All fields requested by the singleton rubric (should_cluster, best_cluster_id, confidence, etc.) + +Example output format (adapt field names to match the rubric): +[ + {"item_id": "issue-1234", "should_cluster": true, "best_cluster_id": "cluster-42", "confidence": 4}, + {"item_id": "issue-5678", "should_cluster": false, "best_cluster_id": null, "confidence": 5} +] + +Rules: +- A singleton that genuinely has no match in existing clusters should get should_cluster: false +- A singleton that clearly belongs in an existing cluster should get should_cluster: true with the cluster ID +- High confidence (4-5) means you are very sure. Low confidence (1-2) means the item is borderline. +- Every singleton in the batch MUST appear in your output + +``` + +## Variable Reference + +| Variable | Source | Description | +|----------|--------|-------------| +| `{rubric}` | Spec `metric.judge.rubric` | User-defined scoring rubric | +| `{items_json}` | Sampled output items | JSON array of items to evaluate (one batch worth) | +| `{singleton_rubric}` | Spec `metric.judge.singleton_rubric` | User-defined rubric for singleton evaluation | +| `{singletons_json}` | Sampled singleton items | JSON array of singleton items to evaluate | +| `{cluster_summaries}` | Experiment output | Summary of existing clusters (titles/themes) for singleton reference | + +## Notes + +- Designed for Haiku by default -- prompts are concise and well-structured for smaller models +- The rubric is part of the immutable measurement harness -- the experiment agent cannot modify it +- The `ambiguous` flag on items helps the orchestrator identify noisy evaluations without forcing bad scores +- For singleton evaluation, the orchestrator provides cluster summaries (not full contents) to keep judge context lean +- Each sub-agent evaluates one batch independently -- sub-agents do not see each other's results diff --git a/plugins/compound-engineering/skills/ce-optimize/references/optimize-spec-schema.yaml b/plugins/compound-engineering/skills/ce-optimize/references/optimize-spec-schema.yaml new file mode 100644 index 00000000..06dee26f --- /dev/null +++ b/plugins/compound-engineering/skills/ce-optimize/references/optimize-spec-schema.yaml @@ -0,0 +1,383 @@ +# Optimization Spec Schema +# This is the canonical schema for optimization spec files created by users +# to configure a /ce:optimize run. The orchestrating agent validates specs +# against this schema before proceeding. +# +# Usage: Create a YAML file matching this schema and pass it to /ce:optimize. +# The agent reads this spec, validates required fields, and uses it to +# configure the entire optimization run. + +# ============================================================================ +# REQUIRED FIELDS +# ============================================================================ + +required_fields: + + name: + type: string + description: "Unique identifier for this optimization run (kebab-case)" + example: "improve-issue-clustering" + + description: + type: string + description: "Human-readable description of the optimization goal" + example: "Improve coherence and coverage of issue/PR clusters" + + metric: + type: object + description: "Three-tier metric configuration" + required_children: + + primary: + type: object + description: "The metric the loop optimizes against" + required_children: + + type: + type: enum + values: + - hard # scalar metric from measurement command (e.g., build time, test pass rate) + - judge # LLM-as-judge quality score from sampled outputs + description: "Whether the primary metric comes from the measurement command directly or from LLM-as-judge evaluation" + + name: + type: string + description: "Metric name — must match a key in the measurement command's JSON output (for hard type) or a scoring field (for judge type)" + example: "cluster_coherence" + + direction: + type: enum + values: + - maximize + - minimize + description: "Whether higher or lower is better" + + optional_children: + + baseline: + type: number + default: null + description: "Filled automatically during Phase 1 baseline measurement. Do not set manually." + + target: + type: number + default: null + description: "Optional target value. Loop stops when this is reached." + example: 4.2 + + degenerate_gates: + type: array + description: "Fast boolean checks that reject obviously broken solutions before expensive evaluation. Run first, before the primary metric or judge." + required: true + items: + type: object + required_children: + name: + type: string + description: "Metric name — must match a key in the measurement command's JSON output" + check: + type: string + description: "Comparison operator and threshold. Supported operators: >=, <=, >, <, ==, !=" + example: "<= 0.10" + optional_children: + description: + type: string + description: "Human-readable explanation of what this gate catches" + + optional_children: + + diagnostics: + type: array + default: [] + description: "Metrics logged for understanding but never gated on. Useful for understanding WHY a primary metric changed." + items: + type: object + required_children: + name: + type: string + description: "Metric name — must match a key in the measurement command's JSON output" + + judge: + type: object + description: "LLM-as-judge configuration. Required when metric.primary.type is 'judge'. Ignored when type is 'hard'." + required_when: "metric.primary.type == 'judge'" + required_children: + rubric: + type: string + description: "Multi-line rubric text sent to the judge model. Must instruct the judge to return JSON." + example: | + Rate this cluster 1-5: + - 5: All items clearly about the same issue/feature + - 4: Strong theme, minor outliers + - 3: Related but covers 2-3 sub-topics + - 2: Weak connection + - 1: Unrelated items grouped together + scoring: + type: object + required_children: + primary: + type: string + description: "Field name from judge JSON output to use as the primary optimization target" + example: "mean_score" + optional_children: + secondary: + type: array + default: [] + description: "Additional scoring fields to log (not optimized against)" + optional_children: + model: + type: enum + values: + - haiku + - sonnet + default: haiku + description: "Model to use for judge evaluation. Haiku is cheaper and faster; Sonnet is more nuanced." + sample_size: + type: integer + default: 30 + description: "Total number of output items to sample for judge evaluation per experiment" + stratification: + type: array + default: null + description: "Stratified sampling buckets. If null, uses uniform random sampling." + items: + type: object + required_children: + bucket: + type: string + description: "Bucket name for this stratum" + count: + type: integer + description: "Number of items to sample from this bucket" + singleton_sample: + type: integer + default: 0 + description: "Number of singleton items to sample for false-negative evaluation" + singleton_rubric: + type: string + default: null + description: "Rubric for evaluating sampled singletons. Required if singleton_sample > 0." + sample_seed: + type: integer + default: 42 + description: "Fixed seed for reproducible sampling across experiments" + batch_size: + type: integer + default: 10 + description: "Number of samples per judge sub-agent batch. Controls parallelism vs overhead." + minimum_improvement: + type: number + default: 0.3 + description: "Minimum judge score improvement required to accept an experiment as 'better'. Accounts for sample-composition variance when output structure changes between experiments. Distinct from measurement.stability.noise_threshold which handles run-to-run flakiness." + + measurement: + type: object + description: "How to run the measurement harness" + required_children: + command: + type: string + description: "Shell command that runs the evaluation and outputs JSON to stdout. The JSON must contain keys matching all gate names and diagnostic names." + example: "python evaluate.py" + optional_children: + timeout_seconds: + type: integer + default: 600 + description: "Maximum seconds for the measurement command to run before being killed" + output_format: + type: enum + values: + - json + default: json + description: "Format of the measurement command's stdout. Currently only JSON is supported." + working_directory: + type: string + default: "." + description: "Working directory for the measurement command, relative to the repo root" + stability: + type: object + default: { mode: "stable" } + description: "How to handle metric variance across runs" + required_children: + mode: + type: enum + values: + - stable # run once, trust the result + - repeat # run N times, aggregate + default: stable + optional_children: + repeat_count: + type: integer + default: 5 + description: "Number of times to run the harness when mode is 'repeat'" + aggregation: + type: enum + values: + - median + - mean + - min + - max + default: median + description: "How to combine repeated measurements into a single value" + noise_threshold: + type: number + default: 0.02 + description: "Minimum improvement that must exceed this value to count as a real improvement (not noise). Applied to hard metrics only." + + scope: + type: object + description: "What the experiment agent is allowed to modify" + required_children: + mutable: + type: array + description: "Files and directories the agent MAY modify during experiments" + items: + type: string + description: "File path or directory (relative to repo root). Directories match all files within." + example: + - "src/clustering/" + - "src/preprocessing/" + - "config/clustering.yaml" + immutable: + type: array + description: "Files and directories the agent MUST NOT modify. The measurement harness should always be listed here." + items: + type: string + example: + - "evaluate.py" + - "tests/fixtures/" + - "data/" + +# ============================================================================ +# OPTIONAL FIELDS +# ============================================================================ + +optional_fields: + + execution: + type: object + default: { mode: "parallel", backend: "worktree", max_concurrent: 4 } + description: "How experiments are executed" + optional_children: + mode: + type: enum + values: + - parallel # run experiments simultaneously (default) + - serial # run one at a time + default: parallel + backend: + type: enum + values: + - worktree # git worktrees for isolation (default) + - codex # Codex sandboxes for isolation + default: worktree + max_concurrent: + type: integer + default: 4 + description: "Maximum experiments to run in parallel. Capped at 6 for worktree backend. 8+ only valid for Codex backend." + codex_security: + type: enum + values: + - full-auto # --full-auto (workspace write) + - yolo # --dangerously-bypass-approvals-and-sandbox + default: null + description: "Codex security posture. If null, user is asked once per session." + + parallel: + type: object + default: {} + description: "Parallelism configuration discovered or set during Phase 1" + optional_children: + port_strategy: + type: enum + values: + - parameterized # use env var for port + - none # no port parameterization needed + default: null + description: "If null, auto-detected during Phase 1 parallelism probe" + port_env_var: + type: string + default: null + description: "Environment variable name for port parameterization (e.g., EVAL_PORT)" + port_base: + type: integer + default: null + description: "Base port number. Each experiment gets port_base + experiment_index." + shared_files: + type: array + default: [] + description: "Files that must be copied into each experiment worktree (e.g., SQLite databases)" + items: + type: string + exclusive_resources: + type: array + default: [] + description: "Resources requiring exclusive access (e.g., 'gpu'). If non-empty, forces serial mode." + items: + type: string + + dependencies: + type: object + default: { approved: [] } + description: "Dependency management for experiments" + optional_children: + approved: + type: array + default: [] + description: "Pre-approved new dependencies that experiments may add" + items: + type: string + + constraints: + type: array + default: [] + description: "Free-text constraints that experiment agents must follow" + items: + type: string + example: + - "Do not change the output format of clusters" + - "Preserve backward compatibility with existing cluster consumers" + + stopping: + type: object + default: { max_iterations: 100, max_hours: 8, plateau_iterations: 10, target_reached: true } + description: "When the optimization loop should stop. Any criterion can trigger a stop." + optional_children: + max_iterations: + type: integer + default: 100 + description: "Stop after this many total experiments" + max_hours: + type: number + default: 8 + description: "Stop after this many hours of wall-clock time" + plateau_iterations: + type: integer + default: 10 + description: "Stop if no improvement for this many consecutive experiments" + target_reached: + type: boolean + default: true + description: "Stop when the primary metric reaches the target value (if set)" + + max_runner_up_merges_per_batch: + type: integer + default: 1 + description: "Maximum number of file-disjoint runner-up experiments to attempt merging per batch after keeping the best experiment" + +# ============================================================================ +# VALIDATION RULES +# ============================================================================ + +validation_rules: + - "All required fields must be present" + - "metric.primary.type must be 'hard' or 'judge'" + - "If metric.primary.type is 'judge', metric.judge must be present with rubric and scoring" + - "metric.degenerate_gates must have at least one entry" + - "measurement.command must be a non-empty string" + - "scope.mutable must have at least one entry" + - "scope.immutable must have at least one entry" + - "Gate check operators must be one of: >=, <=, >, <, ==, !=" + - "execution.max_concurrent must not exceed 6 when execution.backend is 'worktree'" + - "If parallel.exclusive_resources is non-empty, execution.mode should be 'serial'" + - "If metric.judge.singleton_sample > 0, metric.judge.singleton_rubric must be present" + - "stopping must have at least one non-default criterion or use defaults" diff --git a/plugins/compound-engineering/skills/ce-optimize/scripts/experiment-worktree.sh b/plugins/compound-engineering/skills/ce-optimize/scripts/experiment-worktree.sh new file mode 100755 index 00000000..5d44e5dd --- /dev/null +++ b/plugins/compound-engineering/skills/ce-optimize/scripts/experiment-worktree.sh @@ -0,0 +1,221 @@ +#!/bin/bash + +# Experiment Worktree Manager +# Creates, cleans up, and manages worktrees for optimization experiments. +# Each experiment gets an isolated worktree with copied shared resources. +# +# Usage: +# experiment-worktree.sh create [shared_file ...] +# experiment-worktree.sh cleanup +# experiment-worktree.sh cleanup-all +# experiment-worktree.sh count +# +# Worktrees are created at: .worktrees/optimize--exp-/ +# Branches are named: optimize//exp- + +set -euo pipefail + +RED='\033[0;31m' +GREEN='\033[0;32m' +YELLOW='\033[1;33m' +BLUE='\033[0;34m' +NC='\033[0m' + +GIT_ROOT=$(git rev-parse --show-toplevel 2>/dev/null) || { + echo -e "${RED}Error: Not in a git repository${NC}" >&2 + exit 1 +} + +WORKTREE_DIR="$GIT_ROOT/.worktrees" + +ensure_gitignore() { + if ! grep -q "^\.worktrees$" "$GIT_ROOT/.gitignore" 2>/dev/null; then + echo ".worktrees" >> "$GIT_ROOT/.gitignore" + fi +} + +# Create an experiment worktree +create_worktree() { + local spec_name="${1:?Error: spec_name required}" + local exp_index="${2:?Error: exp_index required}" + local base_branch="${3:?Error: base_branch required}" + shift 3 + + local padded_index + padded_index=$(printf "%03d" "$exp_index") + local worktree_name="optimize-${spec_name}-exp-${padded_index}" + local branch_name="optimize/${spec_name}/exp-${padded_index}" + local worktree_path="$WORKTREE_DIR/$worktree_name" + + # Check if worktree already exists + if [[ -d "$worktree_path" ]]; then + echo -e "${YELLOW}Worktree already exists: $worktree_path${NC}" >&2 + echo "$worktree_path" + return 0 + fi + + mkdir -p "$WORKTREE_DIR" + ensure_gitignore + + # Create worktree from the base branch + git worktree add -b "$branch_name" "$worktree_path" "$base_branch" --quiet 2>/dev/null || { + # Branch might already exist (from a previous partial run) + git branch -D "$branch_name" 2>/dev/null || true + git worktree add -b "$branch_name" "$worktree_path" "$base_branch" --quiet + } + + # Copy .env files from main repo + for f in "$GIT_ROOT"/.env*; do + if [[ -f "$f" ]]; then + local basename + basename=$(basename "$f") + if [[ "$basename" != ".env.example" ]]; then + cp "$f" "$worktree_path/$basename" + fi + fi + done + + # Copy shared files + for shared_file in "$@"; do + if [[ -f "$GIT_ROOT/$shared_file" ]]; then + local dir + dir=$(dirname "$worktree_path/$shared_file") + mkdir -p "$dir" + cp "$GIT_ROOT/$shared_file" "$worktree_path/$shared_file" + elif [[ -d "$GIT_ROOT/$shared_file" ]]; then + local dir + dir=$(dirname "$worktree_path/$shared_file") + mkdir -p "$dir" + cp -r "$GIT_ROOT/$shared_file" "$worktree_path/$shared_file" + fi + done + + echo "$worktree_path" +} + +# Clean up a single experiment worktree +cleanup_worktree() { + local spec_name="${1:?Error: spec_name required}" + local exp_index="${2:?Error: exp_index required}" + + local padded_index + padded_index=$(printf "%03d" "$exp_index") + local worktree_name="optimize-${spec_name}-exp-${padded_index}" + local branch_name="optimize/${spec_name}/exp-${padded_index}" + local worktree_path="$WORKTREE_DIR/$worktree_name" + + if [[ -d "$worktree_path" ]]; then + git worktree remove "$worktree_path" --force 2>/dev/null || { + # If worktree remove fails, try manual cleanup + rm -rf "$worktree_path" 2>/dev/null || true + git worktree prune 2>/dev/null || true + } + fi + + # Delete the experiment branch + git branch -D "$branch_name" 2>/dev/null || true + + echo -e "${GREEN}Cleaned up: $worktree_name${NC}" >&2 +} + +# Clean up all experiment worktrees for a spec +cleanup_all() { + local spec_name="${1:?Error: spec_name required}" + local prefix="optimize-${spec_name}-exp-" + local count=0 + + if [[ ! -d "$WORKTREE_DIR" ]]; then + echo -e "${YELLOW}No worktrees directory found${NC}" >&2 + return 0 + fi + + for worktree_path in "$WORKTREE_DIR"/${prefix}*; do + if [[ -d "$worktree_path" ]]; then + local worktree_name + worktree_name=$(basename "$worktree_path") + # Extract index from name + local index_str="${worktree_name#$prefix}" + + git worktree remove "$worktree_path" --force 2>/dev/null || { + rm -rf "$worktree_path" 2>/dev/null || true + } + + # Delete the branch + local branch_name="optimize/${spec_name}/exp-${index_str}" + git branch -D "$branch_name" 2>/dev/null || true + + count=$((count + 1)) + fi + done + + git worktree prune 2>/dev/null || true + + # Clean up empty worktree directory + if [[ -d "$WORKTREE_DIR" ]] && [[ -z "$(ls -A "$WORKTREE_DIR" 2>/dev/null)" ]]; then + rmdir "$WORKTREE_DIR" 2>/dev/null || true + fi + + echo -e "${GREEN}Cleaned up $count experiment worktree(s) for $spec_name${NC}" >&2 +} + +# Count total worktrees (for budget check) +count_worktrees() { + local count=0 + if [[ -d "$WORKTREE_DIR" ]]; then + for worktree_path in "$WORKTREE_DIR"/*; do + if [[ -d "$worktree_path" ]] && [[ -e "$worktree_path/.git" ]]; then + count=$((count + 1)) + fi + done + fi + echo "$count" +} + +# Main +main() { + local command="${1:-help}" + + case "$command" in + create) + shift + create_worktree "$@" + ;; + cleanup) + shift + cleanup_worktree "$@" + ;; + cleanup-all) + shift + cleanup_all "$@" + ;; + count) + count_worktrees + ;; + help) + cat << 'EOF' +Experiment Worktree Manager + +Usage: + experiment-worktree.sh create [shared_file ...] + experiment-worktree.sh cleanup + experiment-worktree.sh cleanup-all + experiment-worktree.sh count + +Commands: + create Create an experiment worktree with copied shared files + cleanup Remove a single experiment worktree and its branch + cleanup-all Remove all experiment worktrees for a spec + count Count total active worktrees (for budget checking) + +Worktrees: .worktrees/optimize--exp-/ +Branches: optimize//exp- +EOF + ;; + *) + echo -e "${RED}Unknown command: $command${NC}" >&2 + exit 1 + ;; + esac +} + +main "$@" diff --git a/plugins/compound-engineering/skills/ce-optimize/scripts/measure.sh b/plugins/compound-engineering/skills/ce-optimize/scripts/measure.sh new file mode 100755 index 00000000..d9a6aaa6 --- /dev/null +++ b/plugins/compound-engineering/skills/ce-optimize/scripts/measure.sh @@ -0,0 +1,47 @@ +#!/bin/bash + +# Measurement Runner +# Runs a measurement command, captures JSON output, and handles timeouts. +# The orchestrating agent (not this script) evaluates gates and handles +# stability repeats. +# +# Usage: measure.sh [working_directory] [KEY=VALUE ...] +# +# Arguments: +# command - Shell command to run (e.g., "python evaluate.py") +# timeout_seconds - Maximum seconds before killing the command +# working_directory - Directory to run the command in (default: .) +# KEY=VALUE - Optional environment variables to set before running +# +# Output: +# stdout: Raw JSON output from the measurement command +# stderr: Passed through from the measurement command +# exit code: Same as the measurement command (124 for timeout) + +set -euo pipefail + +# Parse arguments +COMMAND="${1:?Error: command argument required}" +TIMEOUT="${2:?Error: timeout_seconds argument required}" +WORKDIR="${3:-.}" + +# Shift past the first 3 arguments to get env vars +shift 3 2>/dev/null || shift $# 2>/dev/null || true + +# Set any KEY=VALUE environment variables +for arg in "$@"; do + if [[ "$arg" == *=* ]]; then + export "$arg" + fi +done + +# Change to working directory +cd "$WORKDIR" || { + echo "Error: cannot cd to $WORKDIR" >&2 + exit 1 +} + +# Run the measurement command with timeout +# timeout returns 124 if the command times out +# We pass stdout and stderr through directly +timeout "$TIMEOUT" bash -c "$COMMAND" diff --git a/plugins/compound-engineering/skills/ce-optimize/scripts/parallel-probe.sh b/plugins/compound-engineering/skills/ce-optimize/scripts/parallel-probe.sh new file mode 100755 index 00000000..931f3147 --- /dev/null +++ b/plugins/compound-engineering/skills/ce-optimize/scripts/parallel-probe.sh @@ -0,0 +1,96 @@ +#!/bin/bash + +# Parallelism Probe +# Detects common parallelism blockers in the target project. +# Output is advisory -- the skill presents results to the user for approval. +# +# Usage: parallel-probe.sh [measurement_command] +# +# Arguments: +# project_directory - Root directory of the project to probe +# measurement_command - The measurement command from the spec (optional, for port detection) +# +# Output: +# JSON to stdout with: +# mode: "parallel" | "serial" | "user-decision" +# blockers: [ { type, description, suggestion } ] + +set -euo pipefail + +PROJECT_DIR="${1:?Error: project_directory argument required}" +MEASUREMENT_CMD="${2:-}" + +cd "$PROJECT_DIR" || { + echo '{"mode":"serial","blockers":[{"type":"error","description":"Cannot access project directory","suggestion":"Check path"}]}' + exit 0 +} + +BLOCKERS="[]" + +add_blocker() { + local type="$1" + local desc="$2" + local suggestion="$3" + BLOCKERS=$(echo "$BLOCKERS" | python3 -c " +import json, sys +b = json.load(sys.stdin) +b.append({'type': '$type', 'description': '''$desc''', 'suggestion': '''$suggestion'''}) +print(json.dumps(b)) +" 2>/dev/null || echo "$BLOCKERS") +} + +# Check 1: Hardcoded ports in measurement command +if [[ -n "$MEASUREMENT_CMD" ]]; then + # Look for common port patterns in the command itself + if echo "$MEASUREMENT_CMD" | grep -qE '(--port\s+[0-9]+|:\s*[0-9]{4,5}|PORT=[0-9]+|localhost:[0-9]+)'; then + add_blocker "port" "Measurement command contains hardcoded port reference" "Parameterize port via environment variable (e.g., PORT=\$EVAL_PORT)" + fi +fi + +# Check 2: Hardcoded ports in config files +PORT_FILES=$(grep -rl --include='*.yaml' --include='*.yml' --include='*.json' --include='*.toml' --include='*.cfg' --include='*.ini' --include='*.env' -E '(port:\s*[0-9]{4,5}|PORT\s*=\s*[0-9]{4,5}|"port":\s*[0-9]{4,5})' . 2>/dev/null | head -10 || true) +if [[ -n "$PORT_FILES" ]]; then + FILE_COUNT=$(echo "$PORT_FILES" | wc -l | tr -d ' ') + add_blocker "port" "Found hardcoded port numbers in $FILE_COUNT config file(s)" "Parameterize ports via environment variables" +fi + +# Check 3: SQLite databases +SQLITE_FILES=$(find . -maxdepth 4 -type f \( -name '*.db' -o -name '*.sqlite' -o -name '*.sqlite3' \) ! -path './.git/*' ! -path './node_modules/*' 2>/dev/null | head -10 || true) +if [[ -n "$SQLITE_FILES" ]]; then + FILE_COUNT=$(echo "$SQLITE_FILES" | wc -l | tr -d ' ') + add_blocker "shared_file" "Found $FILE_COUNT SQLite database file(s)" "Copy database files into each experiment worktree" +fi + +# Check 4: Lock/PID files +LOCK_FILES=$(find . -maxdepth 4 -type f \( -name '*.lock' -o -name '*.pid' \) ! -path './.git/*' ! -path './node_modules/*' ! -name 'package-lock.json' ! -name 'yarn.lock' ! -name 'bun.lockb' ! -name 'Gemfile.lock' ! -name 'poetry.lock' ! -name 'Cargo.lock' 2>/dev/null | head -10 || true) +if [[ -n "$LOCK_FILES" ]]; then + FILE_COUNT=$(echo "$LOCK_FILES" | wc -l | tr -d ' ') + add_blocker "lock_file" "Found $FILE_COUNT lock/PID file(s) that may cause contention" "Ensure measurement command cleans up lock files, or run in serial mode" +fi + +# Check 5: GPU references +GPU_FILES=$(grep -rl --include='*.py' --include='*.rs' --include='*.cpp' --include='*.cu' -E '(torch\.device|cuda|gpu|tensorflow|tf\.config)' . 2>/dev/null | head -5 || true) +if [[ -n "$GPU_FILES" ]]; then + add_blocker "exclusive_resource" "Found GPU/CUDA references in source files" "GPU is typically an exclusive resource -- consider serial mode or GPU device parameterization" +fi + +# Determine mode +BLOCKER_COUNT=$(echo "$BLOCKERS" | python3 -c "import json,sys; print(len(json.load(sys.stdin)))" 2>/dev/null || echo "0") + +if [[ "$BLOCKER_COUNT" == "0" ]]; then + MODE="parallel" +elif echo "$BLOCKERS" | python3 -c "import json,sys; b=json.load(sys.stdin); exit(0 if any(x['type']=='exclusive_resource' for x in b) else 1)" 2>/dev/null; then + MODE="serial" +else + MODE="user-decision" +fi + +# Output JSON result +python3 -c " +import json +print(json.dumps({ + 'mode': '$MODE', + 'blockers': $BLOCKERS, + 'blocker_count': $BLOCKER_COUNT +}, indent=2)) +" From f55aa2621d26b43c4982373e3420022675be0d19 Mon Sep 17 00:00:00 2001 From: huntharo Date: Sun, 29 Mar 2026 23:40:27 -0400 Subject: [PATCH 3/6] fix(ce-optimize): enforce disk-first persistence discipline The skill runs for hours but had no explicit write-immediately-per-experiment rule. Results were batched in memory and written after full batch evaluation, making them vulnerable to context compaction and session crashes. Changes: - Add Persistence Discipline section as a top-level skill principle - Write each experiment result to disk IMMEDIATELY after measurement (step 3.3) instead of deferring to batch evaluation - Enforce re-read-from-disk at every phase boundary and before every decision - Per-experiment result.yaml crash-recovery markers in worktrees - Append-only log during Phase 3 to prevent data loss on interrupted writes - Resume logic explicitly reads all state from disk, not in-memory context - Update experiment log schema header to document the write discipline Follows Karpathy's autoresearch pattern: results.tsv is written after every single experiment, making the file the memory and the agent expendable. Co-Authored-By: Claude Opus 4.6 (1M context) --- .../skills/ce-optimize/SKILL.md | 82 +++++++++++++++---- .../references/experiment-log-schema.yaml | 22 +++-- 2 files changed, 84 insertions(+), 20 deletions(-) diff --git a/plugins/compound-engineering/skills/ce-optimize/SKILL.md b/plugins/compound-engineering/skills/ce-optimize/SKILL.md index c691c15f..f8caf1d1 100644 --- a/plugins/compound-engineering/skills/ce-optimize/SKILL.md +++ b/plugins/compound-engineering/skills/ce-optimize/SKILL.md @@ -32,6 +32,43 @@ Reference the experiment log schema for state management: --- +## Persistence Discipline + +**The experiment log on disk is the single source of truth. The agent's in-memory context is expendable.** + +This skill runs for hours. Context windows compact, sessions crash, and agents restart. Every piece of state that matters must live on disk, not in the agent's memory. + +### Core Rules + +1. **Write each experiment result to disk IMMEDIATELY after measurement** — not after the batch, not after evaluation, IMMEDIATELY. Append the experiment entry to the experiment log file the moment its metrics are known, before evaluating the next experiment. This is the #1 crash-safety rule. + +2. **Re-read from disk at every phase boundary and before every decision** — never trust in-memory state across phase transitions, batch boundaries, or after any operation that might have taken significant time. Re-read the experiment log and strategy digest from disk. + +3. **The experiment log is append-only during Phase 3** — never rewrite the full file. Append new experiment entries. Update the `best` section in place only when a new best is found. This prevents data loss if a write is interrupted. + +4. **Per-experiment result markers for crash recovery** — each experiment writes a `result.yaml` marker in its worktree immediately after measurement. On resume, scan for these markers to recover experiments that were measured but not yet logged. + +5. **Strategy digest is written after every batch, before generating new hypotheses** — the agent reads the digest (not its memory) when deciding what to try next. + +### File Locations (all under `.context/compound-engineering/ce-optimize//`) + +| File | Purpose | Written When | +|------|---------|-------------| +| `spec.yaml` | Optimization spec (immutable during run) | Phase 0 | +| `experiment-log.yaml` | Full history of all experiments | Appended after EACH experiment measurement | +| `strategy-digest.md` | Compressed learnings for hypothesis generation | After each batch completes | +| `/result.yaml` | Per-experiment crash-recovery marker | Immediately after measurement, before log append | + +### On Resume + +When Phase 0.4 detects an existing run: +1. Read the experiment log from disk — this is the ground truth +2. Scan worktree directories for `result.yaml` markers not yet in the log +3. Recover any measured-but-unlogged experiments +4. Continue from where the log left off + +--- + ## Phase 0: Setup ### 0.1 Determine Input Type @@ -82,7 +119,7 @@ git rev-parse --verify "optimize/" 2>/dev/null **If branch exists**, check for an existing experiment log at `.context/compound-engineering/ce-optimize//experiment-log.yaml`. Present the user with a choice via the platform question tool: -- **Resume**: inherit existing state, continue from the last iteration number +- **Resume**: read ALL state from the experiment log on disk (do not rely on any in-memory context from a prior session). Recover any measured-but-unlogged experiments by scanning worktree directories for `result.yaml` markers. Continue from the last iteration number in the log. - **Fresh start**: archive the old branch to `optimize//archived-`, clear the experiment log, start from scratch ### 0.5 Create Optimization Branch and Scratch Space @@ -300,25 +337,27 @@ For each hypothesis in the batch, dispatch in parallel: ``` 5. Security posture: use the user's selection (ask once per session if not set in spec) -### 3.3 Collect Results +### 3.3 Collect and Persist Results -Wait for all experiments in the batch to complete. +Process experiments as they complete — do NOT wait for the entire batch to finish before writing results. -For each completed experiment: +For each completed experiment, **immediately**: 1. **Run measurement** in the experiment's worktree: ```bash bash scripts/measure.sh "" "" ``` -2. **Read raw JSON output** from the measurement script +2. **Write crash-recovery marker** — immediately after measurement, write `result.yaml` in the experiment worktree containing the raw metrics. This ensures the measurement is recoverable even if the agent crashes before updating the main log. + +3. **Read raw JSON output** from the measurement script -3. **Evaluate degenerate gates**: +4. **Evaluate degenerate gates**: - For each gate in `metric.degenerate_gates`, parse the operator and threshold - Compare the metric value against the threshold - If ANY gate fails: mark outcome as `degenerate`, skip judge evaluation, save money -4. **If gates pass AND primary type is `judge`**: +5. **If gates pass AND primary type is `judge`**: - Read the experiment's output (cluster assignments, search results, etc.) - Apply stratified sampling per `metric.judge.stratification` config (using `sample_seed`) - Group samples into batches of `metric.judge.batch_size` @@ -328,10 +367,12 @@ For each completed experiment: - Aggregate scores: compute `scoring.primary` (e.g., mean_score) and `scoring.secondary` values - If `singleton_sample > 0`: also dispatch singleton evaluation sub-agents -5. **If gates pass AND primary type is `hard`**: +6. **If gates pass AND primary type is `hard`**: - Use the metric value directly from the measurement output -6. **Record results** in the experiment log entry +7. **IMMEDIATELY append to experiment log on disk** — do not defer this to batch evaluation. Write the experiment entry (iteration, hypothesis, outcome, metrics, learnings) to `.context/compound-engineering/ce-optimize//experiment-log.yaml` right now. The outcome may be preliminary (e.g., `gates_passed` but not yet compared to best) — that is fine. Update the outcome to `kept` or `reverted` in the evaluation step, but the raw metrics are on disk and safe from context compaction. + +**Why immediately?** The agent's context window is NOT a durable store. Context compaction, session crashes, and restarts are expected during long runs. If results only exist in the agent's memory, they are lost. Karpathy's autoresearch writes to `results.tsv` after every single experiment — this skill must do the same with the experiment log. ### 3.4 Evaluate Batch @@ -362,18 +403,29 @@ After all experiments in the batch have been measured: ### 3.5 Update State -1. **Update experiment log** with ALL results from this batch (kept, reverted, degenerate, error, deferred) +By this point, individual experiment results are already on disk (written in step 3.3). This step updates aggregate state. -2. **Write strategy digest** to `.context/compound-engineering/ce-optimize//strategy-digest.md`: +1. **Re-read the experiment log from disk** — do not trust in-memory state. The log is the source of truth. + +2. **Finalize outcomes** — update experiment entries from step 3.4 evaluation (mark `kept`, `reverted`, `runner_up_kept`, etc.). Write these outcome updates to disk immediately. + +3. **Update the `best` section** in the experiment log if a new best was found. Write to disk. + +4. **Write strategy digest** to `.context/compound-engineering/ce-optimize//strategy-digest.md`: - Categories tried so far (with success/failure counts) - Key learnings from this batch and overall - Exploration frontier: what categories and approaches remain untried - Current best metrics and improvement from baseline -3. **Generate new hypotheses** based on learnings: - - Read the rolling window (last 10 experiments from the log) and the strategy digest +5. **Generate new hypotheses** based on learnings: + - Re-read the strategy digest from disk (not from memory) + - Read the rolling window (last 10 experiments from the log on disk) - Do NOT read the full experiment log -- use the digest for broad context - - Add new hypotheses to the backlog based on what succeeded (explore further in that direction) and what failed (avoid similar approaches) + - Add new hypotheses to the backlog and write the updated backlog to disk + +6. **Write updated hypothesis backlog to disk** — the backlog section of the experiment log must reflect newly added hypotheses and removed (tested) ones. + +**Checkpoint: at this point, all state for this batch is on disk. If the agent crashes and restarts, it can resume from the experiment log without loss.** ### 3.6 Check Stopping Criteria @@ -402,7 +454,7 @@ If no stopping criterion is met, proceed to the next batch (step 3.1). - Current best metric and improvement from baseline - Cumulative judge cost (if applicable) -**Crash recovery**: Each experiment should write a small `result.yaml` marker in its worktree upon measurement completion. On resume (Phase 0.4 detects existing branch), scan `.worktrees/` for completed-but-unlogged experiment markers before starting a new batch. +**Crash recovery**: See Persistence Discipline section. Per-experiment `result.yaml` markers are written in step 3.3. Individual experiment results are appended to the log immediately in step 3.3. Batch-level state (outcomes, best, digest) is written in step 3.5. On resume (Phase 0.4), the log on disk is the ground truth — scan for any `result.yaml` markers not yet reflected in the log. --- diff --git a/plugins/compound-engineering/skills/ce-optimize/references/experiment-log-schema.yaml b/plugins/compound-engineering/skills/ce-optimize/references/experiment-log-schema.yaml index 0dc4f9a1..17ee72bb 100644 --- a/plugins/compound-engineering/skills/ce-optimize/references/experiment-log-schema.yaml +++ b/plugins/compound-engineering/skills/ce-optimize/references/experiment-log-schema.yaml @@ -1,13 +1,25 @@ # Experiment Log Schema # This is the canonical schema for the experiment log file that accumulates -# across an optimization run. The orchestrating agent reads and writes this -# file after each batch. +# across an optimization run. # # Location: .context/compound-engineering/ce-optimize//experiment-log.yaml # -# The experiment log is the durable ground truth for all experiments run. -# The orchestrator does NOT read it in full each iteration -- it uses a -# rolling window (last 10 experiments) + a strategy digest file. +# PERSISTENCE MODEL: +# The experiment log on disk is the SINGLE SOURCE OF TRUTH. The agent's +# in-memory context is expendable and will be compacted during long runs. +# +# Write discipline: +# - Each experiment entry is APPENDED immediately after its measurement +# completes (SKILL.md step 3.3), before batch evaluation +# - Outcome fields may be updated in-place after batch evaluation (step 3.5) +# - The `best` section is updated after each batch if a new best is found +# - The `hypothesis_backlog` is updated after each batch +# - The agent re-reads this file from disk at every phase boundary +# +# The orchestrator does NOT read the full log each iteration -- it uses a +# rolling window (last 10 experiments) + a strategy digest file for +# hypothesis generation. But the full log exists on disk for resume, +# crash recovery, and post-run analysis. # ============================================================================ # TOP-LEVEL STRUCTURE From 72353c2d7d10376e30e3fb41317fb5b2e5cd685a Mon Sep 17 00:00:00 2001 From: huntharo Date: Mon, 30 Mar 2026 08:37:00 -0400 Subject: [PATCH 4/6] fix(ce-optimize): enforce judge mode detection and disk persistence verification First-run testing revealed two critical failures: 1. The skill defaulted to type:hard for a qualitative clustering target, optimizing a proxy metric without ever checking cluster coherence. Phase 0.2 now actively detects qualitative targets, strongly recommends type:judge, and walks users through sampling strategy and rubric design. 2. Experiment results were dumped into the conversation but never written to disk. Added mandatory write-then-verify checkpoints (CP-0 through CP-5) at every phase boundary. The persistence discipline now states: "If you produce a results table without writing to disk first, you have a bug." Also adds first-run lessons to the brainstorm doc. Co-Authored-By: Claude Opus 4.6 (1M context) --- ...terative-optimization-loop-requirements.md | 34 +++- .../skills/ce-optimize/SKILL.md | 151 +++++++++++++++--- 2 files changed, 158 insertions(+), 27 deletions(-) diff --git a/docs/brainstorms/2026-03-29-iterative-optimization-loop-requirements.md b/docs/brainstorms/2026-03-29-iterative-optimization-loop-requirements.md index d378d774..c6efe8a9 100644 --- a/docs/brainstorms/2026-03-29-iterative-optimization-loop-requirements.md +++ b/docs/brainstorms/2026-03-29-iterative-optimization-loop-requirements.md @@ -942,10 +942,36 @@ AutoResearch is designed for ML training on a single GPU. CE's version needs to - A human reviewing the experiment log can understand what was tried and why - The skill handles failures gracefully (bad experiments don't corrupt state) +## Lessons from First Run (2026-03-30) + +The skill was tested on the clustering problem for ~90 minutes. Results: + +**What worked:** +- Ran 16 experiments, improved multi_member_pct from 31.4% to 72.1% +- Explored multiple algorithm modes (basic, refine, bounded union-find) +- Correctly identified size-bounded union-find as the winning approach +- Hypothesis diversity across parameter sweeps was reasonable + +**What failed:** + +1. **No LLM-as-judge evaluation** -- The skill defaulted to `type: hard` and optimized `multi_member_pct` as the primary metric. This is a proxy metric that can mislead. A solution that puts 72% of items in clusters is useless if the clusters are incoherent. The Phase 0.2 interactive spec creation did not actively probe whether the target was qualitative or guide toward judge mode. + + **Fix applied**: Phase 0.2 now includes explicit qualitative vs quantitative detection, concrete examples of when to use each type, sampling strategy guidance with walkthrough questions, and rubric design guidance. The skill now strongly recommends `type: judge` for qualitative targets. + +2. **No disk persistence** -- Experiment results existed only in the conversation context (as a table dumped to chat). If the session had been compacted or crashed, all 90 minutes of results would have been lost. This directly contradicts the Karpathy model where `results.tsv` is written after every single experiment. + + **Fix applied**: Added mandatory disk checkpoints (CP-0 through CP-5) at every phase boundary. Each checkpoint requires a write-then-verify cycle: write the file, read it back, confirm the content is present. The persistence discipline section now explicitly states "If you produce a results table in the conversation without writing those results to disk first, you have a bug." + +3. **Sampling strategy not prompted** -- Even if `type: judge` had been used, the skill didn't guide the user through designing a sampling strategy. For clustering, the user wants stratified sampling across: top clusters by size (check for mega-clusters), mid-range clusters (representative quality), small clusters (check if connections are real), and singletons (check for false negatives). This domain-specific guidance was missing. + + **Fix applied**: Phase 0.2 now walks through sampling strategy design with concrete questions and domain-specific examples. + +**Key takeaway**: The skill had all the right machinery in the schema and templates but the SKILL.md instructions didn't forcefully enough guide the agent toward using that machinery. Instructions that say "if judge type, do X" are ignored when the skill silently defaults to hard type. Instructions need to actively detect the right path and guide toward it. + ## Next Steps -1. `/ce:brainstorm` this further to refine the spec format and phase gates -2. Build a prototype with the clustering use case as the driving example +1. Re-test with the clustering use case using `type: judge` to validate the judge loop works end-to-end +2. Verify disk persistence works on a long run (2+ hours) with context compaction 3. Test with a second use case (e.g., prompt optimization, build performance) to validate generality -4. Design the measurement harness validation checks -5. Decide on the v1 branch/git strategy +4. Consider adding anchor samples for judge calibration across experiments (Open Question #2) +5. Consider judge cost budgets (Open Question #8) diff --git a/plugins/compound-engineering/skills/ce-optimize/SKILL.md b/plugins/compound-engineering/skills/ce-optimize/SKILL.md index f8caf1d1..38a0dcfb 100644 --- a/plugins/compound-engineering/skills/ce-optimize/SKILL.md +++ b/plugins/compound-engineering/skills/ce-optimize/SKILL.md @@ -34,30 +34,55 @@ Reference the experiment log schema for state management: ## Persistence Discipline -**The experiment log on disk is the single source of truth. The agent's in-memory context is expendable.** +**CRITICAL: The experiment log on disk is the single source of truth. The conversation context is NOT durable storage. Results that exist only in the conversation WILL be lost.** -This skill runs for hours. Context windows compact, sessions crash, and agents restart. Every piece of state that matters must live on disk, not in the agent's memory. +This skill runs for hours. Context windows compact, sessions crash, and agents restart. Every piece of state that matters MUST live on disk, not in the agent's memory. + +**If you produce a results table in the conversation without writing those results to disk first, you have a bug.** The conversation is for the user's benefit. The experiment log file is for durability. ### Core Rules 1. **Write each experiment result to disk IMMEDIATELY after measurement** — not after the batch, not after evaluation, IMMEDIATELY. Append the experiment entry to the experiment log file the moment its metrics are known, before evaluating the next experiment. This is the #1 crash-safety rule. -2. **Re-read from disk at every phase boundary and before every decision** — never trust in-memory state across phase transitions, batch boundaries, or after any operation that might have taken significant time. Re-read the experiment log and strategy digest from disk. +2. **VERIFY every critical write** — after writing the experiment log, read the file back and confirm the entry is present. This catches silent write failures. Do not proceed to the next experiment until verification passes. + +3. **Re-read from disk at every phase boundary and before every decision** — never trust in-memory state across phase transitions, batch boundaries, or after any operation that might have taken significant time. Re-read the experiment log and strategy digest from disk. + +4. **The experiment log is append-only during Phase 3** — never rewrite the full file. Append new experiment entries. Update the `best` section in place only when a new best is found. This prevents data loss if a write is interrupted. + +5. **Per-experiment result markers for crash recovery** — each experiment writes a `result.yaml` marker in its worktree immediately after measurement. On resume, scan for these markers to recover experiments that were measured but not yet logged. + +6. **Strategy digest is written after every batch, before generating new hypotheses** — the agent reads the digest (not its memory) when deciding what to try next. -3. **The experiment log is append-only during Phase 3** — never rewrite the full file. Append new experiment entries. Update the `best` section in place only when a new best is found. This prevents data loss if a write is interrupted. +7. **Never present results to the user without writing them to disk first** — the pattern is: measure -> write to disk -> verify -> THEN show the user. Not the reverse. -4. **Per-experiment result markers for crash recovery** — each experiment writes a `result.yaml` marker in its worktree immediately after measurement. On resume, scan for these markers to recover experiments that were measured but not yet logged. +### Mandatory Disk Checkpoints -5. **Strategy digest is written after every batch, before generating new hypotheses** — the agent reads the digest (not its memory) when deciding what to try next. +These are non-negotiable write-then-verify steps. At each checkpoint, the agent MUST write the specified file and then read it back to confirm the write succeeded. + +| Checkpoint | File Written | Phase | +|---|---|---| +| CP-0: Spec saved | `spec.yaml` | Phase 0, after user approval | +| CP-1: Baseline recorded | `experiment-log.yaml` (initial with baseline) | Phase 1, after baseline measurement | +| CP-2: Hypothesis backlog saved | `experiment-log.yaml` (hypothesis_backlog section) | Phase 2, after hypothesis generation | +| CP-3: Each experiment result | `experiment-log.yaml` (append experiment entry) | Phase 3.3, immediately after each measurement | +| CP-4: Batch summary | `experiment-log.yaml` (outcomes + best) + `strategy-digest.md` | Phase 3.5, after batch evaluation | +| CP-5: Final summary | `experiment-log.yaml` (final state) | Phase 4, at wrap-up | + +**Format of a verification step:** +1. Write the file using the native file-write tool +2. Read the file back using the native file-read tool +3. Confirm the expected content is present +4. If verification fails, retry the write. If it fails twice, alert the user. ### File Locations (all under `.context/compound-engineering/ce-optimize//`) | File | Purpose | Written When | |------|---------|-------------| -| `spec.yaml` | Optimization spec (immutable during run) | Phase 0 | -| `experiment-log.yaml` | Full history of all experiments | Appended after EACH experiment measurement | -| `strategy-digest.md` | Compressed learnings for hypothesis generation | After each batch completes | -| `/result.yaml` | Per-experiment crash-recovery marker | Immediately after measurement, before log append | +| `spec.yaml` | Optimization spec (immutable during run) | Phase 0 (CP-0) | +| `experiment-log.yaml` | Full history of all experiments | Initialized at CP-1, appended at CP-3, updated at CP-4 | +| `strategy-digest.md` | Compressed learnings for hypothesis generation | Written at CP-4 after each batch | +| `/result.yaml` | Per-experiment crash-recovery marker | Immediately after measurement, before CP-3 | ### On Resume @@ -94,15 +119,81 @@ Check whether the input is: **If description provided:** 1. Analyze the project to understand what can be measured -2. Guide the user through creating a spec: - - What is the optimization target? (metric name, direction, type) - - What degenerate cases should be rejected? (gates) - - If judge type: what rubric should the judge use? +2. **Detect whether the optimization target is qualitative or quantitative** — this determines `type: hard` vs `type: judge` and is the single most important spec decision: + + **Use `type: hard`** when: + - The metric is a scalar number with a clear "better" direction + - The metric is objectively measurable (build time, test pass rate, latency, memory usage) + - No human judgment is needed to evaluate "is this result actually good?" + - Examples: reduce build time, increase test coverage, reduce API latency, decrease bundle size + + **Use `type: judge`** when: + - The quality of the output requires semantic understanding to evaluate + - A human reviewer would need to look at the results to say "this is better" + - Proxy metrics exist but can mislead (e.g., "more clusters" does not mean "better clusters") + - The optimization could produce degenerate solutions that look good on paper + - Examples: clustering quality, search relevance, summarization quality, code readability, UX copy, recommendation relevance + + **IMPORTANT**: If the target is qualitative, **strongly recommend `type: judge`**. Explain that hard metrics alone will optimize proxy numbers without checking actual quality. Show the user the three-tier approach: + - **Degenerate gates** (hard, cheap, fast): catch obviously broken solutions — e.g., "all items in 1 cluster" or "0% coverage". Run first. If gates fail, skip the expensive judge step. + - **LLM-as-judge** (the actual optimization target): sample outputs, score them against a rubric, aggregate. This is what the loop optimizes. + - **Diagnostics** (logged, not gated): distribution stats, counts, timing — useful for understanding WHY a judge score changed. + + If the user insists on `type: hard` for a qualitative target, proceed but warn that the results may optimize a misleading proxy. + +3. **Design the sampling strategy** (for `type: judge`): + + Guide the user through defining stratified sampling. The key question is: "What parts of the output space do you need to check quality on?" + + Walk through these questions: + - **What does one "item" look like?** (a cluster, a search result page, a summary, etc.) + - **What are the natural size/quality strata?** (e.g., large clusters vs small clusters vs singletons) + - **Where are quality failures most likely?** (e.g., very large clusters may be degenerate merges; singletons may be missed groupings) + - **What total sample size balances cost vs signal?** (default: 30 items, adjust based on output volume) + + Example stratified sampling for clustering: + ```yaml + stratification: + - bucket: "top_by_size" # largest clusters — check for degenerate mega-clusters + count: 10 + - bucket: "mid_range" # middle of non-solo cluster size range — representative quality + count: 10 + - bucket: "small_clusters" # clusters with 2-3 items — check if connections are real + count: 10 + singleton_sample: 15 # singletons — check for false negatives (items that should cluster) + ``` + + The sampling strategy is domain-specific. For search relevance, strata might be "top-3 results", "results 4-10", "tail results". For summarization, strata might be "short documents", "long documents", "multi-topic documents". + + **Singleton evaluation is critical when the goal involves coverage** — sampling singletons with the singleton rubric checks whether the system is missing obvious groupings. + +4. **Design the rubric** (for `type: judge`): + + Help the user define the scoring rubric. A good rubric: + - Has a 1-5 scale (or similar) with concrete descriptions for each level + - Includes supplementary fields that help diagnose issues (e.g., `distinct_topics`, `outlier_count`) + - Is specific enough that two judges would give similar scores + - Does NOT assume bigger/more is better — "3 items per cluster average" is not inherently good or bad + + Example for clustering: + ```yaml + rubric: | + Rate this cluster 1-5: + - 5: All items clearly about the same issue/feature + - 4: Strong theme, minor outliers + - 3: Related but covers 2-3 sub-topics that could reasonably be split + - 2: Weak connection — items share superficial similarity only + - 1: Unrelated items grouped together + Also report: distinct_topics (integer), outlier_count (integer) + ``` + +5. Guide the user through the remaining spec fields: + - What degenerate cases should be rejected? (gates — e.g., "solo_pct <= 0.95" catches all-singletons, "max_cluster_size <= 500" catches mega-clusters) - What command runs the measurement? - What files can be modified? What is immutable? - Any constraints or dependencies? -3. Write the spec to `.context/compound-engineering/ce-optimize//spec.yaml` -4. Present the spec to the user for approval before proceeding +6. Write the spec to `.context/compound-engineering/ce-optimize//spec.yaml` +7. Present the spec to the user for approval before proceeding ### 0.3 Search Prior Learnings @@ -217,11 +308,21 @@ If count + `execution.max_concurrent` would exceed 12: - Suggest cleaning up existing worktrees or reducing `max_concurrent` - Do NOT block -- the user may proceed at their own risk -### 1.6 User Approval Gate +### 1.6 Write Baseline to Disk (CP-1) + +**MANDATORY CHECKPOINT.** Before presenting results to the user, write the initial experiment log with baseline metrics to disk: + +1. Create the experiment log file at `.context/compound-engineering/ce-optimize//experiment-log.yaml` +2. Include: spec name, run_id, started_at, baseline section with all gate values, diagnostic values, and judge scores (if applicable) +3. **Verify**: read the file back and confirm the baseline section is present and values match +4. Only THEN present results to the user + +### 1.7 User Approval Gate Present to the user via the platform question tool: - **Baseline metrics**: all gate values, diagnostic values, and judge scores (if applicable) +- **Experiment log location**: show the file path so the user knows where results are saved - **Parallel readiness**: probe results, any blockers, mitigations applied - **Clean-tree status**: confirmed clean - **Worktree budget**: current count and projected usage @@ -271,9 +372,9 @@ If any hypotheses require new dependencies: Hypotheses with unapproved dependencies remain in the backlog but are skipped during batch selection. They are re-presented at wrap-up for potential approval. -### 2.4 Record Hypothesis Backlog +### 2.4 Record Hypothesis Backlog (CP-2) -Write the initial backlog to the experiment log file: +**MANDATORY CHECKPOINT.** Write the initial backlog to the experiment log file and verify: ```yaml hypothesis_backlog: - description: "Remove template boilerplate before embedding" @@ -370,9 +471,11 @@ For each completed experiment, **immediately**: 6. **If gates pass AND primary type is `hard`**: - Use the metric value directly from the measurement output -7. **IMMEDIATELY append to experiment log on disk** — do not defer this to batch evaluation. Write the experiment entry (iteration, hypothesis, outcome, metrics, learnings) to `.context/compound-engineering/ce-optimize//experiment-log.yaml` right now. The outcome may be preliminary (e.g., `gates_passed` but not yet compared to best) — that is fine. Update the outcome to `kept` or `reverted` in the evaluation step, but the raw metrics are on disk and safe from context compaction. +7. **IMMEDIATELY append to experiment log on disk (CP-3)** — do not defer this to batch evaluation. Write the experiment entry (iteration, hypothesis, outcome, metrics, learnings) to `.context/compound-engineering/ce-optimize//experiment-log.yaml` right now. The outcome may be preliminary (e.g., `gates_passed` but not yet compared to best) — that is fine. Update the outcome to `kept` or `reverted` in the evaluation step, but the raw metrics are on disk and safe from context compaction. -**Why immediately?** The agent's context window is NOT a durable store. Context compaction, session crashes, and restarts are expected during long runs. If results only exist in the agent's memory, they are lost. Karpathy's autoresearch writes to `results.tsv` after every single experiment — this skill must do the same with the experiment log. +8. **VERIFY the write (CP-3 verification)** — read the experiment log back from disk and confirm the entry just written is present. If verification fails, retry the write. Do NOT proceed to the next experiment until this entry is confirmed on disk. + +**Why immediately + verify?** The agent's context window is NOT a durable store. Context compaction, session crashes, and restarts are expected during long runs. If results only exist in the agent's memory, they are lost. Karpathy's autoresearch writes to `results.tsv` after every single experiment — this skill must do the same with the experiment log. The verification step catches silent write failures that would otherwise lose data. ### 3.4 Evaluate Batch @@ -401,9 +504,9 @@ After all experiments in the batch have been measured: 6. **Revert all others**: cleanup worktrees, log as `reverted` -### 3.5 Update State +### 3.5 Update State (CP-4) -By this point, individual experiment results are already on disk (written in step 3.3). This step updates aggregate state. +**MANDATORY CHECKPOINT.** By this point, individual experiment results are already on disk (written in step 3.3). This step updates aggregate state and verifies. 1. **Re-read the experiment log from disk** — do not trust in-memory state. The log is the source of truth. @@ -425,6 +528,8 @@ By this point, individual experiment results are already on disk (written in ste 6. **Write updated hypothesis backlog to disk** — the backlog section of the experiment log must reflect newly added hypotheses and removed (tested) ones. +**CP-4 Verification:** Read the experiment log back from disk. Confirm: (a) all experiment outcomes from this batch are finalized, (b) the `best` section reflects the current best, (c) the hypothesis backlog is updated. Read `strategy-digest.md` back and confirm it exists. Only THEN proceed to the next batch or stopping criteria check. + **Checkpoint: at this point, all state for this batch is on disk. If the agent crashes and restarts, it can resume from the experiment log without loss.** ### 3.6 Check Stopping Criteria From 601fb285a179822ee357cd7dcdd4a87830b5c55b Mon Sep 17 00:00:00 2001 From: huntharo Date: Mon, 30 Mar 2026 15:03:58 -0400 Subject: [PATCH 5/6] fix(ce-optimize): unblock experiment worktrees and measurement --- .../skills/ce-optimize/SKILL.md | 4 +- .../scripts/experiment-worktree.sh | 22 +++++++--- .../skills/ce-optimize/scripts/measure.sh | 42 ++++++++++++++++++- 3 files changed, 60 insertions(+), 8 deletions(-) diff --git a/plugins/compound-engineering/skills/ce-optimize/SKILL.md b/plugins/compound-engineering/skills/ce-optimize/SKILL.md index 38a0dcfb..b271e871 100644 --- a/plugins/compound-engineering/skills/ce-optimize/SKILL.md +++ b/plugins/compound-engineering/skills/ce-optimize/SKILL.md @@ -211,7 +211,7 @@ git rev-parse --verify "optimize/" 2>/dev/null Present the user with a choice via the platform question tool: - **Resume**: read ALL state from the experiment log on disk (do not rely on any in-memory context from a prior session). Recover any measured-but-unlogged experiments by scanning worktree directories for `result.yaml` markers. Continue from the last iteration number in the log. -- **Fresh start**: archive the old branch to `optimize//archived-`, clear the experiment log, start from scratch +- **Fresh start**: archive the old branch to `optimize-archive//archived-`, clear the experiment log, start from scratch ### 0.5 Create Optimization Branch and Scratch Space @@ -412,7 +412,7 @@ For each hypothesis in the batch, dispatch in parallel: **Worktree backend:** 1. Create experiment worktree: ```bash - WORKTREE_PATH=$(bash scripts/experiment-worktree.sh create "" "optimize/" ) + WORKTREE_PATH=$(bash scripts/experiment-worktree.sh create "" "optimize/" ) # creates optimize-exp//exp- ``` 2. Apply port parameterization if configured (set env vars for the measurement script) 3. Fill the experiment prompt template (`references/experiment-prompt-template.md`) with: diff --git a/plugins/compound-engineering/skills/ce-optimize/scripts/experiment-worktree.sh b/plugins/compound-engineering/skills/ce-optimize/scripts/experiment-worktree.sh index 5d44e5dd..302e34f5 100755 --- a/plugins/compound-engineering/skills/ce-optimize/scripts/experiment-worktree.sh +++ b/plugins/compound-engineering/skills/ce-optimize/scripts/experiment-worktree.sh @@ -11,7 +11,7 @@ # experiment-worktree.sh count # # Worktrees are created at: .worktrees/optimize--exp-/ -# Branches are named: optimize//exp- +# Branches are named: optimize-exp//exp- set -euo pipefail @@ -28,6 +28,15 @@ GIT_ROOT=$(git rev-parse --show-toplevel 2>/dev/null) || { WORKTREE_DIR="$GIT_ROOT/.worktrees" +experiment_branch_name() { + local spec_name="${1:?Error: spec_name required}" + local padded_index="${2:?Error: padded_index required}" + + # Keep experiment refs outside optimize/ so they do not collide + # with the long-lived optimization branch namespace. + echo "optimize-exp/${spec_name}/exp-${padded_index}" +} + ensure_gitignore() { if ! grep -q "^\.worktrees$" "$GIT_ROOT/.gitignore" 2>/dev/null; then echo ".worktrees" >> "$GIT_ROOT/.gitignore" @@ -44,7 +53,8 @@ create_worktree() { local padded_index padded_index=$(printf "%03d" "$exp_index") local worktree_name="optimize-${spec_name}-exp-${padded_index}" - local branch_name="optimize/${spec_name}/exp-${padded_index}" + local branch_name + branch_name=$(experiment_branch_name "$spec_name" "$padded_index") local worktree_path="$WORKTREE_DIR/$worktree_name" # Check if worktree already exists @@ -101,7 +111,8 @@ cleanup_worktree() { local padded_index padded_index=$(printf "%03d" "$exp_index") local worktree_name="optimize-${spec_name}-exp-${padded_index}" - local branch_name="optimize/${spec_name}/exp-${padded_index}" + local branch_name + branch_name=$(experiment_branch_name "$spec_name" "$padded_index") local worktree_path="$WORKTREE_DIR/$worktree_name" if [[ -d "$worktree_path" ]]; then @@ -141,7 +152,8 @@ cleanup_all() { } # Delete the branch - local branch_name="optimize/${spec_name}/exp-${index_str}" + local branch_name + branch_name=$(experiment_branch_name "$spec_name" "$index_str") git branch -D "$branch_name" 2>/dev/null || true count=$((count + 1)) @@ -208,7 +220,7 @@ Commands: count Count total active worktrees (for budget checking) Worktrees: .worktrees/optimize--exp-/ -Branches: optimize//exp- +Branches: optimize-exp//exp- EOF ;; *) diff --git a/plugins/compound-engineering/skills/ce-optimize/scripts/measure.sh b/plugins/compound-engineering/skills/ce-optimize/scripts/measure.sh index d9a6aaa6..3bb5488a 100755 --- a/plugins/compound-engineering/skills/ce-optimize/scripts/measure.sh +++ b/plugins/compound-engineering/skills/ce-optimize/scripts/measure.sh @@ -41,7 +41,47 @@ cd "$WORKDIR" || { exit 1 } +run_with_timeout() { + if command -v timeout >/dev/null 2>&1; then + timeout "$TIMEOUT" bash -c "$COMMAND" + return + fi + + if command -v gtimeout >/dev/null 2>&1; then + gtimeout "$TIMEOUT" bash -c "$COMMAND" + return + fi + + if command -v python3 >/dev/null 2>&1; then + python3 - "$TIMEOUT" "$COMMAND" <<'PY' +import os +import signal +import subprocess +import sys + +timeout_seconds = int(sys.argv[1]) +command = sys.argv[2] +proc = subprocess.Popen(["bash", "-c", command], start_new_session=True) + +try: + sys.exit(proc.wait(timeout=timeout_seconds)) +except subprocess.TimeoutExpired: + os.killpg(proc.pid, signal.SIGTERM) + try: + proc.wait(timeout=5) + except subprocess.TimeoutExpired: + os.killpg(proc.pid, signal.SIGKILL) + proc.wait() + sys.exit(124) +PY + return + fi + + echo "Error: no timeout implementation available (tried timeout, gtimeout, python3)" >&2 + exit 1 +} + # Run the measurement command with timeout # timeout returns 124 if the command times out # We pass stdout and stderr through directly -timeout "$TIMEOUT" bash -c "$COMMAND" +run_with_timeout From 705f124e1e4a551879182b607e4004f973ff42f5 Mon Sep 17 00:00:00 2001 From: huntharo Date: Mon, 30 Mar 2026 16:44:49 -0400 Subject: [PATCH 6/6] fix(ce-optimize): improve first-run experiment guidance --- plugins/compound-engineering/README.md | 2 + .../skills/ce-optimize/SKILL.md | 21 ++++- .../references/example-hard-spec.yaml | 64 +++++++++++++++ .../references/example-judge-spec.yaml | 78 +++++++++++++++++++ .../references/optimize-spec-schema.yaml | 9 ++- .../ce-optimize/scripts/parallel-probe.sh | 53 +++++++++---- 6 files changed, 207 insertions(+), 20 deletions(-) create mode 100644 plugins/compound-engineering/skills/ce-optimize/references/example-hard-spec.yaml create mode 100644 plugins/compound-engineering/skills/ce-optimize/references/example-judge-spec.yaml diff --git a/plugins/compound-engineering/README.md b/plugins/compound-engineering/README.md index b6beb5fd..86e03646 100644 --- a/plugins/compound-engineering/README.md +++ b/plugins/compound-engineering/README.md @@ -27,6 +27,8 @@ The primary entry points for engineering work, invoked as slash commands: | `/ce:compound-refresh` | Refresh stale or drifting learnings and decide whether to keep, update, replace, or archive them | | `/ce:optimize` | Run iterative optimization loops with parallel experiments, measurement gates, and LLM-as-judge quality scoring | +For `/ce:optimize`, start from the checked-in example specs under `skills/ce-optimize/references/` and keep the first run serial, small, and cheap until the measurement harness is trustworthy. + ### Git Workflow | Skill | Description | diff --git a/plugins/compound-engineering/skills/ce-optimize/SKILL.md b/plugins/compound-engineering/skills/ce-optimize/SKILL.md index b271e871..a29c4947 100644 --- a/plugins/compound-engineering/skills/ce-optimize/SKILL.md +++ b/plugins/compound-engineering/skills/ce-optimize/SKILL.md @@ -30,6 +30,17 @@ Reference the experiment log schema for state management: `references/experiment-log-schema.yaml` +## Quick Start + +For a first run, optimize for signal and safety, not maximum throughput: + +- Start from `references/example-hard-spec.yaml` when the metric is objective and cheap to measure +- Use `references/example-judge-spec.yaml` only when actual quality requires semantic judgment +- Prefer `execution.mode: serial` and `execution.max_concurrent: 1` +- Cap the first run with `stopping.max_iterations: 4` and `stopping.max_hours: 1` +- Avoid new dependencies until the baseline and measurement harness are trusted +- For judge mode, start with `sample_size: 10`, `batch_size: 5`, and `max_total_cost_usd: 5` + --- ## Persistence Discipline @@ -192,6 +203,8 @@ Check whether the input is: - What command runs the measurement? - What files can be modified? What is immutable? - Any constraints or dependencies? + - If this is the first run: recommend `execution.mode: serial`, `execution.max_concurrent: 1`, `stopping.max_iterations: 4`, and `stopping.max_hours: 1` + - If `type: judge`: recommend `sample_size: 10`, `batch_size: 5`, and `max_total_cost_usd: 5` until the rubric and harness are trusted 6. Write the spec to `.context/compound-engineering/ce-optimize//spec.yaml` 7. Present the spec to the user for approval before proceeding @@ -291,10 +304,10 @@ If primary type is `judge`, also run the judge evaluation on baseline output to Run the parallelism probe script: ```bash -bash scripts/parallel-probe.sh "" "" +bash scripts/parallel-probe.sh "" "" "" ``` -Read the JSON output. Present any blockers to the user with suggested mitigations. +Read the JSON output. Present any blockers to the user with suggested mitigations. Treat the probe as intentionally narrow: it should inspect the measurement command, the measurement working directory, and explicitly declared shared files, not the entire repository. ### 1.5 Worktree Budget Check @@ -326,6 +339,7 @@ Present to the user via the platform question tool: - **Parallel readiness**: probe results, any blockers, mitigations applied - **Clean-tree status**: confirmed clean - **Worktree budget**: current count and projected usage +- **Judge budget**: estimated per-experiment judge cost and configured `max_total_cost_usd` cap (or an explicit note that spend is uncapped) **Options:** 1. **Proceed** -- approve baseline and parallel config, move to Phase 2 @@ -334,6 +348,8 @@ Present to the user via the platform question tool: Do NOT proceed to Phase 2 until the user explicitly approves. +If primary type is `judge` and `max_total_cost_usd` is null, call that out as uncapped spend and require explicit approval before proceeding. + **State re-read:** After gate approval, re-read the spec and baseline from disk. Do not carry stale in-memory values forward. --- @@ -538,6 +554,7 @@ Stop the loop if ANY of these are true: - **Target reached**: primary metric meets or exceeds `stopping.target` (if set in spec) - **Max iterations**: total experiments run >= `stopping.max_iterations` - **Max hours**: wall-clock time since Phase 3 start >= `stopping.max_hours` +- **Judge budget exhausted**: cumulative judge spend >= `metric.judge.max_total_cost_usd` (if set) - **Plateau**: no improvement for `stopping.plateau_iterations` consecutive experiments - **Manual stop**: user interrupts (save state and proceed to Phase 4) - **Empty backlog**: no hypotheses remain and no new ones can be generated diff --git a/plugins/compound-engineering/skills/ce-optimize/references/example-hard-spec.yaml b/plugins/compound-engineering/skills/ce-optimize/references/example-hard-spec.yaml new file mode 100644 index 00000000..765d6242 --- /dev/null +++ b/plugins/compound-engineering/skills/ce-optimize/references/example-hard-spec.yaml @@ -0,0 +1,64 @@ +# Minimal first-run template for objective metrics. +# Start here when "better" is a scalar value from the measurement harness. + +name: improve-build-latency +description: Reduce build latency without regressing correctness + +metric: + primary: + type: hard + name: build_seconds + direction: minimize + degenerate_gates: + - name: build_passed + check: "== 1" + description: The build must stay green + - name: test_pass_rate + check: ">= 1.0" + description: Required tests must keep passing + diagnostics: + - name: artifact_size_mb + - name: peak_memory_mb + +measurement: + command: "python evaluate.py" + timeout_seconds: 300 + working_directory: "tools/eval" + stability: + mode: repeat + repeat_count: 3 + aggregation: median + noise_threshold: 0.05 + +scope: + mutable: + - "src/build/" + - "config/build.yaml" + immutable: + - "tools/eval/evaluate.py" + - "tests/fixtures/" + - "scripts/ci/" + +execution: + mode: serial + backend: worktree + max_concurrent: 1 + +parallel: + port_strategy: none + shared_files: [] + +dependencies: + approved: [] + +constraints: + - "Keep output artifacts backward compatible" + - "Do not skip required validation steps" + +stopping: + max_iterations: 4 + max_hours: 1 + plateau_iterations: 3 + target_reached: true + +max_runner_up_merges_per_batch: 0 diff --git a/plugins/compound-engineering/skills/ce-optimize/references/example-judge-spec.yaml b/plugins/compound-engineering/skills/ce-optimize/references/example-judge-spec.yaml new file mode 100644 index 00000000..8d325a32 --- /dev/null +++ b/plugins/compound-engineering/skills/ce-optimize/references/example-judge-spec.yaml @@ -0,0 +1,78 @@ +# Minimal first-run template for qualitative metrics. +# Start here when true quality requires semantic judgment, not a proxy metric. + +name: improve-search-relevance +description: Improve semantic relevance of search results without obvious failures + +metric: + primary: + type: judge + name: mean_score + direction: maximize + degenerate_gates: + - name: result_count + check: ">= 5" + description: Return enough results to judge quality + - name: empty_query_failures + check: "== 0" + description: Empty or trivial queries must not fail + diagnostics: + - name: latency_ms + - name: recall_at_10 + judge: + rubric: | + Rate each result set from 1-5 for relevance: + - 5: Results are directly relevant and well ordered + - 4: Mostly relevant with minor ordering issues + - 3: Mixed relevance or one obvious miss + - 2: Weak relevance, several misses, or poor ordering + - 1: Mostly irrelevant + Also report: ambiguous (boolean) + scoring: + primary: mean_score + secondary: + - ambiguous_rate + model: haiku + sample_size: 10 + batch_size: 5 + sample_seed: 42 + minimum_improvement: 0.2 + max_total_cost_usd: 5 + +measurement: + command: "python eval_search.py" + timeout_seconds: 300 + working_directory: "tools/eval" + +scope: + mutable: + - "src/search/" + - "config/search.yaml" + immutable: + - "tools/eval/eval_search.py" + - "tests/fixtures/" + - "docs/" + +execution: + mode: serial + backend: worktree + max_concurrent: 1 + +parallel: + port_strategy: none + shared_files: [] + +dependencies: + approved: [] + +constraints: + - "Preserve the existing search response shape" + - "Do not add new dependencies on the first run" + +stopping: + max_iterations: 4 + max_hours: 1 + plateau_iterations: 3 + target_reached: true + +max_runner_up_merges_per_batch: 0 diff --git a/plugins/compound-engineering/skills/ce-optimize/references/optimize-spec-schema.yaml b/plugins/compound-engineering/skills/ce-optimize/references/optimize-spec-schema.yaml index 06dee26f..bb26b2f4 100644 --- a/plugins/compound-engineering/skills/ce-optimize/references/optimize-spec-schema.yaml +++ b/plugins/compound-engineering/skills/ce-optimize/references/optimize-spec-schema.yaml @@ -134,7 +134,7 @@ required_fields: description: "Model to use for judge evaluation. Haiku is cheaper and faster; Sonnet is more nuanced." sample_size: type: integer - default: 30 + default: 10 description: "Total number of output items to sample for judge evaluation per experiment" stratification: type: array @@ -163,12 +163,16 @@ required_fields: description: "Fixed seed for reproducible sampling across experiments" batch_size: type: integer - default: 10 + default: 5 description: "Number of samples per judge sub-agent batch. Controls parallelism vs overhead." minimum_improvement: type: number default: 0.3 description: "Minimum judge score improvement required to accept an experiment as 'better'. Accounts for sample-composition variance when output structure changes between experiments. Distinct from measurement.stability.noise_threshold which handles run-to-run flakiness." + max_total_cost_usd: + type: number + default: 5 + description: "Stop judge evaluation when cumulative judge spend reaches this cap. This is a first-run safety default; raise it only after the rubric and harness are trustworthy. Set to null only with explicit user approval." measurement: type: object @@ -380,4 +384,5 @@ validation_rules: - "execution.max_concurrent must not exceed 6 when execution.backend is 'worktree'" - "If parallel.exclusive_resources is non-empty, execution.mode should be 'serial'" - "If metric.judge.singleton_sample > 0, metric.judge.singleton_rubric must be present" + - "If metric.primary.type is 'judge' and metric.judge.max_total_cost_usd is null, the user should explicitly approve uncapped spend" - "stopping must have at least one non-default criterion or use defaults" diff --git a/plugins/compound-engineering/skills/ce-optimize/scripts/parallel-probe.sh b/plugins/compound-engineering/skills/ce-optimize/scripts/parallel-probe.sh index 931f3147..913e0a00 100755 --- a/plugins/compound-engineering/skills/ce-optimize/scripts/parallel-probe.sh +++ b/plugins/compound-engineering/skills/ce-optimize/scripts/parallel-probe.sh @@ -4,11 +4,13 @@ # Detects common parallelism blockers in the target project. # Output is advisory -- the skill presents results to the user for approval. # -# Usage: parallel-probe.sh [measurement_command] +# Usage: parallel-probe.sh [measurement_command] [measurement_workdir] [shared_file ...] # # Arguments: # project_directory - Root directory of the project to probe # measurement_command - The measurement command from the spec (optional, for port detection) +# measurement_workdir - Measurement working directory relative to project root (default: .) +# shared_file - Explicitly declared shared files that parallel runs depend on # # Output: # JSON to stdout with: @@ -19,6 +21,10 @@ set -euo pipefail PROJECT_DIR="${1:?Error: project_directory argument required}" MEASUREMENT_CMD="${2:-}" +MEASUREMENT_WORKDIR="${3:-.}" + +shift 3 2>/dev/null || shift $# 2>/dev/null || true +SHARED_FILES=("$@") cd "$PROJECT_DIR" || { echo '{"mode":"serial","blockers":[{"type":"error","description":"Cannot access project directory","suggestion":"Check path"}]}' @@ -26,6 +32,7 @@ cd "$PROJECT_DIR" || { } BLOCKERS="[]" +SCAN_PATHS=() add_blocker() { local type="$1" @@ -39,6 +46,28 @@ print(json.dumps(b)) " 2>/dev/null || echo "$BLOCKERS") } +add_scan_path() { + local candidate="$1" + + if [[ -z "$candidate" ]]; then + return + fi + + if [[ -e "$candidate" ]]; then + SCAN_PATHS+=("$candidate") + fi +} + +add_scan_path "$MEASUREMENT_WORKDIR" + +for shared_file in "${SHARED_FILES[@]}"; do + add_scan_path "$shared_file" +done + +if [[ ${#SCAN_PATHS[@]} -eq 0 ]]; then + SCAN_PATHS=(".") +fi + # Check 1: Hardcoded ports in measurement command if [[ -n "$MEASUREMENT_CMD" ]]; then # Look for common port patterns in the command itself @@ -47,31 +76,23 @@ if [[ -n "$MEASUREMENT_CMD" ]]; then fi fi -# Check 2: Hardcoded ports in config files -PORT_FILES=$(grep -rl --include='*.yaml' --include='*.yml' --include='*.json' --include='*.toml' --include='*.cfg' --include='*.ini' --include='*.env' -E '(port:\s*[0-9]{4,5}|PORT\s*=\s*[0-9]{4,5}|"port":\s*[0-9]{4,5})' . 2>/dev/null | head -10 || true) -if [[ -n "$PORT_FILES" ]]; then - FILE_COUNT=$(echo "$PORT_FILES" | wc -l | tr -d ' ') - add_blocker "port" "Found hardcoded port numbers in $FILE_COUNT config file(s)" "Parameterize ports via environment variables" -fi - -# Check 3: SQLite databases -SQLITE_FILES=$(find . -maxdepth 4 -type f \( -name '*.db' -o -name '*.sqlite' -o -name '*.sqlite3' \) ! -path './.git/*' ! -path './node_modules/*' 2>/dev/null | head -10 || true) +# Check 2: SQLite databases in the measurement workdir or declared shared files +SQLITE_FILES=$(find "${SCAN_PATHS[@]}" -maxdepth 4 -type f \( -name '*.db' -o -name '*.sqlite' -o -name '*.sqlite3' \) ! -path '*/.git/*' ! -path '*/node_modules/*' ! -path '*/.claude/*' ! -path '*/.context/*' ! -path '*/.worktrees/*' 2>/dev/null | head -10 || true) if [[ -n "$SQLITE_FILES" ]]; then FILE_COUNT=$(echo "$SQLITE_FILES" | wc -l | tr -d ' ') add_blocker "shared_file" "Found $FILE_COUNT SQLite database file(s)" "Copy database files into each experiment worktree" fi -# Check 4: Lock/PID files -LOCK_FILES=$(find . -maxdepth 4 -type f \( -name '*.lock' -o -name '*.pid' \) ! -path './.git/*' ! -path './node_modules/*' ! -name 'package-lock.json' ! -name 'yarn.lock' ! -name 'bun.lockb' ! -name 'Gemfile.lock' ! -name 'poetry.lock' ! -name 'Cargo.lock' 2>/dev/null | head -10 || true) +# Check 3: Lock/PID files in the measurement workdir or declared shared files +LOCK_FILES=$(find "${SCAN_PATHS[@]}" -maxdepth 4 -type f \( -name '*.lock' -o -name '*.pid' \) ! -path '*/.git/*' ! -path '*/node_modules/*' ! -path '*/.claude/*' ! -path '*/.context/*' ! -path '*/.worktrees/*' ! -name 'package-lock.json' ! -name 'yarn.lock' ! -name 'bun.lock' ! -name 'bun.lockb' ! -name 'Gemfile.lock' ! -name 'poetry.lock' ! -name 'Cargo.lock' 2>/dev/null | head -10 || true) if [[ -n "$LOCK_FILES" ]]; then FILE_COUNT=$(echo "$LOCK_FILES" | wc -l | tr -d ' ') add_blocker "lock_file" "Found $FILE_COUNT lock/PID file(s) that may cause contention" "Ensure measurement command cleans up lock files, or run in serial mode" fi -# Check 5: GPU references -GPU_FILES=$(grep -rl --include='*.py' --include='*.rs' --include='*.cpp' --include='*.cu' -E '(torch\.device|cuda|gpu|tensorflow|tf\.config)' . 2>/dev/null | head -5 || true) -if [[ -n "$GPU_FILES" ]]; then - add_blocker "exclusive_resource" "Found GPU/CUDA references in source files" "GPU is typically an exclusive resource -- consider serial mode or GPU device parameterization" +# Check 4: Exclusive resource hints in the measurement command +if [[ -n "$MEASUREMENT_CMD" ]] && echo "$MEASUREMENT_CMD" | grep -qiE '(cuda|gpu|tensorflow|torch|nvidia-smi|CUDA_VISIBLE_DEVICES)'; then + add_blocker "exclusive_resource" "Measurement command appears to use GPU or another exclusive accelerator" "GPU is typically an exclusive resource -- consider serial mode or device parameterization" fi # Determine mode