/ce:optimize - Auto-research loop for tuning system prompts / vector clustering / evaluating different code solution / etc#446
Conversation
/ce:optimize - Auto-research loop for tuning system prompts / vector clustering / evaluating different code solution / etc
|
Results from a 2 hour experiment loop for optimizing a system prompt to extract the signal from issue/pr description noise. The Projecthttps://github.com/pwrdrvr/ghcrawl Artifacts from the Prompt Belowpwrdrvr/ghcrawl@5b531b6#diff-518d9117f5847a1ed4c5d6711dd9832f480f540bdf0e7c260c88d3492ef1cb40 The Prompt
Results Summary - Table
Results Summary - Table TextOutput - The new prompt
Final - 3 hours 12 minutes
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c9c18a7147
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| # Run the measurement command with timeout | ||
| # timeout returns 124 if the command times out | ||
| # We pass stdout and stderr through directly | ||
| timeout "$TIMEOUT" bash -c "$COMMAND" |
There was a problem hiding this comment.
Add a portable timeout fallback
This hard-codes GNU timeout, which is not available in common environments like default macOS shells; in those setups the measurement runner exits with code 127 before executing the harness, so /ce:optimize cannot establish baselines or score experiments at all. Please detect and support alternatives (for example gtimeout/Python-based timeout) instead of assuming timeout exists.
Useful? React with 👍 / 👎.
| fi | ||
|
|
||
| # Output JSON result | ||
| python3 -c " |
There was a problem hiding this comment.
Gracefully handle missing python3 in probe output
The probe script suppresses some Python failures earlier, but this final unconditional python3 call hard-fails when Python is not installed, aborting Phase 1 parallel-readiness checks instead of returning advisory JSON. In minimal Node/Bun environments this makes the optimization workflow fail before experiments begin; add a no-Python fallback or explicit preflight error handling.
Useful? React with 👍 / 👎.
| if ! grep -q "^\.worktrees$" "$GIT_ROOT/.gitignore" 2>/dev/null; then | ||
| echo ".worktrees" >> "$GIT_ROOT/.gitignore" | ||
| fi |
There was a problem hiding this comment.
Avoid mutating tracked .gitignore during worktree create
Creating a worktree unconditionally appends .worktrees to the repository’s tracked .gitignore when absent, which introduces an unrelated dirty change on first run. That side effect can pollute optimization branches or trip workflows expecting only hypothesis-related file edits; use .git/info/exclude or require explicit user opt-in instead of editing tracked files automatically.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: efbc0919bf
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| ### 3.6 Check Stopping Criteria | ||
|
|
||
| Stop the loop if ANY of these are true: | ||
| - **Target reached**: primary metric meets or exceeds `stopping.target` (if set in spec) |
There was a problem hiding this comment.
Use metric.primary.target for target stop checks
The stopping logic references stopping.target, but the spec schema defines the numeric target under metric.primary.target (and stopping only has booleans/limits), so a run with a configured target can miss the intended stop condition and continue until other limits trigger. This mismatch will cause unnecessary extra experiments and cost for users relying on target-based termination.
Useful? React with 👍 / 👎.
| elif [[ -d "$GIT_ROOT/$shared_file" ]]; then | ||
| local dir | ||
| dir=$(dirname "$worktree_path/$shared_file") | ||
| mkdir -p "$dir" | ||
| cp -r "$GIT_ROOT/$shared_file" "$worktree_path/$shared_file" |
There was a problem hiding this comment.
Copy shared directories into place instead of nesting them
When a shared resource is a directory that already exists in the worktree, cp -r "$GIT_ROOT/$shared_file" "$worktree_path/$shared_file" creates a nested directory (e.g. .../data/data) rather than refreshing .../data in place. That means experiments can keep reading stale files from the original path, so the intended per-worktree isolation for shared directories silently fails.
Useful? React with 👍 / 👎.
| fi | ||
|
|
||
| # Check 5: GPU references | ||
| GPU_FILES=$(grep -rl --include='*.py' --include='*.rs' --include='*.cpp' --include='*.cu' -E '(torch\.device|cuda|gpu|tensorflow|tf\.config)' . 2>/dev/null | head -5 || true) |
There was a problem hiding this comment.
Exclude dependency trees from GPU probe scans
The GPU probe recursively greps from . without excluding third-party directories like node_modules, so common dependency source files can trigger exclusive_resource blockers even when the project itself does not use GPUs. Because exclusive_resource forces mode: serial later in the script, this can incorrectly disable parallel experimentation and significantly slow optimization runs.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 91ef486910
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| # Create worktree from the base branch | ||
| git worktree add -b "$branch_name" "$worktree_path" "$base_branch" --quiet 2>/dev/null || { | ||
| # Branch might already exist (from a previous partial run) | ||
| git branch -D "$branch_name" 2>/dev/null || true |
There was a problem hiding this comment.
Avoid force-deleting existing experiment branch
When git worktree add -b ... fails because the branch already exists (for example after an interrupted run that removed the worktree but left the branch), this fallback force-deletes that branch before recreating it from base_branch. That discards any unmerged commits or recovery context on the existing experiment branch, so rerunning create can silently destroy prior experiment results. Prefer reusing the existing branch or failing with an explicit cleanup prompt instead of git branch -D here.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 7dda6f6f1b
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
|
||
| 1. **Run measurement** in the experiment's worktree: | ||
| ```bash | ||
| bash scripts/measure.sh "<measurement.command>" <timeout_seconds> "<worktree_path>" <env_vars...> |
There was a problem hiding this comment.
Run experiment measurements in configured workdir
Phase 3 measures each experiment with bash scripts/measure.sh "<measurement.command>" <timeout_seconds> "<worktree_path>" ..., which ignores measurement.working_directory and always executes from the worktree root. Specs that rely on a non-root working directory (including both checked-in examples using working_directory: "tools/eval") will pass baseline setup but then fail or read wrong files during experiment evaluation, causing experiments to be mis-scored or all marked as errors.
Useful? React with 👍 / 👎.
Capturing this at the brainstorm / plan You know what this is... don't make me say it 😂
New /ce:optimize skill for metric-driven iterative optimization. Defines a measurable goal, builds measurement scaffolding first, then runs parallel experiments via worktrees or Codex that converge toward the best solution. Key capabilities: - Three-tier metrics: degenerate gates -> LLM-as-judge or hard metric -> diagnostics - Parallel experiments via git worktrees (max 6) or Codex sandboxes - Stratified sampling with user-defined rubrics for LLM-as-judge - Parallelism blocker detection (ports, SQLite, GPU) - Rolling context window + strategy digest for long runs - Git-native history with all experiments preserved - Integration with /ce:compound and /ce:review at wrap-up Includes SKILL.md (4-phase workflow), 4 reference files (spec schema, experiment log schema, experiment prompt template, judge prompt template), and 3 scripts (measure.sh, parallel-probe.sh, experiment-worktree.sh). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The skill runs for hours but had no explicit write-immediately-per-experiment rule. Results were batched in memory and written after full batch evaluation, making them vulnerable to context compaction and session crashes. Changes: - Add Persistence Discipline section as a top-level skill principle - Write each experiment result to disk IMMEDIATELY after measurement (step 3.3) instead of deferring to batch evaluation - Enforce re-read-from-disk at every phase boundary and before every decision - Per-experiment result.yaml crash-recovery markers in worktrees - Append-only log during Phase 3 to prevent data loss on interrupted writes - Resume logic explicitly reads all state from disk, not in-memory context - Update experiment log schema header to document the write discipline Follows Karpathy's autoresearch pattern: results.tsv is written after every single experiment, making the file the memory and the agent expendable. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…erification First-run testing revealed two critical failures: 1. The skill defaulted to type:hard for a qualitative clustering target, optimizing a proxy metric without ever checking cluster coherence. Phase 0.2 now actively detects qualitative targets, strongly recommends type:judge, and walks users through sampling strategy and rubric design. 2. Experiment results were dumped into the conversation but never written to disk. Added mandatory write-then-verify checkpoints (CP-0 through CP-5) at every phase boundary. The persistence discipline now states: "If you produce a results table without writing to disk first, you have a bug." Also adds first-run lessons to the brainstorm doc. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 705f124e1e
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| 6. **If gates pass AND primary type is `hard`**: | ||
| - Use the metric value directly from the measurement output | ||
|
|
||
| 7. **IMMEDIATELY append to experiment log on disk (CP-3)** — do not defer this to batch evaluation. Write the experiment entry (iteration, hypothesis, outcome, metrics, learnings) to `.context/compound-engineering/ce-optimize/<spec-name>/experiment-log.yaml` right now. The outcome may be preliminary (e.g., `gates_passed` but not yet compared to best) — that is fine. Update the outcome to `kept` or `reverted` in the evaluation step, but the raw metrics are on disk and safe from context compaction. |
There was a problem hiding this comment.
Keep experiment outcomes within declared enum
Phase 3.3 explicitly allows writing a preliminary outcome like gates_passed, but references/experiment-log-schema.yaml defines outcome as a closed enum that does not include that value. Because this field is load-bearing for resume/evaluation flow, persisting non-enum states can break schema validation and any logic that branches on valid terminal outcomes. Please either add a documented transitional state to the schema/state machine or require CP-3 writes to use only schema-valid outcomes.
Useful? React with 👍 / 👎.
| type: integer | ||
| default: 4 |
There was a problem hiding this comment.
Enforce max_concurrent lower bound
The schema accepts any integer for execution.max_concurrent (including 0 or negative), but Phase 3.1 computes batch_size = min(backlog_size, execution.max_concurrent). With a non-empty backlog and max_concurrent: 0, batch selection yields zero experiments indefinitely, so the loop makes no progress and can spin until a time-based/manual stop. Add a minimum constraint (>=1) and corresponding validation guard in Phase 0.2.
Useful? React with 👍 / 👎.



Summary
Adds
/ce:optimize— an iterative optimization loop skill inspired by Karpathy's autoresearch, generalized for multi-file code changes and non-ML domains.The core idea: define a measurable goal, build measurement scaffolding, then run a long-running loop that tries many hypotheses in parallel, measures each, keeps improvements, and converges toward the best solution.
The Problem
CE has knowledge-compounding and multi-agent review, but no skill for systematic metric-driven optimization — the kind where you need to try 50-100 variations of an approach, measure each, and build on successes. Examples:
These problems share a pattern: no single change gets you there, you need iterative experimentation with memory of what was tried.
How It Works
Four Phases
Phase 0 — Setup: Create or load an optimization spec (YAML). The skill actively detects whether the target is qualitative vs quantitative and guides toward the right metric type. For qualitative targets (clustering quality, search relevance), it recommends LLM-as-judge with stratified sampling over misleading proxy metrics.
Phase 1 — Measurement Scaffolding (hard gate): Build or validate the measurement harness, establish baseline, probe for parallelism blockers (port conflicts, shared SQLite DBs, GPU exclusivity). User must approve baseline before any experiments run.
Phase 2 — Hypothesis Generation: Analyze the codebase, generate 10-30 hypotheses, identify required dependencies, get bulk approval for new deps upfront.
Phase 3 — Optimization Loop: Run experiments in parallel batches (up to 6 worktrees or Codex sandboxes). Each experiment: implement hypothesis -> measure -> evaluate gates -> judge (if qualitative) -> keep or revert. Batches repeat until a stopping criterion is met.
Phase 4 — Wrap-Up: Summarize results, preserve the optimization branch, offer
/ce:reviewon the cumulative diff and/ce:compoundto capture the winning strategy.Three-Tier Metrics
Not everything worth optimizing has a clean scalar metric. The skill uses a three-tier evaluation architecture:
Degenerate gates — fast, cheap boolean checks that catch obviously broken solutions (all items in 1 cluster, 0% coverage, runtime explosion). If any gate fails, skip expensive evaluation entirely.
Primary metric — either a hard scalar (
type: hardfor build time, test coverage) or an LLM-as-judge quality score (type: judgefor clustering coherence, search relevance). For judge mode, the skill uses stratified sampling with user-defined rubrics and scores aggregated from parallel Haiku judge calls.Diagnostics — logged for understanding but never gated on (distribution stats, counts, timing).
LLM-as-Judge for Qualitative Optimization
For problems like clustering quality, hard metrics alone mislead — "fewer singletons" doesn't mean "better clusters." The judge system:
Disk-First Persistence
The skill runs for hours. Context windows compact, sessions crash, agents restart. The experiment log on disk is the single source of truth — not the conversation.
Six mandatory write-then-verify checkpoints (CP-0 through CP-5) at every phase boundary ensure no results are lost. Each checkpoint writes the file, reads it back, and confirms the content is present before proceeding. Per-experiment
result.yamlmarkers in worktrees provide crash recovery for experiments measured but not yet logged.Parallel Execution
Experiments run in parallel by default (up to 6 git worktrees or Codex sandboxes). A Phase 1 parallelism probe detects blockers:
After each batch, file-disjoint runner-up experiments can be cherry-picked onto the winner for compound improvement.
What's Included
SKILL.mdreferences/optimize-spec-schema.yamlreferences/experiment-log-schema.yamlreferences/experiment-prompt-template.mdreferences/judge-prompt-template.mdscripts/measure.shscripts/parallel-probe.shscripts/experiment-worktree.shLessons from First Run
Tested on a clustering optimization problem for ~90 minutes (16 experiments, 31.4% -> 72.1% multi-member coverage). Two critical issues were discovered and fixed:
Judge mode not triggered — The skill defaulted to
type: hardfor a qualitative target. Fixed by adding active qualitative/quantitative detection in Phase 0.2 with concrete guidance on when to use each type, sampling strategy walkthrough, and rubric design help.No disk persistence — Results existed only in the conversation as a table. Fixed by adding mandatory CP-0 through CP-5 checkpoints with write-then-verify discipline. The persistence section now explicitly states: "If you produce a results table without writing to disk first, you have a bug."
Design Influences
results.tsvpersistence, immutable evaluator. We generalize to multi-file, multi-metric, parallel execution.Test Plan
bun run release:validatepasses (42 skills, 48 agents) ✅bun testpasses (507 tests) ✅type: hardon a quantitative target (build time, test coverage)type: judgeon a qualitative target (clustering, search relevance)result.yamlmarkers on resume🤖 Generated with Claude Code