Skip to content

/ce:optimize - Auto-research loop for tuning system prompts / vector clustering / evaluating different code solution / etc#446

Open
huntharo wants to merge 6 commits intoEveryInc:mainfrom
huntharo:ce-optimize
Open

/ce:optimize - Auto-research loop for tuning system prompts / vector clustering / evaluating different code solution / etc#446
huntharo wants to merge 6 commits intoEveryInc:mainfrom
huntharo:ce-optimize

Conversation

@huntharo
Copy link
Copy Markdown

@huntharo huntharo commented Mar 29, 2026

Summary

Adds /ce:optimize — an iterative optimization loop skill inspired by Karpathy's autoresearch, generalized for multi-file code changes and non-ML domains.

The core idea: define a measurable goal, build measurement scaffolding, then run a long-running loop that tries many hypotheses in parallel, measures each, keeps improvements, and converges toward the best solution.

The Problem

CE has knowledge-compounding and multi-agent review, but no skill for systematic metric-driven optimization — the kind where you need to try 50-100 variations of an approach, measure each, and build on successes. Examples:

  • Improving vector clustering quality (20% coverage -> 95%)
  • Tuning system prompts for better output quality
  • Optimizing search relevance scoring
  • Reducing build times through incremental changes
  • Improving code generation accuracy

These problems share a pattern: no single change gets you there, you need iterative experimentation with memory of what was tried.

How It Works

Four Phases

Phase 0 — Setup: Create or load an optimization spec (YAML). The skill actively detects whether the target is qualitative vs quantitative and guides toward the right metric type. For qualitative targets (clustering quality, search relevance), it recommends LLM-as-judge with stratified sampling over misleading proxy metrics.

Phase 1 — Measurement Scaffolding (hard gate): Build or validate the measurement harness, establish baseline, probe for parallelism blockers (port conflicts, shared SQLite DBs, GPU exclusivity). User must approve baseline before any experiments run.

Phase 2 — Hypothesis Generation: Analyze the codebase, generate 10-30 hypotheses, identify required dependencies, get bulk approval for new deps upfront.

Phase 3 — Optimization Loop: Run experiments in parallel batches (up to 6 worktrees or Codex sandboxes). Each experiment: implement hypothesis -> measure -> evaluate gates -> judge (if qualitative) -> keep or revert. Batches repeat until a stopping criterion is met.

Phase 4 — Wrap-Up: Summarize results, preserve the optimization branch, offer /ce:review on the cumulative diff and /ce:compound to capture the winning strategy.

Three-Tier Metrics

Not everything worth optimizing has a clean scalar metric. The skill uses a three-tier evaluation architecture:

  1. Degenerate gates — fast, cheap boolean checks that catch obviously broken solutions (all items in 1 cluster, 0% coverage, runtime explosion). If any gate fails, skip expensive evaluation entirely.

  2. Primary metric — either a hard scalar (type: hard for build time, test coverage) or an LLM-as-judge quality score (type: judge for clustering coherence, search relevance). For judge mode, the skill uses stratified sampling with user-defined rubrics and scores aggregated from parallel Haiku judge calls.

  3. Diagnostics — logged for understanding but never gated on (distribution stats, counts, timing).

LLM-as-Judge for Qualitative Optimization

For problems like clustering quality, hard metrics alone mislead — "fewer singletons" doesn't mean "better clusters." The judge system:

  • Samples outputs using stratified buckets (top by size, mid-range, small clusters)
  • Evaluates singletons separately for false-negative detection (items that should be clustered)
  • Uses a user-defined rubric (1-5 scale with concrete level descriptions)
  • Dispatches parallel judge sub-agents in batches
  • Aggregates into a primary score the loop optimizes against
  • Tracks judge cost per experiment and cumulatively

Disk-First Persistence

The skill runs for hours. Context windows compact, sessions crash, agents restart. The experiment log on disk is the single source of truth — not the conversation.

Six mandatory write-then-verify checkpoints (CP-0 through CP-5) at every phase boundary ensure no results are lost. Each checkpoint writes the file, reads it back, and confirms the content is present before proceeding. Per-experiment result.yaml markers in worktrees provide crash recovery for experiments measured but not yet logged.

Parallel Execution

Experiments run in parallel by default (up to 6 git worktrees or Codex sandboxes). A Phase 1 parallelism probe detects blockers:

  • Hardcoded ports -> parameterize via env vars
  • Shared SQLite databases -> copy per worktree
  • GPU exclusivity -> fall back to serial
  • Shared file locks -> warn user

After each batch, file-disjoint runner-up experiments can be cherry-picked onto the winner for compound improvement.

What's Included

File Purpose
SKILL.md 4-phase workflow with checkpoint discipline
references/optimize-spec-schema.yaml Full spec schema with validation rules
references/experiment-log-schema.yaml Experiment log schema with outcome state machine
references/experiment-prompt-template.md Prompt template for experiment worker agents
references/judge-prompt-template.md Prompt templates for judge and singleton evaluation
scripts/measure.sh Measurement harness runner with timeout and JSON extraction
scripts/parallel-probe.sh Parallelism blocker detection
scripts/experiment-worktree.sh Worktree lifecycle management (create/cleanup/count)

Lessons from First Run

Tested on a clustering optimization problem for ~90 minutes (16 experiments, 31.4% -> 72.1% multi-member coverage). Two critical issues were discovered and fixed:

  1. Judge mode not triggered — The skill defaulted to type: hard for a qualitative target. Fixed by adding active qualitative/quantitative detection in Phase 0.2 with concrete guidance on when to use each type, sampling strategy walkthrough, and rubric design help.

  2. No disk persistence — Results existed only in the conversation as a table. Fixed by adding mandatory CP-0 through CP-5 checkpoints with write-then-verify discipline. The persistence section now explicitly states: "If you produce a results table without writing to disk first, you have a bug."

Design Influences

  • Karpathy's autoresearch: Linear keep/revert loop, results.tsv persistence, immutable evaluator. We generalize to multi-file, multi-metric, parallel execution.
  • AIDE/WecoAI: Tree search in solution space. We take the idea of file-disjoint runner-up merges rather than full tree search.
  • DSPy: User-defined metrics with automated optimization. We adopt the metric-first philosophy.

Test Plan

  • Verify bun run release:validate passes (42 skills, 48 agents) ✅
  • Verify bun test passes (507 tests) ✅
  • Test with type: hard on a quantitative target (build time, test coverage)
  • Test with type: judge on a qualitative target (clustering, search relevance)
  • Verify disk persistence survives context compaction on a 2+ hour run
  • Verify parallelism probe detects port conflicts and SQLite shared files
  • Verify crash recovery from result.yaml markers on resume

🤖 Generated with Claude Code

@huntharo huntharo changed the title ce:optimize - brainstorm / plan capture /ce:optimize - Auto-research loop for tuning system prompts / vector clustering / evaluating different code solution / etc Mar 30, 2026
@huntharo
Copy link
Copy Markdown
Author

huntharo commented Mar 30, 2026

Results from a 2 hour experiment loop for optimizing a system prompt to extract the signal from issue/pr description noise.

The Project

https://github.com/pwrdrvr/ghcrawl

Artifacts from the Prompt Below

pwrdrvr/ghcrawl@5b531b6#diff-518d9117f5847a1ed4c5d6711dd9832f480f540bdf0e7c260c88d3492ef1cb40

The Prompt

/ce:optimize Let's backup... lets actually test and optimize that system prompt for summarizing issues first. I think weshould try some variations.
First we should identify some issues/prs to test, perhaps 20 of each. We are looking for lengthy non-spam descriptions for issues/prs that appear in
reasonably sized clsuters first. So let's find those and record them. Then let's capture the baseilne summary for those (might need to build a
parameter on the CLI that allows summarizing one issue/pr with a specific system prompt override and have it jsut return the response on stdout). Then
let's judge the baseline with the LLM. Then let's tweak the system prompt maybe 10 times and see if a longer prompt, with some examples of what to
remove, and probably telling it to remove testing notes etc (since those do not determine issue similarity they determine issue/pr quality) and see what
prompt is the best.

Results Summary - Table

image

Results Summary - Table Text

⏺ All 11 experiments complete. Here are the final rankings:

  ┌──────┬──────────────────────┬───────┬──────────────┬──────────┬────────┐
  │ Rank │        Prompt        │ Score │ Boilerplate% │ Cluster% │ Tokens │
  ├──────┼──────────────────────┼───────┼──────────────┼──────────┼────────┤
  │ 1    │ v5-component-focused │ 4.97  │ 0%           │ 100%     │ 45,423 │
  ├──────┼──────────────────────┼───────┼──────────────┼──────────┼────────┤
  │ 1    │ v2-signal-focused    │ 4.97  │ 0%           │ 100%     │ 47,983 │
  ├──────┼──────────────────────┼───────┼──────────────┼──────────┼────────┤
  │ 3    │ v1-strip-boilerplate │ 4.95  │ 0%           │ 100%     │ 46,623 │
  ├──────┼──────────────────────┼───────┼──────────────┼──────────┼────────┤
  │ 3    │ v3-examples          │ 4.95  │ 0%           │ 100%     │ 49,863 │
  ├──────┼──────────────────────┼───────┼──────────────┼──────────┼────────┤
  │ 5    │ v7-two-pass          │ 4.93  │ 3%           │ 100%     │ 48,503 │
  ├──────┼──────────────────────┼───────┼──────────────┼──────────┼────────┤
  │ 6    │ v6-negative-examples │ 4.78  │ 3%           │ 100%     │ 47,183 │
  ├──────┼──────────────────────┼───────┼──────────────┼──────────┼────────┤
  │ 7    │ v9-issue-vs-pr       │ 4.75  │ 3%           │ 100%     │ 46,676 │
  ├──────┼──────────────────────┼───────┼──────────────┼──────────┼────────┤
  │ 8    │ v8-embedding-aware   │ 4.58  │ 5%           │ 100%     │ 47,543 │
  ├──────┼──────────────────────┼───────┼──────────────┼──────────┼────────┤
  │ 9    │ v4-terse             │ 4.47  │ 3%           │ 100%     │ 41,983 │
  ├──────┼──────────────────────┼───────┼──────────────┼──────────┼────────┤
  │ 10   │ v10-minimal-rules    │ 4.10  │ 25%          │ 95%      │ 40,703 │
  ├──────┼──────────────────────┼───────┼──────────────┼──────────┼────────┤
  │ 11   │ v0-baseline          │ 2.65  │ 28%          │ 43%      │ 40,423 │
  └──────┴──────────────────────┴───────┴──────────────┴──────────┴────────┘

  Winner: v5-component-focused — tied with v2 on score (4.97) but uses ~2,500 fewer input tokens. Both achieve 0% boilerplate and 100% clustering
  correctness, a massive improvement over the baseline (2.65 score, 28% boilerplate, 43% clustering).

  Key findings:
  - Any improved prompt dramatically beats the baseline (~2x score improvement)
  - Explicit stripping instructions (v1/v2/v3/v5/v7) outperform conceptual explanations (v8 embedding-aware, v6 negative examples)
  - Shorter prompts (v4-terse, v10-minimal) sacrifice quality — the model needs enough guidance
  - v5's component-focused format ("[Component]: [issue]") provides a good structural constraint

  Now let me save the experiment log to disk and update the system prompt in the codebase.

Output - The new prompt

image

Final - 3 hours 12 minutes

image

@huntharo huntharo marked this pull request as ready for review March 30, 2026 18:54
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c9c18a7147

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

# Run the measurement command with timeout
# timeout returns 124 if the command times out
# We pass stdout and stderr through directly
timeout "$TIMEOUT" bash -c "$COMMAND"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Add a portable timeout fallback

This hard-codes GNU timeout, which is not available in common environments like default macOS shells; in those setups the measurement runner exits with code 127 before executing the harness, so /ce:optimize cannot establish baselines or score experiments at all. Please detect and support alternatives (for example gtimeout/Python-based timeout) instead of assuming timeout exists.

Useful? React with 👍 / 👎.

fi

# Output JSON result
python3 -c "
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Gracefully handle missing python3 in probe output

The probe script suppresses some Python failures earlier, but this final unconditional python3 call hard-fails when Python is not installed, aborting Phase 1 parallel-readiness checks instead of returning advisory JSON. In minimal Node/Bun environments this makes the optimization workflow fail before experiments begin; add a no-Python fallback or explicit preflight error handling.

Useful? React with 👍 / 👎.

Comment on lines +32 to +34
if ! grep -q "^\.worktrees$" "$GIT_ROOT/.gitignore" 2>/dev/null; then
echo ".worktrees" >> "$GIT_ROOT/.gitignore"
fi
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid mutating tracked .gitignore during worktree create

Creating a worktree unconditionally appends .worktrees to the repository’s tracked .gitignore when absent, which introduces an unrelated dirty change on first run. That side effect can pollute optimization branches or trip workflows expecting only hypothesis-related file edits; use .git/info/exclude or require explicit user opt-in instead of editing tracked files automatically.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: efbc0919bf

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

### 3.6 Check Stopping Criteria

Stop the loop if ANY of these are true:
- **Target reached**: primary metric meets or exceeds `stopping.target` (if set in spec)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Use metric.primary.target for target stop checks

The stopping logic references stopping.target, but the spec schema defines the numeric target under metric.primary.target (and stopping only has booleans/limits), so a run with a configured target can miss the intended stop condition and continue until other limits trigger. This mismatch will cause unnecessary extra experiments and cost for users relying on target-based termination.

Useful? React with 👍 / 👎.

Comment on lines +95 to +99
elif [[ -d "$GIT_ROOT/$shared_file" ]]; then
local dir
dir=$(dirname "$worktree_path/$shared_file")
mkdir -p "$dir"
cp -r "$GIT_ROOT/$shared_file" "$worktree_path/$shared_file"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Copy shared directories into place instead of nesting them

When a shared resource is a directory that already exists in the worktree, cp -r "$GIT_ROOT/$shared_file" "$worktree_path/$shared_file" creates a nested directory (e.g. .../data/data) rather than refreshing .../data in place. That means experiments can keep reading stale files from the original path, so the intended per-worktree isolation for shared directories silently fails.

Useful? React with 👍 / 👎.

fi

# Check 5: GPU references
GPU_FILES=$(grep -rl --include='*.py' --include='*.rs' --include='*.cpp' --include='*.cu' -E '(torch\.device|cuda|gpu|tensorflow|tf\.config)' . 2>/dev/null | head -5 || true)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Exclude dependency trees from GPU probe scans

The GPU probe recursively greps from . without excluding third-party directories like node_modules, so common dependency source files can trigger exclusive_resource blockers even when the project itself does not use GPUs. Because exclusive_resource forces mode: serial later in the script, this can incorrectly disable parallel experimentation and significantly slow optimization runs.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 91ef486910

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

# Create worktree from the base branch
git worktree add -b "$branch_name" "$worktree_path" "$base_branch" --quiet 2>/dev/null || {
# Branch might already exist (from a previous partial run)
git branch -D "$branch_name" 2>/dev/null || true
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Avoid force-deleting existing experiment branch

When git worktree add -b ... fails because the branch already exists (for example after an interrupted run that removed the worktree but left the branch), this fallback force-deletes that branch before recreating it from base_branch. That discards any unmerged commits or recovery context on the existing experiment branch, so rerunning create can silently destroy prior experiment results. Prefer reusing the existing branch or failing with an explicit cleanup prompt instead of git branch -D here.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7dda6f6f1b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".


1. **Run measurement** in the experiment's worktree:
```bash
bash scripts/measure.sh "<measurement.command>" <timeout_seconds> "<worktree_path>" <env_vars...>
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Run experiment measurements in configured workdir

Phase 3 measures each experiment with bash scripts/measure.sh "<measurement.command>" <timeout_seconds> "<worktree_path>" ..., which ignores measurement.working_directory and always executes from the worktree root. Specs that rely on a non-root working directory (including both checked-in examples using working_directory: "tools/eval") will pass baseline setup but then fail or read wrong files during experiment evaluation, causing experiments to be mis-scored or all marked as errors.

Useful? React with 👍 / 👎.

huntharo and others added 6 commits April 2, 2026 12:34
Capturing this at the brainstorm / plan

You know what this is... don't make me say it 😂
New /ce:optimize skill for metric-driven iterative optimization. Defines
a measurable goal, builds measurement scaffolding first, then runs
parallel experiments via worktrees or Codex that converge toward the
best solution.

Key capabilities:
- Three-tier metrics: degenerate gates -> LLM-as-judge or hard metric -> diagnostics
- Parallel experiments via git worktrees (max 6) or Codex sandboxes
- Stratified sampling with user-defined rubrics for LLM-as-judge
- Parallelism blocker detection (ports, SQLite, GPU)
- Rolling context window + strategy digest for long runs
- Git-native history with all experiments preserved
- Integration with /ce:compound and /ce:review at wrap-up

Includes SKILL.md (4-phase workflow), 4 reference files (spec schema,
experiment log schema, experiment prompt template, judge prompt template),
and 3 scripts (measure.sh, parallel-probe.sh, experiment-worktree.sh).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The skill runs for hours but had no explicit write-immediately-per-experiment
rule. Results were batched in memory and written after full batch evaluation,
making them vulnerable to context compaction and session crashes.

Changes:
- Add Persistence Discipline section as a top-level skill principle
- Write each experiment result to disk IMMEDIATELY after measurement (step 3.3)
  instead of deferring to batch evaluation
- Enforce re-read-from-disk at every phase boundary and before every decision
- Per-experiment result.yaml crash-recovery markers in worktrees
- Append-only log during Phase 3 to prevent data loss on interrupted writes
- Resume logic explicitly reads all state from disk, not in-memory context
- Update experiment log schema header to document the write discipline

Follows Karpathy's autoresearch pattern: results.tsv is written after
every single experiment, making the file the memory and the agent expendable.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…erification

First-run testing revealed two critical failures:

1. The skill defaulted to type:hard for a qualitative clustering target,
   optimizing a proxy metric without ever checking cluster coherence.
   Phase 0.2 now actively detects qualitative targets, strongly recommends
   type:judge, and walks users through sampling strategy and rubric design.

2. Experiment results were dumped into the conversation but never written
   to disk. Added mandatory write-then-verify checkpoints (CP-0 through
   CP-5) at every phase boundary. The persistence discipline now states:
   "If you produce a results table without writing to disk first, you
   have a bug."

Also adds first-run lessons to the brainstorm doc.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 705f124e1e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

6. **If gates pass AND primary type is `hard`**:
- Use the metric value directly from the measurement output

7. **IMMEDIATELY append to experiment log on disk (CP-3)** — do not defer this to batch evaluation. Write the experiment entry (iteration, hypothesis, outcome, metrics, learnings) to `.context/compound-engineering/ce-optimize/<spec-name>/experiment-log.yaml` right now. The outcome may be preliminary (e.g., `gates_passed` but not yet compared to best) — that is fine. Update the outcome to `kept` or `reverted` in the evaluation step, but the raw metrics are on disk and safe from context compaction.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Keep experiment outcomes within declared enum

Phase 3.3 explicitly allows writing a preliminary outcome like gates_passed, but references/experiment-log-schema.yaml defines outcome as a closed enum that does not include that value. Because this field is load-bearing for resume/evaluation flow, persisting non-enum states can break schema validation and any logic that branches on valid terminal outcomes. Please either add a documented transitional state to the schema/state machine or require CP-3 writes to use only schema-valid outcomes.

Useful? React with 👍 / 👎.

Comment on lines +278 to +279
type: integer
default: 4
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Enforce max_concurrent lower bound

The schema accepts any integer for execution.max_concurrent (including 0 or negative), but Phase 3.1 computes batch_size = min(backlog_size, execution.max_concurrent). With a non-empty backlog and max_concurrent: 0, batch selection yields zero experiments indefinitely, so the loop makes no progress and can spin until a time-based/manual stop. Add a minimum constraint (>=1) and corresponding validation guard in Phase 0.2.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant