Give an AI agent a file, a metric, and this guide. Walk away. Come back to something better.
📖 Read the full guide · 🚀 How to use it · 💡 Examples · 🏆 Proof it works
This is a 3,000+ line guide that teaches AI coding agents (Claude Code, Cursor, Codex) how to autonomously optimize anything you can measure.
You have a file. You have a way to score it. You want it to be better. This guide makes that happen — automatically, repeatedly, without you sitting there.
The agent reads this guide, understands the loop, and runs experiments on its own: try an idea → measure → keep if better → revert if worse → try next idea → repeat forever.
You don't code. You don't review each change. You set it up, walk away, and come back to results.
Based on Andrej Karpathy's autoresearch pattern — enhanced through multiple rounds of deep research and real-world testing into a self-sufficient, domain-agnostic operating manual.
Anything with a number attached to it:
| Your situation | What gets optimized | The metric |
|---|---|---|
| "My API is slow" | Your backend code | Response time (ms) |
| "My LLM gives bad answers" | Your system prompt | Eval score |
| "My site loads slowly" | Your frontend code | Lighthouse score |
| "My algorithm is too slow" | Your algorithm implementation | Execution time |
| "My tests don't cover enough" | Your test file | Coverage % |
| "My Docker image is huge" | Your Dockerfile | Image size (MB) |
| "My SQL queries are slow" | Your queries / indexes | Query time (ms) |
| "My emails don't convert" | Your email template | Open/click rate |
| "My config isn't tuned" | Your config file | Throughput / latency |
| "My Rust/C/Go code is slow" | Your source file | Benchmark time (µs) |
If you can run a command and get a number, this guide works.
Ask yourself three questions:
- What file do I want to improve? → This is your target file (e.g.,
src/solver.py,prompt.txt,Dockerfile,nginx.conf) - How do I measure "better"? → This is your eval command (e.g.,
python benchmark.py,bash test.sh,curl -w "%{time_total}" ...) - What number am I optimizing? → This is your metric (e.g.,
duration_ms,accuracy,score,size_kb)
Write a script that runs your benchmark and prints the metric. This script is frozen — the agent must never modify it.
# eval.sh — example for a Python performance optimization
#!/bin/bash
python3 benchmark.py > run.log 2>&1
grep "^execution_time:" run.logThis is the instruction file the agent reads. Copy this template and fill in the blanks:
# Autoresearch: [your project name]
## Setup
- **Target file**: `[path to the file the agent will modify]`
- **Eval command**: `bash eval.sh > run.log 2>&1`
- **Metric**: `grep "^[your_metric]:" run.log` (lower/higher is better)
- **Constraint**: Only modify the target file. Never touch eval.sh.
## The experiment loop
LOOP FOREVER:
1. Look at current git state
2. Modify the target file with an experimental idea
3. git commit -m "description of what you tried"
4. Run: `bash eval.sh > run.log 2>&1`
5. Read: `grep "^[your_metric]:" run.log`
6. If improved → keep the commit
7. If worse or crashed → `git reset --hard HEAD~1`
8. Log result to results.tsv
9. Repeat. Never stop.
## Strategy hints
- [Add domain-specific tips here]
- [What approaches might work]
- [What to avoid]The full guide has a much more detailed universal template with strategy hints, search strategies, constraint writing, and more.
git init
git add .
git commit -m "initial baseline"Open Claude Code (or your preferred AI coding agent) in the project directory and say:
Read program.md and start the experiment loop. Do not stop until I interrupt you.
That's it. The agent starts running experiments autonomously.
Come back in an hour (or overnight). Check results.tsv to see what happened. The agent will have tried dozens or hundreds of ideas, kept the ones that worked, and reverted the ones that didn't.
Your file: src/process.py (data processing pipeline)
Your eval: python benchmark.py → prints "duration_ms: 1245"
Your goal: Lower that number
program.md says: "Optimize src/process.py. Metric is duration_ms (lower is better).
Try vectorization, caching, algorithm changes, data structure swaps."
You tell the agent: "Read program.md and start experimenting."
Agent runs 50 experiments → duration_ms goes from 1245 to 312.
Your file: prompt.txt (system prompt for a customer support bot)
Your eval: python eval_prompt.py → prints "accuracy: 0.72"
Your goal: Raise that number
program.md says: "Optimize prompt.txt. Metric is accuracy (higher is better).
Try different instruction styles, add examples, restructure the persona."
You tell the agent: "Read program.md and start experimenting."
Agent runs 80 experiments → accuracy goes from 0.72 to 0.91.
Your file: src/solver.rs (sudoku solver)
Your eval: bash bench.sh → prints "usec_per_puzzle: 45.3"
Your goal: Lower that number
program.md says: "Optimize src/solver.rs. Metric is usec_per_puzzle (lower is better).
Try SIMD, different data layouts, cache optimization, algorithmic changes."
You tell the agent: "Read program.md and start experimenting."
Agent runs 312 experiments → usec_per_puzzle goes from 6,462,257 to 24.92.
That's a 65,275x speedup. (This actually happened — see proof below.)
Your file: Dockerfile
Your eval: docker build -t test . && docker image inspect test --format '{{.Size}}'
Your goal: Lower the image size
program.md says: "Optimize the Dockerfile. Metric is image size in bytes (lower is better).
Try multi-stage builds, smaller base images, layer optimization."
You tell the agent: "Read program.md and start experimenting."
Agent runs 30 experiments → image size goes from 1.2GB to 89MB.
Your file: nginx.conf
Your eval: wrk -t4 -c100 -d10s http://localhost:8080/ → extract "Requests/sec"
Your goal: Raise requests per second
program.md says: "Optimize nginx.conf. Metric is requests_per_sec (higher is better).
Try worker_processes, keepalive, buffer sizes, gzip, caching headers."
You tell the agent: "Read program.md and start experimenting."
Agent runs 40 experiments → requests/sec goes from 12,000 to 34,000.
The full AUTORESEARCH_COMPLETE_GUIDE.md (3,114 lines) covers everything:
| Section | What you'll learn |
|---|---|
| The three primitives | The minimal setup: program.md + frozen eval + results.tsv |
| Architecture deep dive | How the loop works, why git is essential, state management |
Writing program.md |
The most important skill — how to write instructions that actually work |
| Universal template | Copy-paste template that works for any domain |
| Eval harness cookbook | Full working eval scripts for Python, APIs, LLMs, frontend, configs |
| Metric noise handling | Multiple runs, outlier rejection, confidence intervals for noisy metrics |
| Problem decomposition | How to pick the right metric and avoid Goodhart's Law |
| Pre-flight checklist | Everything to verify before your first experiment |
| Writing constraints | How to tell the agent what it can and can't change |
| Multi-file targets | When your optimization spans more than one file |
| Parallelization | Running multiple agents simultaneously with git worktrees |
| 5 ready-to-use examples | System prompts, API latency, frontend perf, test coverage, config tuning |
| Advanced search strategies | 4-phase protocol: grid scan → hill climb → random search → fine-tune |
| Troubleshooting | 15+ common failures and fixes |
| Cheat sheet | One-page reference for agents already in the loop |
| Hello world walkthrough | End-to-end from zero to first result |
| Agent setup instructions | Exact prompts for Claude Code, Cursor, and Codex |
We used this exact guide to build a sudoku solver that beats the world's #1 and #2 solvers:
| Metric | Result |
|---|---|
| Experiments | 312 autonomous |
| Speedup | 65,275x (6.4 seconds → 99 microseconds) |
| vs Tdoku (#1 since 2019) | 49% faster on main leaderboard |
| vs rust_sudoku (#2) | 82% faster on main leaderboard |
| Datasets won | 4 out of 6 (same hardware, same flags) |
| Human-written solver code | 0 lines |
| Duration | ~18 hours |
The agent independently discovered constraint propagation, hidden singles, SIMD vectorization, band-oriented data structures, and more — techniques the human sudoku community developed over decades. It rewrote its own architecture from scratch 4 times.
Full results: autoresearch-sudoku
Better program.md → Better agent behavior → Better results
A vague instruction like "make it faster" produces mediocre results. A specific instruction with strategy hints, constraints, evaluation details, and domain knowledge produces exceptional results. This guide teaches you how to write the latter.
You are no longer the coder. You are the constraint designer. Your job is to choose the right metric, write clear instructions, set appropriate boundaries, and let the agent do the rest.
| What | Why |
|---|---|
| An AI coding agent | Claude Code, Cursor, Codex — anything with shell access |
| A measurable metric | If it doesn't produce a number, you can't optimize it |
| Git | The agent uses git to checkpoint and revert experiments |
| ~30 minutes | To write your program.md and eval script |
# The entire pattern in 6 steps:
mkdir my-project && cd my-project
git init
# 1. Put your target file in place (the thing you want optimized)
# 2. Write eval.sh (frozen benchmark — agent never touches this)
# 3. Write program.md (instructions — what to optimize, how to measure)
git add . && git commit -m "initial"
# 4. Open your AI agent in this directory
# 5. Say: "Read program.md and start the experiment loop. Don't stop."
# 6. Walk away. Come back to results.tsv.| Metric | Value |
|---|---|
| Lines | 3,114 |
| Words | ~15,000 |
| Main sections | 24 |
| Code blocks | 97 |
| Ready-to-use examples | 5 |
| Tables | 90+ |
| Resource | Link |
|---|---|
| The full guide | AUTORESEARCH_COMPLETE_GUIDE.md |
| Karpathy's original announcement | Tweet (March 7, 2026) |
| Karpathy's autoresearch repo | github.com/karpathy/autoresearch |
| Claude Code (recommended agent) | docs.anthropic.com |
| Proof of concept (sudoku solver) | autoresearch-sudoku |
MIT — use it for anything, anywhere, commercially or not.
Built by Ritik. Enhanced from Karpathy's autoresearch pattern through deep research and real-world testing.