🧠 The Complete Autoresearch Guide

Give an AI agent a file, a metric, and this guide. Walk away. Come back to something better.

📖 Read the full guide · 🚀 How to use it · 💡 Examples · 🏆 Proof it works

What is this?

This is a 3,000+ line guide that teaches AI coding agents (Claude Code, Cursor, Codex) how to autonomously optimize anything you can measure.

You have a file. You have a way to score it. You want it to be better. This guide makes that happen — automatically, repeatedly, without you sitting there.

The agent reads this guide, understands the loop, and runs experiments on its own: try an idea → measure → keep if better → revert if worse → try next idea → repeat forever.

You don't code. You don't review each change. You set it up, walk away, and come back to results.

Based on Andrej Karpathy's autoresearch pattern — enhanced through multiple rounds of deep research and real-world testing into a self-sufficient, domain-agnostic operating manual.

What can it optimize?

Anything with a number attached to it:

Your situation	What gets optimized	The metric
"My API is slow"	Your backend code	Response time (ms)
"My LLM gives bad answers"	Your system prompt	Eval score
"My site loads slowly"	Your frontend code	Lighthouse score
"My algorithm is too slow"	Your algorithm implementation	Execution time
"My tests don't cover enough"	Your test file	Coverage %
"My Docker image is huge"	Your Dockerfile	Image size (MB)
"My SQL queries are slow"	Your queries / indexes	Query time (ms)
"My emails don't convert"	Your email template	Open/click rate
"My config isn't tuned"	Your config file	Throughput / latency
"My Rust/C/Go code is slow"	Your source file	Benchmark time (µs)

If you can run a command and get a number, this guide works.

🚀 How to use it

Step 1: Identify your target

Ask yourself three questions:

What file do I want to improve? → This is your target file (e.g., src/solver.py, prompt.txt, Dockerfile, nginx.conf)
How do I measure "better"? → This is your eval command (e.g., python benchmark.py, bash test.sh, curl -w "%{time_total}" ...)
What number am I optimizing? → This is your metric (e.g., duration_ms, accuracy, score, size_kb)

Step 2: Create your eval script

Write a script that runs your benchmark and prints the metric. This script is frozen — the agent must never modify it.

# eval.sh — example for a Python performance optimization
#!/bin/bash
python3 benchmark.py > run.log 2>&1
grep "^execution_time:" run.log

Step 3: Create your `program.md`

This is the instruction file the agent reads. Copy this template and fill in the blanks:

# Autoresearch: [your project name]

## Setup
- **Target file**: `[path to the file the agent will modify]`
- **Eval command**: `bash eval.sh > run.log 2>&1`
- **Metric**: `grep "^[your_metric]:" run.log` (lower/higher is better)
- **Constraint**: Only modify the target file. Never touch eval.sh.

## The experiment loop
LOOP FOREVER:
1. Look at current git state
2. Modify the target file with an experimental idea
3. git commit -m "description of what you tried"
4. Run: `bash eval.sh > run.log 2>&1`
5. Read: `grep "^[your_metric]:" run.log`
6. If improved → keep the commit
7. If worse or crashed → `git reset --hard HEAD~1`
8. Log result to results.tsv
9. Repeat. Never stop.

## Strategy hints
- [Add domain-specific tips here]
- [What approaches might work]
- [What to avoid]

The full guide has a much more detailed universal template with strategy hints, search strategies, constraint writing, and more.

Step 4: Set up git

git init
git add .
git commit -m "initial baseline"

Step 5: Give it to your AI agent

Open Claude Code (or your preferred AI coding agent) in the project directory and say:

Read program.md and start the experiment loop. Do not stop until I interrupt you.

That's it. The agent starts running experiments autonomously.

Step 6: Walk away

Come back in an hour (or overnight). Check results.tsv to see what happened. The agent will have tried dozens or hundreds of ideas, kept the ones that worked, and reverted the ones that didn't.

💡 Examples

Example 1: Make my Python code faster

Your file:      src/process.py (data processing pipeline)
Your eval:      python benchmark.py → prints "duration_ms: 1245"
Your goal:      Lower that number

program.md says: "Optimize src/process.py. Metric is duration_ms (lower is better).
Try vectorization, caching, algorithm changes, data structure swaps."

You tell the agent: "Read program.md and start experimenting."
Agent runs 50 experiments → duration_ms goes from 1245 to 312.

Example 2: Improve my LLM system prompt

Your file:      prompt.txt (system prompt for a customer support bot)
Your eval:      python eval_prompt.py → prints "accuracy: 0.72"
Your goal:      Raise that number

program.md says: "Optimize prompt.txt. Metric is accuracy (higher is better).
Try different instruction styles, add examples, restructure the persona."

You tell the agent: "Read program.md and start experimenting."
Agent runs 80 experiments → accuracy goes from 0.72 to 0.91.

Example 3: Optimize my Rust algorithm

Your file:      src/solver.rs (sudoku solver)
Your eval:      bash bench.sh → prints "usec_per_puzzle: 45.3"
Your goal:      Lower that number

program.md says: "Optimize src/solver.rs. Metric is usec_per_puzzle (lower is better).
Try SIMD, different data layouts, cache optimization, algorithmic changes."

You tell the agent: "Read program.md and start experimenting."
Agent runs 312 experiments → usec_per_puzzle goes from 6,462,257 to 24.92.
That's a 65,275x speedup. (This actually happened — see proof below.)

Example 4: Shrink my Docker image

Your file:      Dockerfile
Your eval:      docker build -t test . && docker image inspect test --format '{{.Size}}'
Your goal:      Lower the image size

program.md says: "Optimize the Dockerfile. Metric is image size in bytes (lower is better).
Try multi-stage builds, smaller base images, layer optimization."

You tell the agent: "Read program.md and start experimenting."
Agent runs 30 experiments → image size goes from 1.2GB to 89MB.

Example 5: Tune my Nginx config

Your file:      nginx.conf
Your eval:      wrk -t4 -c100 -d10s http://localhost:8080/ → extract "Requests/sec"
Your goal:      Raise requests per second

program.md says: "Optimize nginx.conf. Metric is requests_per_sec (higher is better).
Try worker_processes, keepalive, buffer sizes, gzip, caching headers."

You tell the agent: "Read program.md and start experimenting."
Agent runs 40 experiments → requests/sec goes from 12,000 to 34,000.

What's inside the guide?

The full AUTORESEARCH_COMPLETE_GUIDE.md (3,114 lines) covers everything:

Section	What you'll learn
The three primitives	The minimal setup: `program.md` + frozen eval + `results.tsv`
Architecture deep dive	How the loop works, why git is essential, state management
Writing `program.md`	The most important skill — how to write instructions that actually work
Universal template	Copy-paste template that works for any domain
Eval harness cookbook	Full working eval scripts for Python, APIs, LLMs, frontend, configs
Metric noise handling	Multiple runs, outlier rejection, confidence intervals for noisy metrics
Problem decomposition	How to pick the right metric and avoid Goodhart's Law
Pre-flight checklist	Everything to verify before your first experiment
Writing constraints	How to tell the agent what it can and can't change
Multi-file targets	When your optimization spans more than one file
Parallelization	Running multiple agents simultaneously with git worktrees
5 ready-to-use examples	System prompts, API latency, frontend perf, test coverage, config tuning
Advanced search strategies	4-phase protocol: grid scan → hill climb → random search → fine-tune
Troubleshooting	15+ common failures and fixes
Cheat sheet	One-page reference for agents already in the loop
Hello world walkthrough	End-to-end from zero to first result
Agent setup instructions	Exact prompts for Claude Code, Cursor, and Codex

🏆 Proof it works

We used this exact guide to build a sudoku solver that beats the world's #1 and #2 solvers:

Metric	Result
Experiments	312 autonomous
Speedup	65,275x (6.4 seconds → 99 microseconds)
vs Tdoku (#1 since 2019)	49% faster on main leaderboard
vs rust_sudoku (#2)	82% faster on main leaderboard
Datasets won	4 out of 6 (same hardware, same flags)
Human-written solver code	0 lines
Duration	~18 hours

The agent independently discovered constraint propagation, hidden singles, SIMD vectorization, band-oriented data structures, and more — techniques the human sudoku community developed over decades. It rewrote its own architecture from scratch 4 times.

Full results: autoresearch-sudoku

The key insight

Better program.md → Better agent behavior → Better results

A vague instruction like "make it faster" produces mediocre results. A specific instruction with strategy hints, constraints, evaluation details, and domain knowledge produces exceptional results. This guide teaches you how to write the latter.

You are no longer the coder. You are the constraint designer. Your job is to choose the right metric, write clear instructions, set appropriate boundaries, and let the agent do the rest.

Requirements

What	Why
An AI coding agent	Claude Code, Cursor, Codex — anything with shell access
A measurable metric	If it doesn't produce a number, you can't optimize it
Git	The agent uses git to checkpoint and revert experiments
~30 minutes	To write your `program.md` and eval script

Quick reference

# The entire pattern in 6 steps:
mkdir my-project && cd my-project
git init

# 1. Put your target file in place (the thing you want optimized)
# 2. Write eval.sh (frozen benchmark — agent never touches this)
# 3. Write program.md (instructions — what to optimize, how to measure)

git add . && git commit -m "initial"

# 4. Open your AI agent in this directory
# 5. Say: "Read program.md and start the experiment loop. Don't stop."
# 6. Walk away. Come back to results.tsv.

Guide stats

Metric	Value
Lines	3,114
Words	~15,000
Main sections	24
Code blocks	97
Ready-to-use examples	5
Tables	90+

🔗 Links

Resource	Link
The full guide	AUTORESEARCH_COMPLETE_GUIDE.md
Karpathy's original announcement	Tweet (March 7, 2026)
Karpathy's autoresearch repo	github.com/karpathy/autoresearch
Claude Code (recommended agent)	docs.anthropic.com
Proof of concept (sudoku solver)	autoresearch-sudoku

License

MIT — use it for anything, anywhere, commercially or not.

Built by Ritik. Enhanced from Karpathy's autoresearch pattern through deep research and real-world testing.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
AUTORESEARCH_COMPLETE_GUIDE.md		AUTORESEARCH_COMPLETE_GUIDE.md
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🧠 The Complete Autoresearch Guide

What is this?

What can it optimize?

🚀 How to use it

Step 1: Identify your target

Step 2: Create your eval script

Step 3: Create your `program.md`

Step 4: Set up git

Step 5: Give it to your AI agent

Step 6: Walk away

💡 Examples

Example 1: Make my Python code faster

Example 2: Improve my LLM system prompt

Example 3: Optimize my Rust algorithm

Example 4: Shrink my Docker image

Example 5: Tune my Nginx config

What's inside the guide?

🏆 Proof it works

The key insight

Requirements

Quick reference

Guide stats

🔗 Links

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

🧠 The Complete Autoresearch Guide

What is this?

What can it optimize?

🚀 How to use it

Step 1: Identify your target

Step 2: Create your eval script

Step 3: Create your program.md

Step 4: Set up git

Step 5: Give it to your AI agent

Step 6: Walk away

💡 Examples

Example 1: Make my Python code faster

Example 2: Improve my LLM system prompt

Example 3: Optimize my Rust algorithm

Example 4: Shrink my Docker image

Example 5: Tune my Nginx config

What's inside the guide?

🏆 Proof it works

The key insight

Requirements

Quick reference

Guide stats

🔗 Links

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Step 3: Create your `program.md`

Packages