Skip to content

Rkcr7/autoresearch-guide

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 

Repository files navigation

🧠 The Complete Autoresearch Guide

Give an AI agent a file, a metric, and this guide. Walk away. Come back to something better.

📖 Read the full guide  ·  🚀 How to use it  ·  💡 Examples  ·  🏆 Proof it works


What is this?

This is a 3,000+ line guide that teaches AI coding agents (Claude Code, Cursor, Codex) how to autonomously optimize anything you can measure.

You have a file. You have a way to score it. You want it to be better. This guide makes that happen — automatically, repeatedly, without you sitting there.

The agent reads this guide, understands the loop, and runs experiments on its own: try an idea → measure → keep if better → revert if worse → try next idea → repeat forever.

You don't code. You don't review each change. You set it up, walk away, and come back to results.

Based on Andrej Karpathy's autoresearch pattern — enhanced through multiple rounds of deep research and real-world testing into a self-sufficient, domain-agnostic operating manual.


What can it optimize?

Anything with a number attached to it:

Your situation What gets optimized The metric
"My API is slow" Your backend code Response time (ms)
"My LLM gives bad answers" Your system prompt Eval score
"My site loads slowly" Your frontend code Lighthouse score
"My algorithm is too slow" Your algorithm implementation Execution time
"My tests don't cover enough" Your test file Coverage %
"My Docker image is huge" Your Dockerfile Image size (MB)
"My SQL queries are slow" Your queries / indexes Query time (ms)
"My emails don't convert" Your email template Open/click rate
"My config isn't tuned" Your config file Throughput / latency
"My Rust/C/Go code is slow" Your source file Benchmark time (µs)

If you can run a command and get a number, this guide works.


🚀 How to use it

Step 1: Identify your target

Ask yourself three questions:

  • What file do I want to improve? → This is your target file (e.g., src/solver.py, prompt.txt, Dockerfile, nginx.conf)
  • How do I measure "better"? → This is your eval command (e.g., python benchmark.py, bash test.sh, curl -w "%{time_total}" ...)
  • What number am I optimizing? → This is your metric (e.g., duration_ms, accuracy, score, size_kb)

Step 2: Create your eval script

Write a script that runs your benchmark and prints the metric. This script is frozen — the agent must never modify it.

# eval.sh — example for a Python performance optimization
#!/bin/bash
python3 benchmark.py > run.log 2>&1
grep "^execution_time:" run.log

Step 3: Create your program.md

This is the instruction file the agent reads. Copy this template and fill in the blanks:

# Autoresearch: [your project name]

## Setup
- **Target file**: `[path to the file the agent will modify]`
- **Eval command**: `bash eval.sh > run.log 2>&1`
- **Metric**: `grep "^[your_metric]:" run.log` (lower/higher is better)
- **Constraint**: Only modify the target file. Never touch eval.sh.

## The experiment loop
LOOP FOREVER:
1. Look at current git state
2. Modify the target file with an experimental idea
3. git commit -m "description of what you tried"
4. Run: `bash eval.sh > run.log 2>&1`
5. Read: `grep "^[your_metric]:" run.log`
6. If improved → keep the commit
7. If worse or crashed → `git reset --hard HEAD~1`
8. Log result to results.tsv
9. Repeat. Never stop.

## Strategy hints
- [Add domain-specific tips here]
- [What approaches might work]
- [What to avoid]

The full guide has a much more detailed universal template with strategy hints, search strategies, constraint writing, and more.

Step 4: Set up git

git init
git add .
git commit -m "initial baseline"

Step 5: Give it to your AI agent

Open Claude Code (or your preferred AI coding agent) in the project directory and say:

Read program.md and start the experiment loop. Do not stop until I interrupt you.

That's it. The agent starts running experiments autonomously.

Step 6: Walk away

Come back in an hour (or overnight). Check results.tsv to see what happened. The agent will have tried dozens or hundreds of ideas, kept the ones that worked, and reverted the ones that didn't.


💡 Examples

Example 1: Make my Python code faster

Your file:      src/process.py (data processing pipeline)
Your eval:      python benchmark.py → prints "duration_ms: 1245"
Your goal:      Lower that number

program.md says: "Optimize src/process.py. Metric is duration_ms (lower is better).
Try vectorization, caching, algorithm changes, data structure swaps."

You tell the agent: "Read program.md and start experimenting."
Agent runs 50 experiments → duration_ms goes from 1245 to 312.

Example 2: Improve my LLM system prompt

Your file:      prompt.txt (system prompt for a customer support bot)
Your eval:      python eval_prompt.py → prints "accuracy: 0.72"
Your goal:      Raise that number

program.md says: "Optimize prompt.txt. Metric is accuracy (higher is better).
Try different instruction styles, add examples, restructure the persona."

You tell the agent: "Read program.md and start experimenting."
Agent runs 80 experiments → accuracy goes from 0.72 to 0.91.

Example 3: Optimize my Rust algorithm

Your file:      src/solver.rs (sudoku solver)
Your eval:      bash bench.sh → prints "usec_per_puzzle: 45.3"
Your goal:      Lower that number

program.md says: "Optimize src/solver.rs. Metric is usec_per_puzzle (lower is better).
Try SIMD, different data layouts, cache optimization, algorithmic changes."

You tell the agent: "Read program.md and start experimenting."
Agent runs 312 experiments → usec_per_puzzle goes from 6,462,257 to 24.92.
That's a 65,275x speedup. (This actually happened — see proof below.)

Example 4: Shrink my Docker image

Your file:      Dockerfile
Your eval:      docker build -t test . && docker image inspect test --format '{{.Size}}'
Your goal:      Lower the image size

program.md says: "Optimize the Dockerfile. Metric is image size in bytes (lower is better).
Try multi-stage builds, smaller base images, layer optimization."

You tell the agent: "Read program.md and start experimenting."
Agent runs 30 experiments → image size goes from 1.2GB to 89MB.

Example 5: Tune my Nginx config

Your file:      nginx.conf
Your eval:      wrk -t4 -c100 -d10s http://localhost:8080/ → extract "Requests/sec"
Your goal:      Raise requests per second

program.md says: "Optimize nginx.conf. Metric is requests_per_sec (higher is better).
Try worker_processes, keepalive, buffer sizes, gzip, caching headers."

You tell the agent: "Read program.md and start experimenting."
Agent runs 40 experiments → requests/sec goes from 12,000 to 34,000.

What's inside the guide?

The full AUTORESEARCH_COMPLETE_GUIDE.md (3,114 lines) covers everything:

Section What you'll learn
The three primitives The minimal setup: program.md + frozen eval + results.tsv
Architecture deep dive How the loop works, why git is essential, state management
Writing program.md The most important skill — how to write instructions that actually work
Universal template Copy-paste template that works for any domain
Eval harness cookbook Full working eval scripts for Python, APIs, LLMs, frontend, configs
Metric noise handling Multiple runs, outlier rejection, confidence intervals for noisy metrics
Problem decomposition How to pick the right metric and avoid Goodhart's Law
Pre-flight checklist Everything to verify before your first experiment
Writing constraints How to tell the agent what it can and can't change
Multi-file targets When your optimization spans more than one file
Parallelization Running multiple agents simultaneously with git worktrees
5 ready-to-use examples System prompts, API latency, frontend perf, test coverage, config tuning
Advanced search strategies 4-phase protocol: grid scan → hill climb → random search → fine-tune
Troubleshooting 15+ common failures and fixes
Cheat sheet One-page reference for agents already in the loop
Hello world walkthrough End-to-end from zero to first result
Agent setup instructions Exact prompts for Claude Code, Cursor, and Codex

🏆 Proof it works

We used this exact guide to build a sudoku solver that beats the world's #1 and #2 solvers:

Metric Result
Experiments 312 autonomous
Speedup 65,275x (6.4 seconds → 99 microseconds)
vs Tdoku (#1 since 2019) 49% faster on main leaderboard
vs rust_sudoku (#2) 82% faster on main leaderboard
Datasets won 4 out of 6 (same hardware, same flags)
Human-written solver code 0 lines
Duration ~18 hours

The agent independently discovered constraint propagation, hidden singles, SIMD vectorization, band-oriented data structures, and more — techniques the human sudoku community developed over decades. It rewrote its own architecture from scratch 4 times.

Full results: autoresearch-sudoku


The key insight

Better program.md → Better agent behavior → Better results

A vague instruction like "make it faster" produces mediocre results. A specific instruction with strategy hints, constraints, evaluation details, and domain knowledge produces exceptional results. This guide teaches you how to write the latter.

You are no longer the coder. You are the constraint designer. Your job is to choose the right metric, write clear instructions, set appropriate boundaries, and let the agent do the rest.


Requirements

What Why
An AI coding agent Claude Code, Cursor, Codex — anything with shell access
A measurable metric If it doesn't produce a number, you can't optimize it
Git The agent uses git to checkpoint and revert experiments
~30 minutes To write your program.md and eval script

Quick reference

# The entire pattern in 6 steps:
mkdir my-project && cd my-project
git init

# 1. Put your target file in place (the thing you want optimized)
# 2. Write eval.sh (frozen benchmark — agent never touches this)
# 3. Write program.md (instructions — what to optimize, how to measure)

git add . && git commit -m "initial"

# 4. Open your AI agent in this directory
# 5. Say: "Read program.md and start the experiment loop. Don't stop."
# 6. Walk away. Come back to results.tsv.

Guide stats

Metric Value
Lines 3,114
Words ~15,000
Main sections 24
Code blocks 97
Ready-to-use examples 5
Tables 90+

🔗 Links

Resource Link
The full guide AUTORESEARCH_COMPLETE_GUIDE.md
Karpathy's original announcement Tweet (March 7, 2026)
Karpathy's autoresearch repo github.com/karpathy/autoresearch
Claude Code (recommended agent) docs.anthropic.com
Proof of concept (sudoku solver) autoresearch-sudoku

License

MIT — use it for anything, anywhere, commercially or not.


Built by Ritik. Enhanced from Karpathy's autoresearch pattern through deep research and real-world testing.

About

The complete guide to Karpathy's autoresearch pattern — 3,000+ lines teaching AI agents to autonomously optimize anything measurable. Give it a file, a metric, and this guide. Walk away. Come back to something better.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors