Get started with autoresearch on any codebase — two commands.
Try it · Install · How it works · Dashboard · Upgrading
You give it a codebase. It discovers metrics to optimize, sets up the evaluation, and starts running experiments in a loop -- trying things, keeping what improves the score, throwing away what doesn't.
Inspired by Karpathy's autoresearch -- where an LLM runs training experiments autonomously to beat its own best score. Autoresearch is a pure hill climb: try something, keep or revert, repeat on a single branch. Evo adds structure on top of that idea:
- Tree search over greedy hill climb. Multiple directions can fork from any committed node, so exploration doesn't collapse to one path.
- Parallel semi-autonomous agents. Spawn multiple subagents and run them simultaneously, each in its own git worktree. Each subagent reads traces, formulates hypotheses, and can run multiple iterations within its branch.
- Shared state. Failure traces, annotations, and discarded hypotheses are accessible to every agent before it decides what to try next.
- Gating. Regression tests or safety checks can be wired up as a gate. Experiments that don't pass get discarded.
- Observability. A dashboard to monitor your experiments.
- Benchmark discovery. The
discoverskill explores the repo, figures out what to measure, and instruments the evaluation.
Runs on Claude Code, Codex, OpenClaw, Hermes, Opencode, or Pi. Experiments run locally or on remote sandboxes — Modal, E2B, Daytona, AWS, Azure, SSH.
Two commands:
/evo:discover # one-time code discovery: figures out benchmarks and creates gates against unintended changes
/evo:optimize # run the loop
discover asks what to optimize, the benchmark command, and the metric direction. Skip the questions by seeding the answer:
/evo:discover make the JSON parser at src/parser.py faster
Pass optimize parameters as key=value after the skill name:
/evo:optimize subagents=3 budget=10 stall=3
| Parameter | Default | Description |
|---|---|---|
subagents |
5 | Parallel subagents per round |
budget |
5 | Max iterations each subagent can run within its branch |
stall |
5 | Consecutive rounds with no improvement before auto-stopping |
Invocation syntax is host-specific: /evo: on Claude Code, $evo on Codex, natural language on Hermes, Opencode, OpenClaw, and Pi.
# 1. evo CLI
uv tool install evo-hq-cli
# 2. Host CLI (if you don't already have it)
npm install -g @anthropic-ai/claude-code # or @openai/codex, openclaw, @earendil-works/pi-coding-agent
# 3. Plugin + host hooks
evo install <host> # claude-code | codex | hermes | opencode | openclaw | pievo install <host> installs the plugin into the host's marketplace and stages the hooks evo needs to talk to in-flight subagents. Verify with evo doctor <host>.
For remote backends, install with the matching provider extra: uv tool install 'evo-hq-cli[modal]' (or [e2b], [daytona], [aws], [azure], [all]).
Codex requires manual approval for plugin hooks. After install, run /hooks inside codex to trust evo's hooks — or pass --trust-hooks to evo install codex to skip the prompt.
The orchestrator dispatches subagents in parallel. Each runs in its own isolated workspace, picks up shared state (failure traces, annotations, discarded hypotheses), forms a hypothesis, edits, and runs the benchmark. A subagent with iteration budget remaining continues on its branch within the same round when its prior edit warrants a follow-up.
After each round, the orchestrator selects which committed branch to extend next. Available strategies:
- argmax — extend the highest-scoring branch
- top_k — round-robin among the K best
- epsilon_greedy — best most of the time, random sometimes
- softmax — sample weighted by score
- pareto_per_task — keep specialists the aggregate hides, inspired by GEPA
Configure in the dashboard's Frontier tab, which lists each strategy's parameters.
Between rounds, RLM-inspired scan subagents read trace batches in parallel and surface compound failure patterns: gate-failure intersections, shared root causes across traces. Findings land in shared state, which the next round's subagents read at startup.
evo introduces gates: pass/fail checks that run on every experiment. An experiment that fails a gate is discarded even if its score beats the current best. Without gates, the search will find ways to return a constant, skip work, or trade correctness for speed.
Any command that exits zero on pass and non-zero on fail qualifies as a gate: a test suite, an invariant script, a score floor on a held-out slice of the benchmark. Gates inherit down the experiment tree: a gate registered at the root runs on every descendant. Narrower gates can be attached to specific branches.
When discover builds a benchmark from scratch, it attaches a held-out-slice score-floor gate automatically. When the benchmark already exists in the repo, gates are opt-in.
| Backend | Where | Install |
|---|---|---|
| worktree (default) | local git worktree per experiment | included |
| pool | reuse a fixed set of local workspaces | included |
| ssh | your own SSH host | included |
| modal | Modal serverless cloud | uv tool install 'evo-hq-cli[modal]' |
| e2b | E2B cloud sandboxes | uv tool install 'evo-hq-cli[e2b]' |
| daytona | Daytona cloud workspaces | uv tool install 'evo-hq-cli[daytona]' |
| aws | AWS EC2 sandboxes | uv tool install 'evo-hq-cli[aws]' |
| azure | Azure VMs | uv tool install 'evo-hq-cli[azure]' |
Pick and configure in the dashboard's Backend tab.
The dashboard starts automatically with /evo:discover (or evo init) and prints the URL in chat:
Dashboard live: http://127.0.0.1:8080 (pid 12345)
If 8080 is in use, evo increments to the next free port (8081, 8082, …) and prints it. Subsequent runs reuse the chosen port. Start it manually with:
uv run --project /path/to/evo/plugins/evo evo dashboard --port 8080evo update <host> # host: claude-code | codex | hermes | opencode | openclaw | pi
evo update <host> --version 0.4.1 # pin to a releaseSee evo update --help for --force, --scope, and additional flags.
Reinstall the CLI and refresh the host plugin cache:
uv tool install --force evo-hq-cli && evo update --force--force wipes the host plugin cache and reinstalls, working around anthropics/claude-code#14061: /plugin update returns success but does not replace cached plugin files.
uv and pip skip pre-releases by default. To install an alpha, pin both the CLI version and the host plugin tag:
uv tool install --force 'evo-hq-cli==0.4.1a2' && \
evo update --version 0.4.1-alpha.2 --forceSubstitute the target alpha version. The CLI uses PEP 440 form (0.4.1a2); the marketplace tag uses the dash form (v0.4.1-alpha.2).
For development on evo:
git clone https://github.com/evo-hq/evo
cd evo
uv tool install --editable plugins/evoApache-2.0. See LICENSE.

