loopscore is a CLI-based evaluation tool designed for developers who need a "gut check" on their agentic workflows. It automates the process of running identical tasks across different agent configurations to see which one actually ships working code and which one just burns tokens.
In the rapidly evolving AI landscape, "vibes-based" testing eventually hits a wall. You need this tool if you are:
- Benchmarking Models: Comparing if
Claude 3.5 Sonnettruly outperformsGPT-4ofor your specific codebase. - Prompt Engineering: Testing if a new
AGENTS.mdor system instructions actually improve logic or just change the coding style. - Tooling ROI: Evaluating if adding a Model Context Protocol (MCP) provides a measurable lift in quality or simply adds latency.
Question: Does the caveman skill save tokens without compromising quality? Let's find out with an artificial benchmark!
npx @sebastianmattar/loopscore init # initialize the current directory with config and presets
npx @sebastianmattar/loopscore run benchmarks/caveman-skill.config.yaml # Execute the testThis will generate a benchmarks/results directory containing details of all runs as well as a summary.md. This is the relevant table:
| Agent | Variant | Overall | LLM Judge | Checks | Time (ms) | Lines | Est. cost | Runs |
|---|---|---|---|---|---|---|---|---|
| copilot | without-caveman | 2.990 | 0.980 | 5.000 | 153274 | 153 | — | 1 |
| copilot | with-caveman | 2.940 | 0.880 | 5.000 | 104234 | 116 | — | 1 |
Answer: Caveman runs faster, requires less tokens (cheaper) but has worse quality - interesting!
The tool executes a standardized set of commands across multiple agentic environments. Once the agents complete their tasks, loopscore aggregates the data and runs an evaluation suite:
For each run the process is as follows:
- Inject files such as (AGENTS.md, REQUIREMENTS.md)
- Execute commands to install Skills etc.
- Start an agent with your query
The generated results will be analyzed according to the following metrics.
| Metric | Description |
|---|---|
| Efficiency | Total tokens consumed vs. wall-clock time taken. (Lower is better) |
| Output Volume | Lines of code (LOC) generated to solve the problem. (Lower is better) |
| Code Complexity | Structural analysis of the resulting code. (Lower is better) |
| Requirement Fit | An Agentic Judge reviews the output against the original prompt to score success. (Higher is better) |
| Shell Test Cases | You can run run shell commands and evaluate the return code to generate a score. (Higher is better) |
Benchmark results in the LLM world should be used with caution. Keep the following in mind:
- Benchmarking Complexity: Creating realistic benchmarks is very hard. Models get worse with larger contexts, simple benchmarks without user interaction are very limited.
- Environmental Noise: Results vary due to hardware, seed randomness, and provider-side load balancing.
- Provider Variability: API providers often adjust "thinking budgets" or update system prompts without notice, and caching can mask variance.
- "Benchmaxxing": Some models are fine-tuned specifically to score high on public benchmarks and common frameworks, but may falter in messy, real-world repositories.
- Judge Subjectivity: The "LLM Judge" is itself an LLM, making it susceptible to the same hallucinations and biases as the agents it evaluates.
- Agent Awareness: Agents may get aware that they are being benchmarked and therefore behave differently
To get started, ensure your environment meets these requirements:
-
OS: Anything common and unix-based: macOS, Linux or WSL
-
Runtime: Node.js 22+
-
Agent Access: Target Agent CLIs (e.g.,
copilot,gemini,claude) must be installed and accessible via your$PATH.Currently supported:
- GitHub Copilot
- Google Gemini
- OpenCode
-
Authentication: Agents must be pre-authenticated. You can manage API keys and secrets using a standard
.envfile in your local directory if you need to.
Initialize a workspace with the schema, VS Code YAML settings, starter skill, and example benchmark:
npx loopscore initThen run a bundled example benchmark:
npx loopscore run benchmarks/caveman-skill.config.yamlThere is also a dedicated OpenCode example benchmark:
npx loopscore run benchmarks/opencode.config.yamlloopscore currently supports these built-in agent providers:
copilotgeminiopencode
You can add an opencode variant by overriding the agent on a single benchmark variant:
variants:
- name: copilot-baseline
query:
- Implement requirements.md
- name: opencode-baseline
agent:
type: opencode
model: github-copilot/gpt-5.4
options:
agent: build
variant: high
query:
- Implement requirements.mdOpenCode runs through opencode run ... --format json and supports headless benchmark execution.
The opencode provider supports these agent.options fields:
agentcontinuesessionforksharefiletitleattachpassworddirportvariantthinkingdangerouslySkipPermissionscommandpurelogLevelprintLogs
The model field for opencode should use the CLI's provider/model format. The exact provider IDs depend on what your local OpenCode install exposes, so check them with opencode models. On this machine, valid examples include github-copilot/gpt-5.4 and opencode/gpt-5-nano.
If your agent CLI exposes real input and output token usage, loopscore can estimate cost from explicit model pricing instead of the older flat token heuristic.
Example:
variantDefaults:
agent:
type: copilot
model: gpt-5
pricing:
inputCostPerMillionTokens: 1.25
outputCostPerMillionTokens: 10.0
measure:
- type: judge
provider: copilot
model: gpt-5
acceptanceCriteria:
- Builds and runsIf pricing is present on the benchmarked agent and token usage is available, loopscore computes run cost from real input and output tokens. If not, it falls back to the legacy costPerMillionTokens setting and the existing token estimate heuristic.