agent-benchmark Documentation

agent-benchmark is a comprehensive testing framework for AI agents built on the Model Context Protocol (MCP). It enables systematic testing, validation, and benchmarking of AI agents across different LLM providers with robust assertion capabilities.

Overview

agent-benchmark provides a declarative YAML-based approach to testing AI agents that interact with MCP servers. It supports multiple LLM providers, various MCP server types, and comprehensive assertion mechanisms to validate agent behavior.

Key Features

1. Multi-Provider Support

Test agents across different LLM providers in parallel:

Google AI (Gemini models)
Vertex AI (Google Cloud Gemini)
Anthropic (Claude models)
OpenAI (GPT models)
Azure OpenAI
Groq

2. MCP Server Integration

Connect to MCP servers via:

stdio: Run MCP servers as local processes
SSE: Connect to remote MCP servers via Server-Sent Events
CLI: Wrap command-line tools as MCP-like servers for testing CLI-based tools

3. Session-Based Testing

Organize tests into sessions with shared context and message history, simulating real conversational flows.

4. Test Suite Support

Run multiple test files with centralized configuration, shared variables, and unified success criteria.

5. Rich Assertion Library

Validate agent behavior with 20+ assertion types covering:

Tool usage patterns
Output validation
Performance metrics
Boolean combinators (anyOf, allOf, not) for complex logic

6. Template Engine

Dynamic test generation with Handlebars-style templates supporting:

Random data generation
Timestamp manipulation
Faker integration
String manipulation

7. Data Extraction

Extract data from tool results using JSONPath to pass between tests in a session.

8. Comprehensive Reporting

Generate reports in multiple formats:

Console output with color-coded results
HTML reports with performance comparison
JSON export
Markdown documentation

9. Agent Skills Support

Load domain-specific knowledge following the agentskills.io specification:

Parse SKILL.md files with YAML frontmatter
Progressive disclosure of reference files
Template variable {{SKILL_DIR}} for skill paths

Installation

Quick Install (Recommended)

Install the latest version with a single command:

Linux/macOS:

curl -fsSL https://raw.githubusercontent.com/mykhaliev/agent-benchmark/master/install.sh | bash

Windows (PowerShell):

irm https://raw.githubusercontent.com/mykhaliev/agent-benchmark/master/install.ps1 | iex

Alternative Installation Methods

Minimal Install (60-70% smaller download)

For slower connections or to save bandwidth, use the UPX-compressed version:

curl -fsSL https://raw.githubusercontent.com/mykhaliev/agent-benchmark/master/install-min.sh | bash

Note: The minimal version may trigger antivirus warnings on some systems as UPX compression is sometimes flagged by security software.

Manual Installation from Pre-built Binaries

Download the appropriate file for your system from the releases page:

Regular versions (recommended):

Linux (AMD64): agent-benchmark_vX.X.X_linux_amd64.tar.gz
Linux (ARM64): agent-benchmark_vX.X.X_linux_arm64.tar.gz
macOS (Intel): agent-benchmark_vX.X.X_darwin_amd64.tar.gz
macOS (Apple Silicon): agent-benchmark_vX.X.X_darwin_arm64.tar.gz
Windows (AMD64): agent-benchmark_vX.X.X_windows_amd64.zip
Windows (ARM64): agent-benchmark_vX.X.X_windows_arm64.zip

UPX compressed (smaller size, not available for Windows ARM64):

Linux (AMD64): agent-benchmark_vX.X.X_linux_amd64_upx.tar.gz
Linux (ARM64): agent-benchmark_vX.X.X_linux_arm64_upx.tar.gz
macOS (Intel): agent-benchmark_vX.X.X_darwin_amd64_upx.tar.gz
macOS (Apple Silicon): agent-benchmark_vX.X.X_darwin_arm64_upx.tar.gz
Windows (AMD64): agent-benchmark_vX.X.X_windows_amd64_upx.zip

Extract and move to your PATH:

# Linux/macOS
tar -xzf agent-benchmark_*.tar.gz
sudo mv agent-benchmark /usr/local/bin/

# Windows
# Extract the ZIP file and add the binary to your PATH

Build from Source

Requirements: Go 1.25 or higher

Linux/macOS:

# Clone the repository
git clone https://github.com/mykhaliev/agent-benchmark
cd agent-benchmark

# Build the binary
go build -o agent-benchmark

# (Optional) Move to your PATH
sudo mv agent-benchmark /usr/local/bin/

Windows (PowerShell):

# Clone the repository
git clone https://github.com/mykhaliev/agent-benchmark
cd agent-benchmark

# Build the binary
.\build.ps1
# or
go build -o agent-benchmark.exe

Verify Installation

After installation, verify it works:

agent-benchmark -v

AI Assistant Skills (Optional)

Get AI-powered assistance when writing test configurations in VS Code, Cursor, or other editors with Agent Skills support.

Download agent-benchmark-skills_*.zip from releases and extract:

Linux/macOS:

unzip agent-benchmark-skills_*.zip -d ~/.copilot/skills/

Windows (PowerShell):

Expand-Archive agent-benchmark-skills_*.zip -DestinationPath $env:USERPROFILE\.copilot\skills\

Once installed, your AI assistant will have domain knowledge about:

Provider configuration (Azure, OpenAI, Anthropic, Google, Vertex AI, Groq)
All 20+ assertion types with examples
Template helpers (faker, randomValue, now, etc.)
Best practices for writing reliable test configs

See skills/README.md for more details.

Quick Start

Run your first benchmark:

agent-benchmark -f tests.yaml -o report.html -verbose

Command Line Reference

agent-benchmark [options]

Required (one of):
  -f <file>         Path to test configuration file (YAML)
  -s <file>         Path to suite configuration file (YAML)
  -g <file>         Path to generator config file (enables test generation mode)
  -e <file>         Path to explorer config file (enables exploratory testing mode)
  -generate-report <file>  Generate HTML report from existing JSON results file
                           (reads test_file from JSON to load AI summary config)

Generator options (require -g):
  --dry-run           Preview generated YAML without saving
  --output-dir <dir>  Directory for generated test files (default: ./generated_tests)
  --seed <int>        Random seed for deterministic generation

Explorer options (require -e):
  (none currently — all settings live in the explorer: YAML block)

Optional:
  -o <file>         Output report path/filename without extension
                      Default: <test_dir>/test_results/report
                      The test_results folder is auto-created and git-ignored
  -l <file>         Log file path (default: stdout)
  -reportType <types> Report format(s): html, json, md (default: html)
                      Multiple formats supported as comma-separated values
                      Examples: -reportType html
                                -reportType html,json
                                -reportType html,json,md
  -verbose          Enable verbose logging
  -v                Show version and exit

Examples:

# Run single test file with verbose output
# Reports saved to: examples/test_results/report.html
./agent-benchmark -f examples/tests.yaml -verbose

# Run test suite with JSON report (custom output path)
./agent-benchmark -s suite.yaml -o ./my-reports/results -reportType json

# Run with custom log file
./agent-benchmark -f tests.yaml -l test-run.log

# Generate Markdown report
./agent-benchmark -f tests.yaml -o report -reportType md

# Generate HTML report from existing JSON results (fast iteration)
# Reads test_file from JSON to load AI summary configuration
./agent-benchmark -generate-report results.json -o new-report

# Generate both JSON and HTML reports (for later regeneration)
./agent-benchmark -f tests.yaml -o results -reportType json,html

Test Generation

Use the -g flag to automatically generate a ready-to-run test suite from a generator config. The generator connects to your MCP servers, discovers tool schemas, and uses an LLM to produce test sessions with typed assertions — no test authoring required.

# Preview generated YAML without saving anything
./agent-benchmark -g examples/generator-config.yaml --dry-run

# Generate and save to a timestamped directory under ./generated_tests/
./agent-benchmark -g examples/generator-config.yaml

# Custom output directory and deterministic seed
./agent-benchmark -g gen.yaml --output-dir ./tests --seed 42

How generation works

The generator runs three sequential phases:

Phase 1 — Plan A focused LLM call produces a compact JSON test plan: session names, test names, expected tools, and high-level assertion ideas. The plan is validated against the actual tool list before moving on. If validation fails the plan is regenerated (up to max_retries times).

Phase 2 — Intent For each test in the plan, a separate LLM call produces a TestIntent: a flat JSON object with the prompt, typed assertion checks, and optional JSONPath extractors. The intent is validated (correct assertion types, real tool names, no forward variable references). If validation fails, one automatic repair attempt is made; if that also fails, the generator retries the full intent generation. Hard failure only if all max_retries are exhausted.

Phase 3 — Build The validated intents are assembled deterministically into model.Session structs and serialized to YAML. No further LLM calls are made in this phase.

Output structure

Each run writes to a timestamped subdirectory under --output-dir (default ./generated_tests):

generated_tests/
└── generated_20260301_120000/
    ├── suite.yaml          ← run this with: agent-benchmark -s
    ├── file-operations.yaml
    └── error-handling.yaml

suite.yaml references every session file and is pre-populated with the original providers, servers, agents, and variables from the generator config — so the output is immediately runnable:

./agent-benchmark -s generated_tests/generated_20260301_120000/suite.yaml

Generator config reference

The generator: block controls generation behaviour:

Field	Description	Default
`agent`	Agent whose LLM is used for generation	first agent
`test_count`	Number of tests to generate across all sessions	`5`
`complexity`	`simple` \| `medium` \| `complex` (see below)	`medium`
`include_edge_cases`	Include error/boundary condition tests	`false`
`max_steps_per_test`	Max tool-call steps expected per test	`5`
`max_retries`	Max LLM attempts per phase before giving up	`3`
`max_tokens`	Stop if cumulative LLM tokens exceed this limit (0 = unlimited)	`0`
`max_iterations`	Max LLM conversation turns per generation call	engine default
`plan_chunk_size`	Max tests per plan chunk (0 = use default of 5)	`5`
`plan_chunk_max_tokens`	Max output tokens per plan chunk LLM call (0 = auto)	auto
`tools`	Allowlist of tool names to test (empty = all tools)	all tools
`goal`	Extra instruction injected into the generation prompt	—

Complexity levels:

Level	Behaviour
`simple`	One tool call per test; straightforward prompts
`medium`	One to three tool calls; may chain tool results
`complex`	Multi-step workflows; may use `anyOf`/`allOf` assertion combinators

See examples/generator-config.yaml for a fully annotated example.

Exploratory Testing

Use the -e flag to run an autonomous exploration session. The explorer LLM iteratively decides what test to run next, executes it against the configured agent, observes the result, and plans the next iteration — all without predefined test cases.

# Run an exploration session with default HTML report
./agent-benchmark -e examples/explorer-config.yaml

# With verbose logging and custom report path
./agent-benchmark -e explorer.yaml -o ./reports/explore -reportType html,json -verbose

The explorer: block in your config controls the behaviour:

Field	Description	Default
`goal`	What the explorer is trying to test (required)	—
`max_iterations`	Maximum number of test iterations	`10`
`stop_on_pass_count`	Stop after N consecutive passes (0 = run all iterations)	`0`
`max_retries`	LLM retry attempts per iteration if parsing fails	`3`
`max_tokens`	Stop the exploration loop if cumulative tokens exceed this limit (0 = unlimited)	`0`
`agent`	Agent name reference — must match a name in the top-level `agents` list. Its provider and servers are used for both exploration decisions and test execution. Defaults to the first agent when omitted.	—

How exploration results appear in reports:

Results are fed into the standard report pipeline — no new report format is needed. Exploration metadata is encoded into existing report fields:

Metadata	Where it renders
`Exploration: <goal>`	Suite header
`Exploration Goal: <goal>`	Session group header
`[Iter NN \| prompt-NNN] <test name>`	Test group title
Explorer LLM reasoning + decision prompt	Conversation history (system message)

See examples/explorer-config.yaml for a fully annotated example.

Configuration

Configuration files use YAML format with six main sections:

providers:    # LLM provider configurations
servers:      # MCP server definitions
agents:       # Agent configurations
sessions:     # Test sessions
settings:     # Global test settings
variables:    # Reusable variables

Test Suite Configuration

The framework supports running multiple test files through a suite configuration:

name: "Complete Test Suite"
test_files:
  - tests/basic-operations.yaml
  - tests/advanced-features.yaml
  - tests/edge-cases.yaml

providers:
  - name: gemini
    type: GOOGLE
    token: "{{GOOGLE_API_KEY}}"
    model: gemini-2.0-flash

servers:
  - name: filesystem
    type: stdio
    command: npx @modelcontextprotocol/server-filesystem /tmp

agents:
  - name: test-agent
    provider: gemini
    servers:
      - name: filesystem

settings:
  verbose: true
  max_iterations: 10
  tool_timeout: 30s
  test_delay: 2s

variables:
  base_path: "/tmp/tests"
  timestamp: "{{now format='unix'}}"

criteria:
  success_rate: "0.8"  # 80% of tests must pass

Suite Configuration Benefits:

Centralized provider and server definitions
Shared variables across all test files
Unified success criteria
Single command execution for multiple test files

Test/Suite Definition

Providers

Provider Types:

GOOGLE - Google AI (Gemini)
VERTEX - Vertex AI (Google Cloud Gemini)
ANTHROPIC - Anthropic (Claude)
OPENAI - OpenAI (GPT)
AZURE - Azure OpenAI
GROQ - Groq

Define LLM providers for your agents:

providers:
  - name: gemini-flash
    type: GOOGLE
    token: {{GOOGLE_API_KEY}}
    model: gemini-2.0-flash
    
  - name: claude-sonnet
    type: ANTHROPIC
    token: {{ANTHROPIC_API_KEY}}
    model: claude-sonnet-4-20250514
    
  - name: gpt-4
    type: OPENAI
    token: {{OPENAI_API_KEY}}
    model: gpt-4o-mini
    baseUrl: https://api.openai.com/v1  # Optional
    
  - name: azure-gpt
    type: AZURE
    token: {{AZURE_API_KEY}}
    model: gpt-4
    baseUrl: https://your-resource.openai.azure.com
    version: 2024-02-15-preview

  - name: azure-entra
    type: AZURE
    auth_type: entra_id  # Use Microsoft Entra ID authentication (passwordless)
    model: gpt-4
    baseUrl: https://your-resource.openai.azure.com
    version: 2024-02-15-preview

  - name: vertex-ai
    type: VERTEX
    project_id: "your-gcp-project-id"
    location: "us-central1"
    credentials_path: "/path/to/service-account.json"
    model: gemini-2.0-flash
    
  - name: gpt-4
    type: GROQ
    token: {{GROQ_API_KEY}}
    model: openai/gpt-oss-120b
    baseUrl: https://api.groq.com/openai/v1 # Optional

Azure OpenAI Authentication

The AZURE provider supports two authentication methods:

API Key Authentication (default):

providers:
  - name: azure-apikey
    type: AZURE
    auth_type: api_key  # Optional, this is the default
    token: {{AZURE_OPENAI_API_KEY}}
    model: gpt-4
    baseUrl: https://your-resource.openai.azure.com
    version: 2024-02-15-preview

Microsoft Entra ID Authentication (passwordless):

providers:
  - name: azure-entra
    type: AZURE
    auth_type: entra_id  # Uses DefaultAzureCredential
    model: gpt-4
    baseUrl: https://your-resource.openai.azure.com
    version: 2024-02-15-preview
    # No token required - uses Azure credentials from environment

Entra ID authentication uses Azure's DefaultAzureCredential, which automatically tries multiple authentication methods in order:

Environment variables: AZURE_CLIENT_ID, AZURE_TENANT_ID, AZURE_CLIENT_SECRET
Workload Identity (for Kubernetes)
Managed Identity (when running in Azure)
Azure CLI (az login)
Azure Developer CLI (azd auth login)
Azure PowerShell (Connect-AzAccount)

Required RBAC Role:

Your identity must have the "Cognitive Services OpenAI User" role (or higher) assigned on the Azure OpenAI resource. Without this role, you will receive a 401 Unauthorized error.

To assign the role using Azure CLI:

# Get your Azure OpenAI resource ID
az cognitiveservices account show \
  --name <your-openai-resource-name> \
  --resource-group <your-resource-group> \
  --query id -o tsv

# Assign the required role
az role assignment create \
  --assignee <your-email-or-principal-id> \
  --role "Cognitive Services OpenAI User" \
  --scope <resource-id-from-above>

Note: Role assignments can take up to 5-10 minutes to propagate.

For more information, see Azure Identity authentication and Azure OpenAI RBAC roles.

Rate Limiting

Providers can be configured with rate limits to proactively throttle requests and avoid exceeding API quotas:

providers:
  - name: azure-gpt
    type: AZURE
    token: {{AZURE_API_KEY}}
    model: gpt-4
    baseUrl: https://your-resource.openai.azure.com
    version: 2024-02-15-preview
    rate_limits:
      tpm: 30000               # Tokens per minute limit (proactive throttling)
      rpm: 60                  # Requests per minute limit (proactive throttling)
    retry:
      retry_on_429: true       # Enable retry on 429 errors (default: false)
      max_retries: 3           # Max retry attempts (default: 3 when enabled)

Configuration Options:

Option	Description	Default
`rate_limits.tpm`	Maximum tokens per minute	No limit
`rate_limits.rpm`	Maximum requests per minute	No limit
`retry.retry_on_429`	Enable automatic retry on 429 errors	`false`
`retry.max_retries`	Number of retry attempts	3 (when enabled)

How it works:

Uses token bucket algorithm to proactively throttle requests before sending
Estimates tokens using tiktoken for OpenAI models
Falls back to cl100k_base encoding for non-OpenAI models (Claude, Gemini, Llama, etc.)
Runtime calibration adjusts estimates based on actual API responses
429 retry handling provides a safety net when estimates fall short

Best Practice: Enable both rate_limits (proactive) and retry_on_429 (reactive) for defense in depth.

Important: Rate limiting is best-effort, not guaranteed. Token estimation varies by provider. For detailed technical information, see docs/rate-limiting.md.

Servers

Configure MCP servers that agents will interact with:

Local/Stdio Server

servers:
  - name: filesystem-server
    type: stdio
    command: npx @modelcontextprotocol/server-filesystem /tmp

SSE Server

servers:
  - name: remote-api
    type: sse
    url: https://api.example.com/mcp/events
    headers:
      - "Authorization: Bearer {{API_TOKEN}}"
      - "X-Custom-Header: value"

Server Types:

stdio - Standard Input/Output communication
sse - Server-Sent Events over HTTP
cli - CLI tool wrapper (see CLI Server below)

CLI Server

Wrap command-line tools as MCP-like servers. Useful for testing CLI-based tools:

servers:
  - name: excel-cli
    type: cli
    command: excel-cli
    shell: powershell      # Shell: powershell, pwsh, cmd, bash, sh, zsh
    working_dir: "{{TEST_DIR}}"  # Working directory for CLI commands
    tool_prefix: excel     # Tool name becomes excel_execute
    help_commands:         # Help content for LLM context
      - "excel-cli --help"

Option	Description	Default
`command`	CLI executable to wrap (required)	-
`shell`	Shell to run commands in	`powershell` (Windows), `bash` (Unix)
`working_dir`	Working directory for commands	Current directory
`tool_prefix`	Prefix for generated tool name	`cli` (tool name: `cli_execute`)
`help_commands`	Commands to run at startup for CLI help	-

Key Features:

Auto-discovery: Automatically discovers subcommands from COMMANDS: section in help output
Help content injection: CLI help is included in tool description for LLM context
CLI-specific assertions: cli_exit_code_equals, cli_stdout_contains, cli_stdout_regex, cli_stderr_contains

📖 Full CLI Server Documentation - Complete guide with examples, best practices, and troubleshooting

Server Timing Configuration

Control server initialization and process delays:

servers:
  - name: slow-server
    type: stdio
    command: python server.py
    server_delay: 45s      # Wait up to 45s for initialization
    process_delay: 1s      # Wait 1s after process starts

Delay Parameters:

server_delay - Maximum time to wait for server initialization (default: 30s)
process_delay - Delay after starting process before initialization (default: 300ms)

SSE Server with Authentication

servers:
  - name: authenticated-api
    type: sse
    url: https://api.example.com/mcp/sse
    headers:
      - "Authorization: Bearer {{API_TOKEN}}"
      - "X-API-Version: 2024-01"
      - "X-Client-ID: agent-benchmark"

Agents

Define agents that combine providers with MCP servers:

agents:
  - name: research-agent
    provider: gemini-flash
    system_prompt: |
      You are an autonomous research agent.
      Execute tasks directly without asking for clarification.
      Use available tools to complete the requested tasks.
    servers:
      - name: filesystem-server
        allowedTools:  # Optional: restrict tool access
          - read_file
          - list_directory
      - name: remote-api
        
  - name: coding-agent
    provider: claude-sonnet
    servers:
      - name: filesystem-server  # No tool restrictions

Agent Configuration:

name - Unique agent identifier
provider - Reference to provider name
skill - Optional Agent Skill to load (see Agent Skills section)
system_prompt - Optional system prompt prepended to all conversations (supports templates)
servers - List of MCP servers
allowedTools - Optional tool whitelist per server

System Prompt Templates:

The system_prompt field supports template variables for dynamic context:

{{AGENT_NAME}} - Current agent name
{{SESSION_NAME}} - Current session name
{{PROVIDER_NAME}} - Provider name being used

Example:

agents:
  - name: test-agent
    provider: gemini-flash
    system_prompt: |
      You are {{AGENT_NAME}} using {{PROVIDER_NAME}}.
      Currently running session: {{SESSION_NAME}}.
      Execute all tasks autonomously.

Sessions

Organize tests into sessions with shared conversational context:

sessions:
  - name: File Operations
    tests:
      - name: Create a file
        prompt: "Create a file called {{filename}} with content: Hello World"
        assertions:
          - type: tool_called
            tool: write_file
            
      - name: Read the file
        prompt: "Read the file {{filename}}"
        assertions:
          - type: tool_called
            tool: read_file
          - type: output_contains
            value: "Hello World"

Session Features:

Tests within a session share message history
Variables persist across tests in a session
Simulates multi-turn conversations

Agent Skills

Agent Skills provide domain-specific knowledge to agents following the agentskills.io specification. Skills are loaded from a directory containing a SKILL.md file, and their content is injected into the agent's system prompt.

agents:
  - name: skilled-agent
    provider: azure-openai
    skill:
      path: "./skills/my-skill"  # Path to skill directory
    system_prompt: |
      Additional instructions here...

If the skill has a references/ directory, built-in tools (list_skill_references, read_skill_reference) are automatically added for on-demand access.

For full documentation, see docs/agent-skills.md.

Settings

Global configuration for test execution:

settings:
  verbose: true                 # Enable detailed logging
  max_iterations: 10            # Maximum agent reasoning loops
  timeout: 30s                  # Tool execution timeout (legacy, use tool_timeout)
  tool_timeout: 30s             # Tool execution timeout
  test_delay: 2s                # Delay between tests
  session_delay: 30s            # Delay between sessions (for COM cleanup, resource release)
  variable_policy: suite_only   # Controls are combined (test-only, suite-only, merge-test-priority, merge-suite-priority)

Variable Policy

When running tests as part of a test suite, variables can be defined at both the suite level and the test level.

The variable_policy setting controls how these variables are resolved.

Available Policies

Policy	Description
`suite-only` (default)	Only suite-level variables are used. Test-level variables are ignored.
`test-only`	Only test-level variables are used. Suite-level variables are ignored.
`merge-test-priority`	Suite and test variables are merged. Test variables override suite variables on key conflicts.
`merge-suite-priority`	Suite and test variables are merged. Suite variables override test variables on key conflicts.

If variable_policy is not set or has an unknown value, it defaults to suite-only.

Variables

Define reusable variables with template support:

variables:
  filename: "test-{{randomValue type='ALPHANUMERIC' length=8}}.txt"
  timestamp: "{{now format='unix'}}"
  user_id: "{{randomInt lower=1000 upper=9999}}"
  email: "{{faker 'Internet.email'}}"

Variables can:

Use template helpers
Reference environmental variables

Test Timing Controls

Start Delay

Delay individual test execution:

tests:
  - name: Rate-limited API call
    prompt: "Make API request"
    start_delay: 5s  # Wait 5 seconds before starting
    assertions:
      - type: tool_called
        tool: api_request

Global Test Delay

Pause between all tests:

settings:
  test_delay: 2s  # 2 second pause after each test

Use Cases:

Respect API rate limits
Allow system state to settle
Prevent resource exhaustion

Session Delay

Pause between sessions to allow resource cleanup:

settings:
  session_delay: 30s  # 30 second pause between sessions

Use Cases:

Allow external applications and resources to fully release between sessions
Prevent resource contention when tests interact with stateful applications
Avoid lingering processes from previous sessions affecting new sessions
Give MCP servers time to cleanly shut down between sessions

Test Criteria & Exit Codes

Define minimum success rate for test suites:

criteria:
  success_rate: 0.75  # 75% pass rate required

Exit Code Behavior:

Scenario	Exit Code
All tests pass / Success rate met	0
Some tests fail / Success rate not met	1

Environment Variables

Reference environment variables in configuration:

providers:
  - name: claude
    type: ANTHROPIC
    token: "{{ANTHROPIC_API_KEY}}"
    model: claude-sonnet-4-20250514

servers:
  - name: api-server
    type: sse
    url: "{{API_BASE_URL}}"
    headers:
      - "Authorization: Bearer {{API_TOKEN}}"

variables:
  workspace: "{{WORKSPACE_PATH}}"

Convention:

Use {{VAR_NAME}} syntax
Set before running tests
Common for tokens, URLs, paths

export ANTHROPIC_API_KEY="sk-ant-..."
export API_BASE_URL="https://api.example.com"
export WORKSPACE_PATH="/tmp/workspace"

./agent-benchmark -f tests.yaml

Built-in Template Variables

The framework provides built-in variables that are automatically available in template contexts. Variables are divided into two categories based on when they become available:

Variable Categories

Category	Available In	Description
Static	Everywhere (providers, servers, variables, prompts, assertions)	Available at configuration load time
Runtime	Prompts, assertions, system prompts	Available during test execution only

Static Variables (Available Everywhere)

These variables can be used in server commands, provider configs, user variables, prompts, and assertions:

Variable	Description
`{{TEST_DIR}}`	Absolute path to the directory containing the test YAML file
`{{TEMP_DIR}}`	System temporary directory (cross-platform: `%TEMP%` on Windows, `/tmp` on Linux/macOS)
`{{RUN_ID}}`	Unique UUID v4 for this test run (e.g., `550e8400-e29b-41d4-a716-446655440000`)
`{{ANY_ENV_VAR}}`	Any environment variable (e.g., `{{HOME}}`, `{{AZURE_OPENAI_ENDPOINT}}`)
User-defined variables	Variables defined in the `variables:` section of your config

Runtime Variables (Available During Test Execution)

These variables are only available in prompts, assertions, and system prompts—not in server commands or provider configs:

Variable	Description
`{{AGENT_NAME}}`	Current agent name
`{{SESSION_NAME}}`	Current session name
`{{PROVIDER_NAME}}`	Provider name being used

Using TEST_DIR for Portable Paths:

{{TEST_DIR}} enables test configurations that work regardless of where the repository is cloned:

variables:
  # Paths relative to the test file location
  data_dir: "{{TEST_DIR}}/test-data"
  output_dir: "{{TEST_DIR}}/../TestResults"
  mcp_server: "{{TEST_DIR}}/bin/my-server.exe"

servers:
  - name: filesystem
    type: stdio
    command: npx @modelcontextprotocol/server-filesystem {{output_dir}}

  - name: custom-server
    type: stdio
    command: "{{mcp_server}}"

sessions:
  - name: File Tests
    tests:
      - name: Process test data
        prompt: "Read files from {{data_dir}} and save results to {{output_dir}}"

Assertions

agent-benchmark provides 20+ assertion types to validate agent behavior:

Tool Assertions

no_hallucinated_tools

Verify agent only uses available tools:

assertions:
  - type: no_hallucinated_tools

tool_called

Verify a specific tool was invoked:

assertions:
  - type: tool_called
    tool: create_file

tool_not_called

Ensure a tool was NOT invoked:

assertions:
  - type: tool_not_called
    tool: delete_database

tool_call_count

Validate the exact number of tool calls. The tool name is optional; if it is not specified, the number of all tool calls will be verified:

assertions:
  - type: tool_call_count
    tool: search_api
    count: 3

tool_call_order

Verify tools were called in a specific sequence:

assertions:
  - type: tool_call_order
    sequence:
      - validate_input
      - process_data
      - save_results

tool_param_equals

Check tool parameters match exactly:

assertions:
  - type: tool_param_equals
    tool: create_user
    params:
      name: "John Doe"
      age: 30
      email: "john@example.com"
      settings.theme: "dark"  # Nested parameter with dot notation

Nested Parameter Validation:

Use dot notation for nested parameters:

assertions:
  - type: tool_param_equals
    tool: create_resource
    params:
      name: "test-resource"
      config.timeout: "30"
      config.retry.max_attempts: "3"
      config.retry.backoff: "exponential"
      metadata.tags.environment: "production"

Dot Notation Rules:

Navigate nested maps with dots
Validate deeply nested values
Compare exact matches at any depth

tool_param_matches_regex

Validate parameters with regex patterns:

assertions:
  - type: tool_param_matches_regex
    tool: send_email
    params:
      recipient: "^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$"

tool_result_matches_json

Validate tool results using JSONPath:

assertions:
  - type: tool_result_matches_json
    tool: get_user
    path: "$.data.user.name"
    value: "John Doe"

Output Assertions

output_contains

Check if output contains specific text:

assertions:
  - type: output_contains
    value: "Operation completed successfully"

output_not_contains

Ensure output doesn't contain specific text:

assertions:
  - type: output_not_contains
    value: "error"

output_regex

Validate output with regex pattern:

assertions:
  - type: output_regex
    pattern: "^User ID: [0-9]{4,}$"

Performance Assertions

max_tokens

Limit approximate token usage:

assertions:
  - type: max_tokens
    value: 1000

Token Estimation:

Token usage for OpenAI, Google and Anthropic models is taken from GenerationInfo For other models formula is:

tokens = output_length / 4

This approximation:

Provides rough token counts
Useful for max_tokens assertions
Not exact (varies by tokenizer)

max_latency_ms

Ensure execution completes within time limit:

assertions:
  - type: max_latency_ms
    value: 5000  # 5 seconds

Error Assertions

no_error_messages

Verify execution completed without errors:

assertions:
  - type: no_error_messages

no_rate_limit_errors

Verify the test did not encounter any HTTP 429 rate limit errors:

assertions:
  - type: no_rate_limit_errors

This assertion checks if the provider returned any 429 errors during execution. It's useful for:

Ensuring tests stay within API quotas
Validating that rate limit configuration is adequate
Detecting when throttling is needed

Behavior Assertions

no_clarification_questions

Verify the agent executed tasks directly without asking for clarification. Requires clarification_detection to be enabled on the agent:

assertions:
  - type: no_clarification_questions

Boolean Combinators

Boolean combinators allow you to create complex assertion logic using JSON Schema-style operators. These are useful when LLMs may achieve the same outcome through different approaches.

anyOf

Pass if ANY child assertion passes (OR logic):

assertions:
  # Pass if the LLM used keyboard_control OR ui_automation
  - anyOf:
      - type: tool_called
        tool: keyboard_control
      - type: tool_called
        tool: ui_automation

allOf

Pass if ALL child assertions pass (AND logic):

assertions:
  # Pass if both conditions are met
  - allOf:
      - type: tool_called
        tool: create_file
      - type: output_contains
        value: "File created successfully"

not

Pass if the child assertion FAILS (negation):

assertions:
  # Pass if output does NOT contain "error" (equivalent to output_not_contains)
  - not:
      type: output_contains
      value: "error"

Nested Combinators

Combinators can be nested for complex logic:

assertions:
  # Pass if: (keyboard OR ui_automation) AND no errors
  - allOf:
      - anyOf:
          - type: tool_called
            tool: keyboard_control
          - type: tool_called
            tool: ui_automation
      - type: no_error_messages
  
  # Pass if NOT (error in output AND failed tool)
  - not:
      allOf:
        - type: output_contains
          value: "error"
        - type: tool_not_called
          tool: success_handler

Use Cases:

Testing LLMs that may use different tools to achieve the same goal
Validating that at least one of several acceptable outcomes occurred
Creating exclusion rules (must NOT match a pattern)
Complex conditional validation logic

Template System

agent-benchmark includes a powerful template engine based on Handlebars with custom helpers:

Random Value Generation

randomValue

Generate random strings:

# Alphanumeric (default)
{{randomValue length=10}}
# Output: aB3xY9kL2m

# Alphabetic only
{{randomValue type='ALPHABETIC' length=8}}
# Output: AbCdEfGh

# Numeric only
{{randomValue type='NUMERIC' length=6}}
# Output: 123456

# Hexadecimal
{{randomValue type='HEXADECIMAL' length=8}}
# Output: 1a2b3c4d

# Alphanumeric with symbols
{{randomValue type='ALPHANUMERIC_AND_SYMBOLS' length=12}}
# Output: aB3@xY9!kL2#

# UUID
{{randomValue type='UUID'}}
# Output: 550e8400-e29b-41d4-a716-446655440000

# Uppercase
{{randomValue type='ALPHABETIC' length=8 uppercase=true}}
# Output: ABCDEFGH

Types:

ALPHANUMERIC (default) - Letters and numbers
ALPHABETIC - Letters only
NUMERIC - Numbers only
HEXADECIMAL - Hex characters (0-9, a-f)
ALPHANUMERIC_AND_SYMBOLS - Letters, numbers, and symbols
UUID - UUID v4

randomInt

Generate random integers:

# Random int between 0 and 100 (default)
{{randomInt}}

# Custom range
{{randomInt lower=1000 upper=9999}}
# Output: 5847

# Negative range
{{randomInt lower=-100 upper=100}}

randomDecimal

Generate random decimal numbers:

# Random decimal between 0.00 and 100.00 (default)
{{randomDecimal}}

# Custom range
{{randomDecimal lower=10.5 upper=99.9}}
# Output: 45.73

Timestamp Helpers

now

Generate timestamps with formatting and offsets:

# Current ISO8601 timestamp (default)
{{now}}
# Output: 2024-01-15T14:30:00Z

# Unix epoch (milliseconds)
{{now format='epoch'}}
# Output: 1705329000000

# Unix timestamp (seconds)
{{now format='unix'}}
# Output: 1705329000

# Custom format (Java SimpleDateFormat style)
{{now format='yyyy-MM-dd HH:mm:ss'}}
# Output: 2024-01-15 14:30:00

# With timezone
{{now timezone='America/New_York'}}

# With offset
{{now offset='3 days'}}
{{now offset='-24 hours'}}
{{now offset='1 years'}}

# Combined
{{now format='yyyy-MM-dd' offset='7 days' timezone='UTC'}}

Offset Units:

seconds / second
minutes / minute
hours / hour
days / day
weeks / week
months / month
years / year

Faker Integration

faker

Generate realistic fake data:

# Names
{{faker 'Name.first_name'}}      # John
{{faker 'Name.last_name'}}       # Smith
{{faker 'Name.full_name'}}       # John Smith
{{faker 'Name.prefix'}}          # Mr.
{{faker 'Name.suffix'}}          # Jr.

# Addresses
{{faker 'Address.street'}}       # 123 Main St
{{faker 'Address.city'}}         # New York
{{faker 'Address.state'}}        # California
{{faker 'Address.state_abbrev'}} # CA
{{faker 'Address.country'}}      # United States
{{faker 'Address.postcode'}}     # 12345

# Phone
{{faker 'Phone.number'}}         # 555-1234
{{faker 'Phone.number_formatted'}} # (555) 123-4567

# Internet
{{faker 'Internet.email'}}       # john@example.com
{{faker 'Internet.username'}}    # john_doe_123
{{faker 'Internet.url'}}         # https://example.com
{{faker 'Internet.ipv4'}}        # 192.168.1.1
{{faker 'Internet.ipv6'}}        # 2001:0db8:85a3::8a2e:0370:7334
{{faker 'Internet.mac'}}         # 00:1B:44:11:3A:B7

# Company
{{faker 'Company.name'}}         # Tech Corp
{{faker 'Company.suffix'}}       # Inc.
{{faker 'Company.profession'}}   # Software Engineer

# Lorem
{{faker 'Lorem.word'}}           # ipsum
{{faker 'Lorem.sentence'}}       # Lorem ipsum dolor sit amet
{{faker 'Lorem.paragraph'}}      # Full paragraph text

# Finance
{{faker 'Finance.credit_card'}}  # 4532-1234-5678-9010
{{faker 'Finance.currency'}}     # USD

# Misc
{{faker 'Misc.uuid'}}            # 550e8400-e29b-41d4-a716-446655440000
{{faker 'Misc.boolean'}}         # true/false
{{faker 'Misc.date'}}            # 2024-01-15
{{faker 'Misc.time'}}            # 14:30:00
{{faker 'Misc.timestamp'}}       # 1705329000
{{faker 'Misc.digit'}}           # 7

String Manipulation

cut

Remove substrings:

{{cut "Hello World" "World"}}
# Output: Hello 

{{cut filename ".txt"}}

replace

Replace substrings:

{{replace "Hello World" "World" "Universe"}}
# Output: Hello Universe

{{replace email "@example.com" "@test.com"}}

substring

Extract substrings:

{{substring "Hello World" start=0 end=5}}
# Output: Hello

{{substring text start=6}}
# Output: Rest of string from position 6

Data Extraction

Extract data from tool results to use in subsequent tests:

sessions:
  - name: User Workflow
    tests:
      - name: Create user
        prompt: "Create a new user"
        extractors:
          - type: jsonpath
            tool: create_user
            path: "$.data.user.id"
            variable_name: user_id
        assertions:
          - type: tool_called
            tool: create_user
      
      - name: Get user details
        prompt: "Get details for user {{user_id}}"
        assertions:
          - type: tool_called
            tool: get_user
          - type: tool_param_equals
            tool: get_user
            params:
              id: "{{user_id}}"

Extractor Configuration:

type - Extraction method (currently: jsonpath)
tool - Tool name to extract from
path - JSONPath expression
variable_name - Variable name for template context

Use Cases:

Extract IDs from creation operations
Pass data between sequential tests
Validate consistency across operations

Reports

agent-benchmark generates comprehensive reports in multiple formats. You can specify the output filename with -o (extension added automatically) and generate multiple formats simultaneously using -reportType with comma-separated values.

📊 View Sample Reports - See example HTML reports covering all test configuration permutations (single/multi agent, single/multi test, sessions, suites).

📖 Report Documentation - Detailed documentation on report hierarchy, sections, and adaptive display.

Supported Formats

Console - Real-time colored output during execution (default, always shown)
HTML - Rich visual dashboard with charts and metrics
JSON - Structured data for programmatic analysis
Markdown - Documentation-friendly format
Realtime - Streaming NDJSON written line-by-line as each test completes

Examples

# Console output only (default)
agent-benchmark -f test.yaml

# Generate HTML report
agent-benchmark -f test.yaml -o my-report -reportType html

# Generate multiple formats
agent-benchmark -f test.yaml -o my-report -reportType html,json,md

# Realtime streaming report (useful for CI/CD pipelines and live dashboards)
agent-benchmark -f test.yaml -o my-report -reportType realtime

# Combine realtime with other formats
agent-benchmark -f test.yaml -o my-report -reportType html,json,realtime

Realtime Report

The realtime report type streams results to a .jsonl (JSON Lines) file as each test completes — without waiting for the full suite to finish. This enables external tools to consume results incrementally.

Output file: <name>.jsonl (e.g. -o my-report → my-report.jsonl)

Format — one JSON object per line:

{"type":"test","data":{...full TestRun...}}
{"type":"test","data":{...full TestRun...}}
{"type":"summary","data":{"total_tests":5,"passed":4,"failed":1,"pass_rate":0.8,"total_duration_ms":12340,"generated_at":"2026-04-08T10:00:00Z"}}
END

Line types:

Line	Description
`{"type":"test",...}`	One line per completed test, written immediately after assertion evaluation. The `data` field contains the full `TestRun`: assertions, timestamps, latency, token counts, tool calls, errors, and more.
`{"type":"summary",...}`	Aggregate stats written once after all tests complete.
`END`	Non-JSON sentinel on the last line. Signals to parsers that the stream is complete.

Parser pattern:

with open("my-report.jsonl") as f:
    for line in follow(f):          # tail -f style
        if line.strip() == "END":
            break                   # suite finished
        row = json.loads(line)
        if row["type"] == "test":
            process_test(row["data"])
        elif row["type"] == "summary":
            process_summary(row["data"])

Console Report

Real-time colored output displayed during test execution with three main sections:

Server Comparison Summary

Test-by-test comparison across agents
Pass/fail status with checkmarks
Duration per agent
Provider information
Summary statistics (e.g., "2/2 servers passed")

Detailed Test Results

Individual test results per agent
All assertion results with pass/fail indicators
Detailed metrics for each assertion (expected vs actual values)
Token usage and latency information
Error details (if any)

Execution Summary

Total tests, passed, and failed counts
Pass rate percentage
Total tool calls
Total errors
Total and average duration
Total tokens used

Example:

═══════════════════════════════════════════════════════════════
                    SERVER COMPARISON SUMMARY
═══════════════════════════════════════════════════════════════

📋 Test: Create file [100% passed]
   Summary: 2/2 servers passed

   ┌─────────────────────────────────────────────────────────────┐
   │ Server/Agent              │ Status     │ Duration          │
   ├─────────────────────────────────────────────────────────────┤
   │ gemini-agent              │ ✓ PASS     │ 2.34s            │
   │   └─ [GOOGLE]             │            │                  │
   │ claude-agent              │ ✗ FAIL     │ 3.12s            │
   │   └─ [ANTHROPIC]          │            │                  │
   └─────────────────────────────────────────────────────────────┘


═══════════════════════════════════════════════════════════════
                     DETAILED TEST RESULTS
═══════════════════════════════════════════════════════════════

📋 Test: Create file
  ✓ gemini-agent [GOOGLE] (2.34s)
    ✓ tool_called: Tool 'write_file' was called
    ✓ tool_param_equals: Tool called with correct parameters
    ✓ max_latency_ms: Latency: 2340ms (max: 5000ms)
      • actual: 2340
      • max: 5000

  ✗ claude-agent [ANTHROPIC] (3.12s)
    ✓ tool_called: Tool 'write_file' was called
    ✗ tool_param_equals: Tool called with incorrect parameters
      • expected: {"path": "test.txt", "content": "Hello"}
      • actual: {"path": "test.txt"}
    ✓ max_latency_ms: Latency: 3120ms (max: 5000ms)
      • actual: 3120
      • max: 5000

═══════════════════════════════════════════════════════════════
Total: 2 | Passed: 1 | Failed: 1
═══════════════════════════════════════════════════════════════


================================================================================
[Summary] Test Execution Summary
================================================================================
  Total Tests:      2
  Passed:           1 (50.0%)
  Failed:           1 (50.0%)
  Total Tool Calls: 2
  Total Errors:     1
  Total Duration:   5460ms (avg: 2730ms per test)
  Total Tokens:     350
================================================================================

HTML Report

Rich visual report featuring:

Summary Dashboard

Total/Passed/Failed test counts
Overall success rate with color-coded statistics

Agent Performance Comparison

Statistics by agent with visual metrics
Success rates with percentage indicators
Average duration and latency
Token usage (total and average per test)
Pass/fail counts per agent

Server Comparison Summary

Side-by-side test results across agents
Per-test success rates
Execution duration comparison
Failed server details with error messages

Detailed Test Results

Full execution details per agent
Individual assertion results with pass/fail status
Performance metrics (duration, tokens, latency)
Tool call information and parameters

HTML Report Template Architecture

The HTML report is built from modular, reusable template components. Each report type composes these building blocks differently based on context (single agent vs multi-agent, single file vs suite, etc.).

Component Hierarchy

graph TD
    subgraph "Main Layout"
        A[report.html] --> B[summary-cards]
        A --> C[comparison-matrix]
        A --> D[agent-leaderboard]
        A --> E[file-summary]
        A --> F[session-summary]
        A --> G[test-results]
        A --> H[fullscreen-overlay]
        A --> I[scripts]
    end

    subgraph "Test Results Container"
        G --> J[test-group]
    end

    subgraph "View Selection"
        J -->|"1 agent"| K[single-agent-detail]
        J -->|"2+ agents"| L[multi-agent-comparison]
    end

    subgraph "Single Agent Components"
        K --> K1[agent-assertions]
        K --> K2[agent-errors]
        K --> K3[agent-sequence-diagram]
        K --> K4[agent-tool-calls]
        K --> K5[agent-messages]
        K --> K6[agent-final-output]
    end

    subgraph "Multi-Agent Components"
        L --> L1[comparison-table]
        L --> L2[tool-comparison]
        L --> L3[errors-comparison]
        L --> L4[sequence-comparison]
        L --> L5[outputs-comparison]
    end

Report Hierarchy (Simple → Complex)

Reports are designed hierarchically, with each level building upon the previous:

Level	Report Type	Description	Key Components
1	Single Agent, Single Test	Simplest case - one agent, one test	Summary cards, single-agent-detail
2	Single Agent, Multiple Tests	Multiple independent tests, same agent	+ test-overview table
3	Multiple Agents	Compare agents on same tests	+ comparison-matrix, agent-leaderboard
4	Multiple Sessions	Tests grouped by session with shared context	+ session-summary
5	Full Suite	Multi-agent, multi-session, multi-file	All components combined

Generate sample reports for each level:

go run test/generate_reports.go

This creates hierarchical sample reports in generated_reports/:

01_single_agent_single_test.html - Level 1: One agent, one test
02_single_agent_multi_test.html - Level 2: One agent, multiple tests
03_multi_agent_single_test.html - Level 3: Multiple agents, one test (leaderboard)
04_multi_agent_multi_test.html - Level 4: Multiple agents, multiple tests (matrix)
05_single_agent_multi_session.html - Level 5: One agent, multiple sessions
06_multi_agent_multi_session.html - Level 6: Multiple agents, multiple sessions
07_single_agent_multi_file.html - Level 7: One agent, multiple files
08_multi_agent_multi_file.html - Level 8: Full suite (multiple agents, sessions, files)
09_failed_with_errors.html - Error display example

Report Types and Their Components

Single Agent Report - One agent running tests:

graph LR
    subgraph "Single Agent Report"
        A[summary-cards] --> B[test-results]
        B --> C[test-group]
        C --> D[single-agent-detail]
        D --> D1[assertions]
        D --> D2[errors]
        D --> D3[sequence-diagram]
        D --> D4[tool-calls]
        D --> D5[messages]
        D --> D6[final-output]
    end

Multi-Agent Report - Multiple agents compared on same tests:

graph LR
    subgraph "Multi-Agent Report"
        A[summary-cards] --> B[comparison-matrix]
        B --> C[agent-leaderboard]
        C --> D[test-results]
        D --> E[test-group]
        E --> F[multi-agent-comparison]
        F --> F1[comparison-table]
        F --> F2[tool-comparison]
        F --> F3[errors-comparison]
        F --> F4[sequence-comparison]
        F --> F5[outputs-comparison]
    end

Multi-Session Report - Tests organized by conversation sessions:

graph LR
    subgraph "Multi-Session Report"
        A[summary-cards] --> B[session-summary]
        B --> C[test-results]
        C --> D[test-group]
        D --> E[single-agent-detail]
    end

Full Suite Report - Multiple test files with optional multi-agent:

graph LR
    subgraph "Full Suite Report"
        A[summary-cards] --> B[comparison-matrix]
        B --> C[agent-leaderboard]
        C --> D[file-summary]
        D --> E[test-results]
        E --> F[test-group]
        F -->|"1 agent"| G[single-agent-detail]
        F -->|"2+ agents"| H[multi-agent-comparison]
    end

Template Components Reference

Component	Purpose	Used In
`summary-cards`	Top-level stats (total/passed/failed/tokens/duration)	All reports
`comparison-matrix`	Test × Agent pass/fail matrix	Multi-agent
`agent-leaderboard`	Ranked agent performance table	Multi-agent
`file-summary`	Test file grouping with stats	Suite runs
`session-summary`	Session grouping with flow diagrams	Multi-session
`test-results`	Container for all test groups	All reports
`test-group`	Single test, decides single vs multi view	All reports
`single-agent-detail`	Detailed expandable view for one agent	Single-agent
`multi-agent-comparison`	Side-by-side comparison table	Multi-agent
`agent-assertions`	Assertion results list	Single-agent
`agent-errors`	Error messages display	Single-agent
`agent-sequence-diagram`	Mermaid execution flow diagram	Single-agent
`agent-tool-calls`	Tool calls timeline with params/results	Single-agent
`agent-messages`	Conversation history	Single-agent
`agent-final-output`	Final agent response	Single-agent
`tool-comparison`	Tool calls side-by-side	Multi-agent
`errors-comparison`	Errors side-by-side	Multi-agent
`sequence-comparison`	Diagrams side-by-side (click to fullscreen)	Multi-agent
`outputs-comparison`	Final outputs side-by-side	Multi-agent
`fullscreen-overlay`	Modal overlay for enlarged diagrams	All reports
`scripts`	Mermaid init, expand/collapse, fullscreen JS	All reports

AI Summary (LLM-Generated Executive Summary)

Generate an AI-powered executive summary of test results by adding ai_summary to your test YAML:

ai_summary:
  enabled: true
  judge_provider: azure-gpt  # Provider name from your providers section

The analysis appears as an "AI Summary" section in HTML reports with a verdict, trade-offs analysis, notable observations, failure patterns, and actionable recommendations.

📖 Full AI Summary Documentation

JSON Report

Structured test results for programmatic analysis and CI/CD integration:

{
  "agent_benchmark_version": "1.0.0",
  "generated_at": "2024-01-15T14:30:00Z",
  "summary": {
    "total": 10,
    "passed": 8,
    "failed": 2
  },
  "comparison_summary": {
    "Test Name": {
      "testName": "Create file",
      "serverResults": {
        "gemini-agent": {
          "agentName": "gemini-agent",
          "provider": "GOOGLE",
          "passed": true,
          "duration": 2340,
          "errors": []
        },
        "claude-agent": {
          "agentName": "claude-agent",
          "provider": "ANTHROPIC",
          "passed": false,
          "duration": 3120,
          "errors": ["Tool parameter mismatch"]
        }
      },
      "totalRuns": 2,
      "passedRuns": 1,
      "failedRuns": 1
    }
  },
  "detailed_results": [
    {
      "execution": {
        "testName": "Create file",
        "agentName": "gemini-agent",
        "providerType": "GOOGLE",
        "startTime": "2024-01-15T14:30:00Z",
        "endTime": "2024-01-15T14:30:02Z",
        "tokensUsed": 150,
        "latencyMs": 2340,
        "errors": []
      },
      "assertions": [
        {
          "type": "tool_called",
          "passed": true,
          "message": "Tool 'write_file' was called"
        },
        {
          "type": "tool_param_equals",
          "passed": true,
          "message": "Tool 'write_file' called with correct parameters"
        }
      ],
      "passed": true
    }
  ]
}

Key Fields

summary - Overall test statistics
comparison_summary - Cross-agent comparison data
detailed_results - Full execution details with assertions
agent_benchmark_version - Version of the tool used
generated_at - Report generation timestamp

Markdown Report

Documentation-friendly format ideal for README files, wikis, and technical documentation. Key Features

Clean, readable format for documentation
Summary tables with comparison data
Detailed assertion results per agent
Easy to include in GitHub README or wiki pages
Portable across documentation platforms
Quick visual identification of pass/fail status

Usage Examples

Example 1: Basic File Operations

providers:
  - name: gemini
    type: GOOGLE
    token: ${GOOGLE_API_KEY}
    model: gemini-2.0-flash

servers:
  - name: fs
    type: stdio
    command: npx @modelcontextprotocol/server-filesystem /tmp

agents:
  - name: file-agent
    provider: gemini
    servers:
      - name: fs

settings:
  verbose: true
  max_iterations: 5

variables:
  filename: "test-{{randomValue length=8}}.txt"
  content: "{{faker 'Lorem.paragraph'}}"

sessions:
  - name: File Tests
    tests:
      - name: Create file
        prompt: "Create a file {{filename}} with content: {{content}}"
        assertions:
          - type: tool_called
            tool: write_file
          - type: file_created
            path: "/tmp/{{filename}}"
          
      - name: Read file
        prompt: "Read {{filename}}"
        assertions:
          - type: tool_called
            tool: read_file
          - type: output_contains
            value: "{{content}}"

Run:

./agent-benchmark -f file-tests.yaml -o results.html -verbose

Example 2: API Integration Testing

providers:
  - name: claude
    type: ANTHROPIC
    token: ${ANTHROPIC_API_KEY}
    model: claude-sonnet-4-20250514

servers:
  - name: api-server
    type: sse
    url: https://api.example.com/mcp/events
    headers:
      - "Authorization: Bearer ${API_TOKEN}"

agents:
  - name: api-agent
    provider: claude
    servers:
      - name: api-server

settings:
  tool_timeout: 10s
  max_iterations: 8

variables:
  user_id: "{{randomInt lower=1000 upper=9999}}"
  email: "{{faker 'Internet.email'}}"
  timestamp: "{{now format='unix'}}"

sessions:
  - name: User Management
    tests:
      - name: Create user
        prompt: |
          Create a new user with:
          - ID: {{user_id}}
          - Email: {{email}}
          - Created: {{timestamp}}
        assertions:
          - type: tool_called
            tool: create_user
          - type: tool_param_equals
            tool: create_user
            params:
              id: "{{user_id}}"
              email: "{{email}}"
          - type: output_json_valid
          - type: max_latency_ms
            value: 5000
          
      - name: Fetch user
        prompt: "Get user {{user_id}}"
        assertions:
          - type: tool_called
            tool: get_user
          - type: output_matches_json
            path: "$.data.email"
            value: "{{email}}"

GitLab CI

test:
  stage: test
  script:
    - ./agent-benchmark -s suite.yaml -o results.json -reportType json
  artifacts:
    when: always
    paths:
      - results.json
    reports:
      junit: results.json
  variables:
    GOOGLE_API_KEY: ${GOOGLE_API_KEY}
    ANTHROPIC_API_KEY: ${ANTHROPIC_API_KEY}

Architecture Notes

Session Message History

Within a session, tests share conversation history:

Session Start
  ├─ Test 1: "Create file" 
  │   └─ Messages: [user, assistant, tool_response]
  ├─ Test 2: "Read file"      # Has Test 1 history
  │   └─ Messages: [prev..., user, assistant, tool_response]
  └─ Test 3: "Delete file"    # Has Test 1 & 2 history
      └─ Messages: [prev..., user, assistant, tool_response]

Agent Reasoning Loop

1. User sends prompt
2. Agent calls LLM with tools
3. LLM responds with:
   a) Final answer → Done
   b) Tool calls → Execute tools → Back to step 2
4. Repeat until:
   - Final answer received
   - Max iterations reached
   - Context cancelled
   - Error occurred

Clarification Request Detection

The agent can detect when an LLM asks for clarification instead of taking action (e.g., "Would you like me to...", "Should I proceed..."). This feature uses LLM-based semantic classification for accurate detection across any language.

agents:
  - name: autonomous-agent
    provider: my-provider
    clarification_detection:
      enabled: true
      judge_provider: azure-openai-judge  # Recommend gpt-4.1 for best accuracy

For full documentation, see docs/clarification-detection.md.

License

Apache 2.0 License - See LICENSE file for details

Support & Contributing

Issues: https://github.com/mykhaliev/agent-benchmark/issues

Contributing:

Fork the repository
Create feature branch
Submit pull request

Name		Name	Last commit message	Last commit date
Latest commit History 67 Commits
.github/workflows		.github/workflows
agent		agent
docs		docs
engine		engine
examples		examples
explorer		explorer
generated_reports		generated_reports
generator		generator
logger		logger
model		model
report		report
server		server
skill		skill
skills		skills
templates		templates
test		test
version		version
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
build.bat		build.bat
build.ps1		build.ps1
build.sh		build.sh
build_release.ps1		build_release.ps1
build_release.sh		build_release.sh
go.mod		go.mod
go.sum		go.sum
install-min.sh		install-min.sh
install.ps1		install.ps1
install.sh		install.sh
main.go		main.go

Folders and files

Latest commit

History

Repository files navigation

agent-benchmark Documentation

Table of Contents

Overview

Key Features

1. Multi-Provider Support

2. MCP Server Integration

3. Session-Based Testing

4. Test Suite Support

5. Rich Assertion Library

6. Template Engine

7. Data Extraction

8. Comprehensive Reporting

9. Agent Skills Support

Installation

Quick Install (Recommended)

Alternative Installation Methods

Verify Installation

AI Assistant Skills (Optional)

Quick Start

Command Line Reference

Test Generation

How generation works

Output structure

Generator config reference

Exploratory Testing

Configuration

Test Suite Configuration

Test/Suite Definition

Providers

Azure OpenAI Authentication

Rate Limiting

Servers

Local/Stdio Server

SSE Server

CLI Server

Server Timing Configuration

SSE Server with Authentication

Agents

Sessions

Agent Skills

Settings

Variable Policy

Variables

Test Timing Controls

Start Delay

Global Test Delay

Session Delay

Test Criteria & Exit Codes

Environment Variables

Built-in Template Variables

Variable Categories

Static Variables (Available Everywhere)

Runtime Variables (Available During Test Execution)

Assertions

Tool Assertions

no_hallucinated_tools

tool_called

tool_not_called

tool_call_count

tool_call_order

tool_param_equals

tool_param_matches_regex

tool_result_matches_json

Output Assertions

output_contains

output_not_contains

output_regex

Performance Assertions

max_tokens

max_latency_ms

Error Assertions

no_error_messages

no_rate_limit_errors

Behavior Assertions

no_clarification_questions

Boolean Combinators

Packages