agent-benchmark is a comprehensive testing framework for AI agents built on the Model Context Protocol (MCP). It enables systematic testing, validation, and benchmarking of AI agents across different LLM providers with robust assertion capabilities.
- Overview
- Key Features
- Installation
- Command Line Reference
- Test Generation
- Exploratory Testing
- Configuration
- Test/Suite Definition
- Assertions
- Template System
- Data Extraction
- Reports
- Usage Examples
- Best Practices
- Troubleshooting
- CI/CD Integration
- Architecture Notes
- Contributing
agent-benchmark provides a declarative YAML-based approach to testing AI agents that interact with MCP servers. It supports multiple LLM providers, various MCP server types, and comprehensive assertion mechanisms to validate agent behavior.
Test agents across different LLM providers in parallel:
- Google AI (Gemini models)
- Vertex AI (Google Cloud Gemini)
- Anthropic (Claude models)
- OpenAI (GPT models)
- Azure OpenAI
- Groq
Connect to MCP servers via:
- stdio: Run MCP servers as local processes
- SSE: Connect to remote MCP servers via Server-Sent Events
- CLI: Wrap command-line tools as MCP-like servers for testing CLI-based tools
Organize tests into sessions with shared context and message history, simulating real conversational flows.
Run multiple test files with centralized configuration, shared variables, and unified success criteria.
Validate agent behavior with 20+ assertion types covering:
- Tool usage patterns
- Output validation
- Performance metrics
- Boolean combinators (anyOf, allOf, not) for complex logic
Dynamic test generation with Handlebars-style templates supporting:
- Random data generation
- Timestamp manipulation
- Faker integration
- String manipulation
Extract data from tool results using JSONPath to pass between tests in a session.
Generate reports in multiple formats:
- Console output with color-coded results
- HTML reports with performance comparison
- JSON export
- Markdown documentation
Load domain-specific knowledge following the agentskills.io specification:
- Parse
SKILL.mdfiles with YAML frontmatter - Progressive disclosure of reference files
- Template variable
{{SKILL_DIR}}for skill paths
Install the latest version with a single command:
Linux/macOS:
curl -fsSL https://raw.githubusercontent.com/mykhaliev/agent-benchmark/master/install.sh | bashWindows (PowerShell):
irm https://raw.githubusercontent.com/mykhaliev/agent-benchmark/master/install.ps1 | iexMinimal Install (60-70% smaller download)
For slower connections or to save bandwidth, use the UPX-compressed version:
curl -fsSL https://raw.githubusercontent.com/mykhaliev/agent-benchmark/master/install-min.sh | bashNote: The minimal version may trigger antivirus warnings on some systems as UPX compression is sometimes flagged by security software.
Manual Installation from Pre-built Binaries
Download the appropriate file for your system from the releases page:
Regular versions (recommended):
- Linux (AMD64):
agent-benchmark_vX.X.X_linux_amd64.tar.gz - Linux (ARM64):
agent-benchmark_vX.X.X_linux_arm64.tar.gz - macOS (Intel):
agent-benchmark_vX.X.X_darwin_amd64.tar.gz - macOS (Apple Silicon):
agent-benchmark_vX.X.X_darwin_arm64.tar.gz - Windows (AMD64):
agent-benchmark_vX.X.X_windows_amd64.zip - Windows (ARM64):
agent-benchmark_vX.X.X_windows_arm64.zip
UPX compressed (smaller size, not available for Windows ARM64):
- Linux (AMD64):
agent-benchmark_vX.X.X_linux_amd64_upx.tar.gz - Linux (ARM64):
agent-benchmark_vX.X.X_linux_arm64_upx.tar.gz - macOS (Intel):
agent-benchmark_vX.X.X_darwin_amd64_upx.tar.gz - macOS (Apple Silicon):
agent-benchmark_vX.X.X_darwin_arm64_upx.tar.gz - Windows (AMD64):
agent-benchmark_vX.X.X_windows_amd64_upx.zip
Extract and move to your PATH:
# Linux/macOS
tar -xzf agent-benchmark_*.tar.gz
sudo mv agent-benchmark /usr/local/bin/
# Windows
# Extract the ZIP file and add the binary to your PATHBuild from Source
Requirements: Go 1.25 or higher
Linux/macOS:
# Clone the repository
git clone https://github.com/mykhaliev/agent-benchmark
cd agent-benchmark
# Build the binary
go build -o agent-benchmark
# (Optional) Move to your PATH
sudo mv agent-benchmark /usr/local/bin/Windows (PowerShell):
# Clone the repository
git clone https://github.com/mykhaliev/agent-benchmark
cd agent-benchmark
# Build the binary
.\build.ps1
# or
go build -o agent-benchmark.exeAfter installation, verify it works:
agent-benchmark -vGet AI-powered assistance when writing test configurations in VS Code, Cursor, or other editors with Agent Skills support.
Download agent-benchmark-skills_*.zip from releases and extract:
Linux/macOS:
unzip agent-benchmark-skills_*.zip -d ~/.copilot/skills/Windows (PowerShell):
Expand-Archive agent-benchmark-skills_*.zip -DestinationPath $env:USERPROFILE\.copilot\skills\Once installed, your AI assistant will have domain knowledge about:
- Provider configuration (Azure, OpenAI, Anthropic, Google, Vertex AI, Groq)
- All 20+ assertion types with examples
- Template helpers (faker, randomValue, now, etc.)
- Best practices for writing reliable test configs
See skills/README.md for more details.
Run your first benchmark:
agent-benchmark -f tests.yaml -o report.html -verboseagent-benchmark [options]
Required (one of):
-f <file> Path to test configuration file (YAML)
-s <file> Path to suite configuration file (YAML)
-g <file> Path to generator config file (enables test generation mode)
-e <file> Path to explorer config file (enables exploratory testing mode)
-generate-report <file> Generate HTML report from existing JSON results file
(reads test_file from JSON to load AI summary config)
Generator options (require -g):
--dry-run Preview generated YAML without saving
--output-dir <dir> Directory for generated test files (default: ./generated_tests)
--seed <int> Random seed for deterministic generation
Explorer options (require -e):
(none currently β all settings live in the explorer: YAML block)
Optional:
-o <file> Output report path/filename without extension
Default: <test_dir>/test_results/report
The test_results folder is auto-created and git-ignored
-l <file> Log file path (default: stdout)
-reportType <types> Report format(s): html, json, md (default: html)
Multiple formats supported as comma-separated values
Examples: -reportType html
-reportType html,json
-reportType html,json,md
-verbose Enable verbose logging
-v Show version and exitExamples:
# Run single test file with verbose output
# Reports saved to: examples/test_results/report.html
./agent-benchmark -f examples/tests.yaml -verbose
# Run test suite with JSON report (custom output path)
./agent-benchmark -s suite.yaml -o ./my-reports/results -reportType json
# Run with custom log file
./agent-benchmark -f tests.yaml -l test-run.log
# Generate Markdown report
./agent-benchmark -f tests.yaml -o report -reportType md
# Generate HTML report from existing JSON results (fast iteration)
# Reads test_file from JSON to load AI summary configuration
./agent-benchmark -generate-report results.json -o new-report
# Generate both JSON and HTML reports (for later regeneration)
./agent-benchmark -f tests.yaml -o results -reportType json,htmlUse the -g flag to automatically generate a ready-to-run test suite from a generator config.
The generator connects to your MCP servers, discovers tool schemas, and uses an LLM to produce
test sessions with typed assertions β no test authoring required.
# Preview generated YAML without saving anything
./agent-benchmark -g examples/generator-config.yaml --dry-run
# Generate and save to a timestamped directory under ./generated_tests/
./agent-benchmark -g examples/generator-config.yaml
# Custom output directory and deterministic seed
./agent-benchmark -g gen.yaml --output-dir ./tests --seed 42The generator runs three sequential phases:
Phase 1 β Plan
A focused LLM call produces a compact JSON test plan: session names, test names, expected tools,
and high-level assertion ideas. The plan is validated against the actual tool list before moving
on. If validation fails the plan is regenerated (up to max_retries times).
Phase 2 β Intent
For each test in the plan, a separate LLM call produces a TestIntent: a flat JSON object with
the prompt, typed assertion checks, and optional JSONPath extractors. The intent is validated
(correct assertion types, real tool names, no forward variable references). If validation fails,
one automatic repair attempt is made; if that also fails, the generator retries the full intent
generation. Hard failure only if all max_retries are exhausted.
Phase 3 β Build
The validated intents are assembled deterministically into model.Session structs and
serialized to YAML. No further LLM calls are made in this phase.
Each run writes to a timestamped subdirectory under --output-dir (default ./generated_tests):
generated_tests/
βββ generated_20260301_120000/
βββ suite.yaml β run this with: agent-benchmark -s
βββ file-operations.yaml
βββ error-handling.yaml
suite.yaml references every session file and is pre-populated with the original providers,
servers, agents, and variables from the generator config β so the output is immediately runnable:
./agent-benchmark -s generated_tests/generated_20260301_120000/suite.yamlThe generator: block controls generation behaviour:
| Field | Description | Default |
|---|---|---|
agent |
Agent whose LLM is used for generation | first agent |
test_count |
Number of tests to generate across all sessions | 5 |
complexity |
simple | medium | complex (see below) |
medium |
include_edge_cases |
Include error/boundary condition tests | false |
max_steps_per_test |
Max tool-call steps expected per test | 5 |
max_retries |
Max LLM attempts per phase before giving up | 3 |
max_tokens |
Stop if cumulative LLM tokens exceed this limit (0 = unlimited) | 0 |
max_iterations |
Max LLM conversation turns per generation call | engine default |
plan_chunk_size |
Max tests per plan chunk (0 = use default of 5) | 5 |
plan_chunk_max_tokens |
Max output tokens per plan chunk LLM call (0 = auto) | auto |
tools |
Allowlist of tool names to test (empty = all tools) | all tools |
goal |
Extra instruction injected into the generation prompt | β |
Complexity levels:
| Level | Behaviour |
|---|---|
simple |
One tool call per test; straightforward prompts |
medium |
One to three tool calls; may chain tool results |
complex |
Multi-step workflows; may use anyOf/allOf assertion combinators |
See examples/generator-config.yaml for a fully annotated example.
Use the -e flag to run an autonomous exploration session. The explorer LLM
iteratively decides what test to run next, executes it against the configured
agent, observes the result, and plans the next iteration β all without
predefined test cases.
# Run an exploration session with default HTML report
./agent-benchmark -e examples/explorer-config.yaml
# With verbose logging and custom report path
./agent-benchmark -e explorer.yaml -o ./reports/explore -reportType html,json -verboseThe explorer: block in your config controls the behaviour:
| Field | Description | Default |
|---|---|---|
goal |
What the explorer is trying to test (required) | β |
max_iterations |
Maximum number of test iterations | 10 |
stop_on_pass_count |
Stop after N consecutive passes (0 = run all iterations) | 0 |
max_retries |
LLM retry attempts per iteration if parsing fails | 3 |
max_tokens |
Stop the exploration loop if cumulative tokens exceed this limit (0 = unlimited) | 0 |
agent |
Agent name reference β must match a name in the top-level agents list. Its provider and servers are used for both exploration decisions and test execution. Defaults to the first agent when omitted. |
β |
How exploration results appear in reports:
Results are fed into the standard report pipeline β no new report format is needed. Exploration metadata is encoded into existing report fields:
| Metadata | Where it renders |
|---|---|
Exploration: <goal> |
Suite header |
Exploration Goal: <goal> |
Session group header |
[Iter NN | prompt-NNN] <test name> |
Test group title |
| Explorer LLM reasoning + decision prompt | Conversation history (system message) |
See examples/explorer-config.yaml for a fully annotated example.
Configuration files use YAML format with six main sections:
providers: # LLM provider configurations
servers: # MCP server definitions
agents: # Agent configurations
sessions: # Test sessions
settings: # Global test settings
variables: # Reusable variablesThe framework supports running multiple test files through a suite configuration:
name: "Complete Test Suite"
test_files:
- tests/basic-operations.yaml
- tests/advanced-features.yaml
- tests/edge-cases.yaml
providers:
- name: gemini
type: GOOGLE
token: "{{GOOGLE_API_KEY}}"
model: gemini-2.0-flash
servers:
- name: filesystem
type: stdio
command: npx @modelcontextprotocol/server-filesystem /tmp
agents:
- name: test-agent
provider: gemini
servers:
- name: filesystem
settings:
verbose: true
max_iterations: 10
tool_timeout: 30s
test_delay: 2s
variables:
base_path: "/tmp/tests"
timestamp: "{{now format='unix'}}"
criteria:
success_rate: "0.8" # 80% of tests must passSuite Configuration Benefits:
- Centralized provider and server definitions
- Shared variables across all test files
- Unified success criteria
- Single command execution for multiple test files
Provider Types:
GOOGLE- Google AI (Gemini)VERTEX- Vertex AI (Google Cloud Gemini)ANTHROPIC- Anthropic (Claude)OPENAI- OpenAI (GPT)AZURE- Azure OpenAIGROQ- Groq
Define LLM providers for your agents:
providers:
- name: gemini-flash
type: GOOGLE
token: {{GOOGLE_API_KEY}}
model: gemini-2.0-flash
- name: claude-sonnet
type: ANTHROPIC
token: {{ANTHROPIC_API_KEY}}
model: claude-sonnet-4-20250514
- name: gpt-4
type: OPENAI
token: {{OPENAI_API_KEY}}
model: gpt-4o-mini
baseUrl: https://api.openai.com/v1 # Optional
- name: azure-gpt
type: AZURE
token: {{AZURE_API_KEY}}
model: gpt-4
baseUrl: https://your-resource.openai.azure.com
version: 2024-02-15-preview
- name: azure-entra
type: AZURE
auth_type: entra_id # Use Microsoft Entra ID authentication (passwordless)
model: gpt-4
baseUrl: https://your-resource.openai.azure.com
version: 2024-02-15-preview
- name: vertex-ai
type: VERTEX
project_id: "your-gcp-project-id"
location: "us-central1"
credentials_path: "/path/to/service-account.json"
model: gemini-2.0-flash
- name: gpt-4
type: GROQ
token: {{GROQ_API_KEY}}
model: openai/gpt-oss-120b
baseUrl: https://api.groq.com/openai/v1 # OptionalThe AZURE provider supports two authentication methods:
API Key Authentication (default):
providers:
- name: azure-apikey
type: AZURE
auth_type: api_key # Optional, this is the default
token: {{AZURE_OPENAI_API_KEY}}
model: gpt-4
baseUrl: https://your-resource.openai.azure.com
version: 2024-02-15-previewMicrosoft Entra ID Authentication (passwordless):
providers:
- name: azure-entra
type: AZURE
auth_type: entra_id # Uses DefaultAzureCredential
model: gpt-4
baseUrl: https://your-resource.openai.azure.com
version: 2024-02-15-preview
# No token required - uses Azure credentials from environmentEntra ID authentication uses Azure's DefaultAzureCredential, which automatically tries multiple authentication methods in order:
- Environment variables:
AZURE_CLIENT_ID,AZURE_TENANT_ID,AZURE_CLIENT_SECRET - Workload Identity (for Kubernetes)
- Managed Identity (when running in Azure)
- Azure CLI (
az login) - Azure Developer CLI (
azd auth login) - Azure PowerShell (
Connect-AzAccount)
Required RBAC Role:
Your identity must have the "Cognitive Services OpenAI User" role (or higher) assigned on the Azure OpenAI resource. Without this role, you will receive a 401 Unauthorized error.
To assign the role using Azure CLI:
# Get your Azure OpenAI resource ID
az cognitiveservices account show \
--name <your-openai-resource-name> \
--resource-group <your-resource-group> \
--query id -o tsv
# Assign the required role
az role assignment create \
--assignee <your-email-or-principal-id> \
--role "Cognitive Services OpenAI User" \
--scope <resource-id-from-above>Note: Role assignments can take up to 5-10 minutes to propagate.
For more information, see Azure Identity authentication and Azure OpenAI RBAC roles.
Providers can be configured with rate limits to proactively throttle requests and avoid exceeding API quotas:
providers:
- name: azure-gpt
type: AZURE
token: {{AZURE_API_KEY}}
model: gpt-4
baseUrl: https://your-resource.openai.azure.com
version: 2024-02-15-preview
rate_limits:
tpm: 30000 # Tokens per minute limit (proactive throttling)
rpm: 60 # Requests per minute limit (proactive throttling)
retry:
retry_on_429: true # Enable retry on 429 errors (default: false)
max_retries: 3 # Max retry attempts (default: 3 when enabled)Configuration Options:
| Option | Description | Default |
|---|---|---|
rate_limits.tpm |
Maximum tokens per minute | No limit |
rate_limits.rpm |
Maximum requests per minute | No limit |
retry.retry_on_429 |
Enable automatic retry on 429 errors | false |
retry.max_retries |
Number of retry attempts | 3 (when enabled) |
How it works:
- Uses token bucket algorithm to proactively throttle requests before sending
- Estimates tokens using tiktoken for OpenAI models
- Falls back to
cl100k_baseencoding for non-OpenAI models (Claude, Gemini, Llama, etc.) - Runtime calibration adjusts estimates based on actual API responses
- 429 retry handling provides a safety net when estimates fall short
Best Practice: Enable both rate_limits (proactive) and retry_on_429 (reactive) for defense in depth.
Important: Rate limiting is best-effort, not guaranteed. Token estimation varies by provider. For detailed technical information, see docs/rate-limiting.md.
Configure MCP servers that agents will interact with:
servers:
- name: filesystem-server
type: stdio
command: npx @modelcontextprotocol/server-filesystem /tmpservers:
- name: remote-api
type: sse
url: https://api.example.com/mcp/events
headers:
- "Authorization: Bearer {{API_TOKEN}}"
- "X-Custom-Header: value"Server Types:
stdio- Standard Input/Output communicationsse- Server-Sent Events over HTTPcli- CLI tool wrapper (see CLI Server below)
Wrap command-line tools as MCP-like servers. Useful for testing CLI-based tools:
servers:
- name: excel-cli
type: cli
command: excel-cli
shell: powershell # Shell: powershell, pwsh, cmd, bash, sh, zsh
working_dir: "{{TEST_DIR}}" # Working directory for CLI commands
tool_prefix: excel # Tool name becomes excel_execute
help_commands: # Help content for LLM context
- "excel-cli --help"| Option | Description | Default |
|---|---|---|
command |
CLI executable to wrap (required) | - |
shell |
Shell to run commands in | powershell (Windows), bash (Unix) |
working_dir |
Working directory for commands | Current directory |
tool_prefix |
Prefix for generated tool name | cli (tool name: cli_execute) |
help_commands |
Commands to run at startup for CLI help | - |
Key Features:
- Auto-discovery: Automatically discovers subcommands from
COMMANDS:section in help output - Help content injection: CLI help is included in tool description for LLM context
- CLI-specific assertions:
cli_exit_code_equals,cli_stdout_contains,cli_stdout_regex,cli_stderr_contains
π Full CLI Server Documentation - Complete guide with examples, best practices, and troubleshooting
Control server initialization and process delays:
servers:
- name: slow-server
type: stdio
command: python server.py
server_delay: 45s # Wait up to 45s for initialization
process_delay: 1s # Wait 1s after process startsDelay Parameters:
server_delay- Maximum time to wait for server initialization (default: 30s)process_delay- Delay after starting process before initialization (default: 300ms)
servers:
- name: authenticated-api
type: sse
url: https://api.example.com/mcp/sse
headers:
- "Authorization: Bearer {{API_TOKEN}}"
- "X-API-Version: 2024-01"
- "X-Client-ID: agent-benchmark"Define agents that combine providers with MCP servers:
agents:
- name: research-agent
provider: gemini-flash
system_prompt: |
You are an autonomous research agent.
Execute tasks directly without asking for clarification.
Use available tools to complete the requested tasks.
servers:
- name: filesystem-server
allowedTools: # Optional: restrict tool access
- read_file
- list_directory
- name: remote-api
- name: coding-agent
provider: claude-sonnet
servers:
- name: filesystem-server # No tool restrictionsAgent Configuration:
name- Unique agent identifierprovider- Reference to provider nameskill- Optional Agent Skill to load (see Agent Skills section)system_prompt- Optional system prompt prepended to all conversations (supports templates)servers- List of MCP serversallowedTools- Optional tool whitelist per server
System Prompt Templates:
The system_prompt field supports template variables for dynamic context:
{{AGENT_NAME}}- Current agent name{{SESSION_NAME}}- Current session name{{PROVIDER_NAME}}- Provider name being used
Example:
agents:
- name: test-agent
provider: gemini-flash
system_prompt: |
You are {{AGENT_NAME}} using {{PROVIDER_NAME}}.
Currently running session: {{SESSION_NAME}}.
Execute all tasks autonomously.Organize tests into sessions with shared conversational context:
sessions:
- name: File Operations
tests:
- name: Create a file
prompt: "Create a file called {{filename}} with content: Hello World"
assertions:
- type: tool_called
tool: write_file
- name: Read the file
prompt: "Read the file {{filename}}"
assertions:
- type: tool_called
tool: read_file
- type: output_contains
value: "Hello World"Session Features:
- Tests within a session share message history
- Variables persist across tests in a session
- Simulates multi-turn conversations
Agent Skills provide domain-specific knowledge to agents following the agentskills.io specification. Skills are loaded from a directory containing a SKILL.md file, and their content is injected into the agent's system prompt.
agents:
- name: skilled-agent
provider: azure-openai
skill:
path: "./skills/my-skill" # Path to skill directory
system_prompt: |
Additional instructions here...If the skill has a references/ directory, built-in tools (list_skill_references, read_skill_reference) are automatically added for on-demand access.
For full documentation, see docs/agent-skills.md.
Global configuration for test execution:
settings:
verbose: true # Enable detailed logging
max_iterations: 10 # Maximum agent reasoning loops
timeout: 30s # Tool execution timeout (legacy, use tool_timeout)
tool_timeout: 30s # Tool execution timeout
test_delay: 2s # Delay between tests
session_delay: 30s # Delay between sessions (for COM cleanup, resource release)
variable_policy: suite_only # Controls are combined (test-only, suite-only, merge-test-priority, merge-suite-priority)When running tests as part of a test suite, variables can be defined at both the suite level and the test level.
The variable_policy setting controls how these variables are resolved.
Available Policies
| Policy | Description |
|---|---|
suite-only (default) |
Only suite-level variables are used. Test-level variables are ignored. |
test-only |
Only test-level variables are used. Suite-level variables are ignored. |
merge-test-priority |
Suite and test variables are merged. Test variables override suite variables on key conflicts. |
merge-suite-priority |
Suite and test variables are merged. Suite variables override test variables on key conflicts. |
If variable_policy is not set or has an unknown value, it defaults to suite-only.
Define reusable variables with template support:
variables:
filename: "test-{{randomValue type='ALPHANUMERIC' length=8}}.txt"
timestamp: "{{now format='unix'}}"
user_id: "{{randomInt lower=1000 upper=9999}}"
email: "{{faker 'Internet.email'}}"Variables can:
- Use template helpers
- Reference environmental variables
Delay individual test execution:
tests:
- name: Rate-limited API call
prompt: "Make API request"
start_delay: 5s # Wait 5 seconds before starting
assertions:
- type: tool_called
tool: api_requestPause between all tests:
settings:
test_delay: 2s # 2 second pause after each testUse Cases:
- Respect API rate limits
- Allow system state to settle
- Prevent resource exhaustion
Pause between sessions to allow resource cleanup:
settings:
session_delay: 30s # 30 second pause between sessionsUse Cases:
- Allow external applications and resources to fully release between sessions
- Prevent resource contention when tests interact with stateful applications
- Avoid lingering processes from previous sessions affecting new sessions
- Give MCP servers time to cleanly shut down between sessions
Define minimum success rate for test suites:
criteria:
success_rate: 0.75 # 75% pass rate requiredExit Code Behavior:
| Scenario | Exit Code |
|---|---|
| All tests pass / Success rate met | 0 |
| Some tests fail / Success rate not met | 1 |
Reference environment variables in configuration:
providers:
- name: claude
type: ANTHROPIC
token: "{{ANTHROPIC_API_KEY}}"
model: claude-sonnet-4-20250514
servers:
- name: api-server
type: sse
url: "{{API_BASE_URL}}"
headers:
- "Authorization: Bearer {{API_TOKEN}}"
variables:
workspace: "{{WORKSPACE_PATH}}"Convention:
- Use
{{VAR_NAME}}syntax - Set before running tests
- Common for tokens, URLs, paths
export ANTHROPIC_API_KEY="sk-ant-..."
export API_BASE_URL="https://api.example.com"
export WORKSPACE_PATH="/tmp/workspace"
./agent-benchmark -f tests.yamlThe framework provides built-in variables that are automatically available in template contexts. Variables are divided into two categories based on when they become available:
| Category | Available In | Description |
|---|---|---|
| Static | Everywhere (providers, servers, variables, prompts, assertions) | Available at configuration load time |
| Runtime | Prompts, assertions, system prompts | Available during test execution only |
These variables can be used in server commands, provider configs, user variables, prompts, and assertions:
| Variable | Description |
|---|---|
{{TEST_DIR}} |
Absolute path to the directory containing the test YAML file |
{{TEMP_DIR}} |
System temporary directory (cross-platform: %TEMP% on Windows, /tmp on Linux/macOS) |
{{RUN_ID}} |
Unique UUID v4 for this test run (e.g., 550e8400-e29b-41d4-a716-446655440000) |
{{ANY_ENV_VAR}} |
Any environment variable (e.g., {{HOME}}, {{AZURE_OPENAI_ENDPOINT}}) |
| User-defined variables | Variables defined in the variables: section of your config |
These variables are only available in prompts, assertions, and system promptsβnot in server commands or provider configs:
| Variable | Description |
|---|---|
{{AGENT_NAME}} |
Current agent name |
{{SESSION_NAME}} |
Current session name |
{{PROVIDER_NAME}} |
Provider name being used |
Using TEST_DIR for Portable Paths:
{{TEST_DIR}} enables test configurations that work regardless of where the repository is cloned:
variables:
# Paths relative to the test file location
data_dir: "{{TEST_DIR}}/test-data"
output_dir: "{{TEST_DIR}}/../TestResults"
mcp_server: "{{TEST_DIR}}/bin/my-server.exe"
servers:
- name: filesystem
type: stdio
command: npx @modelcontextprotocol/server-filesystem {{output_dir}}
- name: custom-server
type: stdio
command: "{{mcp_server}}"
sessions:
- name: File Tests
tests:
- name: Process test data
prompt: "Read files from {{data_dir}} and save results to {{output_dir}}"agent-benchmark provides 20+ assertion types to validate agent behavior:
Verify agent only uses available tools:
assertions:
- type: no_hallucinated_toolsVerify a specific tool was invoked:
assertions:
- type: tool_called
tool: create_fileEnsure a tool was NOT invoked:
assertions:
- type: tool_not_called
tool: delete_databaseValidate the exact number of tool calls. The tool name is optional; if it is not specified, the number of all tool calls will be verified:
assertions:
- type: tool_call_count
tool: search_api
count: 3Verify tools were called in a specific sequence:
assertions:
- type: tool_call_order
sequence:
- validate_input
- process_data
- save_resultsCheck tool parameters match exactly:
assertions:
- type: tool_param_equals
tool: create_user
params:
name: "John Doe"
age: 30
email: "john@example.com"
settings.theme: "dark" # Nested parameter with dot notationNested Parameter Validation:
Use dot notation for nested parameters:
assertions:
- type: tool_param_equals
tool: create_resource
params:
name: "test-resource"
config.timeout: "30"
config.retry.max_attempts: "3"
config.retry.backoff: "exponential"
metadata.tags.environment: "production"Dot Notation Rules:
- Navigate nested maps with dots
- Validate deeply nested values
- Compare exact matches at any depth
Validate parameters with regex patterns:
assertions:
- type: tool_param_matches_regex
tool: send_email
params:
recipient: "^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$"Validate tool results using JSONPath:
assertions:
- type: tool_result_matches_json
tool: get_user
path: "$.data.user.name"
value: "John Doe"Check if output contains specific text:
assertions:
- type: output_contains
value: "Operation completed successfully"Ensure output doesn't contain specific text:
assertions:
- type: output_not_contains
value: "error"Validate output with regex pattern:
assertions:
- type: output_regex
pattern: "^User ID: [0-9]{4,}$"Limit approximate token usage:
assertions:
- type: max_tokens
value: 1000Token Estimation:
Token usage for OpenAI, Google and Anthropic models is taken from GenerationInfo For other models formula is:
tokens = output_length / 4
This approximation:
- Provides rough token counts
- Useful for max_tokens assertions
- Not exact (varies by tokenizer)
Ensure execution completes within time limit:
assertions:
- type: max_latency_ms
value: 5000 # 5 secondsVerify execution completed without errors:
assertions:
- type: no_error_messagesVerify the test did not encounter any HTTP 429 rate limit errors:
assertions:
- type: no_rate_limit_errorsThis assertion checks if the provider returned any 429 errors during execution. It's useful for:
- Ensuring tests stay within API quotas
- Validating that rate limit configuration is adequate
- Detecting when throttling is needed
Verify the agent executed tasks directly without asking for clarification. Requires clarification_detection to be enabled on the agent:
assertions:
- type: no_clarification_questionsBoolean combinators allow you to create complex assertion logic using JSON Schema-style operators. These are useful when LLMs may achieve the same outcome through different approaches.
Pass if ANY child assertion passes (OR logic):
assertions:
# Pass if the LLM used keyboard_control OR ui_automation
- anyOf:
- type: tool_called
tool: keyboard_control
- type: tool_called
tool: ui_automationPass if ALL child assertions pass (AND logic):
assertions:
# Pass if both conditions are met
- allOf:
- type: tool_called
tool: create_file
- type: output_contains
value: "File created successfully"Pass if the child assertion FAILS (negation):
assertions:
# Pass if output does NOT contain "error" (equivalent to output_not_contains)
- not:
type: output_contains
value: "error"Combinators can be nested for complex logic:
assertions:
# Pass if: (keyboard OR ui_automation) AND no errors
- allOf:
- anyOf:
- type: tool_called
tool: keyboard_control
- type: tool_called
tool: ui_automation
- type: no_error_messages
# Pass if NOT (error in output AND failed tool)
- not:
allOf:
- type: output_contains
value: "error"
- type: tool_not_called
tool: success_handlerUse Cases:
- Testing LLMs that may use different tools to achieve the same goal
- Validating that at least one of several acceptable outcomes occurred
- Creating exclusion rules (must NOT match a pattern)
- Complex conditional validation logic
agent-benchmark includes a powerful template engine based on Handlebars with custom helpers:
Generate random strings:
# Alphanumeric (default)
{{randomValue length=10}}
# Output: aB3xY9kL2m
# Alphabetic only
{{randomValue type='ALPHABETIC' length=8}}
# Output: AbCdEfGh
# Numeric only
{{randomValue type='NUMERIC' length=6}}
# Output: 123456
# Hexadecimal
{{randomValue type='HEXADECIMAL' length=8}}
# Output: 1a2b3c4d
# Alphanumeric with symbols
{{randomValue type='ALPHANUMERIC_AND_SYMBOLS' length=12}}
# Output: aB3@xY9!kL2#
# UUID
{{randomValue type='UUID'}}
# Output: 550e8400-e29b-41d4-a716-446655440000
# Uppercase
{{randomValue type='ALPHABETIC' length=8 uppercase=true}}
# Output: ABCDEFGHTypes:
ALPHANUMERIC(default) - Letters and numbersALPHABETIC- Letters onlyNUMERIC- Numbers onlyHEXADECIMAL- Hex characters (0-9, a-f)ALPHANUMERIC_AND_SYMBOLS- Letters, numbers, and symbolsUUID- UUID v4
Generate random integers:
# Random int between 0 and 100 (default)
{{randomInt}}
# Custom range
{{randomInt lower=1000 upper=9999}}
# Output: 5847
# Negative range
{{randomInt lower=-100 upper=100}}Generate random decimal numbers:
# Random decimal between 0.00 and 100.00 (default)
{{randomDecimal}}
# Custom range
{{randomDecimal lower=10.5 upper=99.9}}
# Output: 45.73Generate timestamps with formatting and offsets:
# Current ISO8601 timestamp (default)
{{now}}
# Output: 2024-01-15T14:30:00Z
# Unix epoch (milliseconds)
{{now format='epoch'}}
# Output: 1705329000000
# Unix timestamp (seconds)
{{now format='unix'}}
# Output: 1705329000
# Custom format (Java SimpleDateFormat style)
{{now format='yyyy-MM-dd HH:mm:ss'}}
# Output: 2024-01-15 14:30:00
# With timezone
{{now timezone='America/New_York'}}
# With offset
{{now offset='3 days'}}
{{now offset='-24 hours'}}
{{now offset='1 years'}}
# Combined
{{now format='yyyy-MM-dd' offset='7 days' timezone='UTC'}}Offset Units:
seconds/secondminutes/minutehours/hourdays/dayweeks/weekmonths/monthyears/year
Generate realistic fake data:
# Names
{{faker 'Name.first_name'}} # John
{{faker 'Name.last_name'}} # Smith
{{faker 'Name.full_name'}} # John Smith
{{faker 'Name.prefix'}} # Mr.
{{faker 'Name.suffix'}} # Jr.
# Addresses
{{faker 'Address.street'}} # 123 Main St
{{faker 'Address.city'}} # New York
{{faker 'Address.state'}} # California
{{faker 'Address.state_abbrev'}} # CA
{{faker 'Address.country'}} # United States
{{faker 'Address.postcode'}} # 12345
# Phone
{{faker 'Phone.number'}} # 555-1234
{{faker 'Phone.number_formatted'}} # (555) 123-4567
# Internet
{{faker 'Internet.email'}} # john@example.com
{{faker 'Internet.username'}} # john_doe_123
{{faker 'Internet.url'}} # https://example.com
{{faker 'Internet.ipv4'}} # 192.168.1.1
{{faker 'Internet.ipv6'}} # 2001:0db8:85a3::8a2e:0370:7334
{{faker 'Internet.mac'}} # 00:1B:44:11:3A:B7
# Company
{{faker 'Company.name'}} # Tech Corp
{{faker 'Company.suffix'}} # Inc.
{{faker 'Company.profession'}} # Software Engineer
# Lorem
{{faker 'Lorem.word'}} # ipsum
{{faker 'Lorem.sentence'}} # Lorem ipsum dolor sit amet
{{faker 'Lorem.paragraph'}} # Full paragraph text
# Finance
{{faker 'Finance.credit_card'}} # 4532-1234-5678-9010
{{faker 'Finance.currency'}} # USD
# Misc
{{faker 'Misc.uuid'}} # 550e8400-e29b-41d4-a716-446655440000
{{faker 'Misc.boolean'}} # true/false
{{faker 'Misc.date'}} # 2024-01-15
{{faker 'Misc.time'}} # 14:30:00
{{faker 'Misc.timestamp'}} # 1705329000
{{faker 'Misc.digit'}} # 7Remove substrings:
{{cut "Hello World" "World"}}
# Output: Hello
{{cut filename ".txt"}}Replace substrings:
{{replace "Hello World" "World" "Universe"}}
# Output: Hello Universe
{{replace email "@example.com" "@test.com"}}Extract substrings:
{{substring "Hello World" start=0 end=5}}
# Output: Hello
{{substring text start=6}}
# Output: Rest of string from position 6Extract data from tool results to use in subsequent tests:
sessions:
- name: User Workflow
tests:
- name: Create user
prompt: "Create a new user"
extractors:
- type: jsonpath
tool: create_user
path: "$.data.user.id"
variable_name: user_id
assertions:
- type: tool_called
tool: create_user
- name: Get user details
prompt: "Get details for user {{user_id}}"
assertions:
- type: tool_called
tool: get_user
- type: tool_param_equals
tool: get_user
params:
id: "{{user_id}}"Extractor Configuration:
type- Extraction method (currently:jsonpath)tool- Tool name to extract frompath- JSONPath expressionvariable_name- Variable name for template context
Use Cases:
- Extract IDs from creation operations
- Pass data between sequential tests
- Validate consistency across operations
agent-benchmark generates comprehensive reports in multiple formats. You can specify the output filename with -o (extension added automatically) and generate multiple formats simultaneously using -reportType with comma-separated values.
π View Sample Reports - See example HTML reports covering all test configuration permutations (single/multi agent, single/multi test, sessions, suites).
π Report Documentation - Detailed documentation on report hierarchy, sections, and adaptive display.
- Console - Real-time colored output during execution (default, always shown)
- HTML - Rich visual dashboard with charts and metrics
- JSON - Structured data for programmatic analysis
- Markdown - Documentation-friendly format
- Realtime - Streaming NDJSON written line-by-line as each test completes
# Console output only (default)
agent-benchmark -f test.yaml
# Generate HTML report
agent-benchmark -f test.yaml -o my-report -reportType html
# Generate multiple formats
agent-benchmark -f test.yaml -o my-report -reportType html,json,md
# Realtime streaming report (useful for CI/CD pipelines and live dashboards)
agent-benchmark -f test.yaml -o my-report -reportType realtime
# Combine realtime with other formats
agent-benchmark -f test.yaml -o my-report -reportType html,json,realtimeThe realtime report type streams results to a .jsonl (JSON Lines) file as each test completes β without waiting for the full suite to finish. This enables external tools to consume results incrementally.
Output file: <name>.jsonl (e.g. -o my-report β my-report.jsonl)
Format β one JSON object per line:
{"type":"test","data":{...full TestRun...}}
{"type":"test","data":{...full TestRun...}}
{"type":"summary","data":{"total_tests":5,"passed":4,"failed":1,"pass_rate":0.8,"total_duration_ms":12340,"generated_at":"2026-04-08T10:00:00Z"}}
ENDLine types:
| Line | Description |
|---|---|
{"type":"test",...} |
One line per completed test, written immediately after assertion evaluation. The data field contains the full TestRun: assertions, timestamps, latency, token counts, tool calls, errors, and more. |
{"type":"summary",...} |
Aggregate stats written once after all tests complete. |
END |
Non-JSON sentinel on the last line. Signals to parsers that the stream is complete. |
Parser pattern:
with open("my-report.jsonl") as f:
for line in follow(f): # tail -f style
if line.strip() == "END":
break # suite finished
row = json.loads(line)
if row["type"] == "test":
process_test(row["data"])
elif row["type"] == "summary":
process_summary(row["data"])Real-time colored output displayed during test execution with three main sections:
Server Comparison Summary
- Test-by-test comparison across agents
- Pass/fail status with checkmarks
- Duration per agent
- Provider information
- Summary statistics (e.g., "2/2 servers passed")
Detailed Test Results
- Individual test results per agent
- All assertion results with pass/fail indicators
- Detailed metrics for each assertion (expected vs actual values)
- Token usage and latency information
- Error details (if any)
Execution Summary
- Total tests, passed, and failed counts
- Pass rate percentage
- Total tool calls
- Total errors
- Total and average duration
- Total tokens used
Example:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
SERVER COMPARISON SUMMARY
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
π Test: Create file [100% passed]
Summary: 2/2 servers passed
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Server/Agent β Status β Duration β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β gemini-agent β β PASS β 2.34s β
β ββ [GOOGLE] β β β
β claude-agent β β FAIL β 3.12s β
β ββ [ANTHROPIC] β β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
DETAILED TEST RESULTS
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
π Test: Create file
β gemini-agent [GOOGLE] (2.34s)
β tool_called: Tool 'write_file' was called
β tool_param_equals: Tool called with correct parameters
β max_latency_ms: Latency: 2340ms (max: 5000ms)
β’ actual: 2340
β’ max: 5000
β claude-agent [ANTHROPIC] (3.12s)
β tool_called: Tool 'write_file' was called
β tool_param_equals: Tool called with incorrect parameters
β’ expected: {"path": "test.txt", "content": "Hello"}
β’ actual: {"path": "test.txt"}
β max_latency_ms: Latency: 3120ms (max: 5000ms)
β’ actual: 3120
β’ max: 5000
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Total: 2 | Passed: 1 | Failed: 1
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
================================================================================
[Summary] Test Execution Summary
================================================================================
Total Tests: 2
Passed: 1 (50.0%)
Failed: 1 (50.0%)
Total Tool Calls: 2
Total Errors: 1
Total Duration: 5460ms (avg: 2730ms per test)
Total Tokens: 350
================================================================================
Rich visual report featuring:
Summary Dashboard
- Total/Passed/Failed test counts
- Overall success rate with color-coded statistics
Agent Performance Comparison
- Statistics by agent with visual metrics
- Success rates with percentage indicators
- Average duration and latency
- Token usage (total and average per test)
- Pass/fail counts per agent
Server Comparison Summary
- Side-by-side test results across agents
- Per-test success rates
- Execution duration comparison
- Failed server details with error messages
Detailed Test Results
- Full execution details per agent
- Individual assertion results with pass/fail status
- Performance metrics (duration, tokens, latency)
- Tool call information and parameters
The HTML report is built from modular, reusable template components. Each report type composes these building blocks differently based on context (single agent vs multi-agent, single file vs suite, etc.).
graph TD
subgraph "Main Layout"
A[report.html] --> B[summary-cards]
A --> C[comparison-matrix]
A --> D[agent-leaderboard]
A --> E[file-summary]
A --> F[session-summary]
A --> G[test-results]
A --> H[fullscreen-overlay]
A --> I[scripts]
end
subgraph "Test Results Container"
G --> J[test-group]
end
subgraph "View Selection"
J -->|"1 agent"| K[single-agent-detail]
J -->|"2+ agents"| L[multi-agent-comparison]
end
subgraph "Single Agent Components"
K --> K1[agent-assertions]
K --> K2[agent-errors]
K --> K3[agent-sequence-diagram]
K --> K4[agent-tool-calls]
K --> K5[agent-messages]
K --> K6[agent-final-output]
end
subgraph "Multi-Agent Components"
L --> L1[comparison-table]
L --> L2[tool-comparison]
L --> L3[errors-comparison]
L --> L4[sequence-comparison]
L --> L5[outputs-comparison]
end
Reports are designed hierarchically, with each level building upon the previous:
| Level | Report Type | Description | Key Components |
|---|---|---|---|
| 1 | Single Agent, Single Test | Simplest case - one agent, one test | Summary cards, single-agent-detail |
| 2 | Single Agent, Multiple Tests | Multiple independent tests, same agent | + test-overview table |
| 3 | Multiple Agents | Compare agents on same tests | + comparison-matrix, agent-leaderboard |
| 4 | Multiple Sessions | Tests grouped by session with shared context | + session-summary |
| 5 | Full Suite | Multi-agent, multi-session, multi-file | All components combined |
Generate sample reports for each level:
go run test/generate_reports.goThis creates hierarchical sample reports in generated_reports/:
01_single_agent_single_test.html- Level 1: One agent, one test02_single_agent_multi_test.html- Level 2: One agent, multiple tests03_multi_agent_single_test.html- Level 3: Multiple agents, one test (leaderboard)04_multi_agent_multi_test.html- Level 4: Multiple agents, multiple tests (matrix)05_single_agent_multi_session.html- Level 5: One agent, multiple sessions06_multi_agent_multi_session.html- Level 6: Multiple agents, multiple sessions07_single_agent_multi_file.html- Level 7: One agent, multiple files08_multi_agent_multi_file.html- Level 8: Full suite (multiple agents, sessions, files)09_failed_with_errors.html- Error display example
Single Agent Report - One agent running tests:
graph LR
subgraph "Single Agent Report"
A[summary-cards] --> B[test-results]
B --> C[test-group]
C --> D[single-agent-detail]
D --> D1[assertions]
D --> D2[errors]
D --> D3[sequence-diagram]
D --> D4[tool-calls]
D --> D5[messages]
D --> D6[final-output]
end
Multi-Agent Report - Multiple agents compared on same tests:
graph LR
subgraph "Multi-Agent Report"
A[summary-cards] --> B[comparison-matrix]
B --> C[agent-leaderboard]
C --> D[test-results]
D --> E[test-group]
E --> F[multi-agent-comparison]
F --> F1[comparison-table]
F --> F2[tool-comparison]
F --> F3[errors-comparison]
F --> F4[sequence-comparison]
F --> F5[outputs-comparison]
end
Multi-Session Report - Tests organized by conversation sessions:
graph LR
subgraph "Multi-Session Report"
A[summary-cards] --> B[session-summary]
B --> C[test-results]
C --> D[test-group]
D --> E[single-agent-detail]
end
Full Suite Report - Multiple test files with optional multi-agent:
graph LR
subgraph "Full Suite Report"
A[summary-cards] --> B[comparison-matrix]
B --> C[agent-leaderboard]
C --> D[file-summary]
D --> E[test-results]
E --> F[test-group]
F -->|"1 agent"| G[single-agent-detail]
F -->|"2+ agents"| H[multi-agent-comparison]
end
| Component | Purpose | Used In |
|---|---|---|
summary-cards |
Top-level stats (total/passed/failed/tokens/duration) | All reports |
comparison-matrix |
Test Γ Agent pass/fail matrix | Multi-agent |
agent-leaderboard |
Ranked agent performance table | Multi-agent |
file-summary |
Test file grouping with stats | Suite runs |
session-summary |
Session grouping with flow diagrams | Multi-session |
test-results |
Container for all test groups | All reports |
test-group |
Single test, decides single vs multi view | All reports |
single-agent-detail |
Detailed expandable view for one agent | Single-agent |
multi-agent-comparison |
Side-by-side comparison table | Multi-agent |
agent-assertions |
Assertion results list | Single-agent |
agent-errors |
Error messages display | Single-agent |
agent-sequence-diagram |
Mermaid execution flow diagram | Single-agent |
agent-tool-calls |
Tool calls timeline with params/results | Single-agent |
agent-messages |
Conversation history | Single-agent |
agent-final-output |
Final agent response | Single-agent |
tool-comparison |
Tool calls side-by-side | Multi-agent |
errors-comparison |
Errors side-by-side | Multi-agent |
sequence-comparison |
Diagrams side-by-side (click to fullscreen) | Multi-agent |
outputs-comparison |
Final outputs side-by-side | Multi-agent |
fullscreen-overlay |
Modal overlay for enlarged diagrams | All reports |
scripts |
Mermaid init, expand/collapse, fullscreen JS | All reports |
Generate an AI-powered executive summary of test results by adding ai_summary to your test YAML:
ai_summary:
enabled: true
judge_provider: azure-gpt # Provider name from your providers sectionThe analysis appears as an "AI Summary" section in HTML reports with a verdict, trade-offs analysis, notable observations, failure patterns, and actionable recommendations.
π Full AI Summary Documentation
Structured test results for programmatic analysis and CI/CD integration:
{
"agent_benchmark_version": "1.0.0",
"generated_at": "2024-01-15T14:30:00Z",
"summary": {
"total": 10,
"passed": 8,
"failed": 2
},
"comparison_summary": {
"Test Name": {
"testName": "Create file",
"serverResults": {
"gemini-agent": {
"agentName": "gemini-agent",
"provider": "GOOGLE",
"passed": true,
"duration": 2340,
"errors": []
},
"claude-agent": {
"agentName": "claude-agent",
"provider": "ANTHROPIC",
"passed": false,
"duration": 3120,
"errors": ["Tool parameter mismatch"]
}
},
"totalRuns": 2,
"passedRuns": 1,
"failedRuns": 1
}
},
"detailed_results": [
{
"execution": {
"testName": "Create file",
"agentName": "gemini-agent",
"providerType": "GOOGLE",
"startTime": "2024-01-15T14:30:00Z",
"endTime": "2024-01-15T14:30:02Z",
"tokensUsed": 150,
"latencyMs": 2340,
"errors": []
},
"assertions": [
{
"type": "tool_called",
"passed": true,
"message": "Tool 'write_file' was called"
},
{
"type": "tool_param_equals",
"passed": true,
"message": "Tool 'write_file' called with correct parameters"
}
],
"passed": true
}
]
}Key Fields
- summary - Overall test statistics
- comparison_summary - Cross-agent comparison data
- detailed_results - Full execution details with assertions
- agent_benchmark_version - Version of the tool used
- generated_at - Report generation timestamp
Documentation-friendly format ideal for README files, wikis, and technical documentation. Key Features
- Clean, readable format for documentation
- Summary tables with comparison data
- Detailed assertion results per agent
- Easy to include in GitHub README or wiki pages
- Portable across documentation platforms
- Quick visual identification of pass/fail status
providers:
- name: gemini
type: GOOGLE
token: ${GOOGLE_API_KEY}
model: gemini-2.0-flash
servers:
- name: fs
type: stdio
command: npx @modelcontextprotocol/server-filesystem /tmp
agents:
- name: file-agent
provider: gemini
servers:
- name: fs
settings:
verbose: true
max_iterations: 5
variables:
filename: "test-{{randomValue length=8}}.txt"
content: "{{faker 'Lorem.paragraph'}}"
sessions:
- name: File Tests
tests:
- name: Create file
prompt: "Create a file {{filename}} with content: {{content}}"
assertions:
- type: tool_called
tool: write_file
- type: file_created
path: "/tmp/{{filename}}"
- name: Read file
prompt: "Read {{filename}}"
assertions:
- type: tool_called
tool: read_file
- type: output_contains
value: "{{content}}"Run:
./agent-benchmark -f file-tests.yaml -o results.html -verboseproviders:
- name: claude
type: ANTHROPIC
token: ${ANTHROPIC_API_KEY}
model: claude-sonnet-4-20250514
servers:
- name: api-server
type: sse
url: https://api.example.com/mcp/events
headers:
- "Authorization: Bearer ${API_TOKEN}"
agents:
- name: api-agent
provider: claude
servers:
- name: api-server
settings:
tool_timeout: 10s
max_iterations: 8
variables:
user_id: "{{randomInt lower=1000 upper=9999}}"
email: "{{faker 'Internet.email'}}"
timestamp: "{{now format='unix'}}"
sessions:
- name: User Management
tests:
- name: Create user
prompt: |
Create a new user with:
- ID: {{user_id}}
- Email: {{email}}
- Created: {{timestamp}}
assertions:
- type: tool_called
tool: create_user
- type: tool_param_equals
tool: create_user
params:
id: "{{user_id}}"
email: "{{email}}"
- type: output_json_valid
- type: max_latency_ms
value: 5000
- name: Fetch user
prompt: "Get user {{user_id}}"
assertions:
- type: tool_called
tool: get_user
- type: output_matches_json
path: "$.data.email"
value: "{{email}}"test:
stage: test
script:
- ./agent-benchmark -s suite.yaml -o results.json -reportType json
artifacts:
when: always
paths:
- results.json
reports:
junit: results.json
variables:
GOOGLE_API_KEY: ${GOOGLE_API_KEY}
ANTHROPIC_API_KEY: ${ANTHROPIC_API_KEY}Within a session, tests share conversation history:
Session Start
ββ Test 1: "Create file"
β ββ Messages: [user, assistant, tool_response]
ββ Test 2: "Read file" # Has Test 1 history
β ββ Messages: [prev..., user, assistant, tool_response]
ββ Test 3: "Delete file" # Has Test 1 & 2 history
ββ Messages: [prev..., user, assistant, tool_response]
1. User sends prompt
2. Agent calls LLM with tools
3. LLM responds with:
a) Final answer β Done
b) Tool calls β Execute tools β Back to step 2
4. Repeat until:
- Final answer received
- Max iterations reached
- Context cancelled
- Error occurred
The agent can detect when an LLM asks for clarification instead of taking action (e.g., "Would you like me to...", "Should I proceed..."). This feature uses LLM-based semantic classification for accurate detection across any language.
agents:
- name: autonomous-agent
provider: my-provider
clarification_detection:
enabled: true
judge_provider: azure-openai-judge # Recommend gpt-4.1 for best accuracyFor full documentation, see docs/clarification-detection.md.
Apache 2.0 License - See LICENSE file for details
Issues: https://github.com/mykhaliev/agent-benchmark/issues
Contributing:
- Fork the repository
- Create feature branch
- Submit pull request