Skip to content

tracking: adopt Agent Skills evals standard — evals.json support and graduation path #544

@christso

Description

@christso

Overview

Adopt the Agent Skills open standard eval format (`evals.json`) as a supported input to AgentV, providing a smooth on-ramp for skill authors who want to graduate from simple string-based assertions to AgentV's full evaluation framework.

Context

The Agent Skills standard (backed by Anthropic, merged PR #203) defines an `evals.json` format for evaluating skill quality. It uses `prompt` (string), `expected_output` (string), and `assertions` (string[]) — simpler than AgentV's `input` (message array), `expected_output` (message array), and `assert` (typed evaluator array), but serving a similar purpose.

Rather than competing formats, these are complementary levels of the same progression:

  • evals.json — Quick quality checks for skill authors (simple, no framework to learn)
  • EVAL.yaml — Production agent evaluation (deterministic checks, multi-turn, CI/CD, composite scoring)

AgentV should accept both, with a clear graduation path between them.

Research: agentevals-research/research/findings/agentskills-open-standard

Design: Use Existing `--target` for Agent Self-Evaluation

No new flags or execution modes needed. The existing `--target` flag selects the agent harness:

```bash
agentv eval evals.json --target claude
```

The agent self-evaluates naturally — Claude Code loads skills via progressive disclosure, the agent processes the prompt, and AgentV's evaluators score the response. This works because:

  1. The parser (feat: parse Agent Skills evals.json format #538) promotes evals.json to `EvalTest[]`
  2. Format detection (feat: auto-detect evals.json in agentv eval #539) routes `.json` files through the parser
  3. The promoted tests flow through the existing target execution pipeline
  4. `--target claude` (or `copilot-cli`, or any agent target) handles execution
  5. Evaluators (promoted assertions → llm-judge) score the output
  6. Result JSONL + exit codes work for CI/CD

No special agent-mode routing. No `--skill` flag. Just: detect format → promote → run through existing pipeline.

Agent Mode Parity with Skill-Creator

AgentV and Anthropic's skill-creator use the same execution model — Claude Code subagents dispatched via prompt. There is no special skill-loading API; the skill-creator injects skills by adding "Skill path: /path" to the subagent prompt text.

AgentV achieves 100% feature parity with the skill-creator eval workflow, plus:

  • Deterministic evaluators (is_json, contains, regex) that skill-creator lacks
  • Structured JSONL results vs ad-hoc grading.json
  • CI/CD exit codes vs manual-only review
  • `agentv compare` for before/after delta analysis

Baseline comparison (with/without skill) is a shared platform constraint — Claude Code auto-discovers installed skills and there is no per-session disable mechanism. Both AgentV and skill-creator work around this by keeping skills in development directories outside `.claude/skills/`.

Architecture

```
agentv eval evals.json --target claude


┌──────────────┐
│ detect .json │
│ format (#539)│
└──────┬───────┘


┌──────────────┐
│ evals.json │
│ parser (#538)│
└──────┬───────┘
│ promotes to

┌─────────────────────────────────────┐
│ AgentV internal EvalTest │
│ (same as EVAL.yaml pipeline) │
└──────────────┬──────────────────────┘


┌─────────────────────────────────────┐
│ Existing target execution pipeline │
│ (--target claude / copilot-cli│
│ / any agent target) │
└──────────────┬──────────────────────┘


Result JSONL
+ exit codes (CI/CD)
+ benchmark.json (#542)
```

Promotion Rules

Agent Skills AgentV `EvalTest` Transform
`prompt` (string) `input` (TestMessage[]) Wrap in `[{role: "user", content}]`
`expected_output` (string) `expected_output` (TestMessage[]) Wrap in `[{role: "assistant", content}]`
`assertions` (string[]) `evaluators` (EvaluatorConfig[]) Each → `{type: "llm-judge", prompt: text}`
`id` (number) `id` (string) String coercion
`skill_name` `name` Direct
`files` (string[]) `metadata.agent_skills_files` Copy into workspace (#541)

Sub-Issues

Wave 1 — Foundation

Wave 2 — Integration (depends on Wave 1, parallel within wave)

Wave 3 — Graduation Tools (depends on Wave 2, parallel within wave)

Wave 4 — Documentation (depends on Wave 3)

Optional Enhancement

Merge Order

  1. feat: parse Agent Skills evals.json format #538 first (no dependencies)
  2. feat: auto-detect evals.json in agentv eval #539 and feat: handle Agent Skills evals files[] in workspace #541 in parallel (both depend on feat: parse Agent Skills evals.json format #538)
  3. feat: agentv convert command — evals.json to EVAL.yaml #540 and feat: emit Agent Skills benchmark.json alongside result JSONL #542 in parallel (depend on feat: auto-detect evals.json in agentv eval #539)
  4. docs: Agent Skills evals.json adoption guide and examples #543 last (documents everything)
  5. feat: support evals.json in agentv prompt eval subcommands #545 optional (can be done anytime after feat: parse Agent Skills evals.json format #538)

E2E Verification

```bash

Wave 1: Parser unit tests

(in #538)

Wave 2: Full pipeline

agentv eval examples/features/agent-skills-evals/evals.json --target claude --out results.jsonl
cat results.jsonl | jq '.test_id' # "1" and "2"

Wave 3: Graduation

agentv convert examples/features/agent-skills-evals/evals.json -o converted.yaml
agentv eval converted.yaml --target claude # equivalent results

Wave 3: Benchmark

agentv eval evals.json --target claude --benchmark-json benchmark.json
cat benchmark.json | jq '.run_summary'

CI/CD pattern

agentv eval evals/evals.json --target claude || exit 1
```

Completion Criteria

  • `agentv eval evals.json --target claude` works end-to-end
  • `agentv convert evals.json` produces valid, runnable EVAL.yaml
  • Skill authors can start with evals.json and graduate to EVAL.yaml without rewriting
  • Documentation covers both coding and non-coding agent use cases
  • Each wave has passing E2E tests with realistic examples

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions