-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Overview
Adopt the Agent Skills open standard eval format (`evals.json`) as a supported input to AgentV, providing a smooth on-ramp for skill authors who want to graduate from simple string-based assertions to AgentV's full evaluation framework.
Context
The Agent Skills standard (backed by Anthropic, merged PR #203) defines an `evals.json` format for evaluating skill quality. It uses `prompt` (string), `expected_output` (string), and `assertions` (string[]) — simpler than AgentV's `input` (message array), `expected_output` (message array), and `assert` (typed evaluator array), but serving a similar purpose.
Rather than competing formats, these are complementary levels of the same progression:
- evals.json — Quick quality checks for skill authors (simple, no framework to learn)
- EVAL.yaml — Production agent evaluation (deterministic checks, multi-turn, CI/CD, composite scoring)
AgentV should accept both, with a clear graduation path between them.
Research: agentevals-research/research/findings/agentskills-open-standard
Design: Use Existing `--target` for Agent Self-Evaluation
No new flags or execution modes needed. The existing `--target` flag selects the agent harness:
```bash
agentv eval evals.json --target claude
```
The agent self-evaluates naturally — Claude Code loads skills via progressive disclosure, the agent processes the prompt, and AgentV's evaluators score the response. This works because:
- The parser (feat: parse Agent Skills evals.json format #538) promotes evals.json to `EvalTest[]`
- Format detection (feat: auto-detect evals.json in agentv eval #539) routes `.json` files through the parser
- The promoted tests flow through the existing target execution pipeline
- `--target claude` (or `copilot-cli`, or any agent target) handles execution
- Evaluators (promoted assertions → llm-judge) score the output
- Result JSONL + exit codes work for CI/CD
No special agent-mode routing. No `--skill` flag. Just: detect format → promote → run through existing pipeline.
Agent Mode Parity with Skill-Creator
AgentV and Anthropic's skill-creator use the same execution model — Claude Code subagents dispatched via prompt. There is no special skill-loading API; the skill-creator injects skills by adding "Skill path: /path" to the subagent prompt text.
AgentV achieves 100% feature parity with the skill-creator eval workflow, plus:
- Deterministic evaluators (is_json, contains, regex) that skill-creator lacks
- Structured JSONL results vs ad-hoc grading.json
- CI/CD exit codes vs manual-only review
- `agentv compare` for before/after delta analysis
Baseline comparison (with/without skill) is a shared platform constraint — Claude Code auto-discovers installed skills and there is no per-session disable mechanism. Both AgentV and skill-creator work around this by keeping skills in development directories outside `.claude/skills/`.
Architecture
```
agentv eval evals.json --target claude
│
▼
┌──────────────┐
│ detect .json │
│ format (#539)│
└──────┬───────┘
│
▼
┌──────────────┐
│ evals.json │
│ parser (#538)│
└──────┬───────┘
│ promotes to
▼
┌─────────────────────────────────────┐
│ AgentV internal EvalTest │
│ (same as EVAL.yaml pipeline) │
└──────────────┬──────────────────────┘
│
▼
┌─────────────────────────────────────┐
│ Existing target execution pipeline │
│ (--target claude / copilot-cli│
│ / any agent target) │
└──────────────┬──────────────────────┘
│
▼
Result JSONL
+ exit codes (CI/CD)
+ benchmark.json (#542)
```
Promotion Rules
| Agent Skills | AgentV `EvalTest` | Transform |
|---|---|---|
| `prompt` (string) | `input` (TestMessage[]) | Wrap in `[{role: "user", content}]` |
| `expected_output` (string) | `expected_output` (TestMessage[]) | Wrap in `[{role: "assistant", content}]` |
| `assertions` (string[]) | `evaluators` (EvaluatorConfig[]) | Each → `{type: "llm-judge", prompt: text}` |
| `id` (number) | `id` (string) | String coercion |
| `skill_name` | `name` | Direct |
| `files` (string[]) | `metadata.agent_skills_files` | Copy into workspace (#541) |
Sub-Issues
Wave 1 — Foundation
- feat: parse Agent Skills evals.json format #538 — Parse Agent Skills evals.json format (core parser + promotion logic)
Wave 2 — Integration (depends on Wave 1, parallel within wave)
- feat: auto-detect evals.json in agentv eval #539 — Auto-detect evals.json in `agentv eval` (format detection + routing)
- feat: handle Agent Skills evals files[] in workspace #541 — Handle `files[]` in workspace (file isolation)
Wave 3 — Graduation Tools (depends on Wave 2, parallel within wave)
- feat: agentv convert command — evals.json to EVAL.yaml #540 — `agentv convert` command (evals.json → EVAL.yaml)
- feat: emit Agent Skills benchmark.json alongside result JSONL #542 — Emit benchmark.json (Agent Skills-compatible output)
Wave 4 — Documentation (depends on Wave 3)
- docs: Agent Skills evals.json adoption guide and examples #543 — Adoption guide and examples (docs)
Optional Enhancement
- feat: support evals.json in agentv prompt eval subcommands #545 — Support evals.json in `agentv prompt eval` (nice-to-have for manual orchestration; primary path is `agentv eval --target`)
Merge Order
- feat: parse Agent Skills evals.json format #538 first (no dependencies)
- feat: auto-detect evals.json in agentv eval #539 and feat: handle Agent Skills evals files[] in workspace #541 in parallel (both depend on feat: parse Agent Skills evals.json format #538)
- feat: agentv convert command — evals.json to EVAL.yaml #540 and feat: emit Agent Skills benchmark.json alongside result JSONL #542 in parallel (depend on feat: auto-detect evals.json in agentv eval #539)
- docs: Agent Skills evals.json adoption guide and examples #543 last (documents everything)
- feat: support evals.json in agentv prompt eval subcommands #545 optional (can be done anytime after feat: parse Agent Skills evals.json format #538)
E2E Verification
```bash
Wave 1: Parser unit tests
(in #538)
Wave 2: Full pipeline
agentv eval examples/features/agent-skills-evals/evals.json --target claude --out results.jsonl
cat results.jsonl | jq '.test_id' # "1" and "2"
Wave 3: Graduation
agentv convert examples/features/agent-skills-evals/evals.json -o converted.yaml
agentv eval converted.yaml --target claude # equivalent results
Wave 3: Benchmark
agentv eval evals.json --target claude --benchmark-json benchmark.json
cat benchmark.json | jq '.run_summary'
CI/CD pattern
agentv eval evals/evals.json --target claude || exit 1
```
Completion Criteria
- `agentv eval evals.json --target claude` works end-to-end
- `agentv convert evals.json` produces valid, runnable EVAL.yaml
- Skill authors can start with evals.json and graduate to EVAL.yaml without rewriting
- Documentation covers both coding and non-coding agent use cases
- Each wave has passing E2E tests with realistic examples