tracking: adopt Agent Skills evals standard — evals.json support and graduation path

## Overview

Adopt the [Agent Skills open standard](https://agentskills.io) eval format (\`evals.json\`) as a supported input to AgentV, providing a smooth on-ramp for skill authors who want to graduate from simple string-based assertions to AgentV's full evaluation framework.

## Context

The Agent Skills standard (backed by Anthropic, merged [PR #203](https://github.com/agentskills/agentskills/pull/203)) defines an \`evals.json\` format for evaluating skill quality. It uses \`prompt\` (string), \`expected_output\` (string), and \`assertions\` (string[]) — simpler than AgentV's \`input\` (message array), \`expected_output\` (message array), and \`assert\` (typed evaluator array), but serving a similar purpose.

Rather than competing formats, these are complementary levels of the same progression:
- **evals.json** — Quick quality checks for skill authors (simple, no framework to learn)
- **EVAL.yaml** — Production agent evaluation (deterministic checks, multi-turn, CI/CD, composite scoring)

AgentV should accept both, with a clear graduation path between them.

**Research**: [agentevals-research/research/findings/agentskills-open-standard](https://github.com/agentevals/agentevals-research/tree/main/research/findings/agentskills-open-standard)

## Design: Use Existing \`--target\` for Agent Self-Evaluation

No new flags or execution modes needed. The existing \`--target\` flag selects the agent harness:

\`\`\`bash
agentv eval evals.json --target claude
\`\`\`

The agent self-evaluates naturally — Claude Code loads skills via progressive disclosure, the agent processes the prompt, and AgentV's evaluators score the response. This works because:

1. The parser (#538) promotes evals.json to \`EvalTest[]\`
2. Format detection (#539) routes \`.json\` files through the parser
3. The promoted tests flow through the **existing** target execution pipeline
4. \`--target claude\` (or \`copilot-cli\`, or any agent target) handles execution
5. Evaluators (promoted assertions → llm-judge) score the output
6. Result JSONL + exit codes work for CI/CD

No special agent-mode routing. No \`--skill\` flag. Just: detect format → promote → run through existing pipeline.

## Agent Mode Parity with Skill-Creator

AgentV and Anthropic's skill-creator use the **same execution model** — Claude Code subagents dispatched via prompt. There is no special skill-loading API; the skill-creator injects skills by adding "Skill path: /path" to the subagent prompt text.

AgentV achieves 100% feature parity with the skill-creator eval workflow, plus:
- Deterministic evaluators (is_json, contains, regex) that skill-creator lacks
- Structured JSONL results vs ad-hoc grading.json
- CI/CD exit codes vs manual-only review
- \`agentv compare\` for before/after delta analysis

**Baseline comparison (with/without skill)** is a shared platform constraint — Claude Code auto-discovers installed skills and there is no per-session disable mechanism. Both AgentV and skill-creator work around this by keeping skills in development directories outside \`.claude/skills/\`.

## Architecture

\`\`\`
agentv eval evals.json --target claude
        │
        ▼
  ┌──────────────┐
  │ detect .json │
  │ format (#539)│
  └──────┬───────┘
         │
         ▼
  ┌──────────────┐
  │ evals.json   │
  │ parser (#538)│
  └──────┬───────┘
         │ promotes to
         ▼
  ┌─────────────────────────────────────┐
  │        AgentV internal EvalTest     │
  │  (same as EVAL.yaml pipeline)       │
  └──────────────┬──────────────────────┘
                 │
                 ▼
  ┌─────────────────────────────────────┐
  │  Existing target execution pipeline │
  │  (--target claude / copilot-cli│
  │   / any agent target)              │
  └──────────────┬──────────────────────┘
                 │
                 ▼
           Result JSONL
           + exit codes (CI/CD)
           + benchmark.json (#542)
\`\`\`

## Promotion Rules

| Agent Skills | AgentV \`EvalTest\` | Transform |
|---|---|---|
| \`prompt\` (string) | \`input\` (TestMessage[]) | Wrap in \`[{role: "user", content}]\` |
| \`expected_output\` (string) | \`expected_output\` (TestMessage[]) | Wrap in \`[{role: "assistant", content}]\` |
| \`assertions\` (string[]) | \`evaluators\` (EvaluatorConfig[]) | Each → \`{type: "llm-judge", prompt: text}\` |
| \`id\` (number) | \`id\` (string) | String coercion |
| \`skill_name\` | \`name\` | Direct |
| \`files\` (string[]) | \`metadata.agent_skills_files\` | Copy into workspace (#541) |

## Sub-Issues

### Wave 1 — Foundation
- [ ] #538 — **Parse Agent Skills evals.json format** (core parser + promotion logic)

### Wave 2 — Integration (depends on Wave 1, parallel within wave)
- [ ] #539 — **Auto-detect evals.json in \`agentv eval\`** (format detection + routing)
- [ ] #541 — **Handle \`files[]\` in workspace** (file isolation)

### Wave 3 — Graduation Tools (depends on Wave 2, parallel within wave)
- [ ] #540 — **\`agentv convert\` command** (evals.json → EVAL.yaml)
- [ ] #542 — **Emit benchmark.json** (Agent Skills-compatible output)

### Wave 4 — Documentation (depends on Wave 3)
- [ ] #543 — **Adoption guide and examples** (docs)

### Optional Enhancement
- [ ] #545 — **Support evals.json in \`agentv prompt eval\`** (nice-to-have for manual orchestration; primary path is \`agentv eval --target\`)

## Merge Order

1. #538 first (no dependencies)
2. #539 and #541 in parallel (both depend on #538)
3. #540 and #542 in parallel (depend on #539)
4. #543 last (documents everything)
5. #545 optional (can be done anytime after #538)

## E2E Verification

\`\`\`bash
# Wave 1: Parser unit tests
# (in #538)

# Wave 2: Full pipeline
agentv eval examples/features/agent-skills-evals/evals.json --target claude --out results.jsonl
cat results.jsonl | jq '.test_id'  # "1" and "2"

# Wave 3: Graduation
agentv convert examples/features/agent-skills-evals/evals.json -o converted.yaml
agentv eval converted.yaml --target claude  # equivalent results

# Wave 3: Benchmark
agentv eval evals.json --target claude --benchmark-json benchmark.json
cat benchmark.json | jq '.run_summary'

# CI/CD pattern
agentv eval evals/evals.json --target claude || exit 1
\`\`\`

## Completion Criteria

- \`agentv eval evals.json --target claude\` works end-to-end
- \`agentv convert evals.json\` produces valid, runnable EVAL.yaml
- Skill authors can start with evals.json and graduate to EVAL.yaml without rewriting
- Documentation covers both coding and non-coding agent use cases
- Each wave has passing E2E tests with realistic examples

Agent Skills	AgentV `EvalTest`	Transform
`prompt` (string)	`input` (TestMessage[])	Wrap in `[{role: "user", content}]`
`expected_output` (string)	`expected_output` (TestMessage[])	Wrap in `[{role: "assistant", content}]`
`assertions` (string[])	`evaluators` (EvaluatorConfig[])	Each → `{type: "llm-judge", prompt: text}`
`id` (number)	`id` (string)	String coercion
`skill_name`	`name`	Direct
`files` (string[])	`metadata.agent_skills_files`	Copy into workspace (#541)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tracking: adopt Agent Skills evals standard — evals.json support and graduation path #544

Overview

Context

Design: Use Existing `--target` for Agent Self-Evaluation

Agent Mode Parity with Skill-Creator

Architecture

Promotion Rules

Sub-Issues

Wave 1 — Foundation

Wave 2 — Integration (depends on Wave 1, parallel within wave)

Wave 3 — Graduation Tools (depends on Wave 2, parallel within wave)

Wave 4 — Documentation (depends on Wave 3)

Optional Enhancement

Merge Order

E2E Verification

Wave 1: Parser unit tests

(in #538)

Wave 2: Full pipeline

Wave 3: Graduation

Wave 3: Benchmark

CI/CD pattern

Completion Criteria

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

tracking: adopt Agent Skills evals standard — evals.json support and graduation path #544

Description

Overview

Context

Design: Use Existing `--target` for Agent Self-Evaluation

Agent Mode Parity with Skill-Creator

Architecture

Promotion Rules

Sub-Issues

Wave 1 — Foundation

Wave 2 — Integration (depends on Wave 1, parallel within wave)

Wave 3 — Graduation Tools (depends on Wave 2, parallel within wave)

Wave 4 — Documentation (depends on Wave 3)

Optional Enhancement

Merge Order

E2E Verification

Wave 1: Parser unit tests

(in #538)

Wave 2: Full pipeline

Wave 3: Graduation

Wave 3: Benchmark

CI/CD pattern

Completion Criteria

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions