docs: add multi-turn conversation evaluation example with per-turn score breakdown

## Objective

Add a working example demonstrating multi-turn conversation evaluation using AgentV's existing primitives (`llm-judge` with `output: Message[]` and `details` return field). No core code changes required.

The example should show that AgentV can evaluate multi-turn conversations today — competing with Azure SDK's 25+ named conversational evaluators and DeepEval's ConversationalTestCase — using composable `llm-judge` prompt templates instead of hardcoded metric classes.

## Architecture Boundary

`docs-examples` — This is a new example under `examples/` with reusable prompt templates. No changes to core runtime, eval loop, or schemas.

## What the example should demonstrate

1. **Multi-turn eval case** — An \`input\` with multiple user/assistant turns, where the final user message requires the agent to recall information from earlier turns

2. **Conversation-aware llm-judge prompts** — Prompt templates that receive the full \`{{output}}\` Message[] array and evaluate conversation-level qualities:
   - Context retention (does the agent remember earlier information?)
   - Conversation relevancy (are assistant responses relevant across turns?)
   - Role adherence (does the agent maintain its persona?)

3. **Per-turn score breakdown via \`details\`** — The llm-judge prompt should instruct the model to return structured \`details\` with per-turn scores, e.g.:
   ```json
   {
     "score": 0.75,
     "hits": ["Turn 2: correctly recalled user name", "Turn 4: maintained context"],
     "misses": ["Turn 6: forgot order number from turn 1"],
     "reasoning": "3 of 4 assistant turns demonstrated context retention",
     "details": {
       "scores_per_turn": [1.0, 1.0, 0.0, 1.0],
       "relevant_turns": 3,
       "total_turns": 4
     }
   }
   ```

4. **Composability** — Multiple conversation evaluators combined with deterministic assertions:
   ```yaml
   assert:
     - type: llm-judge
       prompt: ./judges/context-retention.md
       required: true
     - type: llm-judge
       prompt: ./judges/conversation-relevancy.md
       weight: 2
     - type: contains
       value: "order #98765"
   ```

## Design Latitude

- Directory structure and naming within `examples/` is up to the implementer
- Prompt template content and evaluation criteria are flexible — the above is illustrative
- Can use markdown templates or TypeScript `definePromptTemplate` — whichever produces a clearer example
- Number of test cases and conversation scenarios is flexible
- Whether to use a mock target or real provider is up to the implementer

## Acceptance Signals

- [ ] Example runs successfully with `agentv run` against a mock or real target
- [ ] At least one llm-judge prompt evaluates multi-turn conversation quality
- [ ] The judge returns structured `details` with per-turn breakdown (not just a flat score)
- [ ] `hits` and `misses` reference specific turns
- [ ] Multiple assert types composed together (llm-judge + deterministic)
- [ ] Example is self-documenting (clear YAML comments or a short README explaining the pattern)

## Non-Goals

- No core runtime changes (no new assertion types, no `conversation:` config block)
- No sliding window or per-turn iteration loop in the runner
- No new named evaluator types (e.g., no `conversation-relevancy` assertion type)
- Not a benchmark — this is a usage example, not a comprehensive test suite

## Context

Research in [agentevals-research](https://github.com/agentevals/agentevals-research) found that AgentV's existing `llm-judge` + `details` field already supports multi-turn conversation evaluation without core changes — matching what Azure SDK does with 25+ hardcoded evaluators and what DeepEval does with `ConversationalTestCase`. The composability advantage (prompt file = evaluator) means users can create any conversation metric in markdown rather than waiting for named evaluator implementations.

Key finding: the `CodeJudgeResultSchema.details` field (`z.record(z.unknown())`) already accepts arbitrary structured data, so per-turn score breakdowns flow through to output JSONL today.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: add multi-turn conversation evaluation example with per-turn score breakdown #505

Objective

Architecture Boundary

What the example should demonstrate

Design Latitude

Acceptance Signals

Non-Goals

Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

docs: add multi-turn conversation evaluation example with per-turn score breakdown #505

Description

Objective

Architecture Boundary

What the example should demonstrate

Design Latitude

Acceptance Signals

Non-Goals

Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions