Skip to content

docs: add multi-turn conversation evaluation example with per-turn score breakdown #505

@christso

Description

@christso

Objective

Add a working example demonstrating multi-turn conversation evaluation using AgentV's existing primitives (llm-judge with output: Message[] and details return field). No core code changes required.

The example should show that AgentV can evaluate multi-turn conversations today — competing with Azure SDK's 25+ named conversational evaluators and DeepEval's ConversationalTestCase — using composable llm-judge prompt templates instead of hardcoded metric classes.

Architecture Boundary

docs-examples — This is a new example under examples/ with reusable prompt templates. No changes to core runtime, eval loop, or schemas.

What the example should demonstrate

  1. Multi-turn eval case — An `input` with multiple user/assistant turns, where the final user message requires the agent to recall information from earlier turns

  2. Conversation-aware llm-judge prompts — Prompt templates that receive the full `{{output}}` Message[] array and evaluate conversation-level qualities:

    • Context retention (does the agent remember earlier information?)
    • Conversation relevancy (are assistant responses relevant across turns?)
    • Role adherence (does the agent maintain its persona?)
  3. Per-turn score breakdown via `details` — The llm-judge prompt should instruct the model to return structured `details` with per-turn scores, e.g.:

    {
      "score": 0.75,
      "hits": ["Turn 2: correctly recalled user name", "Turn 4: maintained context"],
      "misses": ["Turn 6: forgot order number from turn 1"],
      "reasoning": "3 of 4 assistant turns demonstrated context retention",
      "details": {
        "scores_per_turn": [1.0, 1.0, 0.0, 1.0],
        "relevant_turns": 3,
        "total_turns": 4
      }
    }
  4. Composability — Multiple conversation evaluators combined with deterministic assertions:

    assert:
      - type: llm-judge
        prompt: ./judges/context-retention.md
        required: true
      - type: llm-judge
        prompt: ./judges/conversation-relevancy.md
        weight: 2
      - type: contains
        value: "order #98765"

Design Latitude

  • Directory structure and naming within examples/ is up to the implementer
  • Prompt template content and evaluation criteria are flexible — the above is illustrative
  • Can use markdown templates or TypeScript definePromptTemplate — whichever produces a clearer example
  • Number of test cases and conversation scenarios is flexible
  • Whether to use a mock target or real provider is up to the implementer

Acceptance Signals

  • Example runs successfully with agentv run against a mock or real target
  • At least one llm-judge prompt evaluates multi-turn conversation quality
  • The judge returns structured details with per-turn breakdown (not just a flat score)
  • hits and misses reference specific turns
  • Multiple assert types composed together (llm-judge + deterministic)
  • Example is self-documenting (clear YAML comments or a short README explaining the pattern)

Non-Goals

  • No core runtime changes (no new assertion types, no conversation: config block)
  • No sliding window or per-turn iteration loop in the runner
  • No new named evaluator types (e.g., no conversation-relevancy assertion type)
  • Not a benchmark — this is a usage example, not a comprehensive test suite

Context

Research in agentevals-research found that AgentV's existing llm-judge + details field already supports multi-turn conversation evaluation without core changes — matching what Azure SDK does with 25+ hardcoded evaluators and what DeepEval does with ConversationalTestCase. The composability advantage (prompt file = evaluator) means users can create any conversation metric in markdown rather than waiting for named evaluator implementations.

Key finding: the CodeJudgeResultSchema.details field (z.record(z.unknown())) already accepts arbitrary structured data, so per-turn score breakdowns flow through to output JSONL today.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions