-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Objective
Add a working example demonstrating multi-turn conversation evaluation using AgentV's existing primitives (llm-judge with output: Message[] and details return field). No core code changes required.
The example should show that AgentV can evaluate multi-turn conversations today — competing with Azure SDK's 25+ named conversational evaluators and DeepEval's ConversationalTestCase — using composable llm-judge prompt templates instead of hardcoded metric classes.
Architecture Boundary
docs-examples — This is a new example under examples/ with reusable prompt templates. No changes to core runtime, eval loop, or schemas.
What the example should demonstrate
-
Multi-turn eval case — An `input` with multiple user/assistant turns, where the final user message requires the agent to recall information from earlier turns
-
Conversation-aware llm-judge prompts — Prompt templates that receive the full `{{output}}` Message[] array and evaluate conversation-level qualities:
- Context retention (does the agent remember earlier information?)
- Conversation relevancy (are assistant responses relevant across turns?)
- Role adherence (does the agent maintain its persona?)
-
Per-turn score breakdown via `details` — The llm-judge prompt should instruct the model to return structured `details` with per-turn scores, e.g.:
{ "score": 0.75, "hits": ["Turn 2: correctly recalled user name", "Turn 4: maintained context"], "misses": ["Turn 6: forgot order number from turn 1"], "reasoning": "3 of 4 assistant turns demonstrated context retention", "details": { "scores_per_turn": [1.0, 1.0, 0.0, 1.0], "relevant_turns": 3, "total_turns": 4 } } -
Composability — Multiple conversation evaluators combined with deterministic assertions:
assert: - type: llm-judge prompt: ./judges/context-retention.md required: true - type: llm-judge prompt: ./judges/conversation-relevancy.md weight: 2 - type: contains value: "order #98765"
Design Latitude
- Directory structure and naming within
examples/is up to the implementer - Prompt template content and evaluation criteria are flexible — the above is illustrative
- Can use markdown templates or TypeScript
definePromptTemplate— whichever produces a clearer example - Number of test cases and conversation scenarios is flexible
- Whether to use a mock target or real provider is up to the implementer
Acceptance Signals
- Example runs successfully with
agentv runagainst a mock or real target - At least one llm-judge prompt evaluates multi-turn conversation quality
- The judge returns structured
detailswith per-turn breakdown (not just a flat score) -
hitsandmissesreference specific turns - Multiple assert types composed together (llm-judge + deterministic)
- Example is self-documenting (clear YAML comments or a short README explaining the pattern)
Non-Goals
- No core runtime changes (no new assertion types, no
conversation:config block) - No sliding window or per-turn iteration loop in the runner
- No new named evaluator types (e.g., no
conversation-relevancyassertion type) - Not a benchmark — this is a usage example, not a comprehensive test suite
Context
Research in agentevals-research found that AgentV's existing llm-judge + details field already supports multi-turn conversation evaluation without core changes — matching what Azure SDK does with 25+ hardcoded evaluators and what DeepEval does with ConversationalTestCase. The composability advantage (prompt file = evaluator) means users can create any conversation metric in markdown rather than waiting for named evaluator implementations.
Key finding: the CodeJudgeResultSchema.details field (z.record(z.unknown())) already accepts arbitrary structured data, so per-turn score breakdowns flow through to output JSONL today.