feat: add multi-turn conversation eval example with details field by christso · Pull Request #507 · EntityProcess/agentv

christso · 2026-03-09T07:17:54Z

Summary

Adds optional details field to freeform llm-judge schema for consistency with code-judge and score-range rubric mode — enables structured per-turn score breakdowns
Creates new examples/features/multi-turn-conversation/ example with 2 multi-turn test cases (customer support, technical troubleshooting) and 3 composable judge prompt templates (context-retention, conversation-relevancy, role-adherence)
Demonstrates that AgentV can evaluate multi-turn conversations today using composable llm-judge prompts + deterministic assertions, without new evaluator types

Closes #505

Test plan

All 53 existing evaluator tests pass (0 regressions)
YAML dataset parses correctly (2 tests, 4+3 assertions)
Biome lint/format passes
Run agentv eval against default target — 2/2 tests pass, baseline saved

🤖 Generated with Claude Code

Adds optional details field to freeform evaluation schema for consistency with code-judge and score-range rubric mode. Allows llm-judge prompts to return structured domain-specific metrics alongside score/hits/misses. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

cloudflare-workers-and-pages · 2026-03-09T07:18:32Z

Deploying agentv with Cloudflare Pages

Latest commit:	`c22b2a1`
Status:	✅ Deploy successful!
Preview URL:	https://3af40215.agentv.pages.dev
Branch Preview URL:	https://feature-multi-turn-conversat.agentv.pages.dev

View logs

Judge templates require {{ answer }} or {{ expected_output }} to pass prompt validation. Added {{ answer }} section to all 3 judge templates. Generated baseline from successful e2e run (2/2 tests passing). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Judges were only seeing the agent's single output turn. Added {{ input }} to all 3 templates so judges evaluate context retention against the full conversation history. Re-generated baseline with corrected templates. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Updated judge prompts to evaluate ALL assistant turns (from both conversation history and final response), not just the single output turn. Judges now produce proper per-turn breakdowns (e.g., scores_per_turn: [1, 1, 1, 1], total_turns: 4). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The existing {{ input }} variable maps to input_segments (flattened text without role annotations). For multi-turn evaluation, judges need to distinguish user turns from assistant turns. Added {{ conversation }} which serializes evalCase.input (TestMessage[] with role fields). Updated judge templates to use {{ conversation }} so judges see the full conversation with system/user/assistant role annotations. Judges now correctly identify and score individual assistant turns (Turn 1, Turn 2, Turn 3) across the conversation history. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Changed {{ input }} to serialize evalCase.input (TestMessage[] with role fields) instead of input_segments (flattened text without roles). This lets llm-judge templates distinguish user/assistant/system turns, which is essential for multi-turn conversation evaluation. Removed the {{ conversation }} variable added in the previous commit — reusing {{ input }} is simpler and more consistent. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

File segments in input messages only contained paths (e.g. { type: 'file', value: 'src/app.ts' }) without the actual file content. The resolved content existed only in input_segments. Now resolveInputWithFileContent() enriches the serialized input by looking up file text from input_segments, so judges see the full file content alongside role annotations. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

christso and others added 5 commits March 9, 2026 06:46

docs: add multi-turn conversation eval dataset with composable asserts

5a6c9d3

docs: add README for multi-turn conversation evaluation example

444f048

docs: add llm-judge prompt templates for multi-turn conversation eval

ede62f6

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

style: fix biome formatting for details field chain

b366c43

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

christso and others added 8 commits March 9, 2026 08:41

chore: remove auto-generated baseline yaml (jsonl is canonical)

50db5f5

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

fix: use string literal instead of template literal in test

c22b2a1

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

christso mentioned this pull request Mar 9, 2026

refactor: consolidate input_segments into input (TestMessage[]) #511

Open

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add multi-turn conversation eval example with details field#507

feat: add multi-turn conversation eval example with details field#507
christso wants to merge 13 commits intomainfrom
feature/multi-turn-conversation-eval

christso commented Mar 9, 2026 •

edited

Loading

Uh oh!

cloudflare-workers-and-pages bot commented Mar 9, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

christso commented Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

cloudflare-workers-and-pages bot commented Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Deploying agentv with Cloudflare Pages

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

christso commented Mar 9, 2026 •

edited

Loading

cloudflare-workers-and-pages bot commented Mar 9, 2026 •

edited

Loading