Skip to content

feat: add multi-turn conversation eval example with details field#507

Open
christso wants to merge 13 commits intomainfrom
feature/multi-turn-conversation-eval
Open

feat: add multi-turn conversation eval example with details field#507
christso wants to merge 13 commits intomainfrom
feature/multi-turn-conversation-eval

Conversation

@christso
Copy link
Collaborator

@christso christso commented Mar 9, 2026

Summary

  • Adds optional details field to freeform llm-judge schema for consistency with code-judge and score-range rubric mode — enables structured per-turn score breakdowns
  • Creates new examples/features/multi-turn-conversation/ example with 2 multi-turn test cases (customer support, technical troubleshooting) and 3 composable judge prompt templates (context-retention, conversation-relevancy, role-adherence)
  • Demonstrates that AgentV can evaluate multi-turn conversations today using composable llm-judge prompts + deterministic assertions, without new evaluator types

Closes #505

Test plan

  • All 53 existing evaluator tests pass (0 regressions)
  • YAML dataset parses correctly (2 tests, 4+3 assertions)
  • Biome lint/format passes
  • Run agentv eval against default target — 2/2 tests pass, baseline saved

🤖 Generated with Claude Code

christso and others added 5 commits March 9, 2026 06:46
Adds optional details field to freeform evaluation schema for consistency
with code-judge and score-range rubric mode. Allows llm-judge prompts to
return structured domain-specific metrics alongside score/hits/misses.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@cloudflare-workers-and-pages
Copy link

cloudflare-workers-and-pages bot commented Mar 9, 2026

Deploying agentv with  Cloudflare Pages  Cloudflare Pages

Latest commit: c22b2a1
Status: ✅  Deploy successful!
Preview URL: https://3af40215.agentv.pages.dev
Branch Preview URL: https://feature-multi-turn-conversat.agentv.pages.dev

View logs

christso and others added 8 commits March 9, 2026 08:41
Judge templates require {{ answer }} or {{ expected_output }} to pass
prompt validation. Added {{ answer }} section to all 3 judge templates.
Generated baseline from successful e2e run (2/2 tests passing).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Judges were only seeing the agent's single output turn. Added {{ input }}
to all 3 templates so judges evaluate context retention against the full
conversation history. Re-generated baseline with corrected templates.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Updated judge prompts to evaluate ALL assistant turns (from both
conversation history and final response), not just the single output
turn. Judges now produce proper per-turn breakdowns (e.g.,
scores_per_turn: [1, 1, 1, 1], total_turns: 4).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The existing {{ input }} variable maps to input_segments (flattened text
without role annotations). For multi-turn evaluation, judges need to
distinguish user turns from assistant turns. Added {{ conversation }}
which serializes evalCase.input (TestMessage[] with role fields).

Updated judge templates to use {{ conversation }} so judges see the full
conversation with system/user/assistant role annotations. Judges now
correctly identify and score individual assistant turns (Turn 1, Turn 2,
Turn 3) across the conversation history.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Changed {{ input }} to serialize evalCase.input (TestMessage[] with role
fields) instead of input_segments (flattened text without roles). This
lets llm-judge templates distinguish user/assistant/system turns, which
is essential for multi-turn conversation evaluation.

Removed the {{ conversation }} variable added in the previous commit —
reusing {{ input }} is simpler and more consistent.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
File segments in input messages only contained paths (e.g. { type: 'file',
value: 'src/app.ts' }) without the actual file content. The resolved content
existed only in input_segments. Now resolveInputWithFileContent() enriches
the serialized input by looking up file text from input_segments, so judges
see the full file content alongside role annotations.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

docs: add multi-turn conversation evaluation example with per-turn score breakdown

1 participant