feat: add multi-turn conversation eval example with details field#507
Open
feat: add multi-turn conversation eval example with details field#507
Conversation
Adds optional details field to freeform evaluation schema for consistency with code-judge and score-range rubric mode. Allows llm-judge prompts to return structured domain-specific metrics alongside score/hits/misses. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Deploying agentv with
|
| Latest commit: |
c22b2a1
|
| Status: | ✅ Deploy successful! |
| Preview URL: | https://3af40215.agentv.pages.dev |
| Branch Preview URL: | https://feature-multi-turn-conversat.agentv.pages.dev |
Judge templates require {{ answer }} or {{ expected_output }} to pass
prompt validation. Added {{ answer }} section to all 3 judge templates.
Generated baseline from successful e2e run (2/2 tests passing).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Judges were only seeing the agent's single output turn. Added {{ input }}
to all 3 templates so judges evaluate context retention against the full
conversation history. Re-generated baseline with corrected templates.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Updated judge prompts to evaluate ALL assistant turns (from both conversation history and final response), not just the single output turn. Judges now produce proper per-turn breakdowns (e.g., scores_per_turn: [1, 1, 1, 1], total_turns: 4). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The existing {{ input }} variable maps to input_segments (flattened text
without role annotations). For multi-turn evaluation, judges need to
distinguish user turns from assistant turns. Added {{ conversation }}
which serializes evalCase.input (TestMessage[] with role fields).
Updated judge templates to use {{ conversation }} so judges see the full
conversation with system/user/assistant role annotations. Judges now
correctly identify and score individual assistant turns (Turn 1, Turn 2,
Turn 3) across the conversation history.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Changed {{ input }} to serialize evalCase.input (TestMessage[] with role
fields) instead of input_segments (flattened text without roles). This
lets llm-judge templates distinguish user/assistant/system turns, which
is essential for multi-turn conversation evaluation.
Removed the {{ conversation }} variable added in the previous commit —
reusing {{ input }} is simpler and more consistent.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
File segments in input messages only contained paths (e.g. { type: 'file',
value: 'src/app.ts' }) without the actual file content. The resolved content
existed only in input_segments. Now resolveInputWithFileContent() enriches
the serialized input by looking up file text from input_segments, so judges
see the full file content alongside role annotations.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
6 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
detailsfield to freeformllm-judgeschema for consistency with code-judge and score-range rubric mode — enables structured per-turn score breakdownsexamples/features/multi-turn-conversation/example with 2 multi-turn test cases (customer support, technical troubleshooting) and 3 composable judge prompt templates (context-retention, conversation-relevancy, role-adherence)llm-judgeprompts + deterministic assertions, without new evaluator typesCloses #505
Test plan
agentv evalagainst default target — 2/2 tests pass, baseline saved🤖 Generated with Claude Code