Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
37 changes: 37 additions & 0 deletions examples/features/multi-turn-conversation/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
# Multi-Turn Conversation Evaluation

Demonstrates evaluating multi-turn conversation quality using composable
`llm-judge` prompt templates with per-turn score breakdowns.

## What this shows

1. **Multi-turn input** — conversations with 4+ user/assistant turns where
context retention matters
2. **Conversation-aware judge prompts** — markdown templates that receive the
full `{{output}}` Message[] array and evaluate conversation-level qualities
3. **Per-turn score breakdown** — judges return structured `details` with
per-turn scores, not just a flat conversation score
4. **Composability** — multiple llm-judge evaluators combined with
deterministic assertions (e.g., `contains`)

## Judge dimensions

| Judge | What it evaluates |
|-------|-------------------|
| `context-retention.md` | Does the agent remember information from earlier turns? |
| `conversation-relevancy.md` | Are responses relevant to the current request and conversation? |
| `role-adherence.md` | Does the agent maintain its assigned persona? |

## Running

```bash
agentv run default --filter multi-turn-conversation
```

## Creating your own conversation evaluator

1. Create a markdown file in `judges/`
2. Use `{{ output }}` to receive the full conversation Message[] array
3. Use `{{ criteria }}` for the test-specific evaluation criteria
4. Instruct the judge to return `details` with per-turn metrics
5. Reference it in your YAML: `type: llm-judge` / `prompt: ./judges/your-judge.md`
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
{"timestamp":"2026-03-09T10:16:33.509Z","test_id":"support-context-retention","dataset":"dataset.eval","score":0.8,"hits":["Turn 3: remembers and uses customer name Sarah","Turn 3: correctly references order number #98765","Turn 4: acknowledges urgency and Friday deadline","Final Turn: confirms express shipping upgrade and adds gift wrapping as requested","Turn 1: directly acknowledged Sarah by name and referenced her order number","Turn 2: offered relevant expedited shipping solutions aligned with the Friday deadline","Turn 3: confirmed express shipping upgrade and gift wrapping as requested","Turn 3: invited further input for a gift message and additional support","Turn 1: professional and empathetic tone matching support agent role","Turn 2: maintains context—addresses Sarah by name and references order #98765, delivers solution-focused options","Turn 3: addresses Sarah by name, updates her on order status, and offers further personalized support","Turn 3: maintains professional, empathetic, and solution-oriented persona throughout"],"misses":["Output does not contain \"#98765\""],"target":"default","reasoning":"context_retention: The assistant consistently recalls all key context throughout the conversation: Sarah's name, order number, delivery deadline, and gift wrapping request. Professional persona maintained across all turns. | conversation_relevancy: Every assistant turn is contextually relevant, addresses Sarah's immediate needs, and maintains a professional, empathetic support persona. The agent consistently remembers key details and offers solutions directly tied to the conversation. | role_adherence: All assistant turns consistently follow the assigned professional and empathetic support persona, maintain context, and reference customer name, order number, and deadline without breaking character. | contains-#98765: Output does not contain \"#98765\"","input":[{"role":"system","content":"You are a customer support agent for ShopFast. Be professional, empathetic,\nand solution-oriented. Always address customers by name once introduced."},{"role":"user","content":"Hi, my name is Sarah. I placed an order last week and it still hasn't arrived.\nMy order number is #98765."},{"role":"assistant","content":"Hello Sarah! I'm sorry to hear your order hasn't arrived yet. Let me look up\norder #98765 for you right away. I can see it was shipped on March 3rd via\nstandard delivery. It appears there's been a delay at the regional sorting\nfacility. Would you like me to check on expedited options?"},{"role":"user","content":"Yes please. I really need it by Friday — it's a birthday gift for my daughter."},{"role":"assistant","content":"I completely understand the urgency, Sarah. A birthday gift definitely can't\nbe late! I've flagged order #98765 for priority handling. I can offer you two\noptions: (1) upgrade to express shipping at no extra cost, estimated delivery\nThursday, or (2) reroute to our nearest pickup location where it could be\navailable as early as Wednesday."},{"role":"user","content":"The express shipping sounds great. Also, can you add gift wrapping?"}],"scores":[{"name":"context_retention","type":"llm-judge","score":1,"weight":1,"verdict":"pass","hits":["Turn 3: remembers and uses customer name Sarah","Turn 3: correctly references order number #98765","Turn 4: acknowledges urgency and Friday deadline","Final Turn: confirms express shipping upgrade and adds gift wrapping as requested"],"misses":[],"reasoning":"The assistant consistently recalls all key context throughout the conversation: Sarah's name, order number, delivery deadline, and gift wrapping request. Professional persona maintained across all turns.","details":{"scores_per_turn":[1,1,1],"relevant_turns":3,"total_turns":3},"token_usage":{"input":904,"output":166},"duration_ms":3625,"started_at":"2026-03-09T10:16:26.989Z","ended_at":"2026-03-09T10:16:30.614Z"},{"name":"conversation_relevancy","type":"llm-judge","score":1,"weight":2,"verdict":"pass","hits":["Turn 1: directly acknowledged Sarah by name and referenced her order number","Turn 2: offered relevant expedited shipping solutions aligned with the Friday deadline","Turn 3: confirmed express shipping upgrade and gift wrapping as requested","Turn 3: invited further input for a gift message and additional support"],"misses":[],"reasoning":"Every assistant turn is contextually relevant, addresses Sarah's immediate needs, and maintains a professional, empathetic support persona. The agent consistently remembers key details and offers solutions directly tied to the conversation.","details":{"scores_per_turn":[1,1,1],"on_topic_turns":3,"total_turns":3},"token_usage":{"input":925,"output":184},"duration_ms":1456,"started_at":"2026-03-09T10:16:30.614Z","ended_at":"2026-03-09T10:16:32.070Z"},{"name":"role_adherence","type":"llm-judge","score":1,"weight":1,"verdict":"pass","hits":["Turn 1: professional and empathetic tone matching support agent role","Turn 2: maintains context—addresses Sarah by name and references order #98765, delivers solution-focused options","Turn 3: addresses Sarah by name, updates her on order status, and offers further personalized support","Turn 3: maintains professional, empathetic, and solution-oriented persona throughout"],"misses":[],"reasoning":"All assistant turns consistently follow the assigned professional and empathetic support persona, maintain context, and reference customer name, order number, and deadline without breaking character.","details":{"scores_per_turn":[1,1,1],"consistent_turns":3,"total_turns":3},"token_usage":{"input":939,"output":190},"duration_ms":1438,"started_at":"2026-03-09T10:16:32.070Z","ended_at":"2026-03-09T10:16:33.508Z"},{"name":"contains-#98765","type":"contains","score":0,"weight":1,"verdict":"fail","hits":[],"misses":["Output does not contain \"#98765\""],"reasoning":"Output does not contain \"#98765\"","duration_ms":0,"started_at":"2026-03-09T10:16:33.508Z","ended_at":"2026-03-09T10:16:33.508Z"}],"execution_status":"ok"}
{"timestamp":"2026-03-09T10:16:34.697Z","test_id":"support-troubleshooting-flow","dataset":"dataset.eval","score":1,"hits":["Turn 1: Acknowledged prior router restart, skipped that step","Turn 2: Referred to orange blinking light user described in turn 4","Turn 2: Kept instructions simple, matching stated technical comfort level","Turn 2: Suggested watching WiFi and router lights, building on earlier troubleshooting steps","Turn 1: acknowledged router restart, avoided suggesting a repeat, and asked a clear follow-up for diagnostics","Turn 2: matched user's technical level and provided a simple, new troubleshooting step focusing on cable seating","Turn 3: offered next steps if issue recurs, avoided repeating restarted router, gave beginner-friendly instructions","Turn 1: empathetic, professional tone matching technical support role and adapted questions to beginner level","Turn 2: offered an easy next step and gave clear, simple instructions without repeating router restart","Turn 3: maintained helpful, patient tone; provided simple explanations and continued to avoid repeated steps"],"misses":[],"target":"default","reasoning":"context_retention: The assistant consistently retains and builds on context: it avoids repeating already-tried solutions, references the user's technical comfort, tracks symptoms, and provides sequenced, simple next steps. | conversation_relevancy: All assistant turns respond directly to the user's request, build appropriately on prior troubleshooting, match the user's technical comfort, and do not repeat already-tried solutions. | role_adherence: The assistant maintained the technical support persona consistently, adjusted complexity for the user's beginner level, and never repeated previously attempted steps across all turns.","input":[{"role":"system","content":"You are a technical support agent for NetConnect ISP. Adapt your\nexplanations to the customer's technical level. Never suggest steps\nthe customer has already tried."},{"role":"user","content":"My WiFi keeps dropping every 10 minutes. I already restarted my router\ntwice today. I'm not very tech-savvy so please keep it simple."},{"role":"assistant","content":"I'm sorry you're dealing with that — WiFi dropping out is really\nfrustrating! Since you've already restarted the router, let's skip that\nstep. Could you tell me: does the light on the front of your router stay\nsolid green, or does it blink or turn a different color when the WiFi drops?"},{"role":"user","content":"It blinks orange for a few seconds then goes back to green."},{"role":"assistant","content":"That orange blinking usually means the router is briefly losing its\nconnection to our network. Let's try something simple: can you find the\nsmall cable plugged into the back of your router labeled \"WAN\" or\n\"Internet\"? Just unplug it, wait 10 seconds, and plug it back in firmly.\nSometimes it just needs a better connection."},{"role":"user","content":"OK I did that. What if it happens again?"}],"scores":[{"name":"context_retention","type":"llm-judge","score":1,"weight":1,"verdict":"pass","hits":["Turn 1: Acknowledged prior router restart, skipped that step","Turn 2: Referred to orange blinking light user described in turn 4","Turn 2: Kept instructions simple, matching stated technical comfort level","Turn 2: Suggested watching WiFi and router lights, building on earlier troubleshooting steps"],"misses":[],"reasoning":"The assistant consistently retains and builds on context: it avoids repeating already-tried solutions, references the user's technical comfort, tracks symptoms, and provides sequenced, simple next steps.","details":{"scores_per_turn":[1,1],"relevant_turns":2,"total_turns":2},"token_usage":{"input":1004,"output":181},"duration_ms":2908,"started_at":"2026-03-09T10:16:28.589Z","ended_at":"2026-03-09T10:16:31.497Z"},{"name":"conversation_relevancy","type":"llm-judge","score":1,"weight":2,"verdict":"pass","hits":["Turn 1: acknowledged router restart, avoided suggesting a repeat, and asked a clear follow-up for diagnostics","Turn 2: matched user's technical level and provided a simple, new troubleshooting step focusing on cable seating","Turn 3: offered next steps if issue recurs, avoided repeating restarted router, gave beginner-friendly instructions"],"misses":[],"reasoning":"All assistant turns respond directly to the user's request, build appropriately on prior troubleshooting, match the user's technical comfort, and do not repeat already-tried solutions.","details":{"scores_per_turn":[1,1,1],"on_topic_turns":3,"total_turns":3},"token_usage":{"input":1025,"output":180},"duration_ms":1801,"started_at":"2026-03-09T10:16:31.497Z","ended_at":"2026-03-09T10:16:33.298Z"},{"name":"role_adherence","type":"llm-judge","score":1,"weight":1,"verdict":"pass","hits":["Turn 1: empathetic, professional tone matching technical support role and adapted questions to beginner level","Turn 2: offered an easy next step and gave clear, simple instructions without repeating router restart","Turn 3: maintained helpful, patient tone; provided simple explanations and continued to avoid repeated steps"],"misses":[],"reasoning":"The assistant maintained the technical support persona consistently, adjusted complexity for the user's beginner level, and never repeated previously attempted steps across all turns.","details":{"scores_per_turn":[1,1,1],"consistent_turns":3,"total_turns":3},"token_usage":{"input":1039,"output":170},"duration_ms":1399,"started_at":"2026-03-09T10:16:33.298Z","ended_at":"2026-03-09T10:16:34.697Z"}],"execution_status":"ok"}
Loading