Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
42 commits
Select commit Hold shift + click to select a range
e11b96f
feat: add multi-turn dataset manager with flat JSONL support
tianmu-li Apr 23, 2026
4a135ff
feat: add ConversationManager and MultiTurnStrategy
tianmu-li Apr 23, 2026
eb99f58
test: add multi-turn unit and integration tests
tianmu-li Apr 24, 2026
75b64d6
feat: wire multi-turn into benchmark execution pipeline
tianmu-li Apr 24, 2026
1a41869
docs: add multi-turn quickstart, examples, and conversion scripts
tianmu-li Apr 25, 2026
00310b4
fix: replace hardcoded /model/ path in validate_jsonl_schema.py docst…
tianmu-li Apr 25, 2026
2961de5
chore: move multi_turn_dataset_schema.json into scripts/ and update d…
tianmu-li Apr 25, 2026
039f72c
fix: address PR #285 review comments for multi-turn implementation
tianmu-li Apr 28, 2026
c53e5d5
fix: improve multi-turn PromptData text and add concurrent stress test
tianmu-li Apr 28, 2026
7495a45
refactor: replace semaphore with worker-pool concurrency in MultiTurn…
tianmu-li May 4, 2026
7aa45f5
fix: address remaining PR #285 review comments for multi-turn impleme…
tianmu-li May 4, 2026
aedbbe6
fix: address remaining PR #285 review comments
tianmu-li May 4, 2026
c3cd497
refactor: replace worker-pool with event-driven model in MultiTurnStr…
tianmu-li May 4, 2026
c2ab3a7
fix: address PR #285 review comments for multi-turn implementation
tianmu-li May 6, 2026
adaa8b4
Import fix
tianmu-li May 6, 2026
38d0ef0
fix: revert out-of-scope live-history tool_call_id rewriting
tianmu-li May 6, 2026
d2dace8
Fix issue with tool call accumulation and reasoning content
tianmu-li May 6, 2026
a7ef9e5
feat: account for tool-call tokens in OSL / TPOT / TPS metrics
tianmu-li May 7, 2026
452da2f
fix: correct chat-template tokenization for tool-call messages
tianmu-li May 7, 2026
7bde10b
docs: fix stale references and tool-row format in multi-turn docs
tianmu-li May 8, 2026
408ed21
feat: pre-compute ISL token counts for multi-turn dataset-history mode
tianmu-li May 8, 2026
a003c9a
fix: unwrap BatchEncoding from apply_chat_template for Qwen3 tokenizer
tianmu-li May 8, 2026
72c20f5
fix: accuracy phases now inherit configured load pattern instead of f…
tianmu-li May 11, 2026
5b8f515
Fix pre-commit
tianmu-li May 13, 2026
9ad9612
Fix CI error for completion
tianmu-li May 13, 2026
857db5b
Change to SSE choice for test completion
tianmu-li May 13, 2026
75aa9e2
fix: address PR #285 review deficiencies in multi-turn stack
tianmu-li May 13, 2026
8abfc30
fix: close residual PR #285 review deficiencies
tianmu-li May 13, 2026
5169265
fix: drop jinja2 import and fix test mocks for ISL precompute
tianmu-li May 13, 2026
191c320
fix: address Copilot review comments on multi-turn implementation
tianmu-li May 14, 2026
51d37dd
refactor: typed ConversationMetadata dataclass, single build in load(…
tianmu-li May 14, 2026
dda44bb
fix: address PR #285 round-4 review comments
tianmu-li May 14, 2026
595faf4
Send conversation_id and turn number
tianmu-li May 14, 2026
1272386
Address perf concerns
tianmu-li May 14, 2026
acb7464
Merge remote-tracking branch 'origin/main' into feat/tool_sequences
tianmu-li May 14, 2026
af035d9
fix: address PR #285 round-5 Copilot review comments
tianmu-li May 14, 2026
dd47796
fix: update test_schema.py error message assertions for Fix 5
tianmu-li May 14, 2026
15a4108
Fix ci test failure post merge
tianmu-li May 14, 2026
2ac66be
fix: address PR #285 round-6 Copilot review comments
tianmu-li May 15, 2026
4442364
Fix double-firing of timed-out turns
tianmu-li May 15, 2026
03b96b9
feat: stamp conversation_id and turn on EventRecord pipeline
tianmu-li May 15, 2026
fb2ff32
fix: sum_sq overflow, streaming conv stamping, tool-call metric coverage
tianmu-li May 15, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
280 changes: 280 additions & 0 deletions docs/MULTI_TURN_QUICKSTART.md
Comment thread
tianmu-li marked this conversation as resolved.
Original file line number Diff line number Diff line change
@@ -0,0 +1,280 @@
# Multi-Turn Conversation Benchmarking - Quick Start Guide

## Quick Start in 5 Minutes

### 1. Prepare Your Dataset

Create a JSONL file with your conversations. All rows for a given `conversation_id` must appear
**consecutively** in the file (no interleaving with other conversations):

```jsonl
{"conversation_id": "c1", "turn": 1, "role": "user", "content": "Hello!", "system": "You are a helpful assistant"}
{"conversation_id": "c1", "turn": 2, "role": "assistant", "content": "Hi! How can I help?"}
{"conversation_id": "c1", "turn": 3, "role": "user", "content": "What's 2+2?"}
{"conversation_id": "c1", "turn": 4, "role": "assistant", "content": "2+2 equals 4."}
```

**Rules**:

- Alternate between "user" and "assistant" roles
- Start with "user" role
- Sequential turn numbers (1, 2, 3, ...)
- Same `conversation_id` for all turns in a conversation
- All rows for the same `conversation_id` must be grouped together

### 2. Create Configuration File

Save as `multi_turn_config.yaml`:

```yaml
name: "my-multi-turn-benchmark"
version: "1.0"
type: "online"

model_params:
name: "your-model-name"
temperature: 0.7
max_new_tokens: 256

datasets:
- name: my_conversations
type: performance
path: path/to/your/conversations.jsonl
multi_turn: # ← Presence of this block enables multi-turn mode
turn_timeout_s: 300 # ← Max wait for prev turn

settings:
load_pattern:
type: multi_turn # ← Use multi-turn scheduler
target_concurrency: 32 # ← Required: max simultaneous conversations

client:
num_workers: 4

endpoint_config:
endpoints:
- "http://your-endpoint:8000"
api_type: openai

report_dir: logs/my_multi_turn_benchmark
```

Results are written to `report_dir` (here: `logs/my_multi_turn_benchmark/`).

### 3. Run Benchmark

```bash
inference-endpoint benchmark from-config --config multi_turn_config.yaml
```

That's it! Your benchmark will now:

- ✅ Enforce turn ordering (turn N+1 waits for turn N)
- ✅ Include conversation history in each request
- ✅ Log all issued (client) turns to events.jsonl — scripted assistant rows are context only and do not produce sample events

---

## Understanding Results

After the benchmark completes, check the directory configured via `report_dir`:

### Events Log

The `events.jsonl` file contains one JSON record per line, with the standard
`sample_uuid`, `event_type`, and `timestamp_ns` fields. Events are keyed by
`sample_uuid` only. To correlate events with conversations, join through
`sample_idx_map.json` (written next to `events.jsonl`) and the multi-turn
dataset's `conversation_metadata["samples"]`, which maps sample indices to
`(conversation_id, turn)` tuples.

### Metrics

Currently available:

- **Per-turn metrics**: Latency, TTFT, TPOT for each turn
- **Conversation tracking**: events are keyed by `sample_uuid` only; correlate any event back to a conversation by joining through `sample_idx_map.json` and `conversation_metadata["samples"]`
Comment on lines +85 to +96

_Note: Per-conversation aggregation (e.g., "conversations/sec") is coming in a future update._

---

## Concurrency Control

`target_concurrency` is **required** for the `multi_turn` load pattern. It controls how many
conversations are active simultaneously. Each active conversation has exactly one in-flight turn
at a time — a worker issues turn N, waits for the response, then issues turn N+1. A new
conversation starts only after a worker finishes all turns of its current one.

```yaml
settings:
load_pattern:
type: multi_turn
target_concurrency: 32 # ← 32 conversations active simultaneously
```

---

## Troubleshooting

### Validate Your Dataset Before Running

Use the bundled validation script to check your JSONL file for schema errors before benchmarking:

```bash
python scripts/validate_jsonl_schema.py path/to/your/conversations.jsonl
```

This catches per-row schema errors (missing required fields, wrong types,
malformed `tool_results`). Cross-row invariants (consecutive turn numbers,
valid role sequences, grouped conversations) are enforced by
`MultiTurnDataset` at load time and will surface at benchmark startup.

### "Conversation has invalid role sequence"

**Problem**: Your dataset doesn't follow a valid role sequence.

**Fix**: Check your JSONL. Valid sequences:
Comment thread
tianmu-li marked this conversation as resolved.

- Plain chat: `user → assistant → user → assistant → ...`
- Agentic (tool-use): `user → assistant → tool → assistant → tool → ... → user`

Conversations may also end with a `tool` row (the model's response to the final tool call is the benchmark target).

### "Rows for conversation X are not consecutive"

**Problem**: Rows for the same `conversation_id` are interleaved with rows from other conversations.

**Fix**: Sort your JSONL so all rows for each conversation appear together.

### "Turn timed out waiting for prev turn"

**Problem**: Previous turn took longer than `turn_timeout_s`.

**Fixes**:

1. Increase `turn_timeout_s` in config
2. Check if your endpoint is slow or unresponsive
3. Look for errors in the endpoint logs

### Dataset not loading

**Problem**: MultiTurnDataset not recognized.

**Fix**: Ensure `multi_turn:` block is present in the dataset config. The file format
is auto-detected from the `.jsonl` extension — no `format` field is needed:

```yaml
datasets:
- path: your_file.jsonl
multi_turn: {}
```

---

## Example Datasets

### Simple 2-Turn Conversation

```jsonl
{"conversation_id": "c1", "turn": 1, "role": "user", "content": "Hi"}
{"conversation_id": "c1", "turn": 2, "role": "assistant", "content": "Hello!"}
```

### With System Prompt

```jsonl
{"conversation_id": "c1", "turn": 1, "role": "user", "content": "Who won?", "system": "You are a sports expert"}
{"conversation_id": "c1", "turn": 2, "role": "assistant", "content": "The Lakers won."}
```

### Multiple Conversations

```jsonl
{"conversation_id": "c1", "turn": 1, "role": "user", "content": "Hi"}
{"conversation_id": "c1", "turn": 2, "role": "assistant", "content": "Hello!"}
{"conversation_id": "c2", "turn": 1, "role": "user", "content": "Hey"}
{"conversation_id": "c2", "turn": 2, "role": "assistant", "content": "Hi there!"}
```

### With Model Override

```jsonl
{"conversation_id": "c1", "turn": 1, "role": "user", "content": "Summarize this", "model": "gpt-4"}
{"conversation_id": "c1", "turn": 2, "role": "assistant", "content": "Here's the summary..."}
```

---

## Testing Your Setup

### 1. Use the Example Dataset

```bash
# Run from the repository root — dataset paths in the bundled YAML are
# repo-relative (e.g. examples/09_MultiTurn/customer_support_conversations.jsonl).
inference-endpoint benchmark from-config \
--config examples/09_MultiTurn/multi_turn_benchmark.yaml
```

### 2. Check the Logs

```bash
cat logs/multi_turn_test/benchmark.log
# Look for: "Turn X of conversation_id issued"
```

### 3. Verify Event Recording

```bash
# List all sample UUIDs in the events log
jq -r '.sample_uuid' logs/multi_turn_test/events.jsonl | sort -u
# Should show UUIDs; correlate to conversations via sample_idx_map.json
```

---

## Tips & Best Practices

### Dataset Design

- **Keep conversations realistic**: 2-10 turns typical
- **Test edge cases**: 1-turn conversations, very long conversations
- **Include system prompts**: Helps model understand context

### Performance

- **Workers**: `client.num_workers` controls HTTP worker processes, independent of `target_concurrency`. The default (`-1`) auto-tunes based on NUMA topology.
- **Timeout**: Set `turn_timeout_s` = 2x your longest expected turn latency
- **Memory**: ~1KB per turn, plan accordingly for large datasets

### Debugging

- **Start small**: Test with 1-2 conversations first
- **Single conversation**: Use `target_concurrency: 1`
- **Check events.jsonl**: Verify turn ordering with `jq`

---

## More Information

- **Full Documentation**: See `examples/09_MultiTurn/README.md`
- **Architecture**: See `AGENTS.md` (Multi-Turn section)

---

## Checklist

Before running your first multi-turn benchmark:

- [ ] Dataset follows format (user/assistant alternation, or agentic user→assistant→tool sequences)
- [ ] All rows for each conversation_id are grouped together
- [ ] Config has `multi_turn:` block in the dataset section
- [ ] Config has `load_pattern.type: multi_turn`
- [ ] Endpoint is running and reachable
- [ ] File uses `.jsonl` extension (format is auto-detected)
- [ ] Conversation IDs are unique per conversation
- [ ] Turn numbers are sequential (1, 2, 3, ...)
- [ ] Dataset is configured as `type: performance` (accuracy evaluation of multi-turn datasets is not yet supported)

Happy benchmarking!
Loading
Loading