-
Notifications
You must be signed in to change notification settings - Fork 11
Feat/tool sequences #285
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
tianmu-li
wants to merge
42
commits into
mlcommons:main
Choose a base branch
from
tianmu-li:feat/tool_sequences
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Feat/tool sequences #285
Changes from all commits
Commits
Show all changes
42 commits
Select commit
Hold shift + click to select a range
e11b96f
feat: add multi-turn dataset manager with flat JSONL support
tianmu-li 4a135ff
feat: add ConversationManager and MultiTurnStrategy
tianmu-li eb99f58
test: add multi-turn unit and integration tests
tianmu-li 75b64d6
feat: wire multi-turn into benchmark execution pipeline
tianmu-li 1a41869
docs: add multi-turn quickstart, examples, and conversion scripts
tianmu-li 00310b4
fix: replace hardcoded /model/ path in validate_jsonl_schema.py docst…
tianmu-li 2961de5
chore: move multi_turn_dataset_schema.json into scripts/ and update d…
tianmu-li 039f72c
fix: address PR #285 review comments for multi-turn implementation
tianmu-li c53e5d5
fix: improve multi-turn PromptData text and add concurrent stress test
tianmu-li 7495a45
refactor: replace semaphore with worker-pool concurrency in MultiTurn…
tianmu-li 7aa45f5
fix: address remaining PR #285 review comments for multi-turn impleme…
tianmu-li aedbbe6
fix: address remaining PR #285 review comments
tianmu-li c3cd497
refactor: replace worker-pool with event-driven model in MultiTurnStr…
tianmu-li c2ab3a7
fix: address PR #285 review comments for multi-turn implementation
tianmu-li adaa8b4
Import fix
tianmu-li 38d0ef0
fix: revert out-of-scope live-history tool_call_id rewriting
tianmu-li d2dace8
Fix issue with tool call accumulation and reasoning content
tianmu-li a7ef9e5
feat: account for tool-call tokens in OSL / TPOT / TPS metrics
tianmu-li 452da2f
fix: correct chat-template tokenization for tool-call messages
tianmu-li 7bde10b
docs: fix stale references and tool-row format in multi-turn docs
tianmu-li 408ed21
feat: pre-compute ISL token counts for multi-turn dataset-history mode
tianmu-li a003c9a
fix: unwrap BatchEncoding from apply_chat_template for Qwen3 tokenizer
tianmu-li 72c20f5
fix: accuracy phases now inherit configured load pattern instead of f…
tianmu-li 5b8f515
Fix pre-commit
tianmu-li 9ad9612
Fix CI error for completion
tianmu-li 857db5b
Change to SSE choice for test completion
tianmu-li 75aa9e2
fix: address PR #285 review deficiencies in multi-turn stack
tianmu-li 8abfc30
fix: close residual PR #285 review deficiencies
tianmu-li 5169265
fix: drop jinja2 import and fix test mocks for ISL precompute
tianmu-li 191c320
fix: address Copilot review comments on multi-turn implementation
tianmu-li 51d37dd
refactor: typed ConversationMetadata dataclass, single build in load(…
tianmu-li dda44bb
fix: address PR #285 round-4 review comments
tianmu-li 595faf4
Send conversation_id and turn number
tianmu-li 1272386
Address perf concerns
tianmu-li acb7464
Merge remote-tracking branch 'origin/main' into feat/tool_sequences
tianmu-li af035d9
fix: address PR #285 round-5 Copilot review comments
tianmu-li dd47796
fix: update test_schema.py error message assertions for Fix 5
tianmu-li 15a4108
Fix ci test failure post merge
tianmu-li 2ac66be
fix: address PR #285 round-6 Copilot review comments
tianmu-li 4442364
Fix double-firing of timed-out turns
tianmu-li 03b96b9
feat: stamp conversation_id and turn on EventRecord pipeline
tianmu-li fb2ff32
fix: sum_sq overflow, streaming conv stamping, tool-call metric coverage
tianmu-li File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,280 @@ | ||
| # Multi-Turn Conversation Benchmarking - Quick Start Guide | ||
|
|
||
| ## Quick Start in 5 Minutes | ||
|
|
||
| ### 1. Prepare Your Dataset | ||
|
|
||
| Create a JSONL file with your conversations. All rows for a given `conversation_id` must appear | ||
| **consecutively** in the file (no interleaving with other conversations): | ||
|
|
||
| ```jsonl | ||
| {"conversation_id": "c1", "turn": 1, "role": "user", "content": "Hello!", "system": "You are a helpful assistant"} | ||
| {"conversation_id": "c1", "turn": 2, "role": "assistant", "content": "Hi! How can I help?"} | ||
| {"conversation_id": "c1", "turn": 3, "role": "user", "content": "What's 2+2?"} | ||
| {"conversation_id": "c1", "turn": 4, "role": "assistant", "content": "2+2 equals 4."} | ||
| ``` | ||
|
|
||
| **Rules**: | ||
|
|
||
| - Alternate between "user" and "assistant" roles | ||
| - Start with "user" role | ||
| - Sequential turn numbers (1, 2, 3, ...) | ||
| - Same `conversation_id` for all turns in a conversation | ||
| - All rows for the same `conversation_id` must be grouped together | ||
|
|
||
| ### 2. Create Configuration File | ||
|
|
||
| Save as `multi_turn_config.yaml`: | ||
|
|
||
| ```yaml | ||
| name: "my-multi-turn-benchmark" | ||
| version: "1.0" | ||
| type: "online" | ||
|
|
||
| model_params: | ||
| name: "your-model-name" | ||
| temperature: 0.7 | ||
| max_new_tokens: 256 | ||
|
|
||
| datasets: | ||
| - name: my_conversations | ||
| type: performance | ||
| path: path/to/your/conversations.jsonl | ||
| multi_turn: # ← Presence of this block enables multi-turn mode | ||
| turn_timeout_s: 300 # ← Max wait for prev turn | ||
|
|
||
| settings: | ||
| load_pattern: | ||
| type: multi_turn # ← Use multi-turn scheduler | ||
| target_concurrency: 32 # ← Required: max simultaneous conversations | ||
|
|
||
| client: | ||
| num_workers: 4 | ||
|
|
||
| endpoint_config: | ||
| endpoints: | ||
| - "http://your-endpoint:8000" | ||
| api_type: openai | ||
|
|
||
| report_dir: logs/my_multi_turn_benchmark | ||
| ``` | ||
|
|
||
| Results are written to `report_dir` (here: `logs/my_multi_turn_benchmark/`). | ||
|
|
||
| ### 3. Run Benchmark | ||
|
|
||
| ```bash | ||
| inference-endpoint benchmark from-config --config multi_turn_config.yaml | ||
| ``` | ||
|
|
||
| That's it! Your benchmark will now: | ||
|
|
||
| - ✅ Enforce turn ordering (turn N+1 waits for turn N) | ||
| - ✅ Include conversation history in each request | ||
| - ✅ Log all issued (client) turns to events.jsonl — scripted assistant rows are context only and do not produce sample events | ||
|
|
||
| --- | ||
|
|
||
| ## Understanding Results | ||
|
|
||
| After the benchmark completes, check the directory configured via `report_dir`: | ||
|
|
||
| ### Events Log | ||
|
|
||
| The `events.jsonl` file contains one JSON record per line, with the standard | ||
| `sample_uuid`, `event_type`, and `timestamp_ns` fields. Events are keyed by | ||
| `sample_uuid` only. To correlate events with conversations, join through | ||
| `sample_idx_map.json` (written next to `events.jsonl`) and the multi-turn | ||
| dataset's `conversation_metadata["samples"]`, which maps sample indices to | ||
| `(conversation_id, turn)` tuples. | ||
|
|
||
| ### Metrics | ||
|
|
||
| Currently available: | ||
|
|
||
| - **Per-turn metrics**: Latency, TTFT, TPOT for each turn | ||
| - **Conversation tracking**: events are keyed by `sample_uuid` only; correlate any event back to a conversation by joining through `sample_idx_map.json` and `conversation_metadata["samples"]` | ||
|
Comment on lines
+85
to
+96
|
||
|
|
||
| _Note: Per-conversation aggregation (e.g., "conversations/sec") is coming in a future update._ | ||
|
|
||
| --- | ||
|
|
||
| ## Concurrency Control | ||
|
|
||
| `target_concurrency` is **required** for the `multi_turn` load pattern. It controls how many | ||
| conversations are active simultaneously. Each active conversation has exactly one in-flight turn | ||
| at a time — a worker issues turn N, waits for the response, then issues turn N+1. A new | ||
| conversation starts only after a worker finishes all turns of its current one. | ||
|
|
||
| ```yaml | ||
| settings: | ||
| load_pattern: | ||
| type: multi_turn | ||
| target_concurrency: 32 # ← 32 conversations active simultaneously | ||
| ``` | ||
|
|
||
| --- | ||
|
|
||
| ## Troubleshooting | ||
|
|
||
| ### Validate Your Dataset Before Running | ||
|
|
||
| Use the bundled validation script to check your JSONL file for schema errors before benchmarking: | ||
|
|
||
| ```bash | ||
| python scripts/validate_jsonl_schema.py path/to/your/conversations.jsonl | ||
| ``` | ||
|
|
||
| This catches per-row schema errors (missing required fields, wrong types, | ||
| malformed `tool_results`). Cross-row invariants (consecutive turn numbers, | ||
| valid role sequences, grouped conversations) are enforced by | ||
| `MultiTurnDataset` at load time and will surface at benchmark startup. | ||
|
|
||
| ### "Conversation has invalid role sequence" | ||
|
|
||
| **Problem**: Your dataset doesn't follow a valid role sequence. | ||
|
|
||
| **Fix**: Check your JSONL. Valid sequences: | ||
|
tianmu-li marked this conversation as resolved.
|
||
|
|
||
| - Plain chat: `user → assistant → user → assistant → ...` | ||
| - Agentic (tool-use): `user → assistant → tool → assistant → tool → ... → user` | ||
|
|
||
| Conversations may also end with a `tool` row (the model's response to the final tool call is the benchmark target). | ||
|
|
||
| ### "Rows for conversation X are not consecutive" | ||
|
|
||
| **Problem**: Rows for the same `conversation_id` are interleaved with rows from other conversations. | ||
|
|
||
| **Fix**: Sort your JSONL so all rows for each conversation appear together. | ||
|
|
||
| ### "Turn timed out waiting for prev turn" | ||
|
|
||
| **Problem**: Previous turn took longer than `turn_timeout_s`. | ||
|
|
||
| **Fixes**: | ||
|
|
||
| 1. Increase `turn_timeout_s` in config | ||
| 2. Check if your endpoint is slow or unresponsive | ||
| 3. Look for errors in the endpoint logs | ||
|
|
||
| ### Dataset not loading | ||
|
|
||
| **Problem**: MultiTurnDataset not recognized. | ||
|
|
||
| **Fix**: Ensure `multi_turn:` block is present in the dataset config. The file format | ||
| is auto-detected from the `.jsonl` extension — no `format` field is needed: | ||
|
|
||
| ```yaml | ||
| datasets: | ||
| - path: your_file.jsonl | ||
| multi_turn: {} | ||
| ``` | ||
|
|
||
| --- | ||
|
|
||
| ## Example Datasets | ||
|
|
||
| ### Simple 2-Turn Conversation | ||
|
|
||
| ```jsonl | ||
| {"conversation_id": "c1", "turn": 1, "role": "user", "content": "Hi"} | ||
| {"conversation_id": "c1", "turn": 2, "role": "assistant", "content": "Hello!"} | ||
| ``` | ||
|
|
||
| ### With System Prompt | ||
|
|
||
| ```jsonl | ||
| {"conversation_id": "c1", "turn": 1, "role": "user", "content": "Who won?", "system": "You are a sports expert"} | ||
| {"conversation_id": "c1", "turn": 2, "role": "assistant", "content": "The Lakers won."} | ||
| ``` | ||
|
|
||
| ### Multiple Conversations | ||
|
|
||
| ```jsonl | ||
| {"conversation_id": "c1", "turn": 1, "role": "user", "content": "Hi"} | ||
| {"conversation_id": "c1", "turn": 2, "role": "assistant", "content": "Hello!"} | ||
| {"conversation_id": "c2", "turn": 1, "role": "user", "content": "Hey"} | ||
| {"conversation_id": "c2", "turn": 2, "role": "assistant", "content": "Hi there!"} | ||
| ``` | ||
|
|
||
| ### With Model Override | ||
|
|
||
| ```jsonl | ||
| {"conversation_id": "c1", "turn": 1, "role": "user", "content": "Summarize this", "model": "gpt-4"} | ||
| {"conversation_id": "c1", "turn": 2, "role": "assistant", "content": "Here's the summary..."} | ||
| ``` | ||
|
|
||
| --- | ||
|
|
||
| ## Testing Your Setup | ||
|
|
||
| ### 1. Use the Example Dataset | ||
|
|
||
| ```bash | ||
| # Run from the repository root — dataset paths in the bundled YAML are | ||
| # repo-relative (e.g. examples/09_MultiTurn/customer_support_conversations.jsonl). | ||
| inference-endpoint benchmark from-config \ | ||
| --config examples/09_MultiTurn/multi_turn_benchmark.yaml | ||
| ``` | ||
|
|
||
| ### 2. Check the Logs | ||
|
|
||
| ```bash | ||
| cat logs/multi_turn_test/benchmark.log | ||
| # Look for: "Turn X of conversation_id issued" | ||
| ``` | ||
|
|
||
| ### 3. Verify Event Recording | ||
|
|
||
| ```bash | ||
| # List all sample UUIDs in the events log | ||
| jq -r '.sample_uuid' logs/multi_turn_test/events.jsonl | sort -u | ||
| # Should show UUIDs; correlate to conversations via sample_idx_map.json | ||
| ``` | ||
|
|
||
| --- | ||
|
|
||
| ## Tips & Best Practices | ||
|
|
||
| ### Dataset Design | ||
|
|
||
| - **Keep conversations realistic**: 2-10 turns typical | ||
| - **Test edge cases**: 1-turn conversations, very long conversations | ||
| - **Include system prompts**: Helps model understand context | ||
|
|
||
| ### Performance | ||
|
|
||
| - **Workers**: `client.num_workers` controls HTTP worker processes, independent of `target_concurrency`. The default (`-1`) auto-tunes based on NUMA topology. | ||
| - **Timeout**: Set `turn_timeout_s` = 2x your longest expected turn latency | ||
| - **Memory**: ~1KB per turn, plan accordingly for large datasets | ||
|
|
||
| ### Debugging | ||
|
|
||
| - **Start small**: Test with 1-2 conversations first | ||
| - **Single conversation**: Use `target_concurrency: 1` | ||
| - **Check events.jsonl**: Verify turn ordering with `jq` | ||
|
|
||
| --- | ||
|
|
||
| ## More Information | ||
|
|
||
| - **Full Documentation**: See `examples/09_MultiTurn/README.md` | ||
| - **Architecture**: See `AGENTS.md` (Multi-Turn section) | ||
|
|
||
| --- | ||
|
|
||
| ## Checklist | ||
|
|
||
| Before running your first multi-turn benchmark: | ||
|
|
||
| - [ ] Dataset follows format (user/assistant alternation, or agentic user→assistant→tool sequences) | ||
| - [ ] All rows for each conversation_id are grouped together | ||
| - [ ] Config has `multi_turn:` block in the dataset section | ||
| - [ ] Config has `load_pattern.type: multi_turn` | ||
| - [ ] Endpoint is running and reachable | ||
| - [ ] File uses `.jsonl` extension (format is auto-detected) | ||
| - [ ] Conversation IDs are unique per conversation | ||
| - [ ] Turn numbers are sequential (1, 2, 3, ...) | ||
| - [ ] Dataset is configured as `type: performance` (accuracy evaluation of multi-turn datasets is not yet supported) | ||
|
|
||
| Happy benchmarking! | ||
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.