Skip to content

Feat/tool sequences#285

Open
tianmu-li wants to merge 40 commits into
mlcommons:mainfrom
tianmu-li:feat/tool_sequences
Open

Feat/tool sequences#285
tianmu-li wants to merge 40 commits into
mlcommons:mainfrom
tianmu-li:feat/tool_sequences

Conversation

@tianmu-li
Copy link
Copy Markdown
Collaborator

What does this PR do?

Updated multi-turn implementation for #232. Added tool sequencing, fixed scheduler for concurrent requests.

Type of change

  • Bug fix
  • New feature
  • Documentation update
  • Refactor/cleanup

Related issues

Testing

  • Tests added/updated
  • All tests pass locally
  • Manual testing completed

Checklist

  • Code follows project style
  • Pre-commit hooks pass
  • Documentation updated (if needed)

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 17, 2026

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a comprehensive multi-turn conversation benchmarking framework, including a new MultiTurnScheduler, ConversationManager, and MultiTurnDataset. These additions enable benchmarking of conversational AI workloads with turn sequencing, conversation history management, and optional concurrency control. My review identified potential issues regarding the usage of sentinel objects in the scheduler and the robustness of the timeout logic in the conversation manager.

Comment thread src/inference_endpoint/load_generator/scheduler.py Outdated
Comment thread src/inference_endpoint/load_generator/scheduler.py Outdated
Comment thread src/inference_endpoint/load_generator/conversation_manager.py Outdated
Copy link
Copy Markdown
Collaborator

@arekay-nv arekay-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Partial review - will complete later.

Comment thread examples/09_MultiTurn/README.md Outdated
Comment thread examples/09_MultiTurn/README.md Outdated
Comment thread examples/09_MultiTurn/README.md Outdated
Comment thread examples/09_MultiTurn/agentic_coding_benchmark.yaml Outdated
Comment thread examples/09_MultiTurn/agentic_coding_benchmark.yaml Outdated
Comment thread examples/09_MultiTurn/agentic_workflow_benchmark.yaml
Comment thread docs/MULTI_TURN_QUICKSTART.md
Comment thread MULTI_TURN_QUICKSTART.md Outdated
Comment thread docs/MULTI_TURN_QUICKSTART.md Outdated
Comment thread MULTI_TURN_QUICKSTART.md Outdated
tianmu-li added a commit to tianmu-li/endpoints that referenced this pull request Apr 22, 2026
- Remove sequential conversation mode (redundant with target_concurrency=1)
- Remove `enabled` field from MultiTurnConfig; presence of multi_turn: block implies enabled
- Add conversation grouping validation to MultiTurnDataset (raises InputValidationError if rows for a conversation_id are not consecutive)
- Update YAML example configs: model placeholder, relative dataset paths, removed redundant metrics.collect
- Move MULTI_TURN_QUICKSTART.md to docs/
- Update all documentation to remove sequential mode references

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
tianmu-li added a commit to tianmu-li/endpoints that referenced this pull request Apr 22, 2026
…wer comments

- Remove dead constant BLOCK_ON_PREVIOUS_TURN = -1 from scheduler.py
- Remove redundant outer with state.condition: in mark_turn_complete
- Remove ConversationMode import and explicit mode= args from integration tests
- Fix format: jsonl → format: ".jsonl" in example YAMLs and docs
- Add target_concurrency: 1 clarification to quickstart (preserves turn ordering)
- Remove broken HYBRID_SCHEDULER_GUIDE.md reference from quickstart

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@tianmu-li tianmu-li force-pushed the feat/tool_sequences branch 2 times, most recently from b127845 to 0a7ad37 Compare April 27, 2026 20:09
@arekay-nv
Copy link
Copy Markdown
Collaborator

arekay-nv commented Apr 27, 2026

⚠️ Superseded. This comment was posted while a pending review was blocking inline-comment delivery. The review has since been re-posted with all 14 inline comments attached: see review #4184417968 and the updated summary. The findings are unchanged.

Copy link
Copy Markdown
Collaborator

@arekay-nv arekay-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Council — Multi-AI Code Review

Reviewed by: Claude (Codex review failed with a CLI config error — invalid 'features' requirement 'browser_use' from cloud requirements) | Depth: thorough

Found 15 issues (0 critical, 2 high, 5 medium, 8 low). 14 posted as inline comments, 1 in summary table only (line outside diff hunk).

Comment thread src/inference_endpoint/openai/openai_msgspec_adapter.py Outdated
Comment thread src/inference_endpoint/load_generator/multi_turn_strategy.py Outdated
Comment thread src/inference_endpoint/load_generator/multi_turn_strategy.py Outdated
Comment thread src/inference_endpoint/load_generator/session.py
Comment thread src/inference_endpoint/dataset_manager/multi_turn_dataset.py
Comment thread src/inference_endpoint/dataset_manager/multi_turn_dataset.py
Comment thread tests/integration/test_multi_turn.py
Comment thread src/inference_endpoint/dataset_manager/multi_turn_dataset.py Outdated
Comment thread src/inference_endpoint/load_generator/conversation_manager.py Outdated
Comment thread src/inference_endpoint/load_generator/multi_turn_strategy.py
@arekay-nv
Copy link
Copy Markdown
Collaborator

Review Council — Multi-AI Code Review

Reviewed by: Claude | Depth: thorough | Commit: 0a7ad37

14 inline comments posted in review #4184417968. 1 finding (openai_adapter.py:131) is summary-only because the line falls outside the diff hunk.

Found 15 issues (0 critical, 2 high, 5 medium, 8 low). Codex review failed with a CLI config error (invalid features requirement browser_use from cloud requirements) — Claude-only review.

🟠 Must Fix (high)

Issues that will cause incorrect behavior users will hit in normal usage.

# Location Category Summary
1 src/inference_endpoint/openai/openai_msgspec_adapter.py:222 bug from_endpoint_response violates the QueryResult.metadata type contract by passing metadata=None when no…
2 src/inference_endpoint/load_generator/multi_turn_strategy.py:137 error-handling await asyncio.gather(*tasks, return_exceptions=True) silently swallows every exception raised by a _conv_pipeline coroutine. The result…

🟡 Should Fix (medium)

Real issues that trigger under specific conditions, or design flaws that will compound.

# Location Category Summary
3 src/inference_endpoint/load_generator/multi_turn_strategy.py:164 bug When a turn times out waiting for the previous turn's response, the pipeline does state.failed_turns += 1 but never increments…
4 src/inference_endpoint/load_generator/session.py:206 data-integrity PromptData.text = data.get("prompt") and token_ids = data.get("input_tokens") or data.get("token_ids") — neither is set on multi-turn…
5 src/inference_endpoint/dataset_manager/multi_turn_dataset.py:212 bug system_prompts_by_conv is keyed by str(conv_id) (line 212), but pre_built_messages_by_key and current_turn_messages_by_key are…
6 src/inference_endpoint/load_generator/multi_turn_strategy.py:180 bug In live-history mode (use_dataset_history=False), the per-turn tool message reuses the dataset's hardcoded tool_call_id (e.g.…
7 src/inference_endpoint/load_generator/conversation_manager.py:142 error-handling mark_turn_complete and mark_turn_failed raise KeyError if conversation_id is missing. These are invoked from…

🔵 Consider (low)

Valid improvements that could be follow-ups.

# Location Category Summary
8 src/inference_endpoint/config/runtime_settings.py:200 design self.load_pattern.type.value == "multi_turn" compares the enum value as a string literal instead of the enum:…
9 src/inference_endpoint/config/schema.py:253 design MultiTurnConfig uses model_config = {"extra": "forbid"} (raw dict) and is missing frozen=True, while every other config model in this…
10 src/inference_endpoint/dataset_manager/multi_turn_dataset.py:222 bug When iterating prior_rows to build pre-built messages, a tool row whose tool_results field is an empty list (e.g. tool_results: [])…
11 tests/integration/test_multi_turn.py:559 testing The "tool use" integration tests verify that tools, tool_calls, and tool_results are forwarded to the endpoint and that all client…
12 src/inference_endpoint/dataset_manager/multi_turn_dataset.py:142 performance _validate_conversation_structure, _validate_turn_numbering, and _build_metadata each call self.dataframe.groupby("conversation_id")
13 src/inference_endpoint/load_generator/conversation_manager.py:45 concurrency turn_done: asyncio.Event = field(default_factory=asyncio.Event) is constructed at ConversationState instantiation time, before…
14 src/inference_endpoint/load_generator/multi_turn_strategy.py:95 design self._inflight: dict[str, str] = {} is mutated from two execution contexts (the per-conversation pipeline tasks at line 190, and…
15 src/inference_endpoint/openai/openai_adapter.py:131 (summary-only) api-contract OpenAIAdapter.from_endpoint_response (the non-msgspec adapter, used when callers explicitly select it) does not extract tool_calls,…

tianmu-li added a commit to tianmu-li/endpoints that referenced this pull request Apr 28, 2026
…tation

Fix 15 review issues across severity levels:
- HIGH: metadata=None crash in msgspec adapter, silent exception swallowing in gather
- MEDIUM: timeout state consistency, conv_id canonicalization, PromptData fallback, conv_id guard
- LOW: enum comparison, frozen config, empty tool_results warning, adapter metadata extraction,
  groupby deduplication, live-history tool warning, asyncio.Event docs, test TODO

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@tianmu-li
Copy link
Copy Markdown
Collaborator Author

Hi @arekay-nv, I've addressed the comments. Appreciate it if you could take another look.

@tianmu-li tianmu-li marked this pull request as ready for review May 1, 2026 19:16
@tianmu-li tianmu-li requested review from a team and Copilot May 1, 2026 19:16
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the benchmarking system to support multi-turn conversational workloads (including tool-calling sequences), adds a dedicated multi-turn dataset format + conversion/validation tooling, and wires a new multi-turn load strategy into the benchmarking session and OpenAI adapters.

Changes:

  • Add MultiTurnDataset (flat-row JSONL format) with validation, metadata precomputation, and adapter-default handling for per-turn parameters/tools.
  • Add MultiTurnStrategy + ConversationManager to enforce per-conversation turn sequencing with optional global concurrency limiting, and integrate it into BenchmarkSession.
  • Extend OpenAI request/response handling for messages, tools, tool-call metadata, and streaming tool-call accumulation; add extensive unit/integration tests and multi-turn docs/examples/scripts.

Reviewed changes

Copilot reviewed 43 out of 44 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
tests/unit/openai/test_openai_adapter.py New unit tests for OpenAIAdapter tool serialization and tools forwarding.
tests/unit/openai/test_msgspec_adapter.py New unit tests for msgspec OpenAI adapter tool-call fields and message dict conversion.
tests/unit/load_generator/test_multi_turn_strategy.py New unit tests for turn sequencing, concurrency semaphore behavior, and metadata propagation.
tests/unit/load_generator/test_multi_turn_conversation_manager.py New unit tests for conversation state bookkeeping and event gating.
tests/unit/dataset_manager/test_transforms.py Add coverage for the new AddDefaultColumns transform.
tests/unit/dataset_manager/test_multi_turn_dataset.py Comprehensive tests for MultiTurnDataset, including tool sequences and metadata correctness.
tests/unit/core/test_types.py Add tests for QueryResult.with_metadata() and Query.metadata round-tripping.
tests/unit/config/test_schema.py Add tests for multi-turn config validation and multi-turn sample counting logic.
tests/integration/test_multi_turn.py End-to-end integration tests exercising dataset-history/live-history modes and tool-use conversations.
src/inference_endpoint/openai/types.py Extend msgspec OpenAI types to include tool_calls, tool_call_id, and tools.
src/inference_endpoint/openai/openai_msgspec_adapter.py Support messages input, tool-call fields, tools forwarding, and richer response metadata.
src/inference_endpoint/openai/openai_adapter.py Support messages input, tools forwarding, and return richer response metadata.
src/inference_endpoint/openai/accumulator.py Accumulate streamed tool_calls + finish_reason into final QueryResult.metadata.
src/inference_endpoint/load_generator/strategy.py Extend PhaseIssuerProtocol.issue() to accept an optional data_override.
src/inference_endpoint/load_generator/session.py Allow injecting a per-phase strategy; support data overrides in sample issuance.
src/inference_endpoint/load_generator/multi_turn_strategy.py New multi-turn strategy implementing per-conversation sequencing + global concurrency limiting.
src/inference_endpoint/load_generator/conversation_manager.py New synchronous conversation state manager used by multi-turn strategy.
src/inference_endpoint/endpoint_client/worker.py Propagate Query.metadata through requests and merge into results.
src/inference_endpoint/endpoint_client/http.py Add query_metadata field to InFlightRequest.
src/inference_endpoint/endpoint_client/adapter_protocol.py Generalize SSE decoding/parse APIs to return adapter-specific chunk objects.
src/inference_endpoint/dataset_manager/transforms.py Add AddDefaultColumns (fill-missing-only) transform.
src/inference_endpoint/dataset_manager/multi_turn_dataset.py New multi-turn dataset implementation with tool-sequence handling and metadata building.
src/inference_endpoint/dataset_manager/factory.py Select MultiTurnDataset when dataset config includes multi_turn; skip prompt-based transforms for it.
src/inference_endpoint/dataset_manager/init.py Export MultiTurnDataset and AddDefaultColumns.
src/inference_endpoint/core/types.py Add Query.metadata and QueryResult.with_metadata().
src/inference_endpoint/config/templates/online_template_full.yaml Expose multi_turn dataset block and multi_turn load pattern option in template.
src/inference_endpoint/config/templates/online_template.yaml Expose multi_turn load pattern option in template.
src/inference_endpoint/config/templates/offline_template_full.yaml Expose multi_turn dataset block and load pattern option in template.
src/inference_endpoint/config/templates/concurrency_template_full.yaml Expose multi_turn dataset block and load pattern option in template.
src/inference_endpoint/config/templates/concurrency_template.yaml Expose multi_turn load pattern option in template.
src/inference_endpoint/config/schema.py Add multi-turn schema objects and cross-validate dataset.multi_turn ↔ load_pattern.type.
src/inference_endpoint/config/runtime_settings.py Make multi-turn sample count issue all dataset client turns (min-sample-count aware).
src/inference_endpoint/commands/benchmark/execute.py Instantiate and wire MultiTurnStrategy automatically when using MultiTurnDataset.
scripts/validate_jsonl_schema.py New CLI script to validate multi-turn JSONL rows against schema.
scripts/multi_turn_dataset_schema.json New JSON Schema for multi-turn flat-row JSONL datasets.
scripts/convert_agentic_snapshot.py New conversion+verification script from snapshot-style agentic datasets to flat-row JSONL.
examples/09_MultiTurn/multi_turn_with_concurrency.yaml Example config: multi-turn with global concurrency limiting.
examples/09_MultiTurn/multi_turn_benchmark.yaml Example config: basic multi-turn benchmark.
examples/09_MultiTurn/datasets/.gitkeep Placeholder for converted example datasets.
examples/09_MultiTurn/customer_support_conversations.jsonl Example multi-turn dataset.
examples/09_MultiTurn/agentic_workflow_benchmark.yaml Example config for converted agentic workflow dataset.
examples/09_MultiTurn/agentic_coding_benchmark.yaml Example config for converted agentic coding dataset.
examples/09_MultiTurn/README.md Multi-turn feature documentation and agentic conversion guidance.
docs/MULTI_TURN_QUICKSTART.md Quickstart guide for running multi-turn benchmarks.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/inference_endpoint/dataset_manager/multi_turn_dataset.py
Comment thread src/inference_endpoint/load_generator/multi_turn_strategy.py
Comment thread src/inference_endpoint/openai/openai_adapter.py
Comment thread src/inference_endpoint/dataset_manager/multi_turn_dataset.py Outdated
Comment thread src/inference_endpoint/dataset_manager/multi_turn_dataset.py Outdated
Copy link
Copy Markdown
Collaborator

@arekay-nv arekay-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is some confusion around the issue policy, lets sync today.

Comment thread docs/MULTI_TURN_QUICKSTART.md Outdated
Comment thread docs/MULTI_TURN_QUICKSTART.md Outdated
Comment thread docs/MULTI_TURN_QUICKSTART.md
Comment thread src/inference_endpoint/load_generator/multi_turn_strategy.py
Comment thread src/inference_endpoint/load_generator/multi_turn_strategy.py Outdated
Comment thread tests/unit/openai/test_openai_adapter.py
Comment thread src/inference_endpoint/load_generator/multi_turn_strategy.py Outdated
Comment thread src/inference_endpoint/load_generator/multi_turn_strategy.py Outdated
Comment thread src/inference_endpoint/load_generator/session.py Outdated
tianmu-li added a commit to tianmu-li/endpoints that referenced this pull request May 4, 2026
…n implementation

- openai_adapter: normalize null content to "" instead of literal "None"
  to avoid polluting conversation history in tool-calling responses
- multi_turn_dataset: validate tool_results entries have required
  tool_call_id and content fields; raise InputValidationError at load time
- multi_turn_dataset: remove unused "index" field from samples metadata
- multi_turn_strategy: wrap mark_turn_complete/mark_turn_failed in
  try/except KeyError in on_sample_complete
- multi_turn_strategy: clear _inflight at end of execute() with warning
  if entries remain (transport failure or session abort)
- docs: remove prescriptive concurrency sizing guide; replace with
  definition of what target_concurrency controls
- docs: rename "Long Conversations" to "Conversations with Many Turns"
- docs: add dataset validation utility reference in Troubleshooting

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
tianmu-li added a commit to tianmu-li/endpoints that referenced this pull request May 4, 2026
- Fix refusal field set to literal string "None" instead of "" in
  openai_adapter.py — made downstream refusal checks incorrectly truthy
- Add test_pipeline_error_propagated to verify execute() re-raises
  worker exceptions instead of swallowing them via gather(return_exceptions=True)
- Clarify MultiTurnStrategy docstring and MULTI_TURN_QUICKSTART.md:
  target_concurrency = simultaneous conversations (not requests);
  each active conversation has exactly 1 in-flight turn at a time
- Remove unjustified "Common Configurations" section from quickstart
- Correct misleading "workers = concurrent conversations" tip; clarify
  client.workers and target_concurrency are independent layers

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
tianmu-li added a commit to tianmu-li/endpoints that referenced this pull request May 4, 2026
…ategy

Rewrites MultiTurnStrategy to issue subsequent turns synchronously inside
on_sample_complete() (zero event-loop delay), removing pre-spawned worker
tasks and per-conversation asyncio.Event waiting. ConversationState no
longer holds an asyncio.Event; sequencing is driven entirely by the
strategy. Addresses PR mlcommons#285 reviewer request to move turn issuance into
the sample-complete handler.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 4, 2026 22:17
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 43 out of 44 changed files in this pull request and generated 2 comments.

Comments suppressed due to low confidence (2)

src/inference_endpoint/endpoint_client/adapter_protocol.py:134

  • parse_sse_chunk now appends whatever decode_sse_message returns, including None (e.g., when an SSE message has no choices). This contradicts the docstring (“Silently ignores non-content SSE messages”) and forces downstream accumulators to defensively handle None. Consider filtering out None return values here (and/or handling exceptions per JSON doc) so call sites only see meaningful chunk objects.
    def parse_sse_chunk(cls, buffer: bytes, end_pos: int) -> list[Any]:
        """
        Parse SSE chunk and extract all chunk objects.

        Extracts JSON documents from SSE stream and decodes them to chunk objects.
        Silently ignores non-content SSE messages (role, finish_reason, etc).

        Args:
            buffer: Byte buffer containing SSE data
            end_pos: End position in buffer to parse up to

        Returns:
            List of chunk objects extracted from the SSE chunk
        """
        json_docs = cls.SSE_DATA_PATTERN.findall(buffer[:end_pos])
        parsed_contents = []

        try:
            for json_doc in json_docs:
                content = cls.decode_sse_message(json_doc)
                parsed_contents.append(content)
        except Exception:
            # Normal for non-content SSE messages (role, finish_reason, etc)
            pass

        return parsed_contents

src/inference_endpoint/core/types.py:242

  • The Query docstring’s gc=False note only mentions data/headers, but Query now also has a mutable metadata dict. To avoid future misuse (and to match the more explicit QueryResult guidance), consider updating this note to include metadata as well and/or adding an AT-RISK (gc=False) warning that data/metadata/headers must not be mutated to introduce cycles.
    Attributes:
        id: Unique identifier for this query (auto-generated UUID).
        data: Request payload as a dictionary (typically contains prompt, model, etc.).
        metadata: Internal metadata that round-trips through transport (e.g., conversation_id).
        headers: HTTP headers to include in the request (e.g., authorization).
        created_at: Timestamp when query was created (seconds since epoch).

    Example:
        >>> query = Query(
        ...     data={"prompt": "Hello", "model": "Qwen/Qwen3-8B", "max_tokens": 100},
        ...     headers={"Authorization": "Bearer token123"},
        ... )

    Note:
        gc=False: Safe because data/headers are simple key-value pairs without cycles.
        Do NOT store self-referential or cyclic structures in data/headers fields.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/inference_endpoint/openai/types.py
Comment thread src/inference_endpoint/dataset_manager/multi_turn_dataset.py
@tianmu-li
Copy link
Copy Markdown
Collaborator Author

Hi @arekay-nv, I've addressed the comments. Main change is refactoring to an event-based model and limiting number of active conversations to concurrency. Appreciate it if you could take another look.

Copy link
Copy Markdown
Collaborator

@arekay-nv arekay-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost there - the only issue is the stickiness of the client-turns.

Comment thread src/inference_endpoint/commands/benchmark/execute.py Outdated
Comment thread src/inference_endpoint/config/schema.py Outdated
Comment thread src/inference_endpoint/dataset_manager/transforms.py Outdated
Comment thread src/inference_endpoint/core/types.py Outdated
Comment thread tests/unit/config/test_schema.py Outdated
Comment thread tests/unit/config/test_schema.py Outdated
@arekay-nv
Copy link
Copy Markdown
Collaborator

@viraatc can you check the perf implications - there might be non-negligible overhead for non-agentic/multi-turn workloads.

tianmu-li added a commit to tianmu-li/endpoints that referenced this pull request May 6, 2026
…tation

- Remove ConversationMode enum (single-member) and mode field from
  MultiTurnConfig; drop mode: independent from YAML examples and docs
- Merge AddDefaultColumns into AddStaticColumns(overwrite=False)
- Replace per-call strategy check with construct-time branch in execute.py
- Normalize None tool-calling content to "" in openai_adapter.py
- Delete unused Query.metadata, QueryResult.with_metadata, and
  InFlightRequest.query_metadata plumbing
- Add role-specific validation in _validate_conversation_structure:
  tool rows require non-empty tool_results, assistant rows require
  content or tool_calls
- Backfill explicit sample_index into conversation_metadata["samples"];
  MultiTurnStrategy reads sample_meta["sample_index"] instead of enumerate
- Add AT-RISK gc=False docstring notes to openai/types.py structs with
  mutable container fields
- Rewrite dataset tool_call_ids with model-generated ids in live-history
  mode; add test_live_history_remaps_tool_call_id integration test
- Lift inline imports to top of test_schema.py

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 6, 2026 23:38
self._validate_conversation_grouping()
self._validate_conversation_structure()
self._validate_turn_numbering()
self.conversation_metadata = self._build_metadata()
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

conversation_metadata: dict[str, Any] is load-bearing across two files but has no schema. Consumers in multi_turn_strategy.py read metadata["samples"], .get("system_prompts_by_conv", {}), .get("current_turn_messages_by_key", {}), and .get("pre_built_messages_by_key", {}) — typos return silent empty dicts. Worse, this dict is two-phase-initialized: _build_metadata() populates entries without sample_index, and load() mutates them in place to add it (see line 451). Any consumer that reads metadata between __init__ and load() gets a different shape — tests at test_multi_turn_dataset.py:188 already rely on the partial shape, locking in the temporal coupling.

Three concrete risks:

  1. Stringly-typed (conv_id, turn) tuple keys — multi_turn_strategy.py:201 does .get((conv_id, turn)). If conv_id is anything but str (e.g. an int from JSON), it silently returns None → empty messages.
  2. Samples are silently dropped at line 450 if their post-transform key isn't in key_to_sample_index — no log, no count.
  3. The aligned-indices invariant between samples[i]["sample_index"] and self.data is tribal knowledge.

Fix: replace with a typed ConversationMetadata dataclass (frozen, slots) produced once by load(). This eliminates the two-phase shape, makes key types explicit, and gives consumers attribute access with type-checker coverage. This is the single highest-impact refactor available in the PR.

@@ -0,0 +1,794 @@
# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Integration tests don't validate metrics for multi-turn. This file has no assertion that the metrics aggregator produces ISL/OSL/TPOT values for a multi-turn run. Given the PR adds _precompute_isl_for_multi_turn specifically to populate input_tokens for the hot-path IslTrigger, and OslTrigger/TpotTrigger now have _extract_message paths for tool-call responses (metrics_table.py:312-407), at least one end-to-end test should assert that:

  • IslTrigger reports the precomputed token count (not the text-fallback count)
  • OslTrigger/TpotTrigger produce a value for a streamed tool-call response
  • For a SSE response with tool_calls only (no text content), the TTFT sentinel chunk fires

Without this, a regression in either trigger goes unnoticed.

@arekay-nv
Copy link
Copy Markdown
Collaborator

Additional findings outside the PR diff

These can't be posted inline because they reference lines that aren't part of this change, but they're directly worsened or surfaced by it.

src/inference_endpoint/endpoint_client/adapter_protocol.py:126-132 — broad except Exception: pass swallows tool-call decode failures. The block is pre-existing, but its blast radius has grown dramatically in this PR: decode_sse_message now goes through msgspec.json.decode(..., type=SSEMessage) with an expanded SSEDelta / SSEChoice schema that includes tool_calls: list[dict[str, Any]] | None. The catch now silently hides:

  • msgspec.DecodeError from any malformed SSE payload
  • msgspec.ValidationError when a streamed tool-call partial doesn't conform (server sends tool_calls as an object, non-string id, schema drift)
  • The comment claims this is "normal for non-content SSE messages (role, finish_reason, etc)" — but the new accumulator handles those explicitly, so the rationale is stale and the catch now hides real bugs.

A loop-level try means one decode failure mid-batch silently truncates the chunk list — surviving documents in the same SSE flush are lost. Tool-call accumulation can drop the suffix of a streamed response, producing invalid JSON in conversation history without any log. Recommend: narrow the catch to msgspec.DecodeError/ValidationError, log at warning with query_id and the offending payload prefix, and don't break out of the loop on a single bad doc — continue past it.

src/inference_endpoint/load_generator/session.py:465-472 — inflight count drifts negative after timed-out late response. _handle_timeout in MultiTurnStrategy removes the entry from MultiTurnStrategy._inflight, but phase_issuer.uuid_to_index retains the query id. If the late response eventually arrives, _handle_response still hits the query_id in phase_issuer.uuid_to_index branch and decrements phase_issuer.inflight — potentially driving it below zero. The <= 0 check in the drain logic still fires (already <= 0), so no symptom in normal runs, but the invariant violation will silently mask future bugs. Recommend: clear uuid_to_index[query_id] from _handle_timeout, or guard the decrement.

tests/unit/core/test_types.py — missing class-level @pytest.mark.unit markers. TestErrorData (line 36), TestQuerySerialization (50), TestQueryResultSerialization (189), TestStreamChunkSerialization (470), TestQueryResultWorkerPatterns (592), TestTextAfterFirstChunk (693), TestMixedTypeSerialization (747) lack the marker. Only TestTextAfterFirstChunkToolCalls near line 896 has it. This breaks pytest -m unit filtering — these tests would not run in a -m unit CI lane.

Test gaps worth adding (the existing tests catch them but not as explicit cases):

  • _handle_timeout failure bookkeeping (state.failed_turns increments for the timed-out turn and every remaining un-issued turn, state.is_complete() becomes True, _fill_slot() runs)
  • phase_issuer.issue() returning None mid-stream — the FakePhaseIssuer in test_multi_turn_strategy.py defines stop_after but no test ever passes a non-None value, so the session-stop branch in multi_turn_strategy.py:219-223 is untested
  • KeyError paths in on_sample_complete and ConversationManager.mark_turn_*
  • _expand_tool_results validation paths (tool_call_id is None, content is None) and the "empty tool_results list" warning
  • _PROPAGATED_KEYS = {"model", "max_completion_tokens", "max_new_tokens", "stream", "tools"} propagation from first user row — current tests supply tools on every relevant row, so propagation logic isn't exercised
  • Accumulator: parallel tool_calls at different indices (only idx=0 tested), out-of-order indices (currently sorted at finalize, never tested), partial without index key (defaults to 0)
  • SSE streaming end-to-end with real tool-call deltas — the echo server doesn't emit tool_calls deltas, so the new SSEDelta.tool_calls decoding path is only covered by unit tests, not the real transport pipeline

Type tightening — ChatMessage.role: str could be Literal["user", "assistant", "system", "tool"]. msgspec accepts Literal types and would reject unknown values on decode — a free encapsulation win for the wire-format type. The dataset validator already enforces this at multi_turn_dataset.py:154-159; the API boundary doesn't.


Generated by a multi-agent review pass (code-reviewer, pr-test-analyzer, silent-failure-hunter, comment-analyzer, type-design-analyzer). See inline comments for the in-diff findings.

tianmu-li and others added 3 commits May 13, 2026 17:24
Correctness fixes:
- multi_turn_strategy: derive response_text from TextModelOutput.output
  directly so tool-call JSON is not duplicated into assistant history
- multi_turn_strategy: KeyError path in on_sample_complete now pops
  _active_iters and calls _fill_slot() to prevent session hang
- multi_turn_strategy: logger.exception preserves traceback on issuance failure
- multi_turn_strategy: raise InputValidationError at __init__ when
  live-history mode (use_dataset_history=False) is combined with tool turns;
  removes the has_tool_msg warning that sent unissued tool_call_ids
- multi_turn_strategy: _handle_timeout synthesises a failure QueryResult and
  routes it through the composite callback so accuracy collector and event
  logger see timed-out turns; counts and logs dropped downstream turns;
  late responses get a debug log instead of silent drop
- execute.py: each hook in _on_sample_complete wrapped independently so a
  strategy failure cannot suppress accuracy collection
- execute.py: move AutoTokenizer import to top-level; narrow ISL precompute
  exception to (TemplateError, KeyError, ValueError, TypeError); raise
  RuntimeError when all samples fail
- token_metrics: join fallback parts with "\n" to avoid cross-boundary
  token merging; logger.exception for baseline computation failure
- adapter_protocol: per-document SSE try/except so one bad frame does not
  drop the rest of the buffer; filter None returns from decode_sse_message
- openai/types: add role field to SSEDelta to accept streaming first frame
- dataset_manager/factory: skip ColumnRemap for MultiTurnDataset
- multi_turn_dataset: warn and skip samples missing pre-built messages
- config/schema: add gt=0 validator on MultiTurnConfig.turn_timeout_s

Docs:
- examples/09_MultiTurn/README.md: correct concurrency/timeout semantics,
  mark per-conversation metrics as planned
- examples/10_CollectOutputs: add example for output collection

Tests:
- test_live_history_rejects_tool_turns: asserts InputValidationError at init
- test_isl_precomputed_for_dataset_history: guards ISL precompute hot path
- annotate bare except blocks in CapturingEchoServer with explanatory comment

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
- execute.py: guard accuracy phase against MULTI_TURN load pattern on
  non-MultiTurnDataset (currently unreachable, made explicit)
- runtime_settings.py: clamp multi-turn sample count to dataset size;
  warn when min_sample_count exceeds client-turn count
- token_metrics.py: emit one-shot per-(tokenizer, exc-class) warning
  when apply_chat_template falls back to whitespace tokenization
- strategy.py, schema.py, README.md: fix stale docstrings/docs

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Top-level `import jinja2` in execute.py required an undeclared runtime
dep, aborting CI at collection time. Revert the narrow except tuple back
to `except Exception:` — `logger.exception` already delivers the traceback
visibility the reviewer asked for. Fix `test_precompute_isl.py` patch
targets (broken since AutoTokenizer moved to module top) and guard the
all-failed check against the zero-samples-with-messages edge case.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 13, 2026 23:38
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 56 out of 57 changed files in this pull request and generated 22 comments.

Comment on lines +263 to +289
def _handle_timeout(self, query_id: str, conv_id: str) -> None:
"""Called by the event loop when a turn response does not arrive in time."""
if self._inflight.pop(query_id, None) is None:
return
self._timeout_handles.pop(query_id, None)

logger.warning(
"Turn timed out for conversation %s (query=%s)", conv_id, query_id
)

self._conv_manager.mark_turn_failed(
conv_id, store_in_history=self._store_in_history
)

# Route a synthetic failure result so the accuracy collector and event
# logger see the timed-out turn.
if self._session_on_sample_complete is not None:
timeout_result = QueryResult(
id=query_id,
error=ErrorData(
error_type="TurnTimeout",
error_message=f"turn timeout after {self._turn_timeout_s}s",
),
)
try:
self._session_on_sample_complete(timeout_result)
except Exception:
Comment on lines +321 to +324
raw = tokenizer.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
Comment on lines +216 to +227
if choice.message.reasoning_content:
metadata["reasoning_content"] = choice.message.reasoning_content

tool_calls_tuple = (
tuple(choice.message.tool_calls) if choice.message.tool_calls else None
)
return QueryResult(
id=result_id or response.id,
response_output=TextModelOutput(output=response.choices[0].message.content),
response_output=TextModelOutput(
output=choice.message.content or "",
tool_calls=tool_calls_tuple,
),
Comment thread src/inference_endpoint/commands/benchmark/execute.py Outdated
Comment thread examples/10_CollectOutputs/benchmark_with_output_collection.yaml Outdated
Comment thread src/inference_endpoint/load_generator/multi_turn_strategy.py Outdated
Comment thread src/inference_endpoint/dataset_manager/multi_turn_dataset.py
Comment on lines +267 to +273
msg: dict[str, Any] = {}
for key in ("role", "content", "tool_calls", "tool_results"):
val = prior_row.get(key)
if val is not None and not (
isinstance(val, float) and pd.isna(val)
):
msg[key] = val
Comment thread src/inference_endpoint/load_generator/multi_turn_strategy.py Outdated
Comment thread src/inference_endpoint/dataset_manager/multi_turn_dataset.py Outdated
tianmu-li and others added 2 commits May 14, 2026 00:53
Code fixes:
- Timeout handler now decrements PhaseIssuer.inflight, preventing
  _drain_inflight() from hanging when any turn times out (#1)
- _precompute_isl_for_multi_turn normalizes tool_calls arguments
  before apply_chat_template, fixing Hermes-style template failures
  on tool-call-bearing samples (mlcommons#2)
- Non-streaming reasoning_content now set on TextModelOutput.reasoning
  so OSL/TPOT tokenization includes it (mlcommons#3)
- live-history data_override clears stale input_tokens and token_ids (mlcommons#18)
- execute() cleanup runs in finally block, surviving CancelledError (mlcommons#21)
- Validator rejects plain-assistant->tool (missing tool_calls) (mlcommons#13)
- Message builders include 'name' field for prior and current turns (mlcommons#19, mlcommons#20)
- max_new_tokens sets all three adapter aliases (mlcommons#22)
- Multi-turn accuracy datasets raise InputValidationError (not yet supported) (mlcommons#4)

Examples and docs:
- Remove ineffective samples: 10 from multi-turn YAML examples (mlcommons#8, mlcommons#9)
- Fix events.jsonl docs (no conversation_id/turn_number fields) (mlcommons#6, mlcommons#7, mlcommons#14)
- Fix workers -> num_workers in quickstart config snippet (mlcommons#10)
- Fix agentic role-sequence grammar in README (mlcommons#15)
- Document that multi-turn accuracy datasets are not yet supported
- Delete examples/10_CollectOutputs/ (mlcommons#11, mlcommons#12, mlcommons#16, mlcommons#17)

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…), tool-call ISL test

- R1: Move _build_metadata() from __init__ to load() so pre_built_messages_by_key
  is always built against the post-transform dataframe (latent desync bug).
  conversation_metadata is None until load() is called.

- R2: Replace conversation_metadata dict[str, Any] with @DataClass ConversationMetadata
  (+ ConversationSampleEntry). Attribute access in MultiTurnStrategy replaces .get()
  calls; mypy now flags typos at the call site.

- R3: Append unit test to TestPrecomputeIslForMultiTurn asserting that
  _normalize_tool_calls_for_template converts tool_calls[].function.arguments
  from JSON string to dict before apply_chat_template (previously untested branch).

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 14, 2026 04:58
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 54 out of 55 changed files in this pull request and generated 12 comments.

Comment on lines +190 to +195
VALID_NEXT: dict[str, set[str]] = {
"start": {"user"},
"user": {"assistant"},
"assistant": {"tool", "user"},
"tool": {"assistant", "user"},
}
Comment on lines +285 to +296
# Route a synthetic failure result so the accuracy collector and event
# logger see the timed-out turn.
if self._session_on_sample_complete is not None:
timeout_result = QueryResult(
id=query_id,
error=ErrorData(
error_type="TurnTimeout",
error_message=f"turn timeout after {self._turn_timeout_s}s",
),
)
try:
self._session_on_sample_complete(timeout_result)
Comment on lines +453 to +472
@@ -379,7 +469,7 @@ def _build_phases(ctx: BenchmarkContext) -> list[PhaseConfig]:
min_sample_count=acc_ds.num_samples() * acc_ds.repeats,
rng_sched=ctx.rt_settings.rng_sched,
rng_sample_index=ctx.rt_settings.rng_sample_index,
load_pattern=LoadPattern(type=LoadPatternType.MAX_THROUGHPUT),
load_pattern=acc_load_pattern,
Comment thread docs/MULTI_TURN_QUICKSTART.md Outdated
Currently available:

- **Per-turn metrics**: Latency, TTFT, TPOT for each turn
- **Conversation tracking**: All events tagged with conversation_id
self._phase_issuer is not None
and hasattr(self._phase_issuer, "uuid_to_index")
and query_id in self._phase_issuer.uuid_to_index # type: ignore[attr-defined]
):
Comment on lines +152 to +154
self._conv_groups = dict(
list(self.dataframe.groupby("conversation_id", sort=False))
)
f"Conversation {conv_id} turn {row['turn']}: "
"assistant rows must have non-empty 'content' or non-empty 'tool_calls'"
)
prev_assistant_had_tool_calls = has_tool_calls
Comment on lines +101 to +105
if any(m.get("role") == "tool" for m in msgs)
]
if tool_turn_keys:
raise InputValidationError(
"Multi-turn with tool turns requires use_dataset_history=True. "
Comment on lines +334 to +337
raw = tokenizer.apply_chat_template(
normalized_messages,
tokenize=True,
add_generation_prompt=True,
Comment on lines +230 to +233
tool_calls = row.get("tool_calls")
has_tool_calls = (
isinstance(tool_calls, list) and len(tool_calls) > 0
)
tianmu-li and others added 2 commits May 14, 2026 05:37
- Assert conversation_metadata is not None before MultiTurnStrategy
  construction (fixes mypy CI failure at execute.py:634 and
  test_multi_turn.py:89 introduced by R1 moving build to load())
- Narrow SSE decode exception catch from bare Exception to
  msgspec.DecodeError/ValidationError and raise log level to warning
- Clear uuid_to_index entry on turn timeout to prevent inflight counter
  going negative when a late response arrives after timeout
- Add @pytest.mark.unit to 7 unmarked test classes in test_types.py so
  they are included in pytest -m unit CI lane

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Li, Tianmu <tianmu.li@intel.com>
Copy link
Copy Markdown
Collaborator

@viraatc viraatc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks great thanks for the MR!
( completed first look will be doing another pass )

try:
for json_doc in json_docs:
parsed: list[Any] = []
for json_doc in json_docs:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we intentionally put try-except outside the for-loop for perf reasons.
this has large impact in streaming mode.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved try-except to outside the for-loop

if "messages" in query.data and isinstance(query.data["messages"], list):
messages = []
for message in query.data["messages"]:
if not isinstance(message, dict):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adapter code should prioritize performance, wondering if its possible to avoid the isinstance checks and propagate the error directly at runtime.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. isinstance check is removed.


role: str
content: ChatMessageContent
content: ChatMessageContent | None = None
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: why do we need have mutable content? does chat-message with None content make sense?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, tool-dispatching assistant message needs to carry content=null.

Copilot AI review requested due to automatic review settings May 14, 2026 21:15
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 56 out of 57 changed files in this pull request and generated 15 comments.

Comments suppressed due to low confidence (3)

tests/unit/load_generator/test_multi_turn_strategy.py:110

  • This monkey-patched issue method replaces FakePhaseIssuer.issue but also lacks the conversation_id and turn keyword parameters that MultiTurnStrategy passes, so this test will fail with TypeError on the first issued turn.
    tests/unit/load_generator/test_multi_turn_strategy.py:174
  • MultiTurnStrategy passes conversation_id and turn into issue(); this TimedIssuer signature does not accept those keywords, so test_turn_ordering_enforced fails before it can validate ordering.
    tests/unit/load_generator/test_multi_turn_strategy.py:393
  • MultiTurnStrategy calls issue with conversation_id and turn keyword arguments. This ErrorIssuer signature rejects those keywords, so the test raises a TypeError instead of the intended simulated pipeline error.

self.issued: list[int] = []
self.issued_count = 0

def issue(self, sample_index: int, data_override: dict | None = None) -> str | None:
Comment on lines +190 to +195
VALID_NEXT: dict[str, set[str]] = {
"start": {"user"},
"user": {"assistant"},
"assistant": {"tool", "user"},
"tool": {"assistant", "user"},
}
Comment on lines +275 to +282
if (
self._phase_issuer is not None
and hasattr(self._phase_issuer, "uuid_to_index")
and query_id in self._phase_issuer.uuid_to_index # type: ignore[attr-defined]
):
self._phase_issuer.inflight -= 1 # type: ignore[attr-defined]
del self._phase_issuer.uuid_to_index[query_id] # type: ignore[attr-defined]
self._phase_issuer.uuid_to_conv_info.pop(query_id, None) # type: ignore[attr-defined]
Comment on lines +292 to +303
# Route a synthetic failure result so the accuracy collector and event
# logger see the timed-out turn.
if self._session_on_sample_complete is not None:
timeout_result = QueryResult(
id=query_id,
error=ErrorData(
error_type="TurnTimeout",
error_message=f"turn timeout after {self._turn_timeout_s}s",
),
)
try:
self._session_on_sample_complete(timeout_result)
Comment on lines +310 to +317
it = self._active_iters.pop(conv_id, None)
dropped = 0
if it is not None:
for _ in it:
self._conv_manager.mark_turn_failed(
conv_id, store_in_history=self._store_in_history
)
dropped += 1
stream=query.data.get("stream", False),
max_completion_tokens=query.data.get("max_completion_tokens", 100),
temperature=query.data.get("temperature", 0.7),
tools=query.data.get("tools"),
Comment on lines +825 to +827
# Should not raise error or create prompt column
assert "prompt" not in result.columns
assert "unrelated" in result.columns

Comment on lines +230 to +239
tool_calls = row.get("tool_calls")
has_tool_calls = (
isinstance(tool_calls, list) and len(tool_calls) > 0
)
if is_empty_content and not has_tool_calls:
raise InputValidationError(
f"Conversation {conv_id} turn {row['turn']}: "
"assistant rows must have non-empty 'content' or non-empty 'tool_calls'"
)
prev_assistant_had_tool_calls = has_tool_calls
Comment thread docs/MULTI_TURN_QUICKSTART.md Outdated

- ✅ Enforce turn ordering (turn N+1 waits for turn N)
- ✅ Include conversation history in each request
- ✅ Log all turns to events.jsonl
Comment thread src/inference_endpoint/config/schema.py Outdated
Comment on lines +629 to +640
# Cross-validate load_pattern.type=multi_turn ↔ dataset.multi_turn config
has_multi_turn_dataset = any(
d.multi_turn is not None for d in (self.datasets or [])
)
if lp.type == LoadPatternType.MULTI_TURN and not has_multi_turn_dataset:
raise ValueError(
"load_pattern.type=multi_turn requires at least one dataset with multi_turn config"
)
if has_multi_turn_dataset and lp.type != LoadPatternType.MULTI_TURN:
raise ValueError(
f"Datasets with multi_turn config require load_pattern.type=multi_turn, got '{lp.type}'"
)
tianmu-li and others added 2 commits May 14, 2026 22:56
- Revert 595faf4 (conversation_id/turn in EventRecord/PhaseIssuer/session)
- Validate user row content and reject assistant(tool_calls)→user transition
  without intervening tool row in _validate_conversation_structure
- Reject malformed tool_calls (present but not a non-empty list) on assistant rows
- Guard _expand_tool_results against non-dict entries in tool_results list
- Add "tools" to ColumnFilter optional_columns in both OpenAI adapters so
  single-turn datasets with a tools column are not silently stripped
- Replace RuntimeError with logger.warning in _precompute_isl_for_multi_turn
  so template-incompatible tokenizers fall back instead of aborting the benchmark
- Fix schema cross-validation to check only the performance dataset for
  multi_turn config instead of any dataset
- Move seeding loop inside try/finally in MultiTurnStrategy.execute so
  cleanup runs even if _start_conversation raises
- Add inflight/uuid_to_index to FakePhaseIssuer for _handle_timeout coverage
- Strengthen test_no_matching_columns to assert unrelated columns are preserved
- Update MULTI_TURN_QUICKSTART.md to accurately describe which turns produce
  sample events and how to correlate events back to conversations

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Match new error messages from the multi_turn schema cross-validation
fix that scopes validation to the performance dataset specifically.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 14, 2026 23:12
Signed-off-by: Li, Tianmu <tianmu.li@intel.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 54 out of 55 changed files in this pull request and generated 11 comments.

target_concurrency=perf_lp.target_concurrency,
)
else:
acc_load_pattern = perf_lp
and query_id in self._phase_issuer.uuid_to_index # type: ignore[attr-defined]
):
self._phase_issuer.inflight -= 1 # type: ignore[attr-defined]
del self._phase_issuer.uuid_to_index[query_id] # type: ignore[attr-defined]
Comment on lines +286 to +297
# Route a synthetic failure result so the accuracy collector and event
# logger see the timed-out turn.
if self._session_on_sample_complete is not None:
timeout_result = QueryResult(
id=query_id,
error=ErrorData(
error_type="TurnTimeout",
error_message=f"turn timeout after {self._turn_timeout_s}s",
),
)
try:
self._session_on_sample_complete(timeout_result)
Comment on lines 131 to +137
try:
for json_doc in json_docs:
content = cls.decode_sse_message(json_doc)
parsed_contents.append(content)
except Exception:
# Normal for non-content SSE messages (role, finish_reason, etc)
pass

return parsed_contents
if content is not None:
parsed.append(content)
except (msgspec.DecodeError, msgspec.ValidationError) as exc:
logger.warning("skipping malformed SSE batch (%s)", type(exc).__name__)
Comment on lines +56 to +67
c = m.get("content")
if isinstance(c, str) and c:
parts.append(c)
elif isinstance(c, list):
parts.extend(
p["text"]
for p in c
if isinstance(p, dict)
and p.get("type") == "text"
and isinstance(p.get("text"), str)
)
return "\n".join(parts) if parts else None
Only affects dataset-history turns; live-history turns override 'messages'
at runtime so the stored input_tokens are stale (acceptable approximation).
"""
tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)
Comment on lines +358 to +368
if result.error is not None:
self._conv_manager.mark_turn_failed(
conv_id, store_in_history=self._store_in_history
)
else:
self._conv_manager.mark_turn_complete(
conv_id,
response_text or "",
store_in_history=self._store_in_history,
metadata=result.metadata,
)
Comment on lines +330 to +334
raw = tokenizer.apply_chat_template(
normalized_messages,
tokenize=True,
add_generation_prompt=True,
)
Comment thread docs/MULTI_TURN_QUICKSTART.md Outdated
Comment on lines +214 to +215
cd examples/09_MultiTurn
inference-endpoint benchmark from-config --config multi_turn_benchmark.yaml
Comment on lines +516 to +524
# Normalize max-tokens across all adapter aliases.
max_tokens_val = (
sample.pop("max_new_tokens", None)
or sample.get("max_completion_tokens")
or 128
)
sample["max_new_tokens"] = max_tokens_val
sample["max_completion_tokens"] = max_tokens_val
sample["max_tokens"] = max_tokens_val
tianmu-li and others added 2 commits May 15, 2026 02:02
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Li, Tianmu <tianmu.li@intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants