fix: OpenAI response decoding defaults, multimodal prompt handling, and tokenizer_name override by BolinSNLHM · Pull Request #307 · mlcommons/endpoints

BolinSNLHM · 2026-05-05T21:30:20Z

Summary

Fixes three issues encountered when running the Qwen3-VL-235B-A22B benchmark against vLLM/Dynamo with the Shopify multimodal dataset:

openai/types.py — vLLM omits optional response fields (content, refusal, finish_reason, usage, system_fingerprint, etc.) that our msgspec structs required to be present. Added safe defaults so decoding succeeds regardless of what the server populates.
session.py — The ShopifyMultimodalFormatter produces prompt as list[dict] (OpenAI vision content parts), which the HTTP adapter handles correctly. However, PhaseIssuer's metrics tracking assigned it directly to PromptData.text (typed str | None), causing a decode failure. Added a type guard to coerce non-string prompts to None.
config/schema.py + execute.py — When the model is served from a container-local path (e.g. an NVFP4 snapshot under /root/.cache/huggingface/hub/...), tokenizer loading fails because the path isn't a valid HF repo ID. Added model_params.tokenizer_name to decouple the tokenizer source from the serving model name.

Testing

Validated end-to-end on GB300 (offline scenario, 1000 samples): all samples completed successfully with no decode errors and full latency/throughput metrics generated.

Note: The example YAMLs assume the server is started with --served-model-name matching model_params.name. When serving from a local NVFP4 checkpoint (the typical NVIDIA submission setup), set model_params.name to the actual served path, tokenizer_name to the upstream HF repo ID (e.g. Qwen/Qwen3-VL-235B-A22B-Instruct), and endpoint_config.endpoints to the node IP.

github-actions · 2026-05-05T21:30:36Z

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

gemini-code-assist

Code Review

This pull request introduces a tokenizer_name configuration option to allow overriding the tokenizer source, which is particularly useful when the serving model uses a local path or a quantized checkpoint. It also enhances the robustness of OpenAI response parsing by adding default values for fields frequently omitted by inference servers and updates the load generator to handle multimodal prompt data. A suggestion was made to improve the tokenizer existence check by verifying local file system paths before querying the Hugging Face Hub, ensuring local tokenizers are correctly identified without triggering API errors.

wangshangsam

The override works for the case where the local checkpoint genuinely doesn't ship tokenizer files, but I think it's papering over a probe bug rather than fixing the root cause.

_check_tokenizer_exists (execute.py:181) calls huggingface_hub.model_info(model_name) first, which only resolves repo IDs. Pass it a path like /root/.cache/huggingface/hub/... and it raises → except Exception → returns False → tokenizer_name becomes None at execute.py:305 → execute.py:465 never passes --tokenizer to the metrics aggregator subprocess → ISL/OSL/TPOT silently disabled.

But the downstream loader at token_metrics.py:79 is AutoTokenizer.from_pretrained(...), which accepts a local directory containing tokenizer.json / tokenizer_config.json just fine. So there are really two cases:

Local NVFP4 snapshot that contains tokenizer files — the typical case, since vLLM needs the tokenizer to serve, so it's almost always present in the snapshot. The probe is just being too strict; a short-circuit like "if Path(model_name).is_dir() and a tokenizer file exists, return True" would make everything work with no user-visible knob.
Local path with no tokenizer files — rare (weights-only directories). Here tokenizer_name is the right escape hatch. -> But this is very rarely to happen.

This PR ships (2) but leaves (1) broken — even when the served directory contains a perfectly usable tokenizer.json, the user is still forced to set tokenizer_name. Could we enable _check_tokenizer_exists to support checking whether tokenizer files exist for a model checkpoint in local path?

wangshangsam

Nit, but otherwise LGTM!

Co-authored-by: Shang Wang <shangw@nvidia.com>

arekay-nv

Thanks!

BolinSNLHM requested a review from a team May 5, 2026 21:30

gemini-code-assist Bot reviewed May 5, 2026

View reviewed changes

Comment thread src/inference_endpoint/commands/benchmark/execute.py Outdated

BolinSNLHM force-pushed the fix/openai-decoding-and-tokenizer-override branch from 5d35e99 to a33690b Compare May 5, 2026 21:35

Victor49152 requested changes May 8, 2026

View reviewed changes

Comment thread src/inference_endpoint/commands/benchmark/execute.py Outdated

Victor49152 requested review from arekay-nv and wangshangsam May 8, 2026 15:57

wangshangsam assigned BolinSNLHM May 8, 2026

wangshangsam added the type: bug Something isn't working label May 8, 2026

wangshangsam reviewed May 8, 2026

View reviewed changes

BolinSNLHM force-pushed the fix/openai-decoding-and-tokenizer-override branch from 26ab49b to b94e164 Compare May 12, 2026 15:27

fix: support Q3VL multimodal benchmark on vLLM endpoints

b783529

BolinSNLHM force-pushed the fix/openai-decoding-and-tokenizer-override branch from b94e164 to b783529 Compare May 12, 2026 15:44

wangshangsam approved these changes May 12, 2026

View reviewed changes

Comment thread src/inference_endpoint/load_generator/session.py Outdated

Update src/inference_endpoint/load_generator/session.py

8f1b4fe

Co-authored-by: Shang Wang <shangw@nvidia.com>

Victor49152 approved these changes May 12, 2026

View reviewed changes

precommit

21965e7

Victor49152 merged commit 6d82015 into mlcommons:main May 12, 2026
7 checks passed

github-actions Bot locked and limited conversation to collaborators May 12, 2026

arekay-nv reviewed May 13, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: OpenAI response decoding defaults, multimodal prompt handling, and tokenizer_name override#307

fix: OpenAI response decoding defaults, multimodal prompt handling, and tokenizer_name override#307
Victor49152 merged 3 commits into
mlcommons:mainfrom
BolinSNLHM:fix/openai-decoding-and-tokenizer-override

BolinSNLHM commented May 5, 2026

Uh oh!

github-actions Bot commented May 5, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

wangshangsam left a comment •

edited

Loading

Uh oh!

wangshangsam left a comment

Uh oh!

Uh oh!

Uh oh!

arekay-nv left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

BolinSNLHM commented May 5, 2026

Summary

Testing

Uh oh!

github-actions Bot commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

wangshangsam left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wangshangsam left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

arekay-nv left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

github-actions Bot commented May 5, 2026 •

edited

Loading

wangshangsam left a comment •

edited

Loading