fix: OpenAI response decoding defaults, multimodal prompt handling, and tokenizer_name override#307
Conversation
|
MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅ |
There was a problem hiding this comment.
Code Review
This pull request introduces a tokenizer_name configuration option to allow overriding the tokenizer source, which is particularly useful when the serving model uses a local path or a quantized checkpoint. It also enhances the robustness of OpenAI response parsing by adding default values for fields frequently omitted by inference servers and updates the load generator to handle multimodal prompt data. A suggestion was made to improve the tokenizer existence check by verifying local file system paths before querying the Hugging Face Hub, ensuring local tokenizers are correctly identified without triggering API errors.
5d35e99 to
a33690b
Compare
There was a problem hiding this comment.
The override works for the case where the local checkpoint genuinely doesn't ship tokenizer files, but I think it's papering over a probe bug rather than fixing the root cause.
_check_tokenizer_exists (execute.py:181) calls huggingface_hub.model_info(model_name) first, which only resolves repo IDs. Pass it a path like /root/.cache/huggingface/hub/... and it raises → except Exception → returns False → tokenizer_name becomes None at execute.py:305 → execute.py:465 never passes --tokenizer to the metrics aggregator subprocess → ISL/OSL/TPOT silently disabled.
But the downstream loader at token_metrics.py:79 is AutoTokenizer.from_pretrained(...), which accepts a local directory containing tokenizer.json / tokenizer_config.json just fine. So there are really two cases:
- Local NVFP4 snapshot that contains tokenizer files — the typical case, since vLLM needs the tokenizer to serve, so it's almost always present in the snapshot. The probe is just being too strict; a short-circuit like "if
Path(model_name).is_dir()and a tokenizer file exists, return True" would make everything work with no user-visible knob. - Local path with no tokenizer files — rare (weights-only directories). Here
tokenizer_nameis the right escape hatch. -> But this is very rarely to happen.
This PR ships (2) but leaves (1) broken — even when the served directory contains a perfectly usable tokenizer.json, the user is still forced to set tokenizer_name. Could we enable _check_tokenizer_exists to support checking whether tokenizer files exist for a model checkpoint in local path?
26ab49b to
b94e164
Compare
b94e164 to
b783529
Compare
wangshangsam
left a comment
There was a problem hiding this comment.
Nit, but otherwise LGTM!
Co-authored-by: Shang Wang <shangw@nvidia.com>
Summary
Fixes three issues encountered when running the Qwen3-VL-235B-A22B benchmark against vLLM/Dynamo with the Shopify multimodal dataset:
openai/types.py— vLLM omits optional response fields (content,refusal,finish_reason,usage,system_fingerprint, etc.) that ourmsgspecstructs required to be present. Added safe defaults so decoding succeeds regardless of what the server populates.session.py— TheShopifyMultimodalFormatterproducespromptaslist[dict](OpenAI vision content parts), which the HTTP adapter handles correctly. However,PhaseIssuer's metrics tracking assigned it directly toPromptData.text(typedstr | None), causing a decode failure. Added a type guard to coerce non-string prompts toNone.config/schema.py+execute.py— When the model is served from a container-local path (e.g. an NVFP4 snapshot under/root/.cache/huggingface/hub/...), tokenizer loading fails because the path isn't a valid HF repo ID. Addedmodel_params.tokenizer_nameto decouple the tokenizer source from the serving model name.Testing
Validated end-to-end on GB300 (offline scenario, 1000 samples): all samples completed successfully with no decode errors and full latency/throughput metrics generated.
Note: The example YAMLs assume the server is started with
--served-model-namematchingmodel_params.name. When serving from a local NVFP4 checkpoint (the typical NVIDIA submission setup), setmodel_params.nameto the actual served path,tokenizer_nameto the upstream HF repo ID (e.g.Qwen/Qwen3-VL-235B-A22B-Instruct), andendpoint_config.endpointsto the node IP.