fix: OpenAI response decoding defaults, multimodal prompt handling, and tokenizer_name override#307
Conversation
|
MLCommons CLA bot: |
There was a problem hiding this comment.
Code Review
This pull request introduces a tokenizer_name configuration option to allow overriding the tokenizer source, which is particularly useful when the serving model uses a local path or a quantized checkpoint. It also enhances the robustness of OpenAI response parsing by adding default values for fields frequently omitted by inference servers and updates the load generator to handle multimodal prompt data. A suggestion was made to improve the tokenizer existence check by verifying local file system paths before querying the Hugging Face Hub, ensuring local tokenizers are correctly identified without triggering API errors.
| tokenizer_source = config.model_params.tokenizer_name or model_name | ||
| tokenizer_name = ( | ||
| tokenizer_source if _check_tokenizer_exists(tokenizer_source) else None | ||
| ) |
There was a problem hiding this comment.
The current implementation of _check_tokenizer_exists relies on huggingface_hub.model_info, which is designed for repository IDs and will raise an exception (and log a warning) when provided with a local file system path. Since tokenizer_name is explicitly documented in the schema to support local paths, the logic should check for local existence first. This avoids incorrect warnings and ensures local tokenizers are correctly identified and passed to the metrics aggregator.
| tokenizer_source = config.model_params.tokenizer_name or model_name | |
| tokenizer_name = ( | |
| tokenizer_source if _check_tokenizer_exists(tokenizer_source) else None | |
| ) | |
| tokenizer_source = config.model_params.tokenizer_name or model_name | |
| tokenizer_name = ( | |
| tokenizer_source | |
| if Path(tokenizer_source).exists() or _check_tokenizer_exists(tokenizer_source) | |
| else None | |
| ) |
Three independent issues blocked the Qwen3-VL-235B-A22B benchmark
against a real vLLM/Dynamo server. They are bundled here because
they all surface on the same flow (Shopify multimodal dataset →
OpenAI chat completions on vLLM):
1. **OpenAI response decoding too strict (`openai/types.py`).**
The msgspec structs that decode `/v1/chat/completions` responses
declared `content`, `refusal`, `finish_reason`, `created`,
`model`, `usage`, and `system_fingerprint` as required fields
with no defaults. vLLM (and the OpenAI spec itself) treats these
as nullable/optional — vLLM omits `usage` on non-final SSE
chunks, rarely sets `system_fingerprint`, and may omit `content`
on tool-only responses. Decoding any such response raised a
msgspec validation error before the adapter could return a
QueryResult. Added safe defaults so the decoder accepts whatever
the server legitimately sends.
2. **Multimodal prompt crashes metrics tracking (`session.py`).**
`PhaseIssuer.issue()` constructs a `PromptData(text=..., ...)`
for the ISSUED event. `PromptData.text` is typed `str | None`,
but the Shopify multimodal formatter sets `prompt` to OpenAI
vision content parts (`[{"type": "text", ...},
{"type": "image_url", ...}]`). msgspec rejected the list and the
issuer raised, killing the run before any sample was sent. The
HTTP adapter handles list-shaped prompts correctly — only the
metrics tracking path needed fixing. Type-guard the assignment
so non-strings fall through to None and ISL is computed from
`token_ids` (which the multimodal datasets already pre-tokenize).
3. **Tokenizer cannot be loaded from container-local model paths
(`schema.py`, `execute.py`).** When the model is served from a
local cache path (e.g. `/root/.cache/huggingface/hub/models--…`)
instead of a HF repo ID, `_check_tokenizer_exists(model_name)`
fails its `model_info()` probe and the aggregator runs without
ISL/OSL/TPOT metrics. Added a `tokenizer_name` override field
to `ModelParams` so the tokenizer source can be set explicitly
while the served model name remains a path. The example YAML
configs document the new field with a commented hint.
Validated all three changes interactively with msgspec/pydantic:
- `ChatCompletionResponse` decodes minimal vLLM responses (most
fields missing) and full responses with usage.
- `PromptData(text=…)` accepts both string prompts and multimodal
`list[dict]` prompts.
- `ModelParams` round-trips through YAML with and without
`tokenizer_name`, and both Q3VL example configs load cleanly.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
5d35e99 to
a33690b
Compare
Summary
Fixes three issues encountered when running the Qwen3-VL-235B-A22B benchmark against vLLM/Dynamo with the Shopify multimodal dataset:
openai/types.py— vLLM omits optional response fields (content,refusal,finish_reason,usage,system_fingerprint, etc.) that ourmsgspecstructs required to be present. Added safe defaults so decoding succeeds regardless of what the server populates.session.py— TheShopifyMultimodalFormatterproducespromptaslist[dict](OpenAI vision content parts), which the HTTP adapter handles correctly. However,PhaseIssuer's metrics tracking assigned it directly toPromptData.text(typedstr | None), causing a decode failure. Added a type guard to coerce non-string prompts toNone.config/schema.py+execute.py— When the model is served from a container-local path (e.g. an NVFP4 snapshot under/root/.cache/huggingface/hub/...), tokenizer loading fails because the path isn't a valid HF repo ID. Addedmodel_params.tokenizer_nameto decouple the tokenizer source from the serving model name.Testing
Validated end-to-end on GB300 (offline scenario, 1000 samples): all samples completed successfully with no decode errors and full latency/throughput metrics generated.
Note: The example YAMLs assume the server is started with
--served-model-namematchingmodel_params.name. When serving from a local NVFP4 checkpoint (the typical NVIDIA submission setup), setmodel_params.nameto the actual served path,tokenizer_nameto the upstream HF repo ID (e.g.Qwen/Qwen3-VL-235B-A22B-Instruct), andendpoint_config.endpointsto the node IP.