Skip to content

fix: OpenAI response decoding defaults, multimodal prompt handling, and tokenizer_name override#307

Open
BolinSNLHM wants to merge 1 commit intomlcommons:mainfrom
BolinSNLHM:fix/openai-decoding-and-tokenizer-override
Open

fix: OpenAI response decoding defaults, multimodal prompt handling, and tokenizer_name override#307
BolinSNLHM wants to merge 1 commit intomlcommons:mainfrom
BolinSNLHM:fix/openai-decoding-and-tokenizer-override

Conversation

@BolinSNLHM
Copy link
Copy Markdown
Collaborator

Summary

Fixes three issues encountered when running the Qwen3-VL-235B-A22B benchmark against vLLM/Dynamo with the Shopify multimodal dataset:

  1. openai/types.py — vLLM omits optional response fields (content, refusal, finish_reason, usage, system_fingerprint, etc.) that our msgspec structs required to be present. Added safe defaults so decoding succeeds regardless of what the server populates.

  2. session.py — The ShopifyMultimodalFormatter produces prompt as list[dict] (OpenAI vision content parts), which the HTTP adapter handles correctly. However, PhaseIssuer's metrics tracking assigned it directly to PromptData.text (typed str | None), causing a decode failure. Added a type guard to coerce non-string prompts to None.

  3. config/schema.py + execute.py — When the model is served from a container-local path (e.g. an NVFP4 snapshot under /root/.cache/huggingface/hub/...), tokenizer loading fails because the path isn't a valid HF repo ID. Added model_params.tokenizer_name to decouple the tokenizer source from the serving model name.

Testing

Validated end-to-end on GB300 (offline scenario, 1000 samples): all samples completed successfully with no decode errors and full latency/throughput metrics generated.

Note: The example YAMLs assume the server is started with --served-model-name matching model_params.name. When serving from a local NVFP4 checkpoint (the typical NVIDIA submission setup), set model_params.name to the actual served path, tokenizer_name to the upstream HF repo ID (e.g. Qwen/Qwen3-VL-235B-A22B-Instruct), and endpoint_config.endpoints to the node IP.

@BolinSNLHM BolinSNLHM requested a review from a team May 5, 2026 21:30
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 5, 2026

MLCommons CLA bot:
Thank you very much for your submission; we really appreciate it. Before we can accept your contribution,
we ask that you sign the MLCommons CLA (Apache 2). Please submit your GitHub ID to our onboarding form to initiate
authorization. If you are from a MLCommons member organization, we will request that you be added to the CLA.
If you are not from a member organization, we will email you a CLA to sign. For any questions, please contact
support@mlcommons.org.
0 out of 1 committers have signed the MLCommons CLA.
@BoLin Sun
Bolin Sun seems not to be a GitHub user. You need a GitHub account after you become MLCommons member. If you have already a GitHub account, please add the email address used for this commit to your account.
You can retrigger this bot by commenting recheck in this Pull Request

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a tokenizer_name configuration option to allow overriding the tokenizer source, which is particularly useful when the serving model uses a local path or a quantized checkpoint. It also enhances the robustness of OpenAI response parsing by adding default values for fields frequently omitted by inference servers and updates the load generator to handle multimodal prompt data. A suggestion was made to improve the tokenizer existence check by verifying local file system paths before querying the Hugging Face Hub, ensuring local tokenizers are correctly identified without triggering API errors.

Comment on lines +309 to +312
tokenizer_source = config.model_params.tokenizer_name or model_name
tokenizer_name = (
tokenizer_source if _check_tokenizer_exists(tokenizer_source) else None
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The current implementation of _check_tokenizer_exists relies on huggingface_hub.model_info, which is designed for repository IDs and will raise an exception (and log a warning) when provided with a local file system path. Since tokenizer_name is explicitly documented in the schema to support local paths, the logic should check for local existence first. This avoids incorrect warnings and ensures local tokenizers are correctly identified and passed to the metrics aggregator.

Suggested change
tokenizer_source = config.model_params.tokenizer_name or model_name
tokenizer_name = (
tokenizer_source if _check_tokenizer_exists(tokenizer_source) else None
)
tokenizer_source = config.model_params.tokenizer_name or model_name
tokenizer_name = (
tokenizer_source
if Path(tokenizer_source).exists() or _check_tokenizer_exists(tokenizer_source)
else None
)

Three independent issues blocked the Qwen3-VL-235B-A22B benchmark
against a real vLLM/Dynamo server. They are bundled here because
they all surface on the same flow (Shopify multimodal dataset →
OpenAI chat completions on vLLM):

1. **OpenAI response decoding too strict (`openai/types.py`).**
   The msgspec structs that decode `/v1/chat/completions` responses
   declared `content`, `refusal`, `finish_reason`, `created`,
   `model`, `usage`, and `system_fingerprint` as required fields
   with no defaults. vLLM (and the OpenAI spec itself) treats these
   as nullable/optional — vLLM omits `usage` on non-final SSE
   chunks, rarely sets `system_fingerprint`, and may omit `content`
   on tool-only responses. Decoding any such response raised a
   msgspec validation error before the adapter could return a
   QueryResult. Added safe defaults so the decoder accepts whatever
   the server legitimately sends.

2. **Multimodal prompt crashes metrics tracking (`session.py`).**
   `PhaseIssuer.issue()` constructs a `PromptData(text=..., ...)`
   for the ISSUED event. `PromptData.text` is typed `str | None`,
   but the Shopify multimodal formatter sets `prompt` to OpenAI
   vision content parts (`[{"type": "text", ...},
   {"type": "image_url", ...}]`). msgspec rejected the list and the
   issuer raised, killing the run before any sample was sent. The
   HTTP adapter handles list-shaped prompts correctly — only the
   metrics tracking path needed fixing. Type-guard the assignment
   so non-strings fall through to None and ISL is computed from
   `token_ids` (which the multimodal datasets already pre-tokenize).

3. **Tokenizer cannot be loaded from container-local model paths
   (`schema.py`, `execute.py`).** When the model is served from a
   local cache path (e.g. `/root/.cache/huggingface/hub/models--…`)
   instead of a HF repo ID, `_check_tokenizer_exists(model_name)`
   fails its `model_info()` probe and the aggregator runs without
   ISL/OSL/TPOT metrics. Added a `tokenizer_name` override field
   to `ModelParams` so the tokenizer source can be set explicitly
   while the served model name remains a path. The example YAML
   configs document the new field with a commented hint.

Validated all three changes interactively with msgspec/pydantic:
- `ChatCompletionResponse` decodes minimal vLLM responses (most
  fields missing) and full responses with usage.
- `PromptData(text=…)` accepts both string prompts and multimodal
  `list[dict]` prompts.
- `ModelParams` round-trips through YAML with and without
  `tokenizer_name`, and both Q3VL example configs load cleanly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@BolinSNLHM BolinSNLHM force-pushed the fix/openai-decoding-and-tokenizer-override branch from 5d35e99 to a33690b Compare May 5, 2026 21:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant