From 7335f778d63b268ff5caf483b703f916c9b36dba Mon Sep 17 00:00:00 2001 From: Mingyuan Ma Date: Thu, 21 May 2026 10:59:08 -0700 Subject: [PATCH 1/5] Update for v6.1 --- multimodal/qwen3-vl/README.md | 135 ++++++++++++++++++++++++++++++++++ 1 file changed, 135 insertions(+) diff --git a/multimodal/qwen3-vl/README.md b/multimodal/qwen3-vl/README.md index eb032372c2..bf7ab8595c 100644 --- a/multimodal/qwen3-vl/README.md +++ b/multimodal/qwen3-vl/README.md @@ -1,3 +1,138 @@ +# Reference Implementation for the Qwen3-VL (Q3VL) Benchmark + +For the MLPerf Inference v6.1 round, benchmarking uses a decoupled load generator client ([endpoints](https://github.com/mlcommons/endpoints#)), a model server (for example, [vLLM](https://github.com/vllm-project/vllm)), and the dataset/configuration described below. + +## Quick Start + +### Start the model server + +The model server can run in its own environment (for example, a Docker container). Start vLLM as you would for any standard OpenAI-compatible deployment: + +```bash +export MODEL_NAME=Qwen/Qwen3-VL-235B-A22B-Instruct +export HF_TOKEN= # Optional for public models +export HF_HOME= + +docker run --runtime nvidia --gpus all \ + -p 8000:8000 \ + --ipc=host \ + --env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \ + -v ${HF_HOME}:/root/.cache/huggingface \ + vllm/vllm-openai:latest \ + --model ${MODEL_NAME} \ + --tensor-parallel-size 4 \ + --max-model-len=32768 \ + --async-scheduling \ + --limit-mm-per-prompt.video 0 \ + --no-enable-prefix-caching +``` + +### Set up endpoints + +After the server is listening for requests, clone [endpoints](https://github.com/mlcommons/endpoints#) on the same node—or on any host that can reach the server over HTTP. Follow the [endpoints quick start](https://github.com/mlcommons/endpoints/tree/381d13bbd27d6d52306813a51dc4e44295222d7e#quick-start) and install with either **uv**: + +```bash +git clone https://github.com/mlcommons/endpoints.git +cd endpoints +uv sync +``` + +or **pip** in a virtual environment: + +```bash +python3.12 -m venv venv && source venv/bin/activate +pip install . +``` + +### Configure the benchmark + +Example configs live under [endpoints/examples/08_Qwen3-VL-235B-A22B_Example](https://github.com/mlcommons/endpoints/tree/381d13bbd27d6d52306813a51dc4e44295222d7e/examples/08_Qwen3-VL-235B-A22B_Example). Set the endpoint URL in the YAML file to match your server address and port: + +```yaml +endpoint_config: + endpoints: + - "http://localhost:8000" +``` + +### Run the benchmark + +Launch offline or server (online) scenarios: + +```bash +uv run inference-endpoint benchmark from-config \ + -c examples/08_Qwen3-VL-235B-A22B_Example/offline_qwen3_vl_235b_a22b_shopify.yaml +``` + +```bash +uv run inference-endpoint benchmark from-config \ + -c examples/08_Qwen3-VL-235B-A22B_Example/online_qwen3_vl_235b_a22b_shopify.yaml +``` + +Launch the interactive scenario: + +```bash +uv run inference-endpoint benchmark from-config \ + -c examples/08_Qwen3-VL-235B-A22B_Example/interactive_qwen3_vl_235b_a22b_shopify_8k.yaml +``` + +## Compliance test + +Each example benchmark config includes an accuracy test that queries the same server backend. You do not need a separate accuracy-mode run. Reported accuracy must meet the minimum thresholds in [Reference Implementation Specification](#reference-implementation-specification) below. + +## Reference Implementation Specification + +### v6.1 round + +- **vLLM version:** [a65093c](https://github.com/vllm-project/vllm/tree/a65093c1a39a8ddd8455365128ecbe259350e22c) +- **endpoints version:** [381d13bbd27d6d52306813a51dc4e44295222d7e](https://github.com/mlcommons/endpoints/tree/381d13bbd27d6d52306813a51dc4e44295222d7e) +- **Model:** + - [Qwen/Qwen3-VL-235B-A22B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-235B-A22B-Instruct) + - Commit SHA: [710c13861be6c466e66de3f484069440b8f31389](https://huggingface.co/Qwen/Qwen3-VL-235B-A22B-Instruct/tree/710c13861be6c466e66de3f484069440b8f31389) +- **Dataset:** + - **Offline/Server scenario:** + - [Shopify/product-catalogue](https://huggingface.co/datasets/Shopify/product-catalogue) + - Commit SHA: [d5c517c509f5aca99053897ef1de797d6d7e5aa5](https://huggingface.co/datasets/Shopify/product-catalogue/tree/d5c517c509f5aca99053897ef1de797d6d7e5aa5) + - Both the `train` and `test` splits are used, concatenated in that order. + - Total number of samples: `48289`. + - **Interactive scenario:** + - [nvidia/Shopify-product-catalogue-8k](https://huggingface.co/datasets/nvidia/Shopify-product-catalogue-8k) + - Commit SHA: [2bc8c6c4b6ebd27b880b0cba519cb45d09867045](https://huggingface.co/datasets/nvidia/Shopify-product-catalogue-8k/commit/2bc8c6c4b6ebd27b880b0cba519cb45d09867045) + - Total number of samples: `8000`. +- **Guided decoding:** not used. +- **Sampling parameters:** + - Frequency penalty: `None` (mathematically equivalent to `0.0`). + - Presence penalty: `None` (mathematically equivalent to `0.0`). + - Temperature: `None` (mathematically equivalent to `1.0`). + - Top-P: `None` (mathematically equivalent to `1.0`). + - Top-K: `None` (mathematically equivalent to `0`). + - Min-P: `None` (mathematically equivalent to `0.0`). + - Repetition penalty: `None` (mathematically equivalent to `1.0`). +- **Constraints:** + - **Model quality:** + - **Offline/Server scenario:** + - Category Hierarchical F1 score ≥ `0.7824`. This is the 99% recovery of `0.7903037`, the mean category hierarchical F1 score across 10 runs on [the BF16 version of the model](https://huggingface.co/Qwen/Qwen3-VL-235B-A22B-Instruct). The standard deviation across those 10 runs is `0.0002250412555`. + - **Interactive scenario:** + - Category Hierarchical F1 score ≥ `0.7799`. This is the 99% recovery of `0.7878`, the mean category hierarchical F1 score across 5 runs on [the BF16 version of the model](https://huggingface.co/Qwen/Qwen3-VL-235B-A22B-Instruct). The standard deviation across those 5 runs is `0.000535724`. + - **Server scenario:** + - Target latency is the constraint (not Time to First Token (TTFT) or Time per Output Token (TPOT)). + - Target latency percentile: `0.99`. + - Target latency ≤ 12 seconds. + - Performance sample count: `48289`. + - **Offline scenario:** + - Number of samples in the query ≥ `48289` (every sample in the dataset is sent to the VLM endpoint at least once). + - Performance sample count: `48289`. + - **Interactive scenario:** + - Target latency is the constraint (not TTFT or TPOT). + - Target latency percentile: `0.99`. + - Target latency ≤ 1.5 seconds. + - Performance sample count: `8000`. + - Testing duration ≥ 10 minutes. + - Sample concatenation permutation is enabled. + - You must explicitly set `--no-enable-prefix-caching` for vLLM. + + +> **MLPerf Inference v6.0 round only.** The following section and the Qwen3-VL reference under `multimodal/qwen3-vl` were maintained for the **v6.0** submission round. They are **deprecated** for newer rounds; use the current MLPerf Inference docs and repository layout for later versions. + # Reference Implementation for the Qwen3-VL (Q3VL) Benchmark ## Automated command to run the benchmark via MLCFlow From 57b820719b7bf18a2279e6f9ecbd1123f9fe430a Mon Sep 17 00:00:00 2001 From: Mingyuan Ma Date: Thu, 21 May 2026 11:04:05 -0700 Subject: [PATCH 2/5] update --- multimodal/qwen3-vl/README.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/multimodal/qwen3-vl/README.md b/multimodal/qwen3-vl/README.md index bf7ab8595c..9a4b07d4dd 100644 --- a/multimodal/qwen3-vl/README.md +++ b/multimodal/qwen3-vl/README.md @@ -6,7 +6,7 @@ For the MLPerf Inference v6.1 round, benchmarking uses a decoupled load generato ### Start the model server -The model server can run in its own environment (for example, a Docker container). Start vLLM as you would for any standard OpenAI-compatible deployment: +The model server can run in its own environment. Using vLLM as example, start vLLM as you would for any standard OpenAI-compatible deployment: ```bash export MODEL_NAME=Qwen/Qwen3-VL-235B-A22B-Instruct @@ -24,12 +24,12 @@ docker run --runtime nvidia --gpus all \ --max-model-len=32768 \ --async-scheduling \ --limit-mm-per-prompt.video 0 \ - --no-enable-prefix-caching + --no-enable-prefix-caching ## Must have this flag as the rule forbids prefix caching ``` ### Set up endpoints -After the server is listening for requests, clone [endpoints](https://github.com/mlcommons/endpoints#) on the same node—or on any host that can reach the server over HTTP. Follow the [endpoints quick start](https://github.com/mlcommons/endpoints/tree/381d13bbd27d6d52306813a51dc4e44295222d7e#quick-start) and install with either **uv**: +After the server is ready to listen for requests, clone [endpoints](https://github.com/mlcommons/endpoints#) on the same node—or on any host that can reach the server over HTTP. Follow the [endpoints quick start](https://github.com/mlcommons/endpoints/tree/381d13bbd27d6d52306813a51dc4e44295222d7e#quick-start) and install with either **uv**: ```bash git clone https://github.com/mlcommons/endpoints.git @@ -56,7 +56,7 @@ endpoint_config: ### Run the benchmark -Launch offline or server (online) scenarios: +Launch offline or server scenarios: ```bash uv run inference-endpoint benchmark from-config \ From 001917fcca4941f9e04f18c8acc1b23fba3eea98 Mon Sep 17 00:00:00 2001 From: Mingyuan Ma <111467530+Victor49152@users.noreply.github.com> Date: Fri, 22 May 2026 18:24:13 -0700 Subject: [PATCH 3/5] Apply suggestions from code review Co-authored-by: Shang Wang --- multimodal/qwen3-vl/README.md | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/multimodal/qwen3-vl/README.md b/multimodal/qwen3-vl/README.md index 9a4b07d4dd..f10e36edd7 100644 --- a/multimodal/qwen3-vl/README.md +++ b/multimodal/qwen3-vl/README.md @@ -75,7 +75,7 @@ uv run inference-endpoint benchmark from-config \ -c examples/08_Qwen3-VL-235B-A22B_Example/interactive_qwen3_vl_235b_a22b_shopify_8k.yaml ``` -## Compliance test +## Compliance Test Each example benchmark config includes an accuracy test that queries the same server backend. You do not need a separate accuracy-mode run. Reported accuracy must meet the minimum thresholds in [Reference Implementation Specification](#reference-implementation-specification) below. @@ -131,7 +131,9 @@ Each example benchmark config includes an accuracy test that queries the same se - You must explicitly set `--no-enable-prefix-caching` for vLLM. -> **MLPerf Inference v6.0 round only.** The following section and the Qwen3-VL reference under `multimodal/qwen3-vl` were maintained for the **v6.0** submission round. They are **deprecated** for newer rounds; use the current MLPerf Inference docs and repository layout for later versions. +> [!CAUTION] +> **MLPerf Inference v6.0 round only.** +> The following sections and the Qwen3-VL reference implementation under `multimodal/qwen3-vl` were maintained for the **v6.0** submission round, and they are **deprecated** for the newer rounds. Please use the above documentation along with [mlcommons/endpoints](https://github.com/mlcommons/endpoints) for the newer rounds. # Reference Implementation for the Qwen3-VL (Q3VL) Benchmark From 12c0e9ba8d30dea17f2081d9bb597160815914eb Mon Sep 17 00:00:00 2001 From: Mingyuan Ma Date: Fri, 22 May 2026 18:50:12 -0700 Subject: [PATCH 4/5] Add details on endpoints config --- multimodal/qwen3-vl/README.md | 48 ++++++++++++++++++++++++++++++++--- 1 file changed, 45 insertions(+), 3 deletions(-) diff --git a/multimodal/qwen3-vl/README.md b/multimodal/qwen3-vl/README.md index f10e36edd7..e76c07166e 100644 --- a/multimodal/qwen3-vl/README.md +++ b/multimodal/qwen3-vl/README.md @@ -46,7 +46,18 @@ pip install . ### Configure the benchmark -Example configs live under [endpoints/examples/08_Qwen3-VL-235B-A22B_Example](https://github.com/mlcommons/endpoints/tree/381d13bbd27d6d52306813a51dc4e44295222d7e/examples/08_Qwen3-VL-235B-A22B_Example). Set the endpoint URL in the YAML file to match your server address and port: +Example configs live under [endpoints/examples/08_Qwen3-VL-235B-A22B_Example](https://github.com/mlcommons/endpoints/tree/381d13bbd27d6d52306813a51dc4e44295222d7e/examples/08_Qwen3-VL-235B-A22B_Example). + +#### Fields that the submitter **should** update to match their server status: + +- Served model name: + +```yaml +model_params: + name: "Qwen/Qwen3-VL-235B-A22B-Instruct" +``` + +- Endpoints url and port: ```yaml endpoint_config: @@ -54,18 +65,49 @@ endpoint_config: - "http://localhost:8000" ``` +#### Fields that the submitter **may** customize for performance tuning: + +- Target_qps (for server and interactive mode): + +```yaml + load_pattern: + type: "poisson" + target_qps: 6.5 +``` + +- Client worker related settings: + +```yaml +client: + num_workers: 5 + transport: + type: zmq + recv_buffer_size: 16777216 + send_buffer_size: 16777216 + max_connections: 1000 + worker_initialization_timeout: 120 +``` + +#### Fileds that the submitter **MUST NOT** change for valid results: + +- Sampling parameters that specified in scection [Reference Implementation Specification](#reference-implementation-specification) + +- Datasets (Neither performance or accuracy dataset) + ### Run the benchmark -Launch offline or server scenarios: +Launch the offline scenario: ```bash uv run inference-endpoint benchmark from-config \ -c examples/08_Qwen3-VL-235B-A22B_Example/offline_qwen3_vl_235b_a22b_shopify.yaml ``` +Launch the server scenario: + ```bash uv run inference-endpoint benchmark from-config \ - -c examples/08_Qwen3-VL-235B-A22B_Example/online_qwen3_vl_235b_a22b_shopify.yaml + -c examples/08_Qwen3-VL-235B-A22B_Example/server_qwen3_vl_235b_a22b_shopify.yaml ``` Launch the interactive scenario: From b78098363dc19f260360662b5821846668f847b3 Mon Sep 17 00:00:00 2001 From: Mingyuan Ma <111467530+Victor49152@users.noreply.github.com> Date: Mon, 25 May 2026 19:24:59 -0700 Subject: [PATCH 5/5] Apply suggestions from code review Co-authored-by: Shang Wang --- multimodal/qwen3-vl/README.md | 15 +++++---------- 1 file changed, 5 insertions(+), 10 deletions(-) diff --git a/multimodal/qwen3-vl/README.md b/multimodal/qwen3-vl/README.md index e76c07166e..96438cce43 100644 --- a/multimodal/qwen3-vl/README.md +++ b/multimodal/qwen3-vl/README.md @@ -50,14 +50,14 @@ Example configs live under [endpoints/examples/08_Qwen3-VL-235B-A22B_Example](ht #### Fields that the submitter **should** update to match their server status: -- Served model name: +- Served model name (to match the actual, probably quantized, model checkpoint): ```yaml model_params: name: "Qwen/Qwen3-VL-235B-A22B-Instruct" ``` -- Endpoints url and port: +- The URL and port number of the endpoint: ```yaml endpoint_config: @@ -67,13 +67,8 @@ endpoint_config: #### Fields that the submitter **may** customize for performance tuning: -- Target_qps (for server and interactive mode): +- Target QPS (for the server and interactive scenarios): -```yaml - load_pattern: - type: "poisson" - target_qps: 6.5 -``` - Client worker related settings: @@ -90,9 +85,9 @@ client: #### Fileds that the submitter **MUST NOT** change for valid results: -- Sampling parameters that specified in scection [Reference Implementation Specification](#reference-implementation-specification) +- Sampling parameters specified in the section [Reference Implementation Specification](#reference-implementation-specification) -- Datasets (Neither performance or accuracy dataset) +- Datasets (neither for performance evaluation nor for accuracy evaluation) ### Run the benchmark