From 7335f778d63b268ff5caf483b703f916c9b36dba Mon Sep 17 00:00:00 2001
From: Mingyuan Ma <mingyuanm@nvidia.com>
Date: Thu, 21 May 2026 10:59:08 -0700
Subject: [PATCH 1/5] Update for v6.1

---
 multimodal/qwen3-vl/README.md | 135 ++++++++++++++++++++++++++++++++++
 1 file changed, 135 insertions(+)
diff --git a/multimodal/qwen3-vl/README.md b/multimodal/qwen3-vl/README.md
index eb032372c2..bf7ab8595c 100644
--- a/multimodal/qwen3-vl/README.md
+++ b/multimodal/qwen3-vl/README.md
@@ -1,3 +1,138 @@
+# Reference Implementation for the Qwen3-VL (Q3VL) Benchmark
+
+For the MLPerf Inference v6.1 round, benchmarking uses a decoupled load generator client ([endpoints](https://github.com/mlcommons/endpoints#)), a model server (for example, [vLLM](https://github.com/vllm-project/vllm)), and the dataset/configuration described below.
+
+## Quick Start
+
+### Start the model server
+
+The model server can run in its own environment (for example, a Docker container). Start vLLM as you would for any standard OpenAI-compatible deployment:
+
+```bash
+export MODEL_NAME=Qwen/Qwen3-VL-235B-A22B-Instruct
+export HF_TOKEN=<your Hugging Face token>  # Optional for public models
+export HF_HOME=<path to Hugging Face cache, e.g. ~/.cache/huggingface>
+
+docker run --runtime nvidia --gpus all \
+  -p 8000:8000 \
+  --ipc=host \
+  --env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
+  -v ${HF_HOME}:/root/.cache/huggingface \
+  vllm/vllm-openai:latest \
+  --model ${MODEL_NAME} \
+  --tensor-parallel-size 4 \
+  --max-model-len=32768 \
+  --async-scheduling \
+  --limit-mm-per-prompt.video 0 \
+  --no-enable-prefix-caching
+```
+
+### Set up endpoints
+
+After the server is listening for requests, clone [endpoints](https://github.com/mlcommons/endpoints#) on the same node—or on any host that can reach the server over HTTP. Follow the [endpoints quick start](https://github.com/mlcommons/endpoints/tree/381d13bbd27d6d52306813a51dc4e44295222d7e#quick-start) and install with either **uv**:
+
+```bash
+git clone https://github.com/mlcommons/endpoints.git
+cd endpoints
+uv sync
+```
+
+or **pip** in a virtual environment:
+
+```bash
+python3.12 -m venv venv && source venv/bin/activate
+pip install .
+```
+
+### Configure the benchmark
+
+Example configs live under [endpoints/examples/08_Qwen3-VL-235B-A22B_Example](https://github.com/mlcommons/endpoints/tree/381d13bbd27d6d52306813a51dc4e44295222d7e/examples/08_Qwen3-VL-235B-A22B_Example). Set the endpoint URL in the YAML file to match your server address and port:
+
+```yaml
+endpoint_config:
+  endpoints:
+    - "http://localhost:8000"
+```
+
+### Run the benchmark
+
+Launch offline or server (online) scenarios:
+
+```bash
+uv run inference-endpoint benchmark from-config \
+  -c examples/08_Qwen3-VL-235B-A22B_Example/offline_qwen3_vl_235b_a22b_shopify.yaml
+```
+
+```bash
+uv run inference-endpoint benchmark from-config \
+  -c examples/08_Qwen3-VL-235B-A22B_Example/online_qwen3_vl_235b_a22b_shopify.yaml
+```
+
+Launch the interactive scenario:
+
+```bash
+uv run inference-endpoint benchmark from-config \
+  -c examples/08_Qwen3-VL-235B-A22B_Example/interactive_qwen3_vl_235b_a22b_shopify_8k.yaml
+```
+
+## Compliance test
+
+Each example benchmark config includes an accuracy test that queries the same server backend. You do not need a separate accuracy-mode run. Reported accuracy must meet the minimum thresholds in [Reference Implementation Specification](#reference-implementation-specification) below.
+
+## Reference Implementation Specification
+
+### v6.1 round
+
+- **vLLM version:** [a65093c](https://github.com/vllm-project/vllm/tree/a65093c1a39a8ddd8455365128ecbe259350e22c)
+- **endpoints version:** [381d13bbd27d6d52306813a51dc4e44295222d7e](https://github.com/mlcommons/endpoints/tree/381d13bbd27d6d52306813a51dc4e44295222d7e)
+- **Model:**
+  - [Qwen/Qwen3-VL-235B-A22B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-235B-A22B-Instruct)
+  - Commit SHA: [710c13861be6c466e66de3f484069440b8f31389](https://huggingface.co/Qwen/Qwen3-VL-235B-A22B-Instruct/tree/710c13861be6c466e66de3f484069440b8f31389)
+- **Dataset:**
+  - **Offline/Server scenario:**
+    - [Shopify/product-catalogue](https://huggingface.co/datasets/Shopify/product-catalogue)
+    - Commit SHA: [d5c517c509f5aca99053897ef1de797d6d7e5aa5](https://huggingface.co/datasets/Shopify/product-catalogue/tree/d5c517c509f5aca99053897ef1de797d6d7e5aa5)
+    - Both the `train` and `test` splits are used, concatenated in that order.
+    - Total number of samples: `48289`.
+  - **Interactive scenario:**
+    - [nvidia/Shopify-product-catalogue-8k](https://huggingface.co/datasets/nvidia/Shopify-product-catalogue-8k)
+    - Commit SHA: [2bc8c6c4b6ebd27b880b0cba519cb45d09867045](https://huggingface.co/datasets/nvidia/Shopify-product-catalogue-8k/commit/2bc8c6c4b6ebd27b880b0cba519cb45d09867045)
+    - Total number of samples: `8000`.
+- **Guided decoding:** not used.
+- **Sampling parameters:**
+  - Frequency penalty: `None` (mathematically equivalent to `0.0`).
+  - Presence penalty: `None` (mathematically equivalent to `0.0`).
+  - Temperature: `None` (mathematically equivalent to `1.0`).
+  - Top-P: `None` (mathematically equivalent to `1.0`).
+  - Top-K: `None` (mathematically equivalent to `0`).
+  - Min-P: `None` (mathematically equivalent to `0.0`).
+  - Repetition penalty: `None` (mathematically equivalent to `1.0`).
+- **Constraints:**
+  - **Model quality:**
+    - **Offline/Server scenario:**
+      - Category Hierarchical F1 score ≥ `0.7824`. This is the 99% recovery of `0.7903037`, the mean category hierarchical F1 score across 10 runs on [the BF16 version of the model](https://huggingface.co/Qwen/Qwen3-VL-235B-A22B-Instruct). The standard deviation across those 10 runs is `0.0002250412555`.
+    - **Interactive scenario:**
+      - Category Hierarchical F1 score ≥ `0.7799`. This is the 99% recovery of `0.7878`, the mean category hierarchical F1 score across 5 runs on [the BF16 version of the model](https://huggingface.co/Qwen/Qwen3-VL-235B-A22B-Instruct). The standard deviation across those 5 runs is `0.000535724`.
+  - **Server scenario:**
+    - Target latency is the constraint (not Time to First Token (TTFT) or Time per Output Token (TPOT)).
+    - Target latency percentile: `0.99`.
+    - Target latency ≤ 12 seconds.
+    - Performance sample count: `48289`.
+  - **Offline scenario:**
+    - Number of samples in the query ≥ `48289` (every sample in the dataset is sent to the VLM endpoint at least once).
+    - Performance sample count: `48289`.
+  - **Interactive scenario:**
+    - Target latency is the constraint (not TTFT or TPOT).
+    - Target latency percentile: `0.99`.
+    - Target latency ≤ 1.5 seconds.
+    - Performance sample count: `8000`.
+  - Testing duration ≥ 10 minutes.
+  - Sample concatenation permutation is enabled.
+  - You must explicitly set `--no-enable-prefix-caching` for vLLM.
+
+
+> **MLPerf Inference v6.0 round only.** The following section and the Qwen3-VL reference under `multimodal/qwen3-vl` were maintained for the **v6.0** submission round. They are **deprecated** for newer rounds; use the current MLPerf Inference docs and repository layout for later versions.
+
 # Reference Implementation for the Qwen3-VL (Q3VL) Benchmark 
 
 ## Automated command to run the benchmark via MLCFlow

From 57b820719b7bf18a2279e6f9ecbd1123f9fe430a Mon Sep 17 00:00:00 2001
From: Mingyuan Ma <mingyuanm@nvidia.com>
Date: Thu, 21 May 2026 11:04:05 -0700
Subject: [PATCH 2/5] update

---
 multimodal/qwen3-vl/README.md | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/multimodal/qwen3-vl/README.md b/multimodal/qwen3-vl/README.md
index bf7ab8595c..9a4b07d4dd 100644
--- a/multimodal/qwen3-vl/README.md
+++ b/multimodal/qwen3-vl/README.md
@@ -6,7 +6,7 @@ For the MLPerf Inference v6.1 round, benchmarking uses a decoupled load generato
 
 ### Start the model server
 
-The model server can run in its own environment (for example, a Docker container). Start vLLM as you would for any standard OpenAI-compatible deployment:
+The model server can run in its own environment. Using vLLM as example, start vLLM as you would for any standard OpenAI-compatible deployment:
 
 ```bash
 export MODEL_NAME=Qwen/Qwen3-VL-235B-A22B-Instruct
@@ -24,12 +24,12 @@ docker run --runtime nvidia --gpus all \
   --max-model-len=32768 \
   --async-scheduling \
   --limit-mm-per-prompt.video 0 \
-  --no-enable-prefix-caching
+  --no-enable-prefix-caching  ## Must have this flag as the rule forbids prefix caching
 ```
 
 ### Set up endpoints
 
-After the server is listening for requests, clone [endpoints](https://github.com/mlcommons/endpoints#) on the same node—or on any host that can reach the server over HTTP. Follow the [endpoints quick start](https://github.com/mlcommons/endpoints/tree/381d13bbd27d6d52306813a51dc4e44295222d7e#quick-start) and install with either **uv**:
+After the server is ready to listen for requests, clone [endpoints](https://github.com/mlcommons/endpoints#) on the same node—or on any host that can reach the server over HTTP. Follow the [endpoints quick start](https://github.com/mlcommons/endpoints/tree/381d13bbd27d6d52306813a51dc4e44295222d7e#quick-start) and install with either **uv**:
 
 ```bash
 git clone https://github.com/mlcommons/endpoints.git
@@ -56,7 +56,7 @@ endpoint_config:
 
 ### Run the benchmark
 
-Launch offline or server (online) scenarios:
+Launch offline or server scenarios:
 
 ```bash
 uv run inference-endpoint benchmark from-config \

From 001917fcca4941f9e04f18c8acc1b23fba3eea98 Mon Sep 17 00:00:00 2001
From: Mingyuan Ma <111467530+Victor49152@users.noreply.github.com>
Date: Fri, 22 May 2026 18:24:13 -0700
Subject: [PATCH 3/5] Apply suggestions from code review

Co-authored-by: Shang Wang <shangw@nvidia.com>
---
 multimodal/qwen3-vl/README.md | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/multimodal/qwen3-vl/README.md b/multimodal/qwen3-vl/README.md
index 9a4b07d4dd..f10e36edd7 100644
--- a/multimodal/qwen3-vl/README.md
+++ b/multimodal/qwen3-vl/README.md
@@ -75,7 +75,7 @@ uv run inference-endpoint benchmark from-config \
   -c examples/08_Qwen3-VL-235B-A22B_Example/interactive_qwen3_vl_235b_a22b_shopify_8k.yaml
 ```
 
-## Compliance test
+## Compliance Test
 
 Each example benchmark config includes an accuracy test that queries the same server backend. You do not need a separate accuracy-mode run. Reported accuracy must meet the minimum thresholds in [Reference Implementation Specification](#reference-implementation-specification) below.
 
@@ -131,7 +131,9 @@ Each example benchmark config includes an accuracy test that queries the same se
   - You must explicitly set `--no-enable-prefix-caching` for vLLM.
 
 
-> **MLPerf Inference v6.0 round only.** The following section and the Qwen3-VL reference under `multimodal/qwen3-vl` were maintained for the **v6.0** submission round. They are **deprecated** for newer rounds; use the current MLPerf Inference docs and repository layout for later versions.
+> [!CAUTION]
+> **MLPerf Inference v6.0 round only.**
+> The following sections and the Qwen3-VL reference implementation under `multimodal/qwen3-vl` were maintained for the **v6.0** submission round, and they are **deprecated** for the newer rounds. Please use the above documentation along with [mlcommons/endpoints](https://github.com/mlcommons/endpoints) for the newer rounds.
 
 # Reference Implementation for the Qwen3-VL (Q3VL) Benchmark 
 

From 12c0e9ba8d30dea17f2081d9bb597160815914eb Mon Sep 17 00:00:00 2001
From: Mingyuan Ma <mingyuanm@nvidia.com>
Date: Fri, 22 May 2026 18:50:12 -0700
Subject: [PATCH 4/5] Add details on endpoints config

---
 multimodal/qwen3-vl/README.md | 48 ++++++++++++++++++++++++++++++++---
 1 file changed, 45 insertions(+), 3 deletions(-)

diff --git a/multimodal/qwen3-vl/README.md b/multimodal/qwen3-vl/README.md
index f10e36edd7..e76c07166e 100644
--- a/multimodal/qwen3-vl/README.md
+++ b/multimodal/qwen3-vl/README.md
@@ -46,7 +46,18 @@ pip install .
 
 ### Configure the benchmark
 
-Example configs live under [endpoints/examples/08_Qwen3-VL-235B-A22B_Example](https://github.com/mlcommons/endpoints/tree/381d13bbd27d6d52306813a51dc4e44295222d7e/examples/08_Qwen3-VL-235B-A22B_Example). Set the endpoint URL in the YAML file to match your server address and port:
+Example configs live under [endpoints/examples/08_Qwen3-VL-235B-A22B_Example](https://github.com/mlcommons/endpoints/tree/381d13bbd27d6d52306813a51dc4e44295222d7e/examples/08_Qwen3-VL-235B-A22B_Example). 
+
+#### Fields that the submitter **should** update to match their server status:
+
+- Served model name:
+
+```yaml
+model_params:
+  name: "Qwen/Qwen3-VL-235B-A22B-Instruct"
+```
+
+- Endpoints url and port:
 
 ```yaml
 endpoint_config:
@@ -54,18 +65,49 @@ endpoint_config:
     - "http://localhost:8000"
 ```
 
+#### Fields that the submitter **may** customize for performance tuning:
+
+- Target_qps (for server and interactive mode):
+
+```yaml
+  load_pattern:
+    type: "poisson"
+    target_qps: 6.5
+```
+
+- Client worker related settings:
+
+```yaml
+client:
+    num_workers: 5
+    transport:
+      type: zmq
+      recv_buffer_size: 16777216
+      send_buffer_size: 16777216
+    max_connections: 1000
+    worker_initialization_timeout: 120
+```
+
+#### Fileds that the submitter **MUST NOT** change for valid results:
+
+- Sampling parameters that specified in scection [Reference Implementation Specification](#reference-implementation-specification)
+
+- Datasets (Neither performance or accuracy dataset)
+
 ### Run the benchmark
 
-Launch offline or server scenarios:
+Launch the offline scenario:
 
 ```bash
 uv run inference-endpoint benchmark from-config \
   -c examples/08_Qwen3-VL-235B-A22B_Example/offline_qwen3_vl_235b_a22b_shopify.yaml
 ```
 
+Launch the server scenario:
+
 ```bash
 uv run inference-endpoint benchmark from-config \
-  -c examples/08_Qwen3-VL-235B-A22B_Example/online_qwen3_vl_235b_a22b_shopify.yaml
+  -c examples/08_Qwen3-VL-235B-A22B_Example/server_qwen3_vl_235b_a22b_shopify.yaml
 ```
 
 Launch the interactive scenario:

From b78098363dc19f260360662b5821846668f847b3 Mon Sep 17 00:00:00 2001
From: Mingyuan Ma <111467530+Victor49152@users.noreply.github.com>
Date: Mon, 25 May 2026 19:24:59 -0700
Subject: [PATCH 5/5] Apply suggestions from code review

Co-authored-by: Shang Wang <shangw@nvidia.com>
---
 multimodal/qwen3-vl/README.md | 15 +++++----------
 1 file changed, 5 insertions(+), 10 deletions(-)

diff --git a/multimodal/qwen3-vl/README.md b/multimodal/qwen3-vl/README.md
index e76c07166e..96438cce43 100644
--- a/multimodal/qwen3-vl/README.md
+++ b/multimodal/qwen3-vl/README.md
@@ -50,14 +50,14 @@ Example configs live under [endpoints/examples/08_Qwen3-VL-235B-A22B_Example](ht
 
 #### Fields that the submitter **should** update to match their server status:
 
-- Served model name:
+- Served model name (to match the actual, probably quantized, model checkpoint):
 
 ```yaml
 model_params:
   name: "Qwen/Qwen3-VL-235B-A22B-Instruct"
 ```
 
-- Endpoints url and port:
+- The URL and port number of the endpoint:
 
 ```yaml
 endpoint_config:
@@ -67,13 +67,8 @@ endpoint_config:
 
 #### Fields that the submitter **may** customize for performance tuning:
 
-- Target_qps (for server and interactive mode):
+- Target QPS (for the server and interactive scenarios):
 
-```yaml
-  load_pattern:
-    type: "poisson"
-    target_qps: 6.5
-```
 
 - Client worker related settings:
 
@@ -90,9 +85,9 @@ client:
 
 #### Fileds that the submitter **MUST NOT** change for valid results:
 
-- Sampling parameters that specified in scection [Reference Implementation Specification](#reference-implementation-specification)
+- Sampling parameters specified in the section [Reference Implementation Specification](#reference-implementation-specification)
 
-- Datasets (Neither performance or accuracy dataset)
+- Datasets (neither for performance evaluation nor for accuracy evaluation)
 
 ### Run the benchmark