mlcommons · Victor49152 · May 21, 2026 · May 21, 2026 · May 23, 2026 · May 23, 2026
@@ -1,3 +1,177 @@
+# Reference Implementation for the Qwen3-VL (Q3VL) Benchmark
+
+For the MLPerf Inference v6.1 round, benchmarking uses a decoupled load generator client ([endpoints](https://github.com/mlcommons/endpoints#)), a model server (for example, [vLLM](https://github.com/vllm-project/vllm)), and the dataset/configuration described below.
+
+## Quick Start
+
+### Start the model server
+
+The model server can run in its own environment. Using vLLM as example, start vLLM as you would for any standard OpenAI-compatible deployment:
+
+```bash
+export MODEL_NAME=Qwen/Qwen3-VL-235B-A22B-Instruct
+export HF_TOKEN=<your Hugging Face token>  # Optional for public models
+export HF_HOME=<path to Hugging Face cache, e.g. ~/.cache/huggingface>
+
+docker run --runtime nvidia --gpus all \
+  -p 8000:8000 \
+  --ipc=host \
+  --env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
+  -v ${HF_HOME}:/root/.cache/huggingface \
+  vllm/vllm-openai:latest \
+  --model ${MODEL_NAME} \
+  --tensor-parallel-size 4 \
+  --max-model-len=32768 \
+  --async-scheduling \
+  --limit-mm-per-prompt.video 0 \
+  --no-enable-prefix-caching  ## Must have this flag as the rule forbids prefix caching
+```
+
+### Set up endpoints
+
+After the server is ready to listen for requests, clone [endpoints](https://github.com/mlcommons/endpoints#) on the same node—or on any host that can reach the server over HTTP. Follow the [endpoints quick start](https://github.com/mlcommons/endpoints/tree/381d13bbd27d6d52306813a51dc4e44295222d7e#quick-start) and install with either **uv**:
+
+```bash
+git clone https://github.com/mlcommons/endpoints.git
+cd endpoints
+uv sync
+```
+
+or **pip** in a virtual environment:
+
+```bash
+python3.12 -m venv venv && source venv/bin/activate
+pip install .
+```
+
+### Configure the benchmark
+
+Example configs live under [endpoints/examples/08_Qwen3-VL-235B-A22B_Example](https://github.com/mlcommons/endpoints/tree/381d13bbd27d6d52306813a51dc4e44295222d7e/examples/08_Qwen3-VL-235B-A22B_Example). 
+
+#### Fields that the submitter **should** update to match their server status:
+
+- Served model name (to match the actual, probably quantized, model checkpoint):
+
+```yaml
+model_params:
+  name: "Qwen/Qwen3-VL-235B-A22B-Instruct"
+```
+
+- The URL and port number of the endpoint:
+
+```yaml
+endpoint_config:
+  endpoints:
+    - "http://localhost:8000"
+```
+
+#### Fields that the submitter **may** customize for performance tuning:
+
+- Target QPS (for the server and interactive scenarios):
+
+
+- Client worker related settings:
+
+```yaml
+client:
+    num_workers: 5
+    transport:
+      type: zmq
+      recv_buffer_size: 16777216
+      send_buffer_size: 16777216
+    max_connections: 1000
+    worker_initialization_timeout: 120
+```
+
+#### Fileds that the submitter **MUST NOT** change for valid results:
+
+- Sampling parameters specified in the section [Reference Implementation Specification](#reference-implementation-specification)
+
+- Datasets (neither for performance evaluation nor for accuracy evaluation)
+
+### Run the benchmark
+
+Launch the offline scenario:
+
+```bash
+uv run inference-endpoint benchmark from-config \
+  -c examples/08_Qwen3-VL-235B-A22B_Example/offline_qwen3_vl_235b_a22b_shopify.yaml
+```
+
+Launch the server scenario:
+
+```bash
+uv run inference-endpoint benchmark from-config \
+  -c examples/08_Qwen3-VL-235B-A22B_Example/server_qwen3_vl_235b_a22b_shopify.yaml
+```
+
+Launch the interactive scenario:
+
+```bash
+uv run inference-endpoint benchmark from-config \
+  -c examples/08_Qwen3-VL-235B-A22B_Example/interactive_qwen3_vl_235b_a22b_shopify_8k.yaml
+```
+
+## Compliance Test
+
+Each example benchmark config includes an accuracy test that queries the same server backend. You do not need a separate accuracy-mode run. Reported accuracy must meet the minimum thresholds in [Reference Implementation Specification](#reference-implementation-specification) below.
+
+## Reference Implementation Specification
+
+### v6.1 round
+
+- **vLLM version:** [a65093c](https://github.com/vllm-project/vllm/tree/a65093c1a39a8ddd8455365128ecbe259350e22c)
+- **endpoints version:** [381d13bbd27d6d52306813a51dc4e44295222d7e](https://github.com/mlcommons/endpoints/tree/381d13bbd27d6d52306813a51dc4e44295222d7e)
+- **Model:**
+  - [Qwen/Qwen3-VL-235B-A22B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-235B-A22B-Instruct)
+  - Commit SHA: [710c13861be6c466e66de3f484069440b8f31389](https://huggingface.co/Qwen/Qwen3-VL-235B-A22B-Instruct/tree/710c13861be6c466e66de3f484069440b8f31389)
+- **Dataset:**
+  - **Offline/Server scenario:**
+    - [Shopify/product-catalogue](https://huggingface.co/datasets/Shopify/product-catalogue)
+    - Commit SHA: [d5c517c509f5aca99053897ef1de797d6d7e5aa5](https://huggingface.co/datasets/Shopify/product-catalogue/tree/d5c517c509f5aca99053897ef1de797d6d7e5aa5)
+    - Both the `train` and `test` splits are used, concatenated in that order.
+    - Total number of samples: `48289`.
+  - **Interactive scenario:**
+    - [nvidia/Shopify-product-catalogue-8k](https://huggingface.co/datasets/nvidia/Shopify-product-catalogue-8k)
+    - Commit SHA: [2bc8c6c4b6ebd27b880b0cba519cb45d09867045](https://huggingface.co/datasets/nvidia/Shopify-product-catalogue-8k/commit/2bc8c6c4b6ebd27b880b0cba519cb45d09867045)
+    - Total number of samples: `8000`.
+- **Guided decoding:** not used.
+- **Sampling parameters:**
+  - Frequency penalty: `None` (mathematically equivalent to `0.0`).
+  - Presence penalty: `None` (mathematically equivalent to `0.0`).
+  - Temperature: `None` (mathematically equivalent to `1.0`).
+  - Top-P: `None` (mathematically equivalent to `1.0`).
+  - Top-K: `None` (mathematically equivalent to `0`).
+  - Min-P: `None` (mathematically equivalent to `0.0`).
+  - Repetition penalty: `None` (mathematically equivalent to `1.0`).
+- **Constraints:**
+  - **Model quality:**
+    - **Offline/Server scenario:**
+      - Category Hierarchical F1 score ≥ `0.7824`. This is the 99% recovery of `0.7903037`, the mean category hierarchical F1 score across 10 runs on [the BF16 version of the model](https://huggingface.co/Qwen/Qwen3-VL-235B-A22B-Instruct). The standard deviation across those 10 runs is `0.0002250412555`.
+    - **Interactive scenario:**
+      - Category Hierarchical F1 score ≥ `0.7799`. This is the 99% recovery of `0.7878`, the mean category hierarchical F1 score across 5 runs on [the BF16 version of the model](https://huggingface.co/Qwen/Qwen3-VL-235B-A22B-Instruct). The standard deviation across those 5 runs is `0.000535724`.
+  - **Server scenario:**
+    - Target latency is the constraint (not Time to First Token (TTFT) or Time per Output Token (TPOT)).
+    - Target latency percentile: `0.99`.
+    - Target latency ≤ 12 seconds.
+    - Performance sample count: `48289`.
+  - **Offline scenario:**
+    - Number of samples in the query ≥ `48289` (every sample in the dataset is sent to the VLM endpoint at least once).
+    - Performance sample count: `48289`.
+  - **Interactive scenario:**
+    - Target latency is the constraint (not TTFT or TPOT).
+    - Target latency percentile: `0.99`.
+    - Target latency ≤ 1.5 seconds.
+    - Performance sample count: `8000`.
+  - Testing duration ≥ 10 minutes.
+  - Sample concatenation permutation is enabled.
+  - You must explicitly set `--no-enable-prefix-caching` for vLLM.
+
+
+> [!CAUTION]
+> **MLPerf Inference v6.0 round only.**
+> The following sections and the Qwen3-VL reference implementation under `multimodal/qwen3-vl` were maintained for the **v6.0** submission round, and they are **deprecated** for the newer rounds. Please use the above documentation along with [mlcommons/endpoints](https://github.com/mlcommons/endpoints) for the newer rounds.
+
 # Reference Implementation for the Qwen3-VL (Q3VL) Benchmark 
 
 ## Automated command to run the benchmark via MLCFlow