diff --git a/multimodal/qwen3-vl/README.md b/multimodal/qwen3-vl/README.md index eb032372c2..96438cce43 100644 --- a/multimodal/qwen3-vl/README.md +++ b/multimodal/qwen3-vl/README.md @@ -1,3 +1,177 @@ +# Reference Implementation for the Qwen3-VL (Q3VL) Benchmark + +For the MLPerf Inference v6.1 round, benchmarking uses a decoupled load generator client ([endpoints](https://github.com/mlcommons/endpoints#)), a model server (for example, [vLLM](https://github.com/vllm-project/vllm)), and the dataset/configuration described below. + +## Quick Start + +### Start the model server + +The model server can run in its own environment. Using vLLM as example, start vLLM as you would for any standard OpenAI-compatible deployment: + +```bash +export MODEL_NAME=Qwen/Qwen3-VL-235B-A22B-Instruct +export HF_TOKEN= # Optional for public models +export HF_HOME= + +docker run --runtime nvidia --gpus all \ + -p 8000:8000 \ + --ipc=host \ + --env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \ + -v ${HF_HOME}:/root/.cache/huggingface \ + vllm/vllm-openai:latest \ + --model ${MODEL_NAME} \ + --tensor-parallel-size 4 \ + --max-model-len=32768 \ + --async-scheduling \ + --limit-mm-per-prompt.video 0 \ + --no-enable-prefix-caching ## Must have this flag as the rule forbids prefix caching +``` + +### Set up endpoints + +After the server is ready to listen for requests, clone [endpoints](https://github.com/mlcommons/endpoints#) on the same node—or on any host that can reach the server over HTTP. Follow the [endpoints quick start](https://github.com/mlcommons/endpoints/tree/381d13bbd27d6d52306813a51dc4e44295222d7e#quick-start) and install with either **uv**: + +```bash +git clone https://github.com/mlcommons/endpoints.git +cd endpoints +uv sync +``` + +or **pip** in a virtual environment: + +```bash +python3.12 -m venv venv && source venv/bin/activate +pip install . +``` + +### Configure the benchmark + +Example configs live under [endpoints/examples/08_Qwen3-VL-235B-A22B_Example](https://github.com/mlcommons/endpoints/tree/381d13bbd27d6d52306813a51dc4e44295222d7e/examples/08_Qwen3-VL-235B-A22B_Example). + +#### Fields that the submitter **should** update to match their server status: + +- Served model name (to match the actual, probably quantized, model checkpoint): + +```yaml +model_params: + name: "Qwen/Qwen3-VL-235B-A22B-Instruct" +``` + +- The URL and port number of the endpoint: + +```yaml +endpoint_config: + endpoints: + - "http://localhost:8000" +``` + +#### Fields that the submitter **may** customize for performance tuning: + +- Target QPS (for the server and interactive scenarios): + + +- Client worker related settings: + +```yaml +client: + num_workers: 5 + transport: + type: zmq + recv_buffer_size: 16777216 + send_buffer_size: 16777216 + max_connections: 1000 + worker_initialization_timeout: 120 +``` + +#### Fileds that the submitter **MUST NOT** change for valid results: + +- Sampling parameters specified in the section [Reference Implementation Specification](#reference-implementation-specification) + +- Datasets (neither for performance evaluation nor for accuracy evaluation) + +### Run the benchmark + +Launch the offline scenario: + +```bash +uv run inference-endpoint benchmark from-config \ + -c examples/08_Qwen3-VL-235B-A22B_Example/offline_qwen3_vl_235b_a22b_shopify.yaml +``` + +Launch the server scenario: + +```bash +uv run inference-endpoint benchmark from-config \ + -c examples/08_Qwen3-VL-235B-A22B_Example/server_qwen3_vl_235b_a22b_shopify.yaml +``` + +Launch the interactive scenario: + +```bash +uv run inference-endpoint benchmark from-config \ + -c examples/08_Qwen3-VL-235B-A22B_Example/interactive_qwen3_vl_235b_a22b_shopify_8k.yaml +``` + +## Compliance Test + +Each example benchmark config includes an accuracy test that queries the same server backend. You do not need a separate accuracy-mode run. Reported accuracy must meet the minimum thresholds in [Reference Implementation Specification](#reference-implementation-specification) below. + +## Reference Implementation Specification + +### v6.1 round + +- **vLLM version:** [a65093c](https://github.com/vllm-project/vllm/tree/a65093c1a39a8ddd8455365128ecbe259350e22c) +- **endpoints version:** [381d13bbd27d6d52306813a51dc4e44295222d7e](https://github.com/mlcommons/endpoints/tree/381d13bbd27d6d52306813a51dc4e44295222d7e) +- **Model:** + - [Qwen/Qwen3-VL-235B-A22B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-235B-A22B-Instruct) + - Commit SHA: [710c13861be6c466e66de3f484069440b8f31389](https://huggingface.co/Qwen/Qwen3-VL-235B-A22B-Instruct/tree/710c13861be6c466e66de3f484069440b8f31389) +- **Dataset:** + - **Offline/Server scenario:** + - [Shopify/product-catalogue](https://huggingface.co/datasets/Shopify/product-catalogue) + - Commit SHA: [d5c517c509f5aca99053897ef1de797d6d7e5aa5](https://huggingface.co/datasets/Shopify/product-catalogue/tree/d5c517c509f5aca99053897ef1de797d6d7e5aa5) + - Both the `train` and `test` splits are used, concatenated in that order. + - Total number of samples: `48289`. + - **Interactive scenario:** + - [nvidia/Shopify-product-catalogue-8k](https://huggingface.co/datasets/nvidia/Shopify-product-catalogue-8k) + - Commit SHA: [2bc8c6c4b6ebd27b880b0cba519cb45d09867045](https://huggingface.co/datasets/nvidia/Shopify-product-catalogue-8k/commit/2bc8c6c4b6ebd27b880b0cba519cb45d09867045) + - Total number of samples: `8000`. +- **Guided decoding:** not used. +- **Sampling parameters:** + - Frequency penalty: `None` (mathematically equivalent to `0.0`). + - Presence penalty: `None` (mathematically equivalent to `0.0`). + - Temperature: `None` (mathematically equivalent to `1.0`). + - Top-P: `None` (mathematically equivalent to `1.0`). + - Top-K: `None` (mathematically equivalent to `0`). + - Min-P: `None` (mathematically equivalent to `0.0`). + - Repetition penalty: `None` (mathematically equivalent to `1.0`). +- **Constraints:** + - **Model quality:** + - **Offline/Server scenario:** + - Category Hierarchical F1 score ≥ `0.7824`. This is the 99% recovery of `0.7903037`, the mean category hierarchical F1 score across 10 runs on [the BF16 version of the model](https://huggingface.co/Qwen/Qwen3-VL-235B-A22B-Instruct). The standard deviation across those 10 runs is `0.0002250412555`. + - **Interactive scenario:** + - Category Hierarchical F1 score ≥ `0.7799`. This is the 99% recovery of `0.7878`, the mean category hierarchical F1 score across 5 runs on [the BF16 version of the model](https://huggingface.co/Qwen/Qwen3-VL-235B-A22B-Instruct). The standard deviation across those 5 runs is `0.000535724`. + - **Server scenario:** + - Target latency is the constraint (not Time to First Token (TTFT) or Time per Output Token (TPOT)). + - Target latency percentile: `0.99`. + - Target latency ≤ 12 seconds. + - Performance sample count: `48289`. + - **Offline scenario:** + - Number of samples in the query ≥ `48289` (every sample in the dataset is sent to the VLM endpoint at least once). + - Performance sample count: `48289`. + - **Interactive scenario:** + - Target latency is the constraint (not TTFT or TPOT). + - Target latency percentile: `0.99`. + - Target latency ≤ 1.5 seconds. + - Performance sample count: `8000`. + - Testing duration ≥ 10 minutes. + - Sample concatenation permutation is enabled. + - You must explicitly set `--no-enable-prefix-caching` for vLLM. + + +> [!CAUTION] +> **MLPerf Inference v6.0 round only.** +> The following sections and the Qwen3-VL reference implementation under `multimodal/qwen3-vl` were maintained for the **v6.0** submission round, and they are **deprecated** for the newer rounds. Please use the above documentation along with [mlcommons/endpoints](https://github.com/mlcommons/endpoints) for the newer rounds. + # Reference Implementation for the Qwen3-VL (Q3VL) Benchmark ## Automated command to run the benchmark via MLCFlow