-
Notifications
You must be signed in to change notification settings - Fork 9
feat: Support DSV4 example. #304
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
arekay-nv
wants to merge
5
commits into
main
Choose a base branch
from
arekay/dsv4_example
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
5 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,177 @@ | ||
| # DeepSeek-V4-Pro Benchmark | ||
|
|
||
| End-to-end example for benchmarking `deepseek-ai/DeepSeek-V4-Pro` with vLLM on 8×B200 (or 8×B300), covering performance throughput and accuracy evaluation (AIME 2025, GPQA, and the full MLPerf Inference accuracy suite). | ||
|
|
||
| ## Hardware | ||
|
|
||
| | Requirement | Details | | ||
| | ------------ | ---------------------------------------------------------- | | ||
| | GPUs | 8× NVIDIA B200 or B300 | | ||
| | System RAM | ≥ 256 GB | | ||
| | Docker image | `vllm/vllm-openai:deepseekv4-cu130` | | ||
| | Startup time | ~22 minutes (weight loading + TileLang kernel compilation) | | ||
|
|
||
| The recipe is taken from the [vLLM DeepSeek V4 blog post](https://github.com/vllm-project/vllm-project.github.io/blob/main/_posts/2026-04-24-deepseek-v4.md). | ||
|
|
||
| ## Environment Setup | ||
|
|
||
| ```bash | ||
| export MODEL_PATH=/path/to/DeepSeek-V4-Pro # local weight directory | ||
| export HF_HOME=~/.cache/huggingface | ||
| export HF_TOKEN=<your HuggingFace token> | ||
| ``` | ||
|
|
||
| ## Launching the Server | ||
|
|
||
| ```bash | ||
| bash examples/09_DeepSeek-V4-Pro_Example/launch_server.sh | ||
| ``` | ||
|
|
||
| The script mounts `$MODEL_PATH` into the container at `/model`, sets | ||
| `VLLM_ENGINE_READY_TIMEOUT_S=3600`, and polls `/health` until the server is ready. | ||
|
|
||
| ### Why `VLLM_ENGINE_READY_TIMEOUT_S=3600` is required | ||
|
|
||
| The default value is 600 s (10 min). Loading DeepSeek-V4-Pro's 64 safetensor shards plus | ||
| compiling TileLang kernels (`mhc_pre_big_fuse_tilelang` etc.) across 8 DP workers takes | ||
| ~22 min on 8×B200. With the default timeout the `ApiServer_0` process raises a `TimeoutError` | ||
| and exits — even though all 8 engine workers completed successfully — causing the container to | ||
| crash. Setting the timeout to 3600 s avoids this entirely. | ||
|
|
||
| ### Key launch flags | ||
|
|
||
| | Flag | Purpose | | ||
| | ------------------------------------------------------------------------------------- | ------------------------------------------------------------- | | ||
| | `--data-parallel-size 8` | Expert parallelism across 8 GPUs (no TP needed for MoE) | | ||
| | `--enable-expert-parallel` | Required alongside `--data-parallel-size` | | ||
| | `--kv-cache-dtype fp8` | Matches DeepSeek V4's hybrid c4a / c128a KV cache design | | ||
| | `--block-size 256` | Unified 256-token logical block across all compression layers | | ||
| | `--attention_config.use_fp4_indexer_cache=True` | FP4 indexer for ~2x additional KV savings | | ||
| | `--tokenizer-mode deepseek_v4` | Required for the V4 chat template | | ||
| | `--reasoning-parser deepseek_v4` | Strips `<think>…</think>` into `reasoning_content` | | ||
| | `--compilation-config '{"cudagraph_mode":"FULL_AND_PIECEWISE","custom_ops":["all"]}'` | Enables TileLang kernel fusions | | ||
| | `VLLM_ENGINE_READY_TIMEOUT_S=3600` | Prevents premature `ApiServer_0` timeout during startup | | ||
|
|
||
| ## Performance Benchmark | ||
|
|
||
| ```bash | ||
| uv run inference-endpoint benchmark from-config \ | ||
| -c examples/09_DeepSeek-V4-Pro_Example/vllm_dsv4pro_perf.yaml | ||
| ``` | ||
|
|
||
| Config: [`vllm_dsv4pro_perf.yaml`](vllm_dsv4pro_perf.yaml) | ||
|
|
||
| - 2-minute minimum run at concurrency 32 | ||
| - Metrics: throughput, latency, TTFT, TPOT | ||
|
|
||
| ## Accuracy Benchmark (AIME 2025 + GPQA) | ||
|
|
||
| ```bash | ||
| uv run inference-endpoint benchmark from-config \ | ||
| -c examples/09_DeepSeek-V4-Pro_Example/vllm_dsv4pro_accuracy.yaml | ||
| ``` | ||
|
|
||
| Config: [`vllm_dsv4pro_accuracy.yaml`](vllm_dsv4pro_accuracy.yaml) | ||
|
|
||
| | Dataset | Samples | Repeats | Extractor | Scorer | | ||
| | ------------ | ------- | ------- | ---------------------- | ----------- | | ||
| | AIME 2025 | 30 | 8 | `boxed_math_extractor` | `pass_at_1` | | ||
| | GPQA Diamond | 198 | 5 | `abcd_extractor` | `pass_at_1` | | ||
|
|
||
| ### Concurrency note | ||
|
|
||
| `target_concurrency: 4` is intentional. With `max_model_len=65536` and `max_new_tokens=32768`, | ||
| each in-flight request can occupy up to 32k tokens of KV cache. Four concurrent requests | ||
| fit within the fp8 KV cache budget without preemption on 8×B200. | ||
|
|
||
| ### Thinking mode and `budget_tokens` | ||
|
|
||
| The `aime25::gptoss_budget_20k` preset enables DeepSeek's thinking mode | ||
| (`chat_template_kwargs: {thinking: True, budget_tokens: 20000}`). Without `budget_tokens`, | ||
| the model can spend all 32k tokens in the `<think>` block and return an empty boxed answer — | ||
| observed on ~85% of responses in early testing. Setting `budget_tokens=20000` caps the | ||
| reasoning phase and forces a final answer. | ||
|
|
||
| ### Measured results (8×B200, `deepseekv4-cu130`) | ||
|
|
||
| | Dataset | Score | | ||
| | ---------------- | ------------------------------------------ | | ||
| | AIME 2025 pass@1 | **55.4%** (8 repeats, budget_tokens=20000) | | ||
|
|
||
| ## MLPerf Inference Accuracy Suite | ||
|
|
||
| The MLPerf DeepSeek-R1 accuracy check uses 5 sub-datasets (4388 total samples): | ||
|
|
||
| | Sub-dataset | Samples | Metric | File | | ||
| | ------------- | ------- | ------------------- | ------------------------------------------ | | ||
| | AIME 1983 | 932 | exact_match | `mlperf_deepseek_r1_math_accuracy.parquet` | | ||
| | MATH-500 | 499 | exact_match | `mlperf_deepseek_r1_math_accuracy.parquet` | | ||
| | GPQA | 198 | exact_match | `mlperf_deepseek_r1_mcq_accuracy.parquet` | | ||
| | MMLU-Pro | 2410 | exact_match | extracted by `extract_mlperf_subsets.py` | | ||
| | LiveCodeBench | 349 | code_execute_verify | extracted by `extract_mlperf_subsets.py` | | ||
|
|
||
| **Golden accuracy (fp32):** `exact_match = 81.3582%`, `TOKENS_PER_SAMPLE = 3886.2` | ||
| **MLPerf pass threshold:** ≥ 80.52% exact_match (99% of golden), tokens within ±10% | ||
|
|
||
| ### Step 1 — Extract missing subsets | ||
|
|
||
| ```bash | ||
| uv run python examples/09_DeepSeek-V4-Pro_Example/extract_mlperf_subsets.py | ||
| ``` | ||
|
|
||
| This writes: | ||
|
|
||
| - `datasets/deepseek/mlperf_deepseek_r1_mmlu_pro_accuracy.parquet` | ||
| - `datasets/deepseek/mlperf_deepseek_r1_livecodebench_accuracy.parquet` | ||
|
|
||
| ### Step 2 — Run math + MCQ accuracy | ||
|
|
||
| Uncomment MMLU-Pro in [`vllm_dsv4pro_mlperf_accuracy.yaml`](vllm_dsv4pro_mlperf_accuracy.yaml), then: | ||
|
|
||
| ```bash | ||
| uv run inference-endpoint benchmark from-config \ | ||
| -c examples/09_DeepSeek-V4-Pro_Example/vllm_dsv4pro_mlperf_accuracy.yaml | ||
| ``` | ||
|
|
||
| ### Step 3 — Run LiveCodeBench accuracy | ||
|
|
||
| LiveCodeBench requires the `lcb-service` container (executes generated Python code in an | ||
| isolated environment). See the | ||
| [LiveCodeBench README](../../src/inference_endpoint/evaluation/livecodebench/README.md) for | ||
| container setup. Once running on port 13835, uncomment the `mlperf-livecodebench` dataset in | ||
| `vllm_dsv4pro_mlperf_accuracy.yaml` and re-run. | ||
|
|
||
| ## Troubleshooting | ||
|
|
||
| **Container exits immediately or health check never passes** | ||
|
|
||
| ```bash | ||
| docker logs <container_id> | tail -40 | ||
| ``` | ||
|
|
||
| Common causes: | ||
|
|
||
| - `TimeoutError: Timed out waiting for engine core processes to start` — set `VLLM_ENGINE_READY_TIMEOUT_S=3600` (already set in `launch_server.sh`) | ||
| - OOM during weight loading — verify `--max-model-len` is not too large for available GPU memory | ||
| - `MODEL_PATH` not mounted correctly — check that `/model/config.json` exists inside the container | ||
|
|
||
| **`At least one performance dataset required`** | ||
|
|
||
| Every benchmark config must include at least one `type: performance` dataset entry, even for | ||
| accuracy-only runs. Use the perf-warmup entry with `n_samples_to_issue: 1`. | ||
|
|
||
| **Empty boxed answers / low AIME accuracy** | ||
|
|
||
| The model exhausted `max_new_tokens` in the thinking phase. Add `budget_tokens` to the preset: | ||
|
|
||
| ```yaml | ||
| - name: aime25::gptoss_budget_20k # uses budget_tokens=20000 | ||
| ``` | ||
|
|
||
| **`uv: cannot execute binary file: Exec format error`** | ||
|
|
||
| The `uv` binary in `~/.local/bin/uv` has the wrong architecture. Use the venv directly: | ||
|
|
||
| ```bash | ||
| .venv/bin/inference-endpoint benchmark from-config -c <config.yaml> | ||
| ``` | ||
59 changes: 59 additions & 0 deletions
59
examples/09_DeepSeek-V4-Pro_Example/extract_mlperf_subsets.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,59 @@ | ||
| # SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. | ||
| # SPDX-License-Identifier: Apache-2.0 | ||
| # | ||
| # Licensed under the Apache License, Version 2.0 (the "License"); | ||
| # you may not use this file except in compliance with the License. | ||
| # You may obtain a copy of the License at | ||
| # | ||
| # http://www.apache.org/licenses/LICENSE-2.0 | ||
| # | ||
| # Unless required by applicable law or agreed to in writing, software | ||
| # distributed under the License is distributed on an "AS IS" BASIS, | ||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| # See the License for the specific language governing permissions and | ||
| # limitations under the License. | ||
|
|
||
| """Extract MMLU-Pro and LiveCodeBench subsets from the combined MLPerf accuracy parquet. | ||
|
|
||
| The combined parquet (mlperf_deepseek_r1_dataset_4388_fp8_eval_accuracy.parquet) contains | ||
| 5 sub-datasets identified by the 'dataset' column. Pre-split files already exist for Math | ||
| and GPQA; this script extracts the remaining two: | ||
|
|
||
| datasets/deepseek/mlperf_deepseek_r1_mmlu_pro_accuracy.parquet (2410 rows) | ||
| datasets/deepseek/mlperf_deepseek_r1_livecodebench_accuracy.parquet (349 rows) | ||
|
|
||
| Usage: | ||
| uv run python examples/09_DeepSeek-V4-Pro_Example/extract_mlperf_subsets.py | ||
| """ | ||
|
|
||
| from pathlib import Path | ||
|
|
||
| import pandas as pd | ||
|
|
||
| SRC = Path( | ||
| "datasets/deepseek/mlperf_deepseek_r1_dataset_4388_fp8_eval_accuracy.parquet" | ||
| ) | ||
| OUT_DIR = Path("datasets/deepseek") | ||
|
|
||
| SUBSETS = { | ||
| "mmlu_pro": "mlperf_deepseek_r1_mmlu_pro_accuracy.parquet", | ||
| "livecodebench": "mlperf_deepseek_r1_livecodebench_accuracy.parquet", | ||
| } | ||
|
|
||
|
|
||
| def main() -> None: | ||
| df = pd.read_parquet(SRC) | ||
| print(f"Loaded {len(df)} rows from {SRC}") | ||
| print("Sub-dataset breakdown:") | ||
| print(df.groupby(["dataset", "metric"]).size().to_string()) | ||
| print() | ||
|
|
||
| for dataset_name, out_filename in SUBSETS.items(): | ||
| subset = df[df["dataset"] == dataset_name].reset_index(drop=True) | ||
| out_path = OUT_DIR / out_filename | ||
| subset.to_parquet(out_path, index=False) | ||
| print(f"Wrote {len(subset)} rows → {out_path}") | ||
|
|
||
|
|
||
| if __name__ == "__main__": | ||
| main() |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,94 @@ | ||
| #!/usr/bin/env bash | ||
| # Launch DeepSeek-V4-Pro with vLLM on 8×B200 / 8×B300. | ||
| # | ||
| # Key flags vs. a standard vLLM launch: | ||
| # --data-parallel-size 8 Expert parallelism across 8 GPUs (no TP) | ||
| # --enable-expert-parallel Required for MoE data-parallel dispatch | ||
| # --kv-cache-dtype fp8 DeepSeek V4's hybrid KV cache (c4a/c128a) | ||
| # --block-size 256 Matches the 256-native-token logical block size | ||
| # --attention_config.use_fp4_indexer_cache=True FP4 indexer for 2x KV savings | ||
| # --tokenizer-mode deepseek_v4 Custom tokenizer for V4 chat template | ||
| # --reasoning-parser deepseek_v4 Strips <think>…</think> into reasoning_content | ||
| # --compilation-config … FULL_AND_PIECEWISE cudagraph + all custom fusions | ||
| # | ||
| # Startup time note: | ||
| # Model weight loading (64 shards) + TileLang kernel compilation takes ~22 min | ||
| # on 8×B200. The default VLLM_ENGINE_READY_TIMEOUT_S=600 (10 min) is too short | ||
| # and will crash the API server with a TimeoutError even though the workers are | ||
| # fine. Always set VLLM_ENGINE_READY_TIMEOUT_S=3600 for this model. | ||
|
|
||
| set -euo pipefail | ||
|
|
||
| : "${MODEL_PATH:?Set MODEL_PATH to the directory containing the DeepSeek-V4-Pro weights}" | ||
| PORT="${PORT:-8000}" | ||
| MAX_MODEL_LEN="${MAX_MODEL_LEN:-65536}" | ||
|
|
||
| if [[ ! -d "${MODEL_PATH}" ]]; then | ||
| echo "ERROR: MODEL_PATH=${MODEL_PATH} does not exist." | ||
| echo "Set MODEL_PATH to the directory containing the DeepSeek-V4-Pro weights." | ||
| exit 1 | ||
| fi | ||
|
|
||
| echo "Launching DeepSeek-V4-Pro on port ${PORT} (model: ${MODEL_PATH})" | ||
| echo "Startup takes ~22 minutes for weight loading + TileLang kernel compilation." | ||
| echo "" | ||
|
|
||
| CONTAINER_ID=$(docker run -d \ | ||
| --gpus all \ | ||
| --shm-size 32g \ | ||
| --net host \ | ||
| --ipc host \ | ||
| -v "${MODEL_PATH}:/model" \ | ||
| -v "${HF_HOME:-${HOME}/.cache/huggingface}:/root/.cache/huggingface" \ | ||
| --env HF_TOKEN="${HF_TOKEN:-}" \ | ||
| --env VLLM_WORKER_MULTIPROC_METHOD=spawn \ | ||
| --env VLLM_ENGINE_READY_TIMEOUT_S=3600 \ | ||
| vllm/vllm-openai:deepseekv4-cu130 \ | ||
| --model /model \ | ||
| --served-model-name deepseek-ai/DeepSeek-V4-Pro \ | ||
| --trust-remote-code \ | ||
| --kv-cache-dtype fp8 \ | ||
| --block-size 256 \ | ||
| --enable-expert-parallel \ | ||
| --data-parallel-size 8 \ | ||
| --max-model-len "${MAX_MODEL_LEN}" \ | ||
| --compilation-config '{"cudagraph_mode":"FULL_AND_PIECEWISE","custom_ops":["all"]}' \ | ||
| --attention_config.use_fp4_indexer_cache=True \ | ||
| --tokenizer-mode deepseek_v4 \ | ||
| --tool-call-parser deepseek_v4 \ | ||
| --enable-auto-tool-choice \ | ||
| --reasoning-parser deepseek_v4 \ | ||
| --disable-log-stats \ | ||
| --disable-uvicorn-access-log \ | ||
| --port "${PORT}") | ||
|
|
||
| echo "Container started: ${CONTAINER_ID:0:12}" | ||
| echo "" | ||
| echo "Polling http://localhost:${PORT}/health ..." | ||
|
|
||
| TIMEOUT=2400 | ||
| START=$(date +%s) | ||
| while true; do | ||
| if curl -sf "http://localhost:${PORT}/health" > /dev/null 2>&1; then | ||
| ELAPSED=$(( $(date +%s) - START )) | ||
| echo "Server healthy after ${ELAPSED}s. Ready to benchmark." | ||
| break | ||
| fi | ||
| ELAPSED=$(( $(date +%s) - START )) | ||
| if [[ ${ELAPSED} -ge ${TIMEOUT} ]]; then | ||
| echo "ERROR: server not healthy after ${TIMEOUT}s" | ||
| docker logs "${CONTAINER_ID}" | tail -40 | ||
| exit 1 | ||
| fi | ||
| if [[ "$(docker inspect -f '{{.State.Running}}' "${CONTAINER_ID}" 2>/dev/null)" != "true" ]]; then | ||
| echo "ERROR: container exited unexpectedly" | ||
| docker logs "${CONTAINER_ID}" | tail -40 | ||
| exit 1 | ||
| fi | ||
| echo " Waiting... (${ELAPSED}s)" | ||
| sleep 15 | ||
| done | ||
|
|
||
| echo "" | ||
| echo "Container ID : ${CONTAINER_ID:0:12}" | ||
| echo "Stop with : docker stop ${CONTAINER_ID:0:12}" |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This section instructs to “Uncomment MMLU-Pro” in the MLPerf config, but
mlperf-mmlu-prois already enabled invllm_dsv4pro_mlperf_accuracy.yaml. Please adjust the instruction (or comment out that dataset in the YAML) to keep the example steps consistent.