Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
177 changes: 177 additions & 0 deletions examples/09_DeepSeek-V4-Pro_Example/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,177 @@
# DeepSeek-V4-Pro Benchmark

End-to-end example for benchmarking `deepseek-ai/DeepSeek-V4-Pro` with vLLM on 8×B200 (or 8×B300), covering performance throughput and accuracy evaluation (AIME 2025, GPQA, and the full MLPerf Inference accuracy suite).

## Hardware

| Requirement | Details |
| ------------ | ---------------------------------------------------------- |
| GPUs | 8× NVIDIA B200 or B300 |
| System RAM | ≥ 256 GB |
| Docker image | `vllm/vllm-openai:deepseekv4-cu130` |
| Startup time | ~22 minutes (weight loading + TileLang kernel compilation) |

The recipe is taken from the [vLLM DeepSeek V4 blog post](https://github.com/vllm-project/vllm-project.github.io/blob/main/_posts/2026-04-24-deepseek-v4.md).

## Environment Setup

```bash
export MODEL_PATH=/path/to/DeepSeek-V4-Pro # local weight directory
export HF_HOME=~/.cache/huggingface
export HF_TOKEN=<your HuggingFace token>
```

## Launching the Server

```bash
bash examples/09_DeepSeek-V4-Pro_Example/launch_server.sh
```

The script mounts `$MODEL_PATH` into the container at `/model`, sets
`VLLM_ENGINE_READY_TIMEOUT_S=3600`, and polls `/health` until the server is ready.

### Why `VLLM_ENGINE_READY_TIMEOUT_S=3600` is required

The default value is 600 s (10 min). Loading DeepSeek-V4-Pro's 64 safetensor shards plus
compiling TileLang kernels (`mhc_pre_big_fuse_tilelang` etc.) across 8 DP workers takes
~22 min on 8×B200. With the default timeout the `ApiServer_0` process raises a `TimeoutError`
and exits — even though all 8 engine workers completed successfully — causing the container to
crash. Setting the timeout to 3600 s avoids this entirely.

### Key launch flags

| Flag | Purpose |
| ------------------------------------------------------------------------------------- | ------------------------------------------------------------- |
| `--data-parallel-size 8` | Expert parallelism across 8 GPUs (no TP needed for MoE) |
| `--enable-expert-parallel` | Required alongside `--data-parallel-size` |
| `--kv-cache-dtype fp8` | Matches DeepSeek V4's hybrid c4a / c128a KV cache design |
| `--block-size 256` | Unified 256-token logical block across all compression layers |
| `--attention_config.use_fp4_indexer_cache=True` | FP4 indexer for ~2x additional KV savings |
| `--tokenizer-mode deepseek_v4` | Required for the V4 chat template |
| `--reasoning-parser deepseek_v4` | Strips `<think>…</think>` into `reasoning_content` |
| `--compilation-config '{"cudagraph_mode":"FULL_AND_PIECEWISE","custom_ops":["all"]}'` | Enables TileLang kernel fusions |
| `VLLM_ENGINE_READY_TIMEOUT_S=3600` | Prevents premature `ApiServer_0` timeout during startup |

## Performance Benchmark

```bash
uv run inference-endpoint benchmark from-config \
-c examples/09_DeepSeek-V4-Pro_Example/vllm_dsv4pro_perf.yaml
```

Config: [`vllm_dsv4pro_perf.yaml`](vllm_dsv4pro_perf.yaml)

- 2-minute minimum run at concurrency 32
- Metrics: throughput, latency, TTFT, TPOT

## Accuracy Benchmark (AIME 2025 + GPQA)

```bash
uv run inference-endpoint benchmark from-config \
-c examples/09_DeepSeek-V4-Pro_Example/vllm_dsv4pro_accuracy.yaml
```

Config: [`vllm_dsv4pro_accuracy.yaml`](vllm_dsv4pro_accuracy.yaml)

| Dataset | Samples | Repeats | Extractor | Scorer |
| ------------ | ------- | ------- | ---------------------- | ----------- |
| AIME 2025 | 30 | 8 | `boxed_math_extractor` | `pass_at_1` |
| GPQA Diamond | 198 | 5 | `abcd_extractor` | `pass_at_1` |

### Concurrency note

`target_concurrency: 4` is intentional. With `max_model_len=65536` and `max_new_tokens=32768`,
each in-flight request can occupy up to 32k tokens of KV cache. Four concurrent requests
fit within the fp8 KV cache budget without preemption on 8×B200.

### Thinking mode and `budget_tokens`

The `aime25::gptoss_budget_20k` preset enables DeepSeek's thinking mode
(`chat_template_kwargs: {thinking: True, budget_tokens: 20000}`). Without `budget_tokens`,
the model can spend all 32k tokens in the `<think>` block and return an empty boxed answer —
observed on ~85% of responses in early testing. Setting `budget_tokens=20000` caps the
reasoning phase and forces a final answer.

### Measured results (8×B200, `deepseekv4-cu130`)

| Dataset | Score |
| ---------------- | ------------------------------------------ |
| AIME 2025 pass@1 | **55.4%** (8 repeats, budget_tokens=20000) |

## MLPerf Inference Accuracy Suite

The MLPerf DeepSeek-R1 accuracy check uses 5 sub-datasets (4388 total samples):

| Sub-dataset | Samples | Metric | File |
| ------------- | ------- | ------------------- | ------------------------------------------ |
| AIME 1983 | 932 | exact_match | `mlperf_deepseek_r1_math_accuracy.parquet` |
| MATH-500 | 499 | exact_match | `mlperf_deepseek_r1_math_accuracy.parquet` |
| GPQA | 198 | exact_match | `mlperf_deepseek_r1_mcq_accuracy.parquet` |
| MMLU-Pro | 2410 | exact_match | extracted by `extract_mlperf_subsets.py` |
| LiveCodeBench | 349 | code_execute_verify | extracted by `extract_mlperf_subsets.py` |

**Golden accuracy (fp32):** `exact_match = 81.3582%`, `TOKENS_PER_SAMPLE = 3886.2`
**MLPerf pass threshold:** ≥ 80.52% exact_match (99% of golden), tokens within ±10%

### Step 1 — Extract missing subsets

```bash
uv run python examples/09_DeepSeek-V4-Pro_Example/extract_mlperf_subsets.py
```

This writes:

- `datasets/deepseek/mlperf_deepseek_r1_mmlu_pro_accuracy.parquet`
- `datasets/deepseek/mlperf_deepseek_r1_livecodebench_accuracy.parquet`

### Step 2 — Run math + MCQ accuracy

Uncomment MMLU-Pro in [`vllm_dsv4pro_mlperf_accuracy.yaml`](vllm_dsv4pro_mlperf_accuracy.yaml), then:

```bash
uv run inference-endpoint benchmark from-config \
-c examples/09_DeepSeek-V4-Pro_Example/vllm_dsv4pro_mlperf_accuracy.yaml
```
Comment on lines +127 to +134
Copy link

Copilot AI May 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section instructs to “Uncomment MMLU-Pro” in the MLPerf config, but mlperf-mmlu-pro is already enabled in vllm_dsv4pro_mlperf_accuracy.yaml. Please adjust the instruction (or comment out that dataset in the YAML) to keep the example steps consistent.

Copilot uses AI. Check for mistakes.

### Step 3 — Run LiveCodeBench accuracy

LiveCodeBench requires the `lcb-service` container (executes generated Python code in an
isolated environment). See the
[LiveCodeBench README](../../src/inference_endpoint/evaluation/livecodebench/README.md) for
container setup. Once running on port 13835, uncomment the `mlperf-livecodebench` dataset in
`vllm_dsv4pro_mlperf_accuracy.yaml` and re-run.

## Troubleshooting

**Container exits immediately or health check never passes**

```bash
docker logs <container_id> | tail -40
```

Common causes:

- `TimeoutError: Timed out waiting for engine core processes to start` — set `VLLM_ENGINE_READY_TIMEOUT_S=3600` (already set in `launch_server.sh`)
- OOM during weight loading — verify `--max-model-len` is not too large for available GPU memory
- `MODEL_PATH` not mounted correctly — check that `/model/config.json` exists inside the container

**`At least one performance dataset required`**

Every benchmark config must include at least one `type: performance` dataset entry, even for
accuracy-only runs. Use the perf-warmup entry with `n_samples_to_issue: 1`.

**Empty boxed answers / low AIME accuracy**

The model exhausted `max_new_tokens` in the thinking phase. Add `budget_tokens` to the preset:

```yaml
- name: aime25::gptoss_budget_20k # uses budget_tokens=20000
```

**`uv: cannot execute binary file: Exec format error`**

The `uv` binary in `~/.local/bin/uv` has the wrong architecture. Use the venv directly:

```bash
.venv/bin/inference-endpoint benchmark from-config -c <config.yaml>
```
59 changes: 59 additions & 0 deletions examples/09_DeepSeek-V4-Pro_Example/extract_mlperf_subsets.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

"""Extract MMLU-Pro and LiveCodeBench subsets from the combined MLPerf accuracy parquet.

The combined parquet (mlperf_deepseek_r1_dataset_4388_fp8_eval_accuracy.parquet) contains
5 sub-datasets identified by the 'dataset' column. Pre-split files already exist for Math
and GPQA; this script extracts the remaining two:

datasets/deepseek/mlperf_deepseek_r1_mmlu_pro_accuracy.parquet (2410 rows)
datasets/deepseek/mlperf_deepseek_r1_livecodebench_accuracy.parquet (349 rows)

Usage:
uv run python examples/09_DeepSeek-V4-Pro_Example/extract_mlperf_subsets.py
"""

from pathlib import Path

import pandas as pd

SRC = Path(
"datasets/deepseek/mlperf_deepseek_r1_dataset_4388_fp8_eval_accuracy.parquet"
)
OUT_DIR = Path("datasets/deepseek")

SUBSETS = {
"mmlu_pro": "mlperf_deepseek_r1_mmlu_pro_accuracy.parquet",
"livecodebench": "mlperf_deepseek_r1_livecodebench_accuracy.parquet",
}


def main() -> None:
df = pd.read_parquet(SRC)
print(f"Loaded {len(df)} rows from {SRC}")
print("Sub-dataset breakdown:")
print(df.groupby(["dataset", "metric"]).size().to_string())
print()

for dataset_name, out_filename in SUBSETS.items():
subset = df[df["dataset"] == dataset_name].reset_index(drop=True)
out_path = OUT_DIR / out_filename
subset.to_parquet(out_path, index=False)
print(f"Wrote {len(subset)} rows → {out_path}")


if __name__ == "__main__":
main()
94 changes: 94 additions & 0 deletions examples/09_DeepSeek-V4-Pro_Example/launch_server.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,94 @@
#!/usr/bin/env bash
# Launch DeepSeek-V4-Pro with vLLM on 8×B200 / 8×B300.
#
# Key flags vs. a standard vLLM launch:
# --data-parallel-size 8 Expert parallelism across 8 GPUs (no TP)
# --enable-expert-parallel Required for MoE data-parallel dispatch
# --kv-cache-dtype fp8 DeepSeek V4's hybrid KV cache (c4a/c128a)
# --block-size 256 Matches the 256-native-token logical block size
# --attention_config.use_fp4_indexer_cache=True FP4 indexer for 2x KV savings
# --tokenizer-mode deepseek_v4 Custom tokenizer for V4 chat template
# --reasoning-parser deepseek_v4 Strips <think>…</think> into reasoning_content
# --compilation-config … FULL_AND_PIECEWISE cudagraph + all custom fusions
#
# Startup time note:
# Model weight loading (64 shards) + TileLang kernel compilation takes ~22 min
# on 8×B200. The default VLLM_ENGINE_READY_TIMEOUT_S=600 (10 min) is too short
# and will crash the API server with a TimeoutError even though the workers are
# fine. Always set VLLM_ENGINE_READY_TIMEOUT_S=3600 for this model.

set -euo pipefail

: "${MODEL_PATH:?Set MODEL_PATH to the directory containing the DeepSeek-V4-Pro weights}"
PORT="${PORT:-8000}"
MAX_MODEL_LEN="${MAX_MODEL_LEN:-65536}"

if [[ ! -d "${MODEL_PATH}" ]]; then
echo "ERROR: MODEL_PATH=${MODEL_PATH} does not exist."
echo "Set MODEL_PATH to the directory containing the DeepSeek-V4-Pro weights."
exit 1
fi

echo "Launching DeepSeek-V4-Pro on port ${PORT} (model: ${MODEL_PATH})"
echo "Startup takes ~22 minutes for weight loading + TileLang kernel compilation."
echo ""

CONTAINER_ID=$(docker run -d \
--gpus all \
--shm-size 32g \
--net host \
--ipc host \
-v "${MODEL_PATH}:/model" \
-v "${HF_HOME:-${HOME}/.cache/huggingface}:/root/.cache/huggingface" \
--env HF_TOKEN="${HF_TOKEN:-}" \
--env VLLM_WORKER_MULTIPROC_METHOD=spawn \
--env VLLM_ENGINE_READY_TIMEOUT_S=3600 \
vllm/vllm-openai:deepseekv4-cu130 \
--model /model \
--served-model-name deepseek-ai/DeepSeek-V4-Pro \
--trust-remote-code \
--kv-cache-dtype fp8 \
--block-size 256 \
--enable-expert-parallel \
--data-parallel-size 8 \
--max-model-len "${MAX_MODEL_LEN}" \
--compilation-config '{"cudagraph_mode":"FULL_AND_PIECEWISE","custom_ops":["all"]}' \
--attention_config.use_fp4_indexer_cache=True \
--tokenizer-mode deepseek_v4 \
--tool-call-parser deepseek_v4 \
--enable-auto-tool-choice \
--reasoning-parser deepseek_v4 \
--disable-log-stats \
--disable-uvicorn-access-log \
--port "${PORT}")

echo "Container started: ${CONTAINER_ID:0:12}"
echo ""
echo "Polling http://localhost:${PORT}/health ..."

TIMEOUT=2400
START=$(date +%s)
while true; do
if curl -sf "http://localhost:${PORT}/health" > /dev/null 2>&1; then
ELAPSED=$(( $(date +%s) - START ))
echo "Server healthy after ${ELAPSED}s. Ready to benchmark."
break
fi
ELAPSED=$(( $(date +%s) - START ))
if [[ ${ELAPSED} -ge ${TIMEOUT} ]]; then
echo "ERROR: server not healthy after ${TIMEOUT}s"
docker logs "${CONTAINER_ID}" | tail -40
exit 1
fi
if [[ "$(docker inspect -f '{{.State.Running}}' "${CONTAINER_ID}" 2>/dev/null)" != "true" ]]; then
echo "ERROR: container exited unexpectedly"
docker logs "${CONTAINER_ID}" | tail -40
exit 1
fi
echo " Waiting... (${ELAPSED}s)"
sleep 15
done

echo ""
echo "Container ID : ${CONTAINER_ID:0:12}"
echo "Stop with : docker stop ${CONTAINER_ID:0:12}"
Loading
Loading