Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion src/content/docs/index.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -71,7 +71,7 @@ Decode is captured as a CUDA graph with pre-allocated buffers, keeping per-token

| Model | Architecture |
| --- | --- |
| [Qwen3-4B / 8B](/models/qwen3-4b/) | Full attention, tensor parallel |
| [Qwen3-4B / 8B / 32B](/models/qwen3-4b/) | Full attention, tensor parallel |
| Qwen3.5-4B | Hybrid: 24 linear + 8 full attention layers |
| DeepSeek-V4 | MoE + compressor + indexer, 8-GPU |
| DeepSeek-V2-Lite | MoE + expert parallelism, 2-GPU |
Expand Down
56 changes: 53 additions & 3 deletions src/content/docs/models/qwen3-4b.md
Original file line number Diff line number Diff line change
Expand Up @@ -72,6 +72,25 @@ layers) and runs on the same single GPU — just point `--model-path` at the
cargo run --release -- --model-path models/Qwen3-8B
```

### Qwen3-32B

Qwen3-32B's BF16 weights (~63 GB) need a single large-VRAM GPU
(GH200/H200 class).

```bash
huggingface-cli download Qwen/Qwen3-32B --local-dir models/Qwen3-32B

cargo run --release -- --model-path models/Qwen3-32B
```

Tool calling goes through `/v1/chat/completions` with a `tools` array; a
`get_weather` round-trip returns:

```json
{"choices":[{"message":{"role":"assistant","tool_calls":[{"function":{"name":"get_weather",
"arguments":"{\"city\": \"Paris\"}"}}]},"finish_reason":"tool_calls"}]}
```

## Performance

Measured on **1x RTX 5090 32GB**, driver 590.48.01, CUDA 13.1 build,
Expand Down Expand Up @@ -126,6 +145,36 @@ around QPS 8:
| 4 | 498.6 | 498.5 | 88.1 ms | 103.6 ms | 16.08 ms | 16.24 ms |
| 8 | 991.9 | 990.4 | 148.0 ms | 235.1 ms | 30.97 ms | 35.56 ms |

### Qwen3-32B Serving Load

Measured on **1x GH200 120GB** (aarch64, sm_90), openinfer main
`5959f05`, Qwen3-32B BF16, TP1, CUDA Graph on. Load to HTTP-ready is
46 s; the profiled KV budget is 21.4 GB (5360 blocks) next to the 63 GB
of weights. QPS sweep with `vllm-bench`, Poisson arrivals, 1024-token
prompts, 128-token outputs, greedy, seed 42 — reproducible via
`tools/bench/run_serving_bench.sh` in the repo.

| load | req/s | out tok/s | TTFT p50 / p99 | TPOT p50 / p99 |
| ---: | ---: | ---: | ---: | ---: |
| c=1 | 0.35 | 45 | 134 / 137 ms | 21.1 / 21.1 ms |
| QPS 1 | 0.95 | 122 | 154 / 316 ms | 25.3 / 29.3 ms |
| QPS 2 | 1.91 | 244 | 103 / 289 ms | 24.9 / 34.3 ms |
| c=4 | 1.24 | 159 | 286 / 296 ms | 24.0 / 25.8 ms |
| QPS 4 | 3.77 | 482 | 202 / 593 ms | 59.4 / 74.8 ms |
| c=8 | 2.02 | 258 | 294 / 462 ms | 28.9 / 30.0 ms |
| QPS 8 | 5.22 | 668 | 16.2 / 28.0 s | 107.2 / 107.5 ms |

`c=N` rows hold N requests in flight; `QPS n` rows are Poisson arrivals.
The single GPU saturates around 5.3 req/s and 680 output tok/s at this
shape; past that (QPS 10–16) throughput stays flat and TTFT grows with
queueing.

Greedy output matches HF `transformers` (bf16, same GPU)
token-for-token on 4 of 5 test prompts over the first 20 tokens. The
fifth diverges at the second generated token, where HF's own top-4
logits sit within a 0.375 spread and openinfer emits HF's second-ranked
token, 0.25 below the top.

### Warm Prefix-Cache TTFT

For multi-turn chat and agent workloads, most of the prompt often lands as
Expand Down Expand Up @@ -221,9 +270,10 @@ DSpark is the recommended drafter for Qwen3-4B.
## Architecture Notes

- Full attention with grouped-query attention: 32 query heads, 8 KV heads,
head dim 128, 36 layers.
- Qwen3-4B and Qwen3-8B are the default pure Rust + CUDA build, with no
Python build dependency.
head dim 128, 36 layers. Qwen3-32B scales this to 64 query heads and 64
layers (GQA group 8).
- Qwen3-4B, Qwen3-8B, and Qwen3-32B are the default pure Rust + CUDA
build, with no Python build dependency.
- Paged KV cache uses full-lifetime admission, so requests that cannot fit
are rejected instead of hanging under memory pressure.
- Prefix cache is on by default; `--no-prefix-cache` disables GPU prefix
Expand Down