Blackwell LLM Deployment Lab
49 benchmarked | 14 working | 29 pending | 17 failed — across 3 GPU architectures
Canonical Docs and Public Attribution
README.md is the canonical root document and the only root doc intended for publication.
Local mirrors CLAUDE.md, AGENTS.md, and AGENT.md are symlinks to README.md and are git-ignored.
Recreate local mirrors with ./scripts/sync_local_docs.sh.
Public attribution identity for this repo is dev@primitivecontext.com.
Local git hooks (in .githooks/) enforce attribution and block public AI co-author tags.
Work is executed by a large agent swarm (up to ~20 concurrent) in a full-freedom dev sandbox.
Research tools are valuable but sometimes stale or contradictory; agent triage can diverge.
Timeline estimates will be laughed at. Work is performed right now, with agents scaled appropriately.
Target
Chip
Products
Memory
SM100
B200
Datacenter Blackwell
TMEM + HBM
SM120
GB202
RTX 5090, RTX Pro 6000
96 GB GDDR7
SM121
GB10
DGX Spark
128 GB unified LPDDR5x
GPU
Arch
Count
RTX Pro 6000 (96 GB GDDR7)
SM120
46 deployments
DGX Spark (128 GB LPDDR5x)
SM121
41 deployments
RTX 4090 (24 GB GDDR6X)
SM89
22 deployments
Hardware: SM100 vs SM120/SM121
Capability
SM100
SM120/SM121
Impact
Tensor Memory
256 KB TMEM/SM
None
Operands must live in registers
TMA Multicast
Yes
Disabled
Cluster shape locked to 1×1×1
Instruction Set
tcgen05.mma
mma.sync.aligned.block_scale
Different memory models, layouts incompatible
Operand Location
SMEM/TMEM
Registers
Register pressure is the bottleneck
Bottom line : SM100 kernels will not run on SM120/SM121. Code compiled for sm_100a traps on consumer Blackwell.
Platforms: vLLM vs TensorRT-LLM vs Custom CUTLASS
Platform
SM120 NVFP4 Status
Limitation
vLLM
Partial
SM120 FP4 kernel detection fails (#31085), falls back to Marlin (40-50% perf loss)
TensorRT-LLM
Partial
NVFP4 KV cache support requested but not shipped (#10241)
SGLang
Blocked
All attention backends fail for MoE on SM120 (triton SMEM overflow)
Custom CUTLASS
Required
Only path to NVFP4 weights + NVFP4 KV cache on SM120/SM121
Bottom line : No off-the-shelf stack gives NVFP4 model weights + NVFP4 KV cache on consumer Blackwell.
Quantization: NVFP4 vs MXFP4 vs FP8
Format
Data Bits
Scale Type
Block Size
Use Case
NVFP4
E2M1 (4-bit)
UE4M3 (unsigned)
16 elements
Weights + KV cache (2× mem reduction)
MXFP4
E2M1 (4-bit)
UE8M0 (power-of-2)
32 elements
OCP-compliant, coarser scaling
FP8
E4M3/E5M2
N/A
N/A
Intermediate compute precision
Bottom line : NVFP4 with UE4M3 scales gives finer granularity (16 vs 32 blocks) and 137× more dynamic range than signed E4M3.
vLLM/TRT-LLM don't support NVFP4 KV cache on SM120
SM120 lacks TMEM so SM100 kernels fail
NVFP4 weights + NVFP4 KV cache = maximum memory efficiency
MoE + GQA patterns need specialized attention
Use sm_120a for RTX 5090 / RTX Pro 6000
Use sm_121a for DGX Spark
Base sm_120 cannot compile block-scaled instructions
PTX ISA 8.7 added sm_120a, PTX ISA 8.8 added sm_121a
Use mma.sync.aligned.kind::mxf4nvf4.block_scale
All operands (A/B/ACC/SFA/SFB) reside in registers
Cluster shape fixed to 1×1×1 (no multicast)
TMA available for GMEM↔SMEM movement
No TMEM access (SM100-only)
TMA loads to shared memory, not tensor memory
Register file is the primary operand store
Manage register pressure carefully
[NVFP4 weights] ──┐
├──► [FP8 dequant] ──► [FP8 compute] ──► [FP8 result]
[NVFP4 KV cache] ─┘ │
▼
[FP8→NVFP4 quant] ──► [NVFP4 KV append]
NVFP4 → FP8 dequant before attention compute
FP8 is mandatory intermediate precision
New K/V quantized to NVFP4 before cache append
SM120 uses emulation for scalar FP8; native FP8 is tensor core MMA only
__nv_cvt_float_to_fp8() intrinsic produces wrong exponent on SM120 (off by +6)
Use software conversion: float_to_fp8_e4m3_sw() in nvfp4_kv_dequantize.cu
Applies to all NVFP4→FP8 dequant paths
NVFP4: 1 UE4M3 scale per 16 E2M1 values
Scale layout matches SM100 (portable between archs)
Two-level scaling: per-block E4M3 + per-tensor FP32
RTX 5090: No NVLink, PCIe only
DGX Spark: 200 Gbps ConnectX-7 (Ethernet, no RDMA)
PCIe x8 single-GPU: <2% perf impact
PCIe x8 multi-GPU TP: 20-40% perf loss
Optimize for single-GPU first
Full citations: verified_facts.json
Documentation Resource Map
Local documentation library at docs/ (~2.4 GB).
Reference (docs/reference/ — 1.5 GB, static)
Section
Path
Contents
GPU Whitepapers
reference/gpu/whitepapers/
9 PDFs — Blackwell RTX, Blackwell DC, Ada, Ampere, Turing, Volta, Pascal
GPU Datasheets
reference/gpu/datasheets/
17 PDFs — RTX 5090, RTX Pro 6000, DGX Spark, B200, A100, H100, H200, L4, T4, V100
CUDA Toolkit
reference/cuda/pdf/
53 PDFs — Programming Guide, PTX ISA 9.1, Blackwell/Hopper/Ada tuning & compat guides, cuBLAS, cuSPARSE, NVVM IR, etc.
CUDA HTML Docs
reference/cuda/
89 subdirs — parallel-thread-execution, cuda-c-programming-guide, blackwell-tuning-guide, inline-ptx-assembly, etc.
cuDNN
reference/cudnn/
api/, backend/, frontend/, developer-guide/, installation/, latest/
CUTLASS
reference/cutlass/latest/
Scraped HTML docs — overview, changelog, index
DGX
reference/dgx/
27 product manuals — DGX Spark (HTML + PDF), DGX B200/B300, DGX A100/H100/GB200
NVIDIA Blog
reference/blog/blog/
Saved posts — NVFP4 intro, NVFP4 training, Blackwell Ultra, MoE expert parallelism, PTQ optimization
GPU Architecture Whitepapers
Generation
Whitepaper
Blackwell
reference/gpu/whitepapers/nvidia-rtx-blackwell-gpu-architecture.pdf
Ada
reference/gpu/whitepapers/ada-lovelace-architecture-whitepaper.pdf
Ampere
reference/gpu/whitepapers/nvidia-ampere-architecture-whitepaper.pdf
Turing
reference/gpu/whitepapers/nvidia-turing-architecture-whitepaper.pdf
Volta
reference/gpu/whitepapers/volta-architecture-whitepaper.pdf
Pascal
reference/gpu/whitepapers/pascal-architecture-whitepaper.pdf
Need
File
PTX ISA
reference/cuda/pdf/ptx_isa_9.1.pdf
Blackwell tuning
reference/cuda/pdf/Blackwell_Tuning_Guide.pdf
Blackwell compat
reference/cuda/pdf/Blackwell_Compatibility_Guide.pdf
CUDA C Programming
reference/cuda/pdf/CUDA_C_Programming_Guide.pdf
RTX Blackwell arch
reference/gpu/whitepapers/nvidia-rtx-blackwell-gpu-architecture.pdf
Blackwell DC arch
reference/gpu/whitepapers/blackwell-architecture-hardwareand.pdf
Blackwell microbench
reference/gpu/datasheets/blackwell-microbenchmarks-arxiv.pdf
DGX Spark guide
reference/dgx/dgx-spark/
NVFP4 blog
reference/blog/blog/introducing-nvfp4-for-efficient-and-accurate-low-precision-inference/
NVFP4 training blog
reference/blog/blog/nvfp4-trains-with-precision-of-16-bit-and-speed-and-efficiency-of-4-bit/
Source Snapshots (docs/source/ — 885 MB, dated git repos)
Snapshot
Version
Purpose
20260124_cutlass-v4.3.0/
CUTLASS 4.3.0
Kernel primitives, SM120 examples (77, 81, 82)
20260125_vllm-v0.14.0rc1/
vLLM 0.14.0-rc1
Serving framework, quantization backends
20260125_sglang-v0.5.8/
SGLang 0.5.8
Structured generation, attention backends
20260125_flashinfer-v0.6.2/
FlashInfer 0.6.2
Attention kernel library
20260125_flash-attention-v2.8.3/
FlashAttention 2.8.3
Reference attention implementation
20260125_exo-explore-exo/
Exo
Distributed inference
20260122_dgx-spark-playbooks/
—
DGX Spark deployment playbooks
20251210_avarok-vllm-dgx-spark/
—
vLLM DGX Spark adaptation
20251201_dgx-spark-config-v1.0.1/
—
DGX Spark system config
20251124_rtx-ai-toolkit-archived/
—
RTX AI Toolkit (archived)
20251017_rtx-pro-6000-vs-dgx-spark/
—
RTX Pro 6000 vs DGX Spark benchmarks/comparison
vggt/
VGGT
Visual Geometry Grounded Transformer
Other (docs/sources/ — 30 MB)
Repo
Purpose
streaming-vlm/
Streaming VLM inference (git repo with training + inference code)
SM12x Deployment Reference
Sorted by single-stream throughput. TPS = tokens/second.
Single TPS
Batched TPS
Model
Params
Type
Quant
Platform
TP
VRAM
255
6,396
gpt-oss-20b-MXFP4
20.9B/3.6B
MoE
MXFP4
vllm
1
-
198
8,463
Qwen2.5-VL-7B-Instruct-NVFP4
7B
VLM-Dense
NVFP4
trt-llm
1
6.71
173
6,545
gpt-oss-120b-MXFP4
120B/5.1B
MoE
MXFP4
vllm
1
95
131
4,778
Qwen3-Next-80B-A3B-Instruct-NVFP4
80B/3B
MoE
NVFP4
vllm
1
44.2
105
404
MiniMax-M2.1-AWQ-4bit
456B/45B
MoE
AWQ
vllm
2
-
78.0
1,004
Qwen3-Next-80B-A3B-Instruct-FP8
80B/3B
MoE
FP8
sglang
1
82
75.2
1,355
MiniMax-M2.1-NVFP4
456B/45B
MoE
NVFP4
vllm
2
122
61.5
1,630
Llama-4-Scout-17B-16E-Instruct-NVFP4
109B/17B
MoE
NVFP4
vllm
1
89.7
58.3
1,304
Qwen3-235B-A22B-NVFP4
235B/22B
MoE
NVFP4
vllm
2
95.5
43.6
2,510
Nemotron-Nano-12B-v2-VL-NVFP4-QAD
12B
VLM-Dense
NVFP4
trt-llm
1
9.87
31.0
—
chandra
8B
Dense
BF16
vllm
1
26.9
1,812
Nemotron-Super-49B-v1_5-NVFP4
49B
Dense
NVFP4
vllm
1
89.6
25.8
603
Qwen3-VL-30B-A3B-Instruct-NVFP4
30B/3B
VLM-MoE
NVFP4
vllm
1
18.2
21.5
1,959
Qwen3-Next-80B-A3B-Instruct-NVFP4
80B/3B
MoE
NVFP4
vllm
1
90.7
18.8
1,225
Llama-3.3-70B-Instruct-NVFP4
70B
Dense
NVFP4
vllm
1
89.7
14.0
—
olmOCR-2-7B-1025-FP8
7B
Dense
FP8
vllm
1
7.1
2,064
Qwen3-Next-80B-A3B-Instruct-NVFP4
80B/3B
MoE
NVFP4
vllm
1
-
6.1
304
Qwen3-VL-235B-A22B-Instruct-NVFP4
235B/22B
VLM-MoE
NVFP4
vllm
2
94
Single TPS
Batched TPS
Model
Params
Type
Quant
Platform
VRAM
180
—
Qwen3-0.6B-NVFP4
600M
Dense
NVFP4
vllm
0.60
99.7
—
Qwen3-1.7B-NVFP4
2B
Dense
NVFP4
vllm
1.00
60.1
—
Qwen3-30B-A3B-NVFP4
30B/3.3B
MoE
NVFP4
tensorrt-llm
16.85
58.0
—
Phi-4-multimodal-instruct-NVFP4
14B
VLM-Dense
NVFP4
tensorrt-llm
19.40
55.6
—
Qwen3-4B-Instruct-2507-NVFP4
4B
Dense
NVFP4
vllm
3.00
42.7
188
Qwen3-Next-80B-A3B-Instruct-FP8
80B/3B
MoE
FP8
vllm
74.89
41.3
—
Qwen2.5-VL-7B-Instruct-NVFP4
7B
VLM-Dense
NVFP4
tensorrt-llm
22.00
39.9
—
Llama-3.3-70B-Instruct-NVFP4
70B
Dense
NVFP4
vllm
38.6
1,128
Qwen3-30B-A3B-FP4
30B/3B
MoE
NVFP4
trt-llm
19.2
28.5
28.5
Qwen3-8B-NVFP4
8B
Dense
NVFP4
vllm
-
28.2
—
Qwen3-8B-NVFP4
8B
Dense
NVFP4
tensorrt-llm
7.00
27.6
441
Qwen3-Next-80B-A3B-Instruct-NVFP4
80B/3B
MoE
NVFP4
trt-llm
44.2
27.3
—
NVIDIA-Nemotron-Nano-9B-v2-NVFP4
9B
Dense
NVFP4
vllm
7.34
20.2
—
Qwen3-14B-NVFP4
14B
Dense
NVFP4
vllm
9.00
19.6
19.6
Phi-4-reasoning-plus-NVFP4
14B
Dense
NVFP4
trt-llm
-
19.4
—
Phi-4-reasoning-plus-NVFP4
15B
Dense
NVFP4
tensorrt-llm
10.00
17.0
17.0
Qwen3-14B-FP4
14.8B
Dense
NVFP4
trt-llm
-
17.0
—
Qwen3-14B-NVFP4
15B
Dense
NVFP4
tensorrt-llm
10.00
16.3
—
Llama-4-Scout-17B-16E-Instruct-NVFP4
56B/17.0B
MoE
NVFP4
tensorrt-llm
114.23
10.2
—
Qwen3-32B-NVFP4
32B
Dense
NVFP4
vllm
19.00
10.0
—
DeepSeek-R1-Distill-Qwen-32B-NVFP4
32B
Dense
NVFP4
vllm
19.00
8.2
—
Qwen3-32B-NVFP4
33B
Dense
NVFP4
tensorrt-llm
105.00
4.8
398
Qwen3-Next-80B-A3B-Instruct-NVFP4
80B/3B
MoE
NVFP4
vllm
44.3
4.8
75.8
Qwen3-Next-80B-A3B-Instruct-FP8
80B/3B
MoE
FP8
vllm
4.6
—
Llama-3.1-70B-Instruct-NVFP4
70B
Dense
NVFP4
vllm
41.00
3.1
—
Llama-3.3-70B-Instruct-NVFP4
70B
Dense
NVFP4
tensorrt-llm
60.00
Single TPS
Batched TPS
Model
Params
Type
Quant
Platform
VRAM
152
5,026
Llama-3.2-3B-Instruct
3B
Dense
FP8
vllm
5
88.3
3,566
DeepSeek-R1-Distill-Qwen-7B
7B
Dense
W8A8
vllm
10
81.0
197
gpt-oss-20b-MXFP4
20.9B/3.6B
MoE
MXFP4
ollama
16
60.1
2,586
Qwen2.5-7B-Instruct
7B
Dense
BF16
vllm
-
48.0
1,174
DeepSeek-R1-Distill-Qwen-14B
14B
Dense
W8A8
vllm
14
Working (not yet benchmarked)
GPU
Model
Type
Quant
Platform
Notes
RTX Pro 6000
Qwen3-Next-80B-A3B-Instruct-NVFP4
MoE
NVFP4
trt-llm
104 single, 1951 batched (ctx=512 par=64)
RTX Pro 6000
DeepSeek-OCR
Dense
BF16
vllm
RTX Pro 6000
Qwen3-Embedding-8B
Embedding
BF16
vllm
RTX Pro 6000
Qwen3-Reranker-8B
Reranker
BF16
vllm
RTX Pro 6000
Qwen3-VL-Embedding-8B
VL-Embedding
BF16
vllm
RTX Pro 6000
Qwen3-VL-Reranker-8B
VL-Reranker
BF16
vllm
RTX 4090
personaplex-7b-v1
Dense
BF16
pytorch
Moshi voice model
RTX 4090
Qwen3-Embedding-8B
Embedding
BF16
vllm
RTX 4090
Qwen3-Reranker-8B
Reranker
BF16
vllm
RTX 4090
Qwen3-VL-Embedding-8B
VL-Embedding
BF16
vllm
RTX 4090
Qwen3-VL-Reranker-8B
VL-Reranker
BF16
vllm
DGX Spark
Nemotron-Super-49B-v1_5-NVFP4
Dense
NVFP4
trt-llm
TRT-LLM 1.1.0rc3, 3 deployment attempts in logs, served requests successfully
DGX Spark
Llama-4-Scout-17B-16E-Instruct-NVFP4
MoE
NVFP4
trt-llm
TRT-LLM 1.1.0rc3, port 8042, 75.22 GiB model+CUDA
DGX Spark
Qwen3-32B-NVFP4
Dense
NVFP4
trt-llm
TRT-LLM 1.1.0rc3, loaded from logs
GPU
Model
Quant
Platform
Notes
RTX Pro 6000
gpt-oss-120b-MXFP4+eagle3
MXFP4
sglang
RTX Pro 6000
Qwen3-235B-A22B-Thinking-2507-AWQ
AWQ
sglang
RTX Pro 6000
Qwen3-235B-A22B-NVFP4
NVFP4
trt-llm
RTX Pro 6000
Qwen3-Next-80B-A3B-Thinking-NVFP4
NVFP4
trt-llm
RTX Pro 6000
gpt-oss-20B-NVFP4A16-BF16
NVFP4
vllm
RTX Pro 6000
gpt-oss-20b-MXFP4+eagle3
MXFP4
vllm
RTX Pro 6000
Llama-3.3-70B-Instruct-NVFP4
NVFP4
trt-llm
RTX Pro 6000
twitter-roberta-base-sentiment-latest
BF16
transformers
Sentiment analysis, 1556 samples/sec
RTX Pro 6000
bart-large-cnn
BF16
transformers
Text summarization, 86 samples/sec
RTX Pro 6000
bart-large-mnli
BF16
transformers
Zero-shot classification, 415 samples/sec
RTX Pro 6000
gpt-oss-120b
MXFP4
vllm
Older directory naming, config.json present
RTX Pro 6000
Qwen3-Omni-30B-A3B-Abliterated-Voice
BF16
custom
Voice-enabled Qwen3 Omni, custom setup
RTX Pro 6000
NVFP4-GEMM+Attention
NVFP4
cutlass
Custom CUTLASS kernels: GEMM 38% peak, attention, KV quant/dequant, MoE GEMM. In development.
RTX Pro 6000
NVFP4-KV-Cache
NVFP4
trt-llm
TRT-LLM fork adding NVFP4 KV cache (issue #10241). MMHA integration complete, awaiting test.
RTX 4090
Qwen2.5-32B-Instruct-AWQ
AWQ
vllm
RTX 4090
Qwen3-30B-A3B-GPTQ-Int4
GPTQ
vllm
RTX 4090
Qwen3-Coder-30B-A3B-Instruct
GGUF
vllm
RTX 4090
personaplex-7b-v1
BF16
pytorch
Second instance of Moshi voice model
RTX 4090
Qwen3-Embedding-0.6B-FP8-Dynamic
FP8
vllm
RTX 4090
Qwen3-Embedding-0.6B-vllm-W8A8
W8A8
vllm
RTX 4090
Qwen3VL-8B-Instruct-FP8
FP8
vllm
RTX 4090
Qwen3-Embedding-8B-FP8-Dynamic
FP8
vllm
RTX 4090
Qwen3-VL-8B-Instruct-FP8
FP8
vllm
RTX 4090
Qwen3-Embedding-0.6B
BF16
vllm
RTX 4090
Qwen3-VL-8B-Instruct-FP8
FP8
vllm
RTX 4090
Qwen3-VL-8B-Instruct-FP8
FP8
vllm
DGX Spark
Qwen3-14B-FP4
NVFP4
trt-llm
DGX Spark
Qwen3-Next-80B-A3B-Instruct-NVFP4
NVFP4
sglang
DGX Spark
gpt-oss-120b-MXFP4
MXFP4
vllm
GPU
Model
Quant
Platform
Reason
RTX Pro 6000
personaplex-7b-v1
moshi
RTX Pro 6000
Qwen3-Next-80B-A3B-Instruct-NVFP4
NVFP4
sglang
RTX Pro 6000
gpt-oss-120b-MXFP4+eagle3
MXFP4
vllm
RTX Pro 6000
Qwen2.5-VL-7B-Instruct-NVFP4
NVFP4
vllm
vLLM variant, TRT-LLM version works (8463 TPS)
RTX Pro 6000
Qwen2.5-VL-32B-Instruct-AWQ
AWQ
vllm
RTX Pro 6000
Qwen2.5-VL-72B-Instruct-AWQ
AWQ
vllm
OOM or incompatible
RTX Pro 6000
Llama-4-Scout-17B-16E-Instruct-NVFP4
NVFP4
trt-llm
Chunked attention kernel SM90-only
RTX Pro 6000
sam2-hiera-large
BF16
pytorch
SAM2 vision encoder
DGX Spark
Seed-OSS-36B-Instruct-NVFP4
NVFP4
trt-llm
NotImplementedError: compressed-tensors nvfp4-pack-quantized format unsupported
DGX Spark
Qwen3-30B-A3B-NVFP4
NVFP4
vllm
MoE: 30.5B total, 3.3B active. Model loads but stuck in scheduler loop. GB10 sm_121 MoE scheduler incompatibility.
DGX Spark
Qwen3-Next-80B-A3B-Instruct-NVFP4
NVFP4
tensorrt-llm
MoE: 80B total, 3.9B active. Gated DeltaNet hybrid attention not supported. No workaround for GB10.
DGX Spark
Mistral-Small-3.2-24B-Instruct-2506-NVFP4
NVFP4
vllm
Tokenizer error: MistralTokenizer has no convert_tokens_to_ids. Tekken tokenizer incompatible.
DGX Spark
NVIDIA-Nemotron-Nano-9B-v2-NVFP4
NVFP4
tensorrt-llm
NemotronHForCausalLM architecture not supported in TRT-LLM. Use vLLM instead (27.27 TPS).
DGX Spark
Seed-OSS-36B-Instruct-NVFP4
NVFP4
tensorrt-llm
Community quant uses compressed-tensors format, not official NVIDIA NVFP4. TRT-LLM cannot parse.
DGX Spark
Seed-OSS-36B-Instruct-NVFP4
NVFP4
vllm
CUDA driver init error: Error 35.
DGX Spark
nemotron3-nano-nvfp4-w4a16
NVFP4-W4A16
vllm
Mamba hybrid architecture requires HybridMambaAttentionDecoderLayer not supported.
DGX Spark
nemotron3-nano-nvfp4-w4a16
NVFP4-W4A16
tensorrt-llm
Mamba hybrid architecture not supported by TRT-LLM NVFP4 backend.
Scaling Efficiency (RTX Pro 6000)
TPS
Model
Quant
Scaling Eff
Knee (par)
Knee TPS
5,176
gpt-oss-20b-MXFP4
MXFP4
36.5%
16
1,783
3,269
Nemotron-Super-49B-NVFP4
NVFP4
104.1%
128
3,269
2,165
gpt-oss-120b-MXFP4
MXFP4
26.3%
8
610
1,557
Llama-4-Scout-17B-16E-NVFP4
NVFP4
43.2%
16
492
Quant Comparison (SM120, same model)
Quant
Model
TPS
Relative
NVFP4
MiniMax-M2.1-NVFP4
1,355
1.0x
AWQ
MiniMax-M2.1-AWQ-4bit
405
0.3x
NVFP4
Qwen3-Next-80B-NVFP4
4,778
1.0x
FP8
Qwen3-Next-80B-FP8
781
0.16x
User Models --> ModelOpt (NVFP4/FP8 quantization)
|
+-----------+-----------+
v v
vLLM TRT-LLM
(community) (NVIDIA)
| |
v v
FlashInfer TRT-LLM Plugins
(UW Academic) (C++ kernels)
| |
+-----------+-----------+
v
CUTLASS
(NVIDIA)
GEMM + NVFP4 primitives
(no attention/KV cache)
|
+---------+---------+
v v
SM120 SM121
RTX Pro 6000 DGX Spark
Use the /deploy skill for deploying containers. Quick ref: ./deploy <dir> [--dry-run|--generate|--down|--shared-tenancy]
SM
GPU
Stack
Reason
SM120
RTX Pro 6000
vLLM + FlashInfer
TRT-LLM missing chunked attention SM120 kernel
SM121
DGX Spark
TRT-LLM
1.1x faster than vLLM on same model
SM89
RTX 4090
vLLM
Dense models only, 24 GB limit
Model
Platform
Issue
Workaround
Llama-4-Scout-17B-16E
TRT-LLM
Chunked attention kernel SM90-only
Use vLLM
Qwen3-Next-80B (Mamba hybrid)
TRT-LLM
Mamba hybrid overhead
enable_block_reuse=false
Qwen3-Next-80B-NVFP4
SGLang
SMEM overflow on SM120
Use vLLM
gpt-oss-120b-MXFP4+eagle3
vLLM
Speculative decoding broken
Use without spec decode
Issue
Repo
Description
Status
#10241
TensorRT-LLM
NVFP4 KV cache support request
Open
#5018
TensorRT-LLM
SM120 (5090) NVFP4 support
Confirmed in 0.20.0rc3+
#31085
vLLM
SM120 FP4 kernel detection fails
Unknown
Project
URL
CUTLASS
github.com/NVIDIA/cutlass
TensorRT-LLM
github.com/NVIDIA/TensorRT-LLM
vLLM
github.com/vllm-project/vllm
FlashInfer
github.com/flashinfer-ai/flashinfer
FlashAttention
github.com/Dao-AILab/flash-attention
ModelOpt
github.com/NVIDIA/TensorRT-Model-Optimizer