Blackwell Tensor Core Kernels

Hand-written CUDA GEMM kernels targeting Tensor Cores, benchmarked on both Hopper (H100, sm_90) and Blackwell (RTX Pro 6000, sm_120), with a path toward Blackwell's FP8/FP4 microscaling formats.

The point is to connect kernel-level choices — tiling, Tensor Core fragment shapes, occupancy — to measured TFLOP/s as a fraction of the cuBLAS ceiling, on real Hopper and Blackwell silicon.

What this is

A naive baseline, a shared-memory tiled GEMM, a WMMA Tensor Core GEMM, and a hand-written raw mma.sync Tensor Core GEMM ablation ladder (FP16 in / FP32 accumulate).
A harness that checks correctness against cuBLAS and reports TFLOP/s and % of the cuBLAS ceiling, across a precision ladder of cuBLAS baselines on the same card: cublas = cublasSgemm (FP32, CUDA cores) → cublas_tf32 = cublasGemmEx (FP32 in, TF32 compute, Tensor Cores) → cublas_tc = cublasGemmEx (FP16 in / FP32 acc, Tensor Cores — the same precision as the WMMA/mma.sync kernels, hence the honest ceiling).
Builds for sm_90 (H100) and sm_120 (Blackwell RTX Pro 6000) so the same kernels can be profiled across two generations.

Measurement status: both generations are now populated in results/bench.csv — Blackwell RTX PRO 6000 Max-Q (sm_120) and H100 80GB SXM5 (sm_90) (run on a separate 8×H100 box, pinned to a single idle GPU). Three headline findings:

(Phase 2, sm_120) A hand-written raw mma.sync kernel reaches 106% of cuBLAS-TC on sm_120 (243 vs 229 TFLOP/s @ 8192³) — beating the cutlass_80 kernel cuBLAS dispatches on this card, with the same standard-optimization stack Boehm used to reach ~94% on CUDA cores. The WMMA wrapper, not the recipe, was what kept the Phase 1 kernel at 45%.

(Phase 1) The cross-generation comparison: the same WMMA kernel that reaches 45% of cuBLAS-TC on sm_120 reaches only 8% on H100 — not because the kernel runs slower per-SM, but because Hopper's Tensor Core ceiling is only reachable through wgmma warpgroup instructions that the WMMA API cannot emit.

(Phase 2.5, sm_90) That conclusion is now closed constructively: the same GEMM rebuilt with CUTLASS 3.x (TMA + wgmma) reaches 85.5% of cuBLAS-TC on H100 — a 10.5× recovery from the instruction class alone. Findings 1+3 are the same lesson on two architectures: reach the current generation's native instruction or lose the ceiling. Third-party ceilings (DeepGEMM, ThunderKittens, FlashAttention-3, Stream-K) are also measured on this box as reference rows — all within ±12% of their published numbers.

What this is NOT

Not a cuBLAS replacement — cuBLAS is the ceiling measured against, honestly.
Not tcgen05 — tensor-memory / CTA-pair instructions are sm_100 (B200) only. This card's FP8/FP4 path is the regular mma instruction set, and that is what Phase 3 measures.

Hardware

NVIDIA RTX Pro 6000 (Blackwell, sm_120) and/or H100 (Hopper, sm_90). CUDA 12.x+.

Layout

src/gemm_naive.cu      # one-thread-per-output baseline (correctness anchor)
src/gemm_tiled.cu      # shared-memory tiled + register-blocked SIMT GEMM
src/gemm_wmma.cu       # WMMA 16x16x16 Tensor Core GEMM (FP16 in, FP32 acc)
src/gemm_mma.cu        # hand-written raw mma.sync m16n8k16 GEMM — 5-step ablation ladder (Phase 2)
src/cutlass_sm90.cu    # CUTLASS 3.x TMA + wgmma GEMM, Hopper only (Phase 2.5)
src/gemm_mma_fp8.cu    # FP8 (E4M3) / FP4 (E2M1) / MXFP4 GEMM via the sm_120 mma path (Phase 3)
src/cublaslt_fp8.cu    # cuBLASLt FP8 baseline — the library FP8 ceiling (Phase 3)
src/mma_rate_probe.cu  # FP32-acc vs FP16-acc tensor rate microbenchmark — measures the peak (Phase 4)
include/mma_ptx.cuh    # the PTX building blocks: mma.sync, ldmatrix, XOR swizzle, cp.async
src/reference.cu       # cuBLAS FP32 ceiling (cublasSgemm, CUDA cores)
src/cublas_tc.cu       # cuBLAS Tensor Core ceiling (cublasGemmEx, FP16 in / FP32 acc) — same precision as wmma/mma
src/main.cu            # correctness + benchmark driver -> results/bench.csv
include/util.cuh       # timing, init, max-abs-error check
scripts/sweep_mma_stages.sh  # cp.async pipeline-depth sweep -> results/mma_stage_sweep.csv
scripts/run_reference_benches.sh  # DeepGEMM / ThunderKittens / FA3 reference runs (Phase 4)
analysis/reference_summary.py     # parse reference logs -> summary.md + reference_benches.png
CMakeLists.txt         # builds for sm_90 and sm_120 (+ sm_90a CUTLASS via -DCUTLASS_DIR)
docs/design-decisions.md
docs/roadmap.md

Quick start (on the Blackwell / H100 box)

One command does the full capture — size sweep + Nsight + plots + report:

make capture            # build -> sweep sizes -> ncu/nsys -> results/report.md + PNGs

The benchmark CSV records the device and SM of each run, so to compare both generations you run the sweep once on each GPU and the rows accumulate:

# on the H100 box:           ARCH=90  make bench
# on the Blackwell box:      ARCH=120 make bench
make analyze                 # merge both -> results/report.md + results/tflops_sm*.png

Single run / one kernel, by hand:

make build
./build/gemm_bench 4096 4096 4096 results/bench.csv   # M N K [out_csv]
bash scripts/profile.sh 4096                           # ncu + nsys for gemm_wmma

Results

make capture produces, in results/:

bench.csv — per kernel × size × GPU: ms, TFLOP/s, % of FP32 cuBLAS (pct_of_cublas, vs cublasSgemm), max abs err. Now includes a cublas_tc row per size.
tflops_sm120.png, pct_tc_sm120.png, roofline_sm120.png — the three charts below.
ncu_wmma_*.ncu-rep, nsys_*.nsys-rep — Nsight Compute / Systems captures.
report.md — the summary table (both % of FP32 cuBLAS and % of cuBLAS-TC) + charts.

Measured on RTX PRO 6000 Blackwell Max-Q (sm_120, CUDA 12.8)

Throughput across the precision ladder (FP32 → TF32 → FP16, all real, one card):

Each kernel as a fraction of the honest same-precision ceiling (cuBLAS FP16-TC = 100%):

Throughput at the largest (most steady-state) size, M=N=K=8192:

The gemm_wmma kernel is shared-memory tiled with per-warp 2×2 register tiling and a 3-stage cp.async pipeline, size-dispatched: 64×64 tile for N<1536 (better occupancy), 128×128 tile for N≥1536 (more reuse). Numbers at each size (TFLOP/s and fraction of the same-precision cuBLAS FP16-TC ceiling):

size	wmma	tf32-TC	cublas_tc (FP16-TC)	wmma % of cuBLAS-TC	tf32 % of cuBLAS-TC
512	16.2	27.8	33.7	48.2%	82.6%
1024	38.7	90.7	133.8	29.0%	67.8%
2048	68.6	128.8	213.6	32.1%	60.3%
4096	96.1	144.4	235.1	40.9%	61.4%
8192	100.2	150.5	225.8	44.4%	66.7%

Read the % of cuBLAS-TC column — the honest same-precision (FP16-in/FP32-acc, Tensor Core) ceiling. Across two optimization passes (shared-mem + cp.async, then register tiling + deeper pipeline + size dispatch) the WMMA kernel went from a naive 17.3% to 44.4% of cuBLAS-TC at 8192 (1.64× the single-buffer version), and no longer decays at scale. Pipeline depth is tuned (3 stages > 4 > 5) and warp specialization was tried but did not beat the multi-stage pipeline — the expected outcome per CudaDMA (Bauer et al., SC'11), since warp specialization needs large tiles + async-transfer hardware + register reallocation as prerequisites, none of which a 512-thread WMMA tile supplies on its own; see results/nsys_profile.md for the full before/after and the WS experiment. The % of FP32 cuBLAS column is precision-mismatched (kept for continuity); its >100% rows are FP16-TC vs FP32-CUDA-core, not the kernel beating cuBLAS. cublas_tc is 4.2–23× faster than cublasSgemm, confirming the Tensor Core path.

Phase 2: hand-written mma.sync — past the WMMA ceiling, past cuBLAS (sm_120)

Phase 1 ended with a diagnosis: the WMMA kernel is feed-bound (MIO-queue-full), and the fixes — swizzled shared memory, deeper register tiling — are exactly what the WMMA API hides. Phase 2 tests that diagnosis constructively: rewrite the kernel on raw mma.sync.aligned.m16n8k16 PTX + ldmatrix (src/gemm_mma.cu), then re-apply the standard optimization stack one step at a time — the same ablation discipline (and the same recipe) as Simon Boehm's FP32 ladder, which reaches ~94% of cuBLAS on CUDA cores. The roadmap question: does that recipe survive the move to Tensor Cores, which consume operands ~8× faster?

It does — and overshoots:

step	adds	TFLOP/s @ 8192³	% of cuBLAS-TC
`wmma` (Phase 1 best)	–	103.1	45.0%
`mma_base`	raw mma.sync + ldmatrix, 128×128 CTA, 2×2 warp tiling, scalar loads	47.4	20.7%
`mma_swizzle`	+ XOR-swizzled (bank-conflict-free) smem	58.7	25.6%
`mma_vec`	+ 16-byte vectorized `cp.async`	165.3	72.2%
`mma_pipe`	+ 2-stage pipeline (sweep winner: 2 > 3 > 4)	178.0	77.7%
`mma_warptile`	+ 64×64 per-warp register tile	243.2	106.1%
`cublas_tc` (ceiling)	`cutlass_80_tensorop_s16816gemm_f16_128x64`	229.0	100%

Four things make the 106% credible (full validation in results/mma_ablation.md): (1) identical max_abs_err to cuBLAS-TC at every size (same math, verified standalone); (2) the gap is visible inside the GPU timeline — nsys shows 4.52 ms vs 4.79 ms per kernel instance (results/nsys_kern_sum_mma_8192.txt); (3) it is a same-instruction contest — cuBLAS's kernel name (s16816) says it uses the same m16n8k16 instruction; we just tile and pipeline it better for this card (128×128 CTA / 64×64 warp tile / 2 stages, 196 reg/thread, 0 spills); (4) ncu shows why: our kernel runs the Tensor pipe at 85.0% utilization vs cuBLAS's 74.0%, with the same top-stall signature ncu attributes to "already highly optimized kernels" (results/ncu_sm120_*_8192.txt, captured via scripts/profile_mma_ncu.sh).

The ncu side-by-side also closes the Phase 1 → Phase 2 loop at the stall level:

ncu @ 8192³	wmma (Phase 1)	mma_warptile (Phase 2)	cuBLAS-TC (cutlass_80)
Tensor pipe utilization	24.7%	85.0%	74.0%
Warp cycles / issued instruction	49.6	5.8	6.6
Top stall	MIO queue full (34.2%)	fixed-latency dep (2.9 cyc)	fixed-latency dep (4.0 cyc)
Occupancy / registers	65.3% / 64	16.6% / 186	8.3% / 154

The MIO-queue-full stall that defined the WMMA kernel is gone; the kernel now has the low-occupancy / high-register / math-bound signature of the winning kernels — and runs its Tensor pipe 11 points hotter than the kernel cuBLAS picked.

The honest framing: this beats the kernel cuBLAS dispatches on sm_120 — an Ampere-era CUTLASS kernel chosen by heuristics that don't yet have Blackwell-native FP16 tuning — not the hardware peak. But that is precisely the repo's thesis demonstrated in reverse: the ceiling on any given card is set by which instruction class and tile shape the software can reach. On sm_120, raw mma.sync + the standard recipe reaches all of it. (On Hopper the same kernel would cap at the ~63–65%-of-peak mma.sync instruction ceiling — wgmma territory, roadmap Phase 2.5.)

What each step says (ablation reading, details in results/mma_ablation.md):

The baseline starts below WMMA on purpose — every memory-path optimization is stripped so each can be priced separately. Raw Tensor Core instructions alone buy nothing (20.7%).
Vectorized cp.async is the biggest single jump (2.8×) — on Tensor Cores, feed inefficiency is amplified ~8× vs Boehm's CUDA-core numbers.
Warptiling is the decisive step (77.7% → 106.1%), same as in Boehm's ladder (his: 78.4% → 93.7%). 64×64 warp tiles double arithmetic intensity per ldmatrix (4:1 mma:ldmatrix vs 2:1) — the direct fix for the Phase 1 MIO-queue-full stall.
The pipeline sweep flips vs WMMA: 2 stages beat 3 (WMMA preferred 3). With ldmatrix feeding mma.sync directly, deeper buffering just costs occupancy.

Phase 3: FP8 / FP4 — the Blackwell formats, through the sm_120 mma path

Same kernel skeleton as Phase 2 (128×128 CTA, 64×64 warp tile, swizzled smem, 2-stage cp.async), with the instruction swapped per precision — src/gemm_mma_fp8.cu, built for sm_120a (the FP4 kinds need it). tcgen05 is not available on this card (B200/sm_100 only); these are the precisions reachable through the plain mma path:

kernel	format / instruction	TFLOP/s @ 8192³	vs FP16	max_abs_err
`mma_warptile`	FP16 · `m16n8k16` (HMMA)	239.2	1.00×	0.0112
`mma_fp8`	FP8 E4M3 · `m16n8k32` (QMMA)	501.6	2.10×	1.4
`mma_fp4`	FP4-in-8bit · `kind::f8f6f4` (QMMA)	517.8	2.16×	5.97
`mma_mxfp4` ¹	packed FP4 + block scale · `kind::mxf4` (OMMA.SF)	988.3	4.13×	5.97
`cublas_tc`	FP16 (cuBLAS)	225.3	—	0.0112
`cublaslt_fp8`	FP8 E4M3 (cuBLASLt)	552.4	2.31×	1.4

The ladder delivers the 5th-gen Tensor Core spec almost exactly — 2.10× for FP8, 4.13× for packed FP4 — and the Pareto chart shows the price: each 2× costs roughly a decimal digit of accuracy (max_abs_err 0.011 → 1.4 → 6.0 at K=8192). Three findings worth quoting (full analysis in results/phase3_lowprec.md):

¹ The 4.13× is a throughput result. The MXFP4 block-scale factors are fed as 1.0 (UE8M0 = 127), i.e. per-tensor, not per-32-element-block, scaling — the hardware does identical work and the OMMA.SF rate is real, but the numerics do not exercise real per-block MXFP4 scaling (for the uniform test matrices per-block scales would be ≈ identical anyway; real-world activation outliers would widen the accuracy gap). See results/phase3_lowprec.md.

Unpacked FP4 is pointless. kind::f8f6f4 stores E2M1 in 8-bit containers and shares the QMMA pipeline → FP8 speed at 4× FP8's error. The 2×-over-FP8 exists only in the packed, block-scaled kind::mxf4 path (OMMA.SF in SASS).
Our FP8 is 90.8% of cuBLASLt FP8 (502 vs 552) — and cuBLASLt's kernel uses the same tile, warp layout and instruction as ours. Larger tiles and deeper pipelines measured as dead ends; the gap is xmma's instruction-level scheduling. Register-pipelined fragments (this kernel's mainloop) buy 493 → 502; the last 9% needs finer interleaving.
cuBLAS has no FP4 GEMM on sm_120 (no public ceiling exists) — the 988 TFLOP/s MXFP4 row is this card's first-hand FP4 data point, at 56.1% of the measured FP4 peak (4× the directly measured FP16 peak — see the Phase 4 rate probe below; throughput data point only, see footnote ¹ on the per-tensor block-scale caveat).

Measured on H100 80GB SXM5 (sm_90, CUDA 13.1) — the cross-generation result

The same source, built with ARCH=90, run on one idle GPU of an 8×H100 box (nvcr.io/nvidia/pytorch:26.02-py3 container; the box's busy production GPU was never touched):

size	wmma (TFLOP/s)	cublas_tc (TFLOP/s)	wmma % of cuBLAS-TC	same kernel on sm_120
2048	56.6	555.0	10.2%	31.8%
4096	59.4	732.6	8.1%	40.4%
8192	60.9	761.7	8.0%	44.4%

Two things happened at once, and nsys/ncu separate them cleanly (see results/nsys_profile.md for the full profiles):

The ceiling moved. On H100, cublas_tc dispatches nvjet_sm90_hss_320x128 — a Hopper-native warpgroup (wgmma) kernel that reaches 762 TFLOP/s ≈ 77% of H100's 989 TFLOP/s FP16 dense peak. On the RTX Pro 6000, cuBLAS-TC tops out at 226 TFLOP/s (an sm_80-style cutlass_80_tensorop kernel). The H100 ceiling is 3.3× higher in absolute terms.
Our kernel cannot follow it — for two stacked reasons. (a) Architectural: the WMMA API lowers to per-warp mma.sync, and on Hopper mma.sync cannot reach the Tensor Core peak — instruction-level microbenchmarks (arXiv:2501.12084) measure the mma path at ~63–65% of peak (494 of 757 TFLOPS on H800) vs ~96% for wgmma. (b) Kernel: our kernel doesn't even reach that mma.sync ceiling — it reaches 6% of peak, because its shared-memory operand pipeline (tile sizes and 3-stage cp.async depth tuned on sm_120, where the Tensor Cores are 3.3× slower) cannot feed Hopper's TCs; the ncu profile below shows it is MIO/feed-bound, not math-bound. Closing (b) with Hopper-tuned staging would still leave (a) — the wgmma-only gap to cuBLAS.

ncu --set full on both kernels at 8192³ quantifies the gap (full table in results/nsys_profile.md):

metric (ncu, 8192³)	`gemm_wmma_t<128,128,3>` (ours)	`nvjet_sm90` (cuBLAS-TC)
Duration	22.9 ms	1.56 ms
SM compute throughput	26.7%	91.9%
DRAM throughput	6.2%	28.9%
Achieved occupancy	49.4%	14.8%
Registers / thread	64	168
Top stall reason	MIO queue full (shared-mem traffic)	WARPGROUP.ARRIVES (wgmma sync)

The signature is unmistakable: the winning kernel runs at low occupancy with huge register state and near-peak tensor utilization (the warpgroup model), while the WMMA kernel burns its issue slots on shared-memory ld/st (MIO stalls) feeding fragments to mma.sync — high occupancy, low utilization.

What transfers across generations and what doesn't: the optimization story (tiling, cp.async pipelining, register tiling) transfers — the kernel's absolute TFLOP/s scales only with SM count × clock (100 vs 61 TFLOP/s; the spec SM×clock ratio is 1.88×, measured 1.70× — the Max-Q power limit accounts for the difference). What does not transfer is the fraction of the ceiling: each architecture generation moves the ceiling behind a new instruction (Volta mma.sync → Ampere mma.sync+cp.async → Hopper wgmma+TMA → Blackwell tcgen05), and a kernel written against the previous generation's abstraction silently keeps its absolute speed while losing its relative one. That is the actual lesson an SA needs when a partner asks "why is my custom kernel slow on the new GPUs?"

On the two cuBLAS baselines (read before quoting any "% of cuBLAS")

% of FP32 cuBLAS (e.g. 194% @ 512, 185% @ 8192) compares FP16-on-Tensor-Cores WMMA against cublasSgemm FP32 on CUDA cores — precision-mismatched, not a Tensor Core ceiling; a >100% row reflects that mismatch (plus small-size launch overhead), not a kernel beating cuBLAS. % of cuBLAS-TC compares against cublas_tc (cublasGemmEx, FP16 in / FP32 accumulate) in src/cublas_tc.cu — same precision, same timing methodology (FP16 cast staged once outside the timed loop; cuBLAS handle created once). Against this honest ceiling the optimized WMMA lands at ~44% at large sizes. The remaining gap is not TMA or warp specialization — nsys shows cuBLAS dispatches cutlass_80_tensorop_s16816gemm_f16_128x64, an Ampere-generation (sm_80) kernel that uses cp.async multistage pipelining, not TMA and not warp specialization (both are sm_90/Hopper+ features). The gap comes from cuBLAS's larger 128×64 CTA tile, deeper register-level warp/thread tiling, vectorized loads, swizzle/rasterization, and more aggressive multistage cp.async pipelining — the kernel-name string itself (cutlass_80_*) is the evidence. This is consistent with Boehm's published GEMM ablation, where 2D blocktiling + vectorized loads + warptiling reaches ~94% of cuBLAS without TMA or warp specialization.

Phase 2.5 — breaking the WMMA ceiling with CUTLASS 3.x `wgmma` (constructive proof)

The H100 result above says the WMMA API cannot follow Hopper's ceiling. That conclusion rested on literature + ncu evidence; this closes it constructively. src/cutlass_sm90.cu builds the same FP16-in/FP32-accumulate GEMM with CUTLASS 3.x's CollectiveBuilder (TMA loads + wgmma warpgroup MMA, warp-specialized cooperative schedule), added as kernel row cutlass_sm90 (build: cmake -DCMAKE_CUDA_ARCHITECTURES=90a -DCUTLASS_DIR=…; raw rows in results/bench_sm90a.csv, same one-idle-H100 methodology):

size	wmma	cutlass_sm90	cublas_tc (nvjet)	cutlass % of cuBLAS-TC	wmma %
2048	56.0	373.9	540.9	69.1%	10.4%
4096	58.9	552.7	726.4	76.1%	8.1%
8192	61.0	640.9	749.9	85.5%	8.1%

Switching instruction class (mma.sync → wgmma) recovers 10.5× at 8192³ — from 8% to 85.5% of the cuBLAS-TC ceiling — with identical numerical error (0.0112 max abs, FP16 input rounding). The ncu --set full signature confirms the kernel actually crossed regimes (results/ncu_cutlass_8192.txt):

metric (ncu, 8192³)	wmma (ours)	cutlass_sm90 (ours)	nvjet (cuBLAS-TC)
SM compute throughput	26.7%	74.4%	91.9%
Achieved occupancy	49.4%	14.1%	14.8%
Registers / thread	64	168	168
Top stall reason	MIO queue full (memory feed)	CTA barrier (warpgroup sync)	WARPGROUP.ARRIVES

The CUTLASS kernel inherits nvjet's shape: low occupancy, 168 registers/thread, stalls on warpgroup synchronization instead of operand feeding. The remaining 14.5% to nvjet is tile-shape/cluster/rasterization autotuning (nvjet picks 320×128 tiles vs our fixed 128×256) — a tuning gap, not an instruction-class gap. Also note the reversal this documents: warp specialization, a measured negative on sm_120 (experiments/wmma_ws_probe.cu), is mandatory on Hopper.

Third-party reference baselines on this box (Phase 4)

What do the leading open-source kernels actually achieve on this H100, run with their own benchmarks, unmodified? (scripts/run_reference_benches.sh; raw logs + clock state in results/reference/; parsed by analysis/reference_summary.py)

kernel	measured TFLOPS	% of H100 peak	published	measured / published
DeepGEMM FP8 (DeepSeek)	1523	77%	1358 (H800)	112%
DeepGEMM BF16	830	84%	—	—
ThunderKittens FP8	1465	74%	~1500	98%
ThunderKittens BF16	775	78%	~760	102%
ThunderKittens FP8 (per-block scaled)	985	50%	—	—
FlashAttention-3 FP16 fwd	757	77%	740	102%
FlashAttention-2 fwd (same file), context	388	39%	—	—
cuDNN attention fwd (same file), context	689	70%	—	—
cuBLAS FP16-TC (nvjet), context	750	76%	—	—
this repo's WMMA kernel, context	61	6%	—	—
this repo's CUTLASS wgmma kernel, context	641	65%	—	—

Two readings: (1) the published numbers are honest — every project reproduces within ~±10% on our shared box at stock clocks, so the repo's own gaps are real, not environmental; (2) the FP8 ceiling (~1500 TFLOPS measured) is the target Phase 3's own FP8 kernel work gets measured against.

One published claim that does not reproduce: Stream-K's "up to 6.7×" (arXiv:2301.03598). CUTLASS example 47 on this H100 lands at 0.94×–1.05× vs data-parallel tiling, even on shapes built to have a 48%-idle final wave (results/reference/streamk.txt). The 6.7× was measured on A100 (108 SMs) against constructed worst cases; H100's 132 SMs shrink the relative tail, and the Stream-K reduction overhead eats what remains. Negative reproductions are reported as such — that is what the reference rows are for.

The peak denominator, measured: FP32-accumulate is full-rate on this card (Phase 4)

Every "% of theoretical peak" in this repo divides by an FP16 peak that assumes the workstation GB202 runs FP16-in/FP32-acc at full rate. The consumer GB202 (RTX 5090) runs it at half rate, and no public spec exists for the RTX PRO 6000 — so the repo measures it directly: a register-resident, memory-free mma.sync.m16n8k16 loop (one block per SM, 8 warps, 8 independent accumulator chains per warp), in two variants that differ only in accumulator type (src/mma_rate_probe.cu; run: scripts/run_rate_probe.sh; raw data: results/mma_rate_probe.csv).

variant	FLOP/SM/clk (from in-kernel `clock64()`)	wall-clock TFLOP/s @ 300 W cap
FP16-in / FP16-acc	1023.7	440.3
FP16-in / FP32-acc	1023.8	426.7

Both variants hit the 1024 FLOP/SM/clk spec rate exactly: FP32 accumulation is full-rate on the RTX PRO 6000 (per-clock ratio 1.000), unlike the consumer RTX 5090. The only cost of the wider accumulator is power — under the 300 W Max-Q cap the FP32-acc variant sustains ~3% lower clocks. This pins the peak denominators used throughout: FP16 dense ≈ 440 TFLOP/s at the sustained power-capped clock, FP8 ≈ 880, MXFP4 ≈ 1761 — so mma_warptile (243.2) = 55.2% of peak, mma_fp8 (501.6) = 57.0%, mma_mxfp4 (988.3) = 56.1%, cuBLAS-TC (225.75) = 51.3%.

References

NVIDIA CUTLASS — the production reference for Tensor Core GEMM.
WMMA API (CUDA C++ Programming Guide) — the API used by gemm_wmma.cu.
Simon Boehm, "How to Optimize a CUDA Matmul Kernel" — quantified step-by-step ablation: 2D blocktiling alone reaches 68.7% of cuBLAS, +vectorized loads 78.4%, +warptiling 93.7%, all without TMA or warp specialization.
CudaDMA: Optimizing GPU Memory Bandwidth via Warp Specialization (Bauer et al., SC'11) — origin of warp specialization; it pays off only with large tiles + async memory hardware + register reallocation.
CUTLASS Efficient GEMM docs — CTA/warp/thread tiling, multistage pipelining, and kernel naming.

Disclaimer

Personal project for learning and benchmarking. Views and results are my own and do not represent any employer.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Blackwell Tensor Core Kernels

What this is

What this is NOT

Hardware

Layout

Quick start (on the Blackwell / H100 box)

Results

Measured on RTX PRO 6000 Blackwell Max-Q (sm_120, CUDA 12.8)

Phase 2: hand-written mma.sync — past the WMMA ceiling, past cuBLAS (sm_120)

Phase 3: FP8 / FP4 — the Blackwell formats, through the sm_120 mma path

Measured on H100 80GB SXM5 (sm_90, CUDA 13.1) — the cross-generation result

On the two cuBLAS baselines (read before quoting any "% of cuBLAS")

Phase 2.5 — breaking the WMMA ceiling with CUTLASS 3.x `wgmma` (constructive proof)

Third-party reference baselines on this box (Phase 4)

The peak denominator, measured: FP32-accumulate is full-rate on this card (Phase 4)

References

Disclaimer

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
analysis		analysis
docs		docs
experiments		experiments
include		include
results		results
scripts		scripts
src		src
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
VALIDATION.md		VALIDATION.md

Folders and files

Latest commit

History

Repository files navigation

Blackwell Tensor Core Kernels

What this is

What this is NOT

Hardware

Layout

Quick start (on the Blackwell / H100 box)

Results

Measured on RTX PRO 6000 Blackwell Max-Q (sm_120, CUDA 12.8)

Phase 2: hand-written mma.sync — past the WMMA ceiling, past cuBLAS (sm_120)

Phase 3: FP8 / FP4 — the Blackwell formats, through the sm_120 mma path

Measured on H100 80GB SXM5 (sm_90, CUDA 13.1) — the cross-generation result

On the two cuBLAS baselines (read before quoting any "% of cuBLAS")

Phase 2.5 — breaking the WMMA ceiling with CUTLASS 3.x wgmma (constructive proof)

Third-party reference baselines on this box (Phase 4)

The peak denominator, measured: FP32-accumulate is full-rate on this card (Phase 4)

References

Disclaimer

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Phase 2.5 — breaking the WMMA ceiling with CUTLASS 3.x `wgmma` (constructive proof)

Packages