Hand-written CUDA GEMM kernels targeting Tensor Cores, benchmarked on both Hopper (H100, sm_90) and Blackwell (RTX Pro 6000, sm_120), with a path toward Blackwell's FP8/FP4 microscaling formats.
The point is to connect kernel-level choices — tiling, Tensor Core fragment shapes, occupancy — to measured TFLOP/s as a fraction of the cuBLAS ceiling, on real Hopper and Blackwell silicon.
- A naive baseline, a shared-memory tiled GEMM, a WMMA Tensor Core GEMM, and a
hand-written raw
mma.syncTensor Core GEMM ablation ladder (FP16 in / FP32 accumulate). - A harness that checks correctness against cuBLAS and reports TFLOP/s and % of the cuBLAS ceiling,
across a precision ladder of cuBLAS baselines on the same card:
cublas=cublasSgemm(FP32, CUDA cores) →cublas_tf32=cublasGemmEx(FP32 in, TF32 compute, Tensor Cores) →cublas_tc=cublasGemmEx(FP16 in / FP32 acc, Tensor Cores — the same precision as the WMMA/mma.sync kernels, hence the honest ceiling). - Builds for sm_90 (H100) and sm_120 (Blackwell RTX Pro 6000) so the same kernels can be profiled across two generations.
Measurement status: both generations are now populated in
results/bench.csv— Blackwell RTX PRO 6000 Max-Q (sm_120) and H100 80GB SXM5 (sm_90) (run on a separate 8×H100 box, pinned to a single idle GPU). Three headline findings:
- (Phase 2, sm_120) A hand-written raw
mma.synckernel reaches 106% of cuBLAS-TC on sm_120 (243 vs 229 TFLOP/s @ 8192³) — beating thecutlass_80kernel cuBLAS dispatches on this card, with the same standard-optimization stack Boehm used to reach ~94% on CUDA cores. The WMMA wrapper, not the recipe, was what kept the Phase 1 kernel at 45%.- (Phase 1) The cross-generation comparison: the same WMMA kernel that reaches 45% of cuBLAS-TC on sm_120 reaches only 8% on H100 — not because the kernel runs slower per-SM, but because Hopper's Tensor Core ceiling is only reachable through
wgmmawarpgroup instructions that the WMMA API cannot emit.- (Phase 2.5, sm_90) That conclusion is now closed constructively: the same GEMM rebuilt with CUTLASS 3.x (TMA +
wgmma) reaches 85.5% of cuBLAS-TC on H100 — a 10.5× recovery from the instruction class alone. Findings 1+3 are the same lesson on two architectures: reach the current generation's native instruction or lose the ceiling. Third-party ceilings (DeepGEMM, ThunderKittens, FlashAttention-3, Stream-K) are also measured on this box as reference rows — all within ±12% of their published numbers.
- Not a cuBLAS replacement — cuBLAS is the ceiling measured against, honestly.
- Not tcgen05 — tensor-memory / CTA-pair instructions are sm_100 (B200) only. This card's
FP8/FP4 path is the regular
mmainstruction set, and that is what Phase 3 measures.
- NVIDIA RTX Pro 6000 (Blackwell, sm_120) and/or H100 (Hopper, sm_90). CUDA 12.x+.
src/gemm_naive.cu # one-thread-per-output baseline (correctness anchor)
src/gemm_tiled.cu # shared-memory tiled + register-blocked SIMT GEMM
src/gemm_wmma.cu # WMMA 16x16x16 Tensor Core GEMM (FP16 in, FP32 acc)
src/gemm_mma.cu # hand-written raw mma.sync m16n8k16 GEMM — 5-step ablation ladder (Phase 2)
src/cutlass_sm90.cu # CUTLASS 3.x TMA + wgmma GEMM, Hopper only (Phase 2.5)
src/gemm_mma_fp8.cu # FP8 (E4M3) / FP4 (E2M1) / MXFP4 GEMM via the sm_120 mma path (Phase 3)
src/cublaslt_fp8.cu # cuBLASLt FP8 baseline — the library FP8 ceiling (Phase 3)
src/mma_rate_probe.cu # FP32-acc vs FP16-acc tensor rate microbenchmark — measures the peak (Phase 4)
include/mma_ptx.cuh # the PTX building blocks: mma.sync, ldmatrix, XOR swizzle, cp.async
src/reference.cu # cuBLAS FP32 ceiling (cublasSgemm, CUDA cores)
src/cublas_tc.cu # cuBLAS Tensor Core ceiling (cublasGemmEx, FP16 in / FP32 acc) — same precision as wmma/mma
src/main.cu # correctness + benchmark driver -> results/bench.csv
include/util.cuh # timing, init, max-abs-error check
scripts/sweep_mma_stages.sh # cp.async pipeline-depth sweep -> results/mma_stage_sweep.csv
scripts/run_reference_benches.sh # DeepGEMM / ThunderKittens / FA3 reference runs (Phase 4)
analysis/reference_summary.py # parse reference logs -> summary.md + reference_benches.png
CMakeLists.txt # builds for sm_90 and sm_120 (+ sm_90a CUTLASS via -DCUTLASS_DIR)
docs/design-decisions.md
docs/roadmap.md
One command does the full capture — size sweep + Nsight + plots + report:
make capture # build -> sweep sizes -> ncu/nsys -> results/report.md + PNGsThe benchmark CSV records the device and SM of each run, so to compare both generations you run the sweep once on each GPU and the rows accumulate:
# on the H100 box: ARCH=90 make bench
# on the Blackwell box: ARCH=120 make bench
make analyze # merge both -> results/report.md + results/tflops_sm*.pngSingle run / one kernel, by hand:
make build
./build/gemm_bench 4096 4096 4096 results/bench.csv # M N K [out_csv]
bash scripts/profile.sh 4096 # ncu + nsys for gemm_wmmamake capture produces, in results/:
bench.csv— per kernel × size × GPU: ms, TFLOP/s, % of FP32 cuBLAS (pct_of_cublas, vscublasSgemm), max abs err. Now includes acublas_tcrow per size.tflops_sm120.png,pct_tc_sm120.png,roofline_sm120.png— the three charts below.ncu_wmma_*.ncu-rep,nsys_*.nsys-rep— Nsight Compute / Systems captures.report.md— the summary table (both % of FP32 cuBLAS and % of cuBLAS-TC) + charts.
Throughput across the precision ladder (FP32 → TF32 → FP16, all real, one card):
Each kernel as a fraction of the honest same-precision ceiling (cuBLAS FP16-TC = 100%):
Throughput at the largest (most steady-state) size, M=N=K=8192:
The gemm_wmma kernel is shared-memory tiled with per-warp 2×2 register tiling and a 3-stage
cp.async pipeline, size-dispatched: 64×64 tile for N<1536 (better occupancy), 128×128 tile
for N≥1536 (more reuse). Numbers at each size (TFLOP/s and fraction of the same-precision
cuBLAS FP16-TC ceiling):
| size | wmma | tf32-TC | cublas_tc (FP16-TC) | wmma % of cuBLAS-TC | tf32 % of cuBLAS-TC |
|---|---|---|---|---|---|
| 512 | 16.2 | 27.8 | 33.7 | 48.2% | 82.6% |
| 1024 | 38.7 | 90.7 | 133.8 | 29.0% | 67.8% |
| 2048 | 68.6 | 128.8 | 213.6 | 32.1% | 60.3% |
| 4096 | 96.1 | 144.4 | 235.1 | 40.9% | 61.4% |
| 8192 | 100.2 | 150.5 | 225.8 | 44.4% | 66.7% |
Read the % of cuBLAS-TC column — the honest same-precision (FP16-in/FP32-acc, Tensor Core)
ceiling. Across two optimization passes (shared-mem + cp.async, then register tiling + deeper
pipeline + size dispatch) the WMMA kernel went from a naive 17.3% to 44.4% of cuBLAS-TC
at 8192 (1.64× the single-buffer version), and no longer decays at scale. Pipeline depth is
tuned (3 stages > 4 > 5) and warp specialization was tried but did not beat the multi-stage
pipeline — the expected outcome per CudaDMA (Bauer et al., SC'11), since warp specialization
needs large tiles + async-transfer hardware + register reallocation as prerequisites, none of
which a 512-thread WMMA tile supplies on its own; see results/nsys_profile.md for the full
before/after and the WS experiment.
The % of FP32 cuBLAS column is precision-mismatched (kept for continuity); its >100% rows
are FP16-TC vs FP32-CUDA-core, not the kernel beating cuBLAS. cublas_tc is 4.2–23× faster than
cublasSgemm, confirming the Tensor Core path.
Phase 1 ended with a diagnosis: the WMMA kernel is feed-bound (MIO-queue-full), and the
fixes — swizzled shared memory, deeper register tiling — are exactly what the WMMA API hides.
Phase 2 tests that diagnosis constructively: rewrite the kernel on raw
mma.sync.aligned.m16n8k16 PTX + ldmatrix (src/gemm_mma.cu), then re-apply the standard
optimization stack one step at a time — the same ablation discipline (and the same recipe) as
Simon Boehm's FP32 ladder, which reaches ~94% of
cuBLAS on CUDA cores. The roadmap question: does that recipe survive the move to Tensor Cores,
which consume operands ~8× faster?
It does — and overshoots:
| step | adds | TFLOP/s @ 8192³ | % of cuBLAS-TC |
|---|---|---|---|
wmma (Phase 1 best) |
– | 103.1 | 45.0% |
mma_base |
raw mma.sync + ldmatrix, 128×128 CTA, 2×2 warp tiling, scalar loads | 47.4 | 20.7% |
mma_swizzle |
+ XOR-swizzled (bank-conflict-free) smem | 58.7 | 25.6% |
mma_vec |
+ 16-byte vectorized cp.async |
165.3 | 72.2% |
mma_pipe |
+ 2-stage pipeline (sweep winner: 2 > 3 > 4) | 178.0 | 77.7% |
mma_warptile |
+ 64×64 per-warp register tile | 243.2 | 106.1% |
cublas_tc (ceiling) |
cutlass_80_tensorop_s16816gemm_f16_128x64 |
229.0 | 100% |
Four things make the 106% credible (full validation in
results/mma_ablation.md):
(1) identical max_abs_err to cuBLAS-TC at every size (same math, verified standalone);
(2) the gap is visible inside the GPU timeline — nsys shows 4.52 ms vs 4.79 ms per kernel
instance (results/nsys_kern_sum_mma_8192.txt); (3) it is a same-instruction contest —
cuBLAS's kernel name (s16816) says it uses the same m16n8k16 instruction; we just tile and
pipeline it better for this card (128×128 CTA / 64×64 warp tile / 2 stages, 196 reg/thread,
0 spills); (4) ncu shows why: our kernel runs the Tensor pipe at 85.0% utilization vs
cuBLAS's 74.0%, with the same top-stall signature ncu attributes to "already highly optimized
kernels" (results/ncu_sm120_*_8192.txt, captured via scripts/profile_mma_ncu.sh).
The ncu side-by-side also closes the Phase 1 → Phase 2 loop at the stall level:
| ncu @ 8192³ | wmma (Phase 1) | mma_warptile (Phase 2) | cuBLAS-TC (cutlass_80) |
|---|---|---|---|
| Tensor pipe utilization | 24.7% | 85.0% | 74.0% |
| Warp cycles / issued instruction | 49.6 | 5.8 | 6.6 |
| Top stall | MIO queue full (34.2%) | fixed-latency dep (2.9 cyc) | fixed-latency dep (4.0 cyc) |
| Occupancy / registers | 65.3% / 64 | 16.6% / 186 | 8.3% / 154 |
The MIO-queue-full stall that defined the WMMA kernel is gone; the kernel now has the low-occupancy / high-register / math-bound signature of the winning kernels — and runs its Tensor pipe 11 points hotter than the kernel cuBLAS picked.
The honest framing: this beats the kernel cuBLAS dispatches on sm_120 — an Ampere-era
CUTLASS kernel chosen by heuristics that don't yet have Blackwell-native FP16 tuning — not the
hardware peak. But that is precisely the repo's thesis demonstrated in reverse: the ceiling on
any given card is set by which instruction class and tile shape the software can reach. On
sm_120, raw mma.sync + the standard recipe reaches all of it. (On Hopper the same kernel would
cap at the ~63–65%-of-peak mma.sync instruction ceiling — wgmma territory, roadmap
Phase 2.5.)
What each step says (ablation reading, details in
results/mma_ablation.md):
- The baseline starts below WMMA on purpose — every memory-path optimization is stripped so each can be priced separately. Raw Tensor Core instructions alone buy nothing (20.7%).
- Vectorized
cp.asyncis the biggest single jump (2.8×) — on Tensor Cores, feed inefficiency is amplified ~8× vs Boehm's CUDA-core numbers. - Warptiling is the decisive step (77.7% → 106.1%), same as in Boehm's ladder (his: 78.4% →
93.7%). 64×64 warp tiles double arithmetic intensity per
ldmatrix(4:1 mma:ldmatrix vs 2:1) — the direct fix for the Phase 1 MIO-queue-full stall. - The pipeline sweep flips vs WMMA: 2 stages beat 3 (WMMA preferred 3). With ldmatrix feeding mma.sync directly, deeper buffering just costs occupancy.
Same kernel skeleton as Phase 2 (128×128 CTA, 64×64 warp tile, swizzled smem, 2-stage
cp.async), with the instruction swapped per precision — src/gemm_mma_fp8.cu, built for
sm_120a (the FP4 kinds need it). tcgen05 is not available on this card (B200/sm_100
only); these are the precisions reachable through the plain mma path:
| kernel | format / instruction | TFLOP/s @ 8192³ | vs FP16 | max_abs_err |
|---|---|---|---|---|
mma_warptile |
FP16 · m16n8k16 (HMMA) |
239.2 | 1.00× | 0.0112 |
mma_fp8 |
FP8 E4M3 · m16n8k32 (QMMA) |
501.6 | 2.10× | 1.4 |
mma_fp4 |
FP4-in-8bit · kind::f8f6f4 (QMMA) |
517.8 | 2.16× | 5.97 |
mma_mxfp4 ¹ |
packed FP4 + block scale · kind::mxf4 (OMMA.SF) |
988.3 | 4.13× | 5.97 |
cublas_tc |
FP16 (cuBLAS) | 225.3 | — | 0.0112 |
cublaslt_fp8 |
FP8 E4M3 (cuBLASLt) | 552.4 | 2.31× | 1.4 |
The ladder delivers the 5th-gen Tensor Core spec almost exactly — 2.10× for FP8, 4.13× for
packed FP4 — and the Pareto chart shows the price: each 2× costs roughly a decimal digit
of accuracy (max_abs_err 0.011 → 1.4 → 6.0 at K=8192). Three findings worth quoting
(full analysis in results/phase3_lowprec.md):
¹ The 4.13× is a throughput result. The MXFP4 block-scale factors are fed as 1.0 (UE8M0 = 127), i.e. per-tensor, not per-32-element-block, scaling — the hardware does identical work and the OMMA.SF rate is real, but the numerics do not exercise real per-block MXFP4 scaling (for the uniform test matrices per-block scales would be ≈ identical anyway; real-world activation outliers would widen the accuracy gap). See
results/phase3_lowprec.md.
- Unpacked FP4 is pointless.
kind::f8f6f4stores E2M1 in 8-bit containers and shares the QMMA pipeline → FP8 speed at 4× FP8's error. The 2×-over-FP8 exists only in the packed, block-scaledkind::mxf4path (OMMA.SF in SASS). - Our FP8 is 90.8% of cuBLASLt FP8 (502 vs 552) — and cuBLASLt's kernel uses the same tile, warp layout and instruction as ours. Larger tiles and deeper pipelines measured as dead ends; the gap is xmma's instruction-level scheduling. Register-pipelined fragments (this kernel's mainloop) buy 493 → 502; the last 9% needs finer interleaving.
- cuBLAS has no FP4 GEMM on sm_120 (no public ceiling exists) — the 988 TFLOP/s MXFP4 row is this card's first-hand FP4 data point, at 56.1% of the measured FP4 peak (4× the directly measured FP16 peak — see the Phase 4 rate probe below; throughput data point only, see footnote ¹ on the per-tensor block-scale caveat).
The same source, built with ARCH=90, run on one idle GPU of an 8×H100 box
(nvcr.io/nvidia/pytorch:26.02-py3 container; the box's busy production GPU was never touched):
| size | wmma (TFLOP/s) | cublas_tc (TFLOP/s) | wmma % of cuBLAS-TC | same kernel on sm_120 |
|---|---|---|---|---|
| 2048 | 56.6 | 555.0 | 10.2% | 31.8% |
| 4096 | 59.4 | 732.6 | 8.1% | 40.4% |
| 8192 | 60.9 | 761.7 | 8.0% | 44.4% |
Two things happened at once, and nsys/ncu separate them cleanly
(see results/nsys_profile.md for the full profiles):
- The ceiling moved. On H100,
cublas_tcdispatchesnvjet_sm90_hss_320x128— a Hopper-native warpgroup (wgmma) kernel that reaches 762 TFLOP/s ≈ 77% of H100's 989 TFLOP/s FP16 dense peak. On the RTX Pro 6000, cuBLAS-TC tops out at 226 TFLOP/s (an sm_80-stylecutlass_80_tensoropkernel). The H100 ceiling is 3.3× higher in absolute terms. - Our kernel cannot follow it — for two stacked reasons. (a) Architectural: the WMMA
API lowers to per-warp
mma.sync, and on Hoppermma.synccannot reach the Tensor Core peak — instruction-level microbenchmarks (arXiv:2501.12084) measure themmapath at ~63–65% of peak (494 of 757 TFLOPS on H800) vs ~96% forwgmma. (b) Kernel: our kernel doesn't even reach that mma.sync ceiling — it reaches 6% of peak, because its shared-memory operand pipeline (tile sizes and 3-stagecp.asyncdepth tuned on sm_120, where the Tensor Cores are 3.3× slower) cannot feed Hopper's TCs; the ncu profile below shows it is MIO/feed-bound, not math-bound. Closing (b) with Hopper-tuned staging would still leave (a) — the wgmma-only gap to cuBLAS.
ncu --set full on both kernels at 8192³ quantifies the gap (full table in
results/nsys_profile.md):
| metric (ncu, 8192³) | gemm_wmma_t<128,128,3> (ours) |
nvjet_sm90 (cuBLAS-TC) |
|---|---|---|
| Duration | 22.9 ms | 1.56 ms |
| SM compute throughput | 26.7% | 91.9% |
| DRAM throughput | 6.2% | 28.9% |
| Achieved occupancy | 49.4% | 14.8% |
| Registers / thread | 64 | 168 |
| Top stall reason | MIO queue full (shared-mem traffic) | WARPGROUP.ARRIVES (wgmma sync) |
The signature is unmistakable: the winning kernel runs at low occupancy with huge register
state and near-peak tensor utilization (the warpgroup model), while the WMMA kernel burns its
issue slots on shared-memory ld/st (MIO stalls) feeding fragments to mma.sync — high
occupancy, low utilization.
What transfers across generations and what doesn't: the optimization story (tiling,
cp.async pipelining, register tiling) transfers — the kernel's absolute TFLOP/s scales only
with SM count × clock (100 vs 61 TFLOP/s; the spec SM×clock ratio is 1.88×, measured 1.70× —
the Max-Q power limit accounts for the difference). What does not transfer is the
fraction of the ceiling:
each architecture generation moves the ceiling behind a new instruction (Volta mma.sync →
Ampere mma.sync+cp.async → Hopper wgmma+TMA → Blackwell tcgen05), and a kernel written
against the previous generation's abstraction silently keeps its absolute speed while losing
its relative one. That is the actual lesson an SA needs when a partner asks "why is my custom
kernel slow on the new GPUs?"
% of FP32 cuBLAS (e.g. 194% @ 512, 185% @ 8192) compares FP16-on-Tensor-Cores WMMA against
cublasSgemmFP32 on CUDA cores — precision-mismatched, not a Tensor Core ceiling; a>100%row reflects that mismatch (plus small-size launch overhead), not a kernel beating cuBLAS. % of cuBLAS-TC compares againstcublas_tc(cublasGemmEx, FP16 in / FP32 accumulate) insrc/cublas_tc.cu— same precision, same timing methodology (FP16 cast staged once outside the timed loop; cuBLAS handle created once). Against this honest ceiling the optimized WMMA lands at ~44% at large sizes. The remaining gap is not TMA or warp specialization — nsys shows cuBLAS dispatchescutlass_80_tensorop_s16816gemm_f16_128x64, an Ampere-generation (sm_80) kernel that usescp.asyncmultistage pipelining, not TMA and not warp specialization (both are sm_90/Hopper+ features). The gap comes from cuBLAS's larger 128×64 CTA tile, deeper register-level warp/thread tiling, vectorized loads, swizzle/rasterization, and more aggressive multistagecp.asyncpipelining — the kernel-name string itself (cutlass_80_*) is the evidence. This is consistent with Boehm's published GEMM ablation, where 2D blocktiling + vectorized loads + warptiling reaches ~94% of cuBLAS without TMA or warp specialization.
The H100 result above says the WMMA API cannot follow Hopper's ceiling. That conclusion
rested on literature + ncu evidence; this closes it constructively. src/cutlass_sm90.cu
builds the same FP16-in/FP32-accumulate GEMM with CUTLASS 3.x's CollectiveBuilder
(TMA loads + wgmma warpgroup MMA, warp-specialized cooperative schedule), added as kernel
row cutlass_sm90 (build: cmake -DCMAKE_CUDA_ARCHITECTURES=90a -DCUTLASS_DIR=…; raw rows
in results/bench_sm90a.csv, same one-idle-H100 methodology):
| size | wmma | cutlass_sm90 | cublas_tc (nvjet) | cutlass % of cuBLAS-TC | wmma % |
|---|---|---|---|---|---|
| 2048 | 56.0 | 373.9 | 540.9 | 69.1% | 10.4% |
| 4096 | 58.9 | 552.7 | 726.4 | 76.1% | 8.1% |
| 8192 | 61.0 | 640.9 | 749.9 | 85.5% | 8.1% |
Switching instruction class (mma.sync → wgmma) recovers 10.5× at 8192³ — from 8% to
85.5% of the cuBLAS-TC ceiling — with identical numerical error (0.0112 max abs, FP16 input
rounding). The ncu --set full signature confirms the kernel actually crossed regimes
(results/ncu_cutlass_8192.txt):
| metric (ncu, 8192³) | wmma (ours) | cutlass_sm90 (ours) | nvjet (cuBLAS-TC) |
|---|---|---|---|
| SM compute throughput | 26.7% | 74.4% | 91.9% |
| Achieved occupancy | 49.4% | 14.1% | 14.8% |
| Registers / thread | 64 | 168 | 168 |
| Top stall reason | MIO queue full (memory feed) | CTA barrier (warpgroup sync) | WARPGROUP.ARRIVES |
The CUTLASS kernel inherits nvjet's shape: low occupancy, 168 registers/thread, stalls on
warpgroup synchronization instead of operand feeding. The remaining 14.5% to nvjet is
tile-shape/cluster/rasterization autotuning (nvjet picks 320×128 tiles vs our fixed 128×256) —
a tuning gap, not an instruction-class gap. Also note the reversal this documents: warp
specialization, a measured negative on sm_120 (experiments/wmma_ws_probe.cu), is
mandatory on Hopper.
What do the leading open-source kernels actually achieve on this H100, run with their own
benchmarks, unmodified? (scripts/run_reference_benches.sh; raw logs + clock state in
results/reference/; parsed by analysis/reference_summary.py)
| kernel | measured TFLOPS | % of H100 peak | published | measured / published |
|---|---|---|---|---|
| DeepGEMM FP8 (DeepSeek) | 1523 | 77% | 1358 (H800) | 112% |
| DeepGEMM BF16 | 830 | 84% | — | — |
| ThunderKittens FP8 | 1465 | 74% | ~1500 | 98% |
| ThunderKittens BF16 | 775 | 78% | ~760 | 102% |
| ThunderKittens FP8 (per-block scaled) | 985 | 50% | — | — |
| FlashAttention-3 FP16 fwd | 757 | 77% | 740 | 102% |
| FlashAttention-2 fwd (same file), context | 388 | 39% | — | — |
| cuDNN attention fwd (same file), context | 689 | 70% | — | — |
| cuBLAS FP16-TC (nvjet), context | 750 | 76% | — | — |
| this repo's WMMA kernel, context | 61 | 6% | — | — |
| this repo's CUTLASS wgmma kernel, context | 641 | 65% | — | — |
Two readings: (1) the published numbers are honest — every project reproduces within ~±10% on our shared box at stock clocks, so the repo's own gaps are real, not environmental; (2) the FP8 ceiling (~1500 TFLOPS measured) is the target Phase 3's own FP8 kernel work gets measured against.
One published claim that does not reproduce: Stream-K's "up to 6.7×"
(arXiv:2301.03598). CUTLASS example 47 on this H100 lands at 0.94×–1.05× vs data-parallel
tiling, even on shapes built to have a 48%-idle final wave (results/reference/streamk.txt).
The 6.7× was measured on A100 (108 SMs) against constructed worst cases; H100's 132 SMs shrink
the relative tail, and the Stream-K reduction overhead eats what remains. Negative
reproductions are reported as such — that is what the reference rows are for.
Every "% of theoretical peak" in this repo divides by an FP16 peak that assumes the workstation
GB202 runs FP16-in/FP32-acc at full rate. The consumer GB202 (RTX 5090) runs it at half
rate, and no public spec exists for the RTX PRO 6000 — so the repo measures it directly: a
register-resident, memory-free mma.sync.m16n8k16 loop (one block per SM, 8 warps, 8 independent
accumulator chains per warp), in two variants that differ only in accumulator type
(src/mma_rate_probe.cu; run: scripts/run_rate_probe.sh; raw data:
results/mma_rate_probe.csv).
| variant | FLOP/SM/clk (from in-kernel clock64()) |
wall-clock TFLOP/s @ 300 W cap |
|---|---|---|
| FP16-in / FP16-acc | 1023.7 | 440.3 |
| FP16-in / FP32-acc | 1023.8 | 426.7 |
Both variants hit the 1024 FLOP/SM/clk spec rate exactly: FP32 accumulation is full-rate on
the RTX PRO 6000 (per-clock ratio 1.000), unlike the consumer RTX 5090. The only cost of the
wider accumulator is power — under the 300 W Max-Q cap the FP32-acc variant sustains ~3% lower
clocks. This pins the peak denominators used throughout: FP16 dense ≈ 440 TFLOP/s at the
sustained power-capped clock, FP8 ≈ 880, MXFP4 ≈ 1761 — so mma_warptile (243.2) = 55.2%
of peak, mma_fp8 (501.6) = 57.0%, mma_mxfp4 (988.3) = 56.1%, cuBLAS-TC (225.75) = 51.3%.
- NVIDIA CUTLASS — the production reference for Tensor Core GEMM.
- WMMA API (CUDA C++ Programming Guide) — the API used by
gemm_wmma.cu. - Simon Boehm, "How to Optimize a CUDA Matmul Kernel" — quantified step-by-step ablation: 2D blocktiling alone reaches 68.7% of cuBLAS, +vectorized loads 78.4%, +warptiling 93.7%, all without TMA or warp specialization.
- CudaDMA: Optimizing GPU Memory Bandwidth via Warp Specialization (Bauer et al., SC'11) — origin of warp specialization; it pays off only with large tiles + async memory hardware + register reallocation.
- CUTLASS Efficient GEMM docs — CTA/warp/thread tiling, multistage pipelining, and kernel naming.
Personal project for learning and benchmarking. Views and results are my own and do not represent any employer.







