Make Triton benchmark roofline GPU-aware by peter941221 · Pull Request #5 · psmarter/mini-infer

peter941221 · 2026-05-27T04:06:20Z

This removes a hard-coded RTX 4090 assumption from the Triton decode-attention benchmark.

Failure mechanism:

benchmarks/benchmark_triton.py hard-coded RTX 4090 roofline numbers into roofline_analysis()
on other GPUs, especially Blackwell-class consumer GPUs like RTX 5090, the reported ridge point and bandwidth-bound latency are inaccurate even when the benchmark itself runs correctly
that weakens the benchmark's explanatory value for anyone validating kernels on newer hardware

Semantic change:

add known roofline presets for RTX 4090 and RTX 5090
resolve the preset from torch.cuda.get_device_name(0)
allow manual CLI overrides for FP16 TFLOPS and bandwidth when running on an unrecognized GPU or using a custom assumption set
add a CPU-safe test covering preset selection, override behavior, and roofline-analysis output wiring

Preserved invariant:

benchmark execution flow is unchanged
unknown GPUs still run, now with an explicit fallback label instead of silently implying RTX 4090 numbers are universal

Testing:

peter941221 added 2 commits May 27, 2026 12:05

Make Triton roofline GPU-aware

0779927

Keep Triton roofline benchmark backward compatible

7a8cc3f

peter941221 marked this pull request as ready for review May 29, 2026 10:16

Provide feedback