Skip to content

Make Triton benchmark roofline GPU-aware#5

Open
peter941221 wants to merge 2 commits into
psmarter:mainfrom
peter941221:feat/benchmark-gpu-aware-roofline
Open

Make Triton benchmark roofline GPU-aware#5
peter941221 wants to merge 2 commits into
psmarter:mainfrom
peter941221:feat/benchmark-gpu-aware-roofline

Conversation

@peter941221

Copy link
Copy Markdown

This removes a hard-coded RTX 4090 assumption from the Triton decode-attention benchmark.

Failure mechanism:

  • benchmarks/benchmark_triton.py hard-coded RTX 4090 roofline numbers into roofline_analysis()
  • on other GPUs, especially Blackwell-class consumer GPUs like RTX 5090, the reported ridge point and bandwidth-bound latency are inaccurate even when the benchmark itself runs correctly
  • that weakens the benchmark's explanatory value for anyone validating kernels on newer hardware

Semantic change:

  • add known roofline presets for RTX 4090 and RTX 5090
  • resolve the preset from torch.cuda.get_device_name(0)
  • allow manual CLI overrides for FP16 TFLOPS and bandwidth when running on an unrecognized GPU or using a custom assumption set
  • add a CPU-safe test covering preset selection, override behavior, and roofline-analysis output wiring

Preserved invariant:

  • benchmark execution flow is unchanged
  • unknown GPUs still run, now with an explicit fallback label instead of silently implying RTX 4090 numbers are universal

Testing:

  • pytest tests/test_benchmark_triton_roofline.py -q

@peter941221 peter941221 marked this pull request as ready for review May 29, 2026 10:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant