FAME v0: MiniMax M2.7 attention kernel profiling by vedularaghu · Pull Request #104 · fw-ai/benchmark

vedularaghu · 2026-05-01T20:09:02Z

Summary

Adds FAME (Fireworks Attention/MoE) v0 profiling framework for MiniMax M2.7 single-layer attention
End-to-end Jupyter notebook that orchestrates nsys trace collection, ncu kernel profiling, and roofline report generation via Docker
Includes profiling results for prefill (seq_len=4096) and decode (past_kv=4096) on B200 GPU

What's included

Path	Description
`fame/v0/configs/profile.yaml`	Profiling configuration
`fame/v0/load_layer.py`	Builds one transformer layer from HF weights
`fame/v0/run_attention.py`	Runs prefill/decode with NVTX instrumentation
`fame/v0/fame_v0.ipynb`	Main notebook: nsys/ncu orchestration + report
`fame/v0/workspace/*_nsys.json`	Per-kernel nsys timing results
`fame/v0/workspace/*_ncu.{csv,json}`	ncu metrics (DRAM bytes, durations)
`fame/v0/workspace/report.{md,json}`	Final roofline report

Notes

B200 sm__ops_path_tensor FLOPs counters return 0; report uses analytical FLOPs from model dimensions instead
ncu must be run as root inside the Docker container due to RmProfilingAdminOnly=1 on the host
Kernel-to-NVTX mapping uses launch-overlap heuristic to handle async GPU kernel dispatch

Test plan

Notebook cells 1-13 execute end-to-end in Docker container
nsys Phase A produces per-kernel timing for all 7 attention sub-kernels
ncu Phase B collects DRAM bytes and kernel durations
Phase C report shows analytical FLOPs, measured bytes, and arithmetic intensity

Made with Cursor

- profile.yaml: profiling config for MiniMax M2.7 attention block - load_layer.py: builds one transformer layer from HF weights - run_attention.py: runs prefill/decode with NVTX instrumentation Co-authored-by: Cursor <cursoragent@cursor.com>

End-to-end Jupyter notebook that orchestrates: - Docker-based nsys trace collection (Phase A) - ncu kernel profiling for FLOPs/bytes (Phase B) - Roofline report generation (Phase C) Co-authored-by: Cursor <cursoragent@cursor.com>

Per-NVTX-range kernel timing data for prefill (seq_len=4096) and decode (past_kv=4096) in both single-stream and multi-stream modes. Co-authored-by: Cursor <cursoragent@cursor.com>

- ncu CSV/JSON for prefill and decode kernel metrics - B200 sm__ops_path_tensor counters return 0; use analytical FLOPs - Kernel-to-NVTX mapping fixed for async GPU kernel launches Co-authored-by: Cursor <cursoragent@cursor.com>

Final report with analytical FLOPs, measured DRAM bytes, and ncu kernel durations for all attention sub-kernels on B200. Co-authored-by: Cursor <cursoragent@cursor.com>

vedularaghu and others added 5 commits April 17, 2026 14:00

Add FAME v0 notebook with nsys/ncu profiling pipeline

a80d889

End-to-end Jupyter notebook that orchestrates: - Docker-based nsys trace collection (Phase A) - ncu kernel profiling for FLOPs/bytes (Phase B) - Roofline report generation (Phase C) Co-authored-by: Cursor <cursoragent@cursor.com>

Add nsys Phase A results for MiniMax M2.7 attention

c873361

Per-NVTX-range kernel timing data for prefill (seq_len=4096) and decode (past_kv=4096) in both single-stream and multi-stream modes. Co-authored-by: Cursor <cursoragent@cursor.com>

Add ncu Phase B results and fix kernel mapping for B200

4d4fcf2

- ncu CSV/JSON for prefill and decode kernel metrics - B200 sm__ops_path_tensor counters return 0; use analytical FLOPs - Kernel-to-NVTX mapping fixed for async GPU kernel launches Co-authored-by: Cursor <cursoragent@cursor.com>

Add combined roofline report for MiniMax M2.7 attention

72212f4

Final report with analytical FLOPs, measured DRAM bytes, and ncu kernel durations for all attention sub-kernels on B200. Co-authored-by: Cursor <cursoragent@cursor.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FAME v0: MiniMax M2.7 attention kernel profiling#104

FAME v0: MiniMax M2.7 attention kernel profiling#104
vedularaghu wants to merge 5 commits intomainfrom
vedula/fame-v0-attention-profile

vedularaghu commented May 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

vedularaghu commented May 1, 2026

Summary

What's included

Notes

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant