Skip to content

FAME v0: MiniMax M2.7 attention kernel profiling#104

Open
vedularaghu wants to merge 5 commits intomainfrom
vedula/fame-v0-attention-profile
Open

FAME v0: MiniMax M2.7 attention kernel profiling#104
vedularaghu wants to merge 5 commits intomainfrom
vedula/fame-v0-attention-profile

Conversation

@vedularaghu
Copy link
Copy Markdown
Contributor

Summary

  • Adds FAME (Fireworks Attention/MoE) v0 profiling framework for MiniMax M2.7 single-layer attention
  • End-to-end Jupyter notebook that orchestrates nsys trace collection, ncu kernel profiling, and roofline report generation via Docker
  • Includes profiling results for prefill (seq_len=4096) and decode (past_kv=4096) on B200 GPU

What's included

Path Description
fame/v0/configs/profile.yaml Profiling configuration
fame/v0/load_layer.py Builds one transformer layer from HF weights
fame/v0/run_attention.py Runs prefill/decode with NVTX instrumentation
fame/v0/fame_v0.ipynb Main notebook: nsys/ncu orchestration + report
fame/v0/workspace/*_nsys.json Per-kernel nsys timing results
fame/v0/workspace/*_ncu.{csv,json} ncu metrics (DRAM bytes, durations)
fame/v0/workspace/report.{md,json} Final roofline report

Notes

  • B200 sm__ops_path_tensor FLOPs counters return 0; report uses analytical FLOPs from model dimensions instead
  • ncu must be run as root inside the Docker container due to RmProfilingAdminOnly=1 on the host
  • Kernel-to-NVTX mapping uses launch-overlap heuristic to handle async GPU kernel dispatch

Test plan

  • Notebook cells 1-13 execute end-to-end in Docker container
  • nsys Phase A produces per-kernel timing for all 7 attention sub-kernels
  • ncu Phase B collects DRAM bytes and kernel durations
  • Phase C report shows analytical FLOPs, measured bytes, and arithmetic intensity

Made with Cursor

vedularaghu and others added 5 commits April 17, 2026 14:00
- profile.yaml: profiling config for MiniMax M2.7 attention block
- load_layer.py: builds one transformer layer from HF weights
- run_attention.py: runs prefill/decode with NVTX instrumentation

Co-authored-by: Cursor <cursoragent@cursor.com>
End-to-end Jupyter notebook that orchestrates:
- Docker-based nsys trace collection (Phase A)
- ncu kernel profiling for FLOPs/bytes (Phase B)
- Roofline report generation (Phase C)

Co-authored-by: Cursor <cursoragent@cursor.com>
Per-NVTX-range kernel timing data for prefill (seq_len=4096) and
decode (past_kv=4096) in both single-stream and multi-stream modes.

Co-authored-by: Cursor <cursoragent@cursor.com>
- ncu CSV/JSON for prefill and decode kernel metrics
- B200 sm__ops_path_tensor counters return 0; use analytical FLOPs
- Kernel-to-NVTX mapping fixed for async GPU kernel launches

Co-authored-by: Cursor <cursoragent@cursor.com>
Final report with analytical FLOPs, measured DRAM bytes, and
ncu kernel durations for all attention sub-kernels on B200.

Co-authored-by: Cursor <cursoragent@cursor.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant