Skip to content

Add CUDA graph kernel annotations tutorial#3915

Draft
yushangdi wants to merge 10 commits into
mainfrom
cudagraph_annotation
Draft

Add CUDA graph kernel annotations tutorial#3915
yushangdi wants to merge 10 commits into
mainfrom
cudagraph_annotation

Conversation

@yushangdi
Copy link
Copy Markdown
Contributor

This tutorial demonstrates how to use CUDA graph kernel annotations for semantic profiling traces with custom visualization lanes.

Features:

  • End-to-end workflow from graph capture to visualization
  • Transformer block example with annotated regions
  • Post-processing to merge annotations into profiler traces
  • Custom stream assignments for semantic organization
  • Version checking for cuda-bindings compatibility
  • Clear error messages with upgrade instructions

The tutorial includes:

  • mark_kernels() context manager usage
  • Graph capture with enable_annotations=True
  • Profiling and trace post-processing
  • Before/after comparison
  • Troubleshooting guide

Fixes #ISSUE_NUMBER

Description

Checklist

  • The issue that is being fixed is referred in the description (see above "Fixes #ISSUE_NUMBER")
  • Only one issue is addressed in this pull request
  • Labels from the issue that this PR is fixing are added to this pull request
  • No unnecessary issues are included into this pull request.

@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented Jun 2, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/tutorials/3915

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 72d4e71 with merge base cdc645a (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla Bot added the cla signed label Jun 2, 2026
This tutorial demonstrates how to use CUDA graph kernel annotations
for semantic profiling traces with custom visualization lanes.

Features:
- End-to-end workflow from graph capture to visualization
- Transformer block example with annotated regions
- Post-processing to merge annotations into profiler traces
- Custom stream assignments for semantic organization
- Version checking for cuda-bindings compatibility
- Clear error messages with upgrade instructions

The tutorial includes:
- mark_kernels() context manager usage
- Graph capture with enable_annotations=True
- Profiling and trace post-processing
- Before/after comparison
- Troubleshooting guide

Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
@yushangdi yushangdi force-pushed the cudagraph_annotation branch from 4a6f9d9 to c39bac5 Compare June 2, 2026 20:32
yushangdi and others added 8 commits June 2, 2026 20:33
This is required for the CUDA graph annotations tutorial to work
with full annotation support. The cudaGraphNodeGetToolsId API was
added in cuda-bindings 13.3.0.
- Removed check_cuda_bindings_version() function since PyTorch core
  now provides the warning via _probe_tools_id()
- Updated PyTorch requirement from 2.0+ to 2.13+ (required for
  the annotation APIs used in this tutorial)
- Simplified error messaging to reference PyTorch's built-in warnings
Changed the overview to emphasize:
- Ability to add semantic labels to kernels
- Understanding what each kernel does during profiling
- Labeling and organizing kernels by function

Rather than focusing on splitting kernels across streams,
the overview now centers on the annotation feature itself.
Updated the prerequisites card at the top to show PyTorch 2.12+
(was still showing 2.0). Also updated cuda-python to cuda-bindings
for consistency.
Added chrome://tracing screenshots showing:
- Before: All 65 kernels on single stream with auto-generated names
- After: Kernels organized into semantic lanes (streams 61, 62)
  with meaningful labels (attention, mlp)

Screenshots demonstrate the value of kernel annotations for
understanding execution structure and identifying components.
Move `if __name__ == "__main__": main()` to immediately after the
main() function definition (line ~404) so it executes during the
Sphinx Gallery build process.

Sphinx Gallery requires the execution guard to be positioned right
after the function definition, not at the end of the file, to properly
capture and execute the tutorial code during documentation generation.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
matplotlib
librosa
torch==2.12
cuda-bindings>=13.3.0 # Required for CUDA graph annotations tutorial
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you should be able to use earlier version?

These files were accidentally included in the previous commit:
- traces/ directory (both root and advanced_source/)
- Screenshot PNG files
- CUDA_GRAPH_TUTORIAL_README.md

These are build artifacts and temporary files that should not be
committed to the repository.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants