Skip to content

Add callbacks for driver cuGraphLaunch#21

Merged
gnurizen merged 2 commits into
mainfrom
driver-cugraphlaunch-callbacks
Jun 26, 2026
Merged

Add callbacks for driver cuGraphLaunch#21
gnurizen merged 2 commits into
mainfrom
driver-cugraphlaunch-callbacks

Conversation

@gnurizen

@gnurizen gnurizen commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Problem

parcagpu subscribed to CUPTI callbacks for eager kernel launches (cuLaunchKernel) and runtime graph launches (cudaGraphLaunch), but not the driver-API graph launch (cuGraphLaunch / cuGraphLaunch_ptsz). setLaunchCallbacks covers only kernel-launch cbids and setRuntimeCallbacks only the runtime domain, so the driver graph-launch cbid was never enabled.

C++ runtimes such as TensorRT-LLM replay CUDA graphs through the driver API. On those workloads the cuda_correlation USDT never fired for graph launches, no CUDA_KERNEL frame was pushed, and the GPU sample pipeline produced nothing in graph mode — while eager launches kept working. This was confirmed on a 4×B200 TRT-LLM node: a trace_pipe capture in graph mode contained zero FRAME_MARKER_CUDA_KERNEL frames.

Fix

Enable only the two driver graph-launch cbids in CuptiProfiler::initialize (with symmetric teardown). Deliberately not setGraphCallbacks(), which also subscribes the capture (cuStreamBegin/EndCapture) and graph-resource cbids — those would make callbackHandler emit correlation events for non-executing capture calls that never receive kernel timing. The existing isGraphLaunch handler already had all the logic to process these callbacks; it just never received them.

Test coverage

The mock harness previously dispatched the shim's callback unconditionally, so a missing subscription was invisible to tests. Changes:

  • mock_cupti.c: cuptiEnableCallback now records the subscribed (domain, cbid) set.
  • test_cupti_prof.c: callbacks are dispatched only if subscribed, mirroring real CUPTI gating.
  • test-pc-mock-graph.sh: asserts driver cuGraphLaunch correlation events fire (signed cbid -514/-515). The workload splits graph launches 50/50 runtime/driver, so the runtime half emits regardless — only this check catches the regression. Wired to a new make test-pc-mock-graph target.
  • graph_repro.cu / graph-repro-real.sh: real-GPU reproducer that replays a captured graph via driver cuGraphLaunch and verifies correlation events fire.

Verified the mock guard goes red without the fix and green with it.

parcagpu subscribed CUPTI callbacks for eager launches (cuLaunchKernel) and
runtime graph launches (cudaGraphLaunch), but not the driver-API cuGraphLaunch /
cuGraphLaunch_ptsz. C++ runtimes like TensorRT-LLM replay CUDA graphs through the
driver API, so the cuda_correlation USDT never fired in graph mode and no GPU
samples were produced, while eager launches kept working (confirmed on a 4xB200
TRT-LLM node).

Fix: enable the two driver graph-launch cbids in CuptiProfiler::initialize (with
symmetric teardown). Not setGraphCallbacks(), which also subscribes capture cbids
that would emit correlation events for non-executing capture calls.

Tests: the mock harness dispatched callbacks unconditionally, hiding the missing
subscription. mock_cupti now records the subscribed (domain,cbid) set and the
harness only dispatches subscribed callbacks. test-pc-mock-graph (new make
target) asserts driver cuGraphLaunch correlation events fire; graph_repro.cu /
graph-repro-real.sh add a real-GPU reproducer. Verified the guard goes red
without the fix and green with it.
@gnurizen gnurizen force-pushed the driver-cugraphlaunch-callbacks branch from d698a08 to 633dc6b Compare June 26, 2026 13:24
@gnurizen gnurizen requested review from brancz and umanwizard June 26, 2026 13:30
@gnurizen gnurizen merged commit 474be16 into main Jun 26, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants