Conversation
- Add hipGraphGetRootNodes to the ROCm driver wrapper - Add empty traced graph detection in RocmCommandBuffer::Trace() to prevent crashes when a captured stream produces no operations, matching the existing CUDA guard - Implement PrepareFinalization to insert an empty node into empty body graphs, preventing potential issues when graph conditionals are supported on HIP in the future - Add four new platform-agnostic test cases for CreateEmptyNode: EmptyNodeWithKernel, EmptyNodeOnly, EmptyNodeChain, and EmptyNodeAsDependencyBarrier
| TEST(GpuCommandBufferTest, EmptyNodeOnly) { | ||
| Platform* platform = GpuPlatform(); | ||
| StreamExecutor* executor = platform->ExecutorForDevice(0).value(); | ||
|
|
||
| TF_ASSERT_OK_AND_ASSIGN(auto stream, executor->CreateStream()); | ||
|
|
||
| TF_ASSERT_OK_AND_ASSIGN(auto cmd_buffer, | ||
| executor->CreateCommandBuffer(primary)); | ||
| TF_ASSERT_OK_AND_ASSIGN(auto* empty_cmd, cmd_buffer->CreateEmptyCmd({})); | ||
| (void)empty_cmd; | ||
| TF_ASSERT_OK(cmd_buffer->Finalize()); | ||
|
|
||
| TF_ASSERT_OK(cmd_buffer->Submit(stream.get())); | ||
| TF_ASSERT_OK(stream->BlockHostUntilDone()); | ||
| } |
There was a problem hiding this comment.
Misleading test comment: The comment says this test "exercises PrepareFinalization on HIP," but it doesn't actually hit the empty-graph insertion path. CreateEmptyCmd({}) adds one node, so by the time Finalize() → PrepareFinalization() runs, GetNodeCount() returns 1 and the method early-returns at the node_count > 0 check.
To truly exercise the PrepareFinalization empty-graph path, you'd need a command buffer with zero nodes at finalization time (no CreateEmptyCmd call). Consider either:
- Updating the comment to reflect what this test actually covers (that a single empty node can be finalized and executed), or
- Adding a separate test that calls
Finalize()on a command buffer with zero user-added nodes.
| absl::Status RocmCommandBuffer::PrepareFinalization() { | ||
| TF_ASSIGN_OR_RETURN(auto node_count, GetNodeCount()); | ||
| if (node_count > 0) { | ||
| return absl::OkStatus(); | ||
| } | ||
|
|
||
| // HIP graph conditional nodes (once supported) may not handle empty body | ||
| // graphs. Insert an empty node so the graph is non-empty, analogous to | ||
| // CUDA's NoOp kernel insertion for the same case. | ||
| hipGraphNode_t node_handle = nullptr; | ||
| TF_RETURN_IF_ERROR( | ||
| ToStatus(wrap::hipGraphAddEmptyNode(&node_handle, graph_, | ||
| /*pDependencies=*/nullptr, | ||
| /*numDependencies=*/0), | ||
| "Failed to add empty node in PrepareFinalization")); | ||
| return absl::OkStatus(); |
There was a problem hiding this comment.
Untracked graph node: The node_handle created here is added directly to the HIP graph via the raw API but is never returned to or tracked by the GpuCommandBuffer base class. Other node-creation methods (CreateEmptyNode, CreateKernelNode, etc.) return a GraphNodeHandle that the base class stores in its internal command structure.
This mirrors the CUDA behavior (which also adds a NoOp kernel directly), so it's consistent today. However, if any future code relies on the command list being a complete representation of the graph nodes, this node will be invisible. Worth a brief comment noting this is intentional.
| absl::Status SetPriority(StreamPriority priority) override { | ||
| return absl::UnimplementedError("Not implemented."); | ||
| // HIP does not expose an API to set per-node priority on graph kernel nodes | ||
| // (no equivalent of cuGraphKernelNodeSetAttribute with | ||
| // CU_LAUNCH_ATTRIBUTE_PRIORITY). Silently ignore priority requests so that | ||
| // collectives (which use StreamPriority::Highest) can be captured into HIP | ||
| // graphs without crashing. | ||
| return absl::OkStatus(); | ||
| } |
There was a problem hiding this comment.
Nit / observability: The silent no-op is well-documented in the comment, and this is the right approach given HIP's API limitations. Consider adding a VLOG(2) when a non-default priority is silently ignored, so it's visible in debug logs when someone is troubleshooting priority-related behavior:
| absl::Status SetPriority(StreamPriority priority) override { | |
| return absl::UnimplementedError("Not implemented."); | |
| // HIP does not expose an API to set per-node priority on graph kernel nodes | |
| // (no equivalent of cuGraphKernelNodeSetAttribute with | |
| // CU_LAUNCH_ATTRIBUTE_PRIORITY). Silently ignore priority requests so that | |
| // collectives (which use StreamPriority::Highest) can be captured into HIP | |
| // graphs without crashing. | |
| return absl::OkStatus(); | |
| } | |
| // HIP does not expose an API to set per-node priority on graph kernel nodes | |
| // (no equivalent of cuGraphKernelNodeSetAttribute with | |
| // CU_LAUNCH_ATTRIBUTE_PRIORITY). Silently ignore priority requests so that | |
| // collectives (which use StreamPriority::Highest) can be captured into HIP | |
| // graphs without crashing. | |
| if (priority != StreamPriority::Default) { | |
| VLOG(2) << "Ignoring StreamPriority on HIP (no equivalent API)"; | |
| } | |
| return absl::OkStatus(); |
| TF_ASSERT_OK(cmd_buffer->CreateLaunch(add, ThreadDim(), BlockDim(4), | ||
| {empty_cmd}, a, b, c) | ||
| .status()); |
There was a problem hiding this comment.
Minor style inconsistency: CreateLaunch is called with .status() to extract only the status here, while in EmptyNodeChain (line 854) and EmptyNodeAsDependencyBarrier (line 898), TF_ASSERT_OK is called directly on the StatusOr return. Both work, but for consistency within the new tests, consider using the same pattern throughout.
Review SummaryGood PR — correctly implements the missing ROCm safeguards to match CUDA behavior for HIP command buffers. The 4 inline comments posted covering: a misleading test comment about 🤖 Generated with Claude Code |
The CUDA implementation of XLA's command buffer infrastructure has two safeguards for empty graph scenarios that the HIP (ROCm) implementation was missing:
Empty traced graph detection -- When stream capture produces a graph with zero operations (e.g., a custom call that launches nothing), CUDA detects this and returns a clear error. Without this guard on HIP, the runtime would crash during graph instantiation with an opaque driver error.
Empty body graph handling in PrepareFinalization -- CUDA injects a NoOp kernel into empty conditional body graphs (for CUDA < 12.8) to prevent crashes. HIP's PrepareFinalization was a no-op, leaving it unprotected for when graph conditionals are supported on HIP in the future.
Additionally, CreateEmptyNode (via hipGraphAddEmptyNode) was recently implemented in commit b561aaf to fix collective crashes, but had no test coverage on the HIP/ROCm backend -- all existing empty node tests were gated behind CUDA 12.3+ version checks.
This PR adds the missing safeguards and platform-agnostic test coverage.