Skip to content

Phambinh/implement emptynode#680

Open
phambinhfin wants to merge 2 commits intorocm-jaxlib-v0.8.2from
phambinh/implement-emptynode
Open

Phambinh/implement emptynode#680
phambinhfin wants to merge 2 commits intorocm-jaxlib-v0.8.2from
phambinh/implement-emptynode

Conversation

@phambinhfin
Copy link

@phambinhfin phambinhfin commented Mar 18, 2026

The CUDA implementation of XLA's command buffer infrastructure has two safeguards for empty graph scenarios that the HIP (ROCm) implementation was missing:

Empty traced graph detection -- When stream capture produces a graph with zero operations (e.g., a custom call that launches nothing), CUDA detects this and returns a clear error. Without this guard on HIP, the runtime would crash during graph instantiation with an opaque driver error.

Empty body graph handling in PrepareFinalization -- CUDA injects a NoOp kernel into empty conditional body graphs (for CUDA < 12.8) to prevent crashes. HIP's PrepareFinalization was a no-op, leaving it unprotected for when graph conditionals are supported on HIP in the future.

Additionally, CreateEmptyNode (via hipGraphAddEmptyNode) was recently implemented in commit b561aaf to fix collective crashes, but had no test coverage on the HIP/ROCm backend -- all existing empty node tests were gated behind CUDA 12.3+ version checks.

This PR adds the missing safeguards and platform-agnostic test coverage.

ScXfjiang and others added 2 commits March 18, 2026 12:18
- Add hipGraphGetRootNodes to the ROCm driver wrapper
- Add empty traced graph detection in RocmCommandBuffer::Trace() to
  prevent crashes when a captured stream produces no operations,
  matching the existing CUDA guard
- Implement PrepareFinalization to insert an empty node into empty body
  graphs, preventing potential issues when graph conditionals are
  supported on HIP in the future
- Add four new platform-agnostic test cases for CreateEmptyNode:
  EmptyNodeWithKernel, EmptyNodeOnly, EmptyNodeChain, and
  EmptyNodeAsDependencyBarrier
Comment on lines +812 to +826
TEST(GpuCommandBufferTest, EmptyNodeOnly) {
Platform* platform = GpuPlatform();
StreamExecutor* executor = platform->ExecutorForDevice(0).value();

TF_ASSERT_OK_AND_ASSIGN(auto stream, executor->CreateStream());

TF_ASSERT_OK_AND_ASSIGN(auto cmd_buffer,
executor->CreateCommandBuffer(primary));
TF_ASSERT_OK_AND_ASSIGN(auto* empty_cmd, cmd_buffer->CreateEmptyCmd({}));
(void)empty_cmd;
TF_ASSERT_OK(cmd_buffer->Finalize());

TF_ASSERT_OK(cmd_buffer->Submit(stream.get()));
TF_ASSERT_OK(stream->BlockHostUntilDone());
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Misleading test comment: The comment says this test "exercises PrepareFinalization on HIP," but it doesn't actually hit the empty-graph insertion path. CreateEmptyCmd({}) adds one node, so by the time Finalize()PrepareFinalization() runs, GetNodeCount() returns 1 and the method early-returns at the node_count > 0 check.

To truly exercise the PrepareFinalization empty-graph path, you'd need a command buffer with zero nodes at finalization time (no CreateEmptyCmd call). Consider either:

  1. Updating the comment to reflect what this test actually covers (that a single empty node can be finalized and executed), or
  2. Adding a separate test that calls Finalize() on a command buffer with zero user-added nodes.

Comment on lines 437 to 452
absl::Status RocmCommandBuffer::PrepareFinalization() {
TF_ASSIGN_OR_RETURN(auto node_count, GetNodeCount());
if (node_count > 0) {
return absl::OkStatus();
}

// HIP graph conditional nodes (once supported) may not handle empty body
// graphs. Insert an empty node so the graph is non-empty, analogous to
// CUDA's NoOp kernel insertion for the same case.
hipGraphNode_t node_handle = nullptr;
TF_RETURN_IF_ERROR(
ToStatus(wrap::hipGraphAddEmptyNode(&node_handle, graph_,
/*pDependencies=*/nullptr,
/*numDependencies=*/0),
"Failed to add empty node in PrepareFinalization"));
return absl::OkStatus();
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Untracked graph node: The node_handle created here is added directly to the HIP graph via the raw API but is never returned to or tracked by the GpuCommandBuffer base class. Other node-creation methods (CreateEmptyNode, CreateKernelNode, etc.) return a GraphNodeHandle that the base class stores in its internal command structure.

This mirrors the CUDA behavior (which also adds a NoOp kernel directly), so it's consistent today. However, if any future code relies on the command list being a complete representation of the graph nodes, this node will be invisible. Worth a brief comment noting this is intentional.

Comment on lines 150 to 157
absl::Status SetPriority(StreamPriority priority) override {
return absl::UnimplementedError("Not implemented.");
// HIP does not expose an API to set per-node priority on graph kernel nodes
// (no equivalent of cuGraphKernelNodeSetAttribute with
// CU_LAUNCH_ATTRIBUTE_PRIORITY). Silently ignore priority requests so that
// collectives (which use StreamPriority::Highest) can be captured into HIP
// graphs without crashing.
return absl::OkStatus();
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit / observability: The silent no-op is well-documented in the comment, and this is the right approach given HIP's API limitations. Consider adding a VLOG(2) when a non-default priority is silently ignored, so it's visible in debug logs when someone is troubleshooting priority-related behavior:

Suggested change
absl::Status SetPriority(StreamPriority priority) override {
return absl::UnimplementedError("Not implemented.");
// HIP does not expose an API to set per-node priority on graph kernel nodes
// (no equivalent of cuGraphKernelNodeSetAttribute with
// CU_LAUNCH_ATTRIBUTE_PRIORITY). Silently ignore priority requests so that
// collectives (which use StreamPriority::Highest) can be captured into HIP
// graphs without crashing.
return absl::OkStatus();
}
// HIP does not expose an API to set per-node priority on graph kernel nodes
// (no equivalent of cuGraphKernelNodeSetAttribute with
// CU_LAUNCH_ATTRIBUTE_PRIORITY). Silently ignore priority requests so that
// collectives (which use StreamPriority::Highest) can be captured into HIP
// graphs without crashing.
if (priority != StreamPriority::Default) {
VLOG(2) << "Ignoring StreamPriority on HIP (no equivalent API)";
}
return absl::OkStatus();

Comment on lines +795 to +797
TF_ASSERT_OK(cmd_buffer->CreateLaunch(add, ThreadDim(), BlockDim(4),
{empty_cmd}, a, b, c)
.status());
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor style inconsistency: CreateLaunch is called with .status() to extract only the status here, while in EmptyNodeChain (line 854) and EmptyNodeAsDependencyBarrier (line 898), TF_ASSERT_OK is called directly on the StatusOr return. Both work, but for consistency within the new tests, consider using the same pattern throughout.

@claude
Copy link

claude bot commented Mar 18, 2026

Review Summary

Good PR — correctly implements the missing ROCm safeguards to match CUDA behavior for HIP command buffers. The CreateEmptyNode, Trace() empty-graph check, PrepareFinalization() empty-node insertion, and SetPriority silent-OK are all implemented consistently with the CUDA side. Tests cover the key scenarios well.

4 inline comments posted covering: a misleading test comment about PrepareFinalization coverage, an untracked graph node in PrepareFinalization, a suggestion to add VLOG for silently-ignored priority, and a minor style inconsistency in test assertions.

🤖 Generated with Claude Code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants