Compute theoretical occupancy for ROCm profiler kernels by magaonka-amd · Pull Request #669 · ROCm/xla

magaonka-amd · 2026-03-16T23:48:26Z

Replace broken hipOccupancy API path (func_ptr was always null) with direct computation from kernel code object metadata (VGPR/SGPR counts) and rocprofiler-sdk agent properties (SIMD/CU configuration)
Add kTheoreticalOccupancyPct as separate XEvent stat for Kernel Stats table
Use arch-based lookup for VGPR/SGPR allocation limits matching HIP CLR

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

- Replace broken hipOccupancy API path (func_ptr was always null) with direct computation from kernel code object metadata (VGPR/SGPR counts) and rocprofiler-sdk agent properties (SIMD/CU configuration) - Add kTheoreticalOccupancyPct as separate XEvent stat for Kernel Stats table - Use arch-based lookup for VGPR/SGPR allocation limits matching HIP CLR

magaonka-amd · 2026-03-16T23:49:04Z

opening here to see what claude thinks about my patch. no need to review here plan is to open upstream.

claude · 2026-03-16T23:56:29Z

xla/backends/profiler/gpu/device_tracer_rocm.cc

+
+  // Get SIMD properties for theoretical occupancy calculation.
+  uint32_t simd_per_cu = 0, max_waves = 0, gfx_ver = 0;
+  if (rocm_tracer_->GetGpuSimdInfo(0, &simd_per_cu, &max_waves, &gfx_ver)) {


Bug (multi-GPU): GetGpuSimdInfo(0, ...) hardcodes device ordinal 0. The resulting SIMD properties are stored in RocmTraceCollectorOptions and passed identically to every per-device collector in Export(). In a heterogeneous multi-GPU system (e.g., mixing gfx90a and gfx942), all devices would use GPU 0's SIMD properties, producing incorrect occupancy for the other GPUs.

Consider querying per-device, or at least documenting the single-device assumption.

claude · 2026-03-16T23:56:33Z

xla/backends/profiler/gpu/rocm_tracer.cc

+  uint32_t gpu_index = 0;
+  for (const auto& [handle, agent] : agents_) {
+    if (agent.type == ROCPROFILER_AGENT_TYPE_GPU) {
+      if (gpu_index == device_ordinal) {
+        *simd_per_cu = agent.simd_per_cu;
+        *max_waves_per_simd = agent.max_waves_per_simd;
+        *gfx_target_version = agent.gfx_target_version;
+        return true;
+      }
+      gpu_index++;
+    }
+  }


Issue (multi-GPU correctness): agents_ is an absl::flat_hash_map, whose iteration order is non-deterministic. The sequential gpu_index counter assumes a stable ordering, but on multi-GPU systems the mapping from device_ordinal to a specific agent could differ across runs and may not match hipGetDeviceProperties' ordinal assignment.

Consider sorting agents by a stable key (e.g., agent.logical_node_id or PCI bus ID) before the ordinal lookup, or using a different lookup strategy.

claude · 2026-03-16T23:56:37Z

xla/backends/profiler/gpu/rocm_tracer.cc

 #include "rocm/include/rocprofiler-sdk/internal_threading.h"
 #include "rocm/include/rocprofiler-sdk/registration.h"
 #include "rocm/include/rocprofiler-sdk/rocprofiler.h"
+#include <unistd.h>


Nit (include order): #include <unistd.h> is a C system header placed between third-party (rocprofiler-sdk) and project includes. Per Google C++ style, system headers should appear in the system headers group before third-party headers, not interleaved here.

claude · 2026-03-16T23:56:41Z

xla/backends/profiler/gpu/rocm_collector.cc

+
+double PerDeviceCollector::ComputeTheoreticalOccupancy(
+    const KernelDetails& ki) {
+  int block_size = ki.workgroup_x * ki.workgroup_y * ki.workgroup_z;


Nit (type safety): ki.workgroup_x/y/z are uint32_t but block_size is int. The multiplication is performed in uint32_t arithmetic (could theoretically overflow for very large workgroups), and the implicit conversion to int is implementation-defined if the value exceeds INT_MAX. Consider using int64_t or keeping it as uint32_t consistently.

Also on line 226: int alu_waves = simd_per_cu_ * std::min(...) has the same uint32_t-to-int narrowing.

claude · 2026-03-16T23:56:48Z

xla/backends/profiler/gpu/rocm_collector.cc

+double PerDeviceCollector::ComputeTheoreticalOccupancy(
+    const KernelDetails& ki) {
+  int block_size = ki.workgroup_x * ki.workgroup_y * ki.workgroup_z;
+  OccupancyKey key{ki.arch_vgpr_count, ki.sgpr_count, ki.group_segment_size,
+                   static_cast<uint32_t>(block_size)};
+  auto it = theoretical_occupancy_cache_.find(key);
+  if (it != theoretical_occupancy_cache_.end()) {
+    return it->second;
+  }
+  if (block_size <= 0 || device_properties_.warpSize <= 0 ||
+      simd_per_cu_ == 0 || max_waves_per_simd_ == 0) {
+    return 0.0;
+  }

-  if (err != hipError_t::hipSuccess) {
-    return {};
+  int wave_size = device_properties_.warpSize;
+  uint32_t gfx_major = gfx_target_version_ / 10000;
+
+  // Derive VGPRs per SIMD from device properties.
+  // (see HIP CLR: rocclr/device/rocm/rocdevice.cpp)
+  uint32_t vgprs_per_simd =
+      device_properties_.regsPerMultiprocessor / wave_size / simd_per_cu_;
+
+  // VGPR allocation granularity per thread.
+  // gfx10+ adjusts for wave64 mode by halving the granularity.
+  // (see HIP CLR: hipamd/src/hip_platform.cpp)
+  uint32_t vgpr_alloc_granularity;
+  if (gfx_major <= 9) {
+    vgpr_alloc_granularity = 8;
+  } else if (gfx_major <= 10) {
+    vgpr_alloc_granularity = wave_size == 64 ? 4 : 8;
+  } else {
+    vgpr_alloc_granularity = wave_size == 64 ? 6 : 12;
  }

-  stats.occupancy_pct = number_of_active_blocks * params.block_size * 100;
-  stats.occupancy_pct /= device_properties_.maxThreadsPerMultiProcessor;
+  // VGPR limit: how many waves fit per SIMD given VGPR usage.
+  uint32_t used_vgprs = ki.arch_vgpr_count;
+  uint32_t vgpr_waves = max_waves_per_simd_;
+  if (used_vgprs > 0) {
+    uint32_t alloc_vgprs =
+        ((used_vgprs + vgpr_alloc_granularity - 1) / vgpr_alloc_granularity) *
+        vgpr_alloc_granularity;
+    vgpr_waves = vgprs_per_simd / alloc_vgprs;
+  }
+  if (vgpr_waves == 0) {
+    return 0.0;
+  }

-  err = hipOccupancyMaxPotentialBlockSize(
-      &stats.min_grid_size, &stats.suggested_block_size,
-      static_cast<const void*>(params.func_ptr), params.dynamic_smem_size, 0);
+  // SGPRs per SIMD: shared between waves on gfx8/gfx9, unlimited on gfx10+.
+  // (see HIP CLR: rocclr/device/rocm/rocdevice.cpp)
+  uint32_t sgprs_per_simd;
+  if (gfx_major < 8) {
+    sgprs_per_simd = 512;
+  } else if (gfx_major < 10) {
+    sgprs_per_simd = 800;
+  } else {
+    sgprs_per_simd = std::numeric_limits<uint32_t>::max();
+  }

-  if (err != hipError_t::hipSuccess) {
-    return {};
+  // SGPR limit
+  uint32_t gpr_waves = vgpr_waves;
+  if (ki.sgpr_count > 0 &&
+      sgprs_per_simd != std::numeric_limits<uint32_t>::max()) {
+    uint32_t alloc_sgprs =
+        ((ki.sgpr_count + kSgprAllocGranularity - 1) / kSgprAllocGranularity) *
+        kSgprAllocGranularity;
+    uint32_t sgpr_waves = sgprs_per_simd / alloc_sgprs;
+    gpr_waves = std::min(gpr_waves, sgpr_waves);
  }

-  return stats;
+  // ALU occupancy: total threads per CU limited by registers
+  int alu_waves = simd_per_cu_ * std::min(max_waves_per_simd_, gpr_waves);
+  int alu_limited_threads = alu_waves * wave_size;
+
+  // LDS limit: how many workgroups fit given LDS usage
+  int lds_limited_wgs = std::numeric_limits<int>::max();
+  if (ki.group_segment_size > 0 &&
+      device_properties_.maxSharedMemoryPerMultiProcessor > 0) {
+    lds_limited_wgs =
+        static_cast<int>(device_properties_.maxSharedMemoryPerMultiProcessor /
+                         ki.group_segment_size);
+  }
+
+  // Block size aligned to wavefront
+  int aligned_block = ((block_size + wave_size - 1) / wave_size) * wave_size;
+  int max_blocks = alu_limited_threads / aligned_block;
+  max_blocks = std::min(max_blocks, lds_limited_wgs);
+  max_blocks =
+      std::min(max_blocks, device_properties_.maxBlocksPerMultiProcessor);
+  max_blocks = std::max(max_blocks, 0);
+
+  int max_threads = device_properties_.maxThreadsPerMultiProcessor;
+  if (max_threads <= 0) {
+    return 0.0;
+  }
+
+  double result =
+      std::min(100.0, 100.0 * max_blocks * block_size / max_threads);
+  theoretical_occupancy_cache_[key] = result;


Testing gap: ComputeTheoreticalOccupancy is a non-trivial computation with architecture-dependent branches (gfx8/9/10/11), multiple resource constraints (VGPR, SGPR, LDS, max blocks), and a cache. A unit test constructing KernelDetails with known register counts and validating the output against expected occupancy values would greatly help catch regressions.

Note also that rocm_collector_test.cc (not in this diff) references api_event.kernel_info.func_ptr which was removed from KernelDetails in this PR — it will need updating to set the new sgpr_count / arch_vgpr_count fields or the test target will fail to build.

claude · 2026-03-16T23:57:02Z

Claude Review Summary

Overall: Good approach — replacing the broken hipOccupancy path (where func_ptr was always null) with a direct computation from kernel metadata is the right call. The architecture-based lookup tables match HIP CLR logic well.

Key findings (details in inline comments):

Multi-GPU correctness (2 issues): GetGpuSimdInfo(0, ...) hardcodes device 0, and agents_ (flat_hash_map) iteration order is non-deterministic for ordinal lookup
Build breakage: rocm_collector_test.cc references func_ptr which was removed from KernelDetails — needs updating to use the new fields
Testing gap: No unit tests for the new ComputeTheoreticalOccupancy logic with its architecture-dependent branches
Minor: include ordering nit, uint32_t→int narrowing in a couple of places

claude bot reviewed Mar 16, 2026

View reviewed changes

magaonka-amd marked this pull request as draft March 17, 2026 02:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compute theoretical occupancy for ROCm profiler kernels#669

Compute theoretical occupancy for ROCm profiler kernels#669
magaonka-amd wants to merge 1 commit intoROCm:mainfrom
magaonka-amd:feature/occupancy-stats

magaonka-amd commented Mar 16, 2026

Uh oh!

magaonka-amd commented Mar 16, 2026

Uh oh!

claude bot Mar 16, 2026

Uh oh!

claude bot Mar 16, 2026

Uh oh!

claude bot Mar 16, 2026

Uh oh!

claude bot Mar 16, 2026

Uh oh!

claude bot Mar 16, 2026

Uh oh!

claude bot commented Mar 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

magaonka-amd commented Mar 16, 2026

Submission Checklist

Uh oh!

magaonka-amd commented Mar 16, 2026

Uh oh!

claude bot Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot commented Mar 16, 2026

Claude Review Summary

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant