Skip to content

Compute theoretical occupancy for ROCm profiler kernels#669

Draft
magaonka-amd wants to merge 1 commit intoROCm:mainfrom
magaonka-amd:feature/occupancy-stats
Draft

Compute theoretical occupancy for ROCm profiler kernels#669
magaonka-amd wants to merge 1 commit intoROCm:mainfrom
magaonka-amd:feature/occupancy-stats

Conversation

@magaonka-amd
Copy link

  • Replace broken hipOccupancy API path (func_ptr was always null) with direct computation from kernel code object metadata (VGPR/SGPR counts) and rocprofiler-sdk agent properties (SIMD/CU configuration)
  • Add kTheoreticalOccupancyPct as separate XEvent stat for Kernel Stats table
  • Use arch-based lookup for VGPR/SGPR allocation limits matching HIP CLR

Submission Checklist

- Replace broken hipOccupancy API path (func_ptr was always null) with
  direct computation from kernel code object metadata (VGPR/SGPR counts)
  and rocprofiler-sdk agent properties (SIMD/CU configuration)
- Add kTheoreticalOccupancyPct as separate XEvent stat for Kernel Stats table
- Use arch-based lookup for VGPR/SGPR allocation limits matching HIP CLR
@magaonka-amd
Copy link
Author

opening here to see what claude thinks about my patch. no need to review here plan is to open upstream.


// Get SIMD properties for theoretical occupancy calculation.
uint32_t simd_per_cu = 0, max_waves = 0, gfx_ver = 0;
if (rocm_tracer_->GetGpuSimdInfo(0, &simd_per_cu, &max_waves, &gfx_ver)) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug (multi-GPU): GetGpuSimdInfo(0, ...) hardcodes device ordinal 0. The resulting SIMD properties are stored in RocmTraceCollectorOptions and passed identically to every per-device collector in Export(). In a heterogeneous multi-GPU system (e.g., mixing gfx90a and gfx942), all devices would use GPU 0's SIMD properties, producing incorrect occupancy for the other GPUs.

Consider querying per-device, or at least documenting the single-device assumption.

Comment on lines +146 to +157
uint32_t gpu_index = 0;
for (const auto& [handle, agent] : agents_) {
if (agent.type == ROCPROFILER_AGENT_TYPE_GPU) {
if (gpu_index == device_ordinal) {
*simd_per_cu = agent.simd_per_cu;
*max_waves_per_simd = agent.max_waves_per_simd;
*gfx_target_version = agent.gfx_target_version;
return true;
}
gpu_index++;
}
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue (multi-GPU correctness): agents_ is an absl::flat_hash_map, whose iteration order is non-deterministic. The sequential gpu_index counter assumes a stable ordering, but on multi-GPU systems the mapping from device_ordinal to a specific agent could differ across runs and may not match hipGetDeviceProperties' ordinal assignment.

Consider sorting agents by a stable key (e.g., agent.logical_node_id or PCI bus ID) before the ordinal lookup, or using a different lookup strategy.

#include "rocm/include/rocprofiler-sdk/internal_threading.h"
#include "rocm/include/rocprofiler-sdk/registration.h"
#include "rocm/include/rocprofiler-sdk/rocprofiler.h"
#include <unistd.h>
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit (include order): #include <unistd.h> is a C system header placed between third-party (rocprofiler-sdk) and project includes. Per Google C++ style, system headers should appear in the system headers group before third-party headers, not interleaved here.


double PerDeviceCollector::ComputeTheoreticalOccupancy(
const KernelDetails& ki) {
int block_size = ki.workgroup_x * ki.workgroup_y * ki.workgroup_z;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit (type safety): ki.workgroup_x/y/z are uint32_t but block_size is int. The multiplication is performed in uint32_t arithmetic (could theoretically overflow for very large workgroups), and the implicit conversion to int is implementation-defined if the value exceeds INT_MAX. Consider using int64_t or keeping it as uint32_t consistently.

Also on line 226: int alu_waves = simd_per_cu_ * std::min(...) has the same uint32_t-to-int narrowing.

Comment on lines +156 to +253
double PerDeviceCollector::ComputeTheoreticalOccupancy(
const KernelDetails& ki) {
int block_size = ki.workgroup_x * ki.workgroup_y * ki.workgroup_z;
OccupancyKey key{ki.arch_vgpr_count, ki.sgpr_count, ki.group_segment_size,
static_cast<uint32_t>(block_size)};
auto it = theoretical_occupancy_cache_.find(key);
if (it != theoretical_occupancy_cache_.end()) {
return it->second;
}
if (block_size <= 0 || device_properties_.warpSize <= 0 ||
simd_per_cu_ == 0 || max_waves_per_simd_ == 0) {
return 0.0;
}

if (err != hipError_t::hipSuccess) {
return {};
int wave_size = device_properties_.warpSize;
uint32_t gfx_major = gfx_target_version_ / 10000;

// Derive VGPRs per SIMD from device properties.
// (see HIP CLR: rocclr/device/rocm/rocdevice.cpp)
uint32_t vgprs_per_simd =
device_properties_.regsPerMultiprocessor / wave_size / simd_per_cu_;

// VGPR allocation granularity per thread.
// gfx10+ adjusts for wave64 mode by halving the granularity.
// (see HIP CLR: hipamd/src/hip_platform.cpp)
uint32_t vgpr_alloc_granularity;
if (gfx_major <= 9) {
vgpr_alloc_granularity = 8;
} else if (gfx_major <= 10) {
vgpr_alloc_granularity = wave_size == 64 ? 4 : 8;
} else {
vgpr_alloc_granularity = wave_size == 64 ? 6 : 12;
}

stats.occupancy_pct = number_of_active_blocks * params.block_size * 100;
stats.occupancy_pct /= device_properties_.maxThreadsPerMultiProcessor;
// VGPR limit: how many waves fit per SIMD given VGPR usage.
uint32_t used_vgprs = ki.arch_vgpr_count;
uint32_t vgpr_waves = max_waves_per_simd_;
if (used_vgprs > 0) {
uint32_t alloc_vgprs =
((used_vgprs + vgpr_alloc_granularity - 1) / vgpr_alloc_granularity) *
vgpr_alloc_granularity;
vgpr_waves = vgprs_per_simd / alloc_vgprs;
}
if (vgpr_waves == 0) {
return 0.0;
}

err = hipOccupancyMaxPotentialBlockSize(
&stats.min_grid_size, &stats.suggested_block_size,
static_cast<const void*>(params.func_ptr), params.dynamic_smem_size, 0);
// SGPRs per SIMD: shared between waves on gfx8/gfx9, unlimited on gfx10+.
// (see HIP CLR: rocclr/device/rocm/rocdevice.cpp)
uint32_t sgprs_per_simd;
if (gfx_major < 8) {
sgprs_per_simd = 512;
} else if (gfx_major < 10) {
sgprs_per_simd = 800;
} else {
sgprs_per_simd = std::numeric_limits<uint32_t>::max();
}

if (err != hipError_t::hipSuccess) {
return {};
// SGPR limit
uint32_t gpr_waves = vgpr_waves;
if (ki.sgpr_count > 0 &&
sgprs_per_simd != std::numeric_limits<uint32_t>::max()) {
uint32_t alloc_sgprs =
((ki.sgpr_count + kSgprAllocGranularity - 1) / kSgprAllocGranularity) *
kSgprAllocGranularity;
uint32_t sgpr_waves = sgprs_per_simd / alloc_sgprs;
gpr_waves = std::min(gpr_waves, sgpr_waves);
}

return stats;
// ALU occupancy: total threads per CU limited by registers
int alu_waves = simd_per_cu_ * std::min(max_waves_per_simd_, gpr_waves);
int alu_limited_threads = alu_waves * wave_size;

// LDS limit: how many workgroups fit given LDS usage
int lds_limited_wgs = std::numeric_limits<int>::max();
if (ki.group_segment_size > 0 &&
device_properties_.maxSharedMemoryPerMultiProcessor > 0) {
lds_limited_wgs =
static_cast<int>(device_properties_.maxSharedMemoryPerMultiProcessor /
ki.group_segment_size);
}

// Block size aligned to wavefront
int aligned_block = ((block_size + wave_size - 1) / wave_size) * wave_size;
int max_blocks = alu_limited_threads / aligned_block;
max_blocks = std::min(max_blocks, lds_limited_wgs);
max_blocks =
std::min(max_blocks, device_properties_.maxBlocksPerMultiProcessor);
max_blocks = std::max(max_blocks, 0);

int max_threads = device_properties_.maxThreadsPerMultiProcessor;
if (max_threads <= 0) {
return 0.0;
}

double result =
std::min(100.0, 100.0 * max_blocks * block_size / max_threads);
theoretical_occupancy_cache_[key] = result;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Testing gap: ComputeTheoreticalOccupancy is a non-trivial computation with architecture-dependent branches (gfx8/9/10/11), multiple resource constraints (VGPR, SGPR, LDS, max blocks), and a cache. A unit test constructing KernelDetails with known register counts and validating the output against expected occupancy values would greatly help catch regressions.

Note also that rocm_collector_test.cc (not in this diff) references api_event.kernel_info.func_ptr which was removed from KernelDetails in this PR — it will need updating to set the new sgpr_count / arch_vgpr_count fields or the test target will fail to build.

@claude
Copy link

claude bot commented Mar 16, 2026

Claude Review Summary

Overall: Good approach — replacing the broken hipOccupancy path (where func_ptr was always null) with a direct computation from kernel metadata is the right call. The architecture-based lookup tables match HIP CLR logic well.

Key findings (details in inline comments):

  • Multi-GPU correctness (2 issues): GetGpuSimdInfo(0, ...) hardcodes device 0, and agents_ (flat_hash_map) iteration order is non-deterministic for ordinal lookup
  • Build breakage: rocm_collector_test.cc references func_ptr which was removed from KernelDetails — needs updating to use the new fields
  • Testing gap: No unit tests for the new ComputeTheoreticalOccupancy logic with its architecture-dependent branches
  • Minor: include ordering nit, uint32_tint narrowing in a couple of places

@magaonka-amd magaonka-amd marked this pull request as draft March 17, 2026 02:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant