Compute theoretical occupancy for ROCm profiler kernels#669
Compute theoretical occupancy for ROCm profiler kernels#669magaonka-amd wants to merge 1 commit intoROCm:mainfrom
Conversation
- Replace broken hipOccupancy API path (func_ptr was always null) with direct computation from kernel code object metadata (VGPR/SGPR counts) and rocprofiler-sdk agent properties (SIMD/CU configuration) - Add kTheoreticalOccupancyPct as separate XEvent stat for Kernel Stats table - Use arch-based lookup for VGPR/SGPR allocation limits matching HIP CLR
|
opening here to see what claude thinks about my patch. no need to review here plan is to open upstream. |
|
|
||
| // Get SIMD properties for theoretical occupancy calculation. | ||
| uint32_t simd_per_cu = 0, max_waves = 0, gfx_ver = 0; | ||
| if (rocm_tracer_->GetGpuSimdInfo(0, &simd_per_cu, &max_waves, &gfx_ver)) { |
There was a problem hiding this comment.
Bug (multi-GPU): GetGpuSimdInfo(0, ...) hardcodes device ordinal 0. The resulting SIMD properties are stored in RocmTraceCollectorOptions and passed identically to every per-device collector in Export(). In a heterogeneous multi-GPU system (e.g., mixing gfx90a and gfx942), all devices would use GPU 0's SIMD properties, producing incorrect occupancy for the other GPUs.
Consider querying per-device, or at least documenting the single-device assumption.
| uint32_t gpu_index = 0; | ||
| for (const auto& [handle, agent] : agents_) { | ||
| if (agent.type == ROCPROFILER_AGENT_TYPE_GPU) { | ||
| if (gpu_index == device_ordinal) { | ||
| *simd_per_cu = agent.simd_per_cu; | ||
| *max_waves_per_simd = agent.max_waves_per_simd; | ||
| *gfx_target_version = agent.gfx_target_version; | ||
| return true; | ||
| } | ||
| gpu_index++; | ||
| } | ||
| } |
There was a problem hiding this comment.
Issue (multi-GPU correctness): agents_ is an absl::flat_hash_map, whose iteration order is non-deterministic. The sequential gpu_index counter assumes a stable ordering, but on multi-GPU systems the mapping from device_ordinal to a specific agent could differ across runs and may not match hipGetDeviceProperties' ordinal assignment.
Consider sorting agents by a stable key (e.g., agent.logical_node_id or PCI bus ID) before the ordinal lookup, or using a different lookup strategy.
| #include "rocm/include/rocprofiler-sdk/internal_threading.h" | ||
| #include "rocm/include/rocprofiler-sdk/registration.h" | ||
| #include "rocm/include/rocprofiler-sdk/rocprofiler.h" | ||
| #include <unistd.h> |
There was a problem hiding this comment.
Nit (include order): #include <unistd.h> is a C system header placed between third-party (rocprofiler-sdk) and project includes. Per Google C++ style, system headers should appear in the system headers group before third-party headers, not interleaved here.
|
|
||
| double PerDeviceCollector::ComputeTheoreticalOccupancy( | ||
| const KernelDetails& ki) { | ||
| int block_size = ki.workgroup_x * ki.workgroup_y * ki.workgroup_z; |
There was a problem hiding this comment.
Nit (type safety): ki.workgroup_x/y/z are uint32_t but block_size is int. The multiplication is performed in uint32_t arithmetic (could theoretically overflow for very large workgroups), and the implicit conversion to int is implementation-defined if the value exceeds INT_MAX. Consider using int64_t or keeping it as uint32_t consistently.
Also on line 226: int alu_waves = simd_per_cu_ * std::min(...) has the same uint32_t-to-int narrowing.
| double PerDeviceCollector::ComputeTheoreticalOccupancy( | ||
| const KernelDetails& ki) { | ||
| int block_size = ki.workgroup_x * ki.workgroup_y * ki.workgroup_z; | ||
| OccupancyKey key{ki.arch_vgpr_count, ki.sgpr_count, ki.group_segment_size, | ||
| static_cast<uint32_t>(block_size)}; | ||
| auto it = theoretical_occupancy_cache_.find(key); | ||
| if (it != theoretical_occupancy_cache_.end()) { | ||
| return it->second; | ||
| } | ||
| if (block_size <= 0 || device_properties_.warpSize <= 0 || | ||
| simd_per_cu_ == 0 || max_waves_per_simd_ == 0) { | ||
| return 0.0; | ||
| } | ||
|
|
||
| if (err != hipError_t::hipSuccess) { | ||
| return {}; | ||
| int wave_size = device_properties_.warpSize; | ||
| uint32_t gfx_major = gfx_target_version_ / 10000; | ||
|
|
||
| // Derive VGPRs per SIMD from device properties. | ||
| // (see HIP CLR: rocclr/device/rocm/rocdevice.cpp) | ||
| uint32_t vgprs_per_simd = | ||
| device_properties_.regsPerMultiprocessor / wave_size / simd_per_cu_; | ||
|
|
||
| // VGPR allocation granularity per thread. | ||
| // gfx10+ adjusts for wave64 mode by halving the granularity. | ||
| // (see HIP CLR: hipamd/src/hip_platform.cpp) | ||
| uint32_t vgpr_alloc_granularity; | ||
| if (gfx_major <= 9) { | ||
| vgpr_alloc_granularity = 8; | ||
| } else if (gfx_major <= 10) { | ||
| vgpr_alloc_granularity = wave_size == 64 ? 4 : 8; | ||
| } else { | ||
| vgpr_alloc_granularity = wave_size == 64 ? 6 : 12; | ||
| } | ||
|
|
||
| stats.occupancy_pct = number_of_active_blocks * params.block_size * 100; | ||
| stats.occupancy_pct /= device_properties_.maxThreadsPerMultiProcessor; | ||
| // VGPR limit: how many waves fit per SIMD given VGPR usage. | ||
| uint32_t used_vgprs = ki.arch_vgpr_count; | ||
| uint32_t vgpr_waves = max_waves_per_simd_; | ||
| if (used_vgprs > 0) { | ||
| uint32_t alloc_vgprs = | ||
| ((used_vgprs + vgpr_alloc_granularity - 1) / vgpr_alloc_granularity) * | ||
| vgpr_alloc_granularity; | ||
| vgpr_waves = vgprs_per_simd / alloc_vgprs; | ||
| } | ||
| if (vgpr_waves == 0) { | ||
| return 0.0; | ||
| } | ||
|
|
||
| err = hipOccupancyMaxPotentialBlockSize( | ||
| &stats.min_grid_size, &stats.suggested_block_size, | ||
| static_cast<const void*>(params.func_ptr), params.dynamic_smem_size, 0); | ||
| // SGPRs per SIMD: shared between waves on gfx8/gfx9, unlimited on gfx10+. | ||
| // (see HIP CLR: rocclr/device/rocm/rocdevice.cpp) | ||
| uint32_t sgprs_per_simd; | ||
| if (gfx_major < 8) { | ||
| sgprs_per_simd = 512; | ||
| } else if (gfx_major < 10) { | ||
| sgprs_per_simd = 800; | ||
| } else { | ||
| sgprs_per_simd = std::numeric_limits<uint32_t>::max(); | ||
| } | ||
|
|
||
| if (err != hipError_t::hipSuccess) { | ||
| return {}; | ||
| // SGPR limit | ||
| uint32_t gpr_waves = vgpr_waves; | ||
| if (ki.sgpr_count > 0 && | ||
| sgprs_per_simd != std::numeric_limits<uint32_t>::max()) { | ||
| uint32_t alloc_sgprs = | ||
| ((ki.sgpr_count + kSgprAllocGranularity - 1) / kSgprAllocGranularity) * | ||
| kSgprAllocGranularity; | ||
| uint32_t sgpr_waves = sgprs_per_simd / alloc_sgprs; | ||
| gpr_waves = std::min(gpr_waves, sgpr_waves); | ||
| } | ||
|
|
||
| return stats; | ||
| // ALU occupancy: total threads per CU limited by registers | ||
| int alu_waves = simd_per_cu_ * std::min(max_waves_per_simd_, gpr_waves); | ||
| int alu_limited_threads = alu_waves * wave_size; | ||
|
|
||
| // LDS limit: how many workgroups fit given LDS usage | ||
| int lds_limited_wgs = std::numeric_limits<int>::max(); | ||
| if (ki.group_segment_size > 0 && | ||
| device_properties_.maxSharedMemoryPerMultiProcessor > 0) { | ||
| lds_limited_wgs = | ||
| static_cast<int>(device_properties_.maxSharedMemoryPerMultiProcessor / | ||
| ki.group_segment_size); | ||
| } | ||
|
|
||
| // Block size aligned to wavefront | ||
| int aligned_block = ((block_size + wave_size - 1) / wave_size) * wave_size; | ||
| int max_blocks = alu_limited_threads / aligned_block; | ||
| max_blocks = std::min(max_blocks, lds_limited_wgs); | ||
| max_blocks = | ||
| std::min(max_blocks, device_properties_.maxBlocksPerMultiProcessor); | ||
| max_blocks = std::max(max_blocks, 0); | ||
|
|
||
| int max_threads = device_properties_.maxThreadsPerMultiProcessor; | ||
| if (max_threads <= 0) { | ||
| return 0.0; | ||
| } | ||
|
|
||
| double result = | ||
| std::min(100.0, 100.0 * max_blocks * block_size / max_threads); | ||
| theoretical_occupancy_cache_[key] = result; |
There was a problem hiding this comment.
Testing gap: ComputeTheoreticalOccupancy is a non-trivial computation with architecture-dependent branches (gfx8/9/10/11), multiple resource constraints (VGPR, SGPR, LDS, max blocks), and a cache. A unit test constructing KernelDetails with known register counts and validating the output against expected occupancy values would greatly help catch regressions.
Note also that rocm_collector_test.cc (not in this diff) references api_event.kernel_info.func_ptr which was removed from KernelDetails in this PR — it will need updating to set the new sgpr_count / arch_vgpr_count fields or the test target will fail to build.
Claude Review SummaryOverall: Good approach — replacing the broken Key findings (details in inline comments):
|
Submission Checklist