memory heap set / alloc test#32
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughThis PR refactors the CPU GPU-emulation layer from a monolithic header into three separate modules (def.hpp, api.hpp, cls.hpp), relocates launcher and pointer infrastructure, and adds enhanced device property accessors across CUDA and ROCm backends with null-pointer error handling. Configuration files are updated, test paths reorganized, and new memory allocation tests added. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
inc/mkn/gpu/cpu.hpp (1)
65-65:⚠️ Potential issue | 🟡 MinorTypo: "Psuedo" should be "Pseudo".
Minor spelling issue in the info message.
Fix
-void inline prinfo(std::size_t /*dev*/ = 0) { KOUT(NON) << "Psuedo GPU in use"; } +void inline prinfo(std::size_t /*dev*/ = 0) { KOUT(NON) << "Pseudo GPU in use"; }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@inc/mkn/gpu/cpu.hpp` at line 65, Fix the spelling in the prinfo function: change the info string in prinfo(std::size_t /*dev*/ = 0) { KOUT(NON) << "Psuedo GPU in use"; } to use "Pseudo" instead of "Psuedo" so the message reads "Pseudo GPU in use".
🧹 Nitpick comments (6)
inc/mkn/gpu/rocm/def.hpp (1)
46-48: Consider reusinggetDeviceProperties()insidegetWarpSize().This avoids duplicate HIP property-fetch logic in the same file.
♻️ Suggested simplification
- hipDeviceProp_t devProp; - MKN_GPU_ASSERT(hipGetDeviceProperties(&devProp, dev)); - return devProp.warpSize; + return getDeviceProperties(dev).warpSize;🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@inc/mkn/gpu/rocm/def.hpp` around lines 46 - 48, getWarpSize() duplicates HIP property-fetch logic by directly calling hipGetDeviceProperties; instead call the existing getDeviceProperties(dev) helper and return its warpSize to centralize error handling and avoid duplication. Replace the hipDeviceProp_t devProp / MKN_GPU_ASSERT(hipGetDeviceProperties(&devProp, dev)) block in getWarpSize() with a call to getDeviceProperties(dev) and return its .warpSize so that MKN_GPU_ASSERT and property retrieval are reused consistently.inc/mkn/gpu/cpu/def.hpp (2)
57-67: DeviceProperties defaults may cause issues in tests.
totalGlobalMem = 0means CPU-mode tests usinggetDeviceProperties()will computelimit = 0 * percent = 0, resulting in zero-element allocations. The test file returns early for CPU mode, but if other code relies on these values, consider providing non-zero defaults or documenting this behavior.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@inc/mkn/gpu/cpu/def.hpp` around lines 57 - 67, DeviceProperties currently sets totalGlobalMem to 0 which can cause zero-sized allocations when code calls getDeviceProperties() and computes limits from totalGlobalMem; update DeviceProperties (struct name DeviceProperties) to provide a sensible non-zero default for totalGlobalMem (and optionally sharedMemPerBlock/maxThreadsPerBlock) or add a clear comment/docstring near the struct and the getDeviceProperties() usage documenting that CPU-mode defaults are zero and callers must handle that case; touch the struct definition in def.hpp (DeviceProperties) and/or callers of getDeviceProperties() to ensure tests or runtime code don't end up allocating zero-sized buffers.
41-41:#if !defined(dim3)checks for a macro, not the struct.Since
dim3is defined as a struct (not a#define),!defined(dim3)is always true. If the intent is to allow external code to predefinedim3, consider using the existingMKN_CPU_DO_NOT_DEFINE_DIM3guard alone.Simplified guard
-#if !defined(dim3) and !MKN_CPU_DO_NOT_DEFINE_DIM3 +#if !MKN_CPU_DO_NOT_DEFINE_DIM3 struct dim3 {🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@inc/mkn/gpu/cpu/def.hpp` at line 41, The preprocessor check uses !defined(dim3) which tests a macro, not the struct named dim3, so it always evaluates true; update the include guard to rely only on the existing guard symbol MKN_CPU_DO_NOT_DEFINE_DIM3 (e.g., replace the current condition using defined(dim3) with a single check that skips defining the struct when MKN_CPU_DO_NOT_DEFINE_DIM3 is set) so the code defines the struct dim3 locally only when MKN_CPU_DO_NOT_DEFINE_DIM3 is not defined.inc/mkn/gpu/cpu/cls.hpp (3)
18-32: Unused_fnmember in StreamEvent.The
_fnmember is declared but never assigned—fnis invoked immediately inoperator(), so storing a copy seems unnecessary. Consider removing the unused member if callbacks don't need to be retained.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@inc/mkn/gpu/cpu/cls.hpp` around lines 18 - 32, StreamEvent declares an unused member _fn that is never assigned or used; remove the _fn data member and any references to it (keep the constructor StreamEvent(Stream&) and operator()(std::function<void()> fn = {}) behavior if immediate invocation is desired) or, if you want to retain callbacks, store fn into _fn inside operator() and invoke the stored callback instead; update the class definition (StreamEvent, operator(), and fin usage) accordingly to eliminate the dead member or to consistently use it.
77-79: Arguments not forwarded in Launcher::operator().The
args...are passed by lvalue reference tolaunch, losing perfect forwarding. Compare with the ROCm version atinc/mkn/gpu/rocm/cls.hpp:107-108which usesstd::forward<F>(f).Proposed fix
template <typename F, typename... Args> void operator()(F&& f, Args&&... args) { - launch(f, g, b, ds, s, args...); + launch(std::forward<F>(f), g, b, ds, s, std::forward<Args>(args)...); }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@inc/mkn/gpu/cpu/cls.hpp` around lines 77 - 79, The operator() template in the Launcher class forwards arguments by lvalue, losing perfect forwarding; modify Launcher::operator() so it forwards the callable and parameter pack to launch using std::forward (e.g., call launch(std::forward<F>(f), g, b, ds, s, std::forward<Args>(args)...)) so both F and Args... preserve value category and move semantics.
87-95: Parametersshadows inherited members(stream).The constructor parameter
s(representing count/size) shadows the inheritedLauncher::smember (stream). This works but reduces clarity. Consider renaming to_countorszto match the storedcountfield.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@inc/mkn/gpu/cpu/cls.hpp` around lines 87 - 95, The constructor parameter named s in GLauncher shadows the inherited Launcher::s member; rename the parameter (e.g., to sz or _count) and update its use in the GLauncher constructor to initialize count and compute g.x (keep field GLauncher::count unchanged) so there is no shadowing of Launcher::s; update the GLauncher(std::size_t ...) signature and all references to that parameter inside the constructor accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@test/mem/base.hpp`:
- Line 1: File base.hpp currently contains an invalid token ("base.hpp") as its
entire content; replace that token with a valid C++ header by adding either a
`#pragma` once or include guards (e.g., BASE_HPP) and the appropriate declarations
for the module (types, forward declarations, function/class prototypes expected
by code referencing base.hpp, such as any Base class or related symbols). Ensure
the header compiles standalone and exposes the identifiers used by the rest of
the codebase.
In `@test/mem/test_alloc.cpp`:
- Around line 28-32: The test can underflow when size == 0 (size = limit /
sizeof(double)), so add a guard before constructing or indexing
ManagedVector<double>: check if size == 0 and skip the element-access assertion
(e.g., return success or skip the block) to avoid evaluating
vec.data()[vec.size() - 1]; ensure the guard references the same symbols (size,
limit, ManagedVector<double>, vec.size(), vec.data()) so the test only performs
the tail-index check when size > 0.
---
Outside diff comments:
In `@inc/mkn/gpu/cpu.hpp`:
- Line 65: Fix the spelling in the prinfo function: change the info string in
prinfo(std::size_t /*dev*/ = 0) { KOUT(NON) << "Psuedo GPU in use"; } to use
"Pseudo" instead of "Psuedo" so the message reads "Pseudo GPU in use".
---
Nitpick comments:
In `@inc/mkn/gpu/cpu/cls.hpp`:
- Around line 18-32: StreamEvent declares an unused member _fn that is never
assigned or used; remove the _fn data member and any references to it (keep the
constructor StreamEvent(Stream&) and operator()(std::function<void()> fn = {})
behavior if immediate invocation is desired) or, if you want to retain
callbacks, store fn into _fn inside operator() and invoke the stored callback
instead; update the class definition (StreamEvent, operator(), and fin usage)
accordingly to eliminate the dead member or to consistently use it.
- Around line 77-79: The operator() template in the Launcher class forwards
arguments by lvalue, losing perfect forwarding; modify Launcher::operator() so
it forwards the callable and parameter pack to launch using std::forward (e.g.,
call launch(std::forward<F>(f), g, b, ds, s, std::forward<Args>(args)...)) so
both F and Args... preserve value category and move semantics.
- Around line 87-95: The constructor parameter named s in GLauncher shadows the
inherited Launcher::s member; rename the parameter (e.g., to sz or _count) and
update its use in the GLauncher constructor to initialize count and compute g.x
(keep field GLauncher::count unchanged) so there is no shadowing of Launcher::s;
update the GLauncher(std::size_t ...) signature and all references to that
parameter inside the constructor accordingly.
In `@inc/mkn/gpu/cpu/def.hpp`:
- Around line 57-67: DeviceProperties currently sets totalGlobalMem to 0 which
can cause zero-sized allocations when code calls getDeviceProperties() and
computes limits from totalGlobalMem; update DeviceProperties (struct name
DeviceProperties) to provide a sensible non-zero default for totalGlobalMem (and
optionally sharedMemPerBlock/maxThreadsPerBlock) or add a clear
comment/docstring near the struct and the getDeviceProperties() usage
documenting that CPU-mode defaults are zero and callers must handle that case;
touch the struct definition in def.hpp (DeviceProperties) and/or callers of
getDeviceProperties() to ensure tests or runtime code don't end up allocating
zero-sized buffers.
- Line 41: The preprocessor check uses !defined(dim3) which tests a macro, not
the struct named dim3, so it always evaluates true; update the include guard to
rely only on the existing guard symbol MKN_CPU_DO_NOT_DEFINE_DIM3 (e.g., replace
the current condition using defined(dim3) with a single check that skips
defining the struct when MKN_CPU_DO_NOT_DEFINE_DIM3 is set) so the code defines
the struct dim3 locally only when MKN_CPU_DO_NOT_DEFINE_DIM3 is not defined.
In `@inc/mkn/gpu/rocm/def.hpp`:
- Around line 46-48: getWarpSize() duplicates HIP property-fetch logic by
directly calling hipGetDeviceProperties; instead call the existing
getDeviceProperties(dev) helper and return its warpSize to centralize error
handling and avoid duplication. Replace the hipDeviceProp_t devProp /
MKN_GPU_ASSERT(hipGetDeviceProperties(&devProp, dev)) block in getWarpSize()
with a call to getDeviceProperties(dev) and return its .warpSize so that
MKN_GPU_ASSERT and property retrieval are reused consistently.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 6c9dccd5-e982-4fbd-a2d4-9cc3939cb1ae
📒 Files selected for processing (15)
inc/mkn/gpu/cpu.hppinc/mkn/gpu/cpu/api.hppinc/mkn/gpu/cpu/cls.hppinc/mkn/gpu/cpu/def.hppinc/mkn/gpu/cuda.hppinc/mkn/gpu/cuda/cls.hppinc/mkn/gpu/cuda/def.hppinc/mkn/gpu/rocm.hppinc/mkn/gpu/rocm/cls.hppinc/mkn/gpu/rocm/def.hppmkn.yamlres/mkn/clang_cuda.yamlres/mkn/hipcc.yamltest/mem/base.hpptest/mem/test_alloc.cpp
There was a problem hiding this comment.
Actionable comments posted: 7
🧹 Nitpick comments (1)
inc/mkn/gpu/cpu/def.hpp (1)
71-73: Honor_MKN_GPU_WARP_SIZE_in the CPU shim too.Hardcoding
1here means backend-agnostic code and tests cannot use the same compile-time override that CUDA/ROCm expose throughwarp_size.Possible fix
std::uint32_t inline getWarpSize(size_t /*dev */ = 0) { +#ifdef _MKN_GPU_WARP_SIZE_ + return _MKN_GPU_WARP_SIZE_; +#else return 1; +#endif }Based on learnings,
getWarpSize()is designed to provide a default warp size with compile-time override via_MKN_GPU_WARP_SIZE_.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@inc/mkn/gpu/cpu/def.hpp` around lines 71 - 73, The CPU shim currently hardcodes warp size to 1; change getWarpSize() and the warp_size variable to honor the compile-time override _MKN_GPU_WARP_SIZE_ by returning that macro value when defined (and falling back to 1 otherwise), and initialize static const warp_size from getWarpSize() so the same compile-time override used by CUDA/ROCm affects CPU tests and backend-agnostic code; update references to getWarpSize() and warp_size accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@inc/mkn/gpu/cpu/cls.hpp`:
- Around line 53-67: The CPU launch loop in launch(...) only advances
detail::idx and never updates the global CUDA-style indices, so populate
threadIdx and blockIdx per iteration: inside the for loop over i compute flat
thread index within a block and block index from i using the grid (g) and block
(b) dimensions, set global threadIdx and blockIdx accordingly before calling
f(params...), and then increment detail::idx as before; ensure you restore/reset
detail::idx (and optionally threadIdx/blockIdx) after the loop. reference
symbols: launch, detail::idx, threadIdx, blockIdx, devmem_replace.
- Around line 4-5: The header cpu/cls.hpp (and any includes it pulls in like
inc/mkn/gpu/any/inc/device.ipp) uses several STL facilities but doesn't include
the corresponding headers; add the missing standard headers: `#include`
<functional> (for std::function, std::reference_wrapper), `#include` <tuple> (for
std::tuple, std::apply, std::forward_as_tuple), `#include` <utility> (for
std::make_index_sequence), `#include` <type_traits> (for std::is_base_of_v,
std::decay_t), and `#include` <memory> (for std::shared_ptr) so symbols used by
functions/classes in cpu/cls.hpp (and referenced in device.ipp) are declared.
In `@inc/mkn/gpu/cpu/def.hpp`:
- Around line 75-81: getLimitMallocHeapSize() always returns 0 and
setLimitMallocHeapSize() is a no-op, so implement a real storage-backed accessor
pair: add a static (preferably atomic) variable inside the CPU backend scope to
hold the heap limit, have setLimitMallocHeapSize(std::size_t const& bytes)
assign that variable, and have getLimitMallocHeapSize() return its current
value; reference the functions getLimitMallocHeapSize and setLimitMallocHeapSize
when making the change and ensure the storage has appropriate linkage/lifetime
for use across callers (use std::atomic<std::size_t> if concurrent access is
possible).
- Around line 114-123: The CPU shim destroy and destroy_host currently take T*&
which breaks parity with the CUDA/ROCm backends that accept T* by value; change
the signatures of destroy and destroy_host to take T* (plain pointer) instead of
T*& so temporaries and pointer expressions compile consistently with GPU
backends, keep the KLOG(TRC) and std::free(p) behavior but do not attempt to
null the caller's pointer (since we no longer have a reference).
- Around line 4-8: The header is not self-contained: add the missing includes so
symbols used in def.hpp are defined — specifically include <cstdint> for
std::uint32_t, <cstdlib> for std::malloc and std::free, <cstring> for
std::memcpy, and the logging header that defines KLOG/TRC (e.g. mkn/kul/log.hpp)
so KLOG(TRC) is available; update inc/mkn/gpu/cpu/def.hpp to include these
headers near the top so consumers via cpu/api.hpp compile without undefined
symbol errors.
In `@inc/mkn/gpu/rocm/cls.hpp`:
- Around line 64-71: The Pointer(T* _t) constructor calls
hipPointerGetAttributes unconditionally and compares attributes.type to magic
numbers; change it to guard against nullptr by initializing attributes to a safe
default and only calling hipPointerGetAttributes(&attributes, t) and
MKN_GPU_ASSERT(...) when t != nullptr, update
is_unregistered_ptr/is_host_ptr/is_device_ptr/is_managed_ptr to compare
attributes.type against the HIP enum constants (e.g., hipMemoryTypeUndefined,
hipMemoryTypeHost, hipMemoryTypeDevice, hipMemoryTypeManaged) and remove the
unreachable t != nullptr fallback in is_host_ptr so the logic relies on the
guarded attributes value instead; reference the Pointer constructor,
hipPointerGetAttributes, and the
is_unregistered_ptr/is_host_ptr/is_device_ptr/is_managed_ptr methods when making
these changes.
In `@test/mem/test_alloc.cpp`:
- Around line 20-33: The code uses floating-point arithmetic for byte/count
values causing precision and comparison issues: change the computation of limit
from double to an integral byte type by computing limit =
static_cast<std::size_t>(mem) * static_cast<std::size_t>(percent) or, if percent
is fractional, compute limit = static_cast<std::size_t>(std::round(mem *
percent)) but store and pass it as std::size_t when calling
mkn::gpu::setLimitMallocHeapSize(limit) and comparing against
getLimitMallocHeapSize(); then compute size as size_t size = limit /
sizeof(double) (used for the vector allocation) so all heap-limit math and
comparisons use integer types (refer to variables/functions: limit, mem,
percent, mkn::gpu::setLimitMallocHeapSize, mkn::gpu::getLimitMallocHeapSize, and
size).
---
Nitpick comments:
In `@inc/mkn/gpu/cpu/def.hpp`:
- Around line 71-73: The CPU shim currently hardcodes warp size to 1; change
getWarpSize() and the warp_size variable to honor the compile-time override
_MKN_GPU_WARP_SIZE_ by returning that macro value when defined (and falling back
to 1 otherwise), and initialize static const warp_size from getWarpSize() so the
same compile-time override used by CUDA/ROCm affects CPU tests and
backend-agnostic code; update references to getWarpSize() and warp_size
accordingly.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 1069f3b4-2a2e-4d5a-a57a-bd9c74c21258
📒 Files selected for processing (14)
inc/mkn/gpu/cpu.hppinc/mkn/gpu/cpu/api.hppinc/mkn/gpu/cpu/cls.hppinc/mkn/gpu/cpu/def.hppinc/mkn/gpu/cuda.hppinc/mkn/gpu/cuda/cls.hppinc/mkn/gpu/cuda/def.hppinc/mkn/gpu/rocm.hppinc/mkn/gpu/rocm/cls.hppinc/mkn/gpu/rocm/def.hppmkn.yamlres/mkn/clang_cuda.yamlres/mkn/hipcc.yamltest/mem/test_alloc.cpp
🚧 Files skipped from review as they are similar to previous changes (4)
- inc/mkn/gpu/cuda/cls.hpp
- mkn.yaml
- res/mkn/hipcc.yaml
- inc/mkn/gpu/cuda.hpp
There was a problem hiding this comment.
♻️ Duplicate comments (1)
test/mem/test_alloc.cpp (1)
20-32:⚠️ Potential issue | 🟡 MinorUse integral arithmetic for heap-limit calculations.
The computation
mem * percentpromotes todouble, which can lose precision for large memory sizes (multi-GB). The subsequent implicit conversion when passing tosetLimitMallocHeapSize()and the mixed-type comparison at line 25 may cause subtle mismatches.Suggested fix
- auto const limit = mem * percent; + auto const limit = static_cast<std::size_t>(mem * percent); if (set_heap_limit) mkn::gpu::setLimitMallocHeapSize(limit); auto const heapLim = mkn::gpu::getLimitMallocHeapSize(); KLOG(DBG) << heapLim; if (set_heap_limit and limit > heapLim) { KOUT(NON) << "Cannot set heap limit! " << limit << " " << heapLim; return 1; } if (limit < sizeof(double)) throw std::runtime_error("limit < sizeof(double)"); - auto const size = limit / sizeof(double); + auto const size = limit / sizeof(double); // now integral division🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/mem/test_alloc.cpp` around lines 20 - 32, The code currently computes limit with floating-point math (auto const limit = mem * percent) which can lose precision; change to integer arithmetic: compute limit as an unsigned integer (e.g., uint64_t limit) using integer multiplication/division (for example using a 128-bit intermediate like (uint64_t)((__int128)mem * (uint64_t)percent / percent_scale) or mem * percent / 100 depending on how percent is represented), then pass that integral limit to mkn::gpu::setLimitMallocHeapSize and compare against mkn::gpu::getLimitMallocHeapSize (also cast/get as uint64_t) so the comparison uses the same integer type; ensure size = limit / sizeof(double) remains integer division. Reference symbols: limit, mem, percent, setLimitMallocHeapSize, getLimitMallocHeapSize, size.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@test/mem/test_alloc.cpp`:
- Around line 20-32: The code currently computes limit with floating-point math
(auto const limit = mem * percent) which can lose precision; change to integer
arithmetic: compute limit as an unsigned integer (e.g., uint64_t limit) using
integer multiplication/division (for example using a 128-bit intermediate like
(uint64_t)((__int128)mem * (uint64_t)percent / percent_scale) or mem * percent /
100 depending on how percent is represented), then pass that integral limit to
mkn::gpu::setLimitMallocHeapSize and compare against
mkn::gpu::getLimitMallocHeapSize (also cast/get as uint64_t) so the comparison
uses the same integer type; ensure size = limit / sizeof(double) remains integer
division. Reference symbols: limit, mem, percent, setLimitMallocHeapSize,
getLimitMallocHeapSize, size.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 554557a9-30bd-41cf-8fdd-b6db34ac5eb0
📒 Files selected for processing (14)
inc/mkn/gpu/cpu.hppinc/mkn/gpu/cpu/api.hppinc/mkn/gpu/cpu/cls.hppinc/mkn/gpu/cpu/def.hppinc/mkn/gpu/cuda.hppinc/mkn/gpu/cuda/cls.hppinc/mkn/gpu/cuda/def.hppinc/mkn/gpu/rocm.hppinc/mkn/gpu/rocm/cls.hppinc/mkn/gpu/rocm/def.hppmkn.yamlres/mkn/clang_cuda.yamlres/mkn/hipcc.yamltest/mem/test_alloc.cpp
🚧 Files skipped from review as they are similar to previous changes (5)
- res/mkn/clang_cuda.yaml
- inc/mkn/gpu/rocm.hpp
- inc/mkn/gpu/cpu/cls.hpp
- res/mkn/hipcc.yaml
- inc/mkn/gpu/rocm/cls.hpp
There was a problem hiding this comment.
Actionable comments posted: 5
♻️ Duplicate comments (2)
inc/mkn/gpu/rocm/cls.hpp (1)
68-70:⚠️ Potential issue | 🟠 MajorRevert ROCm pointer-type checks to compatibility-safe literals.
Line 68–70 switched back to named HIP memory-type constants, which can regress compatibility on older ROCm headers this repo targets. Please keep literal type checks here (and, for parity with CUDA behavior, treat unregistered as host).
Suggested patch
- bool is_host_ptr() const { return attributes.type == hipMemoryTypeHost; } - bool is_device_ptr() const { return is_managed_ptr() || attributes.type == hipMemoryTypeDevice; } - bool is_managed_ptr() const { return attributes.type == hipMemoryTypeManaged; } + bool is_host_ptr() const { return attributes.type == 1 || attributes.type == 0; } // host or unregistered + bool is_device_ptr() const { return is_managed_ptr() || attributes.type == 2; } + bool is_managed_ptr() const { return attributes.type == 3; }Which ROCm/HIP versions guarantee availability of hipMemoryTypeHost, hipMemoryTypeDevice, and hipMemoryTypeManaged in public headers, and are there versions where projects use integer fallbacks for compatibility?Based on learnings: in
inc/mkn/gpu/rocm/cls.hpp, raw literal comparisons forhipPointerGetAttributes().typeare intentional for ROCm-version compatibility and should be preserved.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@inc/mkn/gpu/rocm/cls.hpp` around lines 68 - 70, Revert the pointer-type checks in is_host_ptr, is_device_ptr, and is_managed_ptr back to the compatibility-safe raw integer literal comparisons on attributes.type (i.e., restore the literal values used in the prior implementation instead of hipMemoryTypeHost/Device/Managed), and ensure is_host_ptr also treats the "unregistered"/unknown value as host for parity with CUDA; update the implementations of is_host_ptr(), is_device_ptr(), and is_managed_ptr() accordingly.inc/mkn/gpu/cpu/cls.hpp (1)
63-75:⚠️ Potential issue | 🟠 MajorStill unresolved: populate
threadIdxandblockIdxbefore each CPU kernel call.The loop still only advances
detail::idxand incrementsblockIdx.xafter the invocation.threadIdxnever changes, and the first call sees thedim3defaults instead of(0,0,0), so CUDA-style indexing stays wrong on the CPU path.Possible fix
std::apply( [&](auto&&... params) { + auto const threads_per_block = b.x * b.y * b.z; for (std::size_t i = 0; i < N; ++i) { + auto const linear_thread = i % threads_per_block; + auto const linear_block = i / threads_per_block; + + threadIdx.x = linear_thread % b.x; + threadIdx.y = (linear_thread / b.x) % b.y; + threadIdx.z = linear_thread / (b.x * b.y); + + blockIdx.x = linear_block % g.x; + blockIdx.y = (linear_block / g.x) % g.y; + blockIdx.z = linear_block / (g.x * g.y); + + detail::idx = i; f(params...); - ++blockIdx.x; - ++detail::idx; } }, devmem_replace(std::forward_as_tuple(args...), std::make_index_sequence<sizeof...(Args)>())); detail::idx = 0; - blockIdx.x = 0; + threadIdx = dim3(0, 0, 0); + blockIdx = dim3(0, 0, 0);🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@inc/mkn/gpu/cpu/cls.hpp` around lines 63 - 75, Before invoking the CPU kernel inside the std::apply lambda (the block that calls f(params...)), compute and assign CUDA-style indices to threadIdx and blockIdx based on detail::idx and any blockDim/threadDim semantics so each call sees correct indices; specifically, set threadIdx (and threadIdx.x/y/z) and blockIdx (and blockIdx.x/y/z) immediately before f(params...) using detail::idx and the kernel launch dimensions, then call f(params...), and only after the call increment detail::idx and advance blockIdx as needed. Locate the lambda passed to std::apply (the code using devmem_replace(std::forward_as_tuple(args...), std::make_index_sequence<...>())) and update it to populate threadIdx and blockIdx prior to f(params...) so CPU path matches CUDA indexing.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@inc/mkn/gpu/cpu/cls.hpp`:
- Around line 77-83: The size-based Launcher constructors use integer floor
division which can produce zero or too-small grid sizes; modify the two
constructors that take (w,h,tpx,tpy) and (x,y,z,tpx,tpy,tpz) to compute grid
dimensions with ceiling division (e.g., (w + tpx - 1) / tpx) for each axis and
ensure the resulting dim3 g components are at least 1 so no work is dropped;
update the initializer calls to use these ceiling-divided values when
constructing g (while keeping b as the tile/thread-per-block dim3).
- Around line 28-31: The operator() implementation unconditionally invokes the
optional std::function parameter which can be empty; change operator() (the
functor operator() in cls.hpp) to check whether the std::function<void()> is
callable before calling it (e.g., if (fn) invoke fn()), leaving the default
parameter fn = {} intact and still returning *this; ensure you only call fn()
when it evaluates true to avoid std::bad_function_call.
In `@inc/mkn/gpu/cpu/def.hpp`:
- Around line 61-77: DeviceProperties.warpSize is initialized to 0 while
getWarpSize() and warp_size return 1, causing inconsistent warp size on CPU;
update the CPU-side DeviceProperties (returned by getDeviceProperties) to set
warpSize to 1 (or otherwise mirror getWarpSize()) so that code reading
DeviceProperties.warpSize, the getWarpSize() helper, and the warp_size constant
all report the same non-zero warp size; modify the DeviceProperties initializer
or the getDeviceProperties() factory to assign warpSize = 1 (and keep
getWarpSize()/warp_size unchanged).
- Around line 45-55: threadIdx and blockIdx are shared globals causing race
conditions between host threads; make them thread-local. Change the declaration
of the dim3 shims (symbol dim3 and the instances threadIdx, blockIdx) so the
instances are declared with thread_local (e.g. thread_local inline dim3
threadIdx, blockIdx) instead of the current shared static inline declaration.
In `@test/mem/test_alloc.cpp`:
- Around line 20-38: The test sets the device heap limit via
setLimitMallocHeapSize but only creates a ManagedVector (which uses
alloc_managed()), so it never exercises device-side malloc inside kernels;
change the test to launch a small CUDA kernel that performs device-side
malloc/free (or uses the library's device allocation path) to actually trigger
heap-limit enforcement: keep the setLimitMallocHeapSize(limit) call, then
replace or augment the ManagedVector allocation with a kernel that calls
malloc(sizeof(double) * N) (or the project's device allocator) and reports
success/failure back to host so the test fails if the kernel-side allocation
succeeds when it should be prevented. Reference symbols: ManagedVector,
setLimitMallocHeapSize, alloc_managed (do not use alloc_managed for this check).
---
Duplicate comments:
In `@inc/mkn/gpu/cpu/cls.hpp`:
- Around line 63-75: Before invoking the CPU kernel inside the std::apply lambda
(the block that calls f(params...)), compute and assign CUDA-style indices to
threadIdx and blockIdx based on detail::idx and any blockDim/threadDim semantics
so each call sees correct indices; specifically, set threadIdx (and
threadIdx.x/y/z) and blockIdx (and blockIdx.x/y/z) immediately before
f(params...) using detail::idx and the kernel launch dimensions, then call
f(params...), and only after the call increment detail::idx and advance blockIdx
as needed. Locate the lambda passed to std::apply (the code using
devmem_replace(std::forward_as_tuple(args...), std::make_index_sequence<...>()))
and update it to populate threadIdx and blockIdx prior to f(params...) so CPU
path matches CUDA indexing.
In `@inc/mkn/gpu/rocm/cls.hpp`:
- Around line 68-70: Revert the pointer-type checks in is_host_ptr,
is_device_ptr, and is_managed_ptr back to the compatibility-safe raw integer
literal comparisons on attributes.type (i.e., restore the literal values used in
the prior implementation instead of hipMemoryTypeHost/Device/Managed), and
ensure is_host_ptr also treats the "unregistered"/unknown value as host for
parity with CUDA; update the implementations of is_host_ptr(), is_device_ptr(),
and is_managed_ptr() accordingly.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 9da73f88-9676-4afd-b405-4dc2dfe7dc81
📒 Files selected for processing (14)
inc/mkn/gpu/cpu.hppinc/mkn/gpu/cpu/api.hppinc/mkn/gpu/cpu/cls.hppinc/mkn/gpu/cpu/def.hppinc/mkn/gpu/cuda.hppinc/mkn/gpu/cuda/cls.hppinc/mkn/gpu/cuda/def.hppinc/mkn/gpu/rocm.hppinc/mkn/gpu/rocm/cls.hppinc/mkn/gpu/rocm/def.hppmkn.yamlres/mkn/clang_cuda.yamlres/mkn/hipcc.yamltest/mem/test_alloc.cpp
✅ Files skipped from review due to trivial changes (4)
- inc/mkn/gpu/rocm.hpp
- inc/mkn/gpu/cpu/api.hpp
- mkn.yaml
- res/mkn/hipcc.yaml
🚧 Files skipped from review as they are similar to previous changes (4)
- res/mkn/clang_cuda.yaml
- inc/mkn/gpu/cuda.hpp
- inc/mkn/gpu/cuda/cls.hpp
- inc/mkn/gpu/rocm/def.hpp
| auto& operator()(std::function<void()> fn = {}) { | ||
| fn(); | ||
| return *this; | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
What happens in C++ when std::function<void()> fn{}is invoked withfn()?
💡 Result:
When std::function fn{} is invoked with fn, it throws a std::bad_function_call exception. The default constructor std::function fn{} creates an empty std::function with no callable target. Invoking operator on an empty std::function results in this exception, as specified in the C++ standard library.
Citations:
- 1: http://en.cppreference.com/w/cpp/utility/functional/function.html
- 2: https://en.cppreference.com/w/cpp/utility/functional/function/function.html
- 3: https://en.cppreference.com/w/cpp/utility/functional/function/operator().html
Check if std::function is callable before invoking it.
When called without arguments, this operator uses the default parameter fn = {}, which creates an empty std::function object. Unconditionally invoking fn() on an empty function throws std::bad_function_call. The signature allows no-arg invocation, so it should behave as a no-op in this case.
Suggested fix
auto& operator()(std::function<void()> fn = {}) {
- fn();
+ if (fn) fn();
return *this;
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| auto& operator()(std::function<void()> fn = {}) { | |
| fn(); | |
| return *this; | |
| } | |
| auto& operator()(std::function<void()> fn = {}) { | |
| if (fn) fn(); | |
| return *this; | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@inc/mkn/gpu/cpu/cls.hpp` around lines 28 - 31, The operator() implementation
unconditionally invokes the optional std::function parameter which can be empty;
change operator() (the functor operator() in cls.hpp) to check whether the
std::function<void()> is callable before calling it (e.g., if (fn) invoke fn()),
leaving the default parameter fn = {} intact and still returning *this; ensure
you only call fn() when it evaluates true to avoid std::bad_function_call.
| struct Launcher { | ||
| Launcher(dim3 _g, dim3 _b) : g{_g}, b{_b} {} | ||
| Launcher(std::size_t w, std::size_t h, std::size_t tpx, std::size_t tpy) | ||
| : Launcher{dim3(w / tpx, h / tpy), dim3(tpx, tpy)} {} | ||
| Launcher(std::size_t x, std::size_t y, std::size_t z, std::size_t tpx, std::size_t tpy, | ||
| std::size_t tpz) | ||
| : Launcher{dim3(x / tpx, y / tpy, z / tpz), dim3(tpx, tpy, tpz)} {} |
There was a problem hiding this comment.
Use ceiling division in the size-based Launcher constructors.
These constructors floor-divide by the tile size. If w < tpx or the dimensions are not exact multiples, g becomes zero or too small and the CPU launch drops work.
Possible fix
Launcher(std::size_t w, std::size_t h, std::size_t tpx, std::size_t tpy)
- : Launcher{dim3(w / tpx, h / tpy), dim3(tpx, tpy)} {}
+ : g{}, b{} {
+ KASSERT(tpx && tpy);
+ g = dim3((w + tpx - 1) / tpx, (h + tpy - 1) / tpy);
+ b = dim3(tpx, tpy);
+ }
Launcher(std::size_t x, std::size_t y, std::size_t z, std::size_t tpx, std::size_t tpy,
std::size_t tpz)
- : Launcher{dim3(x / tpx, y / tpy, z / tpz), dim3(tpx, tpy, tpz)} {}
+ : g{}, b{} {
+ KASSERT(tpx && tpy && tpz);
+ g = dim3((x + tpx - 1) / tpx, (y + tpy - 1) / tpy, (z + tpz - 1) / tpz);
+ b = dim3(tpx, tpy, tpz);
+ }📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| struct Launcher { | |
| Launcher(dim3 _g, dim3 _b) : g{_g}, b{_b} {} | |
| Launcher(std::size_t w, std::size_t h, std::size_t tpx, std::size_t tpy) | |
| : Launcher{dim3(w / tpx, h / tpy), dim3(tpx, tpy)} {} | |
| Launcher(std::size_t x, std::size_t y, std::size_t z, std::size_t tpx, std::size_t tpy, | |
| std::size_t tpz) | |
| : Launcher{dim3(x / tpx, y / tpy, z / tpz), dim3(tpx, tpy, tpz)} {} | |
| struct Launcher { | |
| Launcher(dim3 _g, dim3 _b) : g{_g}, b{_b} {} | |
| Launcher(std::size_t w, std::size_t h, std::size_t tpx, std::size_t tpy) | |
| : g{}, b{} { | |
| KASSERT(tpx && tpy); | |
| g = dim3((w + tpx - 1) / tpx, (h + tpy - 1) / tpy); | |
| b = dim3(tpx, tpy); | |
| } | |
| Launcher(std::size_t x, std::size_t y, std::size_t z, std::size_t tpx, std::size_t tpy, | |
| std::size_t tpz) | |
| : g{}, b{} { | |
| KASSERT(tpx && tpy && tpz); | |
| g = dim3((x + tpx - 1) / tpx, (y + tpy - 1) / tpy, (z + tpz - 1) / tpz); | |
| b = dim3(tpx, tpy, tpz); | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@inc/mkn/gpu/cpu/cls.hpp` around lines 77 - 83, The size-based Launcher
constructors use integer floor division which can produce zero or too-small grid
sizes; modify the two constructors that take (w,h,tpx,tpy) and
(x,y,z,tpx,tpy,tpz) to compute grid dimensions with ceiling division (e.g., (w +
tpx - 1) / tpx) for each axis and ensure the resulting dim3 g components are at
least 1 so no work is dropped; update the initializer calls to use these
ceiling-divided values when constructing g (while keeping b as the
tile/thread-per-block dim3).
| #if !defined(dim3) and !MKN_CPU_DO_NOT_DEFINE_DIM3 | ||
| struct dim3 { | ||
| dim3() {} | ||
| dim3(std::size_t x) : x{x} {} | ||
| dim3(std::size_t x, std::size_t y) : x{x}, y{y} {} | ||
| dim3(std::size_t x, std::size_t y, std::size_t z) : x{x}, y{y}, z{z} {} | ||
|
|
||
| std::size_t x = 1, y = 1, z = 1; | ||
| }; | ||
| dim3 static inline threadIdx, blockIdx; | ||
| #endif // MKN_CPU_DO_NOT_DEFINE_DIM3 |
There was a problem hiding this comment.
Keep the CPU index shims thread-local.
detail::idx in inc/mkn/gpu/cpu/cls.hpp:55-57 is already thread_local, but threadIdx and blockIdx are shared globals here. Two host threads launching CPU kernels concurrently will race on these coordinates.
Possible fix
-dim3 static inline threadIdx, blockIdx;
+static inline thread_local dim3 threadIdx{0, 0, 0}, blockIdx{0, 0, 0};🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@inc/mkn/gpu/cpu/def.hpp` around lines 45 - 55, threadIdx and blockIdx are
shared globals causing race conditions between host threads; make them
thread-local. Change the declaration of the dim3 shims (symbol dim3 and the
instances threadIdx, blockIdx) so the instances are declared with thread_local
(e.g. thread_local inline dim3 threadIdx, blockIdx) instead of the current
shared static inline declaration.
| auto const limit = mem * percent; | ||
| if (set_heap_limit) mkn::gpu::setLimitMallocHeapSize(limit); | ||
|
|
||
| auto const heapLim = mkn::gpu::getLimitMallocHeapSize(); | ||
| KLOG(DBG) << heapLim; | ||
| if (set_heap_limit and limit > heapLim) { | ||
| KOUT(NON) << "Cannot set heap limit! " << limit << " " << heapLim; | ||
| return 1; | ||
| } | ||
|
|
||
| if (limit < sizeof(double)) throw std::runtime_error("limit < sizeof(double)"); | ||
|
|
||
| auto const size = limit / sizeof(double); | ||
|
|
||
| if (size == 0) throw std::runtime_error("size == 0"); | ||
|
|
||
| { | ||
| ManagedVector<double> vec(size, 2); | ||
| if (vec.data()[vec.size() - 1] != 2) return 1; |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
Does cudaLimitMallocHeapSize/hipLimitMallocHeapSizeaffectcudaMallocManaged/hipMallocManaged, or only device-side malloc/new inside kernels?
💡 Result:
No, cudaLimitMallocHeapSize and hipLimitMallocHeapSize affect only device-side malloc and new inside kernels, not cudaMallocManaged or hipMallocManaged. These limits control the size of the runtime heap used exclusively by device-side dynamic memory allocation functions like malloc and free (or new/delete) called from within CUDA/HIP kernels. Official NVIDIA CUDA documentation explicitly states: "CU_LIMIT_MALLOC_HEAP_SIZE controls the size of the heap used by the malloc and free device system calls" (source 15,17). Similarly, HIP documentation confirms hipLimitMallocHeapSize sets "the limit value of the heap used by the malloc/free calls" (source 21,67). In contrast, cudaMallocManaged/hipMallocManaged are host-side runtime API calls that allocate Unified Managed Memory (UMM), which is managed separately by the driver through virtual memory management or page migration mechanisms, not the per-context device runtime heap (sources 2,3,6,18). No official documentation links these limits to managed memory allocations, and searches for direct connections returned no evidence. Device malloc draws from a fixed-size heap (default 8MB, configurable via these limits before kernel launch), while managed allocations are backed by device global memory or oversubscribed via paging, independent of the runtime heap.
Citations:
- 1: https://developer.download.nvidia.com/compute/DevZone/docs/html/C/doc/html/group__CUDA__CTX_g0651954dfb9788173e60a9af7201e65a.html
- 2: https://developer.download.nvidia.com/compute/DevZone/docs/html/C/doc/html/group__CUDART__DEVICE_g50e098f554dbfca0ef054e7b7d39d06e.html
- 3: https://developer.download.nvidia.com/compute/DevZone/docs/html/C/doc/html/group__CUDART__DEVICE_gb8a22e29cc297634b0702db0b69f59e7.html
- 4: https://rocm.docs.amd.com/projects/HIP/en/docs-6.0.0/doxygen/html/group___device.html
- 5: https://stackoverflow.com/questions/34794481/cuda-stack-and-heap
- 6: https://stackoverflow.com/questions/44901330/cuda-c-using-malloc-in-kernel-functions-gives-strange-results
- 7: https://stackoverflow.com/questions/23260074/allocating-malloc-a-double-in-cuda-device-function
- 8: How to set the malloc heap size in CuPy? cupy/cupy#8693
- 9: https://rocm.docs.amd.com/projects/HIP/en/latest/reference/hip_runtime_api/modules/device_management.html
This test does not exercise the heap limit it sets.
ManagedVector allocates through the managed-allocation path (alloc_managed() in inc/mkn/gpu/cuda/def.hpp), while setLimitMallocHeapSize() controls only device-side malloc/free inside kernels. A green run here only shows managed allocation succeeded, not that heap-limit enforcement works.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@test/mem/test_alloc.cpp` around lines 20 - 38, The test sets the device heap
limit via setLimitMallocHeapSize but only creates a ManagedVector (which uses
alloc_managed()), so it never exercises device-side malloc inside kernels;
change the test to launch a small CUDA kernel that performs device-side
malloc/free (or uses the library's device allocation path) to actually trigger
heap-limit enforcement: keep the setLimitMallocHeapSize(limit) call, then
replace or augment the ManagedVector allocation with a kernel that calls
malloc(sizeof(double) * N) (or the project's device allocator) and reports
success/failure back to host so the test fails if the kernel-side allocation
succeeds when it should be prevented. Reference symbols: ManagedVector,
setLimitMallocHeapSize, alloc_managed (do not use alloc_managed for this check).
There was a problem hiding this comment.
have you tested that it's not the same for cudaMalloc as I have, and it's required on this machine
you can see the comment in the test to check on a case by case bases
There was a problem hiding this comment.
@PhilipDeegan, that's a fair point — I hadn't tested it on hardware and was relying on the formal spec, which says cudaLimitMallocHeapSize controls only device-side malloc/free inside kernels. However, the NVIDIA documentation itself acknowledges that "the features and behavior of unified memory vary between operating systems, kernel versions on Linux, GPU hardware, and the GPU-CPU interconnect", so it's plausible that certain driver/hardware combinations require the heap limit to be raised before large managed allocations succeed.
Given your empirical testing and the self-documenting set_heap_limit = true; // see if you need it by setting false after reboot flag, the test design already accounts for this variance. I'll withdraw the concern and note this for future reviews.
✏️ Learnings added
Learnt from: PhilipDeegan
Repo: mkn/mkn.gpu PR: 32
File: test/mem/test_alloc.cpp:20-38
Timestamp: 2026-03-22T12:49:36.974Z
Learning: In `test/mem/test_alloc.cpp` (mkn/mkn.gpu), `setLimitMallocHeapSize` is called before `cudaMallocManaged`-backed `ManagedVector` allocations. While the formal CUDA spec says `cudaLimitMallocHeapSize` only controls the device-side kernel malloc heap, the author (PhilipDeegan) has empirically verified that on their hardware/driver/OS combination this limit must be raised for large managed allocations to succeed. The `set_heap_limit` compile-time flag (default `true`) allows testing whether it is needed on a given machine. Do not flag this as incorrect usage.
Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.
🧠 Learnings used
Learnt from: PhilipDeegan
Repo: mkn/mkn.gpu PR: 32
File: inc/mkn/gpu/cpu/def.hpp:75-81
Timestamp: 2026-03-14T14:51:24.672Z
Learning: In `inc/mkn/gpu/cpu/def.hpp`, `getLimitMallocHeapSize()` returning 0 and `setLimitMallocHeapSize()` being a no-op are intentional: these are CUDA/ROCm device-heap concepts with no CPU equivalent. The CPU backend uses the system allocator directly. The heap-limit test in `test/mem/test_alloc.cpp` is also explicitly skipped on the CPU path.
Learnt from: PhilipDeegan
Repo: mkn/mkn.gpu PR: 19
File: test/any/async_streaming.cpp:2-6
Timestamp: 2024-06-10T19:38:12.663Z
Learning: PhilipDeegan, thank you for sharing your preferred style for organizing includes. I'll remember this for future reviews in your codebase.
<!--
PhilipDeegan prefers to sort `#include` statements by line length first, and alphabetically second.
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
inc/mkn/gpu/cuda/cls.hpp (1)
4-6:⚠️ Potential issue | 🟠 MajorInclude
<stdexcept>in this header.The
Pointerconstructor at line 68 throwsstd::runtime_error, but<stdexcept>is not included. This violates header self-containment—consumers may fail to compile if they don't transitively include<stdexcept>before this header.Suggested fix
`#include` <cstddef> `#include` <functional> +#include <stdexcept>🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@inc/mkn/gpu/cuda/cls.hpp` around lines 4 - 6, The header is missing <stdexcept>, but the Pointer constructor throws std::runtime_error (symbol: Pointer::Pointer), which breaks header self-containment; add `#include` <stdexcept> to the top of the header alongside the existing includes so std::runtime_error is declared for any translation unit that includes this header.
♻️ Duplicate comments (1)
inc/mkn/gpu/cpu/def.hpp (1)
45-55:⚠️ Potential issue | 🟠 MajorMake the CPU launch coordinates thread-local.
threadIdxandblockIdxare still shared globals, so two host threads launching CPU kernels at the same time can stomp each other’s coordinates.Suggested fix
-dim3 static inline threadIdx, blockIdx; +static inline thread_local dim3 threadIdx, blockIdx;🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@inc/mkn/gpu/cpu/def.hpp` around lines 45 - 55, The CPU launch coordinates threadIdx and blockIdx are currently shared globals and must be made thread-local to avoid races; change the declarations of threadIdx and blockIdx (the variables declared after struct dim3) to use thread-local storage (e.g., mark them thread_local inline) so each host thread gets its own dim3 instances while keeping the dim3 struct and the surrounding MKN_CPU_DO_NOT_DEFINE_DIM3 guard unchanged.
🧹 Nitpick comments (2)
tst/any/managed.cpp (1)
106-115: Assert each transfer intest_copy().
dev0is never read, so the firstmkn::gpu::copy(dev0, hst0)can regress without failing the test.Suggested improvement
uint32_t test_copy() { std::vector<float> hst0(NUM, 1), hst1(NUM, 2); ManagedVector<float> dev0(NUM), dev1(NUM); // copy(T0* dst, T1* src, Size const size) mkn::gpu::copy(dev0, hst0); + if (dev0.back() != 1) return 1; mkn::gpu::copy(hst1, hst0); + if (hst1.back() != 1) return 1; mkn::gpu::copy(dev1, hst1); - return dev1[NUM - 1] != 1; + return dev1.back() != 1; }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tst/any/managed.cpp` around lines 106 - 115, The test_copy() currently performs three transfers but only checks dev1, so the first transfer (mkn::gpu::copy(dev0, hst0)) can silently fail; update test_copy to assert each transfer: after mkn::gpu::copy(dev0, hst0) read/verify dev0 matches hst0, after mkn::gpu::copy(hst1, hst0) verify hst1 elements equal the original hst0 values, and after mkn::gpu::copy(dev1, hst1) verify dev1 matches hst1 (and return non-zero on any mismatch). Use the existing identifiers test_copy, dev0, dev1, hst0, and hst1 to locate where to add these checks.mkn.yaml (1)
14-16: Keep the formatter pointed at the new test tree.The active test sources moved to
tst/..., butclang.formatstill only walkstest, so the new tests can drift out of formatting checks.Suggested fix
- clang.format{init{style: file, paths: inc test}} + clang.format{init{style: file, paths: inc test tst}}Also applies to: 22-23, 30-35, 39-39
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@mkn.yaml` around lines 14 - 16, The formatter configuration still targets the old "test" tree; update the formatting globs/patterns so clang.format (or the formatter config referenced in mkn.yaml) also includes the new "tst/..." test tree—replace or augment occurrences of patterns like "test: tst/any/(\w).cpp" and the other affected pattern blocks (lines noted in the review) so they point to or include "tst/..." test paths instead of only "test/..."; ensure all instances (the groups at the other affected locations) are updated so the new tests are covered by formatting checks.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@inc/mkn/gpu/any/inc/alloc.ipp`:
- Around line 136-140: The size-check in copy(auto& dst, auto const& src) is
inverted and the copy length is wrong; change the guard to if (dst.size() <
src.size()) and throw a clear "dst too small" error, then perform the copy for
src.size() elements (i.e. call copy(dst.data(), src.data(), src.size())); also
handle empty containers up front (if (src.empty() || dst.empty()) return;) to
avoid dereferencing null/invalid pointers; update the error message/context
accordingly while keeping the existing copy(dst.data(), src.data(), ...) call
site.
In `@inc/mkn/gpu/def.hpp`:
- Around line 10-21: The macro guard logic for
MKN_GPU_ROCM/MKN_GPU_CUDA/MKN_GPU_CPU can emit redefinition warnings when users
pass explicit -D flags; update the blocks that set the mutually-exclusive macros
(the sections referencing MKN_GPU_ROCM, MKN_GPU_CUDA and MKN_GPU_CPU) to undef
any target macro before redefining it (or use `#if` defined(...) to only define
when unset) so that redefinitions are suppressed; specifically modify the
conditional sections that currently `#define` MKN_GPU_CUDA, MKN_GPU_ROCM and
MKN_GPU_CUDA=0 to perform an `#undef` on the macro being set prior to the `#define`
(or skip defining if already defined) to avoid warnings.
---
Outside diff comments:
In `@inc/mkn/gpu/cuda/cls.hpp`:
- Around line 4-6: The header is missing <stdexcept>, but the Pointer
constructor throws std::runtime_error (symbol: Pointer::Pointer), which breaks
header self-containment; add `#include` <stdexcept> to the top of the header
alongside the existing includes so std::runtime_error is declared for any
translation unit that includes this header.
---
Duplicate comments:
In `@inc/mkn/gpu/cpu/def.hpp`:
- Around line 45-55: The CPU launch coordinates threadIdx and blockIdx are
currently shared globals and must be made thread-local to avoid races; change
the declarations of threadIdx and blockIdx (the variables declared after struct
dim3) to use thread-local storage (e.g., mark them thread_local inline) so each
host thread gets its own dim3 instances while keeping the dim3 struct and the
surrounding MKN_CPU_DO_NOT_DEFINE_DIM3 guard unchanged.
---
Nitpick comments:
In `@mkn.yaml`:
- Around line 14-16: The formatter configuration still targets the old "test"
tree; update the formatting globs/patterns so clang.format (or the formatter
config referenced in mkn.yaml) also includes the new "tst/..." test tree—replace
or augment occurrences of patterns like "test: tst/any/(\w).cpp" and the other
affected pattern blocks (lines noted in the review) so they point to or include
"tst/..." test paths instead of only "test/..."; ensure all instances (the
groups at the other affected locations) are updated so the new tests are covered
by formatting checks.
In `@tst/any/managed.cpp`:
- Around line 106-115: The test_copy() currently performs three transfers but
only checks dev1, so the first transfer (mkn::gpu::copy(dev0, hst0)) can
silently fail; update test_copy to assert each transfer: after
mkn::gpu::copy(dev0, hst0) read/verify dev0 matches hst0, after
mkn::gpu::copy(hst1, hst0) verify hst1 elements equal the original hst0 values,
and after mkn::gpu::copy(dev1, hst1) verify dev1 matches hst1 (and return
non-zero on any mismatch). Use the existing identifiers test_copy, dev0, dev1,
hst0, and hst1 to locate where to add these checks.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 7bbfdf73-24aa-477a-bb6b-97071a3b0314
📒 Files selected for processing (25)
inc/mkn/gpu/any/inc/alloc.ippinc/mkn/gpu/cpu/def.hppinc/mkn/gpu/cuda/cls.hppinc/mkn/gpu/def.hppinc/mkn/gpu/rocm/cls.hppmkn.yamltst/any/__share__.hpptst/any/add.cpptst/any/array.cpptst/any/asio.cpptst/any/async_streaming.cpptst/any/class1.cpptst/any/class2.cpptst/any/construct.cpptst/any/info.cpptst/any/managed.cpptst/any/struct.cpptst/cpu/namespace.cpptst/cuda/add.cpptst/cuda/async.cpptst/cuda/atomic.cpptst/hip/add.cpptst/hip/async.cpptst/hip/async.ppctst/mem/test_alloc.cpp
🚧 Files skipped from review as they are similar to previous changes (1)
- inc/mkn/gpu/rocm/cls.hpp
Summary by CodeRabbit
Release Notes
New Features
Bug Fixes
Tests
Chores