Skip to content

memory heap set / alloc test#32

Merged
PhilipDeegan merged 3 commits intomasterfrom
next
Mar 22, 2026
Merged

memory heap set / alloc test#32
PhilipDeegan merged 3 commits intomasterfrom
next

Conversation

@PhilipDeegan
Copy link
Copy Markdown
Member

@PhilipDeegan PhilipDeegan commented Mar 14, 2026

Summary by CodeRabbit

Release Notes

  • New Features

    • Added device property and memory limit query functions for GPU configuration inspection.
  • Bug Fixes

    • Improved null pointer validation with descriptive error messages instead of silent failures.
    • Enhanced pointer type detection accuracy on GPU backends.
  • Tests

    • Added GPU memory allocation testing infrastructure to validate memory management.
  • Chores

    • Reorganized GPU compatibility layer structure for improved modularity.
    • Updated build configuration and test paths.
    • Enhanced device information output formatting and diagnostic reporting.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Mar 14, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This PR refactors the CPU GPU-emulation layer from a monolithic header into three separate modules (def.hpp, api.hpp, cls.hpp), relocates launcher and pointer infrastructure, and adds enhanced device property accessors across CUDA and ROCm backends with null-pointer error handling. Configuration files are updated, test paths reorganized, and new memory allocation tests added.

Changes

Cohort / File(s) Summary
CPU GPU Emulation Refactor
inc/mkn/gpu/cpu.hpp, inc/mkn/gpu/cpu/def.hpp, inc/mkn/gpu/cpu/api.hpp, inc/mkn/gpu/cpu/cls.hpp
Moved CPU GPU-emulation primitives from monolithic cpu.hpp into split headers: def.hpp provides compatibility macros and allocation/deallocation; api.hpp includes def.hpp; cls.hpp implements Stream, StreamEvent, Pointer, launch template, Launcher, and GLauncher with thread_local idx state tracking. Corrected typo "Psuedo" → "Pseudo" in prinfo().
CUDA Backend Updates
inc/mkn/gpu/cuda.hpp, inc/mkn/gpu/cuda/def.hpp, inc/mkn/gpu/cuda/cls.hpp
Added getDeviceProperties() and getLimitMallocHeapSize() accessors; updated getWarpSize() to use getDeviceProperties(). Enhanced Pointer constructor to throw std::runtime_error on null input; replaced is_unregistered_ptr() with type() accessor; refactored pointer-kind predicates to use cudaMemoryType enums. Updated prinfo() formatting and launch trace logging.
ROCm Backend Updates
inc/mkn/gpu/rocm.hpp, inc/mkn/gpu/rocm/def.hpp, inc/mkn/gpu/rocm/cls.hpp
Added getDeviceProperties() and getLimitMallocHeapSize(); wrapped hipGetDeviceProperties with MKN_GPU_ASSERT. Updated Pointer to throw on null, replaced is_unregistered_ptr() with type() accessor, refactored pointer-kind predicates using hipMemoryType enums. Added HIP version output to prinfo().
Core GPU Configuration
inc/mkn/gpu/def.hpp
Enforced mutual exclusivity of MKN_GPU_ROCM and MKN_GPU_CUDA at compile-time; added auto-detection error when both CUDA and HIP headers are detectable without explicit backend selection; normalized undefined backends to 0 when CPU mode is enabled.
Memory Management
inc/mkn/gpu/any/inc/alloc.ipp
Refactored copy() pointer-kind predicates using is_dev/is_hst lambdas that treat managed pointers as both device and host. Added new copy(auto&, auto const&) overload with size validation.
Build Configuration
mkn.yaml, res/mkn/clang_cuda.yaml, res/mkn/hipcc.yaml
Updated test directory paths from test/ to tst/; added parameterized sm.arch property to clang_cuda.yaml replacing hardcoded sm_120; added /opt/rocm/lib path to hipcc.yaml; introduced mem_test profile with tst/mem/test_alloc.cpp main.
Test Suite
tst/any/add.cpp, tst/any/managed.cpp, tst/mem/test_alloc.cpp
Replaced assert() with mkn::kul::abort_if_not() for pointer validation. Extended managed.cpp with test_copy() function. Added new comprehensive memory allocation test (test_alloc.cpp) probing device heap limits across multiple percentages.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

  • updates #30: Directly refactors CPU GPU-surface into split def/api/cls headers with identical structural reorganization of Stream, Pointer, launcher, and memory operations.
  • just cpu things #27: Modifies inc/mkn/gpu/cpu.hpp removing CUDA keyword shims (\__device\__, \__host\_\_, dim3, threadIdx) and CPU idx() accessor mechanics.
  • host group mutex #23: Updates CPU-side GPU emulation types and functions (Stream, StreamEvent, launch semantics) with overlapping interface changes.

Poem

🐰 Headers split in threes so neat,
CPU GPU, now discrete!
Pointers checked with NULL's defeat,
Memory tests make work complete,
Refactored code—chef's kiss—sweet! 🍀

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 5.06% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly summarizes the main change: adding heap memory limit setter functionality and allocation tests across GPU backends.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch next

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
inc/mkn/gpu/cpu.hpp (1)

65-65: ⚠️ Potential issue | 🟡 Minor

Typo: "Psuedo" should be "Pseudo".

Minor spelling issue in the info message.

Fix
-void inline prinfo(std::size_t /*dev*/ = 0) { KOUT(NON) << "Psuedo GPU in use"; }
+void inline prinfo(std::size_t /*dev*/ = 0) { KOUT(NON) << "Pseudo GPU in use"; }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@inc/mkn/gpu/cpu.hpp` at line 65, Fix the spelling in the prinfo function:
change the info string in prinfo(std::size_t /*dev*/ = 0) { KOUT(NON) << "Psuedo
GPU in use"; } to use "Pseudo" instead of "Psuedo" so the message reads "Pseudo
GPU in use".
🧹 Nitpick comments (6)
inc/mkn/gpu/rocm/def.hpp (1)

46-48: Consider reusing getDeviceProperties() inside getWarpSize().

This avoids duplicate HIP property-fetch logic in the same file.

♻️ Suggested simplification
-  hipDeviceProp_t devProp;
-  MKN_GPU_ASSERT(hipGetDeviceProperties(&devProp, dev));
-  return devProp.warpSize;
+  return getDeviceProperties(dev).warpSize;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@inc/mkn/gpu/rocm/def.hpp` around lines 46 - 48, getWarpSize() duplicates HIP
property-fetch logic by directly calling hipGetDeviceProperties; instead call
the existing getDeviceProperties(dev) helper and return its warpSize to
centralize error handling and avoid duplication. Replace the hipDeviceProp_t
devProp / MKN_GPU_ASSERT(hipGetDeviceProperties(&devProp, dev)) block in
getWarpSize() with a call to getDeviceProperties(dev) and return its .warpSize
so that MKN_GPU_ASSERT and property retrieval are reused consistently.
inc/mkn/gpu/cpu/def.hpp (2)

57-67: DeviceProperties defaults may cause issues in tests.

totalGlobalMem = 0 means CPU-mode tests using getDeviceProperties() will compute limit = 0 * percent = 0, resulting in zero-element allocations. The test file returns early for CPU mode, but if other code relies on these values, consider providing non-zero defaults or documenting this behavior.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@inc/mkn/gpu/cpu/def.hpp` around lines 57 - 67, DeviceProperties currently
sets totalGlobalMem to 0 which can cause zero-sized allocations when code calls
getDeviceProperties() and computes limits from totalGlobalMem; update
DeviceProperties (struct name DeviceProperties) to provide a sensible non-zero
default for totalGlobalMem (and optionally sharedMemPerBlock/maxThreadsPerBlock)
or add a clear comment/docstring near the struct and the getDeviceProperties()
usage documenting that CPU-mode defaults are zero and callers must handle that
case; touch the struct definition in def.hpp (DeviceProperties) and/or callers
of getDeviceProperties() to ensure tests or runtime code don't end up allocating
zero-sized buffers.

41-41: #if !defined(dim3) checks for a macro, not the struct.

Since dim3 is defined as a struct (not a #define), !defined(dim3) is always true. If the intent is to allow external code to predefine dim3, consider using the existing MKN_CPU_DO_NOT_DEFINE_DIM3 guard alone.

Simplified guard
-#if !defined(dim3) and !MKN_CPU_DO_NOT_DEFINE_DIM3
+#if !MKN_CPU_DO_NOT_DEFINE_DIM3
 struct dim3 {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@inc/mkn/gpu/cpu/def.hpp` at line 41, The preprocessor check uses
!defined(dim3) which tests a macro, not the struct named dim3, so it always
evaluates true; update the include guard to rely only on the existing guard
symbol MKN_CPU_DO_NOT_DEFINE_DIM3 (e.g., replace the current condition using
defined(dim3) with a single check that skips defining the struct when
MKN_CPU_DO_NOT_DEFINE_DIM3 is set) so the code defines the struct dim3 locally
only when MKN_CPU_DO_NOT_DEFINE_DIM3 is not defined.
inc/mkn/gpu/cpu/cls.hpp (3)

18-32: Unused _fn member in StreamEvent.

The _fn member is declared but never assigned—fn is invoked immediately in operator(), so storing a copy seems unnecessary. Consider removing the unused member if callbacks don't need to be retained.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@inc/mkn/gpu/cpu/cls.hpp` around lines 18 - 32, StreamEvent declares an unused
member _fn that is never assigned or used; remove the _fn data member and any
references to it (keep the constructor StreamEvent(Stream&) and
operator()(std::function<void()> fn = {}) behavior if immediate invocation is
desired) or, if you want to retain callbacks, store fn into _fn inside
operator() and invoke the stored callback instead; update the class definition
(StreamEvent, operator(), and fin usage) accordingly to eliminate the dead
member or to consistently use it.

77-79: Arguments not forwarded in Launcher::operator().

The args... are passed by lvalue reference to launch, losing perfect forwarding. Compare with the ROCm version at inc/mkn/gpu/rocm/cls.hpp:107-108 which uses std::forward<F>(f).

Proposed fix
   template <typename F, typename... Args>
   void operator()(F&& f, Args&&... args) {
-    launch(f, g, b, ds, s, args...);
+    launch(std::forward<F>(f), g, b, ds, s, std::forward<Args>(args)...);
   }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@inc/mkn/gpu/cpu/cls.hpp` around lines 77 - 79, The operator() template in the
Launcher class forwards arguments by lvalue, losing perfect forwarding; modify
Launcher::operator() so it forwards the callable and parameter pack to launch
using std::forward (e.g., call launch(std::forward<F>(f), g, b, ds, s,
std::forward<Args>(args)...)) so both F and Args... preserve value category and
move semantics.

87-95: Parameter s shadows inherited member s (stream).

The constructor parameter s (representing count/size) shadows the inherited Launcher::s member (stream). This works but reduces clarity. Consider renaming to _count or sz to match the stored count field.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@inc/mkn/gpu/cpu/cls.hpp` around lines 87 - 95, The constructor parameter
named s in GLauncher shadows the inherited Launcher::s member; rename the
parameter (e.g., to sz or _count) and update its use in the GLauncher
constructor to initialize count and compute g.x (keep field GLauncher::count
unchanged) so there is no shadowing of Launcher::s; update the
GLauncher(std::size_t ...) signature and all references to that parameter inside
the constructor accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/mem/base.hpp`:
- Line 1: File base.hpp currently contains an invalid token ("base.hpp") as its
entire content; replace that token with a valid C++ header by adding either a
`#pragma` once or include guards (e.g., BASE_HPP) and the appropriate declarations
for the module (types, forward declarations, function/class prototypes expected
by code referencing base.hpp, such as any Base class or related symbols). Ensure
the header compiles standalone and exposes the identifiers used by the rest of
the codebase.

In `@test/mem/test_alloc.cpp`:
- Around line 28-32: The test can underflow when size == 0 (size = limit /
sizeof(double)), so add a guard before constructing or indexing
ManagedVector<double>: check if size == 0 and skip the element-access assertion
(e.g., return success or skip the block) to avoid evaluating
vec.data()[vec.size() - 1]; ensure the guard references the same symbols (size,
limit, ManagedVector<double>, vec.size(), vec.data()) so the test only performs
the tail-index check when size > 0.

---

Outside diff comments:
In `@inc/mkn/gpu/cpu.hpp`:
- Line 65: Fix the spelling in the prinfo function: change the info string in
prinfo(std::size_t /*dev*/ = 0) { KOUT(NON) << "Psuedo GPU in use"; } to use
"Pseudo" instead of "Psuedo" so the message reads "Pseudo GPU in use".

---

Nitpick comments:
In `@inc/mkn/gpu/cpu/cls.hpp`:
- Around line 18-32: StreamEvent declares an unused member _fn that is never
assigned or used; remove the _fn data member and any references to it (keep the
constructor StreamEvent(Stream&) and operator()(std::function<void()> fn = {})
behavior if immediate invocation is desired) or, if you want to retain
callbacks, store fn into _fn inside operator() and invoke the stored callback
instead; update the class definition (StreamEvent, operator(), and fin usage)
accordingly to eliminate the dead member or to consistently use it.
- Around line 77-79: The operator() template in the Launcher class forwards
arguments by lvalue, losing perfect forwarding; modify Launcher::operator() so
it forwards the callable and parameter pack to launch using std::forward (e.g.,
call launch(std::forward<F>(f), g, b, ds, s, std::forward<Args>(args)...)) so
both F and Args... preserve value category and move semantics.
- Around line 87-95: The constructor parameter named s in GLauncher shadows the
inherited Launcher::s member; rename the parameter (e.g., to sz or _count) and
update its use in the GLauncher constructor to initialize count and compute g.x
(keep field GLauncher::count unchanged) so there is no shadowing of Launcher::s;
update the GLauncher(std::size_t ...) signature and all references to that
parameter inside the constructor accordingly.

In `@inc/mkn/gpu/cpu/def.hpp`:
- Around line 57-67: DeviceProperties currently sets totalGlobalMem to 0 which
can cause zero-sized allocations when code calls getDeviceProperties() and
computes limits from totalGlobalMem; update DeviceProperties (struct name
DeviceProperties) to provide a sensible non-zero default for totalGlobalMem (and
optionally sharedMemPerBlock/maxThreadsPerBlock) or add a clear
comment/docstring near the struct and the getDeviceProperties() usage
documenting that CPU-mode defaults are zero and callers must handle that case;
touch the struct definition in def.hpp (DeviceProperties) and/or callers of
getDeviceProperties() to ensure tests or runtime code don't end up allocating
zero-sized buffers.
- Line 41: The preprocessor check uses !defined(dim3) which tests a macro, not
the struct named dim3, so it always evaluates true; update the include guard to
rely only on the existing guard symbol MKN_CPU_DO_NOT_DEFINE_DIM3 (e.g., replace
the current condition using defined(dim3) with a single check that skips
defining the struct when MKN_CPU_DO_NOT_DEFINE_DIM3 is set) so the code defines
the struct dim3 locally only when MKN_CPU_DO_NOT_DEFINE_DIM3 is not defined.

In `@inc/mkn/gpu/rocm/def.hpp`:
- Around line 46-48: getWarpSize() duplicates HIP property-fetch logic by
directly calling hipGetDeviceProperties; instead call the existing
getDeviceProperties(dev) helper and return its warpSize to centralize error
handling and avoid duplication. Replace the hipDeviceProp_t devProp /
MKN_GPU_ASSERT(hipGetDeviceProperties(&devProp, dev)) block in getWarpSize()
with a call to getDeviceProperties(dev) and return its .warpSize so that
MKN_GPU_ASSERT and property retrieval are reused consistently.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 6c9dccd5-e982-4fbd-a2d4-9cc3939cb1ae

📥 Commits

Reviewing files that changed from the base of the PR and between e5ee749 and dbc2581.

📒 Files selected for processing (15)
  • inc/mkn/gpu/cpu.hpp
  • inc/mkn/gpu/cpu/api.hpp
  • inc/mkn/gpu/cpu/cls.hpp
  • inc/mkn/gpu/cpu/def.hpp
  • inc/mkn/gpu/cuda.hpp
  • inc/mkn/gpu/cuda/cls.hpp
  • inc/mkn/gpu/cuda/def.hpp
  • inc/mkn/gpu/rocm.hpp
  • inc/mkn/gpu/rocm/cls.hpp
  • inc/mkn/gpu/rocm/def.hpp
  • mkn.yaml
  • res/mkn/clang_cuda.yaml
  • res/mkn/hipcc.yaml
  • test/mem/base.hpp
  • test/mem/test_alloc.cpp

Comment thread test/mem/base.hpp Outdated
Comment thread tst/mem/test_alloc.cpp
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

🧹 Nitpick comments (1)
inc/mkn/gpu/cpu/def.hpp (1)

71-73: Honor _MKN_GPU_WARP_SIZE_ in the CPU shim too.

Hardcoding 1 here means backend-agnostic code and tests cannot use the same compile-time override that CUDA/ROCm expose through warp_size.

Possible fix
 std::uint32_t inline getWarpSize(size_t /*dev */ = 0) {
+#ifdef _MKN_GPU_WARP_SIZE_
+  return _MKN_GPU_WARP_SIZE_;
+#else
   return 1;
+#endif
 }

Based on learnings, getWarpSize() is designed to provide a default warp size with compile-time override via _MKN_GPU_WARP_SIZE_.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@inc/mkn/gpu/cpu/def.hpp` around lines 71 - 73, The CPU shim currently
hardcodes warp size to 1; change getWarpSize() and the warp_size variable to
honor the compile-time override _MKN_GPU_WARP_SIZE_ by returning that macro
value when defined (and falling back to 1 otherwise), and initialize static
const warp_size from getWarpSize() so the same compile-time override used by
CUDA/ROCm affects CPU tests and backend-agnostic code; update references to
getWarpSize() and warp_size accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@inc/mkn/gpu/cpu/cls.hpp`:
- Around line 53-67: The CPU launch loop in launch(...) only advances
detail::idx and never updates the global CUDA-style indices, so populate
threadIdx and blockIdx per iteration: inside the for loop over i compute flat
thread index within a block and block index from i using the grid (g) and block
(b) dimensions, set global threadIdx and blockIdx accordingly before calling
f(params...), and then increment detail::idx as before; ensure you restore/reset
detail::idx (and optionally threadIdx/blockIdx) after the loop. reference
symbols: launch, detail::idx, threadIdx, blockIdx, devmem_replace.
- Around line 4-5: The header cpu/cls.hpp (and any includes it pulls in like
inc/mkn/gpu/any/inc/device.ipp) uses several STL facilities but doesn't include
the corresponding headers; add the missing standard headers: `#include`
<functional> (for std::function, std::reference_wrapper), `#include` <tuple> (for
std::tuple, std::apply, std::forward_as_tuple), `#include` <utility> (for
std::make_index_sequence), `#include` <type_traits> (for std::is_base_of_v,
std::decay_t), and `#include` <memory> (for std::shared_ptr) so symbols used by
functions/classes in cpu/cls.hpp (and referenced in device.ipp) are declared.

In `@inc/mkn/gpu/cpu/def.hpp`:
- Around line 75-81: getLimitMallocHeapSize() always returns 0 and
setLimitMallocHeapSize() is a no-op, so implement a real storage-backed accessor
pair: add a static (preferably atomic) variable inside the CPU backend scope to
hold the heap limit, have setLimitMallocHeapSize(std::size_t const& bytes)
assign that variable, and have getLimitMallocHeapSize() return its current
value; reference the functions getLimitMallocHeapSize and setLimitMallocHeapSize
when making the change and ensure the storage has appropriate linkage/lifetime
for use across callers (use std::atomic<std::size_t> if concurrent access is
possible).
- Around line 114-123: The CPU shim destroy and destroy_host currently take T*&
which breaks parity with the CUDA/ROCm backends that accept T* by value; change
the signatures of destroy and destroy_host to take T* (plain pointer) instead of
T*& so temporaries and pointer expressions compile consistently with GPU
backends, keep the KLOG(TRC) and std::free(p) behavior but do not attempt to
null the caller's pointer (since we no longer have a reference).
- Around line 4-8: The header is not self-contained: add the missing includes so
symbols used in def.hpp are defined — specifically include <cstdint> for
std::uint32_t, <cstdlib> for std::malloc and std::free, <cstring> for
std::memcpy, and the logging header that defines KLOG/TRC (e.g. mkn/kul/log.hpp)
so KLOG(TRC) is available; update inc/mkn/gpu/cpu/def.hpp to include these
headers near the top so consumers via cpu/api.hpp compile without undefined
symbol errors.

In `@inc/mkn/gpu/rocm/cls.hpp`:
- Around line 64-71: The Pointer(T* _t) constructor calls
hipPointerGetAttributes unconditionally and compares attributes.type to magic
numbers; change it to guard against nullptr by initializing attributes to a safe
default and only calling hipPointerGetAttributes(&attributes, t) and
MKN_GPU_ASSERT(...) when t != nullptr, update
is_unregistered_ptr/is_host_ptr/is_device_ptr/is_managed_ptr to compare
attributes.type against the HIP enum constants (e.g., hipMemoryTypeUndefined,
hipMemoryTypeHost, hipMemoryTypeDevice, hipMemoryTypeManaged) and remove the
unreachable t != nullptr fallback in is_host_ptr so the logic relies on the
guarded attributes value instead; reference the Pointer constructor,
hipPointerGetAttributes, and the
is_unregistered_ptr/is_host_ptr/is_device_ptr/is_managed_ptr methods when making
these changes.

In `@test/mem/test_alloc.cpp`:
- Around line 20-33: The code uses floating-point arithmetic for byte/count
values causing precision and comparison issues: change the computation of limit
from double to an integral byte type by computing limit =
static_cast<std::size_t>(mem) * static_cast<std::size_t>(percent) or, if percent
is fractional, compute limit = static_cast<std::size_t>(std::round(mem *
percent)) but store and pass it as std::size_t when calling
mkn::gpu::setLimitMallocHeapSize(limit) and comparing against
getLimitMallocHeapSize(); then compute size as size_t size = limit /
sizeof(double) (used for the vector allocation) so all heap-limit math and
comparisons use integer types (refer to variables/functions: limit, mem,
percent, mkn::gpu::setLimitMallocHeapSize, mkn::gpu::getLimitMallocHeapSize, and
size).

---

Nitpick comments:
In `@inc/mkn/gpu/cpu/def.hpp`:
- Around line 71-73: The CPU shim currently hardcodes warp size to 1; change
getWarpSize() and the warp_size variable to honor the compile-time override
_MKN_GPU_WARP_SIZE_ by returning that macro value when defined (and falling back
to 1 otherwise), and initialize static const warp_size from getWarpSize() so the
same compile-time override used by CUDA/ROCm affects CPU tests and
backend-agnostic code; update references to getWarpSize() and warp_size
accordingly.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 1069f3b4-2a2e-4d5a-a57a-bd9c74c21258

📥 Commits

Reviewing files that changed from the base of the PR and between dbc2581 and 0d15ea2.

📒 Files selected for processing (14)
  • inc/mkn/gpu/cpu.hpp
  • inc/mkn/gpu/cpu/api.hpp
  • inc/mkn/gpu/cpu/cls.hpp
  • inc/mkn/gpu/cpu/def.hpp
  • inc/mkn/gpu/cuda.hpp
  • inc/mkn/gpu/cuda/cls.hpp
  • inc/mkn/gpu/cuda/def.hpp
  • inc/mkn/gpu/rocm.hpp
  • inc/mkn/gpu/rocm/cls.hpp
  • inc/mkn/gpu/rocm/def.hpp
  • mkn.yaml
  • res/mkn/clang_cuda.yaml
  • res/mkn/hipcc.yaml
  • test/mem/test_alloc.cpp
🚧 Files skipped from review as they are similar to previous changes (4)
  • inc/mkn/gpu/cuda/cls.hpp
  • mkn.yaml
  • res/mkn/hipcc.yaml
  • inc/mkn/gpu/cuda.hpp

Comment thread inc/mkn/gpu/cpu/cls.hpp
Comment thread inc/mkn/gpu/cpu/cls.hpp
Comment thread inc/mkn/gpu/cpu/def.hpp
Comment thread inc/mkn/gpu/cpu/def.hpp
Comment thread inc/mkn/gpu/cpu/def.hpp
Comment thread inc/mkn/gpu/rocm/cls.hpp Outdated
Comment thread tst/mem/test_alloc.cpp
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
test/mem/test_alloc.cpp (1)

20-32: ⚠️ Potential issue | 🟡 Minor

Use integral arithmetic for heap-limit calculations.

The computation mem * percent promotes to double, which can lose precision for large memory sizes (multi-GB). The subsequent implicit conversion when passing to setLimitMallocHeapSize() and the mixed-type comparison at line 25 may cause subtle mismatches.

Suggested fix
-  auto const limit = mem * percent;
+  auto const limit = static_cast<std::size_t>(mem * percent);
   if (set_heap_limit) mkn::gpu::setLimitMallocHeapSize(limit);

   auto const heapLim = mkn::gpu::getLimitMallocHeapSize();
   KLOG(DBG) << heapLim;
   if (set_heap_limit and limit > heapLim) {
     KOUT(NON) << "Cannot set heap limit! " << limit << " " << heapLim;
     return 1;
   }

   if (limit < sizeof(double)) throw std::runtime_error("limit < sizeof(double)");

-  auto const size = limit / sizeof(double);
+  auto const size = limit / sizeof(double);  // now integral division
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/mem/test_alloc.cpp` around lines 20 - 32, The code currently computes
limit with floating-point math (auto const limit = mem * percent) which can lose
precision; change to integer arithmetic: compute limit as an unsigned integer
(e.g., uint64_t limit) using integer multiplication/division (for example using
a 128-bit intermediate like (uint64_t)((__int128)mem * (uint64_t)percent /
percent_scale) or mem * percent / 100 depending on how percent is represented),
then pass that integral limit to mkn::gpu::setLimitMallocHeapSize and compare
against mkn::gpu::getLimitMallocHeapSize (also cast/get as uint64_t) so the
comparison uses the same integer type; ensure size = limit / sizeof(double)
remains integer division. Reference symbols: limit, mem, percent,
setLimitMallocHeapSize, getLimitMallocHeapSize, size.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@test/mem/test_alloc.cpp`:
- Around line 20-32: The code currently computes limit with floating-point math
(auto const limit = mem * percent) which can lose precision; change to integer
arithmetic: compute limit as an unsigned integer (e.g., uint64_t limit) using
integer multiplication/division (for example using a 128-bit intermediate like
(uint64_t)((__int128)mem * (uint64_t)percent / percent_scale) or mem * percent /
100 depending on how percent is represented), then pass that integral limit to
mkn::gpu::setLimitMallocHeapSize and compare against
mkn::gpu::getLimitMallocHeapSize (also cast/get as uint64_t) so the comparison
uses the same integer type; ensure size = limit / sizeof(double) remains integer
division. Reference symbols: limit, mem, percent, setLimitMallocHeapSize,
getLimitMallocHeapSize, size.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 554557a9-30bd-41cf-8fdd-b6db34ac5eb0

📥 Commits

Reviewing files that changed from the base of the PR and between 0d15ea2 and 8ac124a.

📒 Files selected for processing (14)
  • inc/mkn/gpu/cpu.hpp
  • inc/mkn/gpu/cpu/api.hpp
  • inc/mkn/gpu/cpu/cls.hpp
  • inc/mkn/gpu/cpu/def.hpp
  • inc/mkn/gpu/cuda.hpp
  • inc/mkn/gpu/cuda/cls.hpp
  • inc/mkn/gpu/cuda/def.hpp
  • inc/mkn/gpu/rocm.hpp
  • inc/mkn/gpu/rocm/cls.hpp
  • inc/mkn/gpu/rocm/def.hpp
  • mkn.yaml
  • res/mkn/clang_cuda.yaml
  • res/mkn/hipcc.yaml
  • test/mem/test_alloc.cpp
🚧 Files skipped from review as they are similar to previous changes (5)
  • res/mkn/clang_cuda.yaml
  • inc/mkn/gpu/rocm.hpp
  • inc/mkn/gpu/cpu/cls.hpp
  • res/mkn/hipcc.yaml
  • inc/mkn/gpu/rocm/cls.hpp

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

♻️ Duplicate comments (2)
inc/mkn/gpu/rocm/cls.hpp (1)

68-70: ⚠️ Potential issue | 🟠 Major

Revert ROCm pointer-type checks to compatibility-safe literals.

Line 68–70 switched back to named HIP memory-type constants, which can regress compatibility on older ROCm headers this repo targets. Please keep literal type checks here (and, for parity with CUDA behavior, treat unregistered as host).

Suggested patch
-  bool is_host_ptr() const { return attributes.type == hipMemoryTypeHost; }
-  bool is_device_ptr() const { return is_managed_ptr() || attributes.type == hipMemoryTypeDevice; }
-  bool is_managed_ptr() const { return attributes.type == hipMemoryTypeManaged; }
+  bool is_host_ptr() const { return attributes.type == 1 || attributes.type == 0; }  // host or unregistered
+  bool is_device_ptr() const { return is_managed_ptr() || attributes.type == 2; }
+  bool is_managed_ptr() const { return attributes.type == 3; }
Which ROCm/HIP versions guarantee availability of hipMemoryTypeHost, hipMemoryTypeDevice, and hipMemoryTypeManaged in public headers, and are there versions where projects use integer fallbacks for compatibility?

Based on learnings: in inc/mkn/gpu/rocm/cls.hpp, raw literal comparisons for hipPointerGetAttributes().type are intentional for ROCm-version compatibility and should be preserved.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@inc/mkn/gpu/rocm/cls.hpp` around lines 68 - 70, Revert the pointer-type
checks in is_host_ptr, is_device_ptr, and is_managed_ptr back to the
compatibility-safe raw integer literal comparisons on attributes.type (i.e.,
restore the literal values used in the prior implementation instead of
hipMemoryTypeHost/Device/Managed), and ensure is_host_ptr also treats the
"unregistered"/unknown value as host for parity with CUDA; update the
implementations of is_host_ptr(), is_device_ptr(), and is_managed_ptr()
accordingly.
inc/mkn/gpu/cpu/cls.hpp (1)

63-75: ⚠️ Potential issue | 🟠 Major

Still unresolved: populate threadIdx and blockIdx before each CPU kernel call.

The loop still only advances detail::idx and increments blockIdx.x after the invocation. threadIdx never changes, and the first call sees the dim3 defaults instead of (0,0,0), so CUDA-style indexing stays wrong on the CPU path.

Possible fix
   std::apply(
       [&](auto&&... params) {
+        auto const threads_per_block = b.x * b.y * b.z;
         for (std::size_t i = 0; i < N; ++i) {
+          auto const linear_thread = i % threads_per_block;
+          auto const linear_block = i / threads_per_block;
+
+          threadIdx.x = linear_thread % b.x;
+          threadIdx.y = (linear_thread / b.x) % b.y;
+          threadIdx.z = linear_thread / (b.x * b.y);
+
+          blockIdx.x = linear_block % g.x;
+          blockIdx.y = (linear_block / g.x) % g.y;
+          blockIdx.z = linear_block / (g.x * g.y);
+
+          detail::idx = i;
           f(params...);
-          ++blockIdx.x;
-          ++detail::idx;
         }
       },
       devmem_replace(std::forward_as_tuple(args...), std::make_index_sequence<sizeof...(Args)>()));
 
   detail::idx = 0;
-  blockIdx.x = 0;
+  threadIdx = dim3(0, 0, 0);
+  blockIdx = dim3(0, 0, 0);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@inc/mkn/gpu/cpu/cls.hpp` around lines 63 - 75, Before invoking the CPU kernel
inside the std::apply lambda (the block that calls f(params...)), compute and
assign CUDA-style indices to threadIdx and blockIdx based on detail::idx and any
blockDim/threadDim semantics so each call sees correct indices; specifically,
set threadIdx (and threadIdx.x/y/z) and blockIdx (and blockIdx.x/y/z)
immediately before f(params...) using detail::idx and the kernel launch
dimensions, then call f(params...), and only after the call increment
detail::idx and advance blockIdx as needed. Locate the lambda passed to
std::apply (the code using devmem_replace(std::forward_as_tuple(args...),
std::make_index_sequence<...>())) and update it to populate threadIdx and
blockIdx prior to f(params...) so CPU path matches CUDA indexing.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@inc/mkn/gpu/cpu/cls.hpp`:
- Around line 77-83: The size-based Launcher constructors use integer floor
division which can produce zero or too-small grid sizes; modify the two
constructors that take (w,h,tpx,tpy) and (x,y,z,tpx,tpy,tpz) to compute grid
dimensions with ceiling division (e.g., (w + tpx - 1) / tpx) for each axis and
ensure the resulting dim3 g components are at least 1 so no work is dropped;
update the initializer calls to use these ceiling-divided values when
constructing g (while keeping b as the tile/thread-per-block dim3).
- Around line 28-31: The operator() implementation unconditionally invokes the
optional std::function parameter which can be empty; change operator() (the
functor operator() in cls.hpp) to check whether the std::function<void()> is
callable before calling it (e.g., if (fn) invoke fn()), leaving the default
parameter fn = {} intact and still returning *this; ensure you only call fn()
when it evaluates true to avoid std::bad_function_call.

In `@inc/mkn/gpu/cpu/def.hpp`:
- Around line 61-77: DeviceProperties.warpSize is initialized to 0 while
getWarpSize() and warp_size return 1, causing inconsistent warp size on CPU;
update the CPU-side DeviceProperties (returned by getDeviceProperties) to set
warpSize to 1 (or otherwise mirror getWarpSize()) so that code reading
DeviceProperties.warpSize, the getWarpSize() helper, and the warp_size constant
all report the same non-zero warp size; modify the DeviceProperties initializer
or the getDeviceProperties() factory to assign warpSize = 1 (and keep
getWarpSize()/warp_size unchanged).
- Around line 45-55: threadIdx and blockIdx are shared globals causing race
conditions between host threads; make them thread-local. Change the declaration
of the dim3 shims (symbol dim3 and the instances threadIdx, blockIdx) so the
instances are declared with thread_local (e.g. thread_local inline dim3
threadIdx, blockIdx) instead of the current shared static inline declaration.

In `@test/mem/test_alloc.cpp`:
- Around line 20-38: The test sets the device heap limit via
setLimitMallocHeapSize but only creates a ManagedVector (which uses
alloc_managed()), so it never exercises device-side malloc inside kernels;
change the test to launch a small CUDA kernel that performs device-side
malloc/free (or uses the library's device allocation path) to actually trigger
heap-limit enforcement: keep the setLimitMallocHeapSize(limit) call, then
replace or augment the ManagedVector allocation with a kernel that calls
malloc(sizeof(double) * N) (or the project's device allocator) and reports
success/failure back to host so the test fails if the kernel-side allocation
succeeds when it should be prevented. Reference symbols: ManagedVector,
setLimitMallocHeapSize, alloc_managed (do not use alloc_managed for this check).

---

Duplicate comments:
In `@inc/mkn/gpu/cpu/cls.hpp`:
- Around line 63-75: Before invoking the CPU kernel inside the std::apply lambda
(the block that calls f(params...)), compute and assign CUDA-style indices to
threadIdx and blockIdx based on detail::idx and any blockDim/threadDim semantics
so each call sees correct indices; specifically, set threadIdx (and
threadIdx.x/y/z) and blockIdx (and blockIdx.x/y/z) immediately before
f(params...) using detail::idx and the kernel launch dimensions, then call
f(params...), and only after the call increment detail::idx and advance blockIdx
as needed. Locate the lambda passed to std::apply (the code using
devmem_replace(std::forward_as_tuple(args...), std::make_index_sequence<...>()))
and update it to populate threadIdx and blockIdx prior to f(params...) so CPU
path matches CUDA indexing.

In `@inc/mkn/gpu/rocm/cls.hpp`:
- Around line 68-70: Revert the pointer-type checks in is_host_ptr,
is_device_ptr, and is_managed_ptr back to the compatibility-safe raw integer
literal comparisons on attributes.type (i.e., restore the literal values used in
the prior implementation instead of hipMemoryTypeHost/Device/Managed), and
ensure is_host_ptr also treats the "unregistered"/unknown value as host for
parity with CUDA; update the implementations of is_host_ptr(), is_device_ptr(),
and is_managed_ptr() accordingly.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 9da73f88-9676-4afd-b405-4dc2dfe7dc81

📥 Commits

Reviewing files that changed from the base of the PR and between 8ac124a and debe35e.

📒 Files selected for processing (14)
  • inc/mkn/gpu/cpu.hpp
  • inc/mkn/gpu/cpu/api.hpp
  • inc/mkn/gpu/cpu/cls.hpp
  • inc/mkn/gpu/cpu/def.hpp
  • inc/mkn/gpu/cuda.hpp
  • inc/mkn/gpu/cuda/cls.hpp
  • inc/mkn/gpu/cuda/def.hpp
  • inc/mkn/gpu/rocm.hpp
  • inc/mkn/gpu/rocm/cls.hpp
  • inc/mkn/gpu/rocm/def.hpp
  • mkn.yaml
  • res/mkn/clang_cuda.yaml
  • res/mkn/hipcc.yaml
  • test/mem/test_alloc.cpp
✅ Files skipped from review due to trivial changes (4)
  • inc/mkn/gpu/rocm.hpp
  • inc/mkn/gpu/cpu/api.hpp
  • mkn.yaml
  • res/mkn/hipcc.yaml
🚧 Files skipped from review as they are similar to previous changes (4)
  • res/mkn/clang_cuda.yaml
  • inc/mkn/gpu/cuda.hpp
  • inc/mkn/gpu/cuda/cls.hpp
  • inc/mkn/gpu/rocm/def.hpp

Comment thread inc/mkn/gpu/cpu/cls.hpp
Comment on lines +28 to +31
auto& operator()(std::function<void()> fn = {}) {
fn();
return *this;
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

What happens in C++ when std::function<void()> fn{}is invoked withfn()?

💡 Result:

When std::function fn{} is invoked with fn, it throws a std::bad_function_call exception. The default constructor std::function fn{} creates an empty std::function with no callable target. Invoking operator on an empty std::function results in this exception, as specified in the C++ standard library.

Citations:


Check if std::function is callable before invoking it.

When called without arguments, this operator uses the default parameter fn = {}, which creates an empty std::function object. Unconditionally invoking fn() on an empty function throws std::bad_function_call. The signature allows no-arg invocation, so it should behave as a no-op in this case.

Suggested fix
   auto& operator()(std::function<void()> fn = {}) {
-    fn();
+    if (fn) fn();
     return *this;
   }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
auto& operator()(std::function<void()> fn = {}) {
fn();
return *this;
}
auto& operator()(std::function<void()> fn = {}) {
if (fn) fn();
return *this;
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@inc/mkn/gpu/cpu/cls.hpp` around lines 28 - 31, The operator() implementation
unconditionally invokes the optional std::function parameter which can be empty;
change operator() (the functor operator() in cls.hpp) to check whether the
std::function<void()> is callable before calling it (e.g., if (fn) invoke fn()),
leaving the default parameter fn = {} intact and still returning *this; ensure
you only call fn() when it evaluates true to avoid std::bad_function_call.

Comment thread inc/mkn/gpu/cpu/cls.hpp
Comment on lines +77 to +83
struct Launcher {
Launcher(dim3 _g, dim3 _b) : g{_g}, b{_b} {}
Launcher(std::size_t w, std::size_t h, std::size_t tpx, std::size_t tpy)
: Launcher{dim3(w / tpx, h / tpy), dim3(tpx, tpy)} {}
Launcher(std::size_t x, std::size_t y, std::size_t z, std::size_t tpx, std::size_t tpy,
std::size_t tpz)
: Launcher{dim3(x / tpx, y / tpy, z / tpz), dim3(tpx, tpy, tpz)} {}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Use ceiling division in the size-based Launcher constructors.

These constructors floor-divide by the tile size. If w < tpx or the dimensions are not exact multiples, g becomes zero or too small and the CPU launch drops work.

Possible fix
   Launcher(std::size_t w, std::size_t h, std::size_t tpx, std::size_t tpy)
-      : Launcher{dim3(w / tpx, h / tpy), dim3(tpx, tpy)} {}
+      : g{}, b{} {
+    KASSERT(tpx && tpy);
+    g = dim3((w + tpx - 1) / tpx, (h + tpy - 1) / tpy);
+    b = dim3(tpx, tpy);
+  }
   Launcher(std::size_t x, std::size_t y, std::size_t z, std::size_t tpx, std::size_t tpy,
            std::size_t tpz)
-      : Launcher{dim3(x / tpx, y / tpy, z / tpz), dim3(tpx, tpy, tpz)} {}
+      : g{}, b{} {
+    KASSERT(tpx && tpy && tpz);
+    g = dim3((x + tpx - 1) / tpx, (y + tpy - 1) / tpy, (z + tpz - 1) / tpz);
+    b = dim3(tpx, tpy, tpz);
+  }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
struct Launcher {
Launcher(dim3 _g, dim3 _b) : g{_g}, b{_b} {}
Launcher(std::size_t w, std::size_t h, std::size_t tpx, std::size_t tpy)
: Launcher{dim3(w / tpx, h / tpy), dim3(tpx, tpy)} {}
Launcher(std::size_t x, std::size_t y, std::size_t z, std::size_t tpx, std::size_t tpy,
std::size_t tpz)
: Launcher{dim3(x / tpx, y / tpy, z / tpz), dim3(tpx, tpy, tpz)} {}
struct Launcher {
Launcher(dim3 _g, dim3 _b) : g{_g}, b{_b} {}
Launcher(std::size_t w, std::size_t h, std::size_t tpx, std::size_t tpy)
: g{}, b{} {
KASSERT(tpx && tpy);
g = dim3((w + tpx - 1) / tpx, (h + tpy - 1) / tpy);
b = dim3(tpx, tpy);
}
Launcher(std::size_t x, std::size_t y, std::size_t z, std::size_t tpx, std::size_t tpy,
std::size_t tpz)
: g{}, b{} {
KASSERT(tpx && tpy && tpz);
g = dim3((x + tpx - 1) / tpx, (y + tpy - 1) / tpy, (z + tpz - 1) / tpz);
b = dim3(tpx, tpy, tpz);
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@inc/mkn/gpu/cpu/cls.hpp` around lines 77 - 83, The size-based Launcher
constructors use integer floor division which can produce zero or too-small grid
sizes; modify the two constructors that take (w,h,tpx,tpy) and
(x,y,z,tpx,tpy,tpz) to compute grid dimensions with ceiling division (e.g., (w +
tpx - 1) / tpx) for each axis and ensure the resulting dim3 g components are at
least 1 so no work is dropped; update the initializer calls to use these
ceiling-divided values when constructing g (while keeping b as the
tile/thread-per-block dim3).

Comment thread inc/mkn/gpu/cpu/def.hpp
Comment on lines +45 to +55
#if !defined(dim3) and !MKN_CPU_DO_NOT_DEFINE_DIM3
struct dim3 {
dim3() {}
dim3(std::size_t x) : x{x} {}
dim3(std::size_t x, std::size_t y) : x{x}, y{y} {}
dim3(std::size_t x, std::size_t y, std::size_t z) : x{x}, y{y}, z{z} {}

std::size_t x = 1, y = 1, z = 1;
};
dim3 static inline threadIdx, blockIdx;
#endif // MKN_CPU_DO_NOT_DEFINE_DIM3
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Keep the CPU index shims thread-local.

detail::idx in inc/mkn/gpu/cpu/cls.hpp:55-57 is already thread_local, but threadIdx and blockIdx are shared globals here. Two host threads launching CPU kernels concurrently will race on these coordinates.

Possible fix
-dim3 static inline threadIdx, blockIdx;
+static inline thread_local dim3 threadIdx{0, 0, 0}, blockIdx{0, 0, 0};
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@inc/mkn/gpu/cpu/def.hpp` around lines 45 - 55, threadIdx and blockIdx are
shared globals causing race conditions between host threads; make them
thread-local. Change the declaration of the dim3 shims (symbol dim3 and the
instances threadIdx, blockIdx) so the instances are declared with thread_local
(e.g. thread_local inline dim3 threadIdx, blockIdx) instead of the current
shared static inline declaration.

Comment thread inc/mkn/gpu/cpu/def.hpp
Comment thread tst/mem/test_alloc.cpp
Comment on lines +20 to +38
auto const limit = mem * percent;
if (set_heap_limit) mkn::gpu::setLimitMallocHeapSize(limit);

auto const heapLim = mkn::gpu::getLimitMallocHeapSize();
KLOG(DBG) << heapLim;
if (set_heap_limit and limit > heapLim) {
KOUT(NON) << "Cannot set heap limit! " << limit << " " << heapLim;
return 1;
}

if (limit < sizeof(double)) throw std::runtime_error("limit < sizeof(double)");

auto const size = limit / sizeof(double);

if (size == 0) throw std::runtime_error("size == 0");

{
ManagedVector<double> vec(size, 2);
if (vec.data()[vec.size() - 1] != 2) return 1;
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot Mar 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

Does cudaLimitMallocHeapSize/hipLimitMallocHeapSizeaffectcudaMallocManaged/hipMallocManaged, or only device-side malloc/new inside kernels?

💡 Result:

No, cudaLimitMallocHeapSize and hipLimitMallocHeapSize affect only device-side malloc and new inside kernels, not cudaMallocManaged or hipMallocManaged. These limits control the size of the runtime heap used exclusively by device-side dynamic memory allocation functions like malloc and free (or new/delete) called from within CUDA/HIP kernels. Official NVIDIA CUDA documentation explicitly states: "CU_LIMIT_MALLOC_HEAP_SIZE controls the size of the heap used by the malloc and free device system calls" (source 15,17). Similarly, HIP documentation confirms hipLimitMallocHeapSize sets "the limit value of the heap used by the malloc/free calls" (source 21,67). In contrast, cudaMallocManaged/hipMallocManaged are host-side runtime API calls that allocate Unified Managed Memory (UMM), which is managed separately by the driver through virtual memory management or page migration mechanisms, not the per-context device runtime heap (sources 2,3,6,18). No official documentation links these limits to managed memory allocations, and searches for direct connections returned no evidence. Device malloc draws from a fixed-size heap (default 8MB, configurable via these limits before kernel launch), while managed allocations are backed by device global memory or oversubscribed via paging, independent of the runtime heap.

Citations:


This test does not exercise the heap limit it sets.

ManagedVector allocates through the managed-allocation path (alloc_managed() in inc/mkn/gpu/cuda/def.hpp), while setLimitMallocHeapSize() controls only device-side malloc/free inside kernels. A green run here only shows managed allocation succeeded, not that heap-limit enforcement works.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/mem/test_alloc.cpp` around lines 20 - 38, The test sets the device heap
limit via setLimitMallocHeapSize but only creates a ManagedVector (which uses
alloc_managed()), so it never exercises device-side malloc inside kernels;
change the test to launch a small CUDA kernel that performs device-side
malloc/free (or uses the library's device allocation path) to actually trigger
heap-limit enforcement: keep the setLimitMallocHeapSize(limit) call, then
replace or augment the ManagedVector allocation with a kernel that calls
malloc(sizeof(double) * N) (or the project's device allocator) and reports
success/failure back to host so the test fails if the kernel-side allocation
succeeds when it should be prevented. Reference symbols: ManagedVector,
setLimitMallocHeapSize, alloc_managed (do not use alloc_managed for this check).

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have you tested that it's not the same for cudaMalloc as I have, and it's required on this machine
you can see the comment in the test to check on a case by case bases

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@PhilipDeegan, that's a fair point — I hadn't tested it on hardware and was relying on the formal spec, which says cudaLimitMallocHeapSize controls only device-side malloc/free inside kernels. However, the NVIDIA documentation itself acknowledges that "the features and behavior of unified memory vary between operating systems, kernel versions on Linux, GPU hardware, and the GPU-CPU interconnect", so it's plausible that certain driver/hardware combinations require the heap limit to be raised before large managed allocations succeed.

Given your empirical testing and the self-documenting set_heap_limit = true; // see if you need it by setting false after reboot flag, the test design already accounts for this variance. I'll withdraw the concern and note this for future reviews.


✏️ Learnings added
Learnt from: PhilipDeegan
Repo: mkn/mkn.gpu PR: 32
File: test/mem/test_alloc.cpp:20-38
Timestamp: 2026-03-22T12:49:36.974Z
Learning: In `test/mem/test_alloc.cpp` (mkn/mkn.gpu), `setLimitMallocHeapSize` is called before `cudaMallocManaged`-backed `ManagedVector` allocations. While the formal CUDA spec says `cudaLimitMallocHeapSize` only controls the device-side kernel malloc heap, the author (PhilipDeegan) has empirically verified that on their hardware/driver/OS combination this limit must be raised for large managed allocations to succeed. The `set_heap_limit` compile-time flag (default `true`) allows testing whether it is needed on a given machine. Do not flag this as incorrect usage.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.


🧠 Learnings used
Learnt from: PhilipDeegan
Repo: mkn/mkn.gpu PR: 32
File: inc/mkn/gpu/cpu/def.hpp:75-81
Timestamp: 2026-03-14T14:51:24.672Z
Learning: In `inc/mkn/gpu/cpu/def.hpp`, `getLimitMallocHeapSize()` returning 0 and `setLimitMallocHeapSize()` being a no-op are intentional: these are CUDA/ROCm device-heap concepts with no CPU equivalent. The CPU backend uses the system allocator directly. The heap-limit test in `test/mem/test_alloc.cpp` is also explicitly skipped on the CPU path.

Learnt from: PhilipDeegan
Repo: mkn/mkn.gpu PR: 19
File: test/any/async_streaming.cpp:2-6
Timestamp: 2024-06-10T19:38:12.663Z
Learning: PhilipDeegan, thank you for sharing your preferred style for organizing includes. I'll remember this for future reviews in your codebase.

<!--

PhilipDeegan prefers to sort `#include` statements by line length first, and alphabetically second.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
inc/mkn/gpu/cuda/cls.hpp (1)

4-6: ⚠️ Potential issue | 🟠 Major

Include <stdexcept> in this header.

The Pointer constructor at line 68 throws std::runtime_error, but <stdexcept> is not included. This violates header self-containment—consumers may fail to compile if they don't transitively include <stdexcept> before this header.

Suggested fix
 `#include` <cstddef>
 `#include` <functional>
+#include <stdexcept>
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@inc/mkn/gpu/cuda/cls.hpp` around lines 4 - 6, The header is missing
<stdexcept>, but the Pointer constructor throws std::runtime_error (symbol:
Pointer::Pointer), which breaks header self-containment; add `#include`
<stdexcept> to the top of the header alongside the existing includes so
std::runtime_error is declared for any translation unit that includes this
header.
♻️ Duplicate comments (1)
inc/mkn/gpu/cpu/def.hpp (1)

45-55: ⚠️ Potential issue | 🟠 Major

Make the CPU launch coordinates thread-local.

threadIdx and blockIdx are still shared globals, so two host threads launching CPU kernels at the same time can stomp each other’s coordinates.

Suggested fix
-dim3 static inline threadIdx, blockIdx;
+static inline thread_local dim3 threadIdx, blockIdx;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@inc/mkn/gpu/cpu/def.hpp` around lines 45 - 55, The CPU launch coordinates
threadIdx and blockIdx are currently shared globals and must be made
thread-local to avoid races; change the declarations of threadIdx and blockIdx
(the variables declared after struct dim3) to use thread-local storage (e.g.,
mark them thread_local inline) so each host thread gets its own dim3 instances
while keeping the dim3 struct and the surrounding MKN_CPU_DO_NOT_DEFINE_DIM3
guard unchanged.
🧹 Nitpick comments (2)
tst/any/managed.cpp (1)

106-115: Assert each transfer in test_copy().

dev0 is never read, so the first mkn::gpu::copy(dev0, hst0) can regress without failing the test.

Suggested improvement
 uint32_t test_copy() {
   std::vector<float> hst0(NUM, 1), hst1(NUM, 2);
   ManagedVector<float> dev0(NUM), dev1(NUM);
 
   // copy(T0* dst, T1* src, Size const size)
   mkn::gpu::copy(dev0, hst0);
+  if (dev0.back() != 1) return 1;
   mkn::gpu::copy(hst1, hst0);
+  if (hst1.back() != 1) return 1;
   mkn::gpu::copy(dev1, hst1);
 
-  return dev1[NUM - 1] != 1;
+  return dev1.back() != 1;
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tst/any/managed.cpp` around lines 106 - 115, The test_copy() currently
performs three transfers but only checks dev1, so the first transfer
(mkn::gpu::copy(dev0, hst0)) can silently fail; update test_copy to assert each
transfer: after mkn::gpu::copy(dev0, hst0) read/verify dev0 matches hst0, after
mkn::gpu::copy(hst1, hst0) verify hst1 elements equal the original hst0 values,
and after mkn::gpu::copy(dev1, hst1) verify dev1 matches hst1 (and return
non-zero on any mismatch). Use the existing identifiers test_copy, dev0, dev1,
hst0, and hst1 to locate where to add these checks.
mkn.yaml (1)

14-16: Keep the formatter pointed at the new test tree.

The active test sources moved to tst/..., but clang.format still only walks test, so the new tests can drift out of formatting checks.

Suggested fix
-    clang.format{init{style: file, paths: inc test}}
+    clang.format{init{style: file, paths: inc test tst}}

Also applies to: 22-23, 30-35, 39-39

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@mkn.yaml` around lines 14 - 16, The formatter configuration still targets the
old "test" tree; update the formatting globs/patterns so clang.format (or the
formatter config referenced in mkn.yaml) also includes the new "tst/..." test
tree—replace or augment occurrences of patterns like "test: tst/any/(\w).cpp"
and the other affected pattern blocks (lines noted in the review) so they point
to or include "tst/..." test paths instead of only "test/..."; ensure all
instances (the groups at the other affected locations) are updated so the new
tests are covered by formatting checks.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@inc/mkn/gpu/any/inc/alloc.ipp`:
- Around line 136-140: The size-check in copy(auto& dst, auto const& src) is
inverted and the copy length is wrong; change the guard to if (dst.size() <
src.size()) and throw a clear "dst too small" error, then perform the copy for
src.size() elements (i.e. call copy(dst.data(), src.data(), src.size())); also
handle empty containers up front (if (src.empty() || dst.empty()) return;) to
avoid dereferencing null/invalid pointers; update the error message/context
accordingly while keeping the existing copy(dst.data(), src.data(), ...) call
site.

In `@inc/mkn/gpu/def.hpp`:
- Around line 10-21: The macro guard logic for
MKN_GPU_ROCM/MKN_GPU_CUDA/MKN_GPU_CPU can emit redefinition warnings when users
pass explicit -D flags; update the blocks that set the mutually-exclusive macros
(the sections referencing MKN_GPU_ROCM, MKN_GPU_CUDA and MKN_GPU_CPU) to undef
any target macro before redefining it (or use `#if` defined(...) to only define
when unset) so that redefinitions are suppressed; specifically modify the
conditional sections that currently `#define` MKN_GPU_CUDA, MKN_GPU_ROCM and
MKN_GPU_CUDA=0 to perform an `#undef` on the macro being set prior to the `#define`
(or skip defining if already defined) to avoid warnings.

---

Outside diff comments:
In `@inc/mkn/gpu/cuda/cls.hpp`:
- Around line 4-6: The header is missing <stdexcept>, but the Pointer
constructor throws std::runtime_error (symbol: Pointer::Pointer), which breaks
header self-containment; add `#include` <stdexcept> to the top of the header
alongside the existing includes so std::runtime_error is declared for any
translation unit that includes this header.

---

Duplicate comments:
In `@inc/mkn/gpu/cpu/def.hpp`:
- Around line 45-55: The CPU launch coordinates threadIdx and blockIdx are
currently shared globals and must be made thread-local to avoid races; change
the declarations of threadIdx and blockIdx (the variables declared after struct
dim3) to use thread-local storage (e.g., mark them thread_local inline) so each
host thread gets its own dim3 instances while keeping the dim3 struct and the
surrounding MKN_CPU_DO_NOT_DEFINE_DIM3 guard unchanged.

---

Nitpick comments:
In `@mkn.yaml`:
- Around line 14-16: The formatter configuration still targets the old "test"
tree; update the formatting globs/patterns so clang.format (or the formatter
config referenced in mkn.yaml) also includes the new "tst/..." test tree—replace
or augment occurrences of patterns like "test: tst/any/(\w).cpp" and the other
affected pattern blocks (lines noted in the review) so they point to or include
"tst/..." test paths instead of only "test/..."; ensure all instances (the
groups at the other affected locations) are updated so the new tests are covered
by formatting checks.

In `@tst/any/managed.cpp`:
- Around line 106-115: The test_copy() currently performs three transfers but
only checks dev1, so the first transfer (mkn::gpu::copy(dev0, hst0)) can
silently fail; update test_copy to assert each transfer: after
mkn::gpu::copy(dev0, hst0) read/verify dev0 matches hst0, after
mkn::gpu::copy(hst1, hst0) verify hst1 elements equal the original hst0 values,
and after mkn::gpu::copy(dev1, hst1) verify dev1 matches hst1 (and return
non-zero on any mismatch). Use the existing identifiers test_copy, dev0, dev1,
hst0, and hst1 to locate where to add these checks.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 7bbfdf73-24aa-477a-bb6b-97071a3b0314

📥 Commits

Reviewing files that changed from the base of the PR and between debe35e and 7a239a7.

📒 Files selected for processing (25)
  • inc/mkn/gpu/any/inc/alloc.ipp
  • inc/mkn/gpu/cpu/def.hpp
  • inc/mkn/gpu/cuda/cls.hpp
  • inc/mkn/gpu/def.hpp
  • inc/mkn/gpu/rocm/cls.hpp
  • mkn.yaml
  • tst/any/__share__.hpp
  • tst/any/add.cpp
  • tst/any/array.cpp
  • tst/any/asio.cpp
  • tst/any/async_streaming.cpp
  • tst/any/class1.cpp
  • tst/any/class2.cpp
  • tst/any/construct.cpp
  • tst/any/info.cpp
  • tst/any/managed.cpp
  • tst/any/struct.cpp
  • tst/cpu/namespace.cpp
  • tst/cuda/add.cpp
  • tst/cuda/async.cpp
  • tst/cuda/atomic.cpp
  • tst/hip/add.cpp
  • tst/hip/async.cpp
  • tst/hip/async.ppc
  • tst/mem/test_alloc.cpp
🚧 Files skipped from review as they are similar to previous changes (1)
  • inc/mkn/gpu/rocm/cls.hpp

Comment thread inc/mkn/gpu/any/inc/alloc.ipp
Comment thread inc/mkn/gpu/def.hpp
@PhilipDeegan PhilipDeegan merged commit f32caeb into master Mar 22, 2026
1 check passed
@PhilipDeegan PhilipDeegan deleted the next branch March 22, 2026 14:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant