Skip to content

Migrate RMM usage to CCCL MR design#940

Open
bdice wants to merge 9 commits intorapidsai:mainfrom
bdice:rmm-cccl-migration
Open

Migrate RMM usage to CCCL MR design#940
bdice wants to merge 9 commits intorapidsai:mainfrom
bdice:rmm-cccl-migration

Conversation

@bdice
Copy link
Copy Markdown
Contributor

@bdice bdice commented Apr 3, 2026

Summary

  • Rewrite RmmResourceAdaptor as a thin shell inheriting cuda::mr::shared_resource<detail::RmmResourceAdaptorImpl>, with all mutable state in the impl class for copyable shared ownership.
  • Replace device_memory_resource* with rmm::device_async_resource_ref for non-owning references and cuda::mr::any_resource for owning storage.
  • Remove rmm::mr::owning_wrapper usage (removed in RMM 26.06).
  • Update pool_memory_resource usage (no longer a template; upstream passed as device_async_resource_ref).
  • Replace set_current_device_resource(ptr) with set_current_device_resource_ref(ref).
  • Update test resources to satisfy CCCL resource concept (allocate_sync, deallocate_sync, operator==, get_property).
  • Update Cython bindings (.pxd/.pyx) to use device_async_resource_ref instead of device_memory_resource.
  • Point get_cucascade.cmake to local cuCascade source directory (companion cuCascade PR: Migrate RMM usage to CCCL MR design NVIDIA/cuCascade#98).

Notes

RMM 26.06 removes rmm::mr::device_memory_resource base class and adopts
CCCL-native memory resource concepts. This migrates rapidsmpf accordingly:

- RmmResourceAdaptor rewritten as thin shell inheriting
  cuda::mr::shared_resource<detail::RmmResourceAdaptorImpl>, with all
  mutable state in the impl class for copyable shared ownership.
- Replace device_memory_resource* with rmm::device_async_resource_ref for
  non-owning references and cuda::mr::any_resource for owning storage.
- Remove rmm::mr::owning_wrapper usage.
- pool_memory_resource is no longer a template; upstream passed as
  device_async_resource_ref.
- Replace set_current_device_resource(ptr) with
  set_current_device_resource_ref(ref).
- Update test resources to satisfy CCCL resource concept (allocate_sync,
  deallocate_sync, operator==, get_property).
- Update Cython bindings to use device_async_resource_ref instead of
  device_memory_resource.
- Point get_cucascade.cmake to local cuCascade source directory.
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Apr 3, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

Comment thread cpp/benchmarks/utils/rmm_stack.hpp Outdated
* @return A owning memory resource, which must be kept alive.
* @return An owning memory resource, which must be kept alive.
*/
[[nodiscard]] inline std::shared_ptr<rapidsmpf::RmmResourceAdaptor>
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this could be a cuda::mr::shared_resource isnt it?

Comment on lines +31 to +32
class RmmResourceAdaptor
: public cuda::mr::shared_resource<detail::RmmResourceAdaptorImpl> {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do all rmm adapters follow this pattern now?
As opposed to, move RmmResourceAdaptorImpl into RmmResourceAdaptor, and when ever we want to use the adapter, do,

auto adapter = cuda::mr::make_shared_resource<RmmResourceAdaptor>(mr);

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really we have two types of memory resources:

  • Stateless: trivially copyable, movable, etc. Example: cuda_memory_resource{}.
  • Stateful: adaptors and/or nontrivial base resources. Examples: any adaptor (statistics, logging, pool) or cuda_async_memory_resource which holds a CUDA driver pool object.

So yes, all adaptors use this pattern, but it's not just adaptors, it's anything that has shared state. The way we handle stateful resources is PIMPL-like, with some nuances.

  • AdaptorImpl is a private/detail class that holds the upstream any_resource and/or adaptor state as members,
  • Adaptor is the public API, it uses cuda::mr::shared_resource which forwards allocation methods to the Impl class.

With that split, copying stateful resources is cheap (just the cost of copying a shared_ptr). Crucially, we don't actually use shared_ptr, we use cuda::mr::shared_resource<Impl> and that does the magic for us. It has its own shared-pointer-like implementation.

Comment on lines +31 to +37
// Manual comparison of optionals to avoid recursive constraint satisfaction in
// CCCL 3.2. std::optional::operator== triggers infinite concept checking when the
// wrapped type (rmm::device_async_resource_ref) inherits from CCCL's concept-based
// resource_ref.
// TODO: Revert this after the RMM resource ref types are replaced with
// plain cuda::mr ref types. This depends on
// https://github.com/rapidsai/rmm/issues/2011.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this comment still valid?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, this can be removed now. I'm pretty sure this can be reduced to:

bool RmmResourceAdaptorImpl::operator==(
    RmmResourceAdaptorImpl const& other
) const noexcept {
    return this == std::addressof(other);
}

See this example pattern.

RmmResourceAdaptor mr(primary_mr);

EXPECT_THROW(mr.allocate_sync(8_MiB), rmm::out_of_memory);
EXPECT_THROW((void)mr.allocate_sync(8_MiB), rmm::out_of_memory);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit

Suggested change
EXPECT_THROW((void)mr.allocate_sync(8_MiB), rmm::out_of_memory);
EXPECT_THROW(std::ignore = mr.allocate_sync(8_MiB), rmm::out_of_memory);

RmmResourceAdaptor mr(primary_mr, fallback_mr);

EXPECT_THROW(mr.allocate_sync(2_MiB), std::logic_error);
EXPECT_THROW((void)mr.allocate_sync(2_MiB), std::logic_error);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
EXPECT_THROW((void)mr.allocate_sync(2_MiB), std::logic_error);
EXPECT_THROW(std::ignore = mr.allocate_sync(2_MiB), std::logic_error);

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This same proposal came up recently and I did some research: rapidsai/rmm#2301 (comment)

I concluded that the current state is the best option we have.

I did some reading on this. Apparently std::ignore is really only meant to ignore members during tuple unpacking. The cppreference docs state:

std::ignore is intended for use with std::tie when unpacking a std::tuple, as a placeholder for the arguments that are not used, but can be used for any unwanted assignment.

Some code guides recommend using std::ignore to avoid warnings from unused return values of [[nodiscard]] functions, even though an assignment isn't required.

For ignoring values not requiring assignment, one may cast to void. For variables that have names, but whose value is unused, one may cast those to void or declare those variables with [[maybe_unused]].

The std::ignore name even requires including <tuple>, suggesting to me that this is probably the wrong direction. I confirmed with a little more googling: we might actually have the best thing with (void).

bdice added 3 commits April 14, 2026 19:18
# Conflicts:
#	cpp/src/memory/buffer_resource.cpp
…BufferResource

Replace non-owning device_async_resource_ref with owning
cuda::mr::any_resource<device_accessible> for constructor parameters
and member storage in RmmResourceAdaptor, RmmResourceAdaptorImpl, and
BufferResource. Getters still return device_async_resource_ref via
const_cast, matching the pattern in RMM's prefetch_resource_adaptor.

Forward the alignment parameter in RmmResourceAdaptorImpl::allocate
and deallocate to fix CCCL deprecation warnings.

Update Cython declarations to use any_resource and RMM's
make_any_device_resource/make_device_async_resource_ref helpers,
replacing the custom make_rapidsmpf_resource_ref inline function.
@nirandaperera nirandaperera added breaking Introduces a breaking change improvement Improves an existing functionality labels Apr 16, 2026
bdice added 4 commits April 17, 2026 00:05
throw_at_limit_resource is split into a non-copyable impl holding the
mutable allocs set, and a shared_resource<impl> wrapper. When
any_resource copies the resource, copies now share the same underlying
state via ref counting, so test assertions on the original object
correctly observe allocations made through the adaptor's internal copy.
@bdice bdice force-pushed the rmm-cccl-migration branch from 0b13b4b to 1d28b47 Compare April 17, 2026 05:26
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rapidsmpf should be pinned to a specific commit of cuCascade (the tip of PR 98) until NVIDIA/cuCascade#98 is merged.

@bdice bdice force-pushed the rmm-cccl-migration branch from 1d28b47 to fe393cb Compare April 17, 2026 06:31
Copy link
Copy Markdown
Contributor Author

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nirandaperera Thanks for your help with this so far. I would like to ask if you can help with the remaining refactoring steps here. I will handle the cuCascade pinning soon (just need to get cuDF artifacts built and verify that the cuCascade tests pass, then I will pin to that commit).

create_context(ProgramOptions& arguments, RmmResourceAdaptor* mr) {
rmm::mr::set_current_device_resource(mr);
rmm::mr::set_current_device_resource_ref(mr);
rmm::mr::set_current_device_resource(*mr);
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may be able to pass RmmResourceAdaptor by value rather than by pointer here.

if (args.device_mem_limit_mb >= 0) {
memory_available[rapidsmpf::MemoryType::DEVICE] = rapidsmpf::LimitAvailableMemory{
stat_enabled_mr.get(), args.device_mem_limit_mb << 20
&stat_enabled_mr, args.device_mem_limit_mb << 20
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may be able to pass by value rather than by pointer here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't really need this to be a "stack" anymore. Previously that was required to keep the MRs alive but now adaptors will keep their upstreams alive.

Action: more renaming / cleanup?

@@ -151,7 +151,7 @@ void BM_DeviceToHostCopyInclAlloc(benchmark::State& state) {
auto device_mr = std::make_unique<rmm::mr::cuda_memory_resource>();
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This probably doesn't need to be owned by unique_ptr anymore.

(Pattern repeats.)

Comment on lines +31 to +32
class RmmResourceAdaptor
: public cuda::mr::shared_resource<detail::RmmResourceAdaptorImpl> {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really we have two types of memory resources:

  • Stateless: trivially copyable, movable, etc. Example: cuda_memory_resource{}.
  • Stateful: adaptors and/or nontrivial base resources. Examples: any adaptor (statistics, logging, pool) or cuda_async_memory_resource which holds a CUDA driver pool object.

So yes, all adaptors use this pattern, but it's not just adaptors, it's anything that has shared state. The way we handle stateful resources is PIMPL-like, with some nuances.

  • AdaptorImpl is a private/detail class that holds the upstream any_resource and/or adaptor state as members,
  • Adaptor is the public API, it uses cuda::mr::shared_resource which forwards allocation methods to the Impl class.

With that split, copying stateful resources is cheap (just the cost of copying a shared_ptr). Crucially, we don't actually use shared_ptr, we use cuda::mr::shared_resource<Impl> and that does the magic for us. It has its own shared-pointer-like implementation.

Comment on lines +31 to +37
// Manual comparison of optionals to avoid recursive constraint satisfaction in
// CCCL 3.2. std::optional::operator== triggers infinite concept checking when the
// wrapped type (rmm::device_async_resource_ref) inherits from CCCL's concept-based
// resource_ref.
// TODO: Revert this after the RMM resource ref types are replaced with
// plain cuda::mr ref types. This depends on
// https://github.com/rapidsai/rmm/issues/2011.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, this can be removed now. I'm pretty sure this can be reduced to:

bool RmmResourceAdaptorImpl::operator==(
    RmmResourceAdaptorImpl const& other
) const noexcept {
    return this == std::addressof(other);
}

See this example pattern.

// ---------------------------------------------------------------------------
namespace detail {

RmmResourceAdaptorImpl::RmmResourceAdaptorImpl(
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RMM uses a separate .cpp file for the Impl classes to avoid accidental overlap between private and public APIs. Either way is fine.

[[nodiscard]] bool operator==(
throw_at_limit_resource_impl const& other
) const noexcept {
return this == &other;
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return this == &other;
return this == std::addressof(other);

# need to keep this alive here.
# Stored for the Python device_mr/pinned_mr property accessors.
# The C++ BufferResource owns the resource via any_resource.
self._device_mr = device_mr
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously these members were for keeping the MR alive. However, now it seems like they're used for the property accessors. I updated the comment.

RmmResourceAdaptor mr(primary_mr, fallback_mr);

EXPECT_THROW(mr.allocate_sync(2_MiB), std::logic_error);
EXPECT_THROW((void)mr.allocate_sync(2_MiB), std::logic_error);
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This same proposal came up recently and I did some research: rapidsai/rmm#2301 (comment)

I concluded that the current state is the best option we have.

I did some reading on this. Apparently std::ignore is really only meant to ignore members during tuple unpacking. The cppreference docs state:

std::ignore is intended for use with std::tie when unpacking a std::tuple, as a placeholder for the arguments that are not used, but can be used for any unwanted assignment.

Some code guides recommend using std::ignore to avoid warnings from unused return values of [[nodiscard]] functions, even though an assignment isn't required.

For ignoring values not requiring assignment, one may cast to void. For variables that have names, but whose value is unused, one may cast those to void or declare those variables with [[maybe_unused]].

The std::ignore name even requires including <tuple>, suggesting to me that this is probably the wrong direction. I confirmed with a little more googling: we might actually have the best thing with (void).

@bdice bdice force-pushed the rmm-cccl-migration branch 2 times, most recently from a8012b7 to a871b98 Compare April 17, 2026 07:09
@bdice bdice marked this pull request as ready for review April 17, 2026 07:13
@bdice bdice requested review from a team as code owners April 17, 2026 07:13
@bdice bdice requested a review from msarahan April 17, 2026 07:13
@bdice bdice force-pushed the rmm-cccl-migration branch 4 times, most recently from 2350135 to de72e5c Compare April 17, 2026 20:51
@bdice bdice force-pushed the rmm-cccl-migration branch from de72e5c to 1c157af Compare April 17, 2026 21:07
@bdice bdice force-pushed the rmm-cccl-migration branch from 1c157af to 1b8e20a Compare April 18, 2026 15:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

breaking Introduces a breaking change improvement Improves an existing functionality

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants