Add support for using CCCL for CUDA buffers#641
Draft
vyasr wants to merge 12 commits intorapidsai:mainfrom
Draft
Add support for using CCCL for CUDA buffers#641vyasr wants to merge 12 commits intorapidsai:mainfrom
vyasr wants to merge 12 commits intorapidsai:mainfrom
Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
Introduces cpp/cmake/thirdparty/get_cccl.cmake using rapids_cpm_cccl() and adds UCXX_ENABLE_CCCL option to cpp/CMakeLists.txt with conditional compile definitions, CCCL link, CUDA language enable, and buffer_cccl.cu source registration. Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent) Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
Adds BufferType::CCCL enum value and CCCLBuffer class to buffer.h using the PIMPL pattern (opaque CCCLBufferImpl* in header, full definition in buffer_cccl.cu). Implementation uses cuda::buffer<cuda::std::byte, cuda::mr::device_accessible> with cudaFree(0) context initialization guard. Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent) Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
Adds CCCL branch to allocateBuffer() factory in buffer.cpp and adds UCXX_ENABLE_CCCL/#elif UCXX_ENABLE_RMM/#else dispatch to request_tag_multi.cpp so isCUDA allocations prefer CCCLBuffer when CCCL is enabled. Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent) Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
Mirrors the existing RMM parameterized test suite for CCCLBuffer: TestType, TestSize, TestData, TestThrowAfterRelease — all guarded by UCXX_ENABLE_CCCL. 36 tests total (12 Host + 12 RMM + 12 CCCL). Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent) Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
Adds CcclBufferInterfaceBase and 4 CCCL memory type variants (cccl-device, cccl-shared, cccl-cuda-async, cccl-cuda-async-managed) to buffer_interface.hpp, UCXX_BENCHMARKS_ENABLE_CCCL CMake option, and isCcclMemoryType() dispatch in perftest.cpp. Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent) Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
Updates basic.cpp example with CCCL buffer type parsing and allocation. Adds CCCLBuffer Cython class declaration (ucxx_api.pxd), _CCCLBufferWrapper, _get_cccl_buffer(), _cccl_am_allocator(), and UCXWorker AM allocator registration under IF UCXX_ENABLE_CCCL (libucxx.pyx). Propagates UCXX_ENABLE_CCCL flag to Cython compilation in both CMakeLists.txt files. Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent) Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
Adds CCCL Memory Support subsection to the benchmark documentation showing cccl-device, cccl-shared, cccl-cuda-async, and cccl-cuda-async-managed memory type examples and requirements. Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent) Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
- Use std::unique_ptr<CCCLBufferImpl> internally (RAII-safe) - Destructor becomes = default (compiler-generated via unique_ptr) - Constructor uses std::make_unique (exception-safe) - release() uses _impl.release() to return raw CCCLBufferImpl* (preserves C/Cython ABI -- callers cannot consume unique_ptr) - Add Doxygen @code examples to constructor, release(), data() - Add 6 explanatory comments at intentional CCCL/RMM deviation points: PIMPL pattern rationale, cudaFree(0) context init, static_cast necessity, CCCL priority over RMM for AM allocator, no Python DeviceBuffer equivalent, nullptr stream in benchmarks
CCCLBufferImpl is an internal type that requires nvcc to instantiate; it should not be part of the public C++ or Cython API. No external consumer (Python, benchmarks, production code) calls release() on CCCLBuffer — only tests used it for cleanup. RMMBuffer keeps its release() because rmm.DeviceBuffer.c_from_unique_ptr() is a real consumer in the RMM Python ecosystem. The PIMPL pattern is preserved (forward declaration + unique_ptr member) as the compilation firewall still serves a purpose. Only the public methods that leaked the impl type are removed.
Contributor
Author
|
/ok to test |
1 similar comment
Contributor
Author
|
/ok to test |
CCCL support is now assumed present when building the Python library, matching the RMM pattern. Remove IF UCXX_ENABLE_CCCL guards and the CMake flag propagation that was required to define them.
Add UCXX_ENABLE_CCCL to build.sh matching the RMM pattern. Set CUDA architectures via rapids_cuda_set_architectures(RAPIDS) when CCCL is enabled, fixing conda builds that lack a GPU for native detection.
Contributor
Author
|
/ok to test |
pentschev
reviewed
Apr 29, 2026
Comment on lines
+261
to
+262
| assert(cudaMemcpy(host_buffer.data(), buffer->data(), buffer->getSize(), cudaMemcpyDefault) == | ||
| cudaSuccess); |
Member
There was a problem hiding this comment.
Why drop the async memcpy API?
Member
There was a problem hiding this comment.
Do we really need a .cu file? I would very much like to keep UCXX free from requiring nvcc.
| #if UCXX_ENABLE_CCCL | ||
| auto buffer = std::dynamic_pointer_cast<ucxx::CCCLBuffer>(_buffer); | ||
| ASSERT_EQ(buffer->getType(), _type); | ||
| return; // CCCLBuffer does not expose release(); post-release assertions do not apply |
buffer_cccl only uses host APIs (cudaFree, cuda::buffer) and CCCL headers conditionally define __host__/__device__ to nothing outside nvcc. Rename .cu to .cpp, replace enable_language(CUDA) and rapids_cuda_set_architectures with find_package(CUDAToolkit) and CUDA::cudart, eliminating the CUDA_ARCHITECTURES requirement.
Contributor
Author
|
/ok to test |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR adds support for using CCCL instead of rmm for handling device memory.