Skip to content

Add cuda_buffer_backend and torch_buffer_backend for rosidl::Buffer#1

Open
yuanknv wants to merge 3 commits intoros2:mainfrom
yuanknv:native_buffer_backends
Open

Add cuda_buffer_backend and torch_buffer_backend for rosidl::Buffer#1
yuanknv wants to merge 3 commits intoros2:mainfrom
yuanknv:native_buffer_backends

Conversation

@yuanknv
Copy link
Copy Markdown

@yuanknv yuanknv commented Apr 7, 2026

Description

This pull request adds CUDA and PyTorch buffer backend implementations for the rosidl::Buffer, enabling zero-copy GPU memory sharing between ROS 2 publishers and subscribers .

CUDA buffer backend: Enables zero-copy GPU data transport with fully asynchronous - data could stay on the GPU accorss ROS nodes. allocate_msg allocates from a CUDA Virtual Memory Management (VMM) based IPC memory pool; each block carries a pre-exported POSIX FD for zero-overhead IPC reuse. from_buffer returns a WriteHandle/ReadHandle that manages GPU stream ordering via CUDA events (no cudaStreamSynchronize in the pipeline). On transmit, the plugin checks locality via a shared-memory endpoint registry: for same-host same-GPU peers, it sends the block's FD over a Unix socket and an IPC event handle for cross-process GPU sync; otherwise it falls back to CPU serialization. On receive, the block is imported and mapped (cached per source block), with a shared-memory refcount and UID validation to prevent stale reuse. A background recycler thread handles event synchronization and block reclamation off the callback thread.

Torch Buffer Backend: A device-agnostic layer on top of device buffer backends (e.g. cuda_buffer_backend) that lets users work with torch::Tensor directly. allocate_msg creates a TorchBufferImpl wrapping a buffer with tensor metadata (shape, strides, dtype); the device is auto-detected at compile time - if no accelerated buffer backend is installed, falls back to CPU. from_buffer returns a torch::Tensor view backed by the device buffer's handle (write or read, captured in the tensor deleter for event lifetime safety). to_buffer copies a pre-existing torch tensor into the allocated buffer. On transmit, the TorchBufferDescriptor carries tensor metadata alongside a nested device_data field that RMW serializes via whichever device backend plugin is registered.

This pull request consists of the following key components:

cuda_buffer: Core CUDA buffer library providing a VMM-backed CUDA IPC memory pool, a host endpoint manager for locality discovery over shared memory, and user-facing allocate_msg/from_buffer/to_buffer APIs with RAII CUDA event based GPU synchronization (ReadHandle/WriteHandle).
cuda_buffer_backend: BufferBackend plugin registered via pluginlib. Handles endpoint discovery, CudaBufferDescriptor serialization with VMM IPC handles, IPC refcount lifecycle, and automatic CPU fallback when CUDA IPC is unavailable.
cuda_buffer_backend_msgs: ROS 2 message definition for CudaBufferDescriptor.
torch_buffer: PyTorch buffer library wrapping device buffers with tensor metadata (shape, strides, dtype). Provides allocate_msg/from_buffer/to_buffer APIs that auto-detect device backend at compile time.
torch_buffer_backend: BufferBackend plugin for PyTorch tensors. Handles TorchBufferDescriptor serialization with nested device buffer delegation.
torch_buffer_backend_msgs: ROS 2 message definition for TorchBufferDescriptor.

Is this user-facing behavior change?

No.

Did you use Generative AI?

Yes. Claude (claude-4.6-opus) via Cursor was used to assist with creating an initial prototype version of the changes contained in this PR.

Additional Information

This PR is part of the broader ROS 2 native buffer feature introduced in this post.

@@ -0,0 +1,84 @@
cmake_minimum_required(VERSION 3.8)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
cmake_minimum_required(VERSION 3.8)
cmake_minimum_required(VERSION 3.20)

@@ -0,0 +1,24 @@
cmake_minimum_required(VERSION 3.8)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
cmake_minimum_required(VERSION 3.8)
cmake_minimum_required(VERSION 3.20)

<?xml-model href="http://download.ros.org/schema/package_format3.xsd" schematypens="http://www.w3.org/2001/XMLSchema"?>
<package format="3">
<name>torch_buffer_backend_msgs</name>
<version>0.1.0</version>
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
<version>0.1.0</version>
<version>0.0.0</version>

<?xml-model href="http://download.ros.org/schema/package_format3.xsd" schematypens="http://www.w3.org/2001/XMLSchema"?>
<package format="3">
<name>cuda_buffer</name>
<version>0.1.0</version>
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
<version>0.1.0</version>
<version>0.0.0</version>

<description>CUDA buffer implementation (CudaBuffer, CudaBufferImpl, CUDAMemoryPool, IPC)
for the ROS2 Buffer backend system. Contains both headers and compiled sources
for IPC manager and host endpoint manager.</description>

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing <author> tag.

at::ScalarType dtype = at::kByte)
{
if (buffer.empty()) {return {};}
const auto * impl = static_cast<const TorchBufferImpl<uint8_t> *>(buffer.get_impl());
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
const auto * impl = static_cast<const TorchBufferImpl<uint8_t> *>(buffer.get_impl());
const auto * impl = detail::get_torch_impl<uint8_t>(buffer);

Comment on lines +61 to +62
const auto * torch_impl =
static_cast<const torch_buffer_backend::TorchBufferImpl<uint8_t> *>(impl);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
const auto * torch_impl =
static_cast<const torch_buffer_backend::TorchBufferImpl<uint8_t> *>(impl);
const auto * torch_impl = dynamic_cast<const torch_buffer_backend::TorchBufferImpl<uint8_t> *>(
static_cast<const rosidl::BufferImplBase<uint8_t> *>(impl));
if (!torch_impl) {
return nullptr;
}

{
(void)endpoint_info;
(void)existing_endpoints;
(void)endpoint_supported_backends;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check if backend exists in endpoint_supported_backends ?


std::unique_ptr<rosidl::BufferImplBase<T>> to_cpu() const override
{
if (device_buffer_.empty()) {return nullptr;}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consistent with CudaBufferImpl::to_cpu()

Suggested change
if (device_buffer_.empty()) {return nullptr;}
if (device_buffer_.empty()) {return std::make_unique<rosidl::CpuBufferImpl<T>>();}


int64_t numel = 1;
for (auto s : shape) {
numel *= s;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
numel *= s;
if (s < 0) {
throw std::runtime_error(
"allocate_msg: negative shape dimension (" + std::to_string(s) + ")");
}
numel *= s;

find_package(cuda_buffer_backend_msgs REQUIRED)
find_package(rmw REQUIRED)
find_package(rcutils REQUIRED)
find_package(CUDAToolkit REQUIRED)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to include this dependency in the package.xml, the requirement it's probrably nvcc ?
is this key enough https://github.com/ros/rosdistro/blob/master/rosdep/base.yaml#L8367C1-L8367C12 ?

<depend>rcutils</depend>
<depend>rmw</depend>
<depend>rosidl_buffer</depend>

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
<depend>nvidia-cuda</depend>

set(CMAKE_CUDA_HOST_COMPILER "${CMAKE_CXX_COMPILER}")
endif()

find_package(Torch REQUIRED)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about this one. is there a ubuntu package fot this one ?

wget https://download.pytorch.org/libtorch/nightly/cpu/libtorch-shared-with-deps-latest.zip
unzip libtorch-shared-with-deps-latest.zip

then I used -DCMAKE_PREFIX_PATH=<path to pytorch>

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I finally used version 11.8, but nvcc is 12.0 in ubuntu

https://download.pytorch.org/libtorch/cu118/libtorch-cxx11-abi-shared-with-deps-2.7.1%2Bcu118.zip

Maybe we should think if it's worth it to add a vendor package to install libtorch

@mjcarroll

Copy link
Copy Markdown

@ahcorde ahcorde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also detected some linters failures

and this is not passing

test_cuda_image_cpu_fallback_fastrtps_launch with this error:

6: FAIL: test_cpu_fallback_paths (cuda_buffer_backend.TestCudaImageCpuFallbackFastRTPS.test_cpu_fallback_paths)
6: Test all CPU fallback paths and normal IPC simultaneously over FastRTPS.
6: ----------------------------------------------------------------------
6: Traceback (most recent call last):
6:   File "/tmp/ws/src/rosidl_buffer_backends/cuda_buffer_backend/cuda_buffer_backend/test/test_cuda_image_cpu_fallback_fastrtps_launch.py", line 203, in test_cpu_fallback_paths
6:     self.assertTrue(
6: AssertionError: False is not true : Cross-device fallback validation failed (expected backend="cpu")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants