feat(infra): add SharedMemory IPC fast-path for RTensor by guozhihao-224 · Pull Request #1133 · inclusionAI/AReaL

guozhihao-224 · 2026-04-02T09:08:35Z

Description

Add zero-copy inter-process tensor transfer using POSIX SharedMemory
for same-node RTensor shards. This bypasses HTTP overhead when the
client and server are on the same machine.

The IPC path is automatically used when shard.node_addr points to
localhost, 127.0.0.1, or the machine's hostname/IP, falling back to
HTTP for remote shards.

Key changes:

Add SharedMemory segment creation in _store_local() with header encoding (dtype, shape)
Add _fetch_local() to read tensors from shared memory segments
Add is_local_addr() utility to detect local node addresses
HttpRTensorBackend.fetch() now routes local shards via IPC, remote via HTTP
Add atexit cleanup to prevent shared memory leaks
Add comprehensive tests for IPC path and edge cases

Related Issue

#1117

Type of Change

Bug fix (non-breaking change that fixes an issue)
New feature (non-breaking change that adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation update
Code refactoring (no functional changes)
Performance improvement
Test coverage improvement

Checklist

I have read the Contributing Guide
I have run formatting tools (pre-commit or manual)
I have run relevant unit tests and they pass
I have added tests for new functionality
I have updated documentation if needed
My branch is up to date with main
This PR introduces breaking changes (if yes, fill out details below)
If this PR changes documentation, I have built and previewed it locally with jb build docs
No critical issues raised by AI reviewers (/gemini review)

Breaking Change Details (if applicable):

N/A

Additional Context

Files changed:

areal/infra/rpc/rtensor.py: SharedMemory IPC implementation (+334 lines)
areal/utils/network.py: is_local_addr() utility (+50 lines)
tests/test_rtensor.py: Comprehensive IPC tests (+453 lines)

Total: +802 lines, -35 lines

gemini-code-assist

Code Review

This pull request introduces a SharedMemory-based IPC fast-path for RTensor to optimize same-node tensor transfers, significantly reducing network overhead. Key changes include the addition of IPC helpers for encoding/decoding tensor metadata, partitioning fetch requests into local and remote groups, and implementing shared memory segment creation and cleanup. Feedback focuses on further optimizing memory usage by avoiding unnecessary copies during serialization and deserialization, addressing potential race conditions during shared memory segment creation, and clarifying the 'zero-copy' claim given the use of .clone() in the local fetch path.

gemini-code-assist · 2026-04-02T09:11:04Z

+    try:
+        shm = _SharedMemory(name=name, create=True, size=total_size)
+    except FileExistsError:
+        # Leftover from a previous run — unlink and recreate.
+        try:
+            old = _SharedMemory(name=name, create=False)
+            old.close()
+            old.unlink()
+        except FileNotFoundError:
+            pass
+        shm = _SharedMemory(name=name, create=True, size=total_size)


There is a potential race condition if multiple processes attempt to create a SharedMemory segment for the same shard_id simultaneously. Both might receive FileExistsError, and both might attempt to unlink() the segment that the other just created, leading to a failure on the subsequent create=True call. While shard_id is typically a UUID, this could be an issue if deterministic IDs are used or in high-concurrency scenarios. Consider handling FileExistsError more robustly or attempting to open the existing segment if creation fails.

gemini-code-assist · 2026-04-02T09:11:04Z

+                # which becomes invalid after shm.close() below
+                raw = torch.frombuffer(
+                    shm.buf[header_size:], dtype=torch.uint8, count=nbytes
+                ).clone()


The PR description claims "zero-copy inter-process tensor transfer", but the use of .clone() here introduces a full memory copy. While this simplifies lifecycle management by allowing the SharedMemory segment to be closed immediately, it negates the performance benefits of zero-copy for large tensors. If true zero-copy is a goal, consider keeping the SharedMemory handle alive (e.g., in the _fetch_buffer) and returning a view. If the copy is intentional for safety, please update the PR description to reflect that it is a low-overhead IPC path rather than zero-copy.

garrett4wade · 2026-04-07T02:16:44Z

+    # Create SharedMemory segment for IPC access
+    _create_shm(shard_id, tensor)
+


Creating a shm segment for each tensor will incur a huge overhead. The better solution should be allocating a large buffer in advance for all tensors, not just tensors with the same shard id. We can use HTTP to communicate the buffer indices to read.

garrett4wade · 2026-04-07T02:19:22Z

+# SharedMemory IPC handles: shard_id -> SharedMemory object
+_shm_handles: dict[str, _SharedMemory] = {}


It's kind of ad-hoc if we use global variables to manage shm. We can abstract the functions and variables in a dedicated RTensorShmBuffer class and call the methods of the buffer. The buffer should self-manage its resources like exceptions, shm handle, and the lock.

garrett4wade · 2026-04-07T02:25:54Z

+    try:
+        shm = _SharedMemory(name=name, create=True, size=total_size)
+    except FileExistsError:
+        # Leftover from a previous run — unlink and recreate.
+        try:
+            old = _SharedMemory(name=name, create=False)
+            old.close()
+            old.unlink()
+        except FileNotFoundError:
+            pass
+        shm = _SharedMemory(name=name, create=True, size=total_size)


This exception handling is wierd because shard id or shm name is created from uuid.uuid4(), which should not duplicate across different runs. The current approach unlinks the previous shm and creates a new one, which is too brute-force.

garrett4wade · 2026-04-07T02:26:42Z

+    if not _storage_lock.acquire(timeout=2.0):
+        return


Use a separate lock to manage shm.

garrett4wade · 2026-04-07T02:29:09Z

+            try:
+                local_tensors = _fetch_local(local_shard_ids)
+                for (original_index, _), tensor in zip(
+                    local_grouped, local_tensors, strict=True
+                ):
+                    results[original_index] = tensor
+            except (KeyError, ValueError, struct.error):
+                # SharedMemory segment not found, corrupted header, or
+                # truncated buffer.  Fall back to HTTP for these shards.
+                logger.debug("IPC fetch failed for local shards, falling back to HTTP")
+                for idx, shard in local_grouped:
+                    remote_by_node[shard.node_addr].append((idx, shard))


Should skip the fetched tensors and only use HTTP to fetch the remaining.

guozhihao-224 · 2026-04-07T13:29:51Z

@garrett4wade Thank you for your review.I have two questions here：

Can we utilize the existing RPC serialization mechanism of the TensorShardInfo dataclass to transmit pool metadata via RPC Piggyback, instead of the current HTTP-based index transmission, in order to improve overall efficiency and consistency?
Does allocating a large buffer for all tensors beforehand mean implementing a partitioning algorithm? I'm not sure if I've misunderstood you. Such as :

slab_quotas = {
  16 * MB:  16,   # 256 MB
  64 * MB:   8,   # 512 MB
  256 * MB:  2,   # 512 MB
  1024 * MB: 1,   # 1 GB    
}

garrett4wade · 2026-04-08T05:16:25Z

@guozhihao-224 Those are great questions!

I think your proposal is better—let’s go with that.
I suggest we linearly scan the buffer when allocating tensors, and reset the pointer to zero if the next tensor would exceed the buffer size limit. Since tensors are released after each step, allocations within a batch should remain contiguous as long as the buffer is large enough. We can also add assertions to ensure that live tensors do not have overlapping slices.

Add RTensorShmPool with bump allocation for same-node tensor IPC, bypassing HTTP overhead when client and server share a machine. The pool is writer-owned with a reserve-write-publish protocol and automatic reclamation via try_reset() at shard removal. Key changes: - RTensorShmPool: pre-allocated shm segment with 64-byte aligned bump allocator, in-flight tracking, and _closing flag for safe concurrent shutdown - TensorShardInfo: pool metadata fields for direct IPC routing - HttpRTensorBackend: pool-aware store/fetch with fallback to HTTP - rpc_server: pool lifecycle with atexit + cleanup-hook dual safety - is_local_addr: cached local-address detection for IPC routing - Comprehensive unit and integration tests for pool lifecycle Refs: inclusionAI#1117

guozhihao-224 · 2026-04-13T15:02:44Z

@garrett4wade Hi, this PR has been updated based on the previous review feedback. Replaced per-tensor SharedMemory segments with a single pre-allocated pool (RTensorShmPool) using a bump allocator, as discussed.

garrett4wade

I think the current implementation unnecessarily compliates the data transfer logic. We may defer this optimization until it really becomes a bottleneck in training.

garrett4wade · 2026-04-16T03:17:32Z

+        # Step 1: reserve space (under lock)
+        with self._lock:
+            if self._closing:
+                return False
+            aligned = (self._next_offset + 63) & ~63
+            if aligned + nbytes > self._pool_size:
+                return False
+            self._next_offset = aligned + nbytes
+            self._in_flight += 1


Is writing properly locked across different processes?

guozhihao-224 · 2026-04-16T03:41:35Z

I think the current implementation unnecessarily compliates the data transfer logic. We may defer this optimization until it really becomes a bottleneck in training.

@garrett4wade Thank you for your review. I agree that doing so would increase complexity; let's postpone the optimization for now.

gemini-code-assist bot reviewed Apr 2, 2026

View reviewed changes

garrett4wade reviewed Apr 7, 2026

View reviewed changes

garrett4wade added the reviewed label Apr 7, 2026

guozhihao-224 force-pushed the feat/rtensor-shm-ipc branch from 7bd8d3a to 4b43edf Compare April 13, 2026 14:56

guozhihao-224 requested a review from garrett4wade April 15, 2026 09:58

Merge branch 'main' into feat/rtensor-shm-ipc

344bdac

guozhihao-224 requested review from nuzant and rchardx as code owners April 15, 2026 09:58

garrett4wade reviewed Apr 16, 2026

View reviewed changes

guozhihao-224 closed this Apr 16, 2026

		# Create SharedMemory segment for IPC access
		_create_shm(shard_id, tensor)

		# SharedMemory IPC handles: shard_id -> SharedMemory object
		_shm_handles: dict[str, _SharedMemory] = {}

Conversation

guozhihao-224 commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related Issue

Type of Change

Checklist

Additional Context

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

gemini-code-assist bot Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

gemini-code-assist bot Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

garrett4wade Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

garrett4wade Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

garrett4wade Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

garrett4wade Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

garrett4wade Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

guozhihao-224 commented Apr 7, 2026

Uh oh!

garrett4wade commented Apr 8, 2026

Uh oh!

guozhihao-224 commented Apr 13, 2026

Uh oh!

garrett4wade left a comment

Choose a reason for hiding this comment

Uh oh!

garrett4wade Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

guozhihao-224 commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

guozhihao-224 commented Apr 2, 2026 •

edited

Loading