From 647f06b056b2c8a8c12f23bae016636abf1385cf Mon Sep 17 00:00:00 2001
From: Benjamin Brock <benjamin.brock@intel.com>
Date: Sat, 7 Mar 2026 01:22:39 +0000
Subject: [PATCH 01/17] docs: add ishmem4py plan

---
 ishmem4py/README.md | 226 ++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 226 insertions(+)
 create mode 100644 ishmem4py/README.md

diff --git a/ishmem4py/README.md b/ishmem4py/README.md
new file mode 100644
index 0000000..56f73a7
--- /dev/null
+++ b/ishmem4py/README.md
@@ -0,0 +1,226 @@
+# ishmem4py
+
+`ishmem4py` is a new Python binding layer for Intel SHMEM being developed in-tree in the
+`ishmem` repository. The immediate target is a CPU-initiated MVP that can initialize Intel
+SHMEM, allocate symmetric memory, and perform basic `put` and `get` operations from Python.
+
+## Why A New Shim Layer
+
+Intel SHMEM's installed interface is not a drop-in target for a pure-Python FFI layer:
+
+- the public header is C++, not C
+- the installed library is a static archive (`libishmem.a`)
+- exported symbols are C++ mangled
+
+For `ishmem4py`, the binding strategy is therefore:
+
+1. add a very small C ABI runtime shim linked against Intel SHMEM
+2. load that shim from Python
+3. keep the Python layer thin and name APIs after Intel SHMEM/OpenSHMEM where possible
+
+This keeps the MVP simple and makes future growth predictable.
+
+## MVP Scope
+
+The first deliverable is intentionally narrow:
+
+- library setup and teardown
+  - `init`
+  - `finalize`
+  - `my_pe`
+  - `n_pes`
+  - `barrier_all`
+  - `quiet`
+- symmetric memory management
+  - `malloc`
+  - `calloc`
+  - `free`
+- CPU-initiated RMA
+  - `putmem`
+  - `getmem`
+- Python memory object support sufficient to:
+  - allocate symmetric memory from Python
+  - keep symmetric allocations as explicit handles
+  - stage local data in and out from Python buffers
+  - run multi-PE ring-style smoke tests
+
+Out of scope for the initial MVP:
+
+- device-initiated communication
+- SYCL queue or work-group extensions
+- full typed API coverage (`ishmem_int_put`, `ishmem_float_get`, and similar)
+- team management beyond world-team queries
+- atomics, signaling, reductions, collectives, or wait/test routines
+
+## Memory Model For The MVP
+
+The MVP treats symmetric allocations as opaque Python objects, not as directly dereferenceable
+Python buffers.
+
+That choice matches Intel SHMEM's default memory model more closely:
+
+- the symmetric heap may reside in device memory
+- host loads and stores are not generally valid on symmetric heap pointers
+- host `putmem` and `getmem` are the portable way to move data between Python-visible memory
+  and symmetric objects
+
+In practice, the MVP uses normal Python buffer-protocol objects (`bytes`, `bytearray`,
+`memoryview`, and similar) as the local source or destination for host-initiated RMA calls.
+
+`ISHMEM_ENABLE_ACCESSIBLE_HOST_HEAP=1` remains useful as an optional mode for debugging or
+future richer views, but it is not required by the first implementation.
+
+## Planned Layout
+
+The implementation is expected to live under `ishmem/ishmem4py/` with roughly this structure:
+
+```text
+ishmem4py/
+  CMakeLists.txt
+  README.md
+  csrc/
+    ishmem4py_runtime.cpp
+    ishmem4py_runtime.h
+  python/
+    ishmem4py/
+      __init__.py
+      _lib.py
+      core.py
+  test/
+    smoke_test.py
+    ring_test.py
+  examples/
+    init_fini.py
+    ring_put_get.py
+```
+
+## API Shape
+
+The Python API will stay close to Intel SHMEM/OpenSHMEM naming at the low level.
+
+Expected MVP entry points:
+
+```python
+import ishmem4py as ishmem
+
+ishmem.init()
+pe = ishmem.my_pe()
+npes = ishmem.n_pes()
+
+buf = ishmem.malloc(1024)
+zeros = ishmem.calloc(256, 4)
+
+ishmem.putmem(remote_buf, local_buf, pe=1)
+ishmem.getmem(local_buf, remote_buf, pe=1)
+
+ishmem.barrier_all()
+ishmem.quiet()
+ishmem.free(buf)
+ishmem.finalize()
+```
+
+The Python `SymmetricMemory` object exposes the symmetric address and allocation size, along with
+convenience `read()` and `write()` helpers implemented in terms of same-PE `getmem` and `putmem`.
+
+## Build Integration Plan
+
+The preferred integration is an optional top-level CMake target, for example:
+
+```text
+-DBUILD_PYTHON_BINDINGS=ON
+```
+
+Expected responsibilities:
+
+- build the C ABI runtime shim with `icpx`
+- link it against the existing Intel SHMEM library objects or installed library
+- place the resulting shared library beside the Python package
+- keep the rest of the core Intel SHMEM build unchanged when Python bindings are disabled
+
+## Current Build And Test Flow
+
+The current branch has been validated with an in-tree build and source-tree Python package.
+
+Build:
+
+```bash
+source /opt/intel/oneapi/setvars.sh
+
+cmake -S /docker-mount/ishmem \
+  -B /docker-mount/ishmem/build-ishmem4py-icpx-noaot \
+  -DCMAKE_C_COMPILER=icx \
+  -DCMAKE_CXX_COMPILER=icpx \
+  -DENABLE_OPENSHMEM=ON \
+  -DSHMEM_DIR=/home/xiii/pkg/SOS-2026-03-06 \
+  -DBUILD_PYTHON_BINDINGS=ON \
+  -DBUILD_UNIT_TESTS=OFF \
+  -DBUILD_PERF_TESTS=OFF \
+  -DBUILD_EXAMPLES=OFF \
+  -DBUILD_APPS=OFF \
+  -DENABLE_AOT_COMPILATION=OFF
+
+cmake --build /docker-mount/ishmem/build-ishmem4py-icpx-noaot --target ishmem4py -j4
+```
+
+Run environment:
+
+```bash
+unset ISHMEM_DIR
+export LD_LIBRARY_PATH=/home/xiii/pkg/SOS-2026-03-06/lib:$LD_LIBRARY_PATH
+export PYTHONPATH=/docker-mount/ishmem/ishmem4py/python
+export ISHMEM4PY_RUNTIME_LIBRARY=/docker-mount/ishmem/build-ishmem4py-icpx-noaot/ishmem4py/python/ishmem4py/_ishmem4py_runtime.so
+export ISHMEM_RUNTIME=OPENSHMEM
+```
+
+Smoke test:
+
+```bash
+mpiexec -n 1 /docker-mount/ishmem/scripts/ishmrun \
+  python3 /docker-mount/ishmem/ishmem4py/test/smoke_test.py
+```
+
+Two-PE ring test:
+
+```bash
+mpiexec -n 2 /docker-mount/ishmem/scripts/ishmrun \
+  python3 /docker-mount/ishmem/ishmem4py/test/ring_test.py
+```
+
+Important notes:
+
+- Running without `ishmrun` is not reliable on multi-device nodes because Intel SHMEM needs the
+  normal PE-to-device mapping.
+- The current Intel SHMEM lifecycle should be treated as one-shot per process: initialize once,
+  finalize once, and do not attempt to reinitialize after finalization.
+
+## Test Plan
+
+The MVP needs both local smoke coverage and multi-PE verification:
+
+- single-PE smoke test
+  - init/finalize
+  - malloc/calloc/free
+  - basic local typed view checks
+- two-PE RMA test
+  - each PE allocates symmetric source and destination buffers
+  - PE `i` writes a known pattern
+  - `putmem` and `getmem` exchange data around a ring
+  - barriers validate global completion and correctness
+
+Tests should be runnable through the same launcher expectations as the rest of `ishmem`.
+
+## Known Risks / Open Questions
+
+- Actual runtime verification depends on available Intel GPU + Intel SHMEM backend environment.
+- The installed Intel SHMEM package is static-only, so distribution/packaging needs extra care.
+- Full typed API coverage will likely benefit from code generation once the MVP lands.
+
+## Short-Term Deliverables
+
+Within the current time box, the target is:
+
+1. land the build scaffolding and written plan
+2. implement the runtime shim for the MVP API set
+3. add the thin Python package
+4. add at least one example and basic tests
+5. verify as much as the current environment permits and document any gaps

From a2a6dbc9d79f8c1281dd31e6878a8785d45c64d3 Mon Sep 17 00:00:00 2001
From: Benjamin Brock <benjamin.brock@intel.com>
Date: Sat, 7 Mar 2026 01:22:43 +0000
Subject: [PATCH 02/17] build: add ishmem4py runtime shim

---
 CMakeLists.txt                       |  6 ++
 ishmem4py/CMakeLists.txt             | 46 ++++++++++++++
 ishmem4py/csrc/ishmem4py_runtime.cpp | 91 ++++++++++++++++++++++++++++
 ishmem4py/csrc/ishmem4py_runtime.h   | 42 +++++++++++++
 4 files changed, 185 insertions(+)
 create mode 100644 ishmem4py/CMakeLists.txt
 create mode 100644 ishmem4py/csrc/ishmem4py_runtime.cpp
 create mode 100644 ishmem4py/csrc/ishmem4py_runtime.h

diff --git a/CMakeLists.txt b/CMakeLists.txt
index abb9ab4..ab6e2ba 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -27,6 +27,7 @@ option(BUILD_UNIT_TESTS "Build unit tests" FALSE)
 option(BUILD_PERF_TESTS "Build performance tests" FALSE)
 option(BUILD_EXAMPLES "Build examples" FALSE)
 option(BUILD_APPS "Build apps" FALSE)
+option(BUILD_PYTHON_BINDINGS "Build Python bindings" FALSE)
 option(BUILD_CMAKE_CONFIG "Build CMake config files" TRUE)
 option(ENABLE_ERROR_CHECKING "Validate API inputs" FALSE)
 option(ENABLE_DLMALLOC "Enable dlmalloc for shared heap" TRUE)
@@ -79,6 +80,7 @@ message(STATUS "Build unit tests:            ${BUILD_UNIT_TESTS}")
 message(STATUS "Build performance tests:     ${BUILD_PERF_TESTS}")
 message(STATUS "Build examples:              ${BUILD_EXAMPLES}")
 message(STATUS "Build apps:                  ${BUILD_APPS}")
+message(STATUS "Build Python bindings:       ${BUILD_PYTHON_BINDINGS}")
 message(STATUS "Build CMake configs:         ${BUILD_CMAKE_CONFIG}")
 message(STATUS "Enable input validation:     ${ENABLE_ERROR_CHECKING}")
 message(STATUS "Enable dlmalloc:             ${ENABLE_DLMALLOC}")
@@ -161,6 +163,10 @@ add_subdirectory(src)
 # Build options for tests are handled directly in test/CMakeLists.txt
 add_subdirectory(test)
 
+if (BUILD_PYTHON_BINDINGS)
+    add_subdirectory(ishmem4py)
+endif()
+
 if (BUILD_EXAMPLES)
     add_subdirectory(examples)
 endif()
diff --git a/ishmem4py/CMakeLists.txt b/ishmem4py/CMakeLists.txt
new file mode 100644
index 0000000..6cfd379
--- /dev/null
+++ b/ishmem4py/CMakeLists.txt
@@ -0,0 +1,46 @@
+# Copyright (C) 2026 Intel Corporation
+# SPDX-License-Identifier: BSD-3-Clause
+
+cmake_minimum_required(VERSION 3.17)
+
+set(PROJECT_NAME "ishmem4py")
+set(PROJECT_FULL_NAME "Intel(R) SHMEM Python bindings")
+
+project(${PROJECT_NAME})
+
+find_package(Python3 COMPONENTS Interpreter REQUIRED)
+
+set(ISHMEM4PY_BUILD_ROOT "${CMAKE_CURRENT_BINARY_DIR}")
+set(ISHMEM4PY_PYTHON_BUILD_ROOT "${ISHMEM4PY_BUILD_ROOT}/python")
+set(ISHMEM4PY_PKG_BUILD_DIR "${ISHMEM4PY_PYTHON_BUILD_ROOT}/ishmem4py")
+
+file(COPY ${CMAKE_CURRENT_SOURCE_DIR}/python/ DESTINATION ${ISHMEM4PY_PYTHON_BUILD_ROOT})
+file(COPY ${CMAKE_CURRENT_SOURCE_DIR}/test/ DESTINATION ${ISHMEM4PY_BUILD_ROOT})
+file(COPY ${CMAKE_CURRENT_SOURCE_DIR}/examples/ DESTINATION ${ISHMEM4PY_BUILD_ROOT})
+
+add_library(ishmem4py-runtime SHARED
+    csrc/ishmem4py_runtime.cpp)
+
+target_include_directories(ishmem4py-runtime PRIVATE
+    "${SYCL_INCLUDE_SYCL_DIR}"
+    "${LEVEL_ZERO_INCLUDEDIR}"
+    "${ISHMEM_BUILD_ROOT}/include"
+    "${PROJECT_SOURCE_DIR}/src"
+    "${CMAKE_CURRENT_SOURCE_DIR}/csrc")
+
+target_link_libraries(ishmem4py-runtime PRIVATE ishmem-static)
+set_target_properties(ishmem4py-runtime PROPERTIES
+    OUTPUT_NAME "_ishmem4py_runtime"
+    PREFIX ""
+    LIBRARY_OUTPUT_DIRECTORY "${ISHMEM4PY_PKG_BUILD_DIR}"
+    RUNTIME_OUTPUT_DIRECTORY "${ISHMEM4PY_PKG_BUILD_DIR}"
+    POSITION_INDEPENDENT_CODE 1)
+
+add_custom_target(ishmem4py ALL
+    DEPENDS ishmem4py-runtime)
+
+message(STATUS "\n======= ishmem4py summary ========")
+message(STATUS "Python interpreter:          ${Python3_EXECUTABLE}")
+message(STATUS "Python package root:         ${ISHMEM4PY_PYTHON_BUILD_ROOT}")
+message(STATUS "Runtime library output:      ${ISHMEM4PY_PKG_BUILD_DIR}")
+message(STATUS "==================================\n")
diff --git a/ishmem4py/csrc/ishmem4py_runtime.cpp b/ishmem4py/csrc/ishmem4py_runtime.cpp
new file mode 100644
index 0000000..ef0270a
--- /dev/null
+++ b/ishmem4py/csrc/ishmem4py_runtime.cpp
@@ -0,0 +1,91 @@
+/* Copyright (C) 2026 Intel Corporation
+ * SPDX-License-Identifier: BSD-3-Clause
+ */
+
+#include "ishmem4py_runtime.h"
+
+#include <ishmem.h>
+
+extern "C" {
+
+void ishmem4py_init(void)
+{
+    ishmem_init();
+}
+
+void ishmem4py_finalize(void)
+{
+    ishmem_finalize();
+}
+
+int ishmem4py_my_pe(void)
+{
+    return ishmem_my_pe();
+}
+
+int ishmem4py_n_pes(void)
+{
+    return ishmem_n_pes();
+}
+
+void ishmem4py_info_get_version(int *major, int *minor)
+{
+    ishmem_info_get_version(major, minor);
+}
+
+void ishmem4py_barrier_all(void)
+{
+    ishmem_barrier_all();
+}
+
+void ishmem4py_sync_all(void)
+{
+    ishmem_sync_all();
+}
+
+void ishmem4py_fence(void)
+{
+    ishmem_fence();
+}
+
+void ishmem4py_quiet(void)
+{
+    ishmem_quiet();
+}
+
+void *ishmem4py_malloc(size_t size)
+{
+    return ishmem_malloc(size);
+}
+
+void *ishmem4py_calloc(size_t count, size_t size)
+{
+    return ishmem_calloc(count, size);
+}
+
+void ishmem4py_free(void *ptr)
+{
+    ishmem_free(ptr);
+}
+
+void ishmem4py_putmem(void *dest, const void *src, size_t nbytes, int pe)
+{
+    ishmem_putmem(dest, src, nbytes, pe);
+}
+
+void ishmem4py_getmem(void *dest, const void *src, size_t nbytes, int pe)
+{
+    ishmem_getmem(dest, src, nbytes, pe);
+}
+
+uintptr_t ishmem4py_cast_ptr_to_uintptr(const void *ptr)
+{
+    return reinterpret_cast<uintptr_t>(ptr);
+}
+
+void *ishmem4py_cast_uintptr_to_ptr(uintptr_t value)
+{
+    return reinterpret_cast<void *>(value);
+}
+
+}  // extern "C"
diff --git a/ishmem4py/csrc/ishmem4py_runtime.h b/ishmem4py/csrc/ishmem4py_runtime.h
new file mode 100644
index 0000000..197d7fa
--- /dev/null
+++ b/ishmem4py/csrc/ishmem4py_runtime.h
@@ -0,0 +1,42 @@
+/* Copyright (C) 2026 Intel Corporation
+ * SPDX-License-Identifier: BSD-3-Clause
+ */
+
+#ifndef ISHMEM4PY_RUNTIME_H
+#define ISHMEM4PY_RUNTIME_H
+
+#include <stddef.h>
+#include <stdint.h>
+
+#if defined(__GNUC__)
+#define ISHMEM4PY_EXPORT __attribute__((visibility("default")))
+#else
+#define ISHMEM4PY_EXPORT
+#endif
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+ISHMEM4PY_EXPORT void ishmem4py_init(void);
+ISHMEM4PY_EXPORT void ishmem4py_finalize(void);
+ISHMEM4PY_EXPORT int ishmem4py_my_pe(void);
+ISHMEM4PY_EXPORT int ishmem4py_n_pes(void);
+ISHMEM4PY_EXPORT void ishmem4py_info_get_version(int *major, int *minor);
+ISHMEM4PY_EXPORT void ishmem4py_barrier_all(void);
+ISHMEM4PY_EXPORT void ishmem4py_sync_all(void);
+ISHMEM4PY_EXPORT void ishmem4py_fence(void);
+ISHMEM4PY_EXPORT void ishmem4py_quiet(void);
+ISHMEM4PY_EXPORT void *ishmem4py_malloc(size_t size);
+ISHMEM4PY_EXPORT void *ishmem4py_calloc(size_t count, size_t size);
+ISHMEM4PY_EXPORT void ishmem4py_free(void *ptr);
+ISHMEM4PY_EXPORT void ishmem4py_putmem(void *dest, const void *src, size_t nbytes, int pe);
+ISHMEM4PY_EXPORT void ishmem4py_getmem(void *dest, const void *src, size_t nbytes, int pe);
+ISHMEM4PY_EXPORT uintptr_t ishmem4py_cast_ptr_to_uintptr(const void *ptr);
+ISHMEM4PY_EXPORT void *ishmem4py_cast_uintptr_to_ptr(uintptr_t value);
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif

From dc16e2dd6600a6af6d3fa96da516701d9e4cc62e Mon Sep 17 00:00:00 2001
From: Benjamin Brock <benjamin.brock@intel.com>
Date: Sat, 7 Mar 2026 01:22:48 +0000
Subject: [PATCH 03/17] python: add ishmem4py MVP package and tests

---
 ishmem4py/examples/init_fini.py        |  17 ++
 ishmem4py/examples/ring_put_get.py     |  45 ++++
 ishmem4py/python/ishmem4py/__init__.py |  46 ++++
 ishmem4py/python/ishmem4py/_lib.py     | 102 +++++++++
 ishmem4py/python/ishmem4py/core.py     | 299 +++++++++++++++++++++++++
 ishmem4py/test/ring_test.py            |  55 +++++
 ishmem4py/test/smoke_test.py           |  40 ++++
 7 files changed, 604 insertions(+)
 create mode 100644 ishmem4py/examples/init_fini.py
 create mode 100644 ishmem4py/examples/ring_put_get.py
 create mode 100644 ishmem4py/python/ishmem4py/__init__.py
 create mode 100644 ishmem4py/python/ishmem4py/_lib.py
 create mode 100644 ishmem4py/python/ishmem4py/core.py
 create mode 100644 ishmem4py/test/ring_test.py
 create mode 100644 ishmem4py/test/smoke_test.py

diff --git a/ishmem4py/examples/init_fini.py b/ishmem4py/examples/init_fini.py
new file mode 100644
index 0000000..b0c4a56
--- /dev/null
+++ b/ishmem4py/examples/init_fini.py
@@ -0,0 +1,17 @@
+# Copyright (C) 2026 Intel Corporation
+# SPDX-License-Identifier: BSD-3-Clause
+
+import ishmem4py as ishmem
+
+
+def main() -> None:
+    ishmem.init()
+    try:
+        major, minor = ishmem.info_get_version()
+        print(f"PE {ishmem.my_pe()} / {ishmem.n_pes()} using Intel SHMEM {major}.{minor}")
+    finally:
+        ishmem.finalize()
+
+
+if __name__ == "__main__":
+    main()
diff --git a/ishmem4py/examples/ring_put_get.py b/ishmem4py/examples/ring_put_get.py
new file mode 100644
index 0000000..1d3fdea
--- /dev/null
+++ b/ishmem4py/examples/ring_put_get.py
@@ -0,0 +1,45 @@
+# Copyright (C) 2026 Intel Corporation
+# SPDX-License-Identifier: BSD-3-Clause
+
+import struct
+
+import ishmem4py as ishmem
+
+
+def main() -> None:
+    ishmem.init()
+    try:
+        my_pe = ishmem.my_pe()
+        npes = ishmem.n_pes()
+        next_pe = (my_pe + 1) % npes
+        prev_pe = (my_pe + npes - 1) % npes
+
+        src = ishmem.malloc(4)
+        dst = ishmem.calloc(1, 4)
+        try:
+            src.write(struct.pack("=i", my_pe))
+            ishmem.barrier_all()
+
+            ishmem.putmem(dst, src.read(4), pe=next_pe)
+            ishmem.barrier_all()
+
+            received = struct.unpack("=i", dst.read(4))[0]
+
+            fetched = bytearray(4)
+            ishmem.getmem(fetched, src, pe=next_pe)
+            fetched_value = struct.unpack("=i", fetched)[0]
+
+            print(
+                f"PE {my_pe}: dst after putmem={received} (expected {prev_pe}), "
+                f"getmem from PE {next_pe} returned {fetched_value}"
+            )
+            ishmem.barrier_all()
+        finally:
+            ishmem.free(dst)
+            ishmem.free(src)
+    finally:
+        ishmem.finalize()
+
+
+if __name__ == "__main__":
+    main()
diff --git a/ishmem4py/python/ishmem4py/__init__.py b/ishmem4py/python/ishmem4py/__init__.py
new file mode 100644
index 0000000..facce7c
--- /dev/null
+++ b/ishmem4py/python/ishmem4py/__init__.py
@@ -0,0 +1,46 @@
+# Copyright (C) 2026 Intel Corporation
+# SPDX-License-Identifier: BSD-3-Clause
+
+from .core import (
+    IshmemError,
+    IshmemStateError,
+    SymmetricMemory,
+    barrier_all,
+    calloc,
+    fence,
+    finalize,
+    free,
+    getmem,
+    info_get_version,
+    init,
+    is_initialized,
+    malloc,
+    my_pe,
+    n_pes,
+    putmem,
+    quiet,
+    sync_all,
+)
+
+__all__ = [
+    "IshmemError",
+    "IshmemStateError",
+    "SymmetricMemory",
+    "barrier_all",
+    "calloc",
+    "fence",
+    "finalize",
+    "free",
+    "getmem",
+    "info_get_version",
+    "init",
+    "is_initialized",
+    "malloc",
+    "my_pe",
+    "n_pes",
+    "putmem",
+    "quiet",
+    "sync_all",
+]
+
+__version__ = "0.1.0a0"
diff --git a/ishmem4py/python/ishmem4py/_lib.py b/ishmem4py/python/ishmem4py/_lib.py
new file mode 100644
index 0000000..6be7071
--- /dev/null
+++ b/ishmem4py/python/ishmem4py/_lib.py
@@ -0,0 +1,102 @@
+# Copyright (C) 2026 Intel Corporation
+# SPDX-License-Identifier: BSD-3-Clause
+
+import ctypes
+import os
+from pathlib import Path
+
+_RUNTIME_ENVVAR = "ISHMEM4PY_RUNTIME_LIBRARY"
+
+
+def _candidate_paths():
+    env_path = os.environ.get(_RUNTIME_ENVVAR)
+    if env_path:
+        yield Path(env_path)
+
+    package_dir = Path(__file__).resolve().parent
+    yield package_dir / "_ishmem4py_runtime.so"
+
+
+def _load_runtime():
+    last_error = None
+    for candidate in _candidate_paths():
+        if not candidate.exists():
+            continue
+        try:
+            runtime = ctypes.CDLL(str(candidate))
+        except OSError as exc:
+            last_error = exc
+            continue
+
+        runtime.ishmem4py_init.argtypes = []
+        runtime.ishmem4py_init.restype = None
+
+        runtime.ishmem4py_finalize.argtypes = []
+        runtime.ishmem4py_finalize.restype = None
+
+        runtime.ishmem4py_my_pe.argtypes = []
+        runtime.ishmem4py_my_pe.restype = ctypes.c_int
+
+        runtime.ishmem4py_n_pes.argtypes = []
+        runtime.ishmem4py_n_pes.restype = ctypes.c_int
+
+        runtime.ishmem4py_info_get_version.argtypes = [
+            ctypes.POINTER(ctypes.c_int),
+            ctypes.POINTER(ctypes.c_int),
+        ]
+        runtime.ishmem4py_info_get_version.restype = None
+
+        runtime.ishmem4py_barrier_all.argtypes = []
+        runtime.ishmem4py_barrier_all.restype = None
+
+        runtime.ishmem4py_sync_all.argtypes = []
+        runtime.ishmem4py_sync_all.restype = None
+
+        runtime.ishmem4py_fence.argtypes = []
+        runtime.ishmem4py_fence.restype = None
+
+        runtime.ishmem4py_quiet.argtypes = []
+        runtime.ishmem4py_quiet.restype = None
+
+        runtime.ishmem4py_malloc.argtypes = [ctypes.c_size_t]
+        runtime.ishmem4py_malloc.restype = ctypes.c_void_p
+
+        runtime.ishmem4py_calloc.argtypes = [ctypes.c_size_t, ctypes.c_size_t]
+        runtime.ishmem4py_calloc.restype = ctypes.c_void_p
+
+        runtime.ishmem4py_free.argtypes = [ctypes.c_void_p]
+        runtime.ishmem4py_free.restype = None
+
+        runtime.ishmem4py_putmem.argtypes = [
+            ctypes.c_void_p,
+            ctypes.c_void_p,
+            ctypes.c_size_t,
+            ctypes.c_int,
+        ]
+        runtime.ishmem4py_putmem.restype = None
+
+        runtime.ishmem4py_getmem.argtypes = [
+            ctypes.c_void_p,
+            ctypes.c_void_p,
+            ctypes.c_size_t,
+            ctypes.c_int,
+        ]
+        runtime.ishmem4py_getmem.restype = None
+
+        return runtime
+
+    if last_error is not None:
+        raise RuntimeError(
+            "Found ishmem4py runtime library candidate but failed to load it. "
+            "Ensure oneAPI runtime libraries are in LD_LIBRARY_PATH and set "
+            f"{_RUNTIME_ENVVAR} if needed."
+        ) from last_error
+
+    raise RuntimeError(
+        "Could not locate _ishmem4py_runtime.so. Build ishmem with "
+        "-DBUILD_PYTHON_BINDINGS=ON and add the resulting build-tree python directory to "
+        "PYTHONPATH, or set ISHMEM4PY_RUNTIME_LIBRARY explicitly."
+    )
+
+
+RUNTIME = _load_runtime()
diff --git a/ishmem4py/python/ishmem4py/core.py b/ishmem4py/python/ishmem4py/core.py
new file mode 100644
index 0000000..dbb8b3a
--- /dev/null
+++ b/ishmem4py/python/ishmem4py/core.py
@@ -0,0 +1,299 @@
+# Copyright (C) 2026 Intel Corporation
+# SPDX-License-Identifier: BSD-3-Clause
+
+import ctypes
+from dataclasses import dataclass
+from typing import Optional
+
+from ._lib import RUNTIME
+
+
+class IshmemError(RuntimeError):
+    pass
+
+
+class IshmemStateError(IshmemError):
+    pass
+
+
+@dataclass(frozen=True)
+class _PointerInfo:
+    ptr: int
+    size: int
+    keepalive: object
+
+
+class SymmetricMemory:
+    def __init__(self, ptr: int, size: int):
+        self._ptr = int(ptr)
+        self._size = int(size)
+        self._freed = False
+
+    @property
+    def ptr(self) -> int:
+        return self._ptr
+
+    @property
+    def size(self) -> int:
+        return self._size
+
+    @property
+    def freed(self) -> bool:
+        return self._freed
+
+    def _mark_freed(self) -> None:
+        self._freed = True
+
+    def read(self, size: Optional[int] = None, *, offset: int = 0, pe: Optional[int] = None) -> bytes:
+        if pe is None:
+            pe = my_pe()
+        length = _normalize_span(self, size=size, offset=offset)
+        dst = bytearray(length)
+        getmem(dst, self, pe=pe, size=length, src_offset=offset)
+        return bytes(dst)
+
+    def write(self, data, *, offset: int = 0, pe: Optional[int] = None) -> int:
+        if pe is None:
+            pe = my_pe()
+        return putmem(self, data, pe=pe, dest_offset=offset)
+
+    def __int__(self) -> int:
+        return self._ptr
+
+    def __repr__(self) -> str:
+        state = "freed" if self._freed else "active"
+        return f"SymmetricMemory(ptr=0x{self._ptr:x}, size={self._size}, state={state})"
+
+
+_initialized = False
+_finalized = False
+_live_allocations = {}
+
+
+def _require_initialized() -> None:
+    if not _initialized:
+        raise IshmemStateError("Intel SHMEM is not initialized")
+
+
+def _require_active(symm: SymmetricMemory) -> None:
+    if not isinstance(symm, SymmetricMemory):
+        raise TypeError("expected a SymmetricMemory instance")
+    if symm.freed:
+        raise IshmemStateError("symmetric allocation has already been freed")
+
+
+def _normalize_span(obj, *, size: Optional[int], offset: int) -> int:
+    if offset < 0:
+        raise ValueError("offset must be >= 0")
+
+    available = obj.size - offset
+    if available < 0:
+        raise ValueError("offset is past the end of the object")
+
+    if size is None:
+        return available
+    if size < 0:
+        raise ValueError("size must be >= 0")
+    if size > available:
+        raise ValueError("requested size extends past the end of the object")
+    return size
+
+
+def _ensure_contiguous_memoryview(obj, *, writable: bool):
+    mv = memoryview(obj)
+    if not mv.contiguous:
+        raise ValueError("buffer must be contiguous")
+    if writable and mv.readonly:
+        raise ValueError("destination buffer must be writable")
+    return mv
+
+
+def _pointer_from_local_buffer(obj, *, writable: bool, offset: int = 0) -> _PointerInfo:
+    if isinstance(obj, SymmetricMemory):
+        _require_active(obj)
+        size = _normalize_span(obj, size=None, offset=offset)
+        return _PointerInfo(ptr=obj.ptr + offset, size=size, keepalive=obj)
+
+    mv = _ensure_contiguous_memoryview(obj, writable=writable)
+
+    if offset < 0:
+        raise ValueError("offset must be >= 0")
+    if offset > mv.nbytes:
+        raise ValueError("offset is past the end of the buffer")
+
+    if mv.readonly:
+        if writable:
+            raise ValueError("destination buffer must be writable")
+        copied = (ctypes.c_char * mv.nbytes).from_buffer_copy(mv)
+        return _PointerInfo(ptr=ctypes.addressof(copied) + offset, size=mv.nbytes - offset, keepalive=copied)
+
+    raw = (ctypes.c_char * mv.nbytes).from_buffer(mv)
+    return _PointerInfo(ptr=ctypes.addressof(raw) + offset, size=mv.nbytes - offset, keepalive=(mv, raw))
+
+
+def _symmetric_target(obj) -> SymmetricMemory:
+    _require_active(obj)
+    return obj
+
+
+def is_initialized() -> bool:
+    return _initialized
+
+
+def init() -> None:
+    global _initialized
+    if _initialized:
+        raise IshmemStateError("Intel SHMEM is already initialized")
+    if _finalized:
+        raise IshmemStateError(
+            "Intel SHMEM has already been finalized in this process and cannot be reinitialized"
+        )
+    RUNTIME.ishmem4py_init()
+    _initialized = True
+
+
+def finalize() -> None:
+    global _initialized, _finalized
+    if not _initialized:
+        raise IshmemStateError("Intel SHMEM is not initialized")
+    if _live_allocations:
+        leaked = ", ".join(f"0x{ptr:x}" for ptr in sorted(_live_allocations))
+        raise IshmemStateError(
+            "cannot finalize while symmetric allocations are still live; "
+            f"free them first: {leaked}"
+        )
+    RUNTIME.ishmem4py_finalize()
+    _initialized = False
+    _finalized = True
+
+
+def my_pe() -> int:
+    _require_initialized()
+    return int(RUNTIME.ishmem4py_my_pe())
+
+
+def n_pes() -> int:
+    _require_initialized()
+    return int(RUNTIME.ishmem4py_n_pes())
+
+
+def info_get_version() -> tuple[int, int]:
+    major = ctypes.c_int()
+    minor = ctypes.c_int()
+    RUNTIME.ishmem4py_info_get_version(ctypes.byref(major), ctypes.byref(minor))
+    return int(major.value), int(minor.value)
+
+
+def barrier_all() -> None:
+    _require_initialized()
+    RUNTIME.ishmem4py_barrier_all()
+
+
+def sync_all() -> None:
+    _require_initialized()
+    RUNTIME.ishmem4py_sync_all()
+
+
+def fence() -> None:
+    _require_initialized()
+    RUNTIME.ishmem4py_fence()
+
+
+def quiet() -> None:
+    _require_initialized()
+    RUNTIME.ishmem4py_quiet()
+
+
+def malloc(size: int) -> SymmetricMemory:
+    _require_initialized()
+    if size < 0:
+        raise ValueError("size must be >= 0")
+    ptr = RUNTIME.ishmem4py_malloc(size)
+    if size > 0 and not ptr:
+        raise IshmemError(f"ishmem_malloc({size}) returned NULL")
+    result = SymmetricMemory(ptr=int(ptr or 0), size=size)
+    _live_allocations[result.ptr] = result
+    return result
+
+
+def calloc(count: int, size: int) -> SymmetricMemory:
+    _require_initialized()
+    if count < 0 or size < 0:
+        raise ValueError("count and size must be >= 0")
+    ptr = RUNTIME.ishmem4py_calloc(count, size)
+    total_size = count * size
+    if total_size > 0 and not ptr:
+        raise IshmemError(f"ishmem_calloc({count}, {size}) returned NULL")
+    result = SymmetricMemory(ptr=int(ptr or 0), size=total_size)
+    _live_allocations[result.ptr] = result
+    return result
+
+
+def free(symm: SymmetricMemory) -> None:
+    _require_initialized()
+    _require_active(symm)
+    RUNTIME.ishmem4py_free(ctypes.c_void_p(symm.ptr))
+    _live_allocations.pop(symm.ptr, None)
+    symm._mark_freed()
+
+
+def putmem(
+    dest: SymmetricMemory,
+    src,
+    *,
+    pe: int,
+    size: Optional[int] = None,
+    dest_offset: int = 0,
+    src_offset: int = 0,
+) -> int:
+    _require_initialized()
+    dest = _symmetric_target(dest)
+    src_info = _pointer_from_local_buffer(src, writable=False, offset=src_offset)
+    dest_size = _normalize_span(dest, size=None, offset=dest_offset)
+    nbytes = min(dest_size, src_info.size) if size is None else size
+
+    if nbytes < 0:
+        raise ValueError("size must be >= 0")
+    if nbytes > dest_size:
+        raise ValueError("size extends past the destination symmetric object")
+    if nbytes > src_info.size:
+        raise ValueError("size extends past the source buffer")
+
+    RUNTIME.ishmem4py_putmem(
+        ctypes.c_void_p(dest.ptr + dest_offset),
+        ctypes.c_void_p(src_info.ptr),
+        nbytes,
+        pe,
+    )
+    return nbytes
+
+
+def getmem(
+    dest,
+    src: SymmetricMemory,
+    *,
+    pe: int,
+    size: Optional[int] = None,
+    dest_offset: int = 0,
+    src_offset: int = 0,
+) -> int:
+    _require_initialized()
+    src = _symmetric_target(src)
+    dest_info = _pointer_from_local_buffer(dest, writable=True, offset=dest_offset)
+    src_size = _normalize_span(src, size=None, offset=src_offset)
+    nbytes = min(dest_info.size, src_size) if size is None else size
+
+    if nbytes < 0:
+        raise ValueError("size must be >= 0")
+    if nbytes > dest_info.size:
+        raise ValueError("size extends past the destination buffer")
+    if nbytes > src_size:
+        raise ValueError("size extends past the source symmetric object")
+
+    RUNTIME.ishmem4py_getmem(
+        ctypes.c_void_p(dest_info.ptr),
+        ctypes.c_void_p(src.ptr + src_offset),
+        nbytes,
+        pe,
+    )
+    return nbytes
diff --git a/ishmem4py/test/ring_test.py b/ishmem4py/test/ring_test.py
new file mode 100644
index 0000000..8a45fe6
--- /dev/null
+++ b/ishmem4py/test/ring_test.py
@@ -0,0 +1,55 @@
+# Copyright (C) 2026 Intel Corporation
+# SPDX-License-Identifier: BSD-3-Clause
+
+import struct
+import sys
+
+import ishmem4py as ishmem
+
+
+def _expect_equal(label: str, actual: int, expected: int) -> None:
+    if actual != expected:
+        raise AssertionError(f"{label}: expected {expected}, got {actual}")
+
+
+def main() -> int:
+    ishmem.init()
+    try:
+        my_pe = ishmem.my_pe()
+        npes = ishmem.n_pes()
+        if npes < 2:
+            raise RuntimeError("ring_test.py requires at least 2 PEs")
+
+        next_pe = (my_pe + 1) % npes
+        prev_pe = (my_pe + npes - 1) % npes
+
+        src = ishmem.malloc(4)
+        dst = ishmem.calloc(1, 4)
+        try:
+            src.write(struct.pack("=i", my_pe))
+
+            ishmem.barrier_all()
+
+            ishmem.putmem(dst, src.read(4), pe=next_pe)
+            ishmem.barrier_all()
+
+            received = struct.unpack("=i", dst.read(4))[0]
+            _expect_equal("putmem ring result", received, prev_pe)
+
+            host_value = bytearray(4)
+            ishmem.getmem(host_value, src, pe=next_pe)
+            fetched = struct.unpack("=i", host_value)[0]
+            _expect_equal("getmem ring result", fetched, next_pe)
+
+            ishmem.barrier_all()
+        finally:
+            ishmem.free(dst)
+            ishmem.free(src)
+    finally:
+        ishmem.finalize()
+
+    return 0
+
+
+if __name__ == "__main__":
+    sys.exit(main())
diff --git a/ishmem4py/test/smoke_test.py b/ishmem4py/test/smoke_test.py
new file mode 100644
index 0000000..882d43b
--- /dev/null
+++ b/ishmem4py/test/smoke_test.py
@@ -0,0 +1,40 @@
+# Copyright (C) 2026 Intel Corporation
+# SPDX-License-Identifier: BSD-3-Clause
+
+import struct
+import unittest
+
+import ishmem4py as ishmem
+
+
+class SmokeTest(unittest.TestCase):
+    @classmethod
+    def setUpClass(cls):
+        ishmem.init()
+
+    @classmethod
+    def tearDownClass(cls):
+        if ishmem.is_initialized():
+            ishmem.finalize()
+
+    def test_version_and_world_info(self):
+        major, minor = ishmem.info_get_version()
+        self.assertGreaterEqual(major, 1)
+        self.assertGreaterEqual(minor, 0)
+        self.assertEqual(ishmem.n_pes(), 1)
+        self.assertEqual(ishmem.my_pe(), 0)
+
+    def test_malloc_read_write_free(self):
+        buf = ishmem.malloc(16)
+        buf.write(struct.pack("=4i", 1, 2, 3, 4))
+        self.assertEqual(struct.unpack("=4i", buf.read(16)), (1, 2, 3, 4))
+        ishmem.free(buf)
+
+    def test_calloc_zero_initialized(self):
+        buf = ishmem.calloc(4, 4)
+        self.assertEqual(buf.read(16), b"\x00" * 16)
+        ishmem.free(buf)
+
+
+if __name__ == "__main__":
+    unittest.main()

From 02438425163871960d3eb3d59f2b999f6a75b25b Mon Sep 17 00:00:00 2001
From: Benjamin Brock <benjamin.brock@intel.com>
Date: Tue, 10 Mar 2026 20:56:56 +0000
Subject: [PATCH 04/17] python: add editable ishmem4py packaging

---
 .gitignore                      |  4 ++++
 ishmem4py/python/README.md      | 21 +++++++++++++++++++++
 ishmem4py/python/pyproject.toml | 27 +++++++++++++++++++++++++++
 3 files changed, 52 insertions(+)
 create mode 100644 ishmem4py/python/README.md
 create mode 100644 ishmem4py/python/pyproject.toml

diff --git a/.gitignore b/.gitignore
index d329731..6278da7 100644
--- a/.gitignore
+++ b/.gitignore
@@ -1,4 +1,8 @@
 .*.sw*
 build
+build-*/
 tags
 ishmem/config.h
+*.pyc
+__pycache__/
+*.egg-info/
diff --git a/ishmem4py/python/README.md b/ishmem4py/python/README.md
new file mode 100644
index 0000000..8b005f8
--- /dev/null
+++ b/ishmem4py/python/README.md
@@ -0,0 +1,21 @@
+# ishmem4py Python Package
+
+This directory contains the installable Python package for the current `ishmem4py` MVP.
+
+The package is intentionally thin:
+
+- it installs the pure-Python wrapper module
+- it expects the runtime shared library to already exist
+- it locates that runtime through `ISHMEM4PY_RUNTIME_LIBRARY`
+
+Typical editable install:
+
+```bash
+pip install -e /docker-mount/ishmem/ishmem4py/python
+```
+
+At runtime, set:
+
+```bash
+export ISHMEM4PY_RUNTIME_LIBRARY=/docker-mount/ishmem/build-ishmem4py-icpx-noaot/ishmem4py/python/ishmem4py/_ishmem4py_runtime.so
+```
diff --git a/ishmem4py/python/pyproject.toml b/ishmem4py/python/pyproject.toml
new file mode 100644
index 0000000..8ffc893
--- /dev/null
+++ b/ishmem4py/python/pyproject.toml
@@ -0,0 +1,27 @@
+[build-system]
+requires = ["setuptools>=68", "wheel"]
+build-backend = "setuptools.build_meta"
+
+[project]
+name = "ishmem4py"
+version = "0.1.0a0"
+description = "Python bindings for Intel SHMEM"
+readme = "README.md"
+requires-python = ">=3.10"
+authors = [
+  { name = "Intel SHMEM contributors" }
+]
+license = { text = "BSD-3-Clause" }
+classifiers = [
+  "Development Status :: 3 - Alpha",
+  "License :: OSI Approved :: BSD License",
+  "Programming Language :: Python :: 3",
+  "Programming Language :: Python :: 3 :: Only",
+]
+
+[tool.setuptools]
+include-package-data = false
+
+[tool.setuptools.packages.find]
+where = ["."]
+include = ["ishmem4py"]

From 8467a36ddc3986f990068d033c7e44a90aaaaa0c Mon Sep 17 00:00:00 2001
From: Benjamin Brock <benjamin.brock@intel.com>
Date: Wed, 11 Mar 2026 21:51:24 +0000
Subject: [PATCH 05/17] python: bundle ishmem4py runtime for standard installs

---
 ishmem4py/CMakeLists.txt           | 21 +++++++++++++++++++--
 ishmem4py/python/README.md         | 28 +++++++++++++++++++---------
 ishmem4py/python/ishmem4py/_lib.py |  6 +++---
 ishmem4py/python/pyproject.toml    |  7 +++++++
 ishmem4py/python/setup.py          | 30 ++++++++++++++++++++++++++++++
 5 files changed, 78 insertions(+), 14 deletions(-)
 create mode 100644 ishmem4py/python/setup.py

diff --git a/ishmem4py/CMakeLists.txt b/ishmem4py/CMakeLists.txt
index 6cfd379..a1b0e8e 100644
--- a/ishmem4py/CMakeLists.txt
+++ b/ishmem4py/CMakeLists.txt
@@ -13,11 +13,28 @@ find_package(Python3 COMPONENTS Interpreter REQUIRED)
 set(ISHMEM4PY_BUILD_ROOT "${CMAKE_CURRENT_BINARY_DIR}")
 set(ISHMEM4PY_PYTHON_BUILD_ROOT "${ISHMEM4PY_BUILD_ROOT}/python")
 set(ISHMEM4PY_PKG_BUILD_DIR "${ISHMEM4PY_PYTHON_BUILD_ROOT}/ishmem4py")
+set(ISHMEM4PY_SOURCE_PYTHON_ROOT "${CMAKE_CURRENT_SOURCE_DIR}/python")
+set(ISHMEM4PY_SOURCE_PKG_DIR "${ISHMEM4PY_SOURCE_PYTHON_ROOT}/ishmem4py")
 
-file(COPY ${CMAKE_CURRENT_SOURCE_DIR}/python/ DESTINATION ${ISHMEM4PY_PYTHON_BUILD_ROOT})
 file(COPY ${CMAKE_CURRENT_SOURCE_DIR}/test/ DESTINATION ${ISHMEM4PY_BUILD_ROOT})
 file(COPY ${CMAKE_CURRENT_SOURCE_DIR}/examples/ DESTINATION ${ISHMEM4PY_BUILD_ROOT})
 
+add_custom_target(ishmem4py-package-files
+    COMMAND ${CMAKE_COMMAND} -E make_directory "${ISHMEM4PY_PYTHON_BUILD_ROOT}"
+    COMMAND ${CMAKE_COMMAND} -E copy_if_different
+        "${ISHMEM4PY_SOURCE_PYTHON_ROOT}/pyproject.toml"
+        "${ISHMEM4PY_PYTHON_BUILD_ROOT}/pyproject.toml"
+    COMMAND ${CMAKE_COMMAND} -E copy_if_different
+        "${ISHMEM4PY_SOURCE_PYTHON_ROOT}/README.md"
+        "${ISHMEM4PY_PYTHON_BUILD_ROOT}/README.md"
+    COMMAND ${CMAKE_COMMAND} -E copy_if_different
+        "${ISHMEM4PY_SOURCE_PYTHON_ROOT}/setup.py"
+        "${ISHMEM4PY_PYTHON_BUILD_ROOT}/setup.py"
+    COMMAND ${CMAKE_COMMAND} -E copy_directory
+        "${ISHMEM4PY_SOURCE_PKG_DIR}"
+        "${ISHMEM4PY_PKG_BUILD_DIR}"
+    VERBATIM)
+
 add_library(ishmem4py-runtime SHARED
     csrc/ishmem4py_runtime.cpp)
 
@@ -37,7 +54,7 @@ set_target_properties(ishmem4py-runtime PROPERTIES
     POSITION_INDEPENDENT_CODE 1)
 
 add_custom_target(ishmem4py ALL
-    DEPENDS ishmem4py-runtime)
+    DEPENDS ishmem4py-package-files ishmem4py-runtime)
 
 message(STATUS "\n======= ishmem4py summary ========")
 message(STATUS "Python interpreter:          ${Python3_EXECUTABLE}")
diff --git a/ishmem4py/python/README.md b/ishmem4py/python/README.md
index 8b005f8..54c38a8 100644
--- a/ishmem4py/python/README.md
+++ b/ishmem4py/python/README.md
@@ -1,21 +1,31 @@
 # ishmem4py Python Package
 
-This directory contains the installable Python package for the current `ishmem4py` MVP.
+This directory supports two install modes.
 
-The package is intentionally thin:
+## Standard install
 
-- it installs the pure-Python wrapper module
-- it expects the runtime shared library to already exist
-- it locates that runtime through `ISHMEM4PY_RUNTIME_LIBRARY`
+Build `ishmem` with `-DBUILD_PYTHON_BINDINGS=ON`, then install the package from the build tree:
 
-Typical editable install:
+```bash
+pip install /path/to/ishmem-build/ishmem4py/python
+```
+
+That build-tree package includes `_ishmem4py_runtime.so`, so `ishmem4py` can load its own runtime
+without `ISHMEM4PY_RUNTIME_LIBRARY`.
+
+## Editable / dev install
+
+Install the source tree in editable mode when you are actively changing the Python code:
 
 ```bash
-pip install -e /docker-mount/ishmem/ishmem4py/python
+pip install -e /path/to/ishmem-src/ishmem4py/python
 ```
 
-At runtime, set:
+For editable installs, point the package at the runtime produced by your CMake build:
 
 ```bash
-export ISHMEM4PY_RUNTIME_LIBRARY=/docker-mount/ishmem/build-ishmem4py-icpx-noaot/ishmem4py/python/ishmem4py/_ishmem4py_runtime.so
+export ISHMEM4PY_RUNTIME_LIBRARY=/path/to/ishmem-build/ishmem4py/python/ishmem4py/_ishmem4py_runtime.so
 ```
+
+In both modes, external runtime dependencies such as oneAPI, PTI, and the OpenSHMEM backend still
+need to be visible through your normal environment setup.
diff --git a/ishmem4py/python/ishmem4py/_lib.py b/ishmem4py/python/ishmem4py/_lib.py
index 6be7071..e87d034 100644
--- a/ishmem4py/python/ishmem4py/_lib.py
+++ b/ishmem4py/python/ishmem4py/_lib.py
@@ -93,9 +93,9 @@ def _load_runtime():
         ) from last_error
 
     raise RuntimeError(
-        "Could not locate _ishmem4py_runtime.so. Build ishmem with "
-        "-DBUILD_PYTHON_BINDINGS=ON and add the resulting build-tree python directory to "
-        "PYTHONPATH, or set ISHMEM4PY_RUNTIME_LIBRARY explicitly."
+        "Could not locate _ishmem4py_runtime.so. For a standard install, build ishmem with "
+        "-DBUILD_PYTHON_BINDINGS=ON and pip install the build-tree package directory. "
+        f"For an editable/dev install, set {_RUNTIME_ENVVAR} to the built runtime library."
     )
 
 
diff --git a/ishmem4py/python/pyproject.toml b/ishmem4py/python/pyproject.toml
index 8ffc893..3243a03 100644
--- a/ishmem4py/python/pyproject.toml
+++ b/ishmem4py/python/pyproject.toml
@@ -25,3 +25,10 @@ include-package-data = false
 [tool.setuptools.packages.find]
 where = ["."]
 include = ["ishmem4py"]
+
+[tool.setuptools.package-data]
+ishmem4py = [
+  "_ishmem4py_runtime*.so",
+  "_ishmem4py_runtime*.dylib",
+  "_ishmem4py_runtime*.pyd",
+]
diff --git a/ishmem4py/python/setup.py b/ishmem4py/python/setup.py
new file mode 100644
index 0000000..57568da
--- /dev/null
+++ b/ishmem4py/python/setup.py
@@ -0,0 +1,30 @@
+# Copyright (C) 2026 Intel Corporation
+# SPDX-License-Identifier: BSD-3-Clause
+
+from setuptools import setup
+from setuptools.dist import Distribution
+
+try:
+    from wheel.bdist_wheel import bdist_wheel as _bdist_wheel
+except ImportError:
+    _bdist_wheel = None
+
+
+class BinaryDistribution(Distribution):
+    def has_ext_modules(self):
+        return True
+
+
+cmdclass = {}
+
+if _bdist_wheel is not None:
+    class bdist_wheel(_bdist_wheel):
+        def finalize_options(self):
+            super().finalize_options()
+            self.root_is_pure = False
+
+
+    cmdclass["bdist_wheel"] = bdist_wheel
+
+
+setup(distclass=BinaryDistribution, cmdclass=cmdclass)

From 93c23f49ea28a5f86bb2a8dd2396972aa3022723 Mon Sep 17 00:00:00 2001
From: Benjamin Brock <benjamin.brock@intel.com>
Date: Thu, 12 Mar 2026 19:59:59 +0000
Subject: [PATCH 06/17] Add validated ishmem4py host API surface

---
 ishmem4py/csrc/ishmem4py_runtime.cpp     | 408 ++++++++++++++++++++
 ishmem4py/csrc/ishmem4py_runtime.h       |  65 ++++
 ishmem4py/python/ishmem4py/__init__.py   |  46 +--
 ishmem4py/python/ishmem4py/_common.py    | 450 +++++++++++++++++++++++
 ishmem4py/python/ishmem4py/_lib.py       | 278 +++++++++++---
 ishmem4py/python/ishmem4py/collective.py | 288 +++++++++++++++
 ishmem4py/python/ishmem4py/core.py       | 339 +++--------------
 ishmem4py/python/ishmem4py/init_fini.py  | 107 ++++++
 ishmem4py/python/ishmem4py/memory.py     |  88 +++++
 ishmem4py/python/ishmem4py/rma.py        | 113 ++++++
 ishmem4py/python/ishmem4py/teams.py      |  56 +++
 ishmem4py/python/ishmem4py/version.py    |   4 +
 12 files changed, 1852 insertions(+), 390 deletions(-)
 create mode 100644 ishmem4py/python/ishmem4py/_common.py
 create mode 100644 ishmem4py/python/ishmem4py/collective.py
 create mode 100644 ishmem4py/python/ishmem4py/init_fini.py
 create mode 100644 ishmem4py/python/ishmem4py/memory.py
 create mode 100644 ishmem4py/python/ishmem4py/rma.py
 create mode 100644 ishmem4py/python/ishmem4py/teams.py
 create mode 100644 ishmem4py/python/ishmem4py/version.py

diff --git a/ishmem4py/csrc/ishmem4py_runtime.cpp b/ishmem4py/csrc/ishmem4py_runtime.cpp
index ef0270a..0bc62a9 100644
--- a/ishmem4py/csrc/ishmem4py_runtime.cpp
+++ b/ishmem4py/csrc/ishmem4py_runtime.cpp
@@ -6,6 +6,89 @@
 
 #include <ishmem.h>
 
+#include <cstring>
+
+namespace {
+
+template <typename T>
+uint64_t ishmem4py_to_bits(T value)
+{
+    uint64_t bits = 0;
+    static_assert(sizeof(T) <= sizeof(bits));
+    std::memcpy(&bits, &value, sizeof(T));
+    return bits;
+}
+
+template <typename T>
+T ishmem4py_from_bits(uint64_t bits)
+{
+    T value{};
+    std::memcpy(&value, &bits, sizeof(T));
+    return value;
+}
+
+const ishmem_team_config_t *ishmem4py_convert_team_config(const ishmem4py_team_config_t *config,
+                                                          ishmem_team_config_t *tmp)
+{
+    if (config == nullptr) return nullptr;
+    tmp->num_contexts = config->num_contexts;
+    return tmp;
+}
+
+#define ISHMEM4PY_ATOMIC_FETCH_CASE(DTYPE_ENUM, CPP_TYPE, SUFFIX)                                   \
+    case DTYPE_ENUM:                                                                                 \
+        return ishmem4py_to_bits(                                                                    \
+            ishmem_##SUFFIX##_atomic_fetch(reinterpret_cast<CPP_TYPE *>(source), pe))
+
+#define ISHMEM4PY_ATOMIC_SET_CASE(DTYPE_ENUM, CPP_TYPE, SUFFIX)                                     \
+    case DTYPE_ENUM:                                                                                 \
+        ishmem_##SUFFIX##_atomic_set(reinterpret_cast<CPP_TYPE *>(dest),                             \
+                                     ishmem4py_from_bits<CPP_TYPE>(value_bits), pe);                \
+        return
+
+#define ISHMEM4PY_ATOMIC_SWAP_CASE(DTYPE_ENUM, CPP_TYPE, SUFFIX)                                    \
+    case DTYPE_ENUM:                                                                                 \
+        return ishmem4py_to_bits(                                                                    \
+            ishmem_##SUFFIX##_atomic_swap(reinterpret_cast<CPP_TYPE *>(dest),                        \
+                                          ishmem4py_from_bits<CPP_TYPE>(value_bits), pe))
+
+#define ISHMEM4PY_ATOMIC_COMPARE_SWAP_CASE(DTYPE_ENUM, CPP_TYPE, SUFFIX)                            \
+    case DTYPE_ENUM:                                                                                 \
+        return ishmem4py_to_bits(                                                                    \
+            ishmem_##SUFFIX##_atomic_compare_swap(                                                   \
+                reinterpret_cast<CPP_TYPE *>(dest), ishmem4py_from_bits<CPP_TYPE>(cond_bits),       \
+                ishmem4py_from_bits<CPP_TYPE>(value_bits), pe))
+
+#define ISHMEM4PY_ATOMIC_FETCH_INC_CASE(DTYPE_ENUM, CPP_TYPE, SUFFIX)                               \
+    case DTYPE_ENUM:                                                                                 \
+        return ishmem4py_to_bits(                                                                    \
+            ishmem_##SUFFIX##_atomic_fetch_inc(reinterpret_cast<CPP_TYPE *>(dest), pe))
+
+#define ISHMEM4PY_ATOMIC_INC_CASE(DTYPE_ENUM, CPP_TYPE, SUFFIX)                                     \
+    case DTYPE_ENUM:                                                                                 \
+        ishmem_##SUFFIX##_atomic_inc(reinterpret_cast<CPP_TYPE *>(dest), pe);                       \
+        return
+
+#define ISHMEM4PY_ATOMIC_FETCH_BINARY_CASE(DTYPE_ENUM, CPP_TYPE, SUFFIX, OP)                        \
+    case DTYPE_ENUM:                                                                                 \
+        return ishmem4py_to_bits(                                                                    \
+            ishmem_##SUFFIX##_atomic_fetch_##OP(reinterpret_cast<CPP_TYPE *>(dest),                  \
+                                                ishmem4py_from_bits<CPP_TYPE>(value_bits), pe))
+
+#define ISHMEM4PY_ATOMIC_BINARY_CASE(DTYPE_ENUM, CPP_TYPE, SUFFIX, OP)                              \
+    case DTYPE_ENUM:                                                                                 \
+        ishmem_##SUFFIX##_atomic_##OP(reinterpret_cast<CPP_TYPE *>(dest),                            \
+                                      ishmem4py_from_bits<CPP_TYPE>(value_bits), pe);               \
+        return
+
+#define ISHMEM4PY_REDUCE_CASE(DTYPE_ENUM, CPP_TYPE, SUFFIX, OP)                                     \
+    case DTYPE_ENUM:                                                                                 \
+        return ishmem_##SUFFIX##_##OP##_reduce(                                                      \
+            static_cast<ishmem_team_t>(team), reinterpret_cast<CPP_TYPE *>(dest),                   \
+            reinterpret_cast<const CPP_TYPE *>(src), count)
+
+}  // namespace
+
 extern "C" {
 
 void ishmem4py_init(void)
@@ -33,6 +116,18 @@ void ishmem4py_info_get_version(int *major, int *minor)
     ishmem_info_get_version(major, minor);
 }
 
+void ishmem4py_info_get_name(char *name)
+{
+    ishmem_info_get_name(name);
+}
+
+void ishmem4py_vendor_get_version(int *major, int *minor, int *patch)
+{
+    if (major != nullptr) *major = ISHMEM_MAJOR_VERSION;
+    if (minor != nullptr) *minor = ISHMEM_MINOR_VERSION;
+    if (patch != nullptr) *patch = ISHMEM_PATCH_VERSION;
+}
+
 void ishmem4py_barrier_all(void)
 {
     ishmem_barrier_all();
@@ -78,6 +173,319 @@ void ishmem4py_getmem(void *dest, const void *src, size_t nbytes, int pe)
     ishmem_getmem(dest, src, nbytes, pe);
 }
 
+void *ishmem4py_ptr(const void *dest, int pe)
+{
+    return ishmem_ptr(dest, pe);
+}
+
+int ishmem4py_team_my_pe(int team)
+{
+    return ishmem_team_my_pe(static_cast<ishmem_team_t>(team));
+}
+
+int ishmem4py_team_n_pes(int team)
+{
+    return ishmem_team_n_pes(static_cast<ishmem_team_t>(team));
+}
+
+int ishmem4py_team_translate_pe(int src_team, int src_pe, int dest_team)
+{
+    return ishmem_team_translate_pe(static_cast<ishmem_team_t>(src_team), src_pe,
+                                    static_cast<ishmem_team_t>(dest_team));
+}
+
+int ishmem4py_team_sync(int team)
+{
+    return ishmem_team_sync(static_cast<ishmem_team_t>(team));
+}
+
+int ishmem4py_team_split_strided(int parent_team, int start, int stride, int size,
+                                 const ishmem4py_team_config_t *config, long config_mask,
+                                 int *new_team)
+{
+    ishmem_team_t result_team = ISHMEM_TEAM_INVALID;
+    ishmem_team_config_t tmp_config{};
+    const ishmem_team_config_t *converted_config =
+        ishmem4py_convert_team_config(config, &tmp_config);
+    int result = ishmem_team_split_strided(static_cast<ishmem_team_t>(parent_team), start, stride,
+                                           size, converted_config, config_mask, &result_team);
+    if (new_team != nullptr) *new_team = result_team;
+    return result;
+}
+
+int ishmem4py_team_split_2d(int parent_team, int xrange,
+                            const ishmem4py_team_config_t *xaxis_config, long xaxis_mask,
+                            int *xaxis_team, const ishmem4py_team_config_t *yaxis_config,
+                            long yaxis_mask, int *yaxis_team)
+{
+    ishmem_team_t xaxis_result = ISHMEM_TEAM_INVALID;
+    ishmem_team_t yaxis_result = ISHMEM_TEAM_INVALID;
+    ishmem_team_config_t xaxis_tmp{};
+    ishmem_team_config_t yaxis_tmp{};
+    const ishmem_team_config_t *converted_xaxis =
+        ishmem4py_convert_team_config(xaxis_config, &xaxis_tmp);
+    const ishmem_team_config_t *converted_yaxis =
+        ishmem4py_convert_team_config(yaxis_config, &yaxis_tmp);
+
+    int result = ishmem_team_split_2d(static_cast<ishmem_team_t>(parent_team), xrange,
+                                      converted_xaxis, xaxis_mask, &xaxis_result, converted_yaxis,
+                                      yaxis_mask, &yaxis_result);
+    if (xaxis_team != nullptr) *xaxis_team = xaxis_result;
+    if (yaxis_team != nullptr) *yaxis_team = yaxis_result;
+    return result;
+}
+
+void ishmem4py_team_destroy(int team)
+{
+    ishmem_team_destroy(static_cast<ishmem_team_t>(team));
+}
+
+int ishmem4py_broadcastmem(int team, void *dest, const void *src, size_t nbytes, int root)
+{
+    return ishmem_broadcastmem(static_cast<ishmem_team_t>(team), dest, src, nbytes, root);
+}
+
+int ishmem4py_collectmem(int team, void *dest, const void *src, size_t nbytes)
+{
+    return ishmem_collectmem(static_cast<ishmem_team_t>(team), dest, src, nbytes);
+}
+
+int ishmem4py_fcollectmem(int team, void *dest, const void *src, size_t nbytes)
+{
+    return ishmem_fcollectmem(static_cast<ishmem_team_t>(team), dest, src, nbytes);
+}
+
+int ishmem4py_alltoallmem(int team, void *dest, const void *src, size_t nbytes)
+{
+    return ishmem_alltoallmem(static_cast<ishmem_team_t>(team), dest, src, nbytes);
+}
+
+int ishmem4py_reduce(int op, int dtype, int team, void *dest, const void *src, size_t count)
+{
+    switch (op) {
+        case ISHMEM4PY_REDUCE_SUM:
+            switch (dtype) {
+                ISHMEM4PY_REDUCE_CASE(ISHMEM4PY_DTYPE_INT32, int32_t, int32, sum);
+                ISHMEM4PY_REDUCE_CASE(ISHMEM4PY_DTYPE_INT64, int64_t, int64, sum);
+                ISHMEM4PY_REDUCE_CASE(ISHMEM4PY_DTYPE_UINT32, uint32_t, uint32, sum);
+                ISHMEM4PY_REDUCE_CASE(ISHMEM4PY_DTYPE_UINT64, uint64_t, uint64, sum);
+                ISHMEM4PY_REDUCE_CASE(ISHMEM4PY_DTYPE_FLOAT32, float, float, sum);
+                ISHMEM4PY_REDUCE_CASE(ISHMEM4PY_DTYPE_FLOAT64, double, double, sum);
+            }
+            break;
+        case ISHMEM4PY_REDUCE_PROD:
+            switch (dtype) {
+                ISHMEM4PY_REDUCE_CASE(ISHMEM4PY_DTYPE_INT32, int32_t, int32, prod);
+                ISHMEM4PY_REDUCE_CASE(ISHMEM4PY_DTYPE_INT64, int64_t, int64, prod);
+                ISHMEM4PY_REDUCE_CASE(ISHMEM4PY_DTYPE_UINT32, uint32_t, uint32, prod);
+                ISHMEM4PY_REDUCE_CASE(ISHMEM4PY_DTYPE_UINT64, uint64_t, uint64, prod);
+                ISHMEM4PY_REDUCE_CASE(ISHMEM4PY_DTYPE_FLOAT32, float, float, prod);
+                ISHMEM4PY_REDUCE_CASE(ISHMEM4PY_DTYPE_FLOAT64, double, double, prod);
+            }
+            break;
+        case ISHMEM4PY_REDUCE_AND:
+            switch (dtype) {
+                ISHMEM4PY_REDUCE_CASE(ISHMEM4PY_DTYPE_INT32, int32_t, int32, and);
+                ISHMEM4PY_REDUCE_CASE(ISHMEM4PY_DTYPE_INT64, int64_t, int64, and);
+                ISHMEM4PY_REDUCE_CASE(ISHMEM4PY_DTYPE_UINT32, uint32_t, uint32, and);
+                ISHMEM4PY_REDUCE_CASE(ISHMEM4PY_DTYPE_UINT64, uint64_t, uint64, and);
+            }
+            break;
+        case ISHMEM4PY_REDUCE_OR:
+            switch (dtype) {
+                ISHMEM4PY_REDUCE_CASE(ISHMEM4PY_DTYPE_INT32, int32_t, int32, or);
+                ISHMEM4PY_REDUCE_CASE(ISHMEM4PY_DTYPE_INT64, int64_t, int64, or);
+                ISHMEM4PY_REDUCE_CASE(ISHMEM4PY_DTYPE_UINT32, uint32_t, uint32, or);
+                ISHMEM4PY_REDUCE_CASE(ISHMEM4PY_DTYPE_UINT64, uint64_t, uint64, or);
+            }
+            break;
+        case ISHMEM4PY_REDUCE_XOR:
+            switch (dtype) {
+                ISHMEM4PY_REDUCE_CASE(ISHMEM4PY_DTYPE_INT32, int32_t, int32, xor);
+                ISHMEM4PY_REDUCE_CASE(ISHMEM4PY_DTYPE_INT64, int64_t, int64, xor);
+                ISHMEM4PY_REDUCE_CASE(ISHMEM4PY_DTYPE_UINT32, uint32_t, uint32, xor);
+                ISHMEM4PY_REDUCE_CASE(ISHMEM4PY_DTYPE_UINT64, uint64_t, uint64, xor);
+            }
+            break;
+        case ISHMEM4PY_REDUCE_MIN:
+            switch (dtype) {
+                ISHMEM4PY_REDUCE_CASE(ISHMEM4PY_DTYPE_INT32, int32_t, int32, min);
+                ISHMEM4PY_REDUCE_CASE(ISHMEM4PY_DTYPE_INT64, int64_t, int64, min);
+                ISHMEM4PY_REDUCE_CASE(ISHMEM4PY_DTYPE_UINT32, uint32_t, uint32, min);
+                ISHMEM4PY_REDUCE_CASE(ISHMEM4PY_DTYPE_UINT64, uint64_t, uint64, min);
+                ISHMEM4PY_REDUCE_CASE(ISHMEM4PY_DTYPE_FLOAT32, float, float, min);
+                ISHMEM4PY_REDUCE_CASE(ISHMEM4PY_DTYPE_FLOAT64, double, double, min);
+            }
+            break;
+        case ISHMEM4PY_REDUCE_MAX:
+            switch (dtype) {
+                ISHMEM4PY_REDUCE_CASE(ISHMEM4PY_DTYPE_INT32, int32_t, int32, max);
+                ISHMEM4PY_REDUCE_CASE(ISHMEM4PY_DTYPE_INT64, int64_t, int64, max);
+                ISHMEM4PY_REDUCE_CASE(ISHMEM4PY_DTYPE_UINT32, uint32_t, uint32, max);
+                ISHMEM4PY_REDUCE_CASE(ISHMEM4PY_DTYPE_UINT64, uint64_t, uint64, max);
+                ISHMEM4PY_REDUCE_CASE(ISHMEM4PY_DTYPE_FLOAT32, float, float, max);
+                ISHMEM4PY_REDUCE_CASE(ISHMEM4PY_DTYPE_FLOAT64, double, double, max);
+            }
+            break;
+    }
+    return -1;
+}
+
+uint64_t ishmem4py_atomic_fetch(int dtype, void *source, int pe)
+{
+    switch (dtype) {
+        ISHMEM4PY_ATOMIC_FETCH_CASE(ISHMEM4PY_DTYPE_INT32, int32_t, int32);
+        ISHMEM4PY_ATOMIC_FETCH_CASE(ISHMEM4PY_DTYPE_INT64, int64_t, int64);
+        ISHMEM4PY_ATOMIC_FETCH_CASE(ISHMEM4PY_DTYPE_UINT32, uint32_t, uint32);
+        ISHMEM4PY_ATOMIC_FETCH_CASE(ISHMEM4PY_DTYPE_UINT64, uint64_t, uint64);
+        ISHMEM4PY_ATOMIC_FETCH_CASE(ISHMEM4PY_DTYPE_FLOAT32, float, float);
+        ISHMEM4PY_ATOMIC_FETCH_CASE(ISHMEM4PY_DTYPE_FLOAT64, double, double);
+    }
+    return 0;
+}
+
+void ishmem4py_atomic_set(int dtype, void *dest, uint64_t value_bits, int pe)
+{
+    switch (dtype) {
+        ISHMEM4PY_ATOMIC_SET_CASE(ISHMEM4PY_DTYPE_INT32, int32_t, int32);
+        ISHMEM4PY_ATOMIC_SET_CASE(ISHMEM4PY_DTYPE_INT64, int64_t, int64);
+        ISHMEM4PY_ATOMIC_SET_CASE(ISHMEM4PY_DTYPE_UINT32, uint32_t, uint32);
+        ISHMEM4PY_ATOMIC_SET_CASE(ISHMEM4PY_DTYPE_UINT64, uint64_t, uint64);
+        ISHMEM4PY_ATOMIC_SET_CASE(ISHMEM4PY_DTYPE_FLOAT32, float, float);
+        ISHMEM4PY_ATOMIC_SET_CASE(ISHMEM4PY_DTYPE_FLOAT64, double, double);
+    }
+}
+
+uint64_t ishmem4py_atomic_swap(int dtype, void *dest, uint64_t value_bits, int pe)
+{
+    switch (dtype) {
+        ISHMEM4PY_ATOMIC_SWAP_CASE(ISHMEM4PY_DTYPE_INT32, int32_t, int32);
+        ISHMEM4PY_ATOMIC_SWAP_CASE(ISHMEM4PY_DTYPE_INT64, int64_t, int64);
+        ISHMEM4PY_ATOMIC_SWAP_CASE(ISHMEM4PY_DTYPE_UINT32, uint32_t, uint32);
+        ISHMEM4PY_ATOMIC_SWAP_CASE(ISHMEM4PY_DTYPE_UINT64, uint64_t, uint64);
+        ISHMEM4PY_ATOMIC_SWAP_CASE(ISHMEM4PY_DTYPE_FLOAT32, float, float);
+        ISHMEM4PY_ATOMIC_SWAP_CASE(ISHMEM4PY_DTYPE_FLOAT64, double, double);
+    }
+    return 0;
+}
+
+uint64_t ishmem4py_atomic_compare_swap(int dtype, void *dest, uint64_t cond_bits,
+                                       uint64_t value_bits, int pe)
+{
+    switch (dtype) {
+        ISHMEM4PY_ATOMIC_COMPARE_SWAP_CASE(ISHMEM4PY_DTYPE_INT32, int32_t, int32);
+        ISHMEM4PY_ATOMIC_COMPARE_SWAP_CASE(ISHMEM4PY_DTYPE_INT64, int64_t, int64);
+        ISHMEM4PY_ATOMIC_COMPARE_SWAP_CASE(ISHMEM4PY_DTYPE_UINT32, uint32_t, uint32);
+        ISHMEM4PY_ATOMIC_COMPARE_SWAP_CASE(ISHMEM4PY_DTYPE_UINT64, uint64_t, uint64);
+    }
+    return 0;
+}
+
+uint64_t ishmem4py_atomic_fetch_inc(int dtype, void *dest, int pe)
+{
+    switch (dtype) {
+        ISHMEM4PY_ATOMIC_FETCH_INC_CASE(ISHMEM4PY_DTYPE_INT32, int32_t, int32);
+        ISHMEM4PY_ATOMIC_FETCH_INC_CASE(ISHMEM4PY_DTYPE_INT64, int64_t, int64);
+        ISHMEM4PY_ATOMIC_FETCH_INC_CASE(ISHMEM4PY_DTYPE_UINT32, uint32_t, uint32);
+        ISHMEM4PY_ATOMIC_FETCH_INC_CASE(ISHMEM4PY_DTYPE_UINT64, uint64_t, uint64);
+    }
+    return 0;
+}
+
+void ishmem4py_atomic_inc(int dtype, void *dest, int pe)
+{
+    switch (dtype) {
+        ISHMEM4PY_ATOMIC_INC_CASE(ISHMEM4PY_DTYPE_INT32, int32_t, int32);
+        ISHMEM4PY_ATOMIC_INC_CASE(ISHMEM4PY_DTYPE_INT64, int64_t, int64);
+        ISHMEM4PY_ATOMIC_INC_CASE(ISHMEM4PY_DTYPE_UINT32, uint32_t, uint32);
+        ISHMEM4PY_ATOMIC_INC_CASE(ISHMEM4PY_DTYPE_UINT64, uint64_t, uint64);
+    }
+}
+
+uint64_t ishmem4py_atomic_fetch_add(int dtype, void *dest, uint64_t value_bits, int pe)
+{
+    switch (dtype) {
+        ISHMEM4PY_ATOMIC_FETCH_BINARY_CASE(ISHMEM4PY_DTYPE_INT32, int32_t, int32, add);
+        ISHMEM4PY_ATOMIC_FETCH_BINARY_CASE(ISHMEM4PY_DTYPE_INT64, int64_t, int64, add);
+        ISHMEM4PY_ATOMIC_FETCH_BINARY_CASE(ISHMEM4PY_DTYPE_UINT32, uint32_t, uint32, add);
+        ISHMEM4PY_ATOMIC_FETCH_BINARY_CASE(ISHMEM4PY_DTYPE_UINT64, uint64_t, uint64, add);
+    }
+    return 0;
+}
+
+void ishmem4py_atomic_add(int dtype, void *dest, uint64_t value_bits, int pe)
+{
+    switch (dtype) {
+        ISHMEM4PY_ATOMIC_BINARY_CASE(ISHMEM4PY_DTYPE_INT32, int32_t, int32, add);
+        ISHMEM4PY_ATOMIC_BINARY_CASE(ISHMEM4PY_DTYPE_INT64, int64_t, int64, add);
+        ISHMEM4PY_ATOMIC_BINARY_CASE(ISHMEM4PY_DTYPE_UINT32, uint32_t, uint32, add);
+        ISHMEM4PY_ATOMIC_BINARY_CASE(ISHMEM4PY_DTYPE_UINT64, uint64_t, uint64, add);
+    }
+}
+
+uint64_t ishmem4py_atomic_fetch_and(int dtype, void *dest, uint64_t value_bits, int pe)
+{
+    switch (dtype) {
+        ISHMEM4PY_ATOMIC_FETCH_BINARY_CASE(ISHMEM4PY_DTYPE_INT32, int32_t, int32, and);
+        ISHMEM4PY_ATOMIC_FETCH_BINARY_CASE(ISHMEM4PY_DTYPE_INT64, int64_t, int64, and);
+        ISHMEM4PY_ATOMIC_FETCH_BINARY_CASE(ISHMEM4PY_DTYPE_UINT32, uint32_t, uint32, and);
+        ISHMEM4PY_ATOMIC_FETCH_BINARY_CASE(ISHMEM4PY_DTYPE_UINT64, uint64_t, uint64, and);
+    }
+    return 0;
+}
+
+void ishmem4py_atomic_and(int dtype, void *dest, uint64_t value_bits, int pe)
+{
+    switch (dtype) {
+        ISHMEM4PY_ATOMIC_BINARY_CASE(ISHMEM4PY_DTYPE_INT32, int32_t, int32, and);
+        ISHMEM4PY_ATOMIC_BINARY_CASE(ISHMEM4PY_DTYPE_INT64, int64_t, int64, and);
+        ISHMEM4PY_ATOMIC_BINARY_CASE(ISHMEM4PY_DTYPE_UINT32, uint32_t, uint32, and);
+        ISHMEM4PY_ATOMIC_BINARY_CASE(ISHMEM4PY_DTYPE_UINT64, uint64_t, uint64, and);
+    }
+}
+
+uint64_t ishmem4py_atomic_fetch_or(int dtype, void *dest, uint64_t value_bits, int pe)
+{
+    switch (dtype) {
+        ISHMEM4PY_ATOMIC_FETCH_BINARY_CASE(ISHMEM4PY_DTYPE_INT32, int32_t, int32, or);
+        ISHMEM4PY_ATOMIC_FETCH_BINARY_CASE(ISHMEM4PY_DTYPE_INT64, int64_t, int64, or);
+        ISHMEM4PY_ATOMIC_FETCH_BINARY_CASE(ISHMEM4PY_DTYPE_UINT32, uint32_t, uint32, or);
+        ISHMEM4PY_ATOMIC_FETCH_BINARY_CASE(ISHMEM4PY_DTYPE_UINT64, uint64_t, uint64, or);
+    }
+    return 0;
+}
+
+void ishmem4py_atomic_or(int dtype, void *dest, uint64_t value_bits, int pe)
+{
+    switch (dtype) {
+        ISHMEM4PY_ATOMIC_BINARY_CASE(ISHMEM4PY_DTYPE_INT32, int32_t, int32, or);
+        ISHMEM4PY_ATOMIC_BINARY_CASE(ISHMEM4PY_DTYPE_INT64, int64_t, int64, or);
+        ISHMEM4PY_ATOMIC_BINARY_CASE(ISHMEM4PY_DTYPE_UINT32, uint32_t, uint32, or);
+        ISHMEM4PY_ATOMIC_BINARY_CASE(ISHMEM4PY_DTYPE_UINT64, uint64_t, uint64, or);
+    }
+}
+
+uint64_t ishmem4py_atomic_fetch_xor(int dtype, void *dest, uint64_t value_bits, int pe)
+{
+    switch (dtype) {
+        ISHMEM4PY_ATOMIC_FETCH_BINARY_CASE(ISHMEM4PY_DTYPE_INT32, int32_t, int32, xor);
+        ISHMEM4PY_ATOMIC_FETCH_BINARY_CASE(ISHMEM4PY_DTYPE_INT64, int64_t, int64, xor);
+        ISHMEM4PY_ATOMIC_FETCH_BINARY_CASE(ISHMEM4PY_DTYPE_UINT32, uint32_t, uint32, xor);
+        ISHMEM4PY_ATOMIC_FETCH_BINARY_CASE(ISHMEM4PY_DTYPE_UINT64, uint64_t, uint64, xor);
+    }
+    return 0;
+}
+
+void ishmem4py_atomic_xor(int dtype, void *dest, uint64_t value_bits, int pe)
+{
+    switch (dtype) {
+        ISHMEM4PY_ATOMIC_BINARY_CASE(ISHMEM4PY_DTYPE_INT32, int32_t, int32, xor);
+        ISHMEM4PY_ATOMIC_BINARY_CASE(ISHMEM4PY_DTYPE_INT64, int64_t, int64, xor);
+        ISHMEM4PY_ATOMIC_BINARY_CASE(ISHMEM4PY_DTYPE_UINT32, uint32_t, uint32, xor);
+        ISHMEM4PY_ATOMIC_BINARY_CASE(ISHMEM4PY_DTYPE_UINT64, uint64_t, uint64, xor);
+    }
+}
+
 uintptr_t ishmem4py_cast_ptr_to_uintptr(const void *ptr)
 {
     return reinterpret_cast<uintptr_t>(ptr);
diff --git a/ishmem4py/csrc/ishmem4py_runtime.h b/ishmem4py/csrc/ishmem4py_runtime.h
index 197d7fa..332fb9c 100644
--- a/ishmem4py/csrc/ishmem4py_runtime.h
+++ b/ishmem4py/csrc/ishmem4py_runtime.h
@@ -18,11 +18,36 @@
 extern "C" {
 #endif
 
+typedef enum ishmem4py_dtype_t {
+    ISHMEM4PY_DTYPE_INT32 = 0,
+    ISHMEM4PY_DTYPE_INT64 = 1,
+    ISHMEM4PY_DTYPE_UINT32 = 2,
+    ISHMEM4PY_DTYPE_UINT64 = 3,
+    ISHMEM4PY_DTYPE_FLOAT32 = 4,
+    ISHMEM4PY_DTYPE_FLOAT64 = 5,
+} ishmem4py_dtype_t;
+
+typedef enum ishmem4py_reduce_op_t {
+    ISHMEM4PY_REDUCE_SUM = 0,
+    ISHMEM4PY_REDUCE_PROD = 1,
+    ISHMEM4PY_REDUCE_AND = 2,
+    ISHMEM4PY_REDUCE_OR = 3,
+    ISHMEM4PY_REDUCE_XOR = 4,
+    ISHMEM4PY_REDUCE_MIN = 5,
+    ISHMEM4PY_REDUCE_MAX = 6,
+} ishmem4py_reduce_op_t;
+
+typedef struct ishmem4py_team_config_t {
+    int num_contexts;
+} ishmem4py_team_config_t;
+
 ISHMEM4PY_EXPORT void ishmem4py_init(void);
 ISHMEM4PY_EXPORT void ishmem4py_finalize(void);
 ISHMEM4PY_EXPORT int ishmem4py_my_pe(void);
 ISHMEM4PY_EXPORT int ishmem4py_n_pes(void);
 ISHMEM4PY_EXPORT void ishmem4py_info_get_version(int *major, int *minor);
+ISHMEM4PY_EXPORT void ishmem4py_info_get_name(char *name);
+ISHMEM4PY_EXPORT void ishmem4py_vendor_get_version(int *major, int *minor, int *patch);
 ISHMEM4PY_EXPORT void ishmem4py_barrier_all(void);
 ISHMEM4PY_EXPORT void ishmem4py_sync_all(void);
 ISHMEM4PY_EXPORT void ishmem4py_fence(void);
@@ -32,6 +57,46 @@ ISHMEM4PY_EXPORT void *ishmem4py_calloc(size_t count, size_t size);
 ISHMEM4PY_EXPORT void ishmem4py_free(void *ptr);
 ISHMEM4PY_EXPORT void ishmem4py_putmem(void *dest, const void *src, size_t nbytes, int pe);
 ISHMEM4PY_EXPORT void ishmem4py_getmem(void *dest, const void *src, size_t nbytes, int pe);
+ISHMEM4PY_EXPORT void *ishmem4py_ptr(const void *dest, int pe);
+ISHMEM4PY_EXPORT int ishmem4py_team_my_pe(int team);
+ISHMEM4PY_EXPORT int ishmem4py_team_n_pes(int team);
+ISHMEM4PY_EXPORT int ishmem4py_team_translate_pe(int src_team, int src_pe, int dest_team);
+ISHMEM4PY_EXPORT int ishmem4py_team_sync(int team);
+ISHMEM4PY_EXPORT int ishmem4py_team_split_strided(int parent_team, int start, int stride, int size,
+                                                  const ishmem4py_team_config_t *config,
+                                                  long config_mask, int *new_team);
+ISHMEM4PY_EXPORT int ishmem4py_team_split_2d(int parent_team, int xrange,
+                                             const ishmem4py_team_config_t *xaxis_config,
+                                             long xaxis_mask, int *xaxis_team,
+                                             const ishmem4py_team_config_t *yaxis_config,
+                                             long yaxis_mask, int *yaxis_team);
+ISHMEM4PY_EXPORT void ishmem4py_team_destroy(int team);
+ISHMEM4PY_EXPORT int ishmem4py_broadcastmem(int team, void *dest, const void *src, size_t nbytes,
+                                            int root);
+ISHMEM4PY_EXPORT int ishmem4py_collectmem(int team, void *dest, const void *src, size_t nbytes);
+ISHMEM4PY_EXPORT int ishmem4py_fcollectmem(int team, void *dest, const void *src, size_t nbytes);
+ISHMEM4PY_EXPORT int ishmem4py_alltoallmem(int team, void *dest, const void *src, size_t nbytes);
+ISHMEM4PY_EXPORT int ishmem4py_reduce(int op, int dtype, int team, void *dest, const void *src,
+                                      size_t count);
+ISHMEM4PY_EXPORT uint64_t ishmem4py_atomic_fetch(int dtype, void *source, int pe);
+ISHMEM4PY_EXPORT void ishmem4py_atomic_set(int dtype, void *dest, uint64_t value_bits, int pe);
+ISHMEM4PY_EXPORT uint64_t ishmem4py_atomic_swap(int dtype, void *dest, uint64_t value_bits, int pe);
+ISHMEM4PY_EXPORT uint64_t ishmem4py_atomic_compare_swap(int dtype, void *dest, uint64_t cond_bits,
+                                                        uint64_t value_bits, int pe);
+ISHMEM4PY_EXPORT uint64_t ishmem4py_atomic_fetch_inc(int dtype, void *dest, int pe);
+ISHMEM4PY_EXPORT void ishmem4py_atomic_inc(int dtype, void *dest, int pe);
+ISHMEM4PY_EXPORT uint64_t ishmem4py_atomic_fetch_add(int dtype, void *dest, uint64_t value_bits,
+                                                     int pe);
+ISHMEM4PY_EXPORT void ishmem4py_atomic_add(int dtype, void *dest, uint64_t value_bits, int pe);
+ISHMEM4PY_EXPORT uint64_t ishmem4py_atomic_fetch_and(int dtype, void *dest, uint64_t value_bits,
+                                                     int pe);
+ISHMEM4PY_EXPORT void ishmem4py_atomic_and(int dtype, void *dest, uint64_t value_bits, int pe);
+ISHMEM4PY_EXPORT uint64_t ishmem4py_atomic_fetch_or(int dtype, void *dest, uint64_t value_bits,
+                                                    int pe);
+ISHMEM4PY_EXPORT void ishmem4py_atomic_or(int dtype, void *dest, uint64_t value_bits, int pe);
+ISHMEM4PY_EXPORT uint64_t ishmem4py_atomic_fetch_xor(int dtype, void *dest, uint64_t value_bits,
+                                                     int pe);
+ISHMEM4PY_EXPORT void ishmem4py_atomic_xor(int dtype, void *dest, uint64_t value_bits, int pe);
 ISHMEM4PY_EXPORT uintptr_t ishmem4py_cast_ptr_to_uintptr(const void *ptr);
 ISHMEM4PY_EXPORT void *ishmem4py_cast_uintptr_to_ptr(uintptr_t value);
 
diff --git a/ishmem4py/python/ishmem4py/__init__.py b/ishmem4py/python/ishmem4py/__init__.py
index facce7c..be5d295 100644
--- a/ishmem4py/python/ishmem4py/__init__.py
+++ b/ishmem4py/python/ishmem4py/__init__.py
@@ -1,46 +1,10 @@
 # Copyright (C) 2026 Intel Corporation
 # SPDX-License-Identifier: BSD-3-Clause
 
-from .core import (
-    IshmemError,
-    IshmemStateError,
-    SymmetricMemory,
-    barrier_all,
-    calloc,
-    fence,
-    finalize,
-    free,
-    getmem,
-    info_get_version,
-    init,
-    is_initialized,
-    malloc,
-    my_pe,
-    n_pes,
-    putmem,
-    quiet,
-    sync_all,
-)
+"""Public top-level imports for the Intel SHMEM Python bindings."""
 
-__all__ = [
-    "IshmemError",
-    "IshmemStateError",
-    "SymmetricMemory",
-    "barrier_all",
-    "calloc",
-    "fence",
-    "finalize",
-    "free",
-    "getmem",
-    "info_get_version",
-    "init",
-    "is_initialized",
-    "malloc",
-    "my_pe",
-    "n_pes",
-    "putmem",
-    "quiet",
-    "sync_all",
-]
+from . import core
+from .core import *
+from .version import __version__
 
-__version__ = "0.1.0a0"
+__all__ = list(core.__all__) + ["__version__"]
diff --git a/ishmem4py/python/ishmem4py/_common.py b/ishmem4py/python/ishmem4py/_common.py
new file mode 100644
index 0000000..64efca4
--- /dev/null
+++ b/ishmem4py/python/ishmem4py/_common.py
@@ -0,0 +1,450 @@
+# Copyright (C) 2026 Intel Corporation
+# SPDX-License-Identifier: BSD-3-Clause
+
+"""Shared types and validation helpers for the public ``ishmem4py`` API."""
+
+from __future__ import annotations
+
+import ctypes
+import sys
+from dataclasses import dataclass
+from enum import IntEnum
+from typing import Optional
+
+
+class IshmemError(RuntimeError):
+    """Base exception for ``ishmem4py`` failures."""
+
+    pass
+
+
+class IshmemStateError(IshmemError):
+    """Raised when Intel SHMEM is used in an invalid process state."""
+
+    pass
+
+
+class Comparison(IntEnum):
+    """Comparison operators for wait/test style APIs."""
+
+    EQ = 1
+    NE = 2
+    GT = 3
+    GE = 4
+    LT = 5
+    LE = 6
+
+
+class SignalOp(IntEnum):
+    """Signal operations retained for API compatibility."""
+
+    SET = 0
+    ADD = 1
+
+
+class InitStatus(IntEnum):
+    """Process-local Intel SHMEM initialization state."""
+
+    UNINITIALIZED = 0
+    INITIALIZED = 1
+    FINALIZED = 2
+
+
+@dataclass(frozen=True)
+class Version:
+    """Combined OpenSHMEM, Intel SHMEM, and Python binding version metadata."""
+
+    openshmem_spec_version: str
+    ishmem4py_version: str
+    libishmem_version: str
+    vendor_name: str
+
+
+@dataclass(frozen=True)
+class Team:
+    """Python wrapper for an Intel SHMEM team handle."""
+
+    handle: int
+    name: Optional[str] = None
+
+    def __int__(self) -> int:
+        return self.handle
+
+    @property
+    def valid(self) -> bool:
+        return self.handle != -1
+
+    def __repr__(self) -> str:
+        label = self.name if self.name is not None else f"TEAM_{self.handle}"
+        return f"Team(handle={self.handle}, name={label!r})"
+
+
+@dataclass(frozen=True)
+class TeamConfig:
+    """Subset of ``ishmem_team_config_t`` exposed by the MVP bindings."""
+
+    num_contexts: int = 0
+
+
+@dataclass(frozen=True)
+class MemoryPointer:
+    """A PE-relative address returned by ``ptr`` or ``ishmem_ptr``."""
+
+    address: int
+    size: int
+    pe: int
+    base: "SymmetricMemory"
+
+    def __int__(self) -> int:
+        return self.address
+
+    def __bool__(self) -> bool:
+        return self.address != 0
+
+
+@dataclass(frozen=True)
+class _DTypeInfo:
+    name: str
+    code: int
+    ctype: type[ctypes._SimpleCData]
+    itemsize: int
+
+
+_DTYPES = {
+    "int32": _DTypeInfo("int32", 0, ctypes.c_int32, 4),
+    "int64": _DTypeInfo("int64", 1, ctypes.c_int64, 8),
+    "uint32": _DTypeInfo("uint32", 2, ctypes.c_uint32, 4),
+    "uint64": _DTypeInfo("uint64", 3, ctypes.c_uint64, 8),
+    "float32": _DTypeInfo("float32", 4, ctypes.c_float, 4),
+    "float64": _DTypeInfo("float64", 5, ctypes.c_double, 8),
+}
+
+_DTYPE_ALIASES = {
+    "i": "int32",
+    "int": "int32",
+    "int32": "int32",
+    "l": "int64",
+    "q": "int64",
+    "long": "int64",
+    "int64": "int64",
+    "i4": "int32",
+    "i8": "int64",
+    "u": "uint32",
+    "I": "uint32",
+    "uint": "uint32",
+    "uint32": "uint32",
+    "Q": "uint64",
+    "L": "uint64",
+    "ulong": "uint64",
+    "uint64": "uint64",
+    "u4": "uint32",
+    "u8": "uint64",
+    "f": "float32",
+    "float": "float32",
+    "float32": "float32",
+    "f4": "float32",
+    "d": "float64",
+    "double": "float64",
+    "float64": "float64",
+    "f8": "float64",
+}
+
+_CTYPE_ALIASES = {
+    ctypes.c_int32: "int32",
+    ctypes.c_int64: "int64",
+    ctypes.c_uint32: "uint32",
+    ctypes.c_uint64: "uint64",
+    ctypes.c_float: "float32",
+    ctypes.c_double: "float64",
+}
+
+_ATOMIC_FETCH_DTYPES = frozenset(_DTYPES)
+_ATOMIC_STANDARD_DTYPES = frozenset({"int32", "int64", "uint32", "uint64"})
+_ATOMIC_BITWISE_DTYPES = frozenset({"int32", "int64", "uint32", "uint64"})
+_REDUCTION_DTYPES = {
+    "sum": frozenset(_DTYPES),
+    "prod": frozenset(_DTYPES),
+    "and": frozenset({"int32", "int64", "uint32", "uint64"}),
+    "or": frozenset({"int32", "int64", "uint32", "uint64"}),
+    "xor": frozenset({"int32", "int64", "uint32", "uint64"}),
+    "min": frozenset(_DTYPES),
+    "max": frozenset(_DTYPES),
+}
+_REDUCTION_CODES = {
+    "sum": 0,
+    "prod": 1,
+    "and": 2,
+    "or": 3,
+    "xor": 4,
+    "min": 5,
+    "max": 6,
+}
+
+
+class SymmetricMemory:
+    """An opaque handle to a symmetric heap allocation."""
+
+    def __init__(self, ptr: int, size: int):
+        self._ptr = int(ptr)
+        self._size = int(size)
+        self._freed = False
+
+    @property
+    def ptr(self) -> int:
+        return self._ptr
+
+    @property
+    def size(self) -> int:
+        return self._size
+
+    @property
+    def freed(self) -> bool:
+        return self._freed
+
+    def _mark_freed(self) -> None:
+        self._freed = True
+
+    def read(self, size: Optional[int] = None, *, offset: int = 0, pe: Optional[int] = None) -> bytes:
+        """Fetch bytes from this symmetric allocation into a new ``bytes`` object."""
+        from .rma import get
+        from .init_fini import my_pe
+
+        if pe is None:
+            pe = my_pe()
+        length = _normalize_span(self, size=size, offset=offset)
+        dst = bytearray(length)
+        get(dst, self, pe=pe, size=length, dest_offset=0, src_offset=offset)
+        return bytes(dst)
+
+    def write(self, data, *, offset: int = 0, pe: Optional[int] = None) -> int:
+        """Store bytes from a Python buffer into this symmetric allocation."""
+        from .rma import put
+        from .init_fini import my_pe
+
+        if pe is None:
+            pe = my_pe()
+        return put(self, data, pe=pe, dest_offset=offset)
+
+    def __int__(self) -> int:
+        return self._ptr
+
+    def __repr__(self) -> str:
+        state = "freed" if self._freed else "active"
+        return f"SymmetricMemory(ptr=0x{self._ptr:x}, size={self._size}, state={state})"
+
+
+TEAM_INVALID = Team(-1, "TEAM_INVALID")
+TEAM_WORLD = Team(0, "TEAM_WORLD")
+TEAM_SHARED = Team(1, "TEAM_SHARED")
+
+
+_initialized = False
+_finalized = False
+_live_allocations: dict[int, SymmetricMemory] = {}
+
+
+def _register_allocation(symm: SymmetricMemory) -> None:
+    _live_allocations[id(symm)] = symm
+
+
+def _unregister_allocation(symm: SymmetricMemory) -> None:
+    _live_allocations.pop(id(symm), None)
+
+
+def _require_initialized() -> None:
+    if not _initialized:
+        raise IshmemStateError("Intel SHMEM is not initialized")
+
+
+def _set_initialized() -> None:
+    global _initialized
+    _initialized = True
+
+
+def _set_finalized() -> None:
+    global _initialized, _finalized
+    _initialized = False
+    _finalized = True
+
+
+def _get_init_status() -> InitStatus:
+    if _initialized:
+        return InitStatus.INITIALIZED
+    if _finalized:
+        return InitStatus.FINALIZED
+    return InitStatus.UNINITIALIZED
+
+
+def _check_can_init() -> None:
+    if _initialized:
+        raise IshmemStateError("Intel SHMEM is already initialized")
+    if _finalized:
+        raise IshmemStateError(
+            "Intel SHMEM has already been finalized in this process and cannot be reinitialized"
+        )
+
+
+def _check_can_finalize() -> None:
+    if not _initialized:
+        raise IshmemStateError("Intel SHMEM is not initialized")
+    if _live_allocations:
+        leaked = ", ".join(
+            f"0x{symm.ptr:x}" for symm in sorted(_live_allocations.values(), key=lambda item: item.ptr)
+        )
+        raise IshmemStateError(
+            "cannot finalize while symmetric allocations are still live; free them first: " f"{leaked}"
+        )
+
+
+def _require_active(symm: SymmetricMemory) -> None:
+    if not isinstance(symm, SymmetricMemory):
+        raise TypeError("expected a SymmetricMemory instance")
+    if symm.freed:
+        raise IshmemStateError("symmetric allocation has already been freed")
+
+
+def _normalize_span(obj, *, size: Optional[int], offset: int) -> int:
+    if offset < 0:
+        raise ValueError("offset must be >= 0")
+
+    available = obj.size - offset
+    if available < 0:
+        raise ValueError("offset is past the end of the object")
+
+    if size is None:
+        return available
+    if size < 0:
+        raise ValueError("size must be >= 0")
+    if size > available:
+        raise ValueError("requested size extends past the end of the object")
+    return size
+
+
+def _ensure_contiguous_memoryview(obj, *, writable: bool):
+    mv = memoryview(obj)
+    if not mv.contiguous:
+        raise ValueError("buffer must be contiguous")
+    if writable and mv.readonly:
+        raise ValueError("destination buffer must be writable")
+    return mv
+
+
+@dataclass(frozen=True)
+class _PointerInfo:
+    ptr: int
+    size: int
+    keepalive: object
+
+
+def _pointer_from_local_buffer(obj, *, writable: bool, offset: int = 0) -> _PointerInfo:
+    if isinstance(obj, SymmetricMemory):
+        _require_active(obj)
+        size = _normalize_span(obj, size=None, offset=offset)
+        return _PointerInfo(ptr=obj.ptr + offset, size=size, keepalive=obj)
+
+    mv = _ensure_contiguous_memoryview(obj, writable=writable)
+
+    if offset < 0:
+        raise ValueError("offset must be >= 0")
+    if offset > mv.nbytes:
+        raise ValueError("offset is past the end of the buffer")
+
+    if mv.readonly:
+        if writable:
+            raise ValueError("destination buffer must be writable")
+        copied = (ctypes.c_char * mv.nbytes).from_buffer_copy(mv)
+        return _PointerInfo(ptr=ctypes.addressof(copied) + offset, size=mv.nbytes - offset, keepalive=copied)
+
+    raw = (ctypes.c_char * mv.nbytes).from_buffer(mv)
+    return _PointerInfo(ptr=ctypes.addressof(raw) + offset, size=mv.nbytes - offset, keepalive=(mv, raw))
+
+
+def _symmetric_target(obj) -> SymmetricMemory:
+    _require_active(obj)
+    return obj
+
+
+def _check_status(status: int | None, opname: str) -> None:
+    if status not in (0, None):
+        raise IshmemError(f"{opname} failed with status {status}")
+
+
+def _normalize_team(team: Team | int | None) -> int:
+    if team is None:
+        return TEAM_WORLD.handle
+    if isinstance(team, Team):
+        return team.handle
+    return int(team)
+
+
+def _team_from_handle(handle: int, *, name: Optional[str] = None) -> Team | None:
+    if int(handle) == TEAM_INVALID.handle:
+        return None
+    return Team(int(handle), name)
+
+
+def _normalize_dtype(dtype) -> _DTypeInfo:
+    if isinstance(dtype, _DTypeInfo):
+        return dtype
+    if dtype in _CTYPE_ALIASES:
+        return _DTYPES[_CTYPE_ALIASES[dtype]]
+
+    if isinstance(dtype, str):
+        alias = dtype.strip()
+        if alias.startswith("="):
+            alias = alias[1:]
+        normalized = _DTYPE_ALIASES.get(alias.lower(), _DTYPE_ALIASES.get(alias))
+        if normalized is not None:
+            return _DTYPES[normalized]
+
+    raise TypeError("unsupported dtype; use one of int32, int64, uint32, uint64, float32, float64")
+
+
+def _require_atomic_dtype(dtype, allowed: frozenset[str], opname: str) -> _DTypeInfo:
+    info = _normalize_dtype(dtype)
+    if info.name not in allowed:
+        supported = ", ".join(sorted(allowed))
+        raise TypeError(f"{opname} does not support dtype {info.name}; supported dtypes: {supported}")
+    return info
+
+
+def _require_reduction_dtype(op: str, dtype) -> _DTypeInfo:
+    normalized_op = op.strip().lower()
+    if normalized_op not in _REDUCTION_CODES:
+        supported = ", ".join(sorted(_REDUCTION_CODES))
+        raise ValueError(f"unsupported reduction op {op!r}; supported ops: {supported}")
+
+    info = _normalize_dtype(dtype)
+    allowed = _REDUCTION_DTYPES[normalized_op]
+    if info.name not in allowed:
+        supported = ", ".join(sorted(allowed))
+        raise TypeError(f"reduction {normalized_op!r} does not support dtype {info.name}; supported dtypes: {supported}")
+    return info
+
+
+def _normalize_reduction_op(op: str) -> int:
+    return _REDUCTION_CODES[op.strip().lower()]
+
+
+def _dtype_element_count(src: SymmetricMemory, dest: SymmetricMemory, dtype, count: Optional[int]) -> tuple[_DTypeInfo, int]:
+    info = _normalize_dtype(dtype)
+    available = min(src.size, dest.size) // info.itemsize
+    if count is None:
+        return info, available
+    if count < 0:
+        raise ValueError("count must be >= 0")
+    if count > available:
+        raise ValueError("count exceeds the available symmetric storage")
+    return info, count
+
+
+def _scalar_to_bits(value, dtype_info: _DTypeInfo) -> int:
+    c_value = dtype_info.ctype(value)
+    raw = ctypes.string_at(ctypes.byref(c_value), dtype_info.itemsize)
+    return int.from_bytes(raw, sys.byteorder, signed=False)
+
+
+def _bits_to_scalar(bits: int, dtype_info: _DTypeInfo):
+    raw = int(bits).to_bytes(dtype_info.itemsize, sys.byteorder, signed=False)
+    value = dtype_info.ctype.from_buffer_copy(raw)
+    return value.value
diff --git a/ishmem4py/python/ishmem4py/_lib.py b/ishmem4py/python/ishmem4py/_lib.py
index e87d034..0b81a93 100644
--- a/ishmem4py/python/ishmem4py/_lib.py
+++ b/ishmem4py/python/ishmem4py/_lib.py
@@ -1,6 +1,8 @@
 # Copyright (C) 2026 Intel Corporation
 # SPDX-License-Identifier: BSD-3-Clause
 
+from __future__ import annotations
+
 import ctypes
 import os
 from pathlib import Path
@@ -17,73 +19,230 @@ def _candidate_paths():
     yield package_dir / "_ishmem4py_runtime.so"
 
 
-def _load_runtime():
-    last_error = None
-    for candidate in _candidate_paths():
-        if not candidate.exists():
-            continue
-        try:
-            runtime = ctypes.CDLL(str(candidate))
-        except OSError as exc:
-            last_error = exc
-            continue
+def _configure_runtime(runtime):
+    runtime.ishmem4py_init.argtypes = []
+    runtime.ishmem4py_init.restype = None
+
+    runtime.ishmem4py_finalize.argtypes = []
+    runtime.ishmem4py_finalize.restype = None
+
+    runtime.ishmem4py_my_pe.argtypes = []
+    runtime.ishmem4py_my_pe.restype = ctypes.c_int
+
+    runtime.ishmem4py_n_pes.argtypes = []
+    runtime.ishmem4py_n_pes.restype = ctypes.c_int
+
+    runtime.ishmem4py_info_get_version.argtypes = [
+        ctypes.POINTER(ctypes.c_int),
+        ctypes.POINTER(ctypes.c_int),
+    ]
+    runtime.ishmem4py_info_get_version.restype = None
+
+    runtime.ishmem4py_info_get_name.argtypes = [ctypes.c_void_p]
+    runtime.ishmem4py_info_get_name.restype = None
+
+    runtime.ishmem4py_vendor_get_version.argtypes = [
+        ctypes.POINTER(ctypes.c_int),
+        ctypes.POINTER(ctypes.c_int),
+        ctypes.POINTER(ctypes.c_int),
+    ]
+    runtime.ishmem4py_vendor_get_version.restype = None
+
+    runtime.ishmem4py_barrier_all.argtypes = []
+    runtime.ishmem4py_barrier_all.restype = None
+
+    runtime.ishmem4py_sync_all.argtypes = []
+    runtime.ishmem4py_sync_all.restype = None
+
+    runtime.ishmem4py_fence.argtypes = []
+    runtime.ishmem4py_fence.restype = None
+
+    runtime.ishmem4py_quiet.argtypes = []
+    runtime.ishmem4py_quiet.restype = None
+
+    runtime.ishmem4py_malloc.argtypes = [ctypes.c_size_t]
+    runtime.ishmem4py_malloc.restype = ctypes.c_void_p
+
+    runtime.ishmem4py_calloc.argtypes = [ctypes.c_size_t, ctypes.c_size_t]
+    runtime.ishmem4py_calloc.restype = ctypes.c_void_p
+
+    runtime.ishmem4py_free.argtypes = [ctypes.c_void_p]
+    runtime.ishmem4py_free.restype = None
+
+    runtime.ishmem4py_putmem.argtypes = [
+        ctypes.c_void_p,
+        ctypes.c_void_p,
+        ctypes.c_size_t,
+        ctypes.c_int,
+    ]
+    runtime.ishmem4py_putmem.restype = None
+
+    runtime.ishmem4py_getmem.argtypes = [
+        ctypes.c_void_p,
+        ctypes.c_void_p,
+        ctypes.c_size_t,
+        ctypes.c_int,
+    ]
+    runtime.ishmem4py_getmem.restype = None
 
-        runtime.ishmem4py_init.argtypes = []
-        runtime.ishmem4py_init.restype = None
+    runtime.ishmem4py_ptr.argtypes = [ctypes.c_void_p, ctypes.c_int]
+    runtime.ishmem4py_ptr.restype = ctypes.c_void_p
 
-        runtime.ishmem4py_finalize.argtypes = []
-        runtime.ishmem4py_finalize.restype = None
+    runtime.ishmem4py_team_my_pe.argtypes = [ctypes.c_int]
+    runtime.ishmem4py_team_my_pe.restype = ctypes.c_int
 
-        runtime.ishmem4py_my_pe.argtypes = []
-        runtime.ishmem4py_my_pe.restype = ctypes.c_int
+    runtime.ishmem4py_team_n_pes.argtypes = [ctypes.c_int]
+    runtime.ishmem4py_team_n_pes.restype = ctypes.c_int
 
-        runtime.ishmem4py_n_pes.argtypes = []
-        runtime.ishmem4py_n_pes.restype = ctypes.c_int
+    runtime.ishmem4py_team_translate_pe.argtypes = [ctypes.c_int, ctypes.c_int, ctypes.c_int]
+    runtime.ishmem4py_team_translate_pe.restype = ctypes.c_int
 
-        runtime.ishmem4py_info_get_version.argtypes = [
-            ctypes.POINTER(ctypes.c_int),
-            ctypes.POINTER(ctypes.c_int),
-        ]
-        runtime.ishmem4py_info_get_version.restype = None
+    runtime.ishmem4py_team_sync.argtypes = [ctypes.c_int]
+    runtime.ishmem4py_team_sync.restype = ctypes.c_int
 
-        runtime.ishmem4py_barrier_all.argtypes = []
-        runtime.ishmem4py_barrier_all.restype = None
+    runtime.ishmem4py_team_split_strided.argtypes = [
+        ctypes.c_int,
+        ctypes.c_int,
+        ctypes.c_int,
+        ctypes.c_int,
+        ctypes.c_void_p,
+        ctypes.c_long,
+        ctypes.POINTER(ctypes.c_int),
+    ]
+    runtime.ishmem4py_team_split_strided.restype = ctypes.c_int
 
-        runtime.ishmem4py_sync_all.argtypes = []
-        runtime.ishmem4py_sync_all.restype = None
+    runtime.ishmem4py_team_split_2d.argtypes = [
+        ctypes.c_int,
+        ctypes.c_int,
+        ctypes.c_void_p,
+        ctypes.c_long,
+        ctypes.POINTER(ctypes.c_int),
+        ctypes.c_void_p,
+        ctypes.c_long,
+        ctypes.POINTER(ctypes.c_int),
+    ]
+    runtime.ishmem4py_team_split_2d.restype = ctypes.c_int
 
-        runtime.ishmem4py_fence.argtypes = []
-        runtime.ishmem4py_fence.restype = None
+    runtime.ishmem4py_team_destroy.argtypes = [ctypes.c_int]
+    runtime.ishmem4py_team_destroy.restype = None
 
-        runtime.ishmem4py_quiet.argtypes = []
-        runtime.ishmem4py_quiet.restype = None
+    runtime.ishmem4py_broadcastmem.argtypes = [
+        ctypes.c_int,
+        ctypes.c_void_p,
+        ctypes.c_void_p,
+        ctypes.c_size_t,
+        ctypes.c_int,
+    ]
+    runtime.ishmem4py_broadcastmem.restype = ctypes.c_int
 
-        runtime.ishmem4py_malloc.argtypes = [ctypes.c_size_t]
-        runtime.ishmem4py_malloc.restype = ctypes.c_void_p
+    runtime.ishmem4py_collectmem.argtypes = [
+        ctypes.c_int,
+        ctypes.c_void_p,
+        ctypes.c_void_p,
+        ctypes.c_size_t,
+    ]
+    runtime.ishmem4py_collectmem.restype = ctypes.c_int
 
-        runtime.ishmem4py_calloc.argtypes = [ctypes.c_size_t, ctypes.c_size_t]
-        runtime.ishmem4py_calloc.restype = ctypes.c_void_p
+    runtime.ishmem4py_fcollectmem.argtypes = [
+        ctypes.c_int,
+        ctypes.c_void_p,
+        ctypes.c_void_p,
+        ctypes.c_size_t,
+    ]
+    runtime.ishmem4py_fcollectmem.restype = ctypes.c_int
 
-        runtime.ishmem4py_free.argtypes = [ctypes.c_void_p]
-        runtime.ishmem4py_free.restype = None
+    runtime.ishmem4py_alltoallmem.argtypes = [
+        ctypes.c_int,
+        ctypes.c_void_p,
+        ctypes.c_void_p,
+        ctypes.c_size_t,
+    ]
+    runtime.ishmem4py_alltoallmem.restype = ctypes.c_int
 
-        runtime.ishmem4py_putmem.argtypes = [
-            ctypes.c_void_p,
-            ctypes.c_void_p,
-            ctypes.c_size_t,
-            ctypes.c_int,
-        ]
-        runtime.ishmem4py_putmem.restype = None
+    runtime.ishmem4py_reduce.argtypes = [
+        ctypes.c_int,
+        ctypes.c_int,
+        ctypes.c_int,
+        ctypes.c_void_p,
+        ctypes.c_void_p,
+        ctypes.c_size_t,
+    ]
+    runtime.ishmem4py_reduce.restype = ctypes.c_int
 
-        runtime.ishmem4py_getmem.argtypes = [
-            ctypes.c_void_p,
-            ctypes.c_void_p,
-            ctypes.c_size_t,
-            ctypes.c_int,
-        ]
-        runtime.ishmem4py_getmem.restype = None
+    runtime.ishmem4py_atomic_fetch.argtypes = [ctypes.c_int, ctypes.c_void_p, ctypes.c_int]
+    runtime.ishmem4py_atomic_fetch.restype = ctypes.c_uint64
 
-        return runtime
+    runtime.ishmem4py_atomic_set.argtypes = [
+        ctypes.c_int,
+        ctypes.c_void_p,
+        ctypes.c_uint64,
+        ctypes.c_int,
+    ]
+    runtime.ishmem4py_atomic_set.restype = None
+
+    runtime.ishmem4py_atomic_swap.argtypes = [
+        ctypes.c_int,
+        ctypes.c_void_p,
+        ctypes.c_uint64,
+        ctypes.c_int,
+    ]
+    runtime.ishmem4py_atomic_swap.restype = ctypes.c_uint64
+
+    runtime.ishmem4py_atomic_compare_swap.argtypes = [
+        ctypes.c_int,
+        ctypes.c_void_p,
+        ctypes.c_uint64,
+        ctypes.c_uint64,
+        ctypes.c_int,
+    ]
+    runtime.ishmem4py_atomic_compare_swap.restype = ctypes.c_uint64
+
+    runtime.ishmem4py_atomic_fetch_inc.argtypes = [ctypes.c_int, ctypes.c_void_p, ctypes.c_int]
+    runtime.ishmem4py_atomic_fetch_inc.restype = ctypes.c_uint64
+
+    runtime.ishmem4py_atomic_inc.argtypes = [ctypes.c_int, ctypes.c_void_p, ctypes.c_int]
+    runtime.ishmem4py_atomic_inc.restype = None
+
+    for name in (
+        "ishmem4py_atomic_fetch_add",
+        "ishmem4py_atomic_fetch_and",
+        "ishmem4py_atomic_fetch_or",
+        "ishmem4py_atomic_fetch_xor",
+    ):
+        func = getattr(runtime, name)
+        func.argtypes = [ctypes.c_int, ctypes.c_void_p, ctypes.c_uint64, ctypes.c_int]
+        func.restype = ctypes.c_uint64
+
+    for name in (
+        "ishmem4py_atomic_add",
+        "ishmem4py_atomic_and",
+        "ishmem4py_atomic_or",
+        "ishmem4py_atomic_xor",
+    ):
+        func = getattr(runtime, name)
+        func.argtypes = [ctypes.c_int, ctypes.c_void_p, ctypes.c_uint64, ctypes.c_int]
+        func.restype = None
+
+    runtime.ishmem4py_cast_ptr_to_uintptr.argtypes = [ctypes.c_void_p]
+    runtime.ishmem4py_cast_ptr_to_uintptr.restype = ctypes.c_uint64
+
+    runtime.ishmem4py_cast_uintptr_to_ptr.argtypes = [ctypes.c_uint64]
+    runtime.ishmem4py_cast_uintptr_to_ptr.restype = ctypes.c_void_p
+
+    return runtime
+
+
+def _load_runtime():
+    last_error = None
+    for candidate in _candidate_paths():
+        if not candidate.exists():
+            continue
+        try:
+            runtime = ctypes.CDLL(str(candidate))
+        except OSError as exc:
+            last_error = exc
+            continue
+        return _configure_runtime(runtime)
 
     if last_error is not None:
         raise RuntimeError(
@@ -99,4 +258,17 @@ def _load_runtime():
     )
 
 
-RUNTIME = _load_runtime()
+class _RuntimeProxy:
+    def __init__(self):
+        self._runtime = None
+
+    def _get_runtime(self):
+        if self._runtime is None:
+            self._runtime = _load_runtime()
+        return self._runtime
+
+    def __getattr__(self, name):
+        return getattr(self._get_runtime(), name)
+
+
+RUNTIME = _RuntimeProxy()
diff --git a/ishmem4py/python/ishmem4py/collective.py b/ishmem4py/python/ishmem4py/collective.py
new file mode 100644
index 0000000..4c9c868
--- /dev/null
+++ b/ishmem4py/python/ishmem4py/collective.py
@@ -0,0 +1,288 @@
+# Copyright (C) 2026 Intel Corporation
+# SPDX-License-Identifier: BSD-3-Clause
+
+"""Host-side collective operations over symmetric allocations."""
+
+from __future__ import annotations
+
+import ctypes
+from typing import Optional
+
+from ._common import (
+    IshmemError,
+    SymmetricMemory,
+    TEAM_WORLD,
+    _check_status,
+    _dtype_element_count,
+    _normalize_reduction_op,
+    _normalize_span,
+    _normalize_team,
+    _require_active,
+    _require_initialized,
+    _require_reduction_dtype,
+)
+from ._lib import RUNTIME
+from .teams import team_my_pe, team_n_pes, team_sync
+
+__all__ = [
+    "alltoall",
+    "barrier",
+    "barrier_all",
+    "broadcast",
+    "collect",
+    "fcollect",
+    "reduce",
+    "reducescatter",
+    "sync",
+    "sync_all",
+]
+
+
+def _collective_span(
+    dest: SymmetricMemory,
+    src: SymmetricMemory,
+    *,
+    size: Optional[int],
+    dest_offset: int,
+    src_offset: int,
+) -> tuple[int, int, int]:
+    _require_active(dest)
+    _require_active(src)
+    dest_size = _normalize_span(dest, size=None, offset=dest_offset)
+    src_size = _normalize_span(src, size=None, offset=src_offset)
+    nbytes = min(dest_size, src_size) if size is None else size
+    if nbytes < 0:
+        raise ValueError("size must be >= 0")
+    if nbytes > dest_size:
+        raise ValueError("size extends past the destination symmetric object")
+    if nbytes > src_size:
+        raise ValueError("size extends past the source symmetric object")
+    return dest_size, src_size, nbytes
+
+
+def barrier_all() -> None:
+    """Block until all PEs in ``TEAM_WORLD`` arrive."""
+    _require_initialized()
+    RUNTIME.ishmem4py_barrier_all()
+
+
+def sync_all() -> None:
+    """Synchronize all PEs in ``TEAM_WORLD``."""
+    _require_initialized()
+    RUNTIME.ishmem4py_sync_all()
+
+
+def sync(team=None) -> None:
+    """Synchronize all PEs in ``team``."""
+    team_sync(team)
+
+
+def barrier(team=None) -> None:
+    """Barrier on ``team``."""
+    if team is None or _normalize_team(team) == TEAM_WORLD.handle:
+        barrier_all()
+        return
+    team_sync(team)
+
+
+def broadcast(
+    dest: SymmetricMemory,
+    src: SymmetricMemory,
+    *,
+    root: int,
+    team=None,
+    size: Optional[int] = None,
+    dest_offset: int = 0,
+    src_offset: int = 0,
+) -> int:
+    """Broadcast ``src`` from ``root`` into ``dest`` across ``team``."""
+    _require_initialized()
+    _, _, nbytes = _collective_span(dest, src, size=size, dest_offset=dest_offset, src_offset=src_offset)
+    status = RUNTIME.ishmem4py_broadcastmem(
+        _normalize_team(team),
+        ctypes.c_void_p(dest.ptr + dest_offset),
+        ctypes.c_void_p(src.ptr + src_offset),
+        nbytes,
+        root,
+    )
+    _check_status(status, "ishmem_broadcastmem")
+    return nbytes
+
+
+def collect(
+    dest: SymmetricMemory,
+    src: SymmetricMemory,
+    *,
+    team=None,
+    size: Optional[int] = None,
+    dest_offset: int = 0,
+    src_offset: int = 0,
+) -> int:
+    """Concatenate each PE's contribution from ``src`` into ``dest``."""
+    _require_initialized()
+    participants = team_n_pes(team)
+    dest_size = _normalize_span(dest, size=None, offset=dest_offset)
+    src_size = _normalize_span(src, size=None, offset=src_offset)
+    per_pe = min(src_size, dest_size // max(1, participants)) if size is None else size
+    if per_pe < 0:
+        raise ValueError("size must be >= 0")
+    if per_pe > src_size:
+        raise ValueError("size extends past the source symmetric object")
+    if per_pe * max(1, participants) > dest_size:
+        raise ValueError("destination symmetric object is too small for collect")
+    status = RUNTIME.ishmem4py_collectmem(
+        _normalize_team(team),
+        ctypes.c_void_p(dest.ptr + dest_offset),
+        ctypes.c_void_p(src.ptr + src_offset),
+        per_pe,
+    )
+    _check_status(status, "ishmem_collectmem")
+    return per_pe
+
+
+def fcollect(
+    dest: SymmetricMemory,
+    src: SymmetricMemory,
+    *,
+    team=None,
+    size: Optional[int] = None,
+    dest_offset: int = 0,
+    src_offset: int = 0,
+) -> int:
+    """Fixed-size collect from ``src`` into ``dest``."""
+    _require_initialized()
+    participants = team_n_pes(team)
+    dest_size = _normalize_span(dest, size=None, offset=dest_offset)
+    src_size = _normalize_span(src, size=None, offset=src_offset)
+    per_pe = min(src_size, dest_size // max(1, participants)) if size is None else size
+    if per_pe < 0:
+        raise ValueError("size must be >= 0")
+    if per_pe > src_size:
+        raise ValueError("size extends past the source symmetric object")
+    if per_pe * max(1, participants) > dest_size:
+        raise ValueError("destination symmetric object is too small for fcollect")
+    status = RUNTIME.ishmem4py_fcollectmem(
+        _normalize_team(team),
+        ctypes.c_void_p(dest.ptr + dest_offset),
+        ctypes.c_void_p(src.ptr + src_offset),
+        per_pe,
+    )
+    _check_status(status, "ishmem_fcollectmem")
+    return per_pe
+
+
+def alltoall(
+    dest: SymmetricMemory,
+    src: SymmetricMemory,
+    *,
+    team=None,
+    size: Optional[int] = None,
+    dest_offset: int = 0,
+    src_offset: int = 0,
+) -> int:
+    """Exchange equally sized blocks among all PEs in ``team``."""
+    _require_initialized()
+    participants = team_n_pes(team)
+    if participants <= 0:
+        raise IshmemError("alltoall requires a valid team")
+    dest_size = _normalize_span(dest, size=None, offset=dest_offset)
+    src_size = _normalize_span(src, size=None, offset=src_offset)
+    per_pe = min(src_size // participants, dest_size // participants) if size is None else size
+    if per_pe < 0:
+        raise ValueError("size must be >= 0")
+    if per_pe * participants > src_size:
+        raise ValueError("source symmetric object is too small for alltoall")
+    if per_pe * participants > dest_size:
+        raise ValueError("destination symmetric object is too small for alltoall")
+    status = RUNTIME.ishmem4py_alltoallmem(
+        _normalize_team(team),
+        ctypes.c_void_p(dest.ptr + dest_offset),
+        ctypes.c_void_p(src.ptr + src_offset),
+        per_pe,
+    )
+    _check_status(status, "ishmem_alltoallmem")
+    return per_pe
+
+
+def reduce(
+    op: str,
+    dest: SymmetricMemory,
+    src: SymmetricMemory,
+    *,
+    dtype,
+    count: Optional[int] = None,
+    team=None,
+) -> int:
+    """Reduce ``src`` into ``dest`` across ``team`` using ``op`` and ``dtype``."""
+    _require_initialized()
+    _require_active(dest)
+    _require_active(src)
+    info = _require_reduction_dtype(op, dtype)
+    _, count = _dtype_element_count(src, dest, info, count)
+    status = RUNTIME.ishmem4py_reduce(
+        _normalize_reduction_op(op),
+        info.code,
+        _normalize_team(team),
+        ctypes.c_void_p(dest.ptr),
+        ctypes.c_void_p(src.ptr),
+        count,
+    )
+    _check_status(status, f"ishmem_{info.name}_{op}_reduce")
+    return count
+
+
+def reducescatter(
+    op: str,
+    dest: SymmetricMemory,
+    src: SymmetricMemory,
+    *,
+    dtype,
+    count: Optional[int] = None,
+    team=None,
+) -> int:
+    """Reduce ``src`` across ``team`` and return this PE's block in ``dest``.
+
+    Intel SHMEM does not currently expose a host-side reducescatter routine.
+    ``ishmem4py`` therefore implements the operation as a small software
+    fallback: perform a full ``reduce`` into a temporary symmetric buffer, then
+    copy the local team's slice into ``dest``.
+    """
+    from .init_fini import my_pe
+    from .memory import free, malloc
+    from .rma import get
+
+    _require_initialized()
+    _require_active(dest)
+    _require_active(src)
+    info = _require_reduction_dtype(op, dtype)
+    participants = team_n_pes(team)
+    if participants <= 0:
+        raise IshmemError("reducescatter requires a valid team")
+
+    dest_available = dest.size // info.itemsize
+    src_available = src.size // info.itemsize
+    if count is None:
+        count = min(dest_available, src_available // participants)
+    if count < 0:
+        raise ValueError("count must be >= 0")
+    if count > dest_available:
+        raise ValueError("count exceeds the available destination symmetric storage")
+
+    total_count = count * participants
+    if total_count > src_available:
+        raise ValueError("count exceeds the available source symmetric storage for reducescatter")
+
+    scratch = malloc(total_count * info.itemsize)
+    try:
+        reduce(op, scratch, src, dtype=info, count=total_count, team=team)
+        get(
+            dest,
+            scratch,
+            pe=my_pe(),
+            size=count * info.itemsize,
+            src_offset=team_my_pe(team) * count * info.itemsize,
+        )
+    finally:
+        free(scratch)
+
+    return count
diff --git a/ishmem4py/python/ishmem4py/core.py b/ishmem4py/python/ishmem4py/core.py
index dbb8b3a..f27c926 100644
--- a/ishmem4py/python/ishmem4py/core.py
+++ b/ishmem4py/python/ishmem4py/core.py
@@ -1,299 +1,46 @@
 # Copyright (C) 2026 Intel Corporation
 # SPDX-License-Identifier: BSD-3-Clause
 
-import ctypes
-from dataclasses import dataclass
-from typing import Optional
-
-from ._lib import RUNTIME
-
-
-class IshmemError(RuntimeError):
-    pass
-
-
-class IshmemStateError(IshmemError):
-    pass
-
-
-@dataclass(frozen=True)
-class _PointerInfo:
-    ptr: int
-    size: int
-    keepalive: object
-
-
-class SymmetricMemory:
-    def __init__(self, ptr: int, size: int):
-        self._ptr = int(ptr)
-        self._size = int(size)
-        self._freed = False
-
-    @property
-    def ptr(self) -> int:
-        return self._ptr
-
-    @property
-    def size(self) -> int:
-        return self._size
-
-    @property
-    def freed(self) -> bool:
-        return self._freed
-
-    def _mark_freed(self) -> None:
-        self._freed = True
-
-    def read(self, size: Optional[int] = None, *, offset: int = 0, pe: Optional[int] = None) -> bytes:
-        if pe is None:
-            pe = my_pe()
-        length = _normalize_span(self, size=size, offset=offset)
-        dst = bytearray(length)
-        getmem(dst, self, pe=pe, size=length, src_offset=offset)
-        return bytes(dst)
-
-    def write(self, data, *, offset: int = 0, pe: Optional[int] = None) -> int:
-        if pe is None:
-            pe = my_pe()
-        return putmem(self, data, pe=pe, dest_offset=offset)
-
-    def __int__(self) -> int:
-        return self._ptr
-
-    def __repr__(self) -> str:
-        state = "freed" if self._freed else "active"
-        return f"SymmetricMemory(ptr=0x{self._ptr:x}, size={self._size}, state={state})"
-
-
-_initialized = False
-_finalized = False
-_live_allocations = {}
-
-
-def _require_initialized() -> None:
-    if not _initialized:
-        raise IshmemStateError("Intel SHMEM is not initialized")
-
-
-def _require_active(symm: SymmetricMemory) -> None:
-    if not isinstance(symm, SymmetricMemory):
-        raise TypeError("expected a SymmetricMemory instance")
-    if symm.freed:
-        raise IshmemStateError("symmetric allocation has already been freed")
-
-
-def _normalize_span(obj, *, size: Optional[int], offset: int) -> int:
-    if offset < 0:
-        raise ValueError("offset must be >= 0")
-
-    available = obj.size - offset
-    if available < 0:
-        raise ValueError("offset is past the end of the object")
-
-    if size is None:
-        return available
-    if size < 0:
-        raise ValueError("size must be >= 0")
-    if size > available:
-        raise ValueError("requested size extends past the end of the object")
-    return size
-
-
-def _ensure_contiguous_memoryview(obj, *, writable: bool):
-    mv = memoryview(obj)
-    if not mv.contiguous:
-        raise ValueError("buffer must be contiguous")
-    if writable and mv.readonly:
-        raise ValueError("destination buffer must be writable")
-    return mv
-
-
-def _pointer_from_local_buffer(obj, *, writable: bool, offset: int = 0) -> _PointerInfo:
-    if isinstance(obj, SymmetricMemory):
-        _require_active(obj)
-        size = _normalize_span(obj, size=None, offset=offset)
-        return _PointerInfo(ptr=obj.ptr + offset, size=size, keepalive=obj)
-
-    mv = _ensure_contiguous_memoryview(obj, writable=writable)
-
-    if offset < 0:
-        raise ValueError("offset must be >= 0")
-    if offset > mv.nbytes:
-        raise ValueError("offset is past the end of the buffer")
-
-    if mv.readonly:
-        if writable:
-            raise ValueError("destination buffer must be writable")
-        copied = (ctypes.c_char * mv.nbytes).from_buffer_copy(mv)
-        return _PointerInfo(ptr=ctypes.addressof(copied) + offset, size=mv.nbytes - offset, keepalive=copied)
-
-    raw = (ctypes.c_char * mv.nbytes).from_buffer(mv)
-    return _PointerInfo(ptr=ctypes.addressof(raw) + offset, size=mv.nbytes - offset, keepalive=(mv, raw))
-
-
-def _symmetric_target(obj) -> SymmetricMemory:
-    _require_active(obj)
-    return obj
-
-
-def is_initialized() -> bool:
-    return _initialized
-
-
-def init() -> None:
-    global _initialized
-    if _initialized:
-        raise IshmemStateError("Intel SHMEM is already initialized")
-    if _finalized:
-        raise IshmemStateError(
-            "Intel SHMEM has already been finalized in this process and cannot be reinitialized"
-        )
-    RUNTIME.ishmem4py_init()
-    _initialized = True
-
-
-def finalize() -> None:
-    global _initialized, _finalized
-    if not _initialized:
-        raise IshmemStateError("Intel SHMEM is not initialized")
-    if _live_allocations:
-        leaked = ", ".join(f"0x{ptr:x}" for ptr in sorted(_live_allocations))
-        raise IshmemStateError(
-            "cannot finalize while symmetric allocations are still live; "
-            f"free them first: {leaked}"
-        )
-    RUNTIME.ishmem4py_finalize()
-    _initialized = False
-    _finalized = True
-
-
-def my_pe() -> int:
-    _require_initialized()
-    return int(RUNTIME.ishmem4py_my_pe())
-
-
-def n_pes() -> int:
-    _require_initialized()
-    return int(RUNTIME.ishmem4py_n_pes())
-
-
-def info_get_version() -> tuple[int, int]:
-    major = ctypes.c_int()
-    minor = ctypes.c_int()
-    RUNTIME.ishmem4py_info_get_version(ctypes.byref(major), ctypes.byref(minor))
-    return int(major.value), int(minor.value)
-
-
-def barrier_all() -> None:
-    _require_initialized()
-    RUNTIME.ishmem4py_barrier_all()
-
-
-def sync_all() -> None:
-    _require_initialized()
-    RUNTIME.ishmem4py_sync_all()
-
-
-def fence() -> None:
-    _require_initialized()
-    RUNTIME.ishmem4py_fence()
-
-
-def quiet() -> None:
-    _require_initialized()
-    RUNTIME.ishmem4py_quiet()
-
-
-def malloc(size: int) -> SymmetricMemory:
-    _require_initialized()
-    if size < 0:
-        raise ValueError("size must be >= 0")
-    ptr = RUNTIME.ishmem4py_malloc(size)
-    if size > 0 and not ptr:
-        raise IshmemError(f"ishmem_malloc({size}) returned NULL")
-    result = SymmetricMemory(ptr=int(ptr or 0), size=size)
-    _live_allocations[result.ptr] = result
-    return result
-
-
-def calloc(count: int, size: int) -> SymmetricMemory:
-    _require_initialized()
-    if count < 0 or size < 0:
-        raise ValueError("count and size must be >= 0")
-    ptr = RUNTIME.ishmem4py_calloc(count, size)
-    total_size = count * size
-    if total_size > 0 and not ptr:
-        raise IshmemError(f"ishmem_calloc({count}, {size}) returned NULL")
-    result = SymmetricMemory(ptr=int(ptr or 0), size=total_size)
-    _live_allocations[result.ptr] = result
-    return result
-
-
-def free(symm: SymmetricMemory) -> None:
-    _require_initialized()
-    _require_active(symm)
-    RUNTIME.ishmem4py_free(ctypes.c_void_p(symm.ptr))
-    _live_allocations.pop(symm.ptr, None)
-    symm._mark_freed()
-
-
-def putmem(
-    dest: SymmetricMemory,
-    src,
-    *,
-    pe: int,
-    size: Optional[int] = None,
-    dest_offset: int = 0,
-    src_offset: int = 0,
-) -> int:
-    _require_initialized()
-    dest = _symmetric_target(dest)
-    src_info = _pointer_from_local_buffer(src, writable=False, offset=src_offset)
-    dest_size = _normalize_span(dest, size=None, offset=dest_offset)
-    nbytes = min(dest_size, src_info.size) if size is None else size
-
-    if nbytes < 0:
-        raise ValueError("size must be >= 0")
-    if nbytes > dest_size:
-        raise ValueError("size extends past the destination symmetric object")
-    if nbytes > src_info.size:
-        raise ValueError("size extends past the source buffer")
-
-    RUNTIME.ishmem4py_putmem(
-        ctypes.c_void_p(dest.ptr + dest_offset),
-        ctypes.c_void_p(src_info.ptr),
-        nbytes,
-        pe,
-    )
-    return nbytes
-
-
-def getmem(
-    dest,
-    src: SymmetricMemory,
-    *,
-    pe: int,
-    size: Optional[int] = None,
-    dest_offset: int = 0,
-    src_offset: int = 0,
-) -> int:
-    _require_initialized()
-    src = _symmetric_target(src)
-    dest_info = _pointer_from_local_buffer(dest, writable=True, offset=dest_offset)
-    src_size = _normalize_span(src, size=None, offset=src_offset)
-    nbytes = min(dest_info.size, src_size) if size is None else size
-
-    if nbytes < 0:
-        raise ValueError("size must be >= 0")
-    if nbytes > dest_info.size:
-        raise ValueError("size extends past the destination buffer")
-    if nbytes > src_size:
-        raise ValueError("size extends past the source symmetric object")
-
-    RUNTIME.ishmem4py_getmem(
-        ctypes.c_void_p(dest_info.ptr),
-        ctypes.c_void_p(src.ptr + src_offset),
-        nbytes,
-        pe,
-    )
-    return nbytes
+"""Aggregated public ``ishmem4py`` API."""
+
+from . import collective, init_fini, memory, rma, teams
+from ._common import (
+    Comparison,
+    InitStatus,
+    IshmemError,
+    IshmemStateError,
+    MemoryPointer,
+    SignalOp,
+    SymmetricMemory,
+    TEAM_INVALID,
+    TEAM_SHARED,
+    TEAM_WORLD,
+    Team,
+    Version,
+)
+from .collective import *
+from .init_fini import *
+from .memory import *
+from .rma import *
+from .teams import *
+
+__all__ = [
+    "Comparison",
+    "InitStatus",
+    "IshmemError",
+    "IshmemStateError",
+    "MemoryPointer",
+    "SignalOp",
+    "SymmetricMemory",
+    "TEAM_INVALID",
+    "TEAM_SHARED",
+    "TEAM_WORLD",
+    "Team",
+    "Version",
+]
+
+__all__ += collective.__all__
+__all__ += init_fini.__all__
+__all__ += memory.__all__
+__all__ += rma.__all__
+__all__ += teams.__all__
diff --git a/ishmem4py/python/ishmem4py/init_fini.py b/ishmem4py/python/ishmem4py/init_fini.py
new file mode 100644
index 0000000..f74ac12
--- /dev/null
+++ b/ishmem4py/python/ishmem4py/init_fini.py
@@ -0,0 +1,107 @@
+# Copyright (C) 2026 Intel Corporation
+# SPDX-License-Identifier: BSD-3-Clause
+
+"""Initialization, finalization, and runtime queries."""
+
+from __future__ import annotations
+
+import ctypes
+
+from ._common import (
+    InitStatus,
+    Version,
+    _check_can_finalize,
+    _check_can_init,
+    _get_init_status,
+    _require_initialized,
+    _set_finalized,
+    _set_initialized,
+)
+from ._lib import RUNTIME
+from .version import __version__
+
+_INFO_NAME_BYTES = 256
+
+__all__ = [
+    "InitStatus",
+    "Version",
+    "finalize",
+    "get_version",
+    "info_get_name",
+    "info_get_version",
+    "init",
+    "init_status",
+    "is_initialized",
+    "my_pe",
+    "n_pes",
+]
+
+
+def init() -> None:
+    """Initialize Intel SHMEM for the current process."""
+    _check_can_init()
+    RUNTIME.ishmem4py_init()
+    _set_initialized()
+
+
+def finalize() -> None:
+    """Finalize Intel SHMEM for the current process."""
+    _check_can_finalize()
+    RUNTIME.ishmem4py_finalize()
+    _set_finalized()
+
+
+def init_status() -> InitStatus:
+    """Return the process-local initialization state."""
+    return _get_init_status()
+
+
+def is_initialized() -> bool:
+    """Return ``True`` if Intel SHMEM is currently initialized."""
+    return init_status() == InitStatus.INITIALIZED
+
+
+def my_pe() -> int:
+    """Return the calling PE's global rank."""
+    _require_initialized()
+    return int(RUNTIME.ishmem4py_my_pe())
+
+
+def n_pes() -> int:
+    """Return the number of PEs in ``TEAM_WORLD``."""
+    _require_initialized()
+    return int(RUNTIME.ishmem4py_n_pes())
+
+
+def info_get_version() -> tuple[int, int]:
+    """Return the OpenSHMEM specification version reported by the runtime."""
+    major = ctypes.c_int()
+    minor = ctypes.c_int()
+    RUNTIME.ishmem4py_info_get_version(ctypes.byref(major), ctypes.byref(minor))
+    return int(major.value), int(minor.value)
+
+
+def info_get_name() -> str:
+    """Return the runtime vendor name."""
+    name = ctypes.create_string_buffer(_INFO_NAME_BYTES)
+    RUNTIME.ishmem4py_info_get_name(ctypes.cast(name, ctypes.c_void_p))
+    return name.value.decode("utf-8")
+
+
+def get_version() -> Version:
+    """Return combined OpenSHMEM, Intel SHMEM, and ``ishmem4py`` version metadata."""
+    spec_major, spec_minor = info_get_version()
+    lib_major = ctypes.c_int()
+    lib_minor = ctypes.c_int()
+    lib_patch = ctypes.c_int()
+    RUNTIME.ishmem4py_vendor_get_version(
+        ctypes.byref(lib_major),
+        ctypes.byref(lib_minor),
+        ctypes.byref(lib_patch),
+    )
+    return Version(
+        openshmem_spec_version=f"{spec_major}.{spec_minor}",
+        ishmem4py_version=__version__,
+        libishmem_version=f"{lib_major.value}.{lib_minor.value}.{lib_patch.value}",
+        vendor_name=info_get_name(),
+    )
diff --git a/ishmem4py/python/ishmem4py/memory.py b/ishmem4py/python/ishmem4py/memory.py
new file mode 100644
index 0000000..048ed9b
--- /dev/null
+++ b/ishmem4py/python/ishmem4py/memory.py
@@ -0,0 +1,88 @@
+# Copyright (C) 2026 Intel Corporation
+# SPDX-License-Identifier: BSD-3-Clause
+
+"""Symmetric memory allocation and address helpers."""
+
+from __future__ import annotations
+
+import ctypes
+
+from ._common import (
+    IshmemError,
+    MemoryPointer,
+    SymmetricMemory,
+    _register_allocation,
+    _require_active,
+    _require_initialized,
+    _unregister_allocation,
+)
+from ._lib import RUNTIME
+
+__all__ = [
+    "MemoryPointer",
+    "SymmetricMemory",
+    "buffer",
+    "calloc",
+    "free",
+    "ishmem_ptr",
+    "malloc",
+    "ptr",
+]
+
+
+def malloc(size: int) -> SymmetricMemory:
+    """Allocate ``size`` bytes from the symmetric heap."""
+    _require_initialized()
+    if size < 0:
+        raise ValueError("size must be >= 0")
+    ptr_value = int(RUNTIME.ishmem4py_malloc(size) or 0)
+    if size > 0 and not ptr_value:
+        raise IshmemError(f"ishmem_malloc({size}) returned NULL")
+    result = SymmetricMemory(ptr=ptr_value, size=size)
+    _register_allocation(result)
+    return result
+
+
+def calloc(count: int, size: int) -> SymmetricMemory:
+    """Allocate ``count * size`` zeroed bytes from the symmetric heap."""
+    _require_initialized()
+    if count < 0 or size < 0:
+        raise ValueError("count and size must be >= 0")
+    total_size = count * size
+    ptr_value = int(RUNTIME.ishmem4py_calloc(count, size) or 0)
+    if total_size > 0 and not ptr_value:
+        raise IshmemError(f"ishmem_calloc({count}, {size}) returned NULL")
+    result = SymmetricMemory(ptr=ptr_value, size=total_size)
+    _register_allocation(result)
+    return result
+
+
+def buffer(size: int, *, zero: bool = False) -> SymmetricMemory:
+    """Allocate a symmetric byte buffer."""
+    if zero:
+        return calloc(1, size)
+    return malloc(size)
+
+
+def free(symm: SymmetricMemory) -> None:
+    """Free a symmetric allocation previously returned by ``malloc`` or ``calloc``."""
+    _require_initialized()
+    _require_active(symm)
+    RUNTIME.ishmem4py_free(ctypes.c_void_p(symm.ptr))
+    _unregister_allocation(symm)
+    symm._mark_freed()
+
+
+def ptr(symm: SymmetricMemory, *, pe: int) -> MemoryPointer | None:
+    """Return a directly addressable pointer for ``symm`` on ``pe`` when available."""
+    _require_initialized()
+    _require_active(symm)
+    address = int(RUNTIME.ishmem4py_ptr(ctypes.c_void_p(symm.ptr), pe) or 0)
+    if address == 0:
+        return None
+    return MemoryPointer(address=address, size=symm.size, pe=pe, base=symm)
+
+
+def ishmem_ptr(symm: SymmetricMemory, *, pe: int) -> MemoryPointer | None:
+    """Alias for ``ptr`` for Intel SHMEM naming compatibility."""
+    return ptr(symm, pe=pe)
diff --git a/ishmem4py/python/ishmem4py/rma.py b/ishmem4py/python/ishmem4py/rma.py
new file mode 100644
index 0000000..6351226
--- /dev/null
+++ b/ishmem4py/python/ishmem4py/rma.py
@@ -0,0 +1,113 @@
+# Copyright (C) 2026 Intel Corporation
+# SPDX-License-Identifier: BSD-3-Clause
+
+"""Host-initiated one-sided RMA operations."""
+
+from __future__ import annotations
+
+import ctypes
+from typing import Optional
+
+from ._common import (
+    SymmetricMemory,
+    _normalize_span,
+    _pointer_from_local_buffer,
+    _require_initialized,
+    _symmetric_target,
+)
+from ._lib import RUNTIME
+
+__all__ = [
+    "fence",
+    "get",
+    "getmem",
+    "put",
+    "putmem",
+    "quiet",
+]
+
+
+def fence() -> None:
+    """Order previously issued put-like operations."""
+    _require_initialized()
+    RUNTIME.ishmem4py_fence()
+
+
+def quiet() -> None:
+    """Wait for completion of previously issued RMA operations."""
+    _require_initialized()
+    RUNTIME.ishmem4py_quiet()
+
+
+def put(
+    dest: SymmetricMemory,
+    src,
+    *,
+    pe: int,
+    size: Optional[int] = None,
+    dest_offset: int = 0,
+    src_offset: int = 0,
+) -> int:
+    """Copy bytes from a local buffer into symmetric memory on ``pe``."""
+    _require_initialized()
+    dest = _symmetric_target(dest)
+    src_info = _pointer_from_local_buffer(src, writable=False, offset=src_offset)
+    dest_size = _normalize_span(dest, size=None, offset=dest_offset)
+    nbytes = min(dest_size, src_info.size) if size is None else size
+
+    if nbytes < 0:
+        raise ValueError("size must be >= 0")
+    if nbytes > dest_size:
+        raise ValueError("size extends past the destination symmetric object")
+    if nbytes > src_info.size:
+        raise ValueError("size extends past the source buffer")
+
+    RUNTIME.ishmem4py_putmem(
+        ctypes.c_void_p(dest.ptr + dest_offset),
+        ctypes.c_void_p(src_info.ptr),
+        nbytes,
+        pe,
+    )
+    return nbytes
+
+
+def get(
+    dest,
+    src: SymmetricMemory,
+    *,
+    pe: int,
+    size: Optional[int] = None,
+    dest_offset: int = 0,
+    src_offset: int = 0,
+) -> int:
+    """Copy bytes from symmetric memory on ``pe`` into a local buffer."""
+    _require_initialized()
+    src = _symmetric_target(src)
+    dest_info = _pointer_from_local_buffer(dest, writable=True, offset=dest_offset)
+    src_size = _normalize_span(src, size=None, offset=src_offset)
+    nbytes = min(dest_info.size, src_size) if size is None else size
+
+    if nbytes < 0:
+        raise ValueError("size must be >= 0")
+    if nbytes > dest_info.size:
+        raise ValueError("size extends past the destination buffer")
+    if nbytes > src_size:
+        raise ValueError("size extends past the source symmetric object")
+
+    RUNTIME.ishmem4py_getmem(
+        ctypes.c_void_p(dest_info.ptr),
+        ctypes.c_void_p(src.ptr + src_offset),
+        nbytes,
+        pe,
+    )
+    return nbytes
+
+
+def putmem(*args, **kwargs) -> int:
+    """Alias for ``put`` for Intel SHMEM naming compatibility."""
+    return put(*args, **kwargs)
+
+
+def getmem(*args, **kwargs) -> int:
+    """Alias for ``get`` for Intel SHMEM naming compatibility."""
+    return get(*args, **kwargs)
diff --git a/ishmem4py/python/ishmem4py/teams.py b/ishmem4py/python/ishmem4py/teams.py
new file mode 100644
index 0000000..ae5d79d
--- /dev/null
+++ b/ishmem4py/python/ishmem4py/teams.py
@@ -0,0 +1,56 @@
+# Copyright (C) 2026 Intel Corporation
+# SPDX-License-Identifier: BSD-3-Clause
+
+"""Team queries and synchronization helpers."""
+
+from __future__ import annotations
+
+from ._common import (
+    TEAM_SHARED,
+    TEAM_WORLD,
+    Team,
+    _check_status,
+    _normalize_team,
+    _require_initialized,
+)
+from ._lib import RUNTIME
+
+__all__ = [
+    "TEAM_SHARED",
+    "TEAM_WORLD",
+    "Team",
+    "team_my_pe",
+    "team_n_pes",
+    "team_sync",
+    "team_translate_pe",
+]
+
+
+def team_my_pe(team: Team | int | None = None) -> int:
+    """Return the calling PE's rank within ``team``."""
+    _require_initialized()
+    return int(RUNTIME.ishmem4py_team_my_pe(_normalize_team(team)))
+
+
+def team_n_pes(team: Team | int | None = None) -> int:
+    """Return the number of PEs in ``team``."""
+    _require_initialized()
+    return int(RUNTIME.ishmem4py_team_n_pes(_normalize_team(team)))
+
+
+def team_translate_pe(src_team: Team | int | None, src_pe: int, dest_team: Team | int | None) -> int:
+    """Translate ``src_pe`` from ``src_team`` numbering into ``dest_team`` numbering."""
+    _require_initialized()
+    return int(
+        RUNTIME.ishmem4py_team_translate_pe(
+            _normalize_team(src_team),
+            src_pe,
+            _normalize_team(dest_team),
+        )
+    )
+
+
+def team_sync(team: Team | int | None = None) -> None:
+    """Synchronize all PEs in ``team``."""
+    _require_initialized()
+    _check_status(RUNTIME.ishmem4py_team_sync(_normalize_team(team)), "ishmem_team_sync")
diff --git a/ishmem4py/python/ishmem4py/version.py b/ishmem4py/python/ishmem4py/version.py
new file mode 100644
index 0000000..1ddab53
--- /dev/null
+++ b/ishmem4py/python/ishmem4py/version.py
@@ -0,0 +1,4 @@
+# Copyright (C) 2026 Intel Corporation
+# SPDX-License-Identifier: BSD-3-Clause
+
+__version__ = "0.2.0a0"

From 607f9066f0d0dc86387b1742ff6ce9b4633d724d Mon Sep 17 00:00:00 2001
From: Benjamin Brock <benjamin.brock@intel.com>
Date: Thu, 12 Mar 2026 20:00:03 +0000
Subject: [PATCH 07/17] Add ishmem4py smoke, ring, and collective coverage

---
 ishmem4py/CMakeLists.txt           |  9 ++-
 ishmem4py/examples/ring_put_get.py |  8 +--
 ishmem4py/test/collective_test.py  | 88 ++++++++++++++++++++++++++++++
 ishmem4py/test/ring_test.py        | 26 +++++----
 ishmem4py/test/smoke_test.py       | 73 +++++++++++++++++--------
 ishmem4py/test/utils.py            | 46 ++++++++++++++++
 6 files changed, 207 insertions(+), 43 deletions(-)
 create mode 100644 ishmem4py/test/collective_test.py
 create mode 100644 ishmem4py/test/utils.py

diff --git a/ishmem4py/CMakeLists.txt b/ishmem4py/CMakeLists.txt
index a1b0e8e..f32b2e5 100644
--- a/ishmem4py/CMakeLists.txt
+++ b/ishmem4py/CMakeLists.txt
@@ -16,9 +16,6 @@ set(ISHMEM4PY_PKG_BUILD_DIR "${ISHMEM4PY_PYTHON_BUILD_ROOT}/ishmem4py")
 set(ISHMEM4PY_SOURCE_PYTHON_ROOT "${CMAKE_CURRENT_SOURCE_DIR}/python")
 set(ISHMEM4PY_SOURCE_PKG_DIR "${ISHMEM4PY_SOURCE_PYTHON_ROOT}/ishmem4py")
 
-file(COPY ${CMAKE_CURRENT_SOURCE_DIR}/test/ DESTINATION ${ISHMEM4PY_BUILD_ROOT})
-file(COPY ${CMAKE_CURRENT_SOURCE_DIR}/examples/ DESTINATION ${ISHMEM4PY_BUILD_ROOT})
-
 add_custom_target(ishmem4py-package-files
     COMMAND ${CMAKE_COMMAND} -E make_directory "${ISHMEM4PY_PYTHON_BUILD_ROOT}"
     COMMAND ${CMAKE_COMMAND} -E copy_if_different
@@ -30,6 +27,12 @@ add_custom_target(ishmem4py-package-files
     COMMAND ${CMAKE_COMMAND} -E copy_if_different
         "${ISHMEM4PY_SOURCE_PYTHON_ROOT}/setup.py"
         "${ISHMEM4PY_PYTHON_BUILD_ROOT}/setup.py"
+    COMMAND ${CMAKE_COMMAND} -E copy_directory
+        "${CMAKE_CURRENT_SOURCE_DIR}/test"
+        "${ISHMEM4PY_BUILD_ROOT}/test"
+    COMMAND ${CMAKE_COMMAND} -E copy_directory
+        "${CMAKE_CURRENT_SOURCE_DIR}/examples"
+        "${ISHMEM4PY_BUILD_ROOT}/examples"
     COMMAND ${CMAKE_COMMAND} -E copy_directory
         "${ISHMEM4PY_SOURCE_PKG_DIR}"
         "${ISHMEM4PY_PKG_BUILD_DIR}"
diff --git a/ishmem4py/examples/ring_put_get.py b/ishmem4py/examples/ring_put_get.py
index 1d3fdea..c9c2427 100644
--- a/ishmem4py/examples/ring_put_get.py
+++ b/ishmem4py/examples/ring_put_get.py
@@ -20,18 +20,18 @@ def main() -> None:
             src.write(struct.pack("=i", my_pe))
             ishmem.barrier_all()
 
-            ishmem.putmem(dst, src.read(4), pe=next_pe)
+            ishmem.put(dst, src.read(4), pe=next_pe)
             ishmem.barrier_all()
 
             received = struct.unpack("=i", dst.read(4))[0]
 
             fetched = bytearray(4)
-            ishmem.getmem(fetched, src, pe=next_pe)
+            ishmem.get(fetched, src, pe=next_pe)
             fetched_value = struct.unpack("=i", fetched)[0]
 
             print(
-                f"PE {my_pe}: dst after putmem={received} (expected {prev_pe}), "
-                f"getmem from PE {next_pe} returned {fetched_value}"
+                f"PE {my_pe}: dst after put={received} (expected {prev_pe}), "
+                f"get from PE {next_pe} returned {fetched_value}"
             )
             ishmem.barrier_all()
         finally:
diff --git a/ishmem4py/test/collective_test.py b/ishmem4py/test/collective_test.py
new file mode 100644
index 0000000..e17c8d1
--- /dev/null
+++ b/ishmem4py/test/collective_test.py
@@ -0,0 +1,88 @@
+# Copyright (C) 2026 Intel Corporation
+# SPDX-License-Identifier: BSD-3-Clause
+
+from __future__ import annotations
+
+import sys
+
+import ishmem4py as ishmem
+
+from utils import expect_equal, pack_int32, unpack_int32, unpack_int32_list
+
+
+def main() -> int:
+    ishmem.init()
+    try:
+        my_pe = ishmem.my_pe()
+        npes = ishmem.n_pes()
+        if npes < 2:
+            raise RuntimeError("collective_test.py requires at least 2 PEs")
+
+        broadcast_src = ishmem.malloc(4)
+        broadcast_dst = ishmem.calloc(1, 4)
+        collect_src = ishmem.malloc(4)
+        collect_dst = ishmem.calloc(npes, 4)
+        fcollect_dst = ishmem.calloc(npes, 4)
+        alltoall_src = ishmem.malloc(npes * 4)
+        alltoall_dst = ishmem.calloc(npes, 4)
+        reduce_src = ishmem.malloc(4)
+        reduce_sum_dst = ishmem.calloc(1, 4)
+        reduce_max_dst = ishmem.calloc(1, 4)
+        reducescatter_src = ishmem.malloc(npes * 4)
+        reducescatter_dst = ishmem.calloc(1, 4)
+        try:
+            broadcast_src.write(pack_int32(1000 + my_pe))
+            collect_src.write(pack_int32(my_pe))
+            alltoall_src.write(pack_int32(*[(my_pe * 100) + peer for peer in range(npes)]))
+            reduce_src.write(pack_int32(my_pe + 1))
+            reducescatter_src.write(pack_int32(*[(my_pe + 1) * (peer + 1) for peer in range(npes)]))
+
+            ishmem.barrier_all()
+            ishmem.sync_all()
+
+            ishmem.broadcast(broadcast_dst, broadcast_src, root=0)
+            ishmem.collect(collect_dst, collect_src)
+            ishmem.fcollect(fcollect_dst, collect_src)
+            ishmem.alltoall(alltoall_dst, alltoall_src)
+            ishmem.reduce("sum", reduce_sum_dst, reduce_src, dtype="int32")
+            ishmem.reduce("max", reduce_max_dst, reduce_src, dtype="int32")
+            ishmem.reducescatter("sum", reducescatter_dst, reducescatter_src, dtype="int32", count=1)
+
+            ishmem.barrier_all()
+
+            expect_equal("broadcast result", unpack_int32(broadcast_dst.read(4)), 1000)
+            expect_equal("collect result", unpack_int32_list(collect_dst.read(npes * 4)), list(range(npes)))
+            expect_equal("fcollect result", unpack_int32_list(fcollect_dst.read(npes * 4)), list(range(npes)))
+            expect_equal(
+                "alltoall result",
+                unpack_int32_list(alltoall_dst.read(npes * 4)),
+                [(peer * 100) + my_pe for peer in range(npes)],
+            )
+            expect_equal("sum reduce result", unpack_int32(reduce_sum_dst.read(4)), npes * (npes + 1) // 2)
+            expect_equal("max reduce result", unpack_int32(reduce_max_dst.read(4)), npes)
+            expect_equal(
+                "reducescatter result",
+                unpack_int32(reducescatter_dst.read(4)),
+                (my_pe + 1) * (npes * (npes + 1) // 2),
+            )
+        finally:
+            ishmem.free(reducescatter_dst)
+            ishmem.free(reducescatter_src)
+            ishmem.free(reduce_max_dst)
+            ishmem.free(reduce_sum_dst)
+            ishmem.free(reduce_src)
+            ishmem.free(alltoall_dst)
+            ishmem.free(alltoall_src)
+            ishmem.free(fcollect_dst)
+            ishmem.free(collect_dst)
+            ishmem.free(collect_src)
+            ishmem.free(broadcast_dst)
+            ishmem.free(broadcast_src)
+    finally:
+        ishmem.finalize()
+
+    return 0
+
+
+if __name__ == "__main__":
+    sys.exit(main())
diff --git a/ishmem4py/test/ring_test.py b/ishmem4py/test/ring_test.py
index 8a45fe6..76b1ac8 100644
--- a/ishmem4py/test/ring_test.py
+++ b/ishmem4py/test/ring_test.py
@@ -1,15 +1,13 @@
 # Copyright (C) 2026 Intel Corporation
 # SPDX-License-Identifier: BSD-3-Clause
 
-import struct
+from __future__ import annotations
+
 import sys
 
 import ishmem4py as ishmem
 
-
-def _expect_equal(label: str, actual: int, expected: int) -> None:
-    if actual != expected:
-        raise AssertionError(f"{label}: expected {expected}, got {actual}")
+from utils import expect_equal, pack_int32, unpack_int32
 
 
 def main() -> int:
@@ -26,20 +24,24 @@ def main() -> int:
         src = ishmem.malloc(4)
         dst = ishmem.calloc(1, 4)
         try:
-            src.write(struct.pack("=i", my_pe))
+            src.write(pack_int32(my_pe))
 
             ishmem.barrier_all()
 
-            ishmem.putmem(dst, src.read(4), pe=next_pe)
+            ishmem.put(dst, src.read(4), pe=next_pe)
             ishmem.barrier_all()
 
-            received = struct.unpack("=i", dst.read(4))[0]
-            _expect_equal("putmem ring result", received, prev_pe)
+            received = unpack_int32(dst.read(4))
+            expect_equal("put ring result", received, prev_pe)
 
             host_value = bytearray(4)
-            ishmem.getmem(host_value, src, pe=next_pe)
-            fetched = struct.unpack("=i", host_value)[0]
-            _expect_equal("getmem ring result", fetched, next_pe)
+            ishmem.get(host_value, src, pe=next_pe)
+            expect_equal("get ring result", unpack_int32(host_value), next_pe)
+
+            remote_ptr = ishmem.ishmem_ptr(src, pe=next_pe)
+            if remote_ptr is not None:
+                expect_equal("ptr metadata size", remote_ptr.size, src.size)
+                expect_equal("ptr metadata pe", remote_ptr.pe, next_pe)
 
             ishmem.barrier_all()
         finally:
diff --git a/ishmem4py/test/smoke_test.py b/ishmem4py/test/smoke_test.py
index 882d43b..49abead 100644
--- a/ishmem4py/test/smoke_test.py
+++ b/ishmem4py/test/smoke_test.py
@@ -1,40 +1,65 @@
 # Copyright (C) 2026 Intel Corporation
 # SPDX-License-Identifier: BSD-3-Clause
 
-import struct
-import unittest
+from __future__ import annotations
+
+import sys
 
 import ishmem4py as ishmem
 
+from utils import expect_equal, expect_true, pack_int32, unpack_int32, unpack_int32_list
+
 
-class SmokeTest(unittest.TestCase):
-    @classmethod
-    def setUpClass(cls):
-        ishmem.init()
+def main() -> int:
+    version = ishmem.get_version()
+    expect_true("version vendor string is populated", bool(version.vendor_name))
+    expect_true("version strings are populated", bool(version.libishmem_version))
+    expect_equal("pre-init status", ishmem.init_status(), ishmem.InitStatus.UNINITIALIZED)
 
-    @classmethod
-    def tearDownClass(cls):
-        if ishmem.is_initialized():
-            ishmem.finalize()
+    ishmem.init()
+    try:
+        expect_equal("post-init status", ishmem.init_status(), ishmem.InitStatus.INITIALIZED)
+        expect_equal("n_pes", ishmem.n_pes(), 1)
+        expect_equal("my_pe", ishmem.my_pe(), 0)
 
-    def test_version_and_world_info(self):
         major, minor = ishmem.info_get_version()
-        self.assertGreaterEqual(major, 1)
-        self.assertGreaterEqual(minor, 0)
-        self.assertEqual(ishmem.n_pes(), 1)
-        self.assertEqual(ishmem.my_pe(), 0)
+        expect_true("OpenSHMEM spec major version", major >= 1)
+        expect_true("OpenSHMEM spec minor version", minor >= 0)
+        expect_true("library name", ishmem.info_get_name().startswith("Intel"))
 
-    def test_malloc_read_write_free(self):
         buf = ishmem.malloc(16)
-        buf.write(struct.pack("=4i", 1, 2, 3, 4))
-        self.assertEqual(struct.unpack("=4i", buf.read(16)), (1, 2, 3, 4))
-        ishmem.free(buf)
+        zeroed = ishmem.calloc(4, 4)
+        alias = ishmem.buffer(8, zero=True)
+        try:
+            buf.write(pack_int32(1, 2, 3, 4))
+            expect_equal("local put/get", unpack_int32_list(buf.read(16)), [1, 2, 3, 4])
+            expect_equal("calloc zeroed", zeroed.read(16), b"\x00" * 16)
+            expect_equal("buffer(zero=True)", alias.read(8), b"\x00" * 8)
+
+            host_value = bytearray(4)
+            ishmem.get(host_value, buf, pe=0, size=4, src_offset=8)
+            expect_equal("get into host buffer", unpack_int32(host_value), 3)
+
+            ishmem.put(buf, pack_int32(9), pe=0, dest_offset=4)
+            expect_equal("put alias", unpack_int32(buf.read(4, offset=4)), 9)
+
+            local_ptr = ishmem.ptr(buf, pe=0)
+            expect_true("ptr result type", local_ptr is None or local_ptr.size == buf.size)
+            if local_ptr is not None:
+                expect_equal("ptr pe", local_ptr.pe, 0)
+                expect_true("ptr address", int(local_ptr) != 0)
+
+            ishmem.barrier_all()
+        finally:
+            ishmem.free(alias)
+            ishmem.free(zeroed)
+            ishmem.free(buf)
+    finally:
+        ishmem.finalize()
 
-    def test_calloc_zero_initialized(self):
-        buf = ishmem.calloc(4, 4)
-        self.assertEqual(buf.read(16), b"\x00" * 16)
-        ishmem.free(buf)
+    expect_equal("post-finalize status", ishmem.init_status(), ishmem.InitStatus.FINALIZED)
+    return 0
 
 
 if __name__ == "__main__":
-    unittest.main()
+    sys.exit(main())
diff --git a/ishmem4py/test/utils.py b/ishmem4py/test/utils.py
new file mode 100644
index 0000000..2e97a9b
--- /dev/null
+++ b/ishmem4py/test/utils.py
@@ -0,0 +1,46 @@
+# Copyright (C) 2026 Intel Corporation
+# SPDX-License-Identifier: BSD-3-Clause
+
+from __future__ import annotations
+
+import struct
+
+
+def expect_equal(label: str, actual, expected) -> None:
+    if actual != expected:
+        raise AssertionError(f"{label}: expected {expected!r}, got {actual!r}")
+
+
+def expect_true(label: str, condition: bool) -> None:
+    if not condition:
+        raise AssertionError(label)
+
+
+def pack_int32(*values: int) -> bytes:
+    return struct.pack(f"={len(values)}i", *values)
+
+
+def pack_uint32(*values: int) -> bytes:
+    return struct.pack(f"={len(values)}I", *values)
+
+
+def unpack_int32(data: bytes) -> int:
+    return struct.unpack("=i", data)[0]
+
+
+def unpack_uint32(data: bytes) -> int:
+    return struct.unpack("=I", data)[0]
+
+
+def unpack_int32_list(data: bytes) -> list[int]:
+    if len(data) % 4 != 0:
+        raise ValueError("int32 payload size must be a multiple of 4 bytes")
+    count = len(data) // 4
+    return list(struct.unpack(f"={count}i", data))
+
+
+def unpack_uint32_list(data: bytes) -> list[int]:
+    if len(data) % 4 != 0:
+        raise ValueError("uint32 payload size must be a multiple of 4 bytes")
+    count = len(data) // 4
+    return list(struct.unpack(f"={count}I", data))

From 6da4ba0c0cfe0ccbe5172046483f7a7b742f9943 Mon Sep 17 00:00:00 2001
From: Benjamin Brock <benjamin.brock@intel.com>
Date: Thu, 12 Mar 2026 20:00:19 +0000
Subject: [PATCH 08/17] Document validated ishmem4py bindings

---
 docs/source/conf.py        |   9 ++
 docs/source/index.rst      |   6 +
 docs/source/ishmem4py.rst  | 208 +++++++++++++++++++++++++++++++
 ishmem4py/README.md        | 247 +++++++++++--------------------------
 ishmem4py/python/README.md |  42 +++++--
 5 files changed, 328 insertions(+), 184 deletions(-)
 create mode 100644 docs/source/ishmem4py.rst

diff --git a/docs/source/conf.py b/docs/source/conf.py
index d60fa83..47df9cd 100644
--- a/docs/source/conf.py
+++ b/docs/source/conf.py
@@ -10,6 +10,8 @@
 # -- Packages ----------------------------------------------------------------
 
 import datetime
+import os
+import sys
 
 # -- Project information -----------------------------------------------------
 
@@ -24,9 +26,16 @@
 # Add any Sphinx extension module names here, as strings. They can be
 # extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
 # ones.
+sys.path.insert(0, os.path.abspath('../../ishmem4py/python'))
+
 extensions = [
+    'sphinx.ext.autodoc',
+    'sphinx.ext.autosummary',
 ]
 
+autosummary_generate = False
+autodoc_member_order = 'bysource'
+
 # List of patterns, relative to source directory, that match files and
 # directories to ignore when looking for source files.
 # This pattern also affects html_static_path and html_extra_path.
diff --git a/docs/source/index.rst b/docs/source/index.rst
index 6520a62..bc509f3 100644
--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@@ -45,6 +45,12 @@ Intel® SHMEM
    memory_ordering
    utility
 
+.. toctree::
+   :maxdepth: 2
+   :caption: Python Bindings
+
+   ishmem4py
+
 .. ---------------
 .. Team Management
 .. ---------------
diff --git a/docs/source/ishmem4py.rst b/docs/source/ishmem4py.rst
new file mode 100644
index 0000000..9fa134a
--- /dev/null
+++ b/docs/source/ishmem4py.rst
@@ -0,0 +1,208 @@
+.. _ishmem4py:
+
+--------------------
+ishmem4py Bindings
+--------------------
+
+``ishmem4py`` is the in-tree Python interface for Intel(R) SHMEM.
+It targets the same host-side workflow that the C and C++ Intel SHMEM APIs already support,
+while adopting a small, Python-oriented surface inspired by ``nvshmem4py`` where the models
+overlap.
+
+The current binding is intentionally synchronous and host-driven.
+It does not yet expose Intel SHMEM queue-based extensions or device-initiated Python APIs.
+
+.. currentmodule:: ishmem4py
+
+Import styles:
+
+.. code-block:: python
+
+   import ishmem4py as ishmem
+   # or
+   import ishmem4py.core as ishmem
+
+API Reference
+^^^^^^^^^^^^^
+
+Initialization and Queries
+""""""""""""""""""""""""""
+
+.. autosummary::
+
+   init
+   finalize
+   init_status
+   is_initialized
+   my_pe
+   n_pes
+   info_get_version
+   info_get_name
+   get_version
+   Version
+
+Memory Management
+"""""""""""""""""
+
+.. autosummary::
+
+   malloc
+   calloc
+   buffer
+   free
+   ptr
+   ishmem_ptr
+   SymmetricMemory
+   MemoryPointer
+
+Remote Memory Access
+""""""""""""""""""""
+
+.. autosummary::
+
+   put
+   get
+   putmem
+   getmem
+   fence
+   quiet
+
+Collectives
+"""""""""""
+
+.. autosummary::
+
+   barrier_all
+   sync_all
+   barrier
+   sync
+   broadcast
+   collect
+   fcollect
+   alltoall
+   reduce
+   reducescatter
+
+Teams
+"""""
+
+.. autosummary::
+
+   TEAM_WORLD
+   TEAM_SHARED
+   TEAM_INVALID
+   Team
+   team_my_pe
+   team_n_pes
+   team_translate_pe
+   team_sync
+
+Basic Example
+^^^^^^^^^^^^^
+
+.. literalinclude:: ../../ishmem4py/examples/init_fini.py
+   :language: python
+
+RMA Example
+^^^^^^^^^^^
+
+.. literalinclude:: ../../ishmem4py/examples/ring_put_get.py
+   :language: python
+
+Memory Model
+^^^^^^^^^^^^
+
+``ishmem4py`` keeps symmetric allocations as explicit Python objects instead of exposing the
+symmetric heap as a general Python buffer.
+That is deliberate:
+
+- Intel SHMEM symmetric memory may not be directly host-accessible in every runtime mode.
+- Explicit ``put`` and ``get`` calls are the portable host-side mechanism.
+- Python can still use ordinary local buffers such as ``bytes``, ``bytearray``, and
+  ``memoryview`` as the source or destination for host-initiated RMA.
+
+Example:
+
+.. code-block:: python
+
+   ishmem.init()
+   try:
+       buf = ishmem.malloc(16)
+       buf.write(b"abcd")
+
+       host = bytearray(16)
+       ishmem.get(host, buf, pe=ishmem.my_pe())
+   finally:
+       ishmem.free(buf)
+       ishmem.finalize()
+
+Collectives and Teams
+^^^^^^^^^^^^^^^^^^^^^
+
+World-team collectives are available directly from Python:
+
+.. code-block:: python
+
+   reduce_src = ishmem.malloc(4)
+   reduce_dst = ishmem.calloc(1, 4)
+
+   reduce_src.write((ishmem.my_pe() + 1).to_bytes(4, "little", signed=True))
+   ishmem.reduce("sum", reduce_dst, reduce_src, dtype="int32")
+
+``reducescatter`` is also available for API compatibility with ``nvshmem4py``.
+Intel SHMEM does not currently expose a host-side primitive for it, so ``ishmem4py`` implements
+that operation as a small software fallback on top of ``reduce``.
+
+The binding also exposes team objects and team query or synchronization routines.
+Those APIs follow Intel SHMEM handles rather than CUDA-object wrappers.
+
+Building and Installing
+^^^^^^^^^^^^^^^^^^^^^^^
+
+Build Intel SHMEM with Python bindings enabled:
+
+.. code-block:: bash
+
+   cmake -S /path/to/ishmem -B /path/to/build -DBUILD_PYTHON_BINDINGS=ON ...
+   cmake --build /path/to/build --target ishmem4py -j4
+
+Install from the build tree:
+
+.. code-block:: bash
+
+   pip install /path/to/build/ishmem4py/python
+
+For editable development, install the source tree and point it at the built runtime:
+
+.. code-block:: bash
+
+   pip install -e /path/to/ishmem/ishmem4py/python
+   export ISHMEM4PY_RUNTIME_LIBRARY=/path/to/build/ishmem4py/python/ishmem4py/_ishmem4py_runtime.so
+
+Testing
+^^^^^^^
+
+Typical source-tree test environment:
+
+.. code-block:: bash
+
+   unset ISHMEM_DIR
+   export PYTHONPATH=/path/to/ishmem/ishmem4py/python:/path/to/ishmem/ishmem4py/test
+   export ISHMEM4PY_RUNTIME_LIBRARY=/path/to/build/ishmem4py/python/ishmem4py/_ishmem4py_runtime.so
+   export LD_LIBRARY_PATH=/path/to/openshmem/lib:$LD_LIBRARY_PATH
+   export ISHMEM_RUNTIME=OPENSHMEM
+
+Examples:
+
+.. code-block:: bash
+
+   mpiexec -n 1 ishmrun python3 /path/to/ishmem/ishmem4py/test/smoke_test.py
+   mpiexec -n 2 ishmrun python3 /path/to/ishmem/ishmem4py/test/ring_test.py
+   mpiexec -n 2 ishmrun python3 /path/to/ishmem/ishmem4py/test/collective_test.py
+
+Current Scope
+^^^^^^^^^^^^^
+
+The exported Python API is limited to functionality that has been validated in this environment.
+Host atomics and dynamic team-split helpers were intentionally removed from the public bindings
+after backend-level failures in the current OpenSHMEM runtime.
diff --git a/ishmem4py/README.md b/ishmem4py/README.md
index 56f73a7..7db42fe 100644
--- a/ishmem4py/README.md
+++ b/ishmem4py/README.md
@@ -1,147 +1,78 @@
 # ishmem4py
 
-`ishmem4py` is a new Python binding layer for Intel SHMEM being developed in-tree in the
-`ishmem` repository. The immediate target is a CPU-initiated MVP that can initialize Intel
-SHMEM, allocate symmetric memory, and perform basic `put` and `get` operations from Python.
-
-## Why A New Shim Layer
-
-Intel SHMEM's installed interface is not a drop-in target for a pure-Python FFI layer:
-
-- the public header is C++, not C
-- the installed library is a static archive (`libishmem.a`)
-- exported symbols are C++ mangled
-
-For `ishmem4py`, the binding strategy is therefore:
-
-1. add a very small C ABI runtime shim linked against Intel SHMEM
-2. load that shim from Python
-3. keep the Python layer thin and name APIs after Intel SHMEM/OpenSHMEM where possible
-
-This keeps the MVP simple and makes future growth predictable.
-
-## MVP Scope
-
-The first deliverable is intentionally narrow:
-
-- library setup and teardown
-  - `init`
-  - `finalize`
-  - `my_pe`
-  - `n_pes`
-  - `barrier_all`
-  - `quiet`
-- symmetric memory management
-  - `malloc`
-  - `calloc`
-  - `free`
-- CPU-initiated RMA
-  - `putmem`
-  - `getmem`
-- Python memory object support sufficient to:
-  - allocate symmetric memory from Python
-  - keep symmetric allocations as explicit handles
-  - stage local data in and out from Python buffers
-  - run multi-PE ring-style smoke tests
-
-Out of scope for the initial MVP:
-
-- device-initiated communication
-- SYCL queue or work-group extensions
-- full typed API coverage (`ishmem_int_put`, `ishmem_float_get`, and similar)
-- team management beyond world-team queries
-- atomics, signaling, reductions, collectives, or wait/test routines
-
-## Memory Model For The MVP
-
-The MVP treats symmetric allocations as opaque Python objects, not as directly dereferenceable
-Python buffers.
-
-That choice matches Intel SHMEM's default memory model more closely:
-
-- the symmetric heap may reside in device memory
-- host loads and stores are not generally valid on symmetric heap pointers
-- host `putmem` and `getmem` are the portable way to move data between Python-visible memory
-  and symmetric objects
-
-In practice, the MVP uses normal Python buffer-protocol objects (`bytes`, `bytearray`,
-`memoryview`, and similar) as the local source or destination for host-initiated RMA calls.
-
-`ISHMEM_ENABLE_ACCESSIBLE_HOST_HEAP=1` remains useful as an optional mode for debugging or
-future richer views, but it is not required by the first implementation.
-
-## Planned Layout
-
-The implementation is expected to live under `ishmem/ishmem4py/` with roughly this structure:
-
-```text
-ishmem4py/
-  CMakeLists.txt
-  README.md
-  csrc/
-    ishmem4py_runtime.cpp
-    ishmem4py_runtime.h
-  python/
-    ishmem4py/
-      __init__.py
-      _lib.py
-      core.py
-  test/
-    smoke_test.py
-    ring_test.py
-  examples/
-    init_fini.py
-    ring_put_get.py
-```
-
-## API Shape
+`ishmem4py` is the in-tree Python binding layer for Intel SHMEM.
 
-The Python API will stay close to Intel SHMEM/OpenSHMEM naming at the low level.
+The binding stays close to Intel SHMEM/OpenSHMEM naming where that produces a clear Python API,
+but it also mirrors the high-level shape of `nvshmem4py` where practical:
 
-Expected MVP entry points:
-
-```python
-import ishmem4py as ishmem
+- synchronous host-side `put` / `get`
+- symmetric-memory handles managed from Python
+- world/team queries and team objects
+- host collectives over symmetric buffers
 
-ishmem.init()
-pe = ishmem.my_pe()
-npes = ishmem.n_pes()
+The current implementation is intentionally host-driven. Stream-based and device-initiated APIs
+remain future work.
 
-buf = ishmem.malloc(1024)
-zeros = ishmem.calloc(256, 4)
+## Public API
 
-ishmem.putmem(remote_buf, local_buf, pe=1)
-ishmem.getmem(local_buf, remote_buf, pe=1)
+Main imports:
 
-ishmem.barrier_all()
-ishmem.quiet()
-ishmem.free(buf)
-ishmem.finalize()
+```python
+import ishmem4py as ishmem
+# or
+import ishmem4py.core as ishmem
 ```
 
-The Python `SymmetricMemory` object exposes the symmetric address and allocation size, along with
-convenience `read()` and `write()` helpers implemented in terms of same-PE `getmem` and `putmem`.
-
-## Build Integration Plan
+Primary entry points:
+
+- setup and queries
+  - `init`, `finalize`, `init_status`, `is_initialized`
+  - `my_pe`, `n_pes`
+  - `info_get_version`, `info_get_name`, `get_version`
+- symmetric memory
+  - `malloc`, `calloc`, `buffer`, `free`
+  - `SymmetricMemory`
+  - `ptr`, `ishmem_ptr`
+- RMA
+  - `put`, `get`, `quiet`, `fence`
+  - `putmem`, `getmem` compatibility aliases
+- collectives
+  - `barrier_all`, `sync_all`
+  - `barrier`, `sync`
+  - `broadcast`, `collect`, `fcollect`, `alltoall`, `reduce`, `reducescatter`
+  - `reducescatter` is implemented as a software fallback on top of `reduce`
+- teams
+  - `TEAM_WORLD`, `TEAM_SHARED`, `TEAM_INVALID`
+  - `Team`
+  - `team_my_pe`, `team_n_pes`, `team_translate_pe`, `team_sync`
+
+Supported reduction dtypes:
+
+- `int32`
+- `int64`
+- `uint32`
+- `uint64`
+- `float32` and `float64` for arithmetic reductions
+
+## Design Notes
+
+`ishmem4py` keeps symmetric allocations as explicit Python objects instead of exposing the
+symmetric heap as a direct Python buffer. That matches Intel SHMEM more closely than pretending
+every symmetric allocation is safely CPU-dereferenceable.
+
+Data movement is therefore centered on ordinary Python buffers and explicit SHMEM operations:
 
-The preferred integration is an optional top-level CMake target, for example:
+```python
+buf = ishmem.malloc(16)
+buf.write(b"abcd")
 
-```text
--DBUILD_PYTHON_BINDINGS=ON
+host = bytearray(16)
+ishmem.get(host, buf, pe=0)
 ```
 
-Expected responsibilities:
-
-- build the C ABI runtime shim with `icpx`
-- link it against the existing Intel SHMEM library objects or installed library
-- place the resulting shared library beside the Python package
-- keep the rest of the core Intel SHMEM build unchanged when Python bindings are disabled
-
-## Current Build And Test Flow
-
-The current branch has been validated with an in-tree build and source-tree Python package.
+## Build
 
-Build:
+Typical build:
 
 ```bash
 source /opt/intel/oneapi/setvars.sh
@@ -162,65 +93,33 @@ cmake -S /docker-mount/ishmem \
 cmake --build /docker-mount/ishmem/build-ishmem4py-icpx-noaot --target ishmem4py -j4
 ```
 
-Run environment:
+## Run / Test Environment
+
+Source-tree development:
 
 ```bash
 unset ISHMEM_DIR
-export LD_LIBRARY_PATH=/home/xiii/pkg/SOS-2026-03-06/lib:$LD_LIBRARY_PATH
-export PYTHONPATH=/docker-mount/ishmem/ishmem4py/python
+export PYTHONPATH=/docker-mount/ishmem/ishmem4py/python:/docker-mount/ishmem/ishmem4py/test
 export ISHMEM4PY_RUNTIME_LIBRARY=/docker-mount/ishmem/build-ishmem4py-icpx-noaot/ishmem4py/python/ishmem4py/_ishmem4py_runtime.so
+export LD_LIBRARY_PATH=/home/xiii/pkg/SOS-2026-03-06/lib:$LD_LIBRARY_PATH
 export ISHMEM_RUNTIME=OPENSHMEM
 ```
 
-Smoke test:
+Example test runs:
 
 ```bash
 mpiexec -n 1 /docker-mount/ishmem/scripts/ishmrun \
   python3 /docker-mount/ishmem/ishmem4py/test/smoke_test.py
-```
-
-Two-PE ring test:
 
-```bash
 mpiexec -n 2 /docker-mount/ishmem/scripts/ishmrun \
   python3 /docker-mount/ishmem/ishmem4py/test/ring_test.py
-```
-
-Important notes:
-
-- Running without `ishmrun` is not reliable on multi-device nodes because Intel SHMEM needs the
-  normal PE-to-device mapping.
-- The current Intel SHMEM lifecycle should be treated as one-shot per process: initialize once,
-  finalize once, and do not attempt to reinitialize after finalization.
-
-## Test Plan
 
-The MVP needs both local smoke coverage and multi-PE verification:
-
-- single-PE smoke test
-  - init/finalize
-  - malloc/calloc/free
-  - basic local typed view checks
-- two-PE RMA test
-  - each PE allocates symmetric source and destination buffers
-  - PE `i` writes a known pattern
-  - `putmem` and `getmem` exchange data around a ring
-  - barriers validate global completion and correctness
-
-Tests should be runnable through the same launcher expectations as the rest of `ishmem`.
-
-## Known Risks / Open Questions
-
-- Actual runtime verification depends on available Intel GPU + Intel SHMEM backend environment.
-- The installed Intel SHMEM package is static-only, so distribution/packaging needs extra care.
-- Full typed API coverage will likely benefit from code generation once the MVP lands.
-
-## Short-Term Deliverables
+mpiexec -n 2 /docker-mount/ishmem/scripts/ishmrun \
+  python3 /docker-mount/ishmem/ishmem4py/test/collective_test.py
+```
 
-Within the current time box, the target is:
+## Current Scope
 
-1. land the build scaffolding and written plan
-2. implement the runtime shim for the MVP API set
-3. add the thin Python package
-4. add at least one example and basic tests
-5. verify as much as the current environment permits and document any gaps
+The public Python surface is limited to functionality that has been validated in this environment.
+Host atomics and dynamic team-split helpers were intentionally removed from the exported API after
+backend-level failures in the current OpenSHMEM runtime.
diff --git a/ishmem4py/python/README.md b/ishmem4py/python/README.md
index 54c38a8..b42c56d 100644
--- a/ishmem4py/python/README.md
+++ b/ishmem4py/python/README.md
@@ -1,31 +1,53 @@
 # ishmem4py Python Package
 
-This directory supports two install modes.
+`ishmem4py` supports two install modes.
 
-## Standard install
+## Standard Install
 
-Build `ishmem` with `-DBUILD_PYTHON_BINDINGS=ON`, then install the package from the build tree:
+Build `ishmem` with `-DBUILD_PYTHON_BINDINGS=ON`, then install from the build tree:
 
 ```bash
 pip install /path/to/ishmem-build/ishmem4py/python
 ```
 
-That build-tree package includes `_ishmem4py_runtime.so`, so `ishmem4py` can load its own runtime
-without `ISHMEM4PY_RUNTIME_LIBRARY`.
+That package directory includes `_ishmem4py_runtime.so`, so `ISHMEM4PY_RUNTIME_LIBRARY` is not
+needed.
 
-## Editable / dev install
+## Editable / Dev Install
 
-Install the source tree in editable mode when you are actively changing the Python code:
+For active development:
 
 ```bash
 pip install -e /path/to/ishmem-src/ishmem4py/python
 ```
 
-For editable installs, point the package at the runtime produced by your CMake build:
+Point the editable package at the CMake-built runtime:
 
 ```bash
 export ISHMEM4PY_RUNTIME_LIBRARY=/path/to/ishmem-build/ishmem4py/python/ishmem4py/_ishmem4py_runtime.so
 ```
 
-In both modes, external runtime dependencies such as oneAPI, PTI, and the OpenSHMEM backend still
-need to be visible through your normal environment setup.
+## Runtime Environment
+
+The package still depends on the surrounding Intel SHMEM runtime setup:
+
+```bash
+source /opt/intel/oneapi/setvars.sh
+unset ISHMEM_DIR
+export ISHMEM_RUNTIME=OPENSHMEM
+export LD_LIBRARY_PATH=/path/to/openshmem/lib:$LD_LIBRARY_PATH
+```
+
+For source-tree testing:
+
+```bash
+export PYTHONPATH=/path/to/ishmem-src/ishmem4py/python:/path/to/ishmem-src/ishmem4py/test
+```
+
+## Notes
+
+- `import ishmem4py` and `import ishmem4py.core` both expose the full public API.
+- The runtime library is loaded lazily, so importing the package for documentation or static
+  inspection does not require a live Intel SHMEM runtime.
+- The exported API is limited to functionality that has been validated in the current
+  OpenSHMEM-backed test environment.

From 099a6b606f8a14a1a28c0ee7dfd1118e2f79607f Mon Sep 17 00:00:00 2001
From: Benjamin Brock <benjamin.brock@intel.com>
Date: Thu, 12 Mar 2026 21:50:27 +0000
Subject: [PATCH 09/17] ishmem4py: add torch xpu symmetric tensors

---
 CMakeLists.txt                           |   2 +
 ishmem4py/CMakeLists.txt                 |  54 ++++++
 ishmem4py/csrc/ishmem4py_torch.cpp       |  68 +++++++
 ishmem4py/python/ishmem4py/_common.py    |  42 ++++-
 ishmem4py/python/ishmem4py/core.py       |  32 ++++
 ishmem4py/python/ishmem4py/torch.py      | 227 +++++++++++++++++++++++
 ishmem4py/python/pyproject.toml          |   3 +
 ishmem4py/test/torch_ring_tensor_test.py |  57 ++++++
 ishmem4py/test/torch_tensor_test.py      |  52 ++++++
 9 files changed, 530 insertions(+), 7 deletions(-)
 create mode 100644 ishmem4py/csrc/ishmem4py_torch.cpp
 create mode 100644 ishmem4py/python/ishmem4py/torch.py
 create mode 100644 ishmem4py/test/torch_ring_tensor_test.py
 create mode 100644 ishmem4py/test/torch_tensor_test.py

diff --git a/CMakeLists.txt b/CMakeLists.txt
index ab6e2ba..add742b 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -28,6 +28,7 @@ option(BUILD_PERF_TESTS "Build performance tests" FALSE)
 option(BUILD_EXAMPLES "Build examples" FALSE)
 option(BUILD_APPS "Build apps" FALSE)
 option(BUILD_PYTHON_BINDINGS "Build Python bindings" FALSE)
+option(ISHMEM4PY_BUILD_TORCH_INTEROP "Build optional Torch/XPU interop for ishmem4py" FALSE)
 option(BUILD_CMAKE_CONFIG "Build CMake config files" TRUE)
 option(ENABLE_ERROR_CHECKING "Validate API inputs" FALSE)
 option(ENABLE_DLMALLOC "Enable dlmalloc for shared heap" TRUE)
@@ -81,6 +82,7 @@ message(STATUS "Build performance tests:     ${BUILD_PERF_TESTS}")
 message(STATUS "Build examples:              ${BUILD_EXAMPLES}")
 message(STATUS "Build apps:                  ${BUILD_APPS}")
 message(STATUS "Build Python bindings:       ${BUILD_PYTHON_BINDINGS}")
+message(STATUS "ishmem4py Torch interop:     ${ISHMEM4PY_BUILD_TORCH_INTEROP}")
 message(STATUS "Build CMake configs:         ${BUILD_CMAKE_CONFIG}")
 message(STATUS "Enable input validation:     ${ENABLE_ERROR_CHECKING}")
 message(STATUS "Enable dlmalloc:             ${ENABLE_DLMALLOC}")
diff --git a/ishmem4py/CMakeLists.txt b/ishmem4py/CMakeLists.txt
index f32b2e5..0de2aae 100644
--- a/ishmem4py/CMakeLists.txt
+++ b/ishmem4py/CMakeLists.txt
@@ -56,11 +56,65 @@ set_target_properties(ishmem4py-runtime PROPERTIES
     RUNTIME_OUTPUT_DIRECTORY "${ISHMEM4PY_PKG_BUILD_DIR}"
     POSITION_INDEPENDENT_CODE 1)
 
+if (ISHMEM4PY_BUILD_TORCH_INTEROP)
+    find_package(Python3 COMPONENTS Development.Module REQUIRED)
+
+    execute_process(
+        COMMAND ${Python3_EXECUTABLE} -c "import torch; print(torch.utils.cmake_prefix_path)"
+        OUTPUT_VARIABLE TORCH_CMAKE_PREFIX
+        OUTPUT_STRIP_TRAILING_WHITESPACE
+        RESULT_VARIABLE TORCH_PREFIX_RESULT)
+    if (NOT TORCH_PREFIX_RESULT EQUAL 0)
+        message(FATAL_ERROR
+            "ISHMEM4PY_BUILD_TORCH_INTEROP=ON requires PyTorch to be importable from "
+            "Python3_EXECUTABLE (${Python3_EXECUTABLE})")
+    endif()
+
+    execute_process(
+        COMMAND ${Python3_EXECUTABLE} -c
+                "import torch.utils.cpp_extension as ce; print(ce.library_paths()[0])"
+        OUTPUT_VARIABLE TORCH_LIBRARY_DIR
+        OUTPUT_STRIP_TRAILING_WHITESPACE
+        RESULT_VARIABLE TORCH_LIBRARY_RESULT)
+    if (NOT TORCH_LIBRARY_RESULT EQUAL 0)
+        message(FATAL_ERROR "failed to determine the PyTorch library directory")
+    endif()
+
+    list(PREPEND CMAKE_PREFIX_PATH "${TORCH_CMAKE_PREFIX}")
+    find_package(Torch REQUIRED)
+
+    add_library(ishmem4py-torch MODULE
+        csrc/ishmem4py_torch.cpp)
+
+    target_include_directories(ishmem4py-torch PRIVATE
+        "${CMAKE_CURRENT_SOURCE_DIR}/csrc")
+    target_link_directories(ishmem4py-torch PRIVATE "${TORCH_LIBRARY_DIR}")
+    target_link_libraries(ishmem4py-torch PRIVATE
+        Python3::Module
+        ishmem4py-runtime
+        torch_python
+        ${TORCH_LIBRARIES})
+    target_compile_features(ishmem4py-torch PRIVATE cxx_std_17)
+    target_compile_options(ishmem4py-torch PRIVATE ${TORCH_CXX_FLAGS})
+    set_target_properties(ishmem4py-torch PROPERTIES
+        OUTPUT_NAME "_ishmem4py_torch"
+        PREFIX ""
+        BUILD_RPATH "${TORCH_LIBRARY_DIR}"
+        LIBRARY_OUTPUT_DIRECTORY "${ISHMEM4PY_PKG_BUILD_DIR}"
+        RUNTIME_OUTPUT_DIRECTORY "${ISHMEM4PY_PKG_BUILD_DIR}"
+        POSITION_INDEPENDENT_CODE 1)
+endif()
+
 add_custom_target(ishmem4py ALL
     DEPENDS ishmem4py-package-files ishmem4py-runtime)
 
+if (ISHMEM4PY_BUILD_TORCH_INTEROP)
+    add_dependencies(ishmem4py ishmem4py-torch)
+endif()
+
 message(STATUS "\n======= ishmem4py summary ========")
 message(STATUS "Python interpreter:          ${Python3_EXECUTABLE}")
 message(STATUS "Python package root:         ${ISHMEM4PY_PYTHON_BUILD_ROOT}")
 message(STATUS "Runtime library output:      ${ISHMEM4PY_PKG_BUILD_DIR}")
+message(STATUS "Torch/XPU interop enabled:   ${ISHMEM4PY_BUILD_TORCH_INTEROP}")
 message(STATUS "==================================\n")
diff --git a/ishmem4py/csrc/ishmem4py_torch.cpp b/ishmem4py/csrc/ishmem4py_torch.cpp
new file mode 100644
index 0000000..38be51c
--- /dev/null
+++ b/ishmem4py/csrc/ishmem4py_torch.cpp
@@ -0,0 +1,68 @@
+/* Copyright (C) 2026 Intel Corporation
+ * SPDX-License-Identifier: BSD-3-Clause
+ */
+
+#include "ishmem4py_runtime.h"
+
+#include <ATen/ATen.h>
+#include <torch/extension.h>
+
+#include <functional>
+#include <numeric>
+#include <vector>
+
+namespace {
+
+size_t ishmem4py_numel(c10::IntArrayRef sizes)
+{
+    return std::accumulate(sizes.begin(), sizes.end(), static_cast<size_t>(1),
+                           std::multiplies<size_t>());
+}
+
+at::Tensor ishmem4py_wrap_tensor(void *ptr, c10::IntArrayRef sizes, c10::ScalarType scalar_type,
+                                 c10::DeviceIndex device_index)
+{
+    const auto device = c10::Device(c10::DeviceType::XPU, device_index);
+    const auto options = at::TensorOptions().dtype(scalar_type).device(device);
+    return at::for_blob(ptr, sizes).options(options).target_device(device).make_tensor();
+}
+
+}  // namespace
+
+at::Tensor ishmem4py_alloc_tensor(const std::vector<int64_t> &sizes, int64_t scalar_type,
+                                  int64_t device_index)
+{
+    TORCH_CHECK(device_index >= 0, "device index must be >= 0");
+    const auto dtype = static_cast<c10::ScalarType>(scalar_type);
+    const auto nbytes = ishmem4py_numel(sizes) * c10::elementSize(dtype);
+    if (nbytes == 0) {
+        return at::empty(sizes, at::TensorOptions().dtype(dtype).device(c10::DeviceType::XPU,
+                                                                        device_index));
+    }
+
+    void *ptr = ishmem4py_malloc(nbytes);
+    TORCH_CHECK(ptr != nullptr, "ishmem_malloc(", nbytes, ") returned NULL");
+    return ishmem4py_wrap_tensor(ptr, sizes, dtype, static_cast<c10::DeviceIndex>(device_index));
+}
+
+at::Tensor ishmem4py_tensor_from_ptr(uint64_t ptr, const std::vector<int64_t> &sizes,
+                                     int64_t scalar_type, int64_t device_index)
+{
+    TORCH_CHECK(device_index >= 0, "device index must be >= 0");
+    return ishmem4py_wrap_tensor(reinterpret_cast<void *>(ptr), sizes,
+                                 static_cast<c10::ScalarType>(scalar_type),
+                                 static_cast<c10::DeviceIndex>(device_index));
+}
+
+void ishmem4py_free_ptr(uint64_t ptr)
+{
+    if (ptr == 0) return;
+    ishmem4py_free(reinterpret_cast<void *>(ptr));
+}
+
+PYBIND11_MODULE(_ishmem4py_torch, m)
+{
+    m.def("alloc_tensor", &ishmem4py_alloc_tensor);
+    m.def("tensor_from_ptr", &ishmem4py_tensor_from_ptr);
+    m.def("free_ptr", &ishmem4py_free_ptr);
+}
diff --git a/ishmem4py/python/ishmem4py/_common.py b/ishmem4py/python/ishmem4py/_common.py
index 64efca4..17f987a 100644
--- a/ishmem4py/python/ishmem4py/_common.py
+++ b/ishmem4py/python/ishmem4py/_common.py
@@ -287,10 +287,17 @@ def _check_can_init() -> None:
 def _check_can_finalize() -> None:
     if not _initialized:
         raise IshmemStateError("Intel SHMEM is not initialized")
-    if _live_allocations:
-        leaked = ", ".join(
-            f"0x{symm.ptr:x}" for symm in sorted(_live_allocations.values(), key=lambda item: item.ptr)
-        )
+    leaked_ptrs = [
+        f"0x{symm.ptr:x}" for symm in sorted(_live_allocations.values(), key=lambda item: item.ptr)
+    ]
+    try:
+        from . import torch as _ishmem_torch
+
+        leaked_ptrs.extend(f"0x{ptr:x}" for ptr in _ishmem_torch._live_tensor_pointers())
+    except Exception:
+        pass
+    if leaked_ptrs:
+        leaked = ", ".join(leaked_ptrs)
         raise IshmemStateError(
             "cannot finalize while symmetric allocations are still live; free them first: " f"{leaked}"
         )
@@ -342,6 +349,15 @@ def _pointer_from_local_buffer(obj, *, writable: bool, offset: int = 0) -> _Poin
         size = _normalize_span(obj, size=None, offset=offset)
         return _PointerInfo(ptr=obj.ptr + offset, size=size, keepalive=obj)
 
+    try:
+        from . import torch as _ishmem_torch
+
+        tensor_info = _ishmem_torch._pointer_from_tensor(obj, writable=writable, offset=offset)
+        if tensor_info is not None:
+            return tensor_info
+    except Exception:
+        pass
+
     mv = _ensure_contiguous_memoryview(obj, writable=writable)
 
     if offset < 0:
@@ -359,9 +375,21 @@ def _pointer_from_local_buffer(obj, *, writable: bool, offset: int = 0) -> _Poin
     return _PointerInfo(ptr=ctypes.addressof(raw) + offset, size=mv.nbytes - offset, keepalive=(mv, raw))
 
 
-def _symmetric_target(obj) -> SymmetricMemory:
-    _require_active(obj)
-    return obj
+def _symmetric_target(obj):
+    if isinstance(obj, SymmetricMemory):
+        _require_active(obj)
+        return obj
+
+    try:
+        from . import torch as _ishmem_torch
+
+        target = _ishmem_torch._symmetric_tensor_target(obj)
+        if target is not None:
+            return target
+    except Exception:
+        pass
+
+    raise TypeError("expected a SymmetricMemory instance or a live ishmem4py symmetric tensor")
 
 
 def _check_status(status: int | None, opname: str) -> None:
diff --git a/ishmem4py/python/ishmem4py/core.py b/ishmem4py/python/ishmem4py/core.py
index f27c926..dff7c88 100644
--- a/ishmem4py/python/ishmem4py/core.py
+++ b/ishmem4py/python/ishmem4py/core.py
@@ -44,3 +44,35 @@
 __all__ += memory.__all__
 __all__ += rma.__all__
 __all__ += teams.__all__
+
+
+def tensor(*args, **kwargs):
+    from . import torch as torch_interop
+
+    return torch_interop.tensor(*args, **kwargs)
+
+
+def free_tensor(*args, **kwargs):
+    from . import torch as torch_interop
+
+    return torch_interop.free_tensor(*args, **kwargs)
+
+
+def tensor_base(*args, **kwargs):
+    from . import torch as torch_interop
+
+    return torch_interop.tensor_base(*args, **kwargs)
+
+
+def is_symmetric_tensor(*args, **kwargs):
+    from . import torch as torch_interop
+
+    return torch_interop.is_symmetric_tensor(*args, **kwargs)
+
+
+__all__ += [
+    "free_tensor",
+    "is_symmetric_tensor",
+    "tensor",
+    "tensor_base",
+]
diff --git a/ishmem4py/python/ishmem4py/torch.py b/ishmem4py/python/ishmem4py/torch.py
new file mode 100644
index 0000000..8574e54
--- /dev/null
+++ b/ishmem4py/python/ishmem4py/torch.py
@@ -0,0 +1,227 @@
+# Copyright (C) 2026 Intel Corporation
+# SPDX-License-Identifier: BSD-3-Clause
+
+"""Optional Torch/XPU interoperability helpers for ``ishmem4py``."""
+
+from __future__ import annotations
+
+from dataclasses import dataclass
+from typing import Iterable
+
+from ._common import IshmemStateError, _PointerInfo, _require_initialized
+
+try:
+    import torch as _torch
+except Exception:
+    _torch = None
+
+try:
+    from . import _ishmem4py_torch as _TORCH_EXT
+except Exception:
+    _TORCH_EXT = None
+
+__all__ = [
+    "free_tensor",
+    "is_symmetric_tensor",
+    "tensor",
+    "tensor_base",
+]
+
+
+_DTYPE_TO_SCALAR_TYPE = {
+    _torch.uint8 if _torch is not None else object(): 0,
+    _torch.int8 if _torch is not None else object(): 1,
+    _torch.int16 if _torch is not None else object(): 2,
+    _torch.int32 if _torch is not None else object(): 3,
+    _torch.int64 if _torch is not None else object(): 4,
+    _torch.float16 if _torch is not None else object(): 5,
+    _torch.float32 if _torch is not None else object(): 6,
+    _torch.float64 if _torch is not None else object(): 7,
+    _torch.bool if _torch is not None else object(): 11,
+    _torch.bfloat16 if _torch is not None else object(): 15,
+}
+
+
+@dataclass
+class _TensorAllocation:
+    tensor: "_torch.Tensor"
+    ptr: int
+    size_bytes: int
+
+
+_live_tensor_allocations: dict[int, _TensorAllocation] = {}
+
+
+def _require_torch() -> None:
+    if _torch is None:
+        raise RuntimeError("Torch/XPU interop requires PyTorch to be installed")
+    if _TORCH_EXT is None:
+        raise RuntimeError(
+            "Torch/XPU interop is unavailable in this ishmem4py build; rebuild ishmem with "
+            "-DISHMEM4PY_BUILD_TORCH_INTEROP=ON using a Python executable from a PyTorch XPU environment"
+        )
+
+
+def _normalize_shape(size) -> tuple[int, ...]:
+    if isinstance(size, int):
+        return (int(size),)
+    if isinstance(size, Iterable):
+        return tuple(int(dim) for dim in size)
+    raise TypeError("size must be an int or an iterable of ints")
+
+
+def _normalize_device(device) -> "_torch.device":
+    if _torch is None:
+        raise RuntimeError("PyTorch is not available")
+    if device is None:
+        return _torch.device("xpu", _torch.xpu.current_device())
+    normalized = _torch.device(device)
+    if normalized.type != "xpu":
+        raise ValueError(f"ishmem4py.torch.tensor only supports XPU devices, got {normalized}")
+    if normalized.index is None:
+        return _torch.device("xpu", _torch.xpu.current_device())
+    return normalized
+
+
+def _tensor_nbytes(tensor: "_torch.Tensor") -> int:
+    return int(tensor.numel()) * int(tensor.element_size())
+
+
+def _register_tensor(base: "_torch.Tensor") -> None:
+    base_ptr = int(base.data_ptr())
+    _live_tensor_allocations[base_ptr] = _TensorAllocation(
+        tensor=base,
+        ptr=base_ptr,
+        size_bytes=_tensor_nbytes(base),
+    )
+    base._ishmem_alloc = True  # type: ignore[attr-defined]
+    base._ishmem_base = base  # type: ignore[attr-defined]
+
+
+def _allocation_from_base(base: "_torch.Tensor") -> _TensorAllocation:
+    allocation = _live_tensor_allocations.get(int(base.data_ptr()))
+    if allocation is None:
+        raise IshmemStateError("symmetric tensor allocation is not live")
+    return allocation
+
+
+def _find_allocation_for_tensor(tensor: "_torch.Tensor") -> _TensorAllocation | None:
+    ptr = int(tensor.data_ptr())
+    size_bytes = _tensor_nbytes(tensor)
+    for allocation in _live_tensor_allocations.values():
+        start = allocation.ptr
+        end = allocation.ptr + allocation.size_bytes
+        if ptr < start:
+            continue
+        if ptr > end:
+            continue
+        if size_bytes > 0 and ptr + size_bytes > end:
+            continue
+        return allocation
+    return None
+
+
+def _assert_contiguous_tensor(tensor: "_torch.Tensor") -> None:
+    if tensor.numel() != 0 and not tensor.is_contiguous():
+        raise ValueError("tensor must be contiguous")
+
+
+def _pointer_from_tensor(tensor, *, writable: bool, offset: int = 0) -> _PointerInfo | None:
+    if _torch is None or not isinstance(tensor, _torch.Tensor):
+        return None
+    _assert_contiguous_tensor(tensor)
+    size_bytes = _tensor_nbytes(tensor)
+    if offset < 0:
+        raise ValueError("offset must be >= 0")
+    if offset > size_bytes:
+        raise ValueError("offset is past the end of the tensor")
+    if writable and not tensor.is_leaf and tensor.requires_grad:
+        raise ValueError("destination tensor must be writable")
+    return _PointerInfo(ptr=int(tensor.data_ptr()) + offset, size=size_bytes - offset, keepalive=tensor)
+
+
+def _symmetric_tensor_target(obj) -> _PointerInfo | None:
+    if _torch is None or not isinstance(obj, _torch.Tensor):
+        return None
+    _assert_contiguous_tensor(obj)
+    allocation = _find_allocation_for_tensor(obj)
+    if allocation is None:
+        return None
+    return _PointerInfo(ptr=int(obj.data_ptr()), size=_tensor_nbytes(obj), keepalive=obj)
+
+
+def _live_tensor_pointers() -> list[int]:
+    return sorted(_live_tensor_allocations)
+
+
+def tensor(
+    size,
+    *,
+    dtype=None,
+    device=None,
+    requires_grad: bool = False,
+):
+    """Allocate a symmetric XPU tensor from the Intel SHMEM heap."""
+    _require_initialized()
+    _require_torch()
+    shape = _normalize_shape(size)
+    if dtype is None:
+        dtype = _torch.get_default_dtype()
+    scalar_type = _DTYPE_TO_SCALAR_TYPE.get(dtype)
+    if scalar_type is None:
+        raise TypeError(f"unsupported torch dtype for ishmem4py.torch.tensor: {dtype}")
+
+    normalized_device = _normalize_device(device)
+    with _torch.xpu.device(normalized_device):
+        result = _TORCH_EXT.alloc_tensor(shape, scalar_type, normalized_device.index)
+
+    if result.numel() > 0:
+        _register_tensor(result)
+    else:
+        result._ishmem_alloc = True  # type: ignore[attr-defined]
+        result._ishmem_base = result  # type: ignore[attr-defined]
+    if requires_grad:
+        result.requires_grad_(True)
+    return result
+
+
+def tensor_base(tensor):
+    """Return the base symmetric allocation for ``tensor`` when available."""
+    if _torch is None or not isinstance(tensor, _torch.Tensor):
+        raise TypeError("expected a torch.Tensor")
+    base = getattr(tensor, "_ishmem_base", None)
+    if base is not None:
+        return base
+    allocation = _find_allocation_for_tensor(tensor)
+    if allocation is None:
+        return tensor
+    return allocation.tensor
+
+
+def is_symmetric_tensor(tensor) -> bool:
+    """Return ``True`` when ``tensor`` references live symmetric memory."""
+    if _torch is None or not isinstance(tensor, _torch.Tensor):
+        return False
+    return _find_allocation_for_tensor(tensor) is not None or bool(
+        getattr(tensor, "_ishmem_alloc", False)
+    )
+
+
+def free_tensor(tensor) -> None:
+    """Free a symmetric tensor previously returned by :func:`tensor`."""
+    _require_initialized()
+    _require_torch()
+    if not isinstance(tensor, _torch.Tensor):
+        raise TypeError("expected a torch.Tensor")
+    base = tensor_base(tensor)
+    if not getattr(base, "_ishmem_alloc", False):
+        raise IshmemStateError("tensor is not backed by ishmem4py symmetric memory")
+    base_ptr = int(base.data_ptr())
+    allocation = _live_tensor_allocations.pop(base_ptr, None)
+    if allocation is None:
+        if base_ptr == 0:
+            base._ishmem_alloc = False  # type: ignore[attr-defined]
+            return
+        raise IshmemStateError("symmetric tensor allocation has already been freed")
+    _TORCH_EXT.free_ptr(base_ptr)
+    allocation.tensor._ishmem_alloc = False  # type: ignore[attr-defined]
diff --git a/ishmem4py/python/pyproject.toml b/ishmem4py/python/pyproject.toml
index 3243a03..899e5ed 100644
--- a/ishmem4py/python/pyproject.toml
+++ b/ishmem4py/python/pyproject.toml
@@ -31,4 +31,7 @@ ishmem4py = [
   "_ishmem4py_runtime*.so",
   "_ishmem4py_runtime*.dylib",
   "_ishmem4py_runtime*.pyd",
+  "_ishmem4py_torch*.so",
+  "_ishmem4py_torch*.dylib",
+  "_ishmem4py_torch*.pyd",
 ]
diff --git a/ishmem4py/test/torch_ring_tensor_test.py b/ishmem4py/test/torch_ring_tensor_test.py
new file mode 100644
index 0000000..d5523f5
--- /dev/null
+++ b/ishmem4py/test/torch_ring_tensor_test.py
@@ -0,0 +1,57 @@
+# Copyright (C) 2026 Intel Corporation
+# SPDX-License-Identifier: BSD-3-Clause
+
+from __future__ import annotations
+
+import sys
+
+import torch
+
+import ishmem4py as ishmem
+
+from utils import expect_equal
+
+
+def main() -> int:
+    ishmem.init()
+    try:
+        my_pe = ishmem.my_pe()
+        npes = ishmem.n_pes()
+        if npes < 2:
+            raise RuntimeError("torch_ring_tensor_test.py requires at least 2 PEs")
+
+        next_pe = (my_pe + 1) % npes
+        prev_pe = (my_pe + npes - 1) % npes
+
+        src = ishmem.tensor((4,), dtype=torch.float32, device="xpu")
+        dst = ishmem.tensor((4,), dtype=torch.float32, device="xpu")
+        recv = ishmem.tensor((4,), dtype=torch.float32, device="xpu")
+        try:
+            src.fill_(float(my_pe + 10))
+            dst.zero_()
+            recv.zero_()
+            torch.xpu.synchronize()
+
+            ishmem.barrier_all()
+
+            ishmem.put(dst, src, pe=next_pe)
+
+            ishmem.barrier_all()
+            torch.xpu.synchronize()
+            expect_equal("symmetric tensor put ring", dst.cpu().tolist(), [float(prev_pe + 10)] * 4)
+
+            ishmem.get(recv, src, pe=next_pe)
+            torch.xpu.synchronize()
+            expect_equal("symmetric tensor get ring", recv.cpu().tolist(), [float(next_pe + 10)] * 4)
+        finally:
+            ishmem.free_tensor(recv)
+            ishmem.free_tensor(dst)
+            ishmem.free_tensor(src)
+    finally:
+        ishmem.finalize()
+
+    return 0
+
+
+if __name__ == "__main__":
+    sys.exit(main())
diff --git a/ishmem4py/test/torch_tensor_test.py b/ishmem4py/test/torch_tensor_test.py
new file mode 100644
index 0000000..532ca16
--- /dev/null
+++ b/ishmem4py/test/torch_tensor_test.py
@@ -0,0 +1,52 @@
+# Copyright (C) 2026 Intel Corporation
+# SPDX-License-Identifier: BSD-3-Clause
+
+from __future__ import annotations
+
+import sys
+
+import torch
+
+import ishmem4py as ishmem
+
+from utils import expect_equal, expect_true
+
+
+def main() -> int:
+    ishmem.init()
+    try:
+        tensor = ishmem.tensor((4,), dtype=torch.float32, device="xpu")
+        mirror = ishmem.tensor((4,), dtype=torch.float32, device="xpu")
+        try:
+            tensor.fill_(3.0)
+            torch.xpu.synchronize()
+
+            expect_true("tensor allocation is symmetric", ishmem.is_symmetric_tensor(tensor))
+            expect_true("tensor base returns self for base allocation", ishmem.tensor_base(tensor) is tensor)
+            expect_equal("tensor contents", tensor.cpu().tolist(), [3.0, 3.0, 3.0, 3.0])
+
+            stream = torch.xpu.Stream()
+            ishmem.put(mirror, tensor, pe=ishmem.my_pe(), queue=stream)
+            ishmem.quiet(queue=stream)
+            expect_equal("queued self put", mirror.cpu().tolist(), [3.0, 3.0, 3.0, 3.0])
+
+            recv = ishmem.tensor((4,), dtype=torch.float32, device="xpu")
+            try:
+                recv.zero_()
+                stream = torch.xpu.Stream()
+                ishmem.get(recv, tensor, pe=ishmem.my_pe(), queue=stream)
+                ishmem.quiet(queue=stream)
+                expect_equal("queued self get", recv.cpu().tolist(), [3.0, 3.0, 3.0, 3.0])
+            finally:
+                ishmem.free_tensor(recv)
+        finally:
+            ishmem.free_tensor(mirror)
+            ishmem.free_tensor(tensor)
+    finally:
+        ishmem.finalize()
+
+    return 0
+
+
+if __name__ == "__main__":
+    sys.exit(main())

From aab01a0e998eec24200c46cbdd81a697e2bec23e Mon Sep 17 00:00:00 2001
From: Benjamin Brock <benjamin.brock@intel.com>
Date: Thu, 12 Mar 2026 21:50:36 +0000
Subject: [PATCH 10/17] ishmem4py: add queue-aware xpu rma

---
 docs/source/ishmem4py.rst              | 71 +++++++++++++++++++++++++-
 ishmem4py/csrc/ishmem4py_runtime.cpp   | 25 +++++++++
 ishmem4py/csrc/ishmem4py_runtime.h     |  6 +++
 ishmem4py/python/README.md             | 27 +++++++++-
 ishmem4py/python/ishmem4py/_lib.py     | 24 +++++++++
 ishmem4py/python/ishmem4py/rma.py      | 69 ++++++++++++++++++++-----
 ishmem4py/test/torch_queue_get_test.py | 50 ++++++++++++++++++
 ishmem4py/test/torch_queue_put_test.py | 52 +++++++++++++++++++
 8 files changed, 308 insertions(+), 16 deletions(-)
 create mode 100644 ishmem4py/test/torch_queue_get_test.py
 create mode 100644 ishmem4py/test/torch_queue_put_test.py

diff --git a/docs/source/ishmem4py.rst b/docs/source/ishmem4py.rst
index 9fa134a..5efe7ec 100644
--- a/docs/source/ishmem4py.rst
+++ b/docs/source/ishmem4py.rst
@@ -9,8 +9,9 @@ It targets the same host-side workflow that the C and C++ Intel SHMEM APIs alrea
 while adopting a small, Python-oriented surface inspired by ``nvshmem4py`` where the models
 overlap.
 
-The current binding is intentionally synchronous and host-driven.
-It does not yet expose Intel SHMEM queue-based extensions or device-initiated Python APIs.
+The current binding is intentionally host-driven.
+It now exposes queue-based host ``put`` and ``get`` operations for XPU workflows, but it does
+not yet expose device-initiated Python APIs.
 
 .. currentmodule:: ishmem4py
 
@@ -50,6 +51,10 @@ Memory Management
    calloc
    buffer
    free
+   tensor
+   free_tensor
+   tensor_base
+   is_symmetric_tensor
    ptr
    ishmem_ptr
    SymmetricMemory
@@ -136,6 +141,51 @@ Example:
        ishmem.free(buf)
        ishmem.finalize()
 
+Torch/XPU interop is also available as an optional build-time feature.
+Those helpers allocate symmetric memory directly as ``torch.Tensor`` objects backed by the
+Intel SHMEM heap, which is useful for host-driven one-sided XPU workflows:
+
+.. code-block:: python
+
+   import torch
+   import ishmem4py as ishmem
+
+   ishmem.init()
+   try:
+       src = ishmem.tensor((4,), dtype=torch.float32, device="xpu")
+       dst = ishmem.tensor((4,), dtype=torch.float32, device="xpu")
+
+       src.fill_(7.0)
+       stream = torch.xpu.Stream()
+       ishmem.put(dst, src, pe=ishmem.my_pe(), queue=stream)
+       ishmem.quiet(queue=stream)
+   finally:
+       ishmem.free_tensor(dst)
+       ishmem.free_tensor(src)
+       ishmem.finalize()
+
+When torch interop is built, the top-level module exposes ``tensor``, ``free_tensor``,
+``tensor_base``, and ``is_symmetric_tensor`` lazily so that plain ``import ishmem4py`` does
+not require importing PyTorch up front.
+
+Queue-Based RMA
+^^^^^^^^^^^^^^^
+
+``ishmem4py.put`` and ``ishmem4py.get`` accept an optional ``queue=...`` argument for
+host-initiated queue-based Intel SHMEM operations.
+This is intended for XPU workflows that pass a ``torch.xpu.Stream`` or a raw SYCL queue handle.
+
+Completion remains host-driven: after issuing queue-based RMA, wait for the queue and then call
+``ishmem.quiet(queue=stream)``.
+
+Example:
+
+.. code-block:: python
+
+   stream = torch.xpu.Stream()
+   ishmem.get(dst, src, pe=peer, queue=stream)
+   ishmem.quiet(queue=stream)
+
 Collectives and Teams
 ^^^^^^^^^^^^^^^^^^^^^
 
@@ -166,6 +216,17 @@ Build Intel SHMEM with Python bindings enabled:
    cmake -S /path/to/ishmem -B /path/to/build -DBUILD_PYTHON_BINDINGS=ON ...
    cmake --build /path/to/build --target ishmem4py -j4
 
+To include optional Torch/XPU symmetric tensor support in the package, enable the extra build:
+
+.. code-block:: bash
+
+   cmake -S /path/to/ishmem -B /path/to/build \
+     -DBUILD_PYTHON_BINDINGS=ON \
+     -DISHMEM4PY_BUILD_TORCH_INTEROP=ON \
+     -DPython3_EXECUTABLE=/path/to/python-from-torch-xpu-env \
+     -DCMAKE_PREFIX_PATH="$(python -c 'import torch; print(torch.utils.cmake_prefix_path)')" ...
+   cmake --build /path/to/build --target ishmem4py -j4
+
 Install from the build tree:
 
 .. code-block:: bash
@@ -199,6 +260,12 @@ Examples:
    mpiexec -n 1 ishmrun python3 /path/to/ishmem/ishmem4py/test/smoke_test.py
    mpiexec -n 2 ishmrun python3 /path/to/ishmem/ishmem4py/test/ring_test.py
    mpiexec -n 2 ishmrun python3 /path/to/ishmem/ishmem4py/test/collective_test.py
+   mpiexec -n 1 ishmrun python3 /path/to/ishmem/ishmem4py/test/torch_tensor_test.py
+   mpiexec -n 2 ishmrun python3 /path/to/ishmem/ishmem4py/test/torch_queue_get_test.py
+
+Use real script files for multi-PE validation rather than ``python - <<'PY'`` heredoc launches;
+the latter can appear to hang under ``mpiexec`` because of stdin interaction with the process
+manager.
 
 Current Scope
 ^^^^^^^^^^^^^
diff --git a/ishmem4py/csrc/ishmem4py_runtime.cpp b/ishmem4py/csrc/ishmem4py_runtime.cpp
index 0bc62a9..419322a 100644
--- a/ishmem4py/csrc/ishmem4py_runtime.cpp
+++ b/ishmem4py/csrc/ishmem4py_runtime.cpp
@@ -5,6 +5,7 @@
 #include "ishmem4py_runtime.h"
 
 #include <ishmem.h>
+#include <ishmemx.h>
 
 #include <cstring>
 
@@ -173,6 +174,30 @@ void ishmem4py_getmem(void *dest, const void *src, size_t nbytes, int pe)
     ishmem_getmem(dest, src, nbytes, pe);
 }
 
+void ishmem4py_putmem_on_queue(void *dest, const void *src, size_t nbytes, int pe, void *queue)
+{
+    auto *q = reinterpret_cast<sycl::queue *>(queue);
+    ishmemx_putmem_on_queue(dest, src, nbytes, pe, *q, {});
+}
+
+void ishmem4py_getmem_on_queue(void *dest, const void *src, size_t nbytes, int pe, void *queue)
+{
+    auto *q = reinterpret_cast<sycl::queue *>(queue);
+    ishmemx_getmem_on_queue(dest, src, nbytes, pe, *q, {});
+}
+
+void ishmem4py_quiet_on_queue(void *queue)
+{
+    auto *q = reinterpret_cast<sycl::queue *>(queue);
+    ishmemx_quiet_on_queue(*q, {});
+}
+
+void ishmem4py_queue_sync(void *queue)
+{
+    auto *q = reinterpret_cast<sycl::queue *>(queue);
+    q->wait_and_throw();
+}
+
 void *ishmem4py_ptr(const void *dest, int pe)
 {
     return ishmem_ptr(dest, pe);
diff --git a/ishmem4py/csrc/ishmem4py_runtime.h b/ishmem4py/csrc/ishmem4py_runtime.h
index 332fb9c..f15ced2 100644
--- a/ishmem4py/csrc/ishmem4py_runtime.h
+++ b/ishmem4py/csrc/ishmem4py_runtime.h
@@ -57,6 +57,12 @@ ISHMEM4PY_EXPORT void *ishmem4py_calloc(size_t count, size_t size);
 ISHMEM4PY_EXPORT void ishmem4py_free(void *ptr);
 ISHMEM4PY_EXPORT void ishmem4py_putmem(void *dest, const void *src, size_t nbytes, int pe);
 ISHMEM4PY_EXPORT void ishmem4py_getmem(void *dest, const void *src, size_t nbytes, int pe);
+ISHMEM4PY_EXPORT void ishmem4py_putmem_on_queue(void *dest, const void *src, size_t nbytes,
+                                                int pe, void *queue);
+ISHMEM4PY_EXPORT void ishmem4py_getmem_on_queue(void *dest, const void *src, size_t nbytes,
+                                                int pe, void *queue);
+ISHMEM4PY_EXPORT void ishmem4py_quiet_on_queue(void *queue);
+ISHMEM4PY_EXPORT void ishmem4py_queue_sync(void *queue);
 ISHMEM4PY_EXPORT void *ishmem4py_ptr(const void *dest, int pe);
 ISHMEM4PY_EXPORT int ishmem4py_team_my_pe(int team);
 ISHMEM4PY_EXPORT int ishmem4py_team_n_pes(int team);
diff --git a/ishmem4py/python/README.md b/ishmem4py/python/README.md
index b42c56d..52e8c6f 100644
--- a/ishmem4py/python/README.md
+++ b/ishmem4py/python/README.md
@@ -1,6 +1,8 @@
 # ishmem4py Python Package
 
-`ishmem4py` supports two install modes.
+`ishmem4py` supports two install modes. The base package provides host-side Intel SHMEM
+allocation, RMA, collectives, and team APIs. An optional Torch/XPU extension adds symmetric
+`torch.Tensor` allocation plus queue-aware XPU RMA.
 
 ## Standard Install
 
@@ -13,6 +15,19 @@ pip install /path/to/ishmem-build/ishmem4py/python
 That package directory includes `_ishmem4py_runtime.so`, so `ISHMEM4PY_RUNTIME_LIBRARY` is not
 needed.
 
+To include the optional Torch/XPU interop module in the package, build with
+`-DISHMEM4PY_BUILD_TORCH_INTEROP=ON` and use a Python executable from an environment where
+PyTorch XPU is already installed:
+
+```bash
+cmake -S /path/to/ishmem-src -B /path/to/ishmem-build \
+  -DBUILD_PYTHON_BINDINGS=ON \
+  -DISHMEM4PY_BUILD_TORCH_INTEROP=ON \
+  -DPython3_EXECUTABLE=/path/to/python \
+  -DCMAKE_PREFIX_PATH="$(python -c 'import torch; print(torch.utils.cmake_prefix_path)')"
+cmake --build /path/to/ishmem-build --target ishmem4py -j4
+```
+
 ## Editable / Dev Install
 
 For active development:
@@ -27,6 +42,9 @@ Point the editable package at the CMake-built runtime:
 export ISHMEM4PY_RUNTIME_LIBRARY=/path/to/ishmem-build/ishmem4py/python/ishmem4py/_ishmem4py_runtime.so
 ```
 
+If Torch/XPU interop is enabled in the build tree, the editable package will also find
+`_ishmem4py_torch.so` from that package directory through `PYTHONPATH`.
+
 ## Runtime Environment
 
 The package still depends on the surrounding Intel SHMEM runtime setup:
@@ -47,7 +65,14 @@ export PYTHONPATH=/path/to/ishmem-src/ishmem4py/python:/path/to/ishmem-src/ishme
 ## Notes
 
 - `import ishmem4py` and `import ishmem4py.core` both expose the full public API.
+- Torch/XPU interop is imported lazily through `ishmem4py.tensor(...)`,
+  `ishmem4py.free_tensor(...)`, `ishmem4py.tensor_base(...)`, and
+  `ishmem4py.is_symmetric_tensor(...)` so that plain package import does not eagerly import
+  PyTorch.
 - The runtime library is loaded lazily, so importing the package for documentation or static
   inspection does not require a live Intel SHMEM runtime.
+- `put`, `get`, and `quiet` accept `queue=` for queue-based host-initiated XPU RMA. Supported
+  queue objects are `torch.xpu.Stream`, `ctypes.c_void_p`, and raw integer SYCL queue pointers.
+- The current Torch/XPU MVP is focused on XPU device memory and XPU-to-XPU one-sided transfers.
 - The exported API is limited to functionality that has been validated in the current
   OpenSHMEM-backed test environment.
diff --git a/ishmem4py/python/ishmem4py/_lib.py b/ishmem4py/python/ishmem4py/_lib.py
index 0b81a93..6c7870b 100644
--- a/ishmem4py/python/ishmem4py/_lib.py
+++ b/ishmem4py/python/ishmem4py/_lib.py
@@ -85,6 +85,30 @@ def _configure_runtime(runtime):
     ]
     runtime.ishmem4py_getmem.restype = None
 
+    runtime.ishmem4py_putmem_on_queue.argtypes = [
+        ctypes.c_void_p,
+        ctypes.c_void_p,
+        ctypes.c_size_t,
+        ctypes.c_int,
+        ctypes.c_void_p,
+    ]
+    runtime.ishmem4py_putmem_on_queue.restype = None
+
+    runtime.ishmem4py_getmem_on_queue.argtypes = [
+        ctypes.c_void_p,
+        ctypes.c_void_p,
+        ctypes.c_size_t,
+        ctypes.c_int,
+        ctypes.c_void_p,
+    ]
+    runtime.ishmem4py_getmem_on_queue.restype = None
+
+    runtime.ishmem4py_quiet_on_queue.argtypes = [ctypes.c_void_p]
+    runtime.ishmem4py_quiet_on_queue.restype = None
+
+    runtime.ishmem4py_queue_sync.argtypes = [ctypes.c_void_p]
+    runtime.ishmem4py_queue_sync.restype = None
+
     runtime.ishmem4py_ptr.argtypes = [ctypes.c_void_p, ctypes.c_int]
     runtime.ishmem4py_ptr.restype = ctypes.c_void_p
 
diff --git a/ishmem4py/python/ishmem4py/rma.py b/ishmem4py/python/ishmem4py/rma.py
index 6351226..5f25bbb 100644
--- a/ishmem4py/python/ishmem4py/rma.py
+++ b/ishmem4py/python/ishmem4py/rma.py
@@ -27,15 +27,38 @@
 ]
 
 
+def _queue_handle(queue) -> int:
+    if queue is None:
+        raise ValueError("queue must not be None")
+    if isinstance(queue, int):
+        handle = int(queue)
+    elif isinstance(queue, ctypes.c_void_p):
+        handle = int(queue.value or 0)
+    elif hasattr(queue, "sycl_queue"):
+        handle = int(queue.sycl_queue)
+    elif hasattr(queue, "_as_parameter_"):
+        handle = int(getattr(queue._as_parameter_, "value", queue._as_parameter_))
+    else:
+        raise TypeError("queue must be a torch.xpu.Stream, ctypes.c_void_p, or integer queue pointer")
+    if handle == 0:
+        raise ValueError("queue pointer must be non-zero")
+    return handle
+
+
 def fence() -> None:
     """Order previously issued put-like operations."""
     _require_initialized()
     RUNTIME.ishmem4py_fence()
 
 
-def quiet() -> None:
+def quiet(queue=None) -> None:
     """Wait for completion of previously issued RMA operations."""
     _require_initialized()
+    if queue is not None:
+        queue_ptr = ctypes.c_void_p(_queue_handle(queue))
+        RUNTIME.ishmem4py_queue_sync(queue_ptr)
+        RUNTIME.ishmem4py_quiet()
+        return
     RUNTIME.ishmem4py_quiet()
 
 
@@ -47,6 +70,7 @@ def put(
     size: Optional[int] = None,
     dest_offset: int = 0,
     src_offset: int = 0,
+    queue=None,
 ) -> int:
     """Copy bytes from a local buffer into symmetric memory on ``pe``."""
     _require_initialized()
@@ -62,12 +86,21 @@ def put(
     if nbytes > src_info.size:
         raise ValueError("size extends past the source buffer")
 
-    RUNTIME.ishmem4py_putmem(
-        ctypes.c_void_p(dest.ptr + dest_offset),
-        ctypes.c_void_p(src_info.ptr),
-        nbytes,
-        pe,
-    )
+    if queue is None:
+        RUNTIME.ishmem4py_putmem(
+            ctypes.c_void_p(dest.ptr + dest_offset),
+            ctypes.c_void_p(src_info.ptr),
+            nbytes,
+            pe,
+        )
+    else:
+        RUNTIME.ishmem4py_putmem_on_queue(
+            ctypes.c_void_p(dest.ptr + dest_offset),
+            ctypes.c_void_p(src_info.ptr),
+            nbytes,
+            pe,
+            ctypes.c_void_p(_queue_handle(queue)),
+        )
     return nbytes
 
 
@@ -79,6 +112,7 @@ def get(
     size: Optional[int] = None,
     dest_offset: int = 0,
     src_offset: int = 0,
+    queue=None,
 ) -> int:
     """Copy bytes from symmetric memory on ``pe`` into a local buffer."""
     _require_initialized()
@@ -94,12 +128,21 @@ def get(
     if nbytes > src_size:
         raise ValueError("size extends past the source symmetric object")
 
-    RUNTIME.ishmem4py_getmem(
-        ctypes.c_void_p(dest_info.ptr),
-        ctypes.c_void_p(src.ptr + src_offset),
-        nbytes,
-        pe,
-    )
+    if queue is None:
+        RUNTIME.ishmem4py_getmem(
+            ctypes.c_void_p(dest_info.ptr),
+            ctypes.c_void_p(src.ptr + src_offset),
+            nbytes,
+            pe,
+        )
+    else:
+        RUNTIME.ishmem4py_getmem_on_queue(
+            ctypes.c_void_p(dest_info.ptr),
+            ctypes.c_void_p(src.ptr + src_offset),
+            nbytes,
+            pe,
+            ctypes.c_void_p(_queue_handle(queue)),
+        )
     return nbytes
 
 
diff --git a/ishmem4py/test/torch_queue_get_test.py b/ishmem4py/test/torch_queue_get_test.py
new file mode 100644
index 0000000..a444cca
--- /dev/null
+++ b/ishmem4py/test/torch_queue_get_test.py
@@ -0,0 +1,50 @@
+# Copyright (C) 2026 Intel Corporation
+# SPDX-License-Identifier: BSD-3-Clause
+
+from __future__ import annotations
+
+import sys
+
+import torch
+
+import ishmem4py as ishmem
+
+from utils import expect_equal
+
+
+def main() -> int:
+    ishmem.init()
+    try:
+        my_pe = ishmem.my_pe()
+        npes = ishmem.n_pes()
+        if npes < 2:
+            raise RuntimeError("torch_queue_get_test.py requires at least 2 PEs")
+
+        next_pe = (my_pe + 1) % npes
+
+        src = ishmem.tensor((4,), dtype=torch.float32, device="xpu")
+        recv = ishmem.tensor((4,), dtype=torch.float32, device="xpu")
+        try:
+            src.fill_(float(my_pe + 10))
+            recv.zero_()
+            torch.xpu.synchronize()
+
+            ishmem.barrier_all()
+
+            stream = torch.xpu.Stream()
+            ishmem.get(recv, src, pe=next_pe, queue=stream)
+            ishmem.quiet(queue=stream)
+            torch.xpu.synchronize()
+
+            expect_equal("queued symmetric tensor get ring", recv.cpu().tolist(), [float(next_pe + 10)] * 4)
+        finally:
+            ishmem.free_tensor(recv)
+            ishmem.free_tensor(src)
+    finally:
+        ishmem.finalize()
+
+    return 0
+
+
+if __name__ == "__main__":
+    sys.exit(main())
diff --git a/ishmem4py/test/torch_queue_put_test.py b/ishmem4py/test/torch_queue_put_test.py
new file mode 100644
index 0000000..e960d83
--- /dev/null
+++ b/ishmem4py/test/torch_queue_put_test.py
@@ -0,0 +1,52 @@
+# Copyright (C) 2026 Intel Corporation
+# SPDX-License-Identifier: BSD-3-Clause
+
+from __future__ import annotations
+
+import sys
+
+import torch
+
+import ishmem4py as ishmem
+
+from utils import expect_equal
+
+
+def main() -> int:
+    ishmem.init()
+    try:
+        my_pe = ishmem.my_pe()
+        npes = ishmem.n_pes()
+        if npes < 2:
+            raise RuntimeError("torch_queue_put_test.py requires at least 2 PEs")
+
+        next_pe = (my_pe + 1) % npes
+        prev_pe = (my_pe + npes - 1) % npes
+
+        src = ishmem.tensor((4,), dtype=torch.float32, device="xpu")
+        dst = ishmem.tensor((4,), dtype=torch.float32, device="xpu")
+        try:
+            src.fill_(float(my_pe + 10))
+            dst.zero_()
+            torch.xpu.synchronize()
+
+            ishmem.barrier_all()
+
+            stream = torch.xpu.Stream()
+            ishmem.put(dst, src, pe=next_pe, queue=stream)
+            ishmem.quiet(queue=stream)
+
+            ishmem.barrier_all()
+            torch.xpu.synchronize()
+            expect_equal("queued symmetric tensor put ring", dst.cpu().tolist(), [float(prev_pe + 10)] * 4)
+        finally:
+            ishmem.free_tensor(dst)
+            ishmem.free_tensor(src)
+    finally:
+        ishmem.finalize()
+
+    return 0
+
+
+if __name__ == "__main__":
+    sys.exit(main())

From 83cfd50379a01b155c9fd61d5ca981998d6fcdcd Mon Sep 17 00:00:00 2001
From: Benjamin Brock <benjamin.brock@intel.com>
Date: Tue, 17 Mar 2026 06:14:52 +0000
Subject: [PATCH 11/17] ishmrun: count level-zero devices correctly

---
 scripts/ishmrun | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/scripts/ishmrun b/scripts/ishmrun
index 1986807..92877b9 100755
--- a/scripts/ishmrun
+++ b/scripts/ishmrun
@@ -179,8 +179,8 @@ gpu_binding()
 
     # Determine how to set ZE_AFFINITY_MASK
     if [ $which_sycl_ls -eq 0 ]; then
-        local root_count=$(ONEAPI_DEVICE_SELECTOR=level_zero:* sycl-ls 2>/dev/null | wc -l)
-        local sub_count=$(ONEAPI_DEVICE_SELECTOR=level_zero:*.* sycl-ls 2>/dev/null | wc -l)
+        local root_count=$(ONEAPI_DEVICE_SELECTOR=level_zero:* sycl-ls 2>/dev/null | grep -c "\[level_zero")
+        local sub_count=$(ONEAPI_DEVICE_SELECTOR=level_zero:*.* sycl-ls 2>/dev/null | grep -c "\[level_zero")
     elif [ $which_clinfo -eq 0 ]; then
         local platforms=$(clinfo -l | grep -i Platform | wc -l)
 
@@ -313,6 +313,8 @@ gpu_env()
         # Intel(R) Arc(TM) B-Series GPU Family
         # Necessary for GPU IPC
         export RenderCompressedBuffersEnabled=0
+        # Atomic ops from the GPU require Xe-Links - instead route atomic ops through the host
+        export ISHMEM_ENABLE_DEVICE_ATOMICS=0
     elif [ $(echo $output | grep -Po "Intel.*Data.*Center.*GPU" | wc -l) -gt 0 ]; then
         # Intel(R) Data Center GPU Max Series
         # No support for Implicit Scaling

From 14ba5bd2fbc09ed3d88607bd7464d2d32b94d05d Mon Sep 17 00:00:00 2001
From: Benjamin Brock <benjamin.brock@intel.com>
Date: Tue, 17 Mar 2026 15:04:24 +0000
Subject: [PATCH 12/17] ishmem: add explicit init_attr device selection

---
 README.md                                |   4 +-
 docs/source/execution_model.rst          |   5 +-
 docs/source/library_setup_exit_query.rst |   8 +-
 examples/3_library_apis.cpp              |   6 +-
 src/accelerator.cpp                      | 112 ++++++++++++++++++-----
 src/accelerator.h                        |   6 +-
 src/ishmem.cpp                           |   2 +-
 src/ishmemx.h                            |   2 +
 src/memory.cpp                           |   2 +-
 src/on_queue.h                           |   3 +
 10 files changed, 115 insertions(+), 35 deletions(-)

diff --git a/README.md b/README.md
index b2822e1..3cedcd2 100644
--- a/README.md
+++ b/README.md
@@ -146,8 +146,8 @@ ISHMEM_RUNTIME=<backend> mpiexec.hydra -n 2 -hosts <allocated_node_id> ./scripts
 
 - *Note:* Current supported launchers include: MPI process launchers (i.e. `mpiexec`, `mpiexec.hydra`, `mpirun`, etc.), Slurm (i.e. `srun`, `salloc`, etc.), and PBS (i.e. `qsub`).
 
-- *Note:* Intel® SHMEM execution model requires applications to use a 1:1 mapping between PEs and GPU devices. Attempting to run an application without the `ishmrun` launch script may result in failure if this mapping is not maintained.
-  - For further details on device selection, please see [the ONEAPI_DEVICE_SELECTOR](https://github.com/intel/llvm/blob/sycl/sycl/doc/EnvironmentVariables.md#oneapi_device_selector).
+- *Note:* Intel® SHMEM execution model requires applications to use a 1:1 mapping between PEs and GPU devices. The default `ishmrun` launcher maintains that mapping by restricting each PE to a single visible GPU device. When multiple GPUs remain visible to a PE, set `ishmemx_attr_t.device_id` explicitly before calling `ishmemx_init_attr`.
+  - For further details on launcher-based device selection, please see [the ONEAPI_DEVICE_SELECTOR](https://github.com/intel/llvm/blob/sycl/sycl/doc/EnvironmentVariables.md#oneapi_device_selector).
 
 3. Validate the application ran successfully; example output:
 
diff --git a/docs/source/execution_model.rst b/docs/source/execution_model.rst
index f9bfdcf..06bd01b 100644
--- a/docs/source/execution_model.rst
+++ b/docs/source/execution_model.rst
@@ -21,7 +21,9 @@ library<ishmemx_runtime_type_t>`.
    that assigns the environment variable **ZE_AFFINITY_MASK** so that each
    PE is assigned a single SYCL device. Usage of this script is described in
    Section :ref:`Compiling and Running
-   Programs<compiling_and_running_programs>`.
+   Programs<compiling_and_running_programs>`. When multiple GPUs remain
+   visible to a PE, applications must select one explicitly with
+   ``ishmemx_attr_t.device_id`` before calling ``ishmemx_init_attr``.
 
 .. note:: Intel® Data Center GPU Max Series devices utilize a multi-tile
    architecture (as of Intel®  SHMEM v1.0.0 with 1 or 2 tiles).  By default,
@@ -110,4 +112,3 @@ operations, see :ref:`Memory Ordering<memory_ordering>`.
 .. segment.
 .. For example, objects located in the symmetric data segment and objects
 .. located in the symmetric heap can be provided as arguments to the same OpenSHMEM operation.
-
diff --git a/docs/source/library_setup_exit_query.rst b/docs/source/library_setup_exit_query.rst
index 3374fd1..222acb7 100644
--- a/docs/source/library_setup_exit_query.rst
+++ b/docs/source/library_setup_exit_query.rst
@@ -58,6 +58,7 @@ OpenSHMEM, MPI, or PMI.
   .. c:var:: ishmemx_runtime_type_t runtime
   .. c:var:: bool initialize_runtime = true
   .. c:var:: bool gpu = true
+  .. c:var:: int device_id = -1
   .. c:var:: void *mpi_comm
 
 **Description:**
@@ -68,7 +69,11 @@ library.
 By default, the parallel runtime is initialized by Intel® SHMEM
 (**initialize_runtime** default is ``true``).
 The **gpu** boolean indicates whether to use GPU memory for the symmetric
-heap (default is ``true``). **mpi_comm** is a pointer to the corresponding
+heap (default is ``true``). ``device_id`` selects the visible GPU ordinal to
+use when multiple GPU devices are visible to a PE. The default value of ``-1``
+preserves the traditional single-visible-device behavior: Intel® SHMEM will
+auto-select the device only when exactly one GPU is visible. **mpi_comm** is a
+pointer to the corresponding
 MPI communicator for representing ``ISHMEM_TEAM_WORLD`` when used with
 ``ISHMEM_RUNTIME_MPI`` (default is ``MPI_COMM_WORLD``).
 
@@ -283,4 +288,3 @@ character.  If the **name** memory buffer is provided with size less than
 ISHMEM_MAX_NAME_LEN, behavior is undefined. For a given library
 implementation, the vendor string returned is consistent with the library
 constant ISHMEM_VENDOR_STRING.
-
diff --git a/examples/3_library_apis.cpp b/examples/3_library_apis.cpp
index a25836a..33758b2 100644
--- a/examples/3_library_apis.cpp
+++ b/examples/3_library_apis.cpp
@@ -16,8 +16,10 @@ int main()
               << std::endl;
 
     /* Initialize ISHMEM
-     * The ISHMEM launch script will set things up so that ishmem uses
-     * the same GPU device as the SYCL queue above
+     * The default ishmrun launch script will set things up so that ishmem uses
+     * the same GPU device as the SYCL queue above. When multiple GPUs are
+     * visible to each PE, use ishmemx_attr_t.device_id to select the device
+     * explicitly before calling ishmemx_init_attr().
      */
     ishmem_init();
 
diff --git a/src/accelerator.cpp b/src/accelerator.cpp
index 26f0274..158a95f 100644
--- a/src/accelerator.cpp
+++ b/src/accelerator.cpp
@@ -6,6 +6,7 @@
 #include "accelerator.h"
 #include <level_zero/ze_api.h>
 #include <atomic>
+#include <vector>
 
 /* TODO: Workaround to resolve compiler limitation. Need to be fixed later */
 #if __INTEL_CLANG_COMPILER <= 20210400
@@ -15,12 +16,21 @@
 #endif
 
 namespace {
+    struct ishmemi_visible_gpu_t {
+        ze_driver_handle_t driver = nullptr;
+        ze_device_handle_t device = nullptr;
+        ze_device_properties_t properties = {};
+        uint32_t driver_idx = 0;
+    };
+
     /* L0 driver */
     ze_driver_handle_t *all_drivers = nullptr;
     ze_device_handle_t **all_devices = nullptr;
     uint32_t driver_count = 0;
     uint32_t driver_idx = 0;
     bool driver_found = false;
+    std::vector<ishmemi_visible_gpu_t> visible_gpus;
+    int selected_device_id = -1;
 
     /* L0 device */
     ze_device_properties_t device_properties = {};
@@ -95,10 +105,30 @@ static inline uint32_t get_next_link_index()
     return index;
 }
 
+sycl::device ishmemi_get_selected_sycl_device()
+{
+    return sycl::make_device<sycl::backend::ext_oneapi_level_zero>(ishmemi_gpu_device);
+}
+
+void ishmemi_validate_queue_device(const sycl::queue &q)
+{
+    try {
+        auto queue_device =
+            sycl::get_native<sycl::backend::ext_oneapi_level_zero>(q.get_device());
+        if (queue_device != ishmemi_gpu_device) {
+            RAISE_ERROR_MSG(
+                "Queue device does not match the selected ISHMEM device. Set "
+                "ishmemx_attr_t.device_id to the queue device ordinal.\n");
+        }
+    } catch (const sycl::exception &) {
+        RAISE_ERROR_MSG("Queue device is not a Level Zero GPU device\n");
+    }
+}
+
 int ishmemi_accelerator_preinit()
 {
     int ret = 0;
-    uint32_t i;
+    uint32_t i, j;
     uint32_t device_count = 0;
     ze_init_flag_t flags = ZE_INIT_FLAG_GPU_ONLY;
 
@@ -134,44 +164,44 @@ int ishmemi_accelerator_preinit()
     ZE_CHECK(zeDriverGet(&driver_count, all_drivers));
     ISHMEMI_CHECK_RESULT(ret, 0, fn_fail);
 
-    /* Parse the drivers for a suitable driver */
+    visible_gpus.clear();
+
+    /* Parse the drivers for visible GPU devices */
     for (i = 0; i < driver_count; i++) {
         device_count = 0;
         ZE_CHECK(zeDeviceGet(all_drivers[i], &device_count, nullptr));
         ISHMEMI_CHECK_RESULT(ret, 0, fn_fail);
         if (device_count == 0) continue;
 
-        /* Ensure a single device is detected */
-        ISHMEM_CHECK_GOTO_MSG(device_count != 1, fn_fail, "Detected more than one device\n");
         all_devices[i] = (ze_device_handle_t *) ::malloc(device_count * sizeof(ze_device_handle_t));
-        ISHMEM_CHECK_GOTO_MSG(all_devices == nullptr, fn_fail,
+        ISHMEM_CHECK_GOTO_MSG(all_devices[i] == nullptr, fn_fail,
                               "Allocation of all_drivers[%d] failed\n", i);
 
         ZE_CHECK(zeDeviceGet(all_drivers[i], &device_count, all_devices[i]));
         ISHMEMI_CHECK_RESULT(ret, 0, fn_fail);
 
-        ZE_CHECK(zeDeviceGetProperties(all_devices[i][0], &device_properties));
-        ISHMEMI_CHECK_RESULT(ret, 0, fn_fail);
-
-        if (ZE_DEVICE_TYPE_GPU == device_properties.type && !driver_found) {
-            ishmemi_gpu_driver = all_drivers[i];
-            driver_idx = i;
-            driver_found = true;
+        for (j = 0; j < device_count; ++j) {
+            ze_device_properties_t props = {};
+            ZE_CHECK(zeDeviceGetProperties(all_devices[i][j], &props));
+            ISHMEMI_CHECK_RESULT(ret, 0, fn_fail);
+
+            if (ZE_DEVICE_TYPE_GPU == props.type) {
+                visible_gpus.push_back({
+                    .driver = all_drivers[i],
+                    .device = all_devices[i][j],
+                    .properties = props,
+                    .driver_idx = i,
+                });
+            }
         }
     }
 
-    if (!driver_found) {
+    if (visible_gpus.empty()) {
         ISHMEM_ERROR_MSG("No ZE driver found for GPU\n");
         ret = ISHMEMI_NO_DEVICES;
         goto fn_fail;
     }
 
-    /* Create the ZE context */
-    ishmemi_ze_context_desc.stype = ZE_STRUCTURE_TYPE_CONTEXT_DESC;
-
-    ZE_CHECK(zeContextCreate(ishmemi_gpu_driver, &ishmemi_ze_context_desc, &ishmemi_ze_context));
-    ISHMEMI_CHECK_RESULT(ret, 0, fn_fail);
-
 fn_exit:
     ishmemi_accelerator_preinitialized = true;
     return ret;
@@ -181,7 +211,7 @@ int ishmemi_accelerator_preinit()
     goto fn_exit;
 }
 
-int ishmemi_accelerator_init()
+int ishmemi_accelerator_init(const ishmemx_attr_t *attr)
 {
     int ret = 0;
     uint32_t i, j;
@@ -192,9 +222,37 @@ int ishmemi_accelerator_init()
     ret = ishmemi_accelerator_preinit();
     ISHMEMI_CHECK_RESULT(ret, 0, fn_exit);
 
-    if (driver_found) {
-        /* Set the default GPU */
-        ishmemi_gpu_device = all_devices[driver_idx][0];
+    if (!ishmemi_accelerator_initialized) {
+        ISHMEM_CHECK_GOTO_MSG(attr == nullptr, fn_fail,
+                              "Accelerator initialization requires non-null attributes\n");
+        ISHMEM_CHECK_GOTO_MSG(attr->device_id < -1, fn_fail,
+                              "Invalid device_id %d provided in ishmemx_attr_t\n",
+                              attr->device_id);
+
+        if (attr->device_id == -1) {
+            ISHMEM_CHECK_GOTO_MSG(
+                visible_gpus.size() != 1, fn_fail,
+                "Detected %zu visible GPU devices. Set ishmemx_attr_t.device_id to select one.\n",
+                visible_gpus.size());
+            selected_device_id = 0;
+        } else {
+            ISHMEM_CHECK_GOTO_MSG(
+                static_cast<size_t>(attr->device_id) >= visible_gpus.size(), fn_fail,
+                "Requested device_id %d is out of range for %zu visible GPU devices\n",
+                attr->device_id, visible_gpus.size());
+            selected_device_id = attr->device_id;
+        }
+
+        const auto &selected_device = visible_gpus[static_cast<size_t>(selected_device_id)];
+        ishmemi_gpu_driver = selected_device.driver;
+        ishmemi_gpu_device = selected_device.device;
+        device_properties = selected_device.properties;
+        driver_idx = selected_device.driver_idx;
+        driver_found = true;
+
+        ishmemi_ze_context_desc.stype = ZE_STRUCTURE_TYPE_CONTEXT_DESC;
+        ZE_CHECK(zeContextCreate(ishmemi_gpu_driver, &ishmemi_ze_context_desc, &ishmemi_ze_context));
+        ISHMEMI_CHECK_RESULT(ret, 0, fn_fail);
 
         /* Discover command queue groups */
         ZE_CHECK(
@@ -289,10 +347,11 @@ int ishmemi_accelerator_init()
     ISHMEMI_CHECK_RESULT(ret, 0, fn_fail);
 
 fn_exit:
-    ishmemi_accelerator_initialized = true;
+    ishmemi_accelerator_initialized = (ret == 0);
     return ret;
 fn_fail:
     ishmemi_accelerator_fini();
+    if (!ret) ret = 1;
     goto fn_exit;
 }
 
@@ -326,12 +385,17 @@ int ishmemi_accelerator_fini(void)
         ISHMEMI_FREE(::free, all_devices[i]);
     ISHMEMI_FREE(::free, all_devices);
     ISHMEMI_FREE(::free, all_drivers);
+    visible_gpus.clear();
 
     ishmemi_accelerator_preinitialized = false;
     ishmemi_accelerator_initialized = false;
     driver_found = false;
     driver_idx = 0;
     driver_count = 0;
+    selected_device_id = -1;
+    ishmemi_gpu_driver = nullptr;
+    ishmemi_gpu_device = nullptr;
+    device_properties = {};
 
     if (ishmemi_ze_context) {
         ZE_CHECK(zeContextDestroy(ishmemi_ze_context));
diff --git a/src/accelerator.h b/src/accelerator.h
index 9512e9e..fa138b9 100644
--- a/src/accelerator.h
+++ b/src/accelerator.h
@@ -59,12 +59,16 @@ static inline void ishmemi_print_device_properties(const ze_device_properties_t
 
 /* Initialize accelerator */
 int ishmemi_accelerator_preinit(void);
-int ishmemi_accelerator_init(void);
+int ishmemi_accelerator_init(const ishmemx_attr_t *attr);
 
 /* Finalize accelerator */
 int ishmemi_accelerator_fini(void);
 int ishmemi_accelerator_postfini(void);
 
+/* Selected-device helpers */
+sycl::device ishmemi_get_selected_sycl_device();
+void ishmemi_validate_queue_device(const sycl::queue &q);
+
 /* Query allocation memory type */
 int ishmemi_get_memory_type(const void *ptr, ze_memory_type_t *type);
 
diff --git a/src/ishmem.cpp b/src/ishmem.cpp
index 44b6571..42d8d76 100644
--- a/src/ishmem.cpp
+++ b/src/ishmem.cpp
@@ -280,7 +280,7 @@ static void ishmemi_init(ishmemx_attr_t *attr, bool user_attr)
     ishmemi_cpu_info->n_pes = ishmemi_n_pes;
 
     if (attr->gpu) {
-        ret = ishmemi_accelerator_init();
+        ret = ishmemi_accelerator_init(attr);
         if (ret == ISHMEMI_NO_DEVICE_ACCESS) {
             attr->gpu = false;
             /* TODO need to enable SHARED HEAP config */
diff --git a/src/ishmemx.h b/src/ishmemx.h
index de87a1c..e988287 100644
--- a/src/ishmemx.h
+++ b/src/ishmemx.h
@@ -25,6 +25,8 @@ typedef struct ishmemx_attr_t {
     bool initialize_runtime = true;
     /* By default, gpu is used */
     bool gpu = true;
+    /* By default, select the only visible GPU device */
+    int device_id = -1;
     /* By default, the base team/comm is uninitialized, representing the default global team/comm */
     union {
         /* TODO: add support for user-provided shmem_team as global team */
diff --git a/src/memory.cpp b/src/memory.cpp
index 4f35408..b650be4 100644
--- a/src/memory.cpp
+++ b/src/memory.cpp
@@ -98,7 +98,7 @@ int ishmemi_memory_init()
 
     /* SYCL queue to initialize global_info */
     try {
-        sycl::queue q;
+        sycl::queue q(ishmemi_get_selected_sycl_device());
         q.copy(&ishmemi_gpu_info, global_info).wait_and_throw();
     } catch (...) {
         ret = -1;
diff --git a/src/on_queue.h b/src/on_queue.h
index 2fe3e6a..ba02181 100644
--- a/src/on_queue.h
+++ b/src/on_queue.h
@@ -5,6 +5,7 @@
 #ifndef ISHMEM_ON_QUEUE_H
 #define ISHMEM_ON_QUEUE_H
 
+#include "accelerator.h"
 #include <map>
 
 struct ishmemi_on_queue_map_entry_t {
@@ -38,6 +39,8 @@ class ishmemi_on_queue_map : public std::map<sycl::queue *, ishmemi_on_queue_map
 
     inline ishmemi_on_queue_map::iterator get_entry_info(sycl::queue &q, bool &entry_already_exists)
     {
+        ishmemi_validate_queue_device(q);
+
         auto iter = find(&q);
         if (iter != end()) {
             entry_already_exists = true;

From 28aa34e8095c000191143fd87b53d2a2a5bfefff Mon Sep 17 00:00:00 2001
From: Benjamin Brock <benjamin.brock@intel.com>
Date: Tue, 17 Mar 2026 15:04:43 +0000
Subject: [PATCH 13/17] ishmem: add device selection regression tests

---
 scripts/ctest/all_visible_wrapper          |   8 ++
 test/cmake/common.cmake                    |   1 +
 test/unit/CMakeLists.txt                   |  18 +++-
 test/unit/init_attr_device_id.cpp          | 113 +++++++++++++++++++++
 test/unit/init_attr_device_id_required.cpp |  53 ++++++++++
 5 files changed, 192 insertions(+), 1 deletion(-)
 create mode 100755 scripts/ctest/all_visible_wrapper
 create mode 100644 test/unit/init_attr_device_id.cpp
 create mode 100644 test/unit/init_attr_device_id_required.cpp

diff --git a/scripts/ctest/all_visible_wrapper b/scripts/ctest/all_visible_wrapper
new file mode 100755
index 0000000..90e59a4
--- /dev/null
+++ b/scripts/ctest/all_visible_wrapper
@@ -0,0 +1,8 @@
+#!/bin/bash
+
+export ZE_FLAT_DEVICE_HIERARCHY=FLAT
+export SYCL_DEVICE_FILTER=:gpu
+unset ZE_AFFINITY_MASK
+unset ONEAPI_DEVICE_SELECTOR
+
+exec "$@"
diff --git a/test/cmake/common.cmake b/test/cmake/common.cmake
index 8f56d60..5c71f98 100644
--- a/test/cmake/common.cmake
+++ b/test/cmake/common.cmake
@@ -88,6 +88,7 @@ endif()
 
 list(APPEND ISHMEM_TEST_INCLUDE_DIRS
     "${ISHMEM_INCLUDE}"
+    "${ISHMEM_ROOT_DIR}/src"
     "${ISHMEM_TEST_ROOT_DIR}/include"
     "${CMAKE_CURRENT_BINARY_DIR}/include")
 
diff --git a/test/unit/CMakeLists.txt b/test/unit/CMakeLists.txt
index 632f119..a4ff106 100644
--- a/test/unit/CMakeLists.txt
+++ b/test/unit/CMakeLists.txt
@@ -29,7 +29,9 @@ target_link_libraries(ishmem-test-common-unit PUBLIC ${ISHMEM_TEST_LINK_LIBS})
 # Setup default test values
 
 set(ISHMEM_PE_COUNTS_UNIT_TESTS "2" CACHE STRING "Number of PEs to use for each test")
-set(ISHMEM_NON_STANDARD_CTEST_FILES "")
+set(ISHMEM_NON_STANDARD_CTEST_FILES
+    init_attr_device_id
+    init_attr_device_id_required)
 set(ISHMEM_SKIP_CTEST_FILES "")
 
 set(ISHMEM_TESTER_MODES host_device_device device)
@@ -104,6 +106,8 @@ set(ISHMEM_TESTER_ON_QUEUE_TESTS
     wait_until_some
     wait_until_some_vector)
 
+set(ISHMEM_ALL_VISIBLE_RUN_SCRIPT "${SCRIPTS_DIR}/ctest/all_visible_wrapper")
+
 enable_testing()
 
 # -------------------------------------------------------------------
@@ -130,6 +134,18 @@ foreach(TEST_SOURCE_FILE ${TEST_SOURCE_FILES})
     target_link_libraries(${EXE} PRIVATE ishmem-test-common-unit)
 endforeach()
 
+foreach (N ${ISHMEM_PE_COUNTS_UNIT_TESTS})
+    add_test(NAME init_attr_device_id-flat-${N} COMMAND ${CTEST_WRAPPER} ${N}
+        ${ISHMEM_ALL_VISIBLE_RUN_SCRIPT} ./init_attr_device_id${CMAKE_EXECUTABLE_SUFFIX})
+    set_tests_properties(init_attr_device_id-flat-${N} PROPERTIES SKIP_RETURN_CODE 77)
+endforeach()
+
+add_test(NAME init_attr_device_id_required-flat COMMAND ${CMAKE_COMMAND} -E env
+    ZE_FLAT_DEVICE_HIERARCHY=FLAT
+    SYCL_DEVICE_FILTER=:gpu
+    ${CMAKE_CURRENT_BINARY_DIR}/init_attr_device_id_required${CMAKE_EXECUTABLE_SUFFIX})
+set_tests_properties(init_attr_device_id_required-flat PROPERTIES SKIP_RETURN_CODE 77)
+
 # -------------------------------------------------------------------
 # Add ctests
 
diff --git a/test/unit/init_attr_device_id.cpp b/test/unit/init_attr_device_id.cpp
new file mode 100644
index 0000000..19ff538
--- /dev/null
+++ b/test/unit/init_attr_device_id.cpp
@@ -0,0 +1,113 @@
+/* Copyright (C) 2025 Intel Corporation
+ * SPDX-License-Identifier: BSD-3-Clause
+ */
+
+#include <common.h>
+#include <cstdlib>
+#include <vector>
+
+namespace {
+std::vector<sycl::device> get_visible_level_zero_gpus()
+{
+    std::vector<sycl::device> devices;
+
+    for (const auto &platform : sycl::platform::get_platforms()) {
+        if (platform.get_backend() != sycl::backend::ext_oneapi_level_zero) continue;
+
+        for (const auto &device : platform.get_devices()) {
+            if (device.is_gpu()) {
+                devices.push_back(device);
+            }
+        }
+    }
+
+    return devices;
+}
+
+int get_local_rank()
+{
+    constexpr const char *env_names[] = {
+        "MPI_LOCALRANKID",
+        "OMPI_COMM_WORLD_LOCAL_RANK",
+        "PMI_LOCAL_RANK",
+        "SLURM_LOCALID",
+    };
+
+    for (const char *name : env_names) {
+        const char *value = std::getenv(name);
+        if (value != nullptr) {
+            return std::atoi(value);
+        }
+    }
+
+    return 0;
+}
+}  // namespace
+
+int main()
+{
+    validate_runtime();
+
+    auto devices = get_visible_level_zero_gpus();
+    if (devices.size() < 2) {
+        std::cout << "Skipping explicit device_id test because fewer than 2 visible Level Zero GPUs "
+                     "were detected"
+                  << std::endl;
+        return 77;
+    }
+
+    int local_rank = get_local_rank();
+    if ((local_rank < 0) || (static_cast<size_t>(local_rank) >= devices.size())) {
+        std::cerr << "Invalid local rank " << local_rank << " for " << devices.size()
+                  << " visible GPU devices" << std::endl;
+        return EXIT_FAILURE;
+    }
+
+    ishmemx_attr_t attr;
+    attr.initialize_runtime = true;
+    attr.runtime = ishmemi_test_runtime->get_type();
+    attr.device_id = local_rank;
+    ishmemx_init_attr(&attr);
+
+    int my_pe = ishmem_my_pe();
+    int npes = ishmem_n_pes();
+    int peer = (my_pe + 1) % npes;
+
+    sycl::queue q(devices[static_cast<size_t>(local_rank)]);
+
+    std::cout << "PE " << my_pe << " selected device_id " << local_rank << ": "
+              << q.get_device().get_info<sycl::info::device::name>() << std::endl;
+
+    int *source = (int *) ishmem_malloc(sizeof(int));
+    CHECK_ALLOC(source);
+    int *target = (int *) ishmem_malloc(sizeof(int));
+    CHECK_ALLOC(target);
+    int *host_value = sycl::malloc_host<int>(1, q);
+    CHECK_ALLOC(host_value);
+
+    q.fill(source, my_pe, 1).wait_and_throw();
+    q.fill(target, -1, 1).wait_and_throw();
+    ishmem_barrier_all();
+
+    auto e = ishmemx_int_get_on_queue(target, source, 1, peer, q);
+    e.wait_and_throw();
+    ishmemx_quiet_on_queue(q).wait_and_throw();
+    q.copy(target, host_value, 1).wait_and_throw();
+
+    int rc = EXIT_SUCCESS;
+    if (*host_value != peer) {
+        std::cerr << "PE " << my_pe << " expected " << peer << " but received " << *host_value
+                  << std::endl;
+        rc = EXIT_FAILURE;
+    } else if (my_pe == 0) {
+        std::cout << "Test Passed" << std::endl;
+    }
+
+    ishmem_barrier_all();
+    sycl::free(host_value, q);
+    ishmem_free(target);
+    ishmem_free(source);
+    ishmem_finalize();
+
+    return rc;
+}
diff --git a/test/unit/init_attr_device_id_required.cpp b/test/unit/init_attr_device_id_required.cpp
new file mode 100644
index 0000000..938f93e
--- /dev/null
+++ b/test/unit/init_attr_device_id_required.cpp
@@ -0,0 +1,53 @@
+/* Copyright (C) 2025 Intel Corporation
+ * SPDX-License-Identifier: BSD-3-Clause
+ */
+
+#include "accelerator.h"
+#include <cstdlib>
+#include <iostream>
+#include <vector>
+
+namespace {
+std::vector<sycl::device> get_visible_level_zero_gpus()
+{
+    std::vector<sycl::device> devices;
+
+    for (const auto &platform : sycl::platform::get_platforms()) {
+        if (platform.get_backend() != sycl::backend::ext_oneapi_level_zero) continue;
+
+        for (const auto &device : platform.get_devices()) {
+            if (device.is_gpu()) {
+                devices.push_back(device);
+            }
+        }
+    }
+
+    return devices;
+}
+}  // namespace
+
+int main()
+{
+    auto devices = get_visible_level_zero_gpus();
+    if (devices.size() < 2) {
+        std::cout << "Skipping device_id-required test because fewer than 2 visible Level Zero "
+                     "GPUs were detected"
+                  << std::endl;
+        return 77;
+    }
+
+    ishmemx_attr_t attr;
+    int ret = ishmemi_accelerator_init(&attr);
+
+    if (ret == 0) {
+        std::cerr << "Expected accelerator initialization to fail when multiple GPUs are visible "
+                     "and device_id is not set"
+                  << std::endl;
+        ishmemi_accelerator_fini();
+        return EXIT_FAILURE;
+    }
+
+    std::cout << "Detected required explicit device selection with " << devices.size()
+              << " visible GPUs" << std::endl;
+    return EXIT_SUCCESS;
+}

From d8fa12ef32533bf4d1a1ea04ae4a278f469525e3 Mon Sep 17 00:00:00 2001
From: Benjamin Brock <benjamin.brock@intel.com>
Date: Tue, 17 Mar 2026 16:09:29 +0000
Subject: [PATCH 14/17] ishmem4py: add init device_id support

---
 docs/source/ishmem4py.rst               |  8 +++
 ishmem4py/csrc/ishmem4py_runtime.cpp    |  7 ++
 ishmem4py/csrc/ishmem4py_runtime.h      |  1 +
 ishmem4py/python/README.md              |  2 +
 ishmem4py/python/ishmem4py/_lib.py      |  3 +
 ishmem4py/python/ishmem4py/init_fini.py | 18 ++++-
 ishmem4py/test/init_device_id_test.py   | 89 +++++++++++++++++++++++++
 7 files changed, 125 insertions(+), 3 deletions(-)
 create mode 100644 ishmem4py/test/init_device_id_test.py

diff --git a/docs/source/ishmem4py.rst b/docs/source/ishmem4py.rst
index 5efe7ec..a3db0af 100644
--- a/docs/source/ishmem4py.rst
+++ b/docs/source/ishmem4py.rst
@@ -141,6 +141,14 @@ Example:
        ishmem.free(buf)
        ishmem.finalize()
 
+When multiple XPU devices are visible to a process, pass ``device_id=...`` to
+``ishmem.init(...)`` to select the visible device ordinal for Intel SHMEM:
+
+.. code-block:: python
+
+   local_rank = int(os.environ["MPI_LOCALRANKID"])
+   ishmem.init(device_id=local_rank)
+
 Torch/XPU interop is also available as an optional build-time feature.
 Those helpers allocate symmetric memory directly as ``torch.Tensor`` objects backed by the
 Intel SHMEM heap, which is useful for host-driven one-sided XPU workflows:
diff --git a/ishmem4py/csrc/ishmem4py_runtime.cpp b/ishmem4py/csrc/ishmem4py_runtime.cpp
index 419322a..25dc1b8 100644
--- a/ishmem4py/csrc/ishmem4py_runtime.cpp
+++ b/ishmem4py/csrc/ishmem4py_runtime.cpp
@@ -97,6 +97,13 @@ void ishmem4py_init(void)
     ishmem_init();
 }
 
+void ishmem4py_init_with_device(int device_id)
+{
+    ishmemx_attr_t attr{};
+    attr.device_id = device_id;
+    ishmemx_init_attr(&attr);
+}
+
 void ishmem4py_finalize(void)
 {
     ishmem_finalize();
diff --git a/ishmem4py/csrc/ishmem4py_runtime.h b/ishmem4py/csrc/ishmem4py_runtime.h
index f15ced2..3f60107 100644
--- a/ishmem4py/csrc/ishmem4py_runtime.h
+++ b/ishmem4py/csrc/ishmem4py_runtime.h
@@ -42,6 +42,7 @@ typedef struct ishmem4py_team_config_t {
 } ishmem4py_team_config_t;
 
 ISHMEM4PY_EXPORT void ishmem4py_init(void);
+ISHMEM4PY_EXPORT void ishmem4py_init_with_device(int device_id);
 ISHMEM4PY_EXPORT void ishmem4py_finalize(void);
 ISHMEM4PY_EXPORT int ishmem4py_my_pe(void);
 ISHMEM4PY_EXPORT int ishmem4py_n_pes(void);
diff --git a/ishmem4py/python/README.md b/ishmem4py/python/README.md
index 52e8c6f..89d1802 100644
--- a/ishmem4py/python/README.md
+++ b/ishmem4py/python/README.md
@@ -71,6 +71,8 @@ export PYTHONPATH=/path/to/ishmem-src/ishmem4py/python:/path/to/ishmem-src/ishme
   PyTorch.
 - The runtime library is loaded lazily, so importing the package for documentation or static
   inspection does not require a live Intel SHMEM runtime.
+- `ishmem.init(device_id=...)` selects an explicit visible XPU ordinal for multi-device launch
+  modes. Leaving `device_id` unset preserves the single-visible-device behavior.
 - `put`, `get`, and `quiet` accept `queue=` for queue-based host-initiated XPU RMA. Supported
   queue objects are `torch.xpu.Stream`, `ctypes.c_void_p`, and raw integer SYCL queue pointers.
 - The current Torch/XPU MVP is focused on XPU device memory and XPU-to-XPU one-sided transfers.
diff --git a/ishmem4py/python/ishmem4py/_lib.py b/ishmem4py/python/ishmem4py/_lib.py
index 6c7870b..f44b984 100644
--- a/ishmem4py/python/ishmem4py/_lib.py
+++ b/ishmem4py/python/ishmem4py/_lib.py
@@ -23,6 +23,9 @@ def _configure_runtime(runtime):
     runtime.ishmem4py_init.argtypes = []
     runtime.ishmem4py_init.restype = None
 
+    runtime.ishmem4py_init_with_device.argtypes = [ctypes.c_int]
+    runtime.ishmem4py_init_with_device.restype = None
+
     runtime.ishmem4py_finalize.argtypes = []
     runtime.ishmem4py_finalize.restype = None
 
diff --git a/ishmem4py/python/ishmem4py/init_fini.py b/ishmem4py/python/ishmem4py/init_fini.py
index f74ac12..3e67a1e 100644
--- a/ishmem4py/python/ishmem4py/init_fini.py
+++ b/ishmem4py/python/ishmem4py/init_fini.py
@@ -37,10 +37,22 @@
 ]
 
 
-def init() -> None:
-    """Initialize Intel SHMEM for the current process."""
+def init(*, device_id: int | None = None) -> None:
+    """Initialize Intel SHMEM for the current process.
+
+    When ``device_id`` is provided, Intel SHMEM selects that visible XPU ordinal
+    through ``ishmemx_init_attr``. The default ``None`` preserves the legacy
+    single-visible-device behavior.
+    """
     _check_can_init()
-    RUNTIME.ishmem4py_init()
+    if device_id is None:
+        RUNTIME.ishmem4py_init()
+    else:
+        if isinstance(device_id, bool) or not isinstance(device_id, int):
+            raise TypeError("device_id must be an int or None")
+        if device_id < -1:
+            raise ValueError("device_id must be >= -1")
+        RUNTIME.ishmem4py_init_with_device(int(device_id))
     _set_initialized()
 
 
diff --git a/ishmem4py/test/init_device_id_test.py b/ishmem4py/test/init_device_id_test.py
new file mode 100644
index 0000000..562f69f
--- /dev/null
+++ b/ishmem4py/test/init_device_id_test.py
@@ -0,0 +1,89 @@
+# Copyright (C) 2026 Intel Corporation
+# SPDX-License-Identifier: BSD-3-Clause
+
+from __future__ import annotations
+
+import os
+import subprocess
+import sys
+
+import ishmem4py as ishmem
+
+from utils import expect_equal, pack_int32, unpack_int32
+
+
+def _visible_level_zero_gpu_count() -> int:
+    try:
+        output = subprocess.check_output(["sycl-ls"], text=True)
+    except Exception:
+        return 0
+    return sum(1 for line in output.splitlines() if "[level_zero:" in line)
+
+
+def _local_rank() -> int:
+    for name in ("MPI_LOCALRANKID", "OMPI_COMM_WORLD_LOCAL_RANK", "PMI_LOCAL_RANK", "SLURM_LOCALID"):
+        value = os.environ.get(name)
+        if value is not None:
+            return int(value)
+    return 0
+
+
+def main() -> int:
+    try:
+        ishmem.init(device_id="0")  # type: ignore[arg-type]
+    except TypeError:
+        pass
+    else:
+        raise AssertionError("init(device_id='0') should raise TypeError")
+
+    try:
+        ishmem.init(device_id=-2)
+    except ValueError:
+        pass
+    else:
+        raise AssertionError("init(device_id=-2) should raise ValueError")
+
+    visible_gpus = _visible_level_zero_gpu_count()
+    local_rank = _local_rank()
+    if visible_gpus < 2:
+        print("Skipping init_device_id_test.py because fewer than 2 visible Level Zero GPUs were found")
+        return 77
+    if local_rank >= visible_gpus:
+        print(
+            f"Skipping init_device_id_test.py because local rank {local_rank} exceeds "
+            f"visible Level Zero GPU count {visible_gpus}"
+        )
+        return 77
+
+    ishmem.init(device_id=local_rank)
+    try:
+        my_pe = ishmem.my_pe()
+        npes = ishmem.n_pes()
+        if npes < 2:
+            raise RuntimeError("init_device_id_test.py requires at least 2 PEs")
+
+        next_pe = (my_pe + 1) % npes
+        prev_pe = (my_pe + npes - 1) % npes
+
+        src = ishmem.malloc(4)
+        dst = ishmem.calloc(1, 4)
+        try:
+            src.write(pack_int32(my_pe))
+
+            ishmem.barrier_all()
+            ishmem.put(dst, src.read(4), pe=next_pe)
+            ishmem.barrier_all()
+
+            received = unpack_int32(dst.read(4))
+            expect_equal("put ring result with explicit device_id", received, prev_pe)
+        finally:
+            ishmem.free(dst)
+            ishmem.free(src)
+    finally:
+        ishmem.finalize()
+
+    return 0
+
+
+if __name__ == "__main__":
+    sys.exit(main())

From 1ea7cc2ed50fd9478ce4956fcfe2d90251e296eb Mon Sep 17 00:00:00 2001
From: Benjamin Brock <benjamin.brock@intel.com>
Date: Tue, 17 Mar 2026 16:12:30 +0000
Subject: [PATCH 15/17] ishmem4py: accept xpu device syntax in init

---
 docs/source/ishmem4py.rst               |  4 +-
 ishmem4py/python/README.md              |  3 +-
 ishmem4py/python/ishmem4py/init_fini.py | 66 ++++++++++++++++++++++---
 ishmem4py/test/init_device_id_test.py   | 39 +++++++++++++--
 4 files changed, 101 insertions(+), 11 deletions(-)

diff --git a/docs/source/ishmem4py.rst b/docs/source/ishmem4py.rst
index a3db0af..1484a77 100644
--- a/docs/source/ishmem4py.rst
+++ b/docs/source/ishmem4py.rst
@@ -142,7 +142,9 @@ Example:
        ishmem.finalize()
 
 When multiple XPU devices are visible to a process, pass ``device_id=...`` to
-``ishmem.init(...)`` to select the visible device ordinal for Intel SHMEM:
+``ishmem.init(...)`` to select the visible device ordinal for Intel SHMEM.
+``device_id`` may be an integer visible-device ordinal, a string like
+``"xpu:3"``, or a ``torch.device("xpu", 3)``:
 
 .. code-block:: python
 
diff --git a/ishmem4py/python/README.md b/ishmem4py/python/README.md
index 89d1802..9266225 100644
--- a/ishmem4py/python/README.md
+++ b/ishmem4py/python/README.md
@@ -72,7 +72,8 @@ export PYTHONPATH=/path/to/ishmem-src/ishmem4py/python:/path/to/ishmem-src/ishme
 - The runtime library is loaded lazily, so importing the package for documentation or static
   inspection does not require a live Intel SHMEM runtime.
 - `ishmem.init(device_id=...)` selects an explicit visible XPU ordinal for multi-device launch
-  modes. Leaving `device_id` unset preserves the single-visible-device behavior.
+  modes. `device_id` may be an integer, an XPU device string such as `xpu:1`, or a
+  `torch.device("xpu", 1)`. Leaving it unset preserves the single-visible-device behavior.
 - `put`, `get`, and `quiet` accept `queue=` for queue-based host-initiated XPU RMA. Supported
   queue objects are `torch.xpu.Stream`, `ctypes.c_void_p`, and raw integer SYCL queue pointers.
 - The current Torch/XPU MVP is focused on XPU device memory and XPU-to-XPU one-sided transfers.
diff --git a/ishmem4py/python/ishmem4py/init_fini.py b/ishmem4py/python/ishmem4py/init_fini.py
index 3e67a1e..58d4312 100644
--- a/ishmem4py/python/ishmem4py/init_fini.py
+++ b/ishmem4py/python/ishmem4py/init_fini.py
@@ -6,6 +6,7 @@
 from __future__ import annotations
 
 import ctypes
+from typing import Any
 
 from ._common import (
     InitStatus,
@@ -37,7 +38,7 @@
 ]
 
 
-def init(*, device_id: int | None = None) -> None:
+def init(*, device_id: Any = None) -> None:
     """Initialize Intel SHMEM for the current process.
 
     When ``device_id`` is provided, Intel SHMEM selects that visible XPU ordinal
@@ -45,15 +46,68 @@ def init(*, device_id: int | None = None) -> None:
     single-visible-device behavior.
     """
     _check_can_init()
-    if device_id is None:
+    normalized_device_id = _normalize_device_id(device_id)
+    if normalized_device_id is None:
         RUNTIME.ishmem4py_init()
     else:
-        if isinstance(device_id, bool) or not isinstance(device_id, int):
-            raise TypeError("device_id must be an int or None")
+        RUNTIME.ishmem4py_init_with_device(normalized_device_id)
+    _set_initialized()
+
+
+def _normalize_device_id(device_id: Any) -> int | None:
+    if device_id is None:
+        return None
+
+    if isinstance(device_id, bool):
+        raise TypeError("device_id must be an int, an XPU device string, a torch.device, or None")
+
+    if isinstance(device_id, int):
         if device_id < -1:
             raise ValueError("device_id must be >= -1")
-        RUNTIME.ishmem4py_init_with_device(int(device_id))
-    _set_initialized()
+        return int(device_id)
+
+    if isinstance(device_id, str):
+        if device_id == "xpu":
+            return _current_torch_xpu_device()
+        if device_id.startswith("xpu:"):
+            index_str = device_id.split(":", 1)[1]
+            if not index_str:
+                raise ValueError("device string 'xpu:' is missing a device index")
+            try:
+                index = int(index_str)
+            except ValueError as exc:
+                raise ValueError(f"invalid XPU device string: {device_id!r}") from exc
+            if index < 0:
+                raise ValueError("device_id must be >= -1")
+            return index
+        raise ValueError(f"unsupported device string for ishmem.init: {device_id!r}")
+
+    torch = _import_torch()
+    if torch is not None and isinstance(device_id, torch.device):
+        if device_id.type != "xpu":
+            raise ValueError(f"unsupported torch.device for ishmem.init: {device_id}")
+        if device_id.index is None:
+            return _current_torch_xpu_device()
+        if device_id.index < 0:
+            raise ValueError("device_id must be >= -1")
+        return int(device_id.index)
+
+    raise TypeError("device_id must be an int, an XPU device string, a torch.device, or None")
+
+
+def _import_torch():
+    try:
+        import torch
+    except Exception:
+        return None
+    return torch
+
+
+def _current_torch_xpu_device() -> int:
+    torch = _import_torch()
+    if torch is None:
+        raise RuntimeError("normalizing an XPU device requires PyTorch to be installed")
+    return int(torch.xpu.current_device())
 
 
 def finalize() -> None:
diff --git a/ishmem4py/test/init_device_id_test.py b/ishmem4py/test/init_device_id_test.py
index 562f69f..db89679 100644
--- a/ishmem4py/test/init_device_id_test.py
+++ b/ishmem4py/test/init_device_id_test.py
@@ -8,9 +8,15 @@
 import sys
 
 import ishmem4py as ishmem
+from ishmem4py import init_fini as _init_fini
 
 from utils import expect_equal, pack_int32, unpack_int32
 
+try:
+    import torch
+except Exception:
+    torch = None
+
 
 def _visible_level_zero_gpu_count() -> int:
     try:
@@ -31,10 +37,10 @@ def _local_rank() -> int:
 def main() -> int:
     try:
         ishmem.init(device_id="0")  # type: ignore[arg-type]
-    except TypeError:
+    except ValueError:
         pass
     else:
-        raise AssertionError("init(device_id='0') should raise TypeError")
+        raise AssertionError("init(device_id='0') should raise ValueError")
 
     try:
         ishmem.init(device_id=-2)
@@ -43,6 +49,13 @@ def main() -> int:
     else:
         raise AssertionError("init(device_id=-2) should raise ValueError")
 
+    try:
+        ishmem.init(device_id="cuda:0")
+    except ValueError:
+        pass
+    else:
+        raise AssertionError("init(device_id='cuda:0') should raise ValueError")
+
     visible_gpus = _visible_level_zero_gpu_count()
     local_rank = _local_rank()
     if visible_gpus < 2:
@@ -55,7 +68,27 @@ def main() -> int:
         )
         return 77
 
-    ishmem.init(device_id=local_rank)
+    if torch is not None:
+        with torch.xpu.device(local_rank):
+            expect_equal("xpu string current device normalization", _init_fini._normalize_device_id("xpu"), local_rank)
+            expect_equal(
+                "torch.device current device normalization",
+                _init_fini._normalize_device_id(torch.device("xpu")),
+                local_rank,
+            )
+            expect_equal(
+                "torch.device explicit normalization",
+                _init_fini._normalize_device_id(torch.device("xpu", local_rank)),
+                local_rank,
+            )
+    expect_equal(
+        "xpu explicit string normalization",
+        _init_fini._normalize_device_id(f"xpu:{local_rank}"),
+        local_rank,
+    )
+
+    init_device = torch.device("xpu", local_rank) if torch is not None else f"xpu:{local_rank}"
+    ishmem.init(device_id=init_device)
     try:
         my_pe = ishmem.my_pe()
         npes = ishmem.n_pes()

From 482003376452357c403ca666a2e8e55220f2da67 Mon Sep 17 00:00:00 2001
From: Benjamin Brock <benjamin.brock@intel.com>
Date: Thu, 19 Mar 2026 18:03:53 +0000
Subject: [PATCH 16/17] ishmem4py: add public peer tensor helper

---
 ishmem4py/python/ishmem4py/core.py       |  7 +++
 ishmem4py/python/ishmem4py/torch.py      | 66 ++++++++++++++++++++++--
 ishmem4py/test/torch_peer_tensor_test.py | 61 ++++++++++++++++++++++
 ishmem4py/test/torch_tensor_test.py      | 18 ++++++-
 4 files changed, 147 insertions(+), 5 deletions(-)
 create mode 100644 ishmem4py/test/torch_peer_tensor_test.py

diff --git a/ishmem4py/python/ishmem4py/core.py b/ishmem4py/python/ishmem4py/core.py
index dff7c88..cb427c8 100644
--- a/ishmem4py/python/ishmem4py/core.py
+++ b/ishmem4py/python/ishmem4py/core.py
@@ -58,6 +58,12 @@ def free_tensor(*args, **kwargs):
     return torch_interop.free_tensor(*args, **kwargs)
 
 
+def get_peer_tensor(*args, **kwargs):
+    from . import torch as torch_interop
+
+    return torch_interop.get_peer_tensor(*args, **kwargs)
+
+
 def tensor_base(*args, **kwargs):
     from . import torch as torch_interop
 
@@ -72,6 +78,7 @@ def is_symmetric_tensor(*args, **kwargs):
 
 __all__ += [
     "free_tensor",
+    "get_peer_tensor",
     "is_symmetric_tensor",
     "tensor",
     "tensor_base",
diff --git a/ishmem4py/python/ishmem4py/torch.py b/ishmem4py/python/ishmem4py/torch.py
index 8574e54..fb2c52f 100644
--- a/ishmem4py/python/ishmem4py/torch.py
+++ b/ishmem4py/python/ishmem4py/torch.py
@@ -5,10 +5,12 @@
 
 from __future__ import annotations
 
+import ctypes
 from dataclasses import dataclass
 from typing import Iterable
 
 from ._common import IshmemStateError, _PointerInfo, _require_initialized
+from ._lib import RUNTIME
 
 try:
     import torch as _torch
@@ -22,6 +24,7 @@
 
 __all__ = [
     "free_tensor",
+    "get_peer_tensor",
     "is_symmetric_tensor",
     "tensor",
     "tensor_base",
@@ -96,6 +99,7 @@ def _register_tensor(base: "_torch.Tensor") -> None:
     )
     base._ishmem_alloc = True  # type: ignore[attr-defined]
     base._ishmem_base = base  # type: ignore[attr-defined]
+    base._ishmem_free_owner = True  # type: ignore[attr-defined]
 
 
 def _allocation_from_base(base: "_torch.Tensor") -> _TensorAllocation:
@@ -126,6 +130,27 @@ def _assert_contiguous_tensor(tensor: "_torch.Tensor") -> None:
         raise ValueError("tensor must be contiguous")
 
 
+def _scalar_type_from_dtype(dtype) -> int:
+    scalar_type = _DTYPE_TO_SCALAR_TYPE.get(dtype)
+    if scalar_type is None:
+        raise TypeError(f"unsupported torch dtype for ishmem4py Torch interop: {dtype}")
+    return scalar_type
+
+
+def _remote_tensor_ptr(base: "_torch.Tensor", *, pe: int) -> int:
+    address = int(RUNTIME.ishmem4py_ptr(ctypes.c_void_p(int(base.data_ptr())), int(pe)) or 0)
+    if address == 0:
+        raise IshmemStateError(f"ishmem_ptr returned NULL for PE {pe}")
+    return address
+
+
+def _mark_tensor_alias(tensor: "_torch.Tensor", *, base: "_torch.Tensor", owner: bool) -> None:
+    tensor._ishmem_alias = True  # type: ignore[attr-defined]
+    tensor._ishmem_alloc = True  # type: ignore[attr-defined]
+    tensor._ishmem_base = base  # type: ignore[attr-defined]
+    tensor._ishmem_free_owner = owner  # type: ignore[attr-defined]
+
+
 def _pointer_from_tensor(tensor, *, writable: bool, offset: int = 0) -> _PointerInfo | None:
     if _torch is None or not isinstance(tensor, _torch.Tensor):
         return None
@@ -167,9 +192,7 @@ def tensor(
     shape = _normalize_shape(size)
     if dtype is None:
         dtype = _torch.get_default_dtype()
-    scalar_type = _DTYPE_TO_SCALAR_TYPE.get(dtype)
-    if scalar_type is None:
-        raise TypeError(f"unsupported torch dtype for ishmem4py.torch.tensor: {dtype}")
+    scalar_type = _scalar_type_from_dtype(dtype)
 
     normalized_device = _normalize_device(device)
     with _torch.xpu.device(normalized_device):
@@ -180,6 +203,7 @@ def tensor(
     else:
         result._ishmem_alloc = True  # type: ignore[attr-defined]
         result._ishmem_base = result  # type: ignore[attr-defined]
+        result._ishmem_free_owner = True  # type: ignore[attr-defined]
     if requires_grad:
         result.requires_grad_(True)
     return result
@@ -198,6 +222,38 @@ def tensor_base(tensor):
     return allocation.tensor
 
 
+def get_peer_tensor(tensor, pe: int):
+    """Return an XPU tensor alias for ``tensor`` on remote PE ``pe``."""
+    _require_initialized()
+    _require_torch()
+    if not isinstance(tensor, _torch.Tensor):
+        raise TypeError("expected a torch.Tensor")
+    if tensor.device.type != "xpu":
+        raise ValueError(f"get_peer_tensor only supports XPU tensors, got {tensor.device}")
+    if tensor.device.index is None:
+        raise ValueError("get_peer_tensor requires a concrete XPU device index")
+
+    _assert_contiguous_tensor(tensor)
+    base = tensor_base(tensor)
+    if not is_symmetric_tensor(base):
+        raise IshmemStateError("tensor is not backed by ishmem4py symmetric memory")
+
+    byte_offset = int(tensor.data_ptr()) - int(base.data_ptr())
+    if byte_offset < 0:
+        raise IshmemStateError("tensor points before its tracked symmetric base")
+    remote_ptr = _remote_tensor_ptr(base, pe=pe) + byte_offset
+    result = _TORCH_EXT.tensor_from_ptr(
+        remote_ptr,
+        list(tensor.shape),
+        _scalar_type_from_dtype(tensor.dtype),
+        tensor.device.index,
+    )
+    _mark_tensor_alias(result, base=base, owner=False)
+    result._ishmem_peer_pe = int(pe)  # type: ignore[attr-defined]
+    result._ishmem_keepalive = tensor  # type: ignore[attr-defined]
+    return result
+
+
 def is_symmetric_tensor(tensor) -> bool:
     """Return ``True`` when ``tensor`` references live symmetric memory."""
     if _torch is None or not isinstance(tensor, _torch.Tensor):
@@ -213,6 +269,10 @@ def free_tensor(tensor) -> None:
     _require_torch()
     if not isinstance(tensor, _torch.Tensor):
         raise TypeError("expected a torch.Tensor")
+    if bool(getattr(tensor, "_ishmem_alias", False)) and not bool(
+        getattr(tensor, "_ishmem_free_owner", False)
+    ):
+        raise IshmemStateError("tensor is a non-owning symmetric alias and cannot be freed")
     base = tensor_base(tensor)
     if not getattr(base, "_ishmem_alloc", False):
         raise IshmemStateError("tensor is not backed by ishmem4py symmetric memory")
diff --git a/ishmem4py/test/torch_peer_tensor_test.py b/ishmem4py/test/torch_peer_tensor_test.py
new file mode 100644
index 0000000..83131c1
--- /dev/null
+++ b/ishmem4py/test/torch_peer_tensor_test.py
@@ -0,0 +1,61 @@
+# Copyright (C) 2026 Intel Corporation
+# SPDX-License-Identifier: BSD-3-Clause
+
+from __future__ import annotations
+
+import sys
+
+import torch
+
+import ishmem4py as ishmem
+
+from utils import expect_equal, expect_true
+
+
+def main() -> int:
+    ishmem.init()
+    try:
+        my_pe = ishmem.my_pe()
+        npes = ishmem.n_pes()
+        if npes < 2:
+            raise RuntimeError("torch_peer_tensor_test.py requires at least 2 PEs")
+
+        next_pe = (my_pe + 1) % npes
+        prev_pe = (my_pe + npes - 1) % npes
+
+        target = ishmem.tensor((4,), dtype=torch.float32, device="xpu")
+        try:
+            target.fill_(10.0 * my_pe)
+            torch.xpu.synchronize()
+            ishmem.barrier_all()
+
+            peer = ishmem.get_peer_tensor(target, pe=next_pe)
+            expect_true("peer tensor is symmetric", ishmem.is_symmetric_tensor(peer))
+            expect_true("peer tensor base returns original allocation", ishmem.tensor_base(peer) is target)
+
+            peer.add_(torch.full_like(peer, float(my_pe + 1)))
+            torch.xpu.synchronize()
+
+            try:
+                ishmem.free_tensor(peer)
+            except ishmem.IshmemStateError:
+                pass
+            else:
+                raise AssertionError("free_tensor(peer) should fail for non-owning aliases")
+
+            ishmem.barrier_all()
+            expect_equal(
+                "remote peer alias update",
+                target.cpu().tolist(),
+                [10.0 * my_pe + float(prev_pe + 1)] * 4,
+            )
+        finally:
+            ishmem.free_tensor(target)
+    finally:
+        ishmem.finalize()
+
+    return 0
+
+
+if __name__ == "__main__":
+    sys.exit(main())
diff --git a/ishmem4py/test/torch_tensor_test.py b/ishmem4py/test/torch_tensor_test.py
index 532ca16..f87f2d8 100644
--- a/ishmem4py/test/torch_tensor_test.py
+++ b/ishmem4py/test/torch_tensor_test.py
@@ -25,10 +25,24 @@ def main() -> int:
             expect_true("tensor base returns self for base allocation", ishmem.tensor_base(tensor) is tensor)
             expect_equal("tensor contents", tensor.cpu().tolist(), [3.0, 3.0, 3.0, 3.0])
 
+            peer = ishmem.get_peer_tensor(tensor, pe=ishmem.my_pe())
+            expect_true("peer tensor is symmetric", ishmem.is_symmetric_tensor(peer))
+            expect_true("peer tensor base returns original allocation", ishmem.tensor_base(peer) is tensor)
+            peer.add_(2.0)
+            torch.xpu.synchronize()
+            expect_equal("peer tensor self alias update", tensor.cpu().tolist(), [5.0, 5.0, 5.0, 5.0])
+
+            try:
+                ishmem.free_tensor(peer)
+            except ishmem.IshmemStateError:
+                pass
+            else:
+                raise AssertionError("free_tensor(peer) should fail for non-owning aliases")
+
             stream = torch.xpu.Stream()
             ishmem.put(mirror, tensor, pe=ishmem.my_pe(), queue=stream)
             ishmem.quiet(queue=stream)
-            expect_equal("queued self put", mirror.cpu().tolist(), [3.0, 3.0, 3.0, 3.0])
+            expect_equal("queued self put", mirror.cpu().tolist(), [5.0, 5.0, 5.0, 5.0])
 
             recv = ishmem.tensor((4,), dtype=torch.float32, device="xpu")
             try:
@@ -36,7 +50,7 @@ def main() -> int:
                 stream = torch.xpu.Stream()
                 ishmem.get(recv, tensor, pe=ishmem.my_pe(), queue=stream)
                 ishmem.quiet(queue=stream)
-                expect_equal("queued self get", recv.cpu().tolist(), [3.0, 3.0, 3.0, 3.0])
+                expect_equal("queued self get", recv.cpu().tolist(), [5.0, 5.0, 5.0, 5.0])
             finally:
                 ishmem.free_tensor(recv)
         finally:

From 6670df3ced3aeef78c0552ac9a9f4ec33c1e5033 Mon Sep 17 00:00:00 2001
From: Benjamin Brock <benjamin.brock@intel.com>
Date: Thu, 19 Mar 2026 18:03:55 +0000
Subject: [PATCH 17/17] ishmem4py: document peer tensor interop

---
 docs/source/ishmem4py.rst  | 14 +++++++++++++-
 ishmem4py/python/README.md |  6 +++++-
 2 files changed, 18 insertions(+), 2 deletions(-)

diff --git a/docs/source/ishmem4py.rst b/docs/source/ishmem4py.rst
index 1484a77..51e3780 100644
--- a/docs/source/ishmem4py.rst
+++ b/docs/source/ishmem4py.rst
@@ -53,6 +53,7 @@ Memory Management
    free
    tensor
    free_tensor
+   get_peer_tensor
    tensor_base
    is_symmetric_tensor
    ptr
@@ -174,8 +175,19 @@ Intel SHMEM heap, which is useful for host-driven one-sided XPU workflows:
        ishmem.free_tensor(src)
        ishmem.finalize()
 
+``get_peer_tensor`` provides the tensor-level analogue of ``ishmem_ptr`` for Torch/XPU interop:
+
+.. code-block:: python
+
+   peer = ishmem.get_peer_tensor(src, pe=neighbor)
+   peer.add_(update)
+
+The returned tensor is a non-owning XPU alias of a remote symmetric allocation. It is valid for
+device-side loads and stores, but must not be passed to ``ishmem.free_tensor(...)``.
+
 When torch interop is built, the top-level module exposes ``tensor``, ``free_tensor``,
-``tensor_base``, and ``is_symmetric_tensor`` lazily so that plain ``import ishmem4py`` does
+``get_peer_tensor``, ``tensor_base``, and ``is_symmetric_tensor`` lazily so that plain
+``import ishmem4py`` does
 not require importing PyTorch up front.
 
 Queue-Based RMA
diff --git a/ishmem4py/python/README.md b/ishmem4py/python/README.md
index 9266225..8a608b7 100644
--- a/ishmem4py/python/README.md
+++ b/ishmem4py/python/README.md
@@ -66,9 +66,13 @@ export PYTHONPATH=/path/to/ishmem-src/ishmem4py/python:/path/to/ishmem-src/ishme
 
 - `import ishmem4py` and `import ishmem4py.core` both expose the full public API.
 - Torch/XPU interop is imported lazily through `ishmem4py.tensor(...)`,
-  `ishmem4py.free_tensor(...)`, `ishmem4py.tensor_base(...)`, and
+  `ishmem4py.free_tensor(...)`, `ishmem4py.get_peer_tensor(...)`,
+  `ishmem4py.tensor_base(...)`, and
   `ishmem4py.is_symmetric_tensor(...)` so that plain package import does not eagerly import
   PyTorch.
+- `ishmem4py.get_peer_tensor(tensor, pe=...)` mirrors `nvshmem4py`'s peer-tensor helper for
+  directly addressable XPU aliases returned by `ishmem_ptr`. The returned tensor is a non-owning
+  alias and must not be passed to `ishmem4py.free_tensor(...)`.
 - The runtime library is loaded lazily, so importing the package for documentation or static
   inspection does not require a live Intel SHMEM runtime.
 - `ishmem.init(device_id=...)` selects an explicit visible XPU ordinal for multi-device launch