From 61e0edebadfcfdec5e8c96fac36de664c6e65e40 Mon Sep 17 00:00:00 2001
From: Ludovic Raess <ludovic.rass@gmail.com>
Date: Wed, 17 Dec 2025 10:21:35 +0100
Subject: [PATCH 1/5] Improve GPU-aware section

---
 docs/src/usage.md | 43 +++++++++++++++++++++++++++++--------------
 1 file changed, 29 insertions(+), 14 deletions(-)

diff --git a/docs/src/usage.md b/docs/src/usage.md
index c57eae1af..2835a1c38 100644
--- a/docs/src/usage.md
+++ b/docs/src/usage.md
@@ -74,33 +74,48 @@ with:
 $ mpiexecjl --project=/path/to/project -n 20 julia script.jl
 ```
 
-## CUDA-aware MPI support
+## GPU-aware MPI support
 
-If your MPI implementation has been compiled with CUDA support, then `CUDA.CuArray`s (from the
-[CUDA.jl](https://github.com/JuliaGPU/CUDA.jl) package) can be passed directly as
-send and receive buffers for point-to-point and collective operations (they may also work with one-sided operations, but these are not often supported).
+If your MPI implementation has been compiled with CUDA or ROCm support, then `CUDA.CuArray`s (from
+[CUDA.jl](https://github.com/JuliaGPU/CUDA.jl)) or `AMDGPU.ROCArray`s (from [AMDGPU.jl](https://github.com/JuliaGPU/AMDGPU.jl)) can be passed directly as
+send and receive buffers for point-to-point and collective operations (they may also work with one-sided operations, but these are not often supported). GPU-aware MPI requires to in most cases to use a [system provided MPI installation](configuration.md#using_system_mpi).
 
-Successfully running the [alltoall\_test\_cuda.jl](https://gist.github.com/luraess/0063e90cb08eb2208b7fe204bbd90ed2) 
-should confirm your MPI implementation to have the CUDA support enabled. Moreover, successfully running the 
-[alltoall\_test\_cuda\_multigpu.jl](https://gist.github.com/luraess/ed93cc09ba04fe16f63b4219c1811566) should confirm 
+### CUDA
+
+Successfully running the [alltoall\_test\_cuda.jl](https://gist.github.com/luraess/0063e90cb08eb2208b7fe204bbd90ed2)
+should confirm your MPI implementation to have the CUDA support enabled. Moreover, successfully running the
+[alltoall\_test\_cuda\_multigpu.jl](https://gist.github.com/luraess/ed93cc09ba04fe16f63b4219c1811566) should confirm
 your CUDA-aware MPI implementation to use multiple Nvidia GPUs (one GPU per rank).
 
 If using OpenMPI, the status of CUDA support can be checked via the
 [`MPI.has_cuda()`](@ref) function.
 
-## ROCm-aware MPI support
-
-If your MPI implementation has been compiled with ROCm support (AMDGPU), then `AMDGPU.ROCArray`s (from the
-[AMDGPU.jl](https://github.com/JuliaGPU/AMDGPU.jl) package) can be passed directly as send and receive buffers for point-to-point and collective operations (they may also work with one-sided operations, but these are not often supported).
+### ROCm
 
-Successfully running the [alltoall\_test\_rocm.jl](https://gist.github.com/luraess/c228ec08629737888a18c6a1e397643c) 
-should confirm your MPI implementation to have the ROCm support (AMDGPU) enabled. Moreover, successfully running the 
-[alltoall\_test\_rocm\_multigpu.jl](https://gist.github.com/luraess/a47931d7fb668bd4348a2c730d5489f4) should confirm 
+Successfully running the [alltoall\_test\_rocm.jl](https://gist.github.com/luraess/c228ec08629737888a18c6a1e397643c)
+should confirm your MPI implementation to have the ROCm support (AMDGPU) enabled. Moreover, successfully running the
+[alltoall\_test\_rocm\_multigpu.jl](https://gist.github.com/luraess/a47931d7fb668bd4348a2c730d5489f4) should confirm
 your ROCm-aware MPI implementation to use multiple AMD GPUs (one GPU per rank).
 
 If using OpenMPI, the status of ROCm support can be checked via the
 [`MPI.has_rocm()`](@ref) function.
 
+> [!NOTE]
+> On Cray machines, you may need to ensure the following preloads to be set in the preferences:
+> ```
+> preloads = ["libmpi_gtl_hsa.so"]
+> preloads_env_switch = "MPICH_GPU_SUPPORT_ENABLED"
+> ```
+
+> [!NOTE]
+> In a configuration with multiple GPUs per node, mapping GPU ID to node local MPI rank can be achieved either (1) on the application side using node-local communicator (`MPI.COMM_TYPE_SHARED`) or (2) on the system side setting device visibility accordingly.
+> If using (1), one can use the node-local rank `rank_loc` to select the GPU device:
+> ```
+> comm_loc = MPI.Comm_split_type(comm, MPI.COMM_TYPE_SHARED, rank)
+> rank_loc = MPI.Comm_rank(comm_loc)
+> ```
+> If using (2), one can use the default device but make sur to handle device visbility in the scheduler; for SLURM on Cray systems, this can be mostly achieved using `--gpus-per-task=1`.
+
 ## Writing MPI tests
 
 It is recommended to use the `mpiexec()` wrapper when writing your package tests in `runtests.jl`:

From b8312ea6cc751f58be2fa5f9f1c9396cd160ff66 Mon Sep 17 00:00:00 2001
From: Ludovic Raess <ludovic.rass@gmail.com>
Date: Wed, 17 Dec 2025 19:56:05 +0100
Subject: [PATCH 2/5] Add suggestions

---
 docs/src/usage.md | 32 ++++++++++++++++----------------
 1 file changed, 16 insertions(+), 16 deletions(-)

diff --git a/docs/src/usage.md b/docs/src/usage.md
index 2835a1c38..dca16be84 100644
--- a/docs/src/usage.md
+++ b/docs/src/usage.md
@@ -78,7 +78,7 @@ $ mpiexecjl --project=/path/to/project -n 20 julia script.jl
 
 If your MPI implementation has been compiled with CUDA or ROCm support, then `CUDA.CuArray`s (from
 [CUDA.jl](https://github.com/JuliaGPU/CUDA.jl)) or `AMDGPU.ROCArray`s (from [AMDGPU.jl](https://github.com/JuliaGPU/AMDGPU.jl)) can be passed directly as
-send and receive buffers for point-to-point and collective operations (they may also work with one-sided operations, but these are not often supported). GPU-aware MPI requires to in most cases to use a [system provided MPI installation](configuration.md#using_system_mpi).
+send and receive buffers for point-to-point and collective operations (they may also work with one-sided operations, but these are not often supported). GPU-aware MPI requires in most cases to use a [system provided MPI installation](@ref using_system_mpi).
 
 ### CUDA
 
@@ -100,21 +100,21 @@ your ROCm-aware MPI implementation to use multiple AMD GPUs (one GPU per rank).
 If using OpenMPI, the status of ROCm support can be checked via the
 [`MPI.has_rocm()`](@ref) function.
 
-> [!NOTE]
-> On Cray machines, you may need to ensure the following preloads to be set in the preferences:
-> ```
-> preloads = ["libmpi_gtl_hsa.so"]
-> preloads_env_switch = "MPICH_GPU_SUPPORT_ENABLED"
-> ```
-
-> [!NOTE]
-> In a configuration with multiple GPUs per node, mapping GPU ID to node local MPI rank can be achieved either (1) on the application side using node-local communicator (`MPI.COMM_TYPE_SHARED`) or (2) on the system side setting device visibility accordingly.
-> If using (1), one can use the node-local rank `rank_loc` to select the GPU device:
-> ```
-> comm_loc = MPI.Comm_split_type(comm, MPI.COMM_TYPE_SHARED, rank)
-> rank_loc = MPI.Comm_rank(comm_loc)
-> ```
-> If using (2), one can use the default device but make sur to handle device visbility in the scheduler; for SLURM on Cray systems, this can be mostly achieved using `--gpus-per-task=1`.
+!!! note "Preloads"
+    On Cray machines, you may need to ensure the following preloads to be set in the preferences:
+    ```
+    preloads = ["libmpi_gtl_hsa.so"]
+    preloads_env_switch = "MPICH_GPU_SUPPORT_ENABLED"
+    ```
+
+!!! note "Multiple GPUs per node"
+    In a configuration with multiple GPUs per node, mapping GPU ID to node local MPI rank can be achieved either (1) on the application side using node-local communicator (`MPI.COMM_TYPE_SHARED`) or (2) on the system side setting device visibility accordingly.
+    For (1), using the node-local rank `rank_loc` is a way to select the GPU device:
+    ```
+    comm_loc = MPI.Comm_split_type(comm, MPI.COMM_TYPE_SHARED, rank)
+    rank_loc = MPI.Comm_rank(comm_loc)
+    ```
+    For (2), one can use the default device but make sur to handle device visibility in the scheduler or by using `CUDA/ROCM_VISIBLE_DEVICES`.
 
 ## Writing MPI tests
 

From a01708036656ecdf767b0c22afa48e9006e3a76f Mon Sep 17 00:00:00 2001
From: Ludovic Raess <ludovic.rass@gmail.com>
Date: Wed, 17 Dec 2025 21:05:50 +0100
Subject: [PATCH 3/5] Update

---
 docs/src/usage.md | 29 +++++++++++++++--------------
 1 file changed, 15 insertions(+), 14 deletions(-)

diff --git a/docs/src/usage.md b/docs/src/usage.md
index dca16be84..1150ee9ea 100644
--- a/docs/src/usage.md
+++ b/docs/src/usage.md
@@ -80,6 +80,13 @@ If your MPI implementation has been compiled with CUDA or ROCm support, then `CU
 [CUDA.jl](https://github.com/JuliaGPU/CUDA.jl)) or `AMDGPU.ROCArray`s (from [AMDGPU.jl](https://github.com/JuliaGPU/AMDGPU.jl)) can be passed directly as
 send and receive buffers for point-to-point and collective operations (they may also work with one-sided operations, but these are not often supported). GPU-aware MPI requires in most cases to use a [system provided MPI installation](@ref using_system_mpi).
 
+!!! note "Preloads"
+    On Cray machines, you may need to ensure the following preloads to be set in the preferences:
+    ```
+    preloads = ["libmpi_gtl_hsa.so"]
+    preloads_env_switch = "MPICH_GPU_SUPPORT_ENABLED"
+    ```
+
 ### CUDA
 
 Successfully running the [alltoall\_test\_cuda.jl](https://gist.github.com/luraess/0063e90cb08eb2208b7fe204bbd90ed2)
@@ -100,21 +107,15 @@ your ROCm-aware MPI implementation to use multiple AMD GPUs (one GPU per rank).
 If using OpenMPI, the status of ROCm support can be checked via the
 [`MPI.has_rocm()`](@ref) function.
 
-!!! note "Preloads"
-    On Cray machines, you may need to ensure the following preloads to be set in the preferences:
-    ```
-    preloads = ["libmpi_gtl_hsa.so"]
-    preloads_env_switch = "MPICH_GPU_SUPPORT_ENABLED"
-    ```
+### Multiple GPUs per node
 
-!!! note "Multiple GPUs per node"
-    In a configuration with multiple GPUs per node, mapping GPU ID to node local MPI rank can be achieved either (1) on the application side using node-local communicator (`MPI.COMM_TYPE_SHARED`) or (2) on the system side setting device visibility accordingly.
-    For (1), using the node-local rank `rank_loc` is a way to select the GPU device:
-    ```
-    comm_loc = MPI.Comm_split_type(comm, MPI.COMM_TYPE_SHARED, rank)
-    rank_loc = MPI.Comm_rank(comm_loc)
-    ```
-    For (2), one can use the default device but make sur to handle device visibility in the scheduler or by using `CUDA/ROCM_VISIBLE_DEVICES`.
+In a configuration with multiple GPUs per node, mapping GPU ID to node local MPI rank can be achieved either (1) on the application side using node-local communicator (`MPI.COMM_TYPE_SHARED`) or (2) on the system side setting device visibility accordingly.
+For (1), using the node-local rank `rank_loc` is a way to select the GPU device:
+```
+comm_loc = MPI.Comm_split_type(comm, MPI.COMM_TYPE_SHARED, rank)
+rank_loc = MPI.Comm_rank(comm_loc)
+```
+For (2), one can use the default device but make sur to handle device visibility in the scheduler or by using `CUDA/ROCM_VISIBLE_DEVICES`.
 
 ## Writing MPI tests
 

From d64e598ee05ff9ddbf3da6c48d9cefad77e69cf9 Mon Sep 17 00:00:00 2001
From: Ludovic Raess <ludovic.rass@gmail.com>
Date: Wed, 17 Dec 2025 21:25:00 +0100
Subject: [PATCH 4/5] Add examples

---
 docs/examples/alltoall_test_cuda.jl          | 27 ++++++++++++++
 docs/examples/alltoall_test_cuda_multigpu.jl | 38 ++++++++++++++++++++
 docs/examples/alltoall_test_rocm.jl          | 27 ++++++++++++++
 docs/examples/alltoall_test_rocm_multigpu.jl | 38 ++++++++++++++++++++
 docs/src/usage.md                            |  8 ++---
 5 files changed, 134 insertions(+), 4 deletions(-)
 create mode 100644 docs/examples/alltoall_test_cuda.jl
 create mode 100644 docs/examples/alltoall_test_cuda_multigpu.jl
 create mode 100644 docs/examples/alltoall_test_rocm.jl
 create mode 100644 docs/examples/alltoall_test_rocm_multigpu.jl

diff --git a/docs/examples/alltoall_test_cuda.jl b/docs/examples/alltoall_test_cuda.jl
new file mode 100644
index 000000000..05011985c
--- /dev/null
+++ b/docs/examples/alltoall_test_cuda.jl
@@ -0,0 +1,27 @@
+# This example demonstrates your MPI implementation to have the CUDA support enabled.
+
+using MPI
+using CUDA
+
+MPI.Init()
+
+comm = MPI.COMM_WORLD
+rank = MPI.Comm_rank(comm)
+
+size = MPI.Comm_size(comm)
+dst  = mod(rank+1, size)
+src  = mod(rank-1, size)
+println("rank=$rank, size=$size, dst=$dst, src=$src")
+
+N = 4
+
+send_mesg = CuArray{Float64}(undef, N)
+recv_mesg = CuArray{Float64}(undef, N)
+
+fill!(send_mesg, Float64(rank))
+CUDA.synchronize()
+
+println("start sending...")
+MPI.Sendrecv!(send_mesg, dst, 0, recv_mesg, src, 0, comm)
+println("recv_mesg on proc $rank: $recv_mesg")
+rank==0 && println("done.")
diff --git a/docs/examples/alltoall_test_cuda_multigpu.jl b/docs/examples/alltoall_test_cuda_multigpu.jl
new file mode 100644
index 000000000..cc4838153
--- /dev/null
+++ b/docs/examples/alltoall_test_cuda_multigpu.jl
@@ -0,0 +1,38 @@
+confirm
+# This example demonstrates your CUDA-aware MPI implementation can use multiple Nvidia GPUs (one GPU per rank)
+
+using MPI
+using CUDA
+
+MPI.Init()
+
+comm = MPI.COMM_WORLD
+rank = MPI.Comm_rank(comm)
+
+# select device (specifically relevant if >1 GPU per node)
+# using node-local communicator to retrieve node-local rank
+comm_l = MPI.Comm_split_type(comm, MPI.COMM_TYPE_SHARED, rank)
+rank_l = MPI.Comm_rank(comm_l)
+
+# select device
+gpu_id = CUDA.device!(rank_l)
+# using default device if the scheduler exposes different GPU per rank (e.g. SLURM `--gpus-per-task=1`)
+# gpu_id = CUDA.device!(0)
+
+size = MPI.Comm_size(comm)
+dst  = mod(rank+1, size)
+src  = mod(rank-1, size)
+println("rank=$rank rank_loc=$rank_l (gpu_id=$gpu_id), size=$size, dst=$dst, src=$src")
+
+N = 4
+
+send_mesg = CuArray{Float64}(undef, N)
+recv_mesg = CuArray{Float64}(undef, N)
+
+fill!(send_mesg, Float64(rank))
+CUDA.synchronize()
+
+rank==0 && println("start sending...")
+MPI.Sendrecv!(send_mesg, dst, 0, recv_mesg, src, 0, comm)
+println("recv_mesg on proc $rank_l: $recv_mesg")
+rank==0 && println("done.")
diff --git a/docs/examples/alltoall_test_rocm.jl b/docs/examples/alltoall_test_rocm.jl
new file mode 100644
index 000000000..e8be85b34
--- /dev/null
+++ b/docs/examples/alltoall_test_rocm.jl
@@ -0,0 +1,27 @@
+# This example demonstrates your MPI implementation to have the ROCm support enabled.
+
+using MPI
+using AMDGPU
+
+MPI.Init()
+
+comm = MPI.COMM_WORLD
+rank = MPI.Comm_rank(comm)
+
+size = MPI.Comm_size(comm)
+dst  = mod(rank+1, size)
+src  = mod(rank-1, size)
+println("rank=$rank, size=$size, dst=$dst, src=$src")
+
+N = 4
+
+send_mesg = ROCArray{Float64}(undef, N)
+recv_mesg = ROCArray{Float64}(undef, N)
+
+fill!(send_mesg, Float64(rank))
+AMDGPU.synchronize()
+
+println("start sending...")
+MPI.Sendrecv!(send_mesg, dst, 0, recv_mesg, src, 0, comm)
+println("recv_mesg on proc $rank: $recv_mesg")
+rank==0 && println("done.")
diff --git a/docs/examples/alltoall_test_rocm_multigpu.jl b/docs/examples/alltoall_test_rocm_multigpu.jl
new file mode 100644
index 000000000..c26348261
--- /dev/null
+++ b/docs/examples/alltoall_test_rocm_multigpu.jl
@@ -0,0 +1,38 @@
+# This example demonstrates your ROCm-aware MPI implementation can use multiple AMD GPUs (one GPU per rank)
+
+using MPI
+using AMDGPU
+
+MPI.Init()
+
+comm = MPI.COMM_WORLD
+rank = MPI.Comm_rank(comm)
+
+# select device (specifically relevant if >1 GPU per node)
+# using node-local communicator to retrieve node-local rank
+comm_l = MPI.Comm_split_type(comm, MPI.COMM_TYPE_SHARED, rank)
+rank_l = MPI.Comm_rank(comm_l)
+
+# select device
+device = AMDGPU.device_id!(rank_l+1)
+# using default device if the scheduler exposes different GPU per rank (e.g. SLURM `--gpus-per-task=1`)
+# device = AMDGPU.device_id!(1)
+gpu_id = AMDGPU.device_id(AMDGPU.device())
+
+size = MPI.Comm_size(comm)
+dst  = mod(rank+1, size)
+src  = mod(rank-1, size)
+println("rank=$rank rank_loc=$rank_l (gpu_id=$gpu_id - $device), size=$size, dst=$dst, src=$src")
+
+N = 4
+
+send_mesg = ROCArray{Float64}(undef, N)
+recv_mesg = ROCArray{Float64}(undef, N)
+
+fill!(send_mesg, Float64(rank))
+AMDGPU.synchronize()
+
+rank==0 && println("start sending...")
+MPI.Sendrecv!(send_mesg, dst, 0, recv_mesg, src, 0, comm)
+println("recv_mesg on proc $rank: $recv_mesg")
+rank==0 && println("done.")
diff --git a/docs/src/usage.md b/docs/src/usage.md
index 1150ee9ea..969cb6d3d 100644
--- a/docs/src/usage.md
+++ b/docs/src/usage.md
@@ -89,9 +89,9 @@ send and receive buffers for point-to-point and collective operations (they may
 
 ### CUDA
 
-Successfully running the [alltoall\_test\_cuda.jl](https://gist.github.com/luraess/0063e90cb08eb2208b7fe204bbd90ed2)
+Successfully running the [alltoall\_test\_cuda.jl](../examples/alltoall_test_cuda.jl)
 should confirm your MPI implementation to have the CUDA support enabled. Moreover, successfully running the
-[alltoall\_test\_cuda\_multigpu.jl](https://gist.github.com/luraess/ed93cc09ba04fe16f63b4219c1811566) should confirm
+[alltoall\_test\_cuda\_multigpu.jl](../examples/alltoall_test_cuda_multigpu.jl) should confirm
 your CUDA-aware MPI implementation to use multiple Nvidia GPUs (one GPU per rank).
 
 If using OpenMPI, the status of CUDA support can be checked via the
@@ -99,9 +99,9 @@ If using OpenMPI, the status of CUDA support can be checked via the
 
 ### ROCm
 
-Successfully running the [alltoall\_test\_rocm.jl](https://gist.github.com/luraess/c228ec08629737888a18c6a1e397643c)
+Successfully running the [alltoall\_test\_rocm.jl](../examples/alltoall_test_rocm.jl)
 should confirm your MPI implementation to have the ROCm support (AMDGPU) enabled. Moreover, successfully running the
-[alltoall\_test\_rocm\_multigpu.jl](https://gist.github.com/luraess/a47931d7fb668bd4348a2c730d5489f4) should confirm
+[alltoall\_test\_rocm\_multigpu.jl](../examples/alltoall_test_rocm_multigpu.jl) should confirm
 your ROCm-aware MPI implementation to use multiple AMD GPUs (one GPU per rank).
 
 If using OpenMPI, the status of ROCm support can be checked via the

From 55f2fe9d5342953839d7396bac2e162b9763d52c Mon Sep 17 00:00:00 2001
From: Lucas C Wilcox <lucas@swirlee.com>
Date: Tue, 13 Jan 2026 13:19:35 -0800
Subject: [PATCH 5/5] Fix spelling typo.

---
 docs/src/usage.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/src/usage.md b/docs/src/usage.md
index 969cb6d3d..705fe0770 100644
--- a/docs/src/usage.md
+++ b/docs/src/usage.md
@@ -115,7 +115,7 @@ For (1), using the node-local rank `rank_loc` is a way to select the GPU device:
 comm_loc = MPI.Comm_split_type(comm, MPI.COMM_TYPE_SHARED, rank)
 rank_loc = MPI.Comm_rank(comm_loc)
 ```
-For (2), one can use the default device but make sur to handle device visibility in the scheduler or by using `CUDA/ROCM_VISIBLE_DEVICES`.
+For (2), one can use the default device but make sure to handle device visibility in the scheduler or by using `CUDA/ROCM_VISIBLE_DEVICES`.
 
 ## Writing MPI tests