diff --git a/README.md b/README.md
index 03ef6c1..9bead01 100644
--- a/README.md
+++ b/README.md
@@ -10,18 +10,19 @@
 [![Conda Version](https://anaconda.org/openbiosim/loch/badges/downloads.svg)](https://anaconda.org/openbiosim/loch)
 [![License: GPL v3](https://img.shields.io/badge/License-GPL_v3-blue.svg)](https://www.gnu.org/licenses/gpl-3.0.en.html)
 
-CUDA accelerated Grand Canonical Monte Carlo (GCMC) water sampling code. Built
+CUDA/OpenCL accelerated Grand Canonical Monte Carlo (GCMC) water sampling code. Built
 on top of [Sire](https://github.com/OpenBioSim/sire),
 [BioSimSpace](https://github.com/OpenBioSim/biosimspace),
-[OpenMM](https://github.com/openmm/openmm), and
-[PyCUDA](https://documen.tician.de/pycuda/index.html#).
+[OpenMM](https://github.com/openmm/openmm),
+[PyCUDA](https://documen.tician.de/pycuda/index.html#),
+and [PyOpenCL](https://documen.tician.de/pyopencl/).
 
 ## Installation
 
 First, create a conda environment with the required dependencies:
 
 ```
-conda create -f environment.yaml
+conda env create -f environment.yaml
 conda activate loch
 ```
 
@@ -49,7 +50,7 @@ conda install -c conda-forge -c openbiosim/label/dev loch
 
 Instead of computing the energy change for each trial insertion/deletion with
 OpenMM, the calculation is performed at the reaction field (RF) level using
-a custom CUDA kernel, allowing multiple candidates to be evaluated
+a custom CUDA/OpenCL kernel, allowing multiple candidates to be evaluated
 simultaneously. Particle mesh Ewald (PME) is handled via the method for
 sampling from an approximate potential (in this case the RF potential)
 introduced [here](https://doi.org/10.1063/1.1563597). Parallelisation of the
@@ -228,8 +229,9 @@ to enhance sampling.
 Once finished, `mu_ex` will contain the computed excess chemical potential in units
 kcal/mol.
 
-Note that the simulation requires a system with CUDA support. Please set the
-`CUDA_VISIBLE_DEVICES` environment variable accordingly.
+Note that the simulation requires a system with CUDA or OpenCL support. Please
+set the `CUDA_VISIBLE_DEVICES` or `OPENCL_VISIBLE_DEVICES` environment variable
+accordingly.
 
 The standard volume can be computed as follows:
 
@@ -263,13 +265,11 @@ Free Energy Perturbation (FEP) with GCMC using `loch` is supported via the
 
 ## Notes
 
-* Make sure that `nvcc` is in your `PATH`. If you require a different `nvcc` to that
-  provided by conda, you can set the `PYCUDA_NVCC` environment variable to point
-  to the desired `nvcc` binary, or use the `nvcc` kwarg in the `GCMCSampler` constructor.
-  Depending on your setup, you may also need to install the `cuda-nvvm` package from
-  `conda-forge`.
-
-* A future version supporting AMD GPUs via PyOpenCL is planned.
+* When using the CUDA platform, make sure that `nvcc` is in your `PATH`. If you require
+  a different `nvcc` to that provided by conda, you can set the `PYCUDA_NVCC` environment
+  variable to point to the desired `nvcc` binary, or use the `nvcc` kwarg in the
+  `GCMCSampler` constructor. Depending on your setup, you may also need to install the
+  `cuda-nvvm` package from `conda-forge`.
 
 * OpenMM-to-Sire roundtrip example:
 
diff --git a/WHITEPAPER.md b/WHITEPAPER.md
index cf773fa..3d74690 100644
--- a/WHITEPAPER.md
+++ b/WHITEPAPER.md
@@ -1,22 +1,23 @@
-# Loch: CUDA accelerated Grand Canonical Monte Carlo (GCMC) water sampling
+# Loch: GPU accelerated Grand Canonical Monte Carlo (GCMC) water sampling
 
 ## Introduction
 
-We present `loch`, a high-performance CUDA-accelerated Python package designed
+We present `loch`, a high-performance GPU-accelerated Python package designed
 for Grand Canonical Monte Carlo (GCMC) water sampling in molecular simulations
 via [OpenMM](https://openmm.org/). To enable parallelisation of insertion and
-deletion attempts, `loch` leverages GPU capabilities using a custom CUDA kernel
-for nonbonded interactions. This allows thousands of GCMC trials to be attempted
-in parallel, significantly enhancing sampling efficiency compared to traditional
-CPU-based implementations that perform sequential attempts via the OpenMM Python
-API. Additionally, electrostatics for GCMC attempts are computed using the
-reaction field (RF) method, with accepted candidates being re-evaluated with a
-correction step based on the difference between reaction field and Particle Mesh
-Ewald (PME) potential energies. The use of an approximate potential for trial
-moves leads to a substantial speed-up in GCMC move evaluation. `loch` has been
-designed to be modular, allowing standalone GCMC sampling, or integration with
-OpenMM-based molecular dynamics simulation code, e.g. as has been done in the
-[SOMD2](https://github.com/openbiosim/somd2) free-energy perturbation engine.
+deletion attempts, `loch` leverages GPU capabilities using a custom CUDA/OpenCL
+kernel for nonbonded interactions. This allows thousands of GCMC trials to be
+attempted in parallel, significantly enhancing sampling efficiency compared to
+traditional CPU-based implementations that perform sequential attempts via the
+OpenMM Python API. Additionally, electrostatics for GCMC attempts are computed
+using the reaction field (RF) method, with accepted candidates being
+re-evaluated with a correction step based on the difference between reaction
+field and Particle Mesh Ewald (PME) potential energies. The use of an
+approximate potential for trial moves leads to a substantial speed-up in GCMC
+move evaluation. `loch` has been designed to be modular, allowing standalone
+GCMC sampling, or integration with OpenMM-based molecular dynamics simulation
+code, e.g. as has been done in the [SOMD2](https://github.com/openbiosim/somd2)
+free-energy perturbation engine.
 
 ## Parallelisation strategy
 
@@ -52,6 +53,14 @@ each iteration, as more trials need to be evaluated in parallel, and more data
 needs to be transferred to and from the GPU, in which case it might be more
 efficient to simply perform more iterations with a smaller batch size.
 
+To enable reproduciblility across GPU platforms we choose to generate random
+numbers on the host using NumPy's random number generator, then transfer these
+to the GPU kernels where required. This avoids differences in random number
+generation across different GPU architectures and drivers, making testing
+and validation of the implementation significantly easier. In benchmarks we
+have found the NumPy approach to be as performant as using GPU-based random
+numbers for the typical batch sizes employed in `loch`.
+
 ## Sampling from an approximate potential
 
 In order to further accelerate the evaluation of GCMC insertion and deletion
@@ -91,7 +100,7 @@ Other than the cost of evaluating GCMC trials using PME, performance is aslo
 impacted by the cost of updating nonbonded parameters and atomic positions
 in the OpenMM context after each accepted insertion or deletion. (No updates
 are required for trial moves, since these are all evaluated via the custom
-CUDA kernel.) [Recent updates](https://github.com/openmm/openmm/pull/4610)
+CUDA/OpenCL kernel.) [Recent updates](https://github.com/openmm/openmm/pull/4610)
 to OpenMM have helped mitigate the cost of modifying force field parameters,
 allowing updates for only the subset of parameters that have changed within
 a particular force. However, updating atomic positions still requires
diff --git a/environment.yaml b/environment.yaml
index 08acf8f..be4b621 100644
--- a/environment.yaml
+++ b/environment.yaml
@@ -8,3 +8,4 @@ dependencies:
   - biosimspace
   - loguru
   - pycuda
+  - pyopencl
diff --git a/examples/bpti/bpti.py b/examples/bpti/bpti.py
index 83c8f8e..0cffe16 100644
--- a/examples/bpti/bpti.py
+++ b/examples/bpti/bpti.py
@@ -53,6 +53,14 @@
     choices=["info", "debug", "error"],
     required=False,
 )
+parser.add_argument(
+    "--platform",
+    help="The GPU platform to use",
+    type=str,
+    default="auto",
+    choices=["auto", "cuda", "opencl"],
+    required=False,
+)
 
 args = parser.parse_args()
 
@@ -78,6 +86,7 @@
     num_ghost_waters=100,
     bulk_sampling_probability=0,
     log_level=args.log_level,
+    platform=args.platform,
     overwrite=True,
 )
 
@@ -92,6 +101,7 @@
     pressure=None,
     constraint="h_bonds",
     timestep="2 fs",
+    platform=args.platform,
 )
 d.randomise_velocities()
 
diff --git a/examples/scytalone/sd.py b/examples/scytalone/sd.py
index 14d2aa8..52117f3 100644
--- a/examples/scytalone/sd.py
+++ b/examples/scytalone/sd.py
@@ -64,6 +64,14 @@
     choices=["info", "debug", "error"],
     required=False,
 )
+parser.add_argument(
+    "--platform",
+    help="The GPU platform to use",
+    type=str,
+    default="auto",
+    choices=["auto", "cuda", "opencl"],
+    required=False,
+)
 args = parser.parse_args()
 
 # Store the ligand index.
@@ -90,6 +98,7 @@
     ghost_file=f"ghosts_{lig}.txt",
     log_file=f"gcmc_{lig}.txt",
     log_level=args.log_level,
+    platform=args.platform,
     overwrite=True,
 )
 
@@ -104,6 +113,7 @@
     pressure=None,
     constraint="h_bonds",
     timestep="2 fs",
+    platform=args.platform,
 )
 d.randomise_velocities()
 
diff --git a/examples/water/water.py b/examples/water/water.py
index 22e0dbd..11eaa2f 100644
--- a/examples/water/water.py
+++ b/examples/water/water.py
@@ -66,6 +66,14 @@
     choices=["info", "debug", "error"],
     required=False,
 )
+parser.add_argument(
+    "--platform",
+    help="The GPU platform to use",
+    type=str,
+    default="auto",
+    choices=["auto", "cuda", "opencl"],
+    required=False,
+)
 args = parser.parse_args()
 
 # Load the water box.
@@ -91,6 +99,8 @@
     temperature=args.temperature,
     num_ghost_waters=100,
     log_level=args.log_level,
+    platform=args.platform,
+    overwrite=True,
 )
 
 # Create a dynamics object using the modified GCMC system.
@@ -104,6 +114,7 @@
     pressure=None,
     constraint="h_bonds",
     timestep="2 fs",
+    platform=args.platform,
 )
 d.randomise_velocities()
 
diff --git a/recipes/loch/template.yaml b/recipes/loch/template.yaml
index 50a2ac2..0096832 100644
--- a/recipes/loch/template.yaml
+++ b/recipes/loch/template.yaml
@@ -18,6 +18,7 @@ requirements:
     - loguru
     - pip
     - pycuda # [not macos]
+    - pyopencl
     - python
     - setuptools
     - sire
diff --git a/src/loch/__init__.py b/src/loch/__init__.py
index 6aa5abe..45a2c0c 100644
--- a/src/loch/__init__.py
+++ b/src/loch/__init__.py
@@ -1,7 +1,7 @@
 ######################################################################
 # Loch: GPU accelerated GCMC water sampling engine.
 #
-# Copyright: 2025
+# Copyright: 2025-2026
 #
 # Authors: The OpenBioSim Team <team@openbiosim.org>
 #
diff --git a/src/loch/_kernels.py b/src/loch/_kernels.py
index b45a456..185f708 100644
--- a/src/loch/_kernels.py
+++ b/src/loch/_kernels.py
@@ -1,7 +1,7 @@
 ######################################################################
 # Loch: GPU accelerated GCMC water sampling engine.
 #
-# Copyright: 2025
+# Copyright: 2025-2026
 #
 # Authors: The OpenBioSim Team <team@openbiosim.org>
 #
@@ -20,11 +20,38 @@
 #####################################################################
 
 """
-GCMC CUDA kernels.
+GCMC CUDA/OpenCL kernels.
 """
 
 code = """
-    #include <curand_kernel.h>
+    // Platform-specific definitions for CUDA and OpenCL compatibility
+    #ifdef __OPENCL_VERSION__
+        #define KERNEL __kernel
+        #define DEVICE  // OpenCL: file-scope variables visible to all kernels
+        #define GLOBAL __global
+        #define LOCAL __local
+        #define GET_GLOBAL_ID(dim) get_global_id(dim)
+        #define BLOCK_ID_Y get_group_id(1)  // OpenCL: work-group ID in dimension 1
+        // Map CUDA-style function names to OpenCL names
+        #define sqrtf sqrt
+        #define powf pow
+        #define expf exp
+        #define cosf cos
+        #define sinf sin
+        #define rsqrtf rsqrt
+        #define floorf floor
+        #define fmaf fma
+        // OpenCL doesn't have sincosf, so define it
+        #define sincosf(x, sptr, cptr) do { *(sptr) = sinf(x); *(cptr) = cosf(x); } while(0)
+        #pragma OPENCL EXTENSION cl_khr_fp64 : enable
+    #else
+        #define KERNEL extern "C" __global__
+        #define DEVICE __device__  // CUDA: device memory (mutable)
+        #define GLOBAL
+        #define LOCAL __shared__
+        #define GET_GLOBAL_ID(dim) (threadIdx.x + blockIdx.x * blockDim.x)
+        #define BLOCK_ID_Y blockIdx.y  // CUDA: block index in y dimension
+    #endif
 
     // Constants.
     const float pi = 3.14159265359f;
@@ -32,68 +59,51 @@
     const int num_batch = %(NUM_BATCH)s;
     const int num_atoms = %(NUM_ATOMS)s;
     const int num_waters = %(NUM_WATERS)s;
-    const int num_water_atoms = 3*num_points;
+    const int num_water_positions = 3 * num_points;
     const float prefactor = 332.0637090025476f;
 
-    // Random number generator state for each water thread.
-    __device__ curandState_t* states[num_batch];
-
     // Reaction field parameters.
-    __device__ float rf_dielectric;
-    __device__ float rf_kappa;
-    __device__ float rf_cutoff;
-    __device__ float rf_correction;
+    DEVICE float rf_dielectric;
+    DEVICE float rf_kappa;
+    DEVICE float rf_cutoff;
+    DEVICE float rf_correction;
 
     // Soft-core parameters.
-    __device__ float coulomb_power;
-    __device__ float shift_coulomb;
-    __device__ float shift_delta;
+    DEVICE float coulomb_power;
+    DEVICE float shift_coulomb;
+    DEVICE float shift_delta;
 
     // Triclinic cell information.
-    __device__ float cell_matrix[3][3];
-    __device__ float cell_matrix_inverse[3][3];
-    __device__ float M[3][3];
+    DEVICE float cell_matrix[3][3];
+    DEVICE float cell_matrix_inverse[3][3];
+    DEVICE float M[3][3];
 
     // Atom properties.
-    __device__ float sigma[num_atoms];
-    __device__ float epsilon[num_atoms];
-    __device__ float charge[num_atoms];
-    __device__ float alpha[num_atoms];
-    __device__ float position[num_atoms * 3];
-    __device__ int is_ghost_water[num_atoms];
-    __device__ int is_ghost_fep[num_atoms];
+    DEVICE float sigma[num_atoms];
+    DEVICE float epsilon[num_atoms];
+    DEVICE float charge[num_atoms];
+    DEVICE float alpha[num_atoms];
+    DEVICE float position[num_atoms * 3];
+    DEVICE int is_ghost_water[num_atoms];
+    DEVICE int is_ghost_fep[num_atoms];
 
     // Water properties.
-    __device__ float sigma_water[num_points];
-    __device__ float epsilon_water[num_points];
-    __device__ float charge_water[num_points];
-    __device__ int water_idx[num_waters];
-    __device__ int water_state[num_waters];
+    DEVICE float sigma_water[num_points];
+    DEVICE float epsilon_water[num_points];
+    DEVICE float charge_water[num_points];
+    DEVICE int water_idx[num_waters];
+    DEVICE int water_state[num_waters];
 
+    #ifndef __OPENCL_VERSION__
     extern "C"
     {
-        // Initialisation of the random number generator state for each attempt thread.
-        __global__ void initialiseRNG(int* seed)
-        {
-            const int tidx = threadIdx.x + blockIdx.x * blockDim.x;
-
-            if (tidx < num_batch)
-            {
-                curandState_t* s = new curandState_t;
-                if (s != 0)
-                {
-                    curand_init(seed[tidx], tidx, 0, s);
-                }
-
-                states[tidx] = s;
-            }
-        }
+    #endif
 
         // Intialisation of the cell information for periodic triclinic boxes.
-        __global__ void setCellMatrix(
-            float* matrix,
-            float* matrix_inverse,
-            float* m)
+        KERNEL void setCellMatrix(
+            GLOBAL float* matrix,
+            GLOBAL float* matrix_inverse,
+            GLOBAL float* m)
         {
             for (int i = 0; i < 3; i++)
             {
@@ -107,18 +117,18 @@
         }
 
         // Set the reaction field parameters.
-        __global__ void setReactionField(float cutoff, float dielectric)
+        KERNEL void setReactionField(float cutoff, float dielectric)
         {
             rf_dielectric = dielectric;
             rf_cutoff = cutoff;
-            const auto rf_cutoff2 = cutoff * cutoff;
-            const auto rf_cutoff3_inv = 1.0f / (rf_cutoff * rf_cutoff2);
+            const float rf_cutoff2 = cutoff * cutoff;
+            const float rf_cutoff3_inv = 1.0f / (rf_cutoff * rf_cutoff2);
             rf_kappa = rf_cutoff3_inv * (dielectric - 1.0f) / (2.0f * dielectric + 1.0f);
             rf_correction = (1.0 / rf_cutoff) + rf_kappa * rf_cutoff2;
         }
 
         // Set the soft-core parameters.
-        __global__ void setSoftCore(float power, float coulomb, float delta)
+        KERNEL void setSoftCore(float power, float coulomb, float delta)
         {
             coulomb_power = power;
             shift_coulomb = coulomb;
@@ -126,15 +136,15 @@
         }
 
         // Set the properties of each atom.
-        __global__ void setAtomProperties(
-            float* charges,
-            float* sigmas,
-            float* epsilons,
-            float* alphas,
-            int* ghost_water,
-            int* ghost_fep)
+        KERNEL void setAtomProperties(
+            GLOBAL float* charges,
+            GLOBAL float* sigmas,
+            GLOBAL float* epsilons,
+            GLOBAL float* alphas,
+            GLOBAL int* ghost_water,
+            GLOBAL int* ghost_fep)
         {
-            const int tidx = threadIdx.x + blockIdx.x * blockDim.x;
+            const int tidx = GET_GLOBAL_ID(0);
 
             if (tidx < num_atoms)
             {
@@ -148,9 +158,9 @@
         }
 
         // Set the positions of each atom.
-        __global__ void setAtomPositions(float* positions, float scale=1.0)
+        KERNEL void setAtomPositions(GLOBAL float* positions, float scale)
         {
-            const int tidx = threadIdx.x + blockIdx.x * blockDim.x;
+            const int tidx = GET_GLOBAL_ID(0);
 
             if (tidx < num_atoms)
             {
@@ -161,12 +171,12 @@
         }
 
         // Set the properties of each water atom.
-        __global__ void setWaterProperties(
-            float* charges,
-            float* sigmas,
-            float* epsilons,
-            int* idx,
-            int* state)
+        KERNEL void setWaterProperties(
+            GLOBAL float* charges,
+            GLOBAL float* sigmas,
+            GLOBAL float* epsilons,
+            GLOBAL int* idx,
+            GLOBAL int* state)
         {
             for (int i = 0; i < num_points; i++)
             {
@@ -183,7 +193,7 @@
         }
 
         // Update a single water.
-        __global__ void updateWater(int idx, int state, int is_insertion, float* new_position)
+        KERNEL void updateWater(int idx, int state, int is_insertion, GLOBAL float* new_position)
         {
             // Set the new state.
             water_state[idx] = state;
@@ -222,7 +232,7 @@
 
         // Calculate the delta that needs to be subtracted from the interatomic distance
         // so that the atoms are wrapped to the same periodic box.
-        __device__ void wrapDelta(float* v0, float* v1, float* delta_box)
+        DEVICE void wrapDelta(float* v0, float* v1, float* delta_box)
         {
             // Work out the positions of v0 and v1 in "box" space.
             float v0_box[3];
@@ -253,40 +263,10 @@
             float frac_y = delta_box[1] - int_y;
             float frac_z = delta_box[2] - int_z;
 
-            // Shift to the box.
-
-            // x
-
-            if (frac_x > 0.5f)
-            {
-                int_x += 1;
-            }
-            else if (frac_x < -0.5f)
-            {
-                int_x -= 1;
-            }
-
-            // y
-
-            if (frac_y > 0.5f)
-            {
-                int_y += 1;
-            }
-            else if (frac_y < -0.5f)
-            {
-                int_y -= 1;
-            }
-
-            // z
-
-            if (frac_z > 0.5f)
-            {
-                int_z += 1;
-            }
-            else if (frac_z < -0.5f)
-            {
-                int_z -= 1;
-            }
+            // Shift to the box (branchless).
+            int_x += (int)floorf(frac_x + 0.5f);
+            int_y += (int)floorf(frac_y + 0.5f);
+            int_z += (int)floorf(frac_z + 0.5f);
 
             // Calculate the shifts over the box vectors.
             delta_box[0] = 0.0f;
@@ -301,10 +281,10 @@
         }
 
         // Calculate the distance between two atoms within the periodic box.
-        __device__ void distance2(
+        DEVICE void distance2(
             float* v0,
             float* v1,
-            float& dist2)
+            float* dist2)
         {
             // Work out the positions of v0 and v1 in "box" space.
             float v0_box[3];
@@ -336,40 +316,10 @@
             float frac_y = delta_box[1] - int_y;
             float frac_z = delta_box[2] - int_z;
 
-            // Shift to the box.
-
-            // x
-
-            if (frac_x >= 0.5f)
-            {
-                frac_x -= 1.0f;
-            }
-            else if (frac_x <= -0.5f)
-            {
-                frac_x += 1.0f;
-            }
-
-            // y
-
-            if (frac_y >= 0.5f)
-            {
-                frac_y -= 1.0f;
-            }
-            else if (frac_y <= -0.5f)
-            {
-                frac_y += 1.0f;
-            }
-
-            // z
-
-            if (frac_z >= 0.5f)
-            {
-                frac_z -= 1.0f;
-            }
-            else if (frac_z <= -0.5f)
-            {
-                frac_z += 1.0f;
-            }
+            // Shift to the box (branchless).
+            frac_x -= floorf(frac_x + 0.5f);
+            frac_y -= floorf(frac_y + 0.5f);
+            frac_z -= floorf(frac_z + 0.5f);
 
             float frac_dist[3];
             frac_dist[0] = frac_x;
@@ -384,11 +334,11 @@
                     delta_box[i] += M[i][j] * frac_dist[j];
                 }
             }
-            dist2 = frac_x * delta_box[0] + frac_y * delta_box[1] + frac_z * delta_box[2];
+            *dist2 = frac_x * delta_box[0] + frac_y * delta_box[1] + frac_z * delta_box[2];
         }
 
         // Perform a random rotation about a unit sphere.
-        __device__ void uniform_random_rotation(float* v, float r0, float r1, float r2)
+        DEVICE void uniform_random_rotation(float* v, float r0, float r1, float r2)
         {
             /* Adapted from:
                 https://www.blopig.com/blog/2021/08/uniformly-sampled-3d-rotation-matrices/
@@ -401,15 +351,20 @@
             float x2 = 2.0f * pi * r0;
             float x3 = r1;
             float R[3][3];
-            R[0][0] = R[1][1] = cosf(2.0f * pi * r2);
-            R[0][1] = -sinf(2.0f * pi * r2);
-            R[1][0] = sinf(2.0f * pi * r2);
+            float sin_r2, cos_r2;
+            sincosf(2.0f * pi * r2, &sin_r2, &cos_r2);
+            R[0][0] = R[1][1] = cos_r2;
+            R[0][1] = -sin_r2;
+            R[1][0] = sin_r2;
             R[0][2] = R[1][2] = R[2][0] = R[2][1] = 0.0f;
             R[2][2] = 1.0f;
 
             // Now compute the Householder matrix H.
-            float v0 = cosf(x2) * sqrtf(x3);
-            float v1 = sinf(x2) * sqrtf(x3);
+            float sin_x2, cos_x2;
+            sincosf(x2, &sin_x2, &cos_x2);
+            float sqrt_x3 = sqrtf(x3);
+            float v0 = cos_x2 * sqrt_x3;
+            float v1 = sin_x2 * sqrt_x3;
             float v2 = sqrtf(1.0f - x3);
             float H[3][3];
             H[0][0] = 1.0f - 2.0f * v0 * v0;
@@ -440,7 +395,13 @@
             mean_coord[1] = (v[1] + v[4] + v[7]) / 3.0f;
             mean_coord[2] = (v[2] + v[5] + v[8]) / 3.0f;
 
-            // Now compute ((v - mean_coord) @ M) + mean_coord @ M.
+            // Precompute mean_coord @ M (avoids redundant calculations).
+            float mean_M[3];
+            mean_M[0] = fmaf(mean_coord[0], M[0][0], fmaf(mean_coord[1], M[1][0], mean_coord[2] * M[2][0]));
+            mean_M[1] = fmaf(mean_coord[0], M[0][1], fmaf(mean_coord[1], M[1][1], mean_coord[2] * M[2][1]));
+            mean_M[2] = fmaf(mean_coord[0], M[0][2], fmaf(mean_coord[1], M[1][2], mean_coord[2] * M[2][2]));
+
+            // Now compute ((v - mean_coord) @ M) + mean_M.
             float x[3][3];
             x[0][0] = v[0] - mean_coord[0];
             x[0][1] = v[1] - mean_coord[1];
@@ -452,47 +413,38 @@
             x[2][1] = v[7] - mean_coord[1];
             x[2][2] = v[8] - mean_coord[2];
 
-            // Compute the rotated coordinates.
-            v[0] = x[0][0] * M[0][0] + x[0][1] * M[1][0] + x[0][2] * M[2][0]
-                + mean_coord[0] * M[0][0] + mean_coord[1] * M[1][0] + mean_coord[2] * M[2][0];
-            v[1] = x[0][0] * M[0][1] + x[0][1] * M[1][1] + x[0][2] * M[2][1]
-                + mean_coord[0] * M[0][1] + mean_coord[1] * M[1][1] + mean_coord[2] * M[2][1];
-            v[2] = x[0][0] * M[0][2] + x[0][1] * M[1][2] + x[0][2] * M[2][2]
-                + mean_coord[0] * M[0][2] + mean_coord[1] * M[1][2] + mean_coord[2] * M[2][2];
-            v[3] = x[1][0] * M[0][0] + x[1][1] * M[1][0] + x[1][2] * M[2][0]
-                + mean_coord[0] * M[0][0] + mean_coord[1] * M[1][0] + mean_coord[2] * M[2][0];
-            v[4] = x[1][0] * M[0][1] + x[1][1] * M[1][1] + x[1][2] * M[2][1]
-                + mean_coord[0] * M[0][1] + mean_coord[1] * M[1][1] + mean_coord[2] * M[2][1];
-            v[5] = x[1][0] * M[0][2] + x[1][1] * M[1][2] + x[1][2] * M[2][2]
-                + mean_coord[0] * M[0][2] + mean_coord[1] * M[1][2] + mean_coord[2] * M[2][2];
-            v[6] = x[2][0] * M[0][0] + x[2][1] * M[1][0] + x[2][2] * M[2][0]
-                + mean_coord[0] * M[0][0] + mean_coord[1] * M[1][0] + mean_coord[2] * M[2][0];
-            v[7] = x[2][0] * M[0][1] + x[2][1] * M[1][1] + x[2][2] * M[2][1]
-                + mean_coord[0] * M[0][1] + mean_coord[1] * M[1][1] + mean_coord[2] * M[2][1];
-            v[8] = x[2][0] * M[0][2] + x[2][1] * M[1][2] + x[2][2] * M[2][2]
-                + mean_coord[0] * M[0][2] + mean_coord[1] * M[1][2] + mean_coord[2] * M[2][2];
+            // Compute the rotated coordinates using fma.
+            v[0] = fmaf(x[0][0], M[0][0], fmaf(x[0][1], M[1][0], fmaf(x[0][2], M[2][0], mean_M[0])));
+            v[1] = fmaf(x[0][0], M[0][1], fmaf(x[0][1], M[1][1], fmaf(x[0][2], M[2][1], mean_M[1])));
+            v[2] = fmaf(x[0][0], M[0][2], fmaf(x[0][1], M[1][2], fmaf(x[0][2], M[2][2], mean_M[2])));
+            v[3] = fmaf(x[1][0], M[0][0], fmaf(x[1][1], M[1][0], fmaf(x[1][2], M[2][0], mean_M[0])));
+            v[4] = fmaf(x[1][0], M[0][1], fmaf(x[1][1], M[1][1], fmaf(x[1][2], M[2][1], mean_M[1])));
+            v[5] = fmaf(x[1][0], M[0][2], fmaf(x[1][1], M[1][2], fmaf(x[1][2], M[2][2], mean_M[2])));
+            v[6] = fmaf(x[2][0], M[0][0], fmaf(x[2][1], M[1][0], fmaf(x[2][2], M[2][0], mean_M[0])));
+            v[7] = fmaf(x[2][0], M[0][1], fmaf(x[2][1], M[1][1], fmaf(x[2][2], M[2][1], mean_M[1])));
+            v[8] = fmaf(x[2][0], M[0][2], fmaf(x[2][1], M[1][2], fmaf(x[2][2], M[2][2], mean_M[2])));
         }
 
         // Generate a random position and orientation within the GCMC sphere
         // for each trial insertion.
-        __global__ void generateWater(
-            float* water_template,
-            float* target,
+        KERNEL void generateWater(
+            GLOBAL float* water_template,
+            GLOBAL float* target,
             float radius,
-            float* water_position,
-            int is_target)
+            GLOBAL float* water_position,
+            int is_target,
+            GLOBAL float* randoms_rotation,
+            GLOBAL float* randoms_position,
+            GLOBAL float* randoms_radius)
         {
             // Work out the thread index.
-            const int tidx = threadIdx.x + blockIdx.x * blockDim.x;
+            const int tidx = GET_GLOBAL_ID(0);
 
             // Make sure we are within the number of waters.
             if (tidx < num_batch)
             {
-                // Get the RNG state.
-                curandState_t state = *states[tidx];
-
                 // Translate the oxygen atom to the origin.
-                float water[num_water_atoms];
+                float water[num_water_positions];
                 water[0] = 0.0f;
                 water[1] = 0.0f;
                 water[2] = 0.0f;
@@ -505,8 +457,11 @@
                     water[i*3 + 2] = water_template[i*3 + 2] - water_template[2];
                 }
 
-                // Rotate the water randomly.
-                uniform_random_rotation(water, curand_uniform(&state), curand_uniform(&state), curand_uniform(&state));
+                // Rotate the water randomly using pre-generated randoms.
+                uniform_random_rotation(water,
+                    randoms_rotation[tidx * 3],
+                    randoms_rotation[tidx * 3 + 1],
+                    randoms_rotation[tidx * 3 + 2]);
 
                 // Calculate the distance between the oxygen and the hydrogens.
                 float dh[num_points][3];
@@ -522,16 +477,16 @@
                 // Choose a random position within the GCMC sphere.
                 if (is_target == 1)
                 {
-                    // Generate a random position around the target.
-                    xyz[0] = curand_normal(&state);
-                    xyz[1] = curand_normal(&state);
-                    xyz[2] = curand_normal(&state);
+                    // Generate a random position around the target using pre-generated normals.
+                    xyz[0] = randoms_position[tidx * 3];
+                    xyz[1] = randoms_position[tidx * 3 + 1];
+                    xyz[2] = randoms_position[tidx * 3 + 2];
 
                     float norm = sqrtf(xyz[0] * xyz[0] + xyz[1] * xyz[1] + xyz[2] * xyz[2]);
                     xyz[0] /= norm;
                     xyz[1] /= norm;
                     xyz[2] /= norm;
-                    float r = radius * powf(curand_uniform(&state), 1.0f / 3.0f);
+                    float r = radius * powf(randoms_radius[tidx], 1.0f / 3.0f);
                     xyz[0] = target[0] + r * xyz[0];
                     xyz[1] = target[1] + r * xyz[1];
                     xyz[2] = target[2] + r * xyz[2];
@@ -539,10 +494,11 @@
                 // Choose a random position within the triclinic box.
                 else
                 {
+                    // Use pre-generated uniform randoms for bulk sampling.
                     float r[3];
-                    r[0] = curand_uniform(&state);
-                    r[1] = curand_uniform(&state);
-                    r[2] = curand_uniform(&state);
+                    r[0] = randoms_position[tidx * 3];
+                    r[1] = randoms_position[tidx * 3 + 1];
+                    r[2] = randoms_position[tidx * 3 + 2];
 
                     for (int i = 0; i < 3; i++)
                     {
@@ -555,44 +511,41 @@
                 }
 
                 // Place the oxygen (first atom) at the random position.
-                water_position[tidx * num_water_atoms] = xyz[0];
-                water_position[tidx * num_water_atoms + 1] = xyz[1];
-                water_position[tidx * num_water_atoms + 2] = xyz[2];
+                water_position[tidx * num_water_positions] = xyz[0];
+                water_position[tidx * num_water_positions + 1] = xyz[1];
+                water_position[tidx * num_water_positions + 2] = xyz[2];
 
                 // Shift the hydrogens by the appropriate amount.
                 for (int i = 0; i < num_points-1; i++)
                 {
-                    water_position[tidx * num_water_atoms + 3 + i*3] = xyz[0] + dh[i][0];
-                    water_position[tidx * num_water_atoms + 4 + i*3] = xyz[1] + dh[i][1];
-                    water_position[tidx * num_water_atoms + 5 + i*3] = xyz[2] + dh[i][2];
+                    water_position[tidx * num_water_positions + 3 + i*3] = xyz[0] + dh[i][0];
+                    water_position[tidx * num_water_positions + 4 + i*3] = xyz[1] + dh[i][1];
+                    water_position[tidx * num_water_positions + 5 + i*3] = xyz[2] + dh[i][2];
                 }
-
-                // Set the new state.
-                *states[tidx] = state;
             }
         }
 
         // Compute the Lennard-Jones and reaction field Coulomb energy between
         // the water and the atoms.
-        __global__ void computeEnergy(
-            float* water_position,
-            float* energy_coul,
-            float* energy_lj,
-            int* deletion_candidates,
-            int* is_deletion,
+        KERNEL void computeEnergy(
+            GLOBAL float* water_position,
+            GLOBAL float* energy_coul,
+            GLOBAL float* energy_lj,
+            GLOBAL int* deletion_candidates,
+            GLOBAL int* is_deletion,
             int is_fep)
         {
             // Work out the atom index.
-            const int idx_atom = threadIdx.x + blockDim.x * blockIdx.x;
+            const int idx_atom = GET_GLOBAL_ID(0);
 
             // Make sure we're in bounds.
             if (idx_atom < num_atoms)
             {
                 // Store the squared cut-off distance.
-                const auto cutoff2 = rf_cutoff * rf_cutoff;
+                const float cutoff2 = rf_cutoff * rf_cutoff;
 
                 // Work out the water index.
-                const int idx_water = blockIdx.y;
+                const int idx_water = BLOCK_ID_Y;
 
                 // Work out the index for the result.
                 const int idx = (idx_water * num_atoms) + idx_atom;
@@ -607,13 +560,13 @@
                     for (int i = 0; i < num_points; i++)
                     {
                         // Self interaction.
-                        const auto q1 = charge_water[i];
+                        const float q1 = charge_water[i];
                         energy_coul[idx] -= 0.5f * (q1 * q1) * rf_correction;
 
                         // Pair interaction.
                         for (int j = i+1; j < num_points; j++)
                         {
-                            const auto q2 = charge_water[j];
+                            const float q2 = charge_water[j];
                             energy_coul[idx] -= (q1 * q2) * rf_correction;
                         }
                     }
@@ -623,10 +576,10 @@
                 if (is_deletion[idx_water] == 1)
                 {
                     const int idx_water_context = water_idx[deletion_candidates[idx_water]];
-                    const auto delta = idx_atom - idx_water_context;
+                    const float delta = idx_atom - idx_water_context;
 
                     // Don't compute self-interactions.
-                    if (delta >= 0 and delta < num_points)
+                    if (delta >= 0 && delta < num_points)
                     {
                         return;
                     }
@@ -656,7 +609,7 @@
                 v0[2] = position[3 * idx_atom + 2];
 
                 // Store the charge on the atom.
-                auto q0 = charge[idx_atom];
+                float q0 = charge[idx_atom];
 
                 // Store the epsilon and sigma for the atom.
                 float s0 = sigma[idx_atom];
@@ -676,20 +629,20 @@
                     }
                     else
                     {
-                        v1[0] = water_position[3 * num_points * idx_water + 3 * i];
-                        v1[1] = water_position[3 * num_points * idx_water + 3 * i + 1];
-                        v1[2] = water_position[3 * num_points * idx_water + 3 * i + 2];
+                        v1[0] = water_position[num_water_positions * idx_water + 3 * i];
+                        v1[1] = water_position[num_water_positions * idx_water + 3 * i + 1];
+                        v1[2] = water_position[num_water_positions * idx_water + 3 * i + 2];
                     }
 
                     // Calculate the squared distance between the atoms.
                     float r2;
-                    distance2(v0, v1, r2);
+                    distance2(v0, v1, &r2);
 
                     // The distance is within the cut-off.
                     if (r2 < cutoff2)
                     {
                         // Don't divide by zero.
-                        if (not is_fep and r2 < 1e-6)
+                        if (!is_fep && r2 < 1e-6)
                         {
                             energy_coul[idx] = 1e6;
                             energy_lj[idx] = 1e6;
@@ -698,38 +651,38 @@
                         else
                         {
                             // Regular non-bonded forces.
-                            if (not is_ghost_atom)
+                            if (!is_ghost_atom)
                             {
                                 // Compute the LJ interaction.
-                                auto s1 = sigma_water[i];
-                                const auto e1 = epsilon_water[i];
-                                const auto s = 0.5f * (s0 + s1);
-                                const auto e = sqrtf(e0 * e1);
-                                const auto s2 = s * s;
-                                const auto sr2 = s2 / r2;
-                                const auto sr6 = sr2 * sr2 * sr2;
+                                float s1 = sigma_water[i];
+                                const float e1 = epsilon_water[i];
+                                const float s = 0.5f * (s0 + s1);
+                                const float e = sqrtf(e0 * e1);
+                                const float s2 = s * s;
+                                const float sr2 = s2 / r2;
+                                const float sr6 = sr2 * sr2 * sr2;
                                 energy_lj[idx] += 4.0f * e * sr6 * (sr6 - 1.0f);
 
-                                // Compute the distance between the atoms.
-                                const auto r = sqrtf(r2);
+                                // Compute reciprocal distance (faster than sqrtf).
+                                const float r_inv = rsqrtf(r2);
 
                                 // Store the charge on the water atom.
-                                const auto q1 = charge_water[i];
+                                const float q1 = charge_water[i];
 
                                 // Add the reaction field pair energy.
-                                energy_coul[idx] += (q0 * q1) * ((1.0f / r) + (rf_kappa * r2) - rf_correction);
+                                energy_coul[idx] += (q0 * q1) * (r_inv + (rf_kappa * r2) - rf_correction);
                             }
 
                             // Zacharias soft-core potential.
                             else
                             {
                                 // Store required parameters.
-                                const auto q1 = charge_water[i];
-                                const auto s1 = sigma_water[i];
-                                const auto e1 = epsilon_water[i];
-                                const auto s = 0.5f * (s0 + s1);
-                                const auto e = sqrtf(e0 * e1);
-                                const auto a = alpha[idx_atom];
+                                const float q1 = charge_water[i];
+                                const float s1 = sigma_water[i];
+                                const float e1 = epsilon_water[i];
+                                const float s = 0.5f * (s0 + s1);
+                                const float e = sqrtf(e0 * e1);
+                                const float a = alpha[idx_atom];
 
                                 // Compute the distance between the atoms.
                                 float r = sqrtf(r2);
@@ -741,8 +694,8 @@
                                 }
 
                                 // Compute the Lennard-Jones interaction.
-                                const auto delta_lj = shift_delta * a;
-                                const auto s6 = powf(s, 6.0) / powf((s * delta_lj) + (r * r), 3.0);
+                                const float delta_lj = shift_delta * a;
+                                const float s6 = powf(s, 6.0f) / powf((s * delta_lj) + (r * r), 3.0f);
                                 energy_lj[idx] += 4.0f * e * s6 * (s6 - 1.0f);
 
                                 // Compute the Coulomb power expression.
@@ -769,20 +722,21 @@
         }
 
         // Calculate whether each attempt is accepted.
-        __global__ void checkAcceptance(
+        KERNEL void checkAcceptance(
             int N,
             float exp_B,
             float exp_minus_B,
             float beta,
-            int* is_deletion,
-            float* energy_coul,
-            float* energy_lj,
-            float* energy_change,
-            float* probability,
-            int* accepted,
-            float tolerance)
+            GLOBAL int* is_deletion,
+            GLOBAL float* energy_coul,
+            GLOBAL float* energy_lj,
+            GLOBAL float* energy_change,
+            GLOBAL float* probability,
+            GLOBAL int* accepted,
+            float tolerance,
+            GLOBAL float* randoms_acceptance)
         {
-            const int tidx = threadIdx.x + blockIdx.x * blockDim.x;
+            const int tidx = GET_GLOBAL_ID(0);
 
             if (tidx < num_batch)
             {
@@ -825,13 +779,10 @@
                 // Store the probability.
                 probability[tidx] = prob;
 
-                // Get the RNG state.
-                curandState_t state = *states[tidx];
-
-                // Accept or reject based on the Boltzmann weight. A tolerance
-                // can be used to reject low probability states that can cause
+                // Accept or reject based on the Boltzmann weight using pre-generated random.
+                // A tolerance can be used to reject low probability states that can cause
                 // instabilities and/or crashes in the MD engine.
-                if (prob > tolerance and curand_uniform(&state) < prob)
+                if (prob > tolerance && randoms_acceptance[tidx] < prob)
                 {
                     accepted[tidx] = 1;
                 }
@@ -839,19 +790,16 @@
                 {
                     accepted[tidx] = 0;
                 }
-
-                // Set the new state.
-                *states[tidx] = state;
             }
         }
 
         // Find candidate waters for deletion.
-        __global__ void findDeletionCandidates(
-            int* candidates,
-            float* target,
+        KERNEL void findDeletionCandidates(
+            GLOBAL int* candidates,
+            GLOBAL float* target,
             float radius)
         {
-            const int tidx = threadIdx.x + blockIdx.x * blockDim.x;
+            const int tidx = GET_GLOBAL_ID(0);
 
             if (tidx < num_waters)
             {
@@ -872,7 +820,7 @@
 
                     // Calculate the distance between the water and the target.
                     float r2;
-                    distance2(v, target, r2);
+                    distance2(v, target, &r2);
 
                     // The water is within the GCMC sphere. Flag it as a candidate.
                     if (r2 < radius * radius)
@@ -882,5 +830,8 @@
                 }
             }
         }
+
+    #ifndef __OPENCL_VERSION__
     }
+    #endif
 """
diff --git a/src/loch/_platforms/__init__.py b/src/loch/_platforms/__init__.py
new file mode 100644
index 0000000..a720502
--- /dev/null
+++ b/src/loch/_platforms/__init__.py
@@ -0,0 +1,163 @@
+######################################################################
+# Loch: GPU accelerated GCMC water sampling engine.
+#
+# Copyright: 2025-2026
+#
+# Authors: The OpenBioSim Team <team@openbiosim.org>
+#
+# Loch is free software: you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation, either version 3 of the License, or
+# (at your option) any later version.
+#
+# Loch is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with Loch. If not, see <http://www.gnu.org/licenses/>.
+#####################################################################
+
+"""
+GPU platform abstraction for GCMC sampling.
+
+This module provides a factory interface for creating GPU platform backends
+that abstract the differences between CUDA and OpenCL.
+"""
+
+from typing import Optional as _Optional
+
+from ._base import PlatformBackend as _PlatformBackend
+
+
+def detect_platform() -> str:
+    """
+    Auto-detect available GPU platform.
+
+    Tries CUDA first (preferred), then falls back to OpenCL if CUDA is not available.
+
+    Returns
+    -------
+    str
+        Platform name ('cuda' or 'opencl').
+
+    Raises
+    ------
+    RuntimeError
+        If no CUDA or OpenCL GPU devices are found.
+    """
+    # Try CUDA first
+    try:
+        import pycuda.driver as cuda
+
+        cuda.init()
+        if cuda.Device.count() > 0:
+            return "cuda"
+    except Exception:
+        pass
+
+    # Fall back to OpenCL
+    try:
+        import pyopencl as cl
+
+        platforms = cl.get_platforms()
+        for platform in platforms:
+            devices = platform.get_devices(device_type=cl.device_type.GPU)
+            if devices:
+                return "opencl"
+    except Exception:
+        pass
+
+    # No GPU found
+    raise RuntimeError(
+        "No CUDA or OpenCL GPU devices found. "
+        "Please ensure PyCUDA or PyOpenCL is installed and a GPU is available."
+    )
+
+
+def create_backend(
+    platform: str,
+    device: _Optional[int],
+    num_points: int,
+    num_batch: int,
+    num_waters: int,
+    num_atoms: int,
+    num_threads: int,
+    nvcc: _Optional[str] = None,
+    compiler_optimisations: bool = True,
+) -> _PlatformBackend:
+    """
+    Create a GPU platform backend instance.
+
+    Parameters
+    ----------
+    platform : str
+        Platform to use ('auto', 'cuda', or 'opencl').
+        If 'auto', CUDA will be tried first, falling back to OpenCL.
+
+    device : int
+        GPU device index.
+
+    num_points : int
+        Number of atoms per water molecule (typically 3).
+
+    num_batch : int
+        Batch size for parallel trials.
+
+    num_waters : int
+        Number of ghost water molecules.
+
+    num_atoms : int
+        Total number of atoms in the system.
+
+    num_threads : int
+        Number of threads per block/work-group.
+
+    nvcc : str, optional
+        Path to NVCC compiler (CUDA only).
+
+    compiler_optimisations : bool, optional
+        Enable compiler optimisations for faster math operations.
+        Default: True (matches OpenMM defaults).
+
+    Returns
+    -------
+    PlatformBackend
+        Initialized platform backend (CUDAPlatform or OpenCLPlatform).
+    """
+    # Detect platform if auto
+    if platform == "auto":
+        platform = detect_platform()
+
+    # Create appropriate backend
+    if platform == "cuda":
+        from ._cuda import CUDAPlatform as _CUDAPlatform
+
+        return _CUDAPlatform(
+            device,
+            num_points,
+            num_batch,
+            num_waters,
+            num_atoms,
+            num_threads,
+            nvcc,
+            compiler_optimisations,
+        )
+    elif platform == "opencl":
+        from ._opencl import OpenCLPlatform as _OpenCLPlatform
+
+        return _OpenCLPlatform(
+            device,
+            num_points,
+            num_batch,
+            num_waters,
+            num_atoms,
+            num_threads,
+            nvcc,
+            compiler_optimisations,
+        )
+    else:
+        raise ValueError(
+            f"Unknown platform '{platform}'. Must be 'auto', 'cuda', or 'opencl'."
+        )
diff --git a/src/loch/_platforms/_base.py b/src/loch/_platforms/_base.py
new file mode 100644
index 0000000..f192d76
--- /dev/null
+++ b/src/loch/_platforms/_base.py
@@ -0,0 +1,207 @@
+######################################################################
+# Loch: GPU accelerated GCMC water sampling engine.
+#
+# Copyright: 2025-2026
+#
+# Authors: The OpenBioSim Team <team@openbiosim.org>
+#
+# Loch is free software: you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation, either version 3 of the License, or
+# (at your option) any later version.
+#
+# Loch is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with Loch. If not, see <http://www.gnu.org/licenses/>.
+#####################################################################
+
+"""
+Abstract base class for GPU platform backends.
+"""
+
+from abc import ABC as _ABC, abstractmethod as _abstractmethod
+from typing import Any as _Any, Callable as _Callable, Dict as _Dict
+
+import numpy as _np
+
+
+class PlatformBackend(_ABC):
+    """
+    Abstract base class for GPU platform backends.
+
+    This class defines the interface that all GPU platform implementations
+    (CUDA, OpenCL, etc.) must implement. It abstracts platform-specific
+    details like kernel compilation, memory management, and context handling.
+    """
+
+    @_abstractmethod
+    def __init__(
+        self,
+        device,
+        num_points,
+        num_batch,
+        num_waters,
+        num_atoms,
+        num_threads,
+        nvcc=None,
+        compiler_optimisations=True,
+    ):
+        """
+        Initialize the platform backend.
+
+        Parameters
+        ----------
+        device : int
+            The GPU device index to use.
+
+        num_points : int
+            Number of atoms per water molecule (typically 3).
+
+        num_batch : int
+            Number of parallel GCMC trials per batch.
+
+        num_waters : int
+            Number of ghost water molecules.
+
+        num_atoms : int
+            Total number of atoms in the system.
+
+        num_threads : int
+            Number of threads per block (CUDA) or work-group size (OpenCL).
+
+        nvcc : str, optional
+            Path to NVCC compiler (CUDA only). If None, uses default.
+
+        compiler_optimisations : bool, optional
+            Enable compiler optimisations for faster math operations.
+            Default: True (matches OpenMM defaults).
+        """
+        pass
+
+    @_abstractmethod
+    def compile_kernels(self) -> _Dict[str, _Callable]:
+        """
+        Compile GPU kernels and return callable functions.
+
+        Returns
+        -------
+        dict
+            Dictionary mapping kernel names to callable kernel functions.
+            Expected keys: 'cell', 'rf', 'softcore', 'atom_properties',
+            'atom_positions', 'water_properties', 'update_water', 'deletion',
+            'water', 'energy', 'acceptance'.
+        """
+        pass
+
+    @_abstractmethod
+    def to_gpu(self, array: _np.ndarray) -> _Any:
+        """
+        Transfer a NumPy array to GPU memory.
+
+        Parameters
+        ----------
+        array : numpy.ndarray
+            Array to transfer to GPU.
+
+        Returns
+        -------
+        GPU buffer
+            Platform-specific GPU buffer object.
+        """
+        pass
+
+    @_abstractmethod
+    def empty(self, shape, dtype) -> _Any:
+        """
+        Allocate an empty GPU buffer.
+
+        Parameters
+        ----------
+        shape : tuple
+            Shape of the array to allocate.
+
+        dtype : numpy.dtype
+            Data type of the array.
+
+        Returns
+        -------
+        GPU buffer
+            Platform-specific GPU buffer object.
+        """
+        pass
+
+    @_abstractmethod
+    def from_gpu(self, buffer: _Any) -> _np.ndarray:
+        """
+        Transfer data from GPU memory to host NumPy array.
+
+        Parameters
+        ----------
+        buffer : GPU buffer
+            Platform-specific GPU buffer to transfer from.
+
+        Returns
+        -------
+        numpy.ndarray
+            Array containing the data from GPU.
+        """
+        pass
+
+    @_abstractmethod
+    def push_context(self):
+        """
+        Push GPU context onto the context stack.
+
+        For CUDA, this pushes the context onto the driver stack.
+        For OpenCL, this is a no-op as OpenCL doesn't use context stacking.
+        """
+        pass
+
+    @_abstractmethod
+    def pop_context(self):
+        """
+        Pop GPU context from the context stack.
+
+        For CUDA, this pops the context from the driver stack.
+        For OpenCL, this is a no-op as OpenCL doesn't use context stacking.
+        """
+        pass
+
+    @_abstractmethod
+    def cleanup(self):
+        """
+        Clean up GPU resources and detach context.
+
+        This method should release all GPU memory and detach the context.
+        It is called during shutdown to ensure proper resource cleanup.
+        """
+        pass
+
+    @property
+    @_abstractmethod
+    def platform_name(self) -> str:
+        """
+        Get the name of the platform backend.
+
+        Returns
+        -------
+        str
+            Platform name ('cuda' or 'opencl').
+        """
+        pass
+
+    @property
+    def compiler_log(self) -> str:
+        """
+        Get the kernel compiler log including any warnings.
+
+        Returns
+        -------
+        str
+            Compiler log output, or empty string if no warnings/messages.
+        """
+        return getattr(self, "_compiler_log", "")
diff --git a/src/loch/_platforms/_cuda.py b/src/loch/_platforms/_cuda.py
new file mode 100644
index 0000000..e7e6ab7
--- /dev/null
+++ b/src/loch/_platforms/_cuda.py
@@ -0,0 +1,277 @@
+######################################################################
+# Loch: GPU accelerated GCMC water sampling engine.
+#
+# Copyright: 2025-2026
+#
+# Authors: The OpenBioSim Team <team@openbiosim.org>
+#
+# Loch is free software: you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation, either version 3 of the License, or
+# (at your option) any later version.
+#
+# Loch is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with Loch. If not, see <http://www.gnu.org/licenses/>.
+#####################################################################
+
+"""
+CUDA platform backend implementation.
+"""
+
+import atexit as _atexit
+import io as _io
+import sys as _sys
+from typing import Any as _Any, Callable as _Callable, Dict as _Dict
+
+import numpy as _np
+import pycuda.driver as _cuda
+import pycuda.gpuarray as _gpuarray
+from pycuda.compiler import SourceModule as _SourceModule
+
+from .._kernels import code as _kernel_code
+from ._base import PlatformBackend as _PlatformBackend
+
+
+class CUDAPlatform(_PlatformBackend):
+    """
+    CUDA platform backend using PyCUDA.
+
+    This backend wraps PyCUDA functionality to provide GPU-accelerated
+    GCMC sampling on NVIDIA GPUs.
+    """
+
+    def __init__(
+        self,
+        device,
+        num_points,
+        num_batch,
+        num_waters,
+        num_atoms,
+        num_threads,
+        nvcc=None,
+        compiler_optimisations=True,
+    ):
+        """
+        Initialize the CUDA platform backend.
+
+        Parameters
+        ----------
+        device : int
+            The CUDA device index to use.
+
+        num_points : int
+            Number of atoms per water molecule (typically 3).
+
+        num_batch : int
+            Number of parallel GCMC trials per batch.
+
+        num_waters : int
+            Number of ghost water molecules.
+
+        num_atoms : int
+            Total number of atoms in the system.
+
+        num_threads : int
+            Number of threads per block.
+
+        nvcc : str, optional
+            Path to NVCC compiler. If None, uses default from PATH.
+
+        compiler_optimisations : bool, optional
+            Enable compiler optimisations for faster math operations.
+            When True, passes --use_fast_math to nvcc.
+            Default: True (matches OpenMM defaults).
+        """
+        from pycuda.tools import make_default_context
+
+        # Initialize CUDA driver
+        _cuda.init()
+
+        # Validate and set device
+        if device is not None:
+            if not isinstance(device, int):
+                raise ValueError("'device' must be of type 'int'")
+            if device < 0 or device >= _cuda.Device.count():
+                raise ValueError(
+                    f"'device' must be between 0 and {_cuda.Device.count() - 1}"
+                )
+            self._pycuda_context = _cuda.Device(device).make_context()
+        else:
+            self._pycuda_context = make_default_context()
+
+        self._device = self._pycuda_context.get_device()
+
+        # Store parameters
+        self._num_points = num_points
+        self._num_batch = num_batch
+        self._num_waters = num_waters
+        self._num_atoms = num_atoms
+        self._num_threads = num_threads
+        self._nvcc = nvcc
+        self._compiler_optimisations = compiler_optimisations
+
+        # Register cleanup
+        _atexit.register(self._cleanup_wrapper)
+
+    def compile_kernels(self) -> _Dict[str, _Callable]:
+        """
+        Compile CUDA kernels and return callable functions.
+
+        Returns
+        -------
+        dict
+            Dictionary mapping kernel names to callable kernel functions.
+        """
+        # Compile kernel module with template substitution.
+        # Suppress stderr but capture it for error reporting.
+        stderr_capture = _io.StringIO()
+        old_stderr = _sys.stderr
+
+        # Build compiler options
+        options = []
+        if self._compiler_optimisations:
+            options.append("--use_fast_math")
+
+        try:
+            _sys.stderr = stderr_capture
+            mod = _SourceModule(
+                _kernel_code
+                % {
+                    "NUM_POINTS": self._num_points,
+                    "NUM_BATCH": self._num_batch,
+                    "NUM_WATERS": self._num_waters,
+                    "NUM_ATOMS": self._num_atoms,
+                },
+                no_extern_c=True,
+                nvcc=self._nvcc,
+                options=options,
+            )
+        except Exception as e:
+            stderr_output = stderr_capture.getvalue().strip()
+            error_msg = f"CUDA kernel compilation failed: {e}"
+            if stderr_output:
+                error_msg += f"\n{stderr_output}"
+            raise RuntimeError(error_msg)
+        finally:
+            _sys.stderr = old_stderr
+
+        # Store any compiler warnings.
+        self._compiler_log = stderr_capture.getvalue().strip()
+
+        # Extract kernel functions
+        kernels = {
+            "cell": mod.get_function("setCellMatrix"),
+            "rf": mod.get_function("setReactionField"),
+            "softcore": mod.get_function("setSoftCore"),
+            "atom_properties": mod.get_function("setAtomProperties"),
+            "atom_positions": mod.get_function("setAtomPositions"),
+            "water_properties": mod.get_function("setWaterProperties"),
+            "update_water": mod.get_function("updateWater"),
+            "deletion": mod.get_function("findDeletionCandidates"),
+            "water": mod.get_function("generateWater"),
+            "energy": mod.get_function("computeEnergy"),
+            "acceptance": mod.get_function("checkAcceptance"),
+        }
+
+        return kernels
+
+    def to_gpu(self, array: _np.ndarray) -> _Any:
+        """
+        Transfer a NumPy array to GPU memory.
+
+        Parameters
+        ----------
+        array : numpy.ndarray
+            Array to transfer to GPU.
+
+        Returns
+        -------
+        pycuda.gpuarray.GPUArray
+            GPU array containing the data.
+        """
+        return _gpuarray.to_gpu(array)
+
+    def empty(self, shape, dtype) -> _Any:
+        """
+        Allocate an empty GPU buffer.
+
+        Parameters
+        ----------
+        shape : tuple
+            Shape of the array to allocate.
+
+        dtype : numpy.dtype
+            Data type of the array.
+
+        Returns
+        -------
+        pycuda.gpuarray.GPUArray
+            Allocated GPU array.
+        """
+        return _gpuarray.empty(shape, dtype)
+
+    def from_gpu(self, buffer: _Any) -> _np.ndarray:
+        """
+        Transfer data from GPU memory to host NumPy array.
+
+        Parameters
+        ----------
+        buffer : pycuda.gpuarray.GPUArray
+            GPU array to transfer from.
+
+        Returns
+        -------
+        numpy.ndarray
+            Array containing the data from GPU.
+        """
+        return buffer.get()
+
+    def push_context(self):
+        """
+        Push the CUDA context onto the context stack.
+        """
+        self._pycuda_context.push()
+
+    def pop_context(self):
+        """
+        Pop the CUDA context from the context stack.
+        """
+        self._pycuda_context.pop()
+
+    def cleanup(self):
+        """
+        Clean up CUDA resources and detach context.
+        """
+        try:
+            self.pop_context()
+        except Exception:
+            pass
+        self._pycuda_context.detach()
+        self._pycuda_context = None
+
+    def _cleanup_wrapper(self):
+        """
+        Wrapper for cleanup to handle atexit registration.
+        """
+        try:
+            if self._pycuda_context is not None:
+                self.cleanup()
+        except Exception:
+            pass
+
+    @property
+    def platform_name(self) -> str:
+        """
+        Get the name of the platform backend.
+
+        Returns
+        -------
+        str
+            Platform name ('cuda').
+        """
+        return "cuda"
diff --git a/src/loch/_platforms/_opencl.py b/src/loch/_platforms/_opencl.py
new file mode 100644
index 0000000..a4fe550
--- /dev/null
+++ b/src/loch/_platforms/_opencl.py
@@ -0,0 +1,298 @@
+######################################################################
+# Loch: GPU accelerated GCMC water sampling engine.
+#
+# Copyright: 2025-2026
+#
+# Authors: The OpenBioSim Team <team@openbiosim.org>
+#
+# Loch is free software: you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation, either version 3 of the License, or
+# (at your option) any later version.
+#
+# Loch is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with Loch. If not, see <http://www.gnu.org/licenses/>.
+#####################################################################
+
+"""
+OpenCL platform backend implementation.
+"""
+
+import io as _io
+import sys as _sys
+import warnings as _warnings
+from typing import Any as _Any, Callable as _Callable, Dict as _Dict
+
+import numpy as _np
+import pyopencl as _cl
+import pyopencl.array as _cl_array
+
+from .._kernels import code as _kernel_code
+from ._base import PlatformBackend as _PlatformBackend
+
+
+class OpenCLPlatform(_PlatformBackend):
+    """
+    OpenCL platform backend using PyOpenCL.
+
+    This backend wraps PyOpenCL functionality to provide GPU-accelerated
+    GCMC sampling on various GPU vendors (Intel, AMD, NVIDIA).
+    """
+
+    def __init__(
+        self,
+        device,
+        num_points,
+        num_batch,
+        num_waters,
+        num_atoms,
+        num_threads,
+        nvcc=None,
+        compiler_optimisations=True,
+    ):
+        """
+        Initialize the OpenCL platform backend.
+
+        Parameters
+        ----------
+        device : int
+            The OpenCL device index to use.
+
+        num_points : int
+            Number of atoms per water molecule (typically 3).
+
+        num_batch : int
+            Number of parallel GCMC trials per batch.
+
+        num_waters : int
+            Number of ghost water molecules.
+
+        num_atoms : int
+            Total number of atoms in the system.
+
+        num_threads : int
+            Work-group size (threads per work-group).
+
+        nvcc : str, optional
+            Ignored for OpenCL (included for API compatibility).
+
+        compiler_optimisations : bool, optional
+            Enable compiler optimisations for faster math operations.
+            When True, passes -cl-mad-enable and -cl-no-signed-zeros to the compiler.
+            Default: True (matches OpenMM defaults).
+        """
+        # Get platforms and devices
+        platforms = _cl.get_platforms()
+        devices = []
+        for platform in platforms:
+            devices.extend(platform.get_devices(device_type=_cl.device_type.GPU))
+
+        if not devices:
+            raise RuntimeError("No OpenCL GPU devices found")
+
+        # Validate device index
+        if device is not None:
+            if not isinstance(device, int):
+                raise ValueError("'device' must be of type 'int'")
+            if device < 0 or device >= len(devices):
+                raise ValueError(f"'device' must be between 0 and {len(devices) - 1}")
+            self._device = devices[device]
+        else:
+            self._device = devices[0]
+
+        # Create context and command queue
+        self._context = _cl.Context([self._device])
+        self._queue = _cl.CommandQueue(self._context)
+
+        # Store parameters
+        self._num_points = num_points
+        self._num_batch = num_batch
+        self._num_waters = num_waters
+        self._num_atoms = num_atoms
+        self._num_threads = num_threads
+        self._compiler_optimisations = compiler_optimisations
+
+    def compile_kernels(self) -> _Dict[str, _Callable]:
+        """
+        Compile OpenCL kernels and return callable functions.
+
+        Returns
+        -------
+        dict
+            Dictionary mapping kernel names to callable kernel functions.
+        """
+        # Substitute template parameters
+        kernel_source = _kernel_code % {
+            "NUM_POINTS": self._num_points,
+            "NUM_BATCH": self._num_batch,
+            "NUM_WATERS": self._num_waters,
+            "NUM_ATOMS": self._num_atoms,
+        }
+
+        # Build compiler options
+        build_options = []
+        if self._compiler_optimisations:
+            build_options.extend(["-cl-mad-enable", "-cl-no-signed-zeros"])
+
+        # Compile program, suppressing stderr and warnings but capturing for errors.
+        stderr_capture = _io.StringIO()
+        old_stderr = _sys.stderr
+        try:
+            _sys.stderr = stderr_capture
+            with _warnings.catch_warnings():
+                _warnings.simplefilter("ignore")
+                program = _cl.Program(self._context, kernel_source).build(
+                    options=build_options
+                )
+        except _cl.RuntimeError as e:
+            stderr_output = stderr_capture.getvalue().strip()
+            error_msg = f"OpenCL kernel compilation failed: {e}"
+            if stderr_output:
+                error_msg += f"\n{stderr_output}"
+            raise RuntimeError(error_msg)
+        finally:
+            _sys.stderr = old_stderr
+
+        # Capture the compiler log (including any warnings).
+        self._compiler_log = program.get_build_info(
+            self._device, _cl.program_build_info.LOG
+        ).strip()
+
+        # Create kernel wrappers that match PyCUDA calling convention
+        # OpenCL kernels need (queue, global_size, local_size, *args)
+        # We'll wrap them to match CUDA's (args..., block, grid) signature
+        def make_kernel_wrapper(kernel):
+            def wrapper(*args, **kwargs):
+                # Extract block and grid from kwargs
+                block = kwargs.get("block", (self._num_threads, 1, 1))
+                grid = kwargs.get("grid", (1, 1, 1))
+
+                # Calculate global work size
+                global_size = tuple(b * g for b, g in zip(block, grid))
+                local_size = block
+
+                # Convert PyOpenCL array objects to their .data buffers
+                processed_args = []
+                for arg in args:
+                    if isinstance(arg, _cl_array.Array):
+                        processed_args.append(arg.data)
+                    else:
+                        processed_args.append(arg)
+
+                # Execute kernel
+                kernel(self._queue, global_size, local_size, *processed_args)
+                self._queue.finish()
+
+            return wrapper
+
+        # Extract and wrap kernel functions
+        kernels = {
+            "cell": make_kernel_wrapper(program.setCellMatrix),
+            "rf": make_kernel_wrapper(program.setReactionField),
+            "softcore": make_kernel_wrapper(program.setSoftCore),
+            "atom_properties": make_kernel_wrapper(program.setAtomProperties),
+            "atom_positions": make_kernel_wrapper(program.setAtomPositions),
+            "water_properties": make_kernel_wrapper(program.setWaterProperties),
+            "update_water": make_kernel_wrapper(program.updateWater),
+            "deletion": make_kernel_wrapper(program.findDeletionCandidates),
+            "water": make_kernel_wrapper(program.generateWater),
+            "energy": make_kernel_wrapper(program.computeEnergy),
+            "acceptance": make_kernel_wrapper(program.checkAcceptance),
+        }
+
+        return kernels
+
+    def to_gpu(self, array: _np.ndarray) -> _Any:
+        """
+        Transfer a NumPy array to GPU memory.
+
+        Parameters
+        ----------
+        array : numpy.ndarray
+            Array to transfer to GPU.
+
+        Returns
+        -------
+        pyopencl.array.Array
+            GPU array containing the data.
+        """
+        return _cl_array.to_device(self._queue, array)
+
+    def empty(self, shape, dtype) -> _Any:
+        """
+        Allocate an empty GPU buffer.
+
+        Parameters
+        ----------
+        shape : tuple
+            Shape of the array to allocate.
+
+        dtype : numpy.dtype
+            Data type of the array.
+
+        Returns
+        -------
+        pyopencl.array.Array
+            Allocated GPU array.
+        """
+        return _cl_array.empty(self._queue, shape, dtype)
+
+    def from_gpu(self, buffer: _Any) -> _np.ndarray:
+        """
+        Transfer data from GPU memory to host NumPy array.
+
+        Parameters
+        ----------
+        buffer : pyopencl.array.Array
+            GPU array to transfer from.
+
+        Returns
+        -------
+        numpy.ndarray
+            Array containing the data from GPU.
+        """
+        return buffer.get()
+
+    def push_context(self):
+        """
+        Push context (no-op for OpenCL).
+
+        OpenCL doesn't use context stacking like CUDA, so this method
+        does nothing. It's provided for API compatibility.
+        """
+        pass
+
+    def pop_context(self):
+        """
+        Pop context (no-op for OpenCL).
+
+        OpenCL doesn't use context stacking like CUDA, so this method
+        does nothing. It's provided for API compatibility.
+        """
+        pass
+
+    def cleanup(self):
+        """
+        Clean up OpenCL resources.
+        """
+        # OpenCL resources are automatically released when objects are deleted
+        # No explicit cleanup needed, but we'll clear references
+        self._queue = None
+        self._context = None
+
+    @property
+    def platform_name(self) -> str:
+        """
+        Get the name of the platform backend.
+
+        Returns
+        -------
+        str
+            Platform name ('opencl').
+        """
+        return "opencl"
diff --git a/src/loch/_platforms/_rng.py b/src/loch/_platforms/_rng.py
new file mode 100644
index 0000000..42699f4
--- /dev/null
+++ b/src/loch/_platforms/_rng.py
@@ -0,0 +1,133 @@
+######################################################################
+# Loch: GPU accelerated GCMC water sampling engine.
+#
+# Copyright: 2025-2026
+#
+# Authors: The OpenBioSim Team <team@openbiosim.org>
+#
+# Loch is free software: you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation, either version 3 of the License, or
+# (at your option) any later version.
+#
+# Loch is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with Loch. If not, see <http://www.gnu.org/licenses/>.
+#####################################################################
+
+"""
+Random number generation utilities for GPU platforms.
+"""
+
+from dataclasses import dataclass as _dataclass
+from queue import Queue as _Queue
+from threading import Thread as _Thread
+from typing import Optional as _Optional
+
+import numpy as _np
+
+
+@_dataclass
+class BatchRandoms:
+    """
+    Container for all random numbers needed for a single batch.
+
+    Attributes
+    ----------
+    rotation : numpy.ndarray
+        Shape (batch_size, 3) array of uniform [0,1) randoms for water rotation.
+    position : numpy.ndarray
+        Shape (batch_size, 3) array of normal randoms for position direction.
+    radius : numpy.ndarray
+        Shape (batch_size,) array of uniform [0,1) randoms for radial distance.
+    acceptance : numpy.ndarray
+        Shape (batch_size,) array of uniform [0,1) randoms for Metropolis acceptance.
+    """
+
+    rotation: _np.ndarray
+    position: _np.ndarray
+    radius: _np.ndarray
+    acceptance: _np.ndarray
+
+
+class RNGManager:
+    """
+    Host-side random number generator for both CUDA and OpenCL backends.
+
+    This class generates random numbers on the host (CPU) that are then
+    transferred to the GPU for use in kernels. This approach eliminates
+    the need for device-side RNG libraries (like CURAND) and simplifies
+    cross-platform support.
+
+    Random number batches are pre-computed in a background thread while GPU
+    kernels are running, so that they are ready when the next batch is requested.
+    """
+
+    _MAX_QUEUE_SIZE = 10
+
+    def __init__(self, batch_size: int, seed: _Optional[int] = None) -> None:
+        """
+        Initialize the RNG manager.
+
+        Parameters
+        ----------
+        batch_size : int
+            Number of parallel trials in a batch.
+
+        seed : int, optional
+            Random seed for reproducibility. If None, a random seed is used.
+        """
+        self._batch_size = batch_size
+        self._rng = _np.random.default_rng(seed)
+        self._queue: _Queue[BatchRandoms] = _Queue(maxsize=self._MAX_QUEUE_SIZE)
+        self._shutdown = False
+
+        # Start the background thread that fills the queue.
+        self._thread = _Thread(target=self._fill_queue, daemon=True)
+        self._thread.start()
+
+    def _generate_batch(self) -> BatchRandoms:
+        """Generate a single batch of random numbers."""
+        return BatchRandoms(
+            rotation=self._rng.uniform(0, 1, size=(self._batch_size, 3)).astype(
+                _np.float32
+            ),
+            position=self._rng.normal(0, 1, size=(self._batch_size, 3)).astype(
+                _np.float32
+            ),
+            radius=self._rng.uniform(0, 1, size=self._batch_size).astype(_np.float32),
+            acceptance=self._rng.uniform(0, 1, size=self._batch_size).astype(
+                _np.float32
+            ),
+        )
+
+    def _fill_queue(self) -> None:
+        """Background thread that continuously fills the queue with batches."""
+        while not self._shutdown:
+            batch = self._generate_batch()
+            self._queue.put(batch)  # Blocks if queue is full.
+
+    def get_batch_randoms(self) -> BatchRandoms:
+        """
+        Get all random numbers needed for a single batch.
+
+        Returns
+        -------
+        BatchRandoms
+            Container with rotation, position, radius, and acceptance arrays.
+        """
+        return self._queue.get()
+
+    def shutdown(self) -> None:
+        """Stop the background thread. Call when done using the RNG manager."""
+        self._shutdown = True
+        # Drain one item to unblock the thread if it's waiting on put().
+        try:
+            self._queue.get_nowait()
+        except Exception:
+            pass
+        self._thread.join(timeout=1.0)
diff --git a/src/loch/_sampler.py b/src/loch/_sampler.py
index b8c9a95..f5e4958 100644
--- a/src/loch/_sampler.py
+++ b/src/loch/_sampler.py
@@ -1,7 +1,7 @@
 ######################################################################
 # Loch: GPU accelerated GCMC water sampling engine.
 #
-# Copyright: 2025
+# Copyright: 2025-2026
 #
 # Authors: The OpenBioSim Team <team@openbiosim.org>
 #
@@ -21,66 +21,77 @@
 
 __all__ = ["GCMCSampler"]
 
+from typing import Any as _Any, Optional as _Optional, Union as _Union
+
 import numpy as _np
 import openmm as _openmm
 import os as _os
 
 try:
     from somd2 import _logger
-except:
+except Exception:
     from loguru import logger as _logger
 
-import pycuda.driver as _cuda
-import pycuda.gpuarray as _gpuarray
-from pycuda.compiler import SourceModule as _SourceModule
-
 import BioSimSpace as _BSS
 import sire as _sr
 
-from ._kernels import code as _code
+from ._platforms import create_backend as _create_backend
+from ._platforms._rng import RNGManager as _RNGManager
+
+
+def _as_float32(arr: _np.ndarray) -> _np.ndarray:
+    """Convert array to float32 only if not already float32."""
+    return arr if arr.dtype == _np.float32 else arr.astype(_np.float32)
+
+
+def _as_int32(arr: _np.ndarray) -> _np.ndarray:
+    """Convert array to int32 only if not already int32."""
+    return arr if arr.dtype == _np.int32 else arr.astype(_np.int32)
 
 
 class GCMCSampler:
     """
-    A class to perform GCMC water sampling on the GPU via PyCUDA.
+    A class to perform GCMC water sampling on the GPU.
     """
 
     def __init__(
         self,
-        system,
-        reference=None,
-        radius="4.0 A",
-        cutoff_type="pme",
-        cutoff="10.0 A",
-        excess_chemical_potential="-6.09 kcal/mol",
-        standard_volume="30.543 A^3",
-        temperature="298 K",
-        adams_shift=0.0,
-        num_ghost_waters=20,
-        batch_size=1000,
-        num_attempts=10000,
-        num_threads=1024,
-        bulk_sampling_probability=0.1,
-        water_template=None,
-        device=None,
-        tolerance=0.0,
-        lambda_schedule=None,
-        lambda_value=0.0,
-        rest2_scale=1.0,
-        rest2_selection=None,
-        coulomb_power=0.0,
-        shift_coulomb="1 A",
-        shift_delta="1.5 A",
-        swap_end_states=False,
-        restart=False,
-        overwrite=False,
-        ghost_file="ghosts.txt",
-        log_file="gcmc.txt",
-        log_level="error",
-        seed=None,
-        nvcc=None,
-        **kwargs,
-    ):
+        system: _Any,
+        reference: _Optional[str] = None,
+        radius: str = "4.0 A",
+        cutoff_type: str = "pme",
+        cutoff: str = "10.0 A",
+        excess_chemical_potential: str = "-6.09 kcal/mol",
+        standard_volume: str = "30.543 A^3",
+        temperature: str = "298 K",
+        adams_shift: _Union[int, float] = 0.0,
+        num_ghost_waters: int = 20,
+        batch_size: int = 1000,
+        num_attempts: int = 10000,
+        num_threads: int = 1024,
+        bulk_sampling_probability: float = 0.1,
+        water_template: _Optional[_Any] = None,
+        device: _Optional[int] = None,
+        platform: str = "auto",
+        tolerance: float = 0.0,
+        lambda_schedule: _Optional[_Any] = None,
+        lambda_value: float = 0.0,
+        rest2_scale: float = 1.0,
+        rest2_selection: _Optional[str] = None,
+        coulomb_power: float = 0.0,
+        shift_coulomb: str = "1 A",
+        shift_delta: str = "1.5 A",
+        swap_end_states: bool = False,
+        restart: bool = False,
+        overwrite: bool = False,
+        ghost_file: _Optional[str] = "ghosts.txt",
+        log_file: _Optional[str] = "gcmc.txt",
+        log_level: str = "error",
+        seed: _Optional[int] = None,
+        nvcc: _Optional[str] = None,
+        compiler_optimisations: bool = True,
+        **kwargs: _Any,
+    ) -> None:
         """
         Initialise the GCMC sampler.
 
@@ -149,9 +160,14 @@ def __init__(
             from the template.
 
         device: int
-            The CUDA device index. (This is the index in the list of visible
+            The GPU device index. (This is the index in the list of visible
             devices.)
 
+        platform: str
+            The GPU platform to use. Options are 'auto' (default), 'cuda',
+            or 'opencl'. When 'auto', CUDA will be preferred if available,
+            falling back to OpenCL.
+
         tolerance: float
             The tolerance for the acceptance probability, i.e. the minimum
             probability of acceptance for a move. This can be used to exclude
@@ -216,6 +232,12 @@ def __init__(
         nvcc: str
             The path to the nvcc compiler. If None, the default nvcc
             in the PATH will be used.
+
+        compiler_optimisations: bool
+            Enable compiler optimisations for faster math operations.
+            When True, passes --use_fast_math to CUDA (nvcc) and
+            -cl-mad-enable -cl-no-signed-zeros to OpenCL.
+            Default: True (matches OpenMM defaults).
         """
 
         # Validate the input.
@@ -231,7 +253,7 @@ def __init__(
                 self._system = _sr.morph.link_to_reference(self._system)
             else:
                 self._is_fep = False
-        except:
+        except Exception:
             self._is_fep = False
 
         if reference is not None:
@@ -242,7 +264,7 @@ def __init__(
         if not isinstance(cutoff_type, str):
             raise ValueError("'cutoff_type' must be of type 'str'")
         cutoff_type = cutoff_type.lower().replace(" ", "")
-        if not cutoff_type in ["rf", "pme"]:
+        if cutoff_type not in ["rf", "pme"]:
             raise ValueError("The cutoff type must be 'rf' or 'pme'.")
         self._cutoff_type = cutoff_type
 
@@ -362,7 +384,7 @@ def __init__(
             raise ValueError("'log_level' must be of type 'str'")
         log_level = log_level.lower().replace(" ", "")
         allowed_levels = [level.lower() for level in _logger._core.levels]
-        if not log_level in allowed_levels:
+        if log_level not in allowed_levels:
             raise ValueError(
                 f"Invalid 'log_level': {log_level}. Choices are: {', '.join(allowed_levels)}"
             )
@@ -385,6 +407,7 @@ def __init__(
         # Create a random number generator.
         self._rng = _np.random.default_rng(self._seed)
 
+        # Validate nvcc path if provided
         if nvcc is not None:
             if not isinstance(nvcc, str):
                 raise ValueError("'nvcc' must be of type 'str'")
@@ -395,22 +418,6 @@ def __init__(
 
             nvcc = _os.environ.get("PYCUDA_NVCC", which("nvcc"))
 
-        from pycuda.tools import make_default_context
-
-        # Set the CUDA device.
-        _cuda.init()
-        if device is not None:
-            if not isinstance(device, int):
-                raise ValueError("'device' must be of type 'int'")
-            if device < 0 or device >= _cuda.Device.count():
-                raise ValueError(
-                    f"'device' must be between 0 and {cuda.Device.count() - 1}"
-                )
-            self._pycuda_context = _cuda.Device(device).make_context()
-        else:
-            self._pycuda_context = make_default_context()
-        self._device = self._pycuda_context.get_device()
-
         # Set the tolerance.
         try:
             self._tolerance = float(tolerance)
@@ -433,7 +440,7 @@ def __init__(
 
         try:
             lambda_value = float(lambda_value)
-        except:
+        except Exception:
             raise ValueError("'lambda_value' must be of type 'float'")
         if not 0.0 <= lambda_value <= 1.0:
             raise ValueError("'lambda_value' must be between 0 and 1")
@@ -441,7 +448,7 @@ def __init__(
 
         try:
             rest2_scale = float(rest2_scale)
-        except:
+        except Exception:
             raise ValueError("'rest2_scale' must be of type 'float'")
         if rest2_scale < 1.0:
             raise ValueError("'rest2_scale' must be greater than or equal to 1.0")
@@ -455,7 +462,7 @@ def __init__(
 
             try:
                 atoms = selection_to_atoms(self._system, rest2_selection)
-            except:
+            except Exception:
                 msg = "Invalid 'rest2_selection' value."
                 _logger.error(msg)
                 raise ValueError(msg)
@@ -469,7 +476,7 @@ def __init__(
 
         try:
             coulomb_power = float(coulomb_power)
-        except:
+        except Exception:
             raise ValueError("'coulomb_power' must be of type 'float'")
         self._coulomb_power = float(coulomb_power)
 
@@ -494,7 +501,7 @@ def __init__(
         # Check for waters and validate the template.
         try:
             self._water_template = system["water and not property is_perturbable"][0]
-        except:
+        except Exception:
             if water_template is None:
                 raise ValueError(
                     "The system does not contain any water molecules. "
@@ -524,31 +531,24 @@ def __init__(
         except Exception as e:
             raise ValueError(f"Could not prepare the system for GCMC sampling: {e}")
 
-        # Create the kernels.
-        self._kernels = {}
-        mod = _SourceModule(
-            _code
-            % {
-                "NUM_POINTS": self._num_points,
-                "NUM_BATCH": self._batch_size,
-                "NUM_WATERS": self._num_waters,
-                "NUM_ATOMS": self._num_atoms,
-            },
-            no_extern_c=True,
+        # Create platform backend
+        self._backend = _create_backend(
+            platform=platform,
+            device=device if device is not None else 0,
+            num_points=self._num_points,
+            num_batch=self._batch_size,
+            num_waters=self._num_waters,
+            num_atoms=self._num_atoms,
+            num_threads=self._num_threads,
             nvcc=nvcc,
+            compiler_optimisations=compiler_optimisations,
         )
-        self._kernels["cell"] = mod.get_function("setCellMatrix")
-        self._kernels["rng"] = mod.get_function("initialiseRNG")
-        self._kernels["rf"] = mod.get_function("setReactionField")
-        self._kernels["softcore"] = mod.get_function("setSoftCore")
-        self._kernels["atom_properties"] = mod.get_function("setAtomProperties")
-        self._kernels["atom_positions"] = mod.get_function("setAtomPositions")
-        self._kernels["water_properties"] = mod.get_function("setWaterProperties")
-        self._kernels["update_water"] = mod.get_function("updateWater")
-        self._kernels["deletion"] = mod.get_function("findDeletionCandidates")
-        self._kernels["water"] = mod.get_function("generateWater")
-        self._kernels["energy"] = mod.get_function("computeEnergy")
-        self._kernels["acceptance"] = mod.get_function("checkAcceptance")
+
+        # Compile kernels
+        self._kernels = self._backend.compile_kernels()
+
+        # Create RNG manager for host-side random number generation
+        self._rng_manager = _RNGManager(self._batch_size, seed=self._seed)
 
         # Work out the number of blocks to process the atoms.
         self._atom_blocks = self._num_atoms // self._num_threads + 1
@@ -625,7 +625,7 @@ def __init__(
         # Create a logger that writes to stderr and the log file.
         # The 'no_logger' keyword argument can be used to disable logging if
         # the sampler is being driven by an external package, e.g. SOMD2.
-        if not "no_logger" in kwargs:
+        if "no_logger" not in kwargs:
             _logger.remove()
             _logger.add(sys.stderr, level=self._log_level.upper())
             if self._log_file is not None:
@@ -643,7 +643,7 @@ def __init__(
 
         # Check for testing mode.
         if "test" in kwargs:
-            if kwargs["test"] == True:
+            if kwargs["test"]:
                 _logger.debug("Testing mode enabled")
                 self._is_test = True
             else:
@@ -651,7 +651,7 @@ def __init__(
         else:
             self._is_test = False
 
-    def __str__(self):
+    def __str__(self) -> str:
         """
         Return a string representation of the class.
         """
@@ -683,37 +683,52 @@ def __str__(self):
             f"seed={self._seed})"
         )
 
-    def __repr__(self):
+    def __repr__(self) -> str:
         """
         Return a string representation of the class.
         """
 
         return str(self)
 
-    def _cleanup(self):
+    def _cleanup(self) -> None:
         """
-        Detach the PyCUDA context.
+        Clean up GPU resources and detach context.
         """
         try:
-            self.pop()
-        except:
+            self._rng_manager.shutdown()
+        except Exception:
+            pass
+        try:
+            self._backend.cleanup()
+        except Exception:
             pass
-        self._pycuda_context.detach()
-        self._pycuda_context = None
 
-    def push(self):
+    def _invalidate_water_caches(self) -> None:
+        """Invalidate cached water indices. Call when _water_state changes."""
+        self._ghost_waters_cache = _np.where(self._water_state == 0)[0]
+        self._non_ghost_waters_cache = _np.where(self._water_state != 0)[0]
+
+    def _get_ghost_waters(self) -> _np.ndarray:
+        """Get indices of ghost waters (cached)."""
+        return self._ghost_waters_cache
+
+    def _get_non_ghost_waters(self) -> _np.ndarray:
+        """Get indices of non-ghost waters (cached)."""
+        return self._non_ghost_waters_cache
+
+    def push(self) -> None:
         """
-        Push the PyCUDA context on top of the stack.
+        Push the GPU context on top of the stack (CUDA only, no-op for OpenCL).
         """
-        self._pycuda_context.push()
+        self._backend.push_context()
 
-    def pop(self):
+    def pop(self) -> None:
         """
-        Pop the PyCUDA context from the stack.
+        Pop the GPU context from the stack (CUDA only, no-op for OpenCL).
         """
-        self._pycuda_context.pop()
+        self._backend.pop_context()
 
-    def system(self):
+    def system(self) -> _Any:
         """
         Return the GCMC system.
 
@@ -725,7 +740,22 @@ def system(self):
         """
         return self._system.clone()
 
-    def set_box(self, system):
+    def compiler_log(self) -> str:
+        """
+        Return the GPU kernel compiler log.
+
+        This includes any warnings generated during kernel compilation.
+        Useful for debugging or investigating compiler messages.
+
+        Returns
+        -------
+
+        log: str
+            The compiler log, or empty string if no warnings/messages.
+        """
+        return self._backend.compiler_log
+
+    def set_box(self, system: _Any) -> None:
         """
         Set the box information.
 
@@ -740,7 +770,7 @@ def set_box(self, system):
         if isinstance(system, _sr.system.System):
             try:
                 self._space = system.property("space")
-            except:
+            except Exception:
                 raise ValueError("'system' must contain a 'space' property")
         # Create a Sire TriclinicBox from the OpenMM box vectors.
         elif isinstance(system, _openmm.Context):
@@ -770,7 +800,7 @@ def set_box(self, system):
             grid=(1, 1, 1),
         )
 
-    def set_bulk_sampling_probability(self, probability):
+    def set_bulk_sampling_probability(self, probability: float) -> None:
         """
         Set the bulk sampling probability.
 
@@ -792,7 +822,7 @@ def set_bulk_sampling_probability(self, probability):
 
         self._bulk_sampling_probability = probability
 
-    def delete_waters(self, context):
+    def delete_waters(self, context: _openmm.Context) -> None:
         """
         Delete any waters within the GCMC sphere. (Convert to ghosts.)
 
@@ -813,13 +843,13 @@ def delete_waters(self, context):
         positions = state.getPositions(asNumpy=True) / _openmm.unit.angstrom
 
         # Get the position of the GCMC sphere centre.
-        target = _gpuarray.to_gpu(
+        target = self._backend.to_gpu(
             self._get_target_position(positions).astype(_np.float32)
         )
 
         # Set the positions on the GPU.
         self._kernels["atom_positions"](
-            _gpuarray.to_gpu(positions.astype(_np.float32).flatten()),
+            self._backend.to_gpu(positions.astype(_np.float32).flatten()),
             _np.float32(1.0),
             block=(self._num_threads, 1, 1),
             grid=(self._atom_blocks, 1, 1),
@@ -828,14 +858,14 @@ def delete_waters(self, context):
         # Find the non-ghost waters within the GCMC region.
         self._kernels["deletion"](
             self._deletion_candidates,
-            _gpuarray.to_gpu(target.astype(_np.float32)),
+            self._backend.to_gpu(target.astype(_np.float32)),
             _np.float32(self._radius.value()),
             block=(self._num_threads, 1, 1),
             grid=(self._water_blocks, 1, 1),
         )
 
         # Get the candidates.
-        candidates = self._deletion_candidates.get().flatten()
+        candidates = self._backend.from_gpu(self._deletion_candidates).flatten()
 
         # Find the waters within the GCMC sphere.
         candidates = _np.where(candidates == 1)[0]
@@ -849,7 +879,7 @@ def delete_waters(self, context):
         # Set the number of waters in the GCMC sphere to zero.
         self._N = 0
 
-    def num_waters(self):
+    def num_waters(self) -> int:
         """
         Return the number of waters in the GCMC region.
 
@@ -875,13 +905,13 @@ def num_waters(self):
             positions = state.getPositions(asNumpy=True) / _openmm.unit.angstrom
 
             # Get the position of the GCMC sphere centre.
-            target = _gpuarray.to_gpu(
+            target = self._backend.to_gpu(
                 self._get_target_position(positions).astype(_np.float32)
             )
 
             # Set the positions on the GPU.
             self._kernels["atom_positions"](
-                _gpuarray.to_gpu(positions.astype(_np.float32).flatten()),
+                self._backend.to_gpu(positions.astype(_np.float32).flatten()),
                 _np.float32(1.0),
                 block=(self._num_threads, 1, 1),
                 grid=(self._atom_blocks, 1, 1),
@@ -890,14 +920,14 @@ def num_waters(self):
             # Find the non-ghost waters within the GCMC region.
             self._kernels["deletion"](
                 self._deletion_candidates,
-                _gpuarray.to_gpu(target.astype(_np.float32)),
+                self._backend.to_gpu(target.astype(_np.float32)),
                 _np.float32(self._radius.value()),
                 block=(self._num_threads, 1, 1),
                 grid=(self._water_blocks, 1, 1),
             )
 
             # Get the candidates.
-            candidates = self._deletion_candidates.get().flatten()
+            candidates = self._backend.from_gpu(self._deletion_candidates).flatten()
 
             # Find the waters within the GCMC sphere.
             candidates = _np.where(candidates == 1)[0]
@@ -910,7 +940,7 @@ def num_waters(self):
 
         return self._N
 
-    def num_accepted_moves(self):
+    def num_accepted_moves(self) -> int:
         """
         Return the number of accepted moves.
 
@@ -922,7 +952,7 @@ def num_accepted_moves(self):
         """
         return self._num_accepted
 
-    def num_accepted_attempts(self):
+    def num_accepted_attempts(self) -> int:
         """
         Return the number accepted attempts. (Note that, when using PME, this
         is the number of accepted attempts for the approximate RF potential.)
@@ -935,13 +965,13 @@ def num_accepted_attempts(self):
         """
         return self._num_accepted_attempts
 
-    def water_state(self):
+    def water_state(self) -> _np.ndarray:
         """
         Return the current water state array: 0 = ghost water, 1 = real water.
         """
         return self._water_state.copy()
 
-    def move_acceptance_probability(self):
+    def move_acceptance_probability(self) -> float:
         """
         Return the acceptance probability.
 
@@ -953,7 +983,7 @@ def move_acceptance_probability(self):
         """
         return self._num_accepted / self._num_moves
 
-    def attempt_acceptance_probability(self):
+    def attempt_acceptance_probability(self) -> float:
         """
         Return the acceptance probability per attempt. (Note that, when using
         PME, this is acceptance probability for the approximate RF potential.)
@@ -966,7 +996,7 @@ def attempt_acceptance_probability(self):
         """
         return self._num_accepted_attempts / (self._num_moves * self._num_attempts)
 
-    def num_insertions(self):
+    def num_insertions(self) -> int:
         """
         Return the number of accepted insertions.
 
@@ -978,7 +1008,7 @@ def num_insertions(self):
         """
         return self._num_insertions
 
-    def num_deletions(self):
+    def num_deletions(self) -> int:
         """
         Return the number of accepted deletions.
 
@@ -990,7 +1020,7 @@ def num_deletions(self):
         """
         return self._num_deletions
 
-    def reset(self):
+    def reset(self) -> None:
         """
         Reset the sampler.
         """
@@ -1010,7 +1040,7 @@ def reset(self):
         # Clear the OpenMM context.
         self._openmm_context = None
 
-    def ghost_residues(self):
+    def ghost_residues(self) -> _np.ndarray:
         """
         Return the current indices of the ghost water residues in the OpenMM
         context.
@@ -1022,13 +1052,10 @@ def ghost_residues(self):
             The indices of the ghost water residues.
         """
 
-        # First get the indices of the ghost waters.
-        ghost_waters = _np.where(self._water_state == 0)[0]
-
         # Now extract and return the residue indices.
-        return self._water_residues[ghost_waters]
+        return self._water_residues[self._get_ghost_waters()]
 
-    def write_ghost_residues(self):
+    def write_ghost_residues(self) -> None:
         """
         Write the current indices of the ghost water residues to a file.
         """
@@ -1043,7 +1070,7 @@ def write_ghost_residues(self):
         with open(self._ghost_file, "a") as f:
             f.write(f"{', '.join([str(x) for x in ghost_residues])}\n")
 
-    def move(self, context):
+    def move(self, context: _openmm.Context) -> list[int]:
         """
         Perform num_attempts trial moves.
 
@@ -1115,9 +1142,7 @@ def move(self, context):
 
                     # Set the positions on the GPU.
                     self._kernels["atom_positions"](
-                        _gpuarray.to_gpu(
-                            positions_angstrom.astype(_np.float32).flatten()
-                        ),
+                        self._backend.to_gpu(_as_float32(positions_angstrom).flatten()),
                         _np.float32(1.0),
                         block=(self._num_threads, 1, 1),
                         grid=(self._atom_blocks, 1, 1),
@@ -1127,14 +1152,16 @@ def move(self, context):
                 if not self._is_bulk:
                     self._kernels["deletion"](
                         self._deletion_candidates,
-                        _gpuarray.to_gpu(target.astype(_np.float32)),
+                        self._backend.to_gpu(_as_float32(target)),
                         _np.float32(self._radius.value()),
                         block=(self._num_threads, 1, 1),
                         grid=(self._water_blocks, 1, 1),
                     )
 
                     # Get the candidates.
-                    deletion_candidates = self._deletion_candidates.get().flatten()
+                    deletion_candidates = self._backend.from_gpu(
+                        self._deletion_candidates
+                    ).flatten()
 
                     # Find the waters within the GCMC sphere.
                     deletion_candidates = _np.where(deletion_candidates == 1)[0]
@@ -1142,15 +1169,15 @@ def move(self, context):
                 # Use all non-ghost waters.
                 else:
                     _logger.debug("Sampling within the entire simulation box")
-                    deletion_candidates = _np.where(self._water_state != 0)[0]
+                    deletion_candidates = self._get_non_ghost_waters()
                     target = None
 
                 # Get the current ghost waters.
-                ghost_waters = _np.where(self._water_state == 0)[0]
+                ghost_waters = self._get_ghost_waters()
 
                 # If there are no ghost waters, then we can't perform any insertions.
                 if len(ghost_waters) == 0:
-                    msg = f"Cannot insert any more waters. Please increase 'num_ghost_waters'."
+                    msg = "Cannot insert any more waters. Please increase 'num_ghost_waters'."
                     _logger.error(msg)
                     raise RuntimeError(msg)
 
@@ -1159,10 +1186,10 @@ def move(self, context):
 
                 # Get the template positions for the water insertion.
                 start_idx = self._water_indices[idx_water]
-                template_positions = _gpuarray.to_gpu(
-                    positions_angstrom[start_idx : start_idx + self._num_points]
-                    .astype(_np.float32)
-                    .flatten()
+                template_positions = self._backend.to_gpu(
+                    _as_float32(
+                        positions_angstrom[start_idx : start_idx + self._num_points]
+                    ).flatten()
                 )
 
                 # Set the number of waters.
@@ -1183,32 +1210,38 @@ def move(self, context):
                 candidates = self._rng.choice(
                     deletion_candidates, size=self._batch_size
                 )
-                candidates_gpu = _gpuarray.to_gpu(candidates.astype(_np.int32))
+                candidates_gpu = self._backend.to_gpu(_as_int32(candidates))
 
                 # Generate the array of moves types. (0 = insertion, 1 = deletion)
                 is_deletion = self._rng.choice(2, size=self._batch_size)
-                is_deletion_gpu = _gpuarray.to_gpu(is_deletion.astype(_np.int32))
+                is_deletion_gpu = self._backend.to_gpu(_as_int32(is_deletion))
             # If there are no deletion candidates, then we can only perform
             # insertion moves.
             else:
                 candidates = _np.zeros(self._batch_size, dtype=_np.int32)
-                candidates_gpu = _gpuarray.to_gpu(candidates.astype(_np.int32))
+                candidates_gpu = self._backend.to_gpu(candidates)
                 is_deletion = _np.zeros(self._batch_size, dtype=_np.int32)
-                is_deletion_gpu = _gpuarray.to_gpu(is_deletion.astype(_np.int32))
+                is_deletion_gpu = self._backend.to_gpu(is_deletion)
 
             _logger.debug("Preparing insertion candidates")
 
             if target is None:
-                target_gpu = _gpuarray.to_gpu(_np.zeros(3, dtype=_np.float32))
+                target_gpu = self._zero_target_gpu
                 is_target = _np.int32(0)
                 exp_B = self._exp_B_bulk
                 exp_minus_B = self._exp_minus_B_bulk
             else:
-                target_gpu = _gpuarray.to_gpu(target.astype(_np.float32))
+                target_gpu = self._backend.to_gpu(_as_float32(target))
                 is_target = _np.int32(1)
                 exp_B = self._exp_B
                 exp_minus_B = self._exp_minus_B
 
+            # Get pre-computed random numbers for this batch.
+            batch_randoms = self._rng_manager.get_batch_randoms()
+            randoms_rotation = self._backend.to_gpu(batch_randoms.rotation)
+            randoms_position = self._backend.to_gpu(batch_randoms.position)
+            randoms_radius = self._backend.to_gpu(batch_randoms.radius)
+
             # Generate the random water positions and orientations.
             self._kernels["water"](
                 template_positions,
@@ -1216,6 +1249,9 @@ def move(self, context):
                 _np.float32(self._radius.value()),
                 self._water_positions,
                 is_target,
+                randoms_rotation,
+                randoms_position,
+                randoms_radius,
                 block=(self._num_threads, 1, 1),
                 grid=(self._batch_blocks, 1, 1),
             )
@@ -1232,6 +1268,9 @@ def move(self, context):
                 grid=(self._atom_blocks, self._batch_size, 1),
             )
 
+            # Transfer pre-computed acceptance randoms to GPU.
+            randoms_acceptance = self._backend.to_gpu(batch_randoms.acceptance)
+
             # Check the acceptance for each trial state.
             self._kernels["acceptance"](
                 _np.int32(self._N),
@@ -1245,12 +1284,15 @@ def move(self, context):
                 self._probability,
                 self._accepted,
                 _np.float32(self._tolerance),
+                randoms_acceptance,
                 block=(self._num_threads, 1, 1),
                 grid=(self._batch_blocks, 1, 1),
             )
 
             # Get the acceptance array.
-            accepted = _np.where(self._accepted.get().flatten() == 1)[0]
+            accepted = _np.where(self._backend.from_gpu(self._accepted).flatten() == 1)[
+                0
+            ]
 
             # Store the number of accepted attempts.
             num_accepted_attempts = len(accepted)
@@ -1272,9 +1314,12 @@ def move(self, context):
             # see whether it is accepted via the Gelb correction.
             if self._is_pme:
                 max_accepted = num_accepted_attempts
+                # Read energy changes once for all accepted trials.
+                energy_changes = self._backend.from_gpu(self._energy_change).flatten()
             # For RF we just use the first accepted trial.
             else:
                 max_accepted = 1
+                energy_changes = None
 
             # Loop over the accepted trials.
             for i in range(max_accepted):
@@ -1312,10 +1357,7 @@ def move(self, context):
                     # Apply the PME correction.
                     if self._is_pme:
                         # Get the energy change in kcal/mol.
-                        dE_RF = (
-                            self._energy_change.get().flatten()[idx]
-                            * _openmm.unit.kilocalories_per_mole
-                        )
+                        dE_RF = energy_changes[idx] * _openmm.unit.kilocalories_per_mole
 
                         # Get the new energy.
                         final_energy = context.getState(
@@ -1396,10 +1438,7 @@ def move(self, context):
                     # Apply the PME correction.
                     if self._is_pme:
                         # Get the energy change in kcal/mol.
-                        dE_RF = (
-                            self._energy_change.get().flatten()[idx]
-                            * _openmm.unit.kilocalories_per_mole
-                        )
+                        dE_RF = energy_changes[idx] * _openmm.unit.kilocalories_per_mole
 
                         # Get the new energy.
                         final_energy = context.getState(
@@ -1489,7 +1528,7 @@ def move(self, context):
 
         return moves
 
-    def bind_dynamics(self, dynamics):
+    def bind_dynamics(self, dynamics: _Any) -> None:
         """
         Bind the GCMC sampler to a Sire Dynamics object.
 
@@ -1505,7 +1544,7 @@ def bind_dynamics(self, dynamics):
         dynamics._d._gcmc_sampler = self
 
     @staticmethod
-    def _validate_sire_unit(parameter, value, unit):
+    def _validate_sire_unit(parameter: str, value: str, unit: _Any) -> _Any:
         """
         Validate a Sire unit.
 
@@ -1541,8 +1580,7 @@ def _validate_sire_unit(parameter, value, unit):
 
         return u
 
-    @staticmethod
-    def _get_box_information(space):
+    def _get_box_information(self, space):
         """
         Get the box information from the system.
 
@@ -1590,11 +1628,11 @@ def _get_box_information(space):
         M = _np.array([row0, row1, row2])
 
         # Convert to GPU memory.
-        cell_matrix = _gpuarray.to_gpu(cell_matrix.flatten().astype(_np.float32))
-        cell_matrix_inverse = _gpuarray.to_gpu(
+        cell_matrix = self._backend.to_gpu(cell_matrix.flatten().astype(_np.float32))
+        cell_matrix_inverse = self._backend.to_gpu(
             cell_matrix_inverse.flatten().astype(_np.float32)
         )
-        M = _gpuarray.to_gpu(M.flatten().astype(_np.float32))
+        M = self._backend.to_gpu(M.flatten().astype(_np.float32))
 
         return cell_matrix, cell_matrix_inverse, M
 
@@ -1666,7 +1704,7 @@ def _prepare_system(system, water_template, rng, num_ghost_waters):
         # Get the space property from the system.
         try:
             space = system.property("space")
-        except:
+        except Exception:
             raise ValueError("'system' must contain a 'space' property")
 
         # Get the box matrix and diagonal.
@@ -1752,7 +1790,7 @@ def _initialise_gpu_memory(self):
                         i += 1
 
                 # Convert to a GPU array.
-                charges = _gpuarray.to_gpu(charges.astype(_np.float32))
+                charges = self._backend.to_gpu(charges.astype(_np.float32))
 
             except Exception as e:
                 raise ValueError(f"Could not get the charges on the atoms: {e}")
@@ -1769,17 +1807,19 @@ def _initialise_gpu_memory(self):
                         i += 1
 
                 # Convert to GPU arrays.
-                sigmas = _gpuarray.to_gpu(sigmas.astype(_np.float32))
-                epsilons = _gpuarray.to_gpu(epsilons.astype(_np.float32))
+                sigmas = self._backend.to_gpu(sigmas.astype(_np.float32))
+                epsilons = self._backend.to_gpu(epsilons.astype(_np.float32))
 
             except Exception as e:
                 raise ValueError(f"Could not get the LJ parameters: {e}")
 
             # Set the alphas to zero.
-            alphas = _gpuarray.to_gpu(_np.zeros(self._num_atoms, dtype=_np.float32))
+            alphas = self._backend.to_gpu(_np.zeros(self._num_atoms, dtype=_np.float32))
 
             # Set the is_ghost_fep array to zero.
-            is_ghost_fep = _gpuarray.to_gpu(_np.zeros(self._num_atoms, dtype=_np.int32))
+            is_ghost_fep = self._backend.to_gpu(
+                _np.zeros(self._num_atoms, dtype=_np.int32)
+            )
 
         # Otherwise, we need to create an OpenMM context using the specified lambda
         # schedule and value, then extract the required properties from the forces
@@ -1841,10 +1881,10 @@ def _initialise_gpu_memory(self):
                 alphas[i] = alpha
 
             # Convert to GPU arrays.
-            charges = _gpuarray.to_gpu(charges.astype(_np.float32))
-            sigmas = _gpuarray.to_gpu(sigmas.astype(_np.float32))
-            epsilons = _gpuarray.to_gpu(epsilons.astype(_np.float32))
-            alphas = _gpuarray.to_gpu(alphas.astype(_np.float32))
+            charges = self._backend.to_gpu(charges.astype(_np.float32))
+            sigmas = self._backend.to_gpu(sigmas.astype(_np.float32))
+            epsilons = self._backend.to_gpu(epsilons.astype(_np.float32))
+            alphas = self._backend.to_gpu(alphas.astype(_np.float32))
 
             # Create the ghost atom array.
             is_ghost_fep = _np.zeros(self._num_atoms, dtype=_np.int32)
@@ -1881,7 +1921,7 @@ def _initialise_gpu_memory(self):
                             is_ghost_fep[idx] = 1
 
             # Convert to GPU array.
-            is_ghost_fep = _gpuarray.to_gpu(is_ghost_fep.astype(_np.int32))
+            is_ghost_fep = self._backend.to_gpu(is_ghost_fep.astype(_np.int32))
 
         # Get the water properties.
         try:
@@ -1905,9 +1945,11 @@ def _initialise_gpu_memory(self):
             self._water_epsilon_custom = 2.0 * _np.sqrt(4.184 * self._water_epsilon)
 
             # Convert to GPU arrays.
-            charge_water = _gpuarray.to_gpu(self._water_charge.astype(_np.float32))
-            sigma_water = _gpuarray.to_gpu(self._water_sigma.astype(_np.float32))
-            epsilon_water = _gpuarray.to_gpu(self._water_epsilon.astype(_np.float32))
+            charge_water = self._backend.to_gpu(self._water_charge.astype(_np.float32))
+            sigma_water = self._backend.to_gpu(self._water_sigma.astype(_np.float32))
+            epsilon_water = self._backend.to_gpu(
+                self._water_epsilon.astype(_np.float32)
+            )
 
         except Exception as e:
             raise ValueError(f"Could not get the atomic properties of the water: {e}")
@@ -1924,16 +1966,13 @@ def _initialise_gpu_memory(self):
                     is_ghost_water[self._water_indices[i] + j] = 1
         self._water_state = _np.array(water_state).astype(_np.int32)
 
-        # Initialise the random number generator.
-        self._kernels["rng"](
-            _gpuarray.to_gpu(
-                _np.random.randint(
-                    _np.iinfo(_np.int32).max, size=(1, self._batch_size)
-                ).astype(_np.int32)
-            ),
-            block=(self._num_threads, 1, 1),
-            grid=(self._batch_blocks, 1, 1),
-        )
+        # Initialize water index caches (invalidated when water_state changes).
+        self._ghost_waters_cache = None
+        self._non_ghost_waters_cache = None
+        self._invalidate_water_caches()
+
+        # Pre-allocate zero target array for bulk sampling.
+        self._zero_target_gpu = self._backend.to_gpu(_np.zeros(3, dtype=_np.float32))
 
         # Initialise the reaction field parameters.
         self._kernels["rf"](
@@ -1959,7 +1998,7 @@ def _initialise_gpu_memory(self):
             sigmas,
             epsilons,
             alphas,
-            _gpuarray.to_gpu(is_ghost_water.astype(_np.int32)),
+            self._backend.to_gpu(is_ghost_water.astype(_np.int32)),
             is_ghost_fep,
             block=(self._num_threads, 1, 1),
             grid=(self._atom_blocks, 1, 1),
@@ -1970,33 +2009,35 @@ def _initialise_gpu_memory(self):
             charge_water,
             sigma_water,
             epsilon_water,
-            _gpuarray.to_gpu(self._water_indices.astype(_np.int32)),
-            _gpuarray.to_gpu(self._water_state.astype(_np.int32)),
+            self._backend.to_gpu(self._water_indices.astype(_np.int32)),
+            self._backend.to_gpu(self._water_state.astype(_np.int32)),
             block=(1, 1, 1),
             grid=(1, 1, 1),
         )
 
         # Initialise the memory to store the water positions.
-        self._water_positions = _gpuarray.empty(
+        self._water_positions = self._backend.empty(
             (1, self._batch_size * 3 * self._num_points), _np.float32
         )
 
         # Initialise memory to store the energy.
-        self._energy_coul = _gpuarray.empty(
+        self._energy_coul = self._backend.empty(
             (1, self._batch_size * self._num_atoms), _np.float32
         )
-        self._energy_lj = _gpuarray.empty(
+        self._energy_lj = self._backend.empty(
             (1, self._batch_size * self._num_atoms), _np.float32
         )
 
         # Initialise memory to store whether each attempt is accepted and
         # the probability of acceptance.
-        self._accepted = _gpuarray.empty((1, self._batch_size), _np.int32)
-        self._energy_change = _gpuarray.empty((1, self._batch_size), _np.float32)
-        self._probability = _gpuarray.empty((1, self._batch_size), _np.float32)
+        self._accepted = self._backend.empty((1, self._batch_size), _np.int32)
+        self._energy_change = self._backend.empty((1, self._batch_size), _np.float32)
+        self._probability = self._backend.empty((1, self._batch_size), _np.float32)
 
         # Initialise memory to store the deletion candidates.
-        self._deletion_candidates = _gpuarray.empty((1, self._num_waters), _np.int32)
+        self._deletion_candidates = self._backend.empty(
+            (1, self._num_waters), _np.int32
+        )
 
     def _accept_insertion(
         self, idx, idx_water, positions_openmm, positions_angstrom, context
@@ -2024,12 +2065,13 @@ def _accept_insertion(
         """
 
         # Get the new water positions.
-        water_positions = self._water_positions.get().reshape(
+        water_positions = self._backend.from_gpu(self._water_positions).reshape(
             (self._batch_size, self._num_points, 3)
         )[idx]
 
         # Update the water state.
         self._water_state[idx_water] = 1
+        self._invalidate_water_caches()
 
         # Get the starting atom index.
         start_idx = self._water_indices[idx_water]
@@ -2076,7 +2118,7 @@ def _accept_insertion(
             _np.int32(idx_water),
             _np.int32(1),
             _np.int32(1),
-            _gpuarray.to_gpu(water_positions.flatten().astype(_np.float32)),
+            self._backend.to_gpu(water_positions.flatten().astype(_np.float32)),
             block=(1, 1, 1),
             grid=(1, 1, 1),
         )
@@ -2100,6 +2142,7 @@ def _accept_deletion(self, idx, context):
 
         # Update the water state.
         self._water_state[idx] = 0
+        self._invalidate_water_caches()
 
         # Get the starting atom index.
         start_idx = self._water_indices[idx]
@@ -2134,7 +2177,7 @@ def _accept_deletion(self, idx, context):
             _np.int32(idx),
             _np.int32(0),
             _np.int32(0),
-            _gpuarray.to_gpu(
+            self._backend.to_gpu(
                 _np.zeros((self._num_points, 3), dtype=_np.float32).flatten()
             ),
             block=(1, 1, 1),
@@ -2160,6 +2203,7 @@ def _reject_deletion(self, idx, context):
 
         # Reset the water state.
         self._water_state[idx] = 1
+        self._invalidate_water_caches()
 
         # Get the starting atom index.
         start_idx = self._water_indices[idx]
@@ -2197,7 +2241,7 @@ def _reject_deletion(self, idx, context):
             _np.int32(idx),
             _np.int32(1),
             _np.int32(0),
-            _gpuarray.to_gpu(
+            self._backend.to_gpu(
                 _np.zeros((self._num_points, 3), dtype=_np.float32).flatten()
             ),
             block=(1, 1, 1),
@@ -2291,7 +2335,7 @@ def _set_water_state(self, context, indices=None, states=None, force=False):
                     _np.int32(idx),
                     _np.int32(0),
                     _np.int32(0),
-                    _gpuarray.to_gpu(
+                    self._backend.to_gpu(
                         _np.zeros((self._num_points, 3), dtype=_np.float32).flatten()
                     ),
                     block=(1, 1, 1),
@@ -2329,7 +2373,7 @@ def _set_water_state(self, context, indices=None, states=None, force=False):
                     _np.int32(idx),
                     _np.int32(1),
                     _np.int32(0),
-                    _gpuarray.to_gpu(
+                    self._backend.to_gpu(
                         _np.zeros((self._num_points, 3), dtype=_np.float32).flatten()
                     ),
                     block=(1, 1, 1),
@@ -2339,6 +2383,9 @@ def _set_water_state(self, context, indices=None, states=None, force=False):
                 # Set the new water state.
                 self._water_state[idx] = 1
 
+        # Invalidate water caches after all state updates.
+        self._invalidate_water_caches()
+
         # Update the NonbondedForce parameters in the context.
         self._nonbonded_force.updateParametersInContext(context)
 
@@ -2433,18 +2480,20 @@ def _log_insertion(self, idx, idx_water, pme_energy=None, pme_probability=None):
             The PME acceptance probability.
         """
         # Get the energies.
-        energy_coul = self._energy_coul.get().reshape(
+        energy_coul = self._backend.from_gpu(self._energy_coul).reshape(
+            (self._batch_size, self._num_atoms)
+        )
+        energy_lj = self._backend.from_gpu(self._energy_lj).reshape(
             (self._batch_size, self._num_atoms)
         )
-        energy_lj = self._energy_lj.get().reshape((self._batch_size, self._num_atoms))
 
         # Get the water positions.
-        water_positions = self._water_positions.get().reshape(
+        water_positions = self._backend.from_gpu(self._water_positions).reshape(
             (self._batch_size, self._num_points, 3)
         )
 
         # Get the RF acceptance probability.
-        probability = self._probability.get().flatten()
+        probability = self._backend.from_gpu(self._probability).flatten()
 
         # Store debugging attributes.
         self._debug = {
@@ -2503,13 +2552,15 @@ def _log_deletion(
             The PME acceptance probability.
         """
         # Get the coulomb and LJ energies.
-        energy_coul = self._energy_coul.get().reshape(
+        energy_coul = self._backend.from_gpu(self._energy_coul).reshape(
+            (self._batch_size, self._num_atoms)
+        )
+        energy_lj = self._backend.from_gpu(self._energy_lj).reshape(
             (self._batch_size, self._num_atoms)
         )
-        energy_lj = self._energy_lj.get().reshape((self._batch_size, self._num_atoms))
 
         # Get the RF acceptance probability.
-        probability = self._probability.get().flatten()
+        probability = self._backend.from_gpu(self._probability).flatten()
 
         # Get the water index.
         idx_water = self._water_indices[candidates[idx]]
@@ -2565,11 +2616,8 @@ def _flag_ghost_waters(self, system):
         if not isinstance(system, _sr.system.System):
             raise ValueError("'system' must be a Sire system")
 
-        # First get the indices of the ghost waters.
-        ghost_waters = _np.where(self._water_state == 0)[0]
-
-        # Now extract the oxygen indices.
-        ghost_oxygens = self._water_indices[ghost_waters]
+        # Now extract the oxygen indices using cached ghost water indices.
+        ghost_oxygens = self._water_indices[self._get_ghost_waters()]
 
         # Loop over the ghost waters and set the is_ghost property.
         for i in ghost_oxygens:
diff --git a/src/loch/_utils.py b/src/loch/_utils.py
index 67c2cec..3defbac 100644
--- a/src/loch/_utils.py
+++ b/src/loch/_utils.py
@@ -1,7 +1,7 @@
 ######################################################################
 # Loch: GPU accelerated GCMC water sampling engine.
 #
-# Copyright: 2025
+# Copyright: 2025-2026
 #
 # Authors: The OpenBioSim Team <team@openbiosim.org>
 #
@@ -23,19 +23,21 @@
 Utility functions for calibrating the GCMC potential.
 """
 
+from typing import Optional as _Optional
+
 __all__ = ["excess_chemical_potential", "standard_volume"]
 
 
 def excess_chemical_potential(
     system,
-    temperature="298 K",
-    pressure="1 bar",
-    cutoff="10 A",
-    runtime="5 ns",
-    num_lambda=24,
-    replica_exchange=False,
-    work_dir=None,
-):
+    temperature: str = "298 K",
+    pressure: str = "1 bar",
+    cutoff: str = "10 A",
+    runtime: str = "5 ns",
+    num_lambda: int = 24,
+    replica_exchange: bool = False,
+    work_dir: _Optional[str] = None,
+) -> float:
     """
     Calculate the excess chemical potential of a water molecule at the given
     temperature and pressure via alchemical decoupling.
@@ -75,7 +77,6 @@ def excess_chemical_potential(
         Excess chemical potential in kcal/mol.
     """
 
-    import os
     import sire as sr
 
     from BioSimSpace.FreeEnergy import Relative
@@ -173,31 +174,31 @@ def excess_chemical_potential(
     system = sr.morph.link_to_reference(system)
 
     # Get the lambda schedule from the molecule.
-    l = mol.property("schedule")
+    s = mol.property("schedule")
 
     # Avoid scaling kappa during decouple stage.
-    l.set_equation(stage="decouple", lever="kappa", force="ghost/ghost", equation=0)
-    l.set_equation(stage="decouple", lever="kappa", force="ghost-14", equation=0)
+    s.set_equation(stage="decouple", lever="kappa", force="ghost/ghost", equation=0)
+    s.set_equation(stage="decouple", lever="kappa", force="ghost-14", equation=0)
 
     # Add new discharging stage.
-    l.set_equation(stage="decouple", lever="charge", equation=l.final())
-    l.prepend_stage("decharge", l.initial())
-    l.set_equation(
+    s.set_equation(stage="decouple", lever="charge", equation=s.final())
+    s.prepend_stage("decharge", s.initial())
+    s.set_equation(
         stage="decharge",
         lever="charge",
-        equation=l.lam() * l.final() + l.initial() * (1 - l.lam()),
+        equation=s.lam() * s.final() + s.initial() * (1 - s.lam()),
     )
-    l.set_equation(stage="decharge", force="ghost/ghost", equation=l.initial())
-    l.set_equation(stage="decharge", force="ghost-14", equation=l.initial())
-    l.set_equation(
-        stage="decharge", lever="kappa", force="ghost/ghost", equation=-l.lam() + 1
+    s.set_equation(stage="decharge", force="ghost/ghost", equation=s.initial())
+    s.set_equation(stage="decharge", force="ghost-14", equation=s.initial())
+    s.set_equation(
+        stage="decharge", lever="kappa", force="ghost/ghost", equation=-s.lam() + 1
     )
-    l.set_equation(
-        stage="decharge", lever="kappa", force="ghost-14", equation=-l.lam() + 1
+    s.set_equation(
+        stage="decharge", lever="kappa", force="ghost-14", equation=-s.lam() + 1
     )
 
     # Update the schedule in the configuration.
-    config.lambda_schedule = l
+    config.lambda_schedule = s
 
     # Set up the runner.
     try:
@@ -223,12 +224,12 @@ def excess_chemical_potential(
 
 def standard_volume(
     system,
-    temperature="298 K",
-    pressure="1 bar",
-    cutoff="10 A",
-    num_samples=5000,
-    sample_interval="1 ps",
-):
+    temperature: str = "298 K",
+    pressure: str = "1 bar",
+    cutoff: str = "10 A",
+    num_samples: int = 5000,
+    sample_interval: str = "1 ps",
+) -> float:
     """
     Calculate the standard volume of water at the given temperature and pressure.
 
@@ -259,7 +260,6 @@ def standard_volume(
     float
         Standard volume in A^3/molecule.
     """
-    import os
     import sire as sr
 
     from openmm.unit import angstrom
diff --git a/tests/test_compiler.py b/tests/test_compiler.py
new file mode 100644
index 0000000..b369da3
--- /dev/null
+++ b/tests/test_compiler.py
@@ -0,0 +1,131 @@
+import os
+import shutil
+
+import pytest
+
+
+def _get_nvcc():
+    """Get nvcc path from environment or PATH."""
+    return os.environ.get("PYCUDA_NVCC") or shutil.which("nvcc")
+
+
+@pytest.mark.skipif(
+    "CUDA_VISIBLE_DEVICES" not in os.environ,
+    reason="Requires CUDA enabled GPU.",
+)
+class TestOpenCLCompiler:
+    """Tests for OpenCL compiler log functionality."""
+
+    def test_compiler_log_returns_string(self):
+        """Test that compiler_log returns a string after successful compilation."""
+        from loch._platforms._opencl import OpenCLPlatform
+
+        backend = OpenCLPlatform(
+            device=0,
+            num_points=3,
+            num_batch=10,
+            num_waters=5,
+            num_atoms=100,
+            num_threads=32,
+        )
+        backend.compile_kernels()
+
+        log = backend.compiler_log
+        assert isinstance(log, str)
+
+    def test_compiler_log_empty_before_compilation(self):
+        """Test that compiler_log is empty before compile_kernels is called."""
+        from loch._platforms._opencl import OpenCLPlatform
+
+        backend = OpenCLPlatform(
+            device=0,
+            num_points=3,
+            num_batch=10,
+            num_waters=5,
+            num_atoms=100,
+            num_threads=32,
+        )
+
+        # Before compilation, should return empty string
+        log = backend.compiler_log
+        assert log == ""
+
+
+@pytest.mark.skipif(
+    "CUDA_VISIBLE_DEVICES" not in os.environ,
+    reason="Requires CUDA enabled GPU.",
+)
+class TestCUDACompiler:
+    """Tests for CUDA compiler log functionality."""
+
+    def test_compiler_log_returns_string(self):
+        """Test that compiler_log returns a string after successful compilation."""
+        from loch._platforms._cuda import CUDAPlatform
+
+        backend = CUDAPlatform(
+            device=0,
+            num_points=3,
+            num_batch=10,
+            num_waters=5,
+            num_atoms=100,
+            num_threads=32,
+            nvcc=_get_nvcc(),
+        )
+        backend.compile_kernels()
+
+        log = backend.compiler_log
+        assert isinstance(log, str)
+
+        backend.cleanup()
+
+    def test_compiler_log_empty_before_compilation(self):
+        """Test that compiler_log is empty before compile_kernels is called."""
+        from loch._platforms._cuda import CUDAPlatform
+
+        backend = CUDAPlatform(
+            device=0,
+            num_points=3,
+            num_batch=10,
+            num_waters=5,
+            num_atoms=100,
+            num_threads=32,
+            nvcc=_get_nvcc(),
+        )
+
+        # Before compilation, should return empty string
+        log = backend.compiler_log
+        assert log == ""
+
+        backend.cleanup()
+
+    def test_compilation_error_raises_exception(self):
+        """Test that compilation errors raise RuntimeError with message."""
+        import loch._platforms._cuda as cuda_module
+        from loch._platforms._cuda import CUDAPlatform
+
+        backend = CUDAPlatform(
+            device=0,
+            num_points=3,
+            num_batch=10,
+            num_waters=5,
+            num_atoms=100,
+            num_threads=32,
+            nvcc=_get_nvcc(),
+        )
+
+        # Patch kernel code directly in the cuda module (not the kernels module,
+        # since it's already imported as _kernel_code at module load time).
+        original_code = cuda_module._kernel_code
+        # Use code with syntax error - will fail on CUDA compiler.
+        cuda_module._kernel_code = (
+            'extern "C" { __global__ void test( { syntax error here } }'
+        )
+
+        try:
+            with pytest.raises(RuntimeError) as exc_info:
+                backend.compile_kernels()
+
+            assert "CUDA kernel compilation failed" in str(exc_info.value)
+        finally:
+            cuda_module._kernel_code = original_code
+            backend.cleanup()
diff --git a/tests/test_energy.py b/tests/test_energy.py
index 08bed27..87cfcdc 100644
--- a/tests/test_energy.py
+++ b/tests/test_energy.py
@@ -12,8 +12,9 @@
     "CUDA_VISIBLE_DEVICES" not in os.environ,
     reason="Requires CUDA enabled GPU.",
 )
+@pytest.mark.parametrize("platform", ["cuda", "opencl"])
 @pytest.mark.parametrize("fixture", ["water_box", "bpti", "sd12"])
-def test_energy(fixture, request):
+def test_energy(fixture, platform, request):
     """
     Test that the RF energy difference agrees with OpenMM.
     """
@@ -39,6 +40,7 @@ def test_energy(fixture, request):
         ghost_file=None,
         log_file=None,
         test=True,
+        platform=platform,
     )
 
     # Create a dynamics object using the modified GCMC system.
@@ -54,11 +56,9 @@ def test_energy(fixture, request):
         coulomb_power=sampler._coulomb_power,
         shift_coulomb=str(sampler._shift_coulomb),
         shift_delta=str(sampler._shift_delta),
+        platform=platform,
     )
 
-    # Get the context.
-    context = d.context()
-
     # Loop until we accept an insertion move.
     is_accepted = False
     while not is_accepted:
@@ -142,3 +142,128 @@ def test_energy(fixture, request):
 
     # Check that the energy difference is close to the calculated energy change.
     assert math.isclose(energy_difference, sampler_energy, abs_tol=1e-2)
+
+
+@pytest.mark.skipif(
+    "CUDA_VISIBLE_DEVICES" not in os.environ,
+    reason="Requires CUDA enabled GPU.",
+)
+@pytest.mark.parametrize("fixture", ["water_box", "bpti", "sd12"])
+def test_platform_consistency(fixture, request):
+    """
+    Test that CUDA and OpenCL platforms produce consistent energy calculations.
+    """
+
+    # Get the fixture.
+    mols, reference = request.getfixturevalue(fixture)
+
+    # Standard lambda schedule.
+    schedule = sr.cas.LambdaSchedule.standard_morph()
+
+    # Set the lambda value.
+    lambda_value = 0.5
+
+    # Use a fixed seed for reproducibility
+    seed = 42
+
+    # Create CUDA sampler.
+    cuda_sampler = GCMCSampler(
+        mols,
+        cutoff_type="rf",
+        cutoff="10 A",
+        reference=reference,
+        lambda_schedule=schedule,
+        lambda_value=lambda_value,
+        log_level="debug",
+        ghost_file=None,
+        log_file=None,
+        test=True,
+        platform="cuda",
+        seed=seed,
+    )
+
+    # Create OpenCL sampler with same configuration.
+    opencl_sampler = GCMCSampler(
+        mols,
+        cutoff_type="rf",
+        cutoff="10 A",
+        reference=reference,
+        lambda_schedule=schedule,
+        lambda_value=lambda_value,
+        log_level="debug",
+        ghost_file=None,
+        log_file=None,
+        test=True,
+        platform="opencl",
+        seed=seed,
+    )
+
+    # Perform insertion moves on both samplers.
+    # With same seed, both platforms should generate identical random numbers
+    # and thus identical water positions, allowing direct energy comparison.
+
+    # Create dynamics objects for both.
+    cuda_d = cuda_sampler.system().dynamics(
+        cutoff_type="rf",
+        cutoff="10 A",
+        temperature="298 K",
+        pressure=None,
+        constraint="h_bonds",
+        timestep="2 fs",
+        schedule=schedule,
+        lambda_value=lambda_value,
+        coulomb_power=cuda_sampler._coulomb_power,
+        shift_coulomb=str(cuda_sampler._shift_coulomb),
+        shift_delta=str(cuda_sampler._shift_delta),
+        platform="cuda",
+    )
+
+    opencl_d = opencl_sampler.system().dynamics(
+        cutoff_type="rf",
+        cutoff="10 A",
+        temperature="298 K",
+        pressure=None,
+        constraint="h_bonds",
+        timestep="2 fs",
+        schedule=schedule,
+        lambda_value=lambda_value,
+        coulomb_power=opencl_sampler._coulomb_power,
+        shift_coulomb=str(opencl_sampler._shift_coulomb),
+        shift_delta=str(opencl_sampler._shift_delta),
+        platform="opencl",
+    )
+
+    # Perform moves until we get an accepted insertion on CUDA.
+    is_accepted = False
+    while not is_accepted:
+        moves = cuda_sampler.move(cuda_d.context())
+        if len(moves) > 0 and moves[0] == 0:
+            is_accepted = True
+
+    # Get CUDA energy calculation.
+    cuda_energy = cuda_sampler._debug["energy_coul"] + cuda_sampler._debug["energy_lj"]
+
+    # Perform moves until we get an accepted insertion on OpenCL.
+    is_accepted = False
+    while not is_accepted:
+        moves = opencl_sampler.move(opencl_d.context())
+        if len(moves) > 0 and moves[0] == 0:
+            is_accepted = True
+
+    # Get OpenCL energy calculation.
+    opencl_energy = (
+        opencl_sampler._debug["energy_coul"] + opencl_sampler._debug["energy_lj"]
+    )
+
+    # With same seed, the water positions should be identical, so energies
+    # should match closely. Allow small tolerance for floating point differences.
+    assert math.isfinite(cuda_energy), "CUDA energy is not finite"
+    assert math.isfinite(opencl_energy), "OpenCL energy is not finite"
+
+    # Energy calculations should be very close (within 0.1%)
+    relative_diff = abs(cuda_energy - opencl_energy) / max(
+        abs(cuda_energy), abs(opencl_energy), 1.0
+    )
+    assert (
+        relative_diff < 0.001
+    ), f"Platform energies differ: CUDA={cuda_energy:.6f}, OpenCL={opencl_energy:.6f}, relative_diff={relative_diff:.6f}"
diff --git a/tests/test_rng.py b/tests/test_rng.py
new file mode 100644
index 0000000..cbb514f
--- /dev/null
+++ b/tests/test_rng.py
@@ -0,0 +1,156 @@
+import numpy as np
+import time
+
+from loch._platforms._rng import RNGManager, BatchRandoms
+
+
+class TestBatchRandoms:
+    """Tests for the BatchRandoms dataclass."""
+
+    def test_dataclass_fields(self):
+        """Test that BatchRandoms has the expected fields."""
+        batch = BatchRandoms(
+            rotation=np.zeros((10, 3)),
+            position=np.zeros((10, 3)),
+            radius=np.zeros(10),
+            acceptance=np.zeros(10),
+        )
+        assert hasattr(batch, "rotation")
+        assert hasattr(batch, "position")
+        assert hasattr(batch, "radius")
+        assert hasattr(batch, "acceptance")
+
+
+class TestRNGManager:
+    """Tests for the RNGManager class."""
+
+    def test_batch_shapes(self):
+        """Test that generated batches have correct shapes."""
+        batch_size = 64
+        rng = RNGManager(batch_size=batch_size, seed=42)
+
+        batch = rng.get_batch_randoms()
+
+        assert batch.rotation.shape == (batch_size, 3)
+        assert batch.position.shape == (batch_size, 3)
+        assert batch.radius.shape == (batch_size,)
+        assert batch.acceptance.shape == (batch_size,)
+
+        rng.shutdown()
+
+    def test_batch_dtypes(self):
+        """Test that generated batches have float32 dtype."""
+        rng = RNGManager(batch_size=32, seed=42)
+
+        batch = rng.get_batch_randoms()
+
+        assert batch.rotation.dtype == np.float32
+        assert batch.position.dtype == np.float32
+        assert batch.radius.dtype == np.float32
+        assert batch.acceptance.dtype == np.float32
+
+        rng.shutdown()
+
+    def test_uniform_range(self):
+        """Test that uniform randoms are in [0, 1) range."""
+        rng = RNGManager(batch_size=1000, seed=42)
+
+        # Get several batches to have enough samples.
+        for _ in range(5):
+            batch = rng.get_batch_randoms()
+
+            assert np.all(batch.rotation >= 0) and np.all(batch.rotation < 1)
+            assert np.all(batch.radius >= 0) and np.all(batch.radius < 1)
+            assert np.all(batch.acceptance >= 0) and np.all(batch.acceptance < 1)
+
+        rng.shutdown()
+
+    def test_normal_distribution(self):
+        """Test that position randoms follow a normal distribution."""
+        rng = RNGManager(batch_size=1000, seed=42)
+
+        # Collect samples from multiple batches.
+        samples = []
+        for _ in range(10):
+            batch = rng.get_batch_randoms()
+            samples.append(batch.position.flatten())
+
+        all_samples = np.concatenate(samples)
+
+        # Check mean and std are approximately 0 and 1.
+        assert np.abs(np.mean(all_samples)) < 0.1
+        assert np.abs(np.std(all_samples) - 1.0) < 0.1
+
+        rng.shutdown()
+
+    def test_reproducibility_with_seed(self):
+        """Test that the same seed produces the same sequence."""
+        rng1 = RNGManager(batch_size=32, seed=12345)
+        rng2 = RNGManager(batch_size=32, seed=12345)
+
+        batch1 = rng1.get_batch_randoms()
+        batch2 = rng2.get_batch_randoms()
+
+        np.testing.assert_array_equal(batch1.rotation, batch2.rotation)
+        np.testing.assert_array_equal(batch1.position, batch2.position)
+        np.testing.assert_array_equal(batch1.radius, batch2.radius)
+        np.testing.assert_array_equal(batch1.acceptance, batch2.acceptance)
+
+        rng1.shutdown()
+        rng2.shutdown()
+
+    def test_different_seeds_produce_different_results(self):
+        """Test that different seeds produce different sequences."""
+        rng1 = RNGManager(batch_size=32, seed=111)
+        rng2 = RNGManager(batch_size=32, seed=222)
+
+        batch1 = rng1.get_batch_randoms()
+        batch2 = rng2.get_batch_randoms()
+
+        # At least one array should differ.
+        assert not np.array_equal(batch1.rotation, batch2.rotation)
+
+        rng1.shutdown()
+        rng2.shutdown()
+
+    def test_multiple_batches_are_different(self):
+        """Test that consecutive batches are different."""
+        rng = RNGManager(batch_size=32, seed=42)
+
+        batch1 = rng.get_batch_randoms()
+        batch2 = rng.get_batch_randoms()
+
+        assert not np.array_equal(batch1.rotation, batch2.rotation)
+        assert not np.array_equal(batch1.acceptance, batch2.acceptance)
+
+        rng.shutdown()
+
+    def test_queue_prefilling(self):
+        """Test that the background thread pre-fills the queue."""
+        rng = RNGManager(batch_size=32, seed=42)
+
+        # Give the background thread time to fill the queue.
+        time.sleep(0.2)
+
+        # Getting multiple batches should be fast since they're pre-computed.
+        start = time.perf_counter()
+        for _ in range(5):
+            rng.get_batch_randoms()
+        elapsed = time.perf_counter() - start
+
+        # Should be very fast since batches are pre-computed.
+        assert elapsed < 0.1
+
+        rng.shutdown()
+
+    def test_shutdown(self):
+        """Test that shutdown stops the background thread."""
+        rng = RNGManager(batch_size=32, seed=42)
+
+        # Verify thread is running.
+        assert rng._thread.is_alive()
+
+        rng.shutdown()
+
+        # Thread should have stopped.
+        assert not rng._thread.is_alive()