localai-org · mudler · May 29, 2026 · May 29, 2026 · May 29, 2026 · May 29, 2026
diff --git a/BENCHMARK.md b/BENCHMARK.md
@@ -646,3 +646,37 @@ See the commit history for the full back-and-forth.
 - Bench images are not committed (`benchmarks/images/` is in `.gitignore`).
   The benchmark JSON and plot artifacts are committed and reproducible from
   any equivalent set of inputs.
+
+## GPU
+
+Built with `-DRFDETR_GGML_CUDA=ON`, rf-detr.cpp offloads weights to VRAM and
+runs the compute graph on the GPU. The deformable-attention bilinear sampler
+(`ggml_custom_4d`) has no GPU kernel, so the ggml scheduler runs those 3 ops
+(one per decoder layer) on CPU and inserts the device↔host copies
+automatically — confirmed via `GGML_SCHED_DEBUG=2`.
+
+| Device | Model | F16 median ms @ batch 1 | Same-box CPU F16 median ms | Speedup |
+|--------|-------|------------------------:|---------------------------:|--------:|
+| NVIDIA GB10 (Grace Blackwell, CUDA 13.0, cc 12.1, 122 GB) | rfdetr-base | **23.6** | 274 (20-core ARM, 8 threads) | **11.6x** |
+
+GPU bench: min 22.6 / median 23.6 / mean 23.5 / max 24.9 ms over 20 iters
+after 5 warmup. Same-box CPU bench: median 274 ms (the GB10's ARM CPU is much
+weaker than a desktop x86 part — the cross-machine comparison vs the Ryzen
+9950X3D's ~137 ms CPU number elsewhere in this doc is not apples-to-apples;
+the 11.6x here is the honest same-box figure).
+
+Correctness: GPU detections match the CPU baseline (`expected_base-f16.json`)
+8/8 at threshold 0.55 within the standard tolerance (score ≤ 0.05, bbox
+≤ 2 px). The overlapping high-confidence detections agree to ≤ 0.002 score —
+the small delta is GPU matmul rounding; the deformable sampler is CPU in both
+paths.
+
+Reproduce on a CUDA host:
+
+```bash
+export PATH=/usr/local/cuda/bin:$PATH
+cmake -B build-cuda -DRFDETR_BUILD_CLI=ON -DRFDETR_GGML_CUDA=ON
+cmake --build build-cuda -j
+./build-cuda/bin/rfdetr-cli bench --model models/rfdetr-base-f16.gguf \
+    --input tests/fixtures/ci/test_image.jpg --iters 20 --warmup 5
+```
diff --git a/CMakeLists.txt b/CMakeLists.txt
@@ -123,6 +123,24 @@ else()
     add_library(rfdetr STATIC ${RFDETR_SOURCES})
 endif()
 
+# Propagate the active GPU backend to src/backend.cpp so it knows which
+# ggml_backend_*_init() to call. ggml itself is configured above via the
+# GGML_* cache vars; these definitions just gate our wiring code.
+if(RFDETR_GGML_CUDA)
+    target_compile_definitions(rfdetr PRIVATE RFDETR_USE_CUDA)
+endif()
+if(RFDETR_GGML_METAL)
+    target_compile_definitions(rfdetr PRIVATE RFDETR_USE_METAL)
+endif()
+if(RFDETR_GGML_VULKAN)
+    target_compile_definitions(rfdetr PRIVATE RFDETR_USE_VULKAN)
+endif()
+if(RFDETR_GGML_HIPBLAS)
+    # HIP uses the CUDA backend code path in ggml (ggml_backend_cuda_init
+    # works for HIP builds). Expose the same define.
+    target_compile_definitions(rfdetr PRIVATE RFDETR_USE_CUDA)
+endif()
+
 target_include_directories(rfdetr
     PUBLIC  ${CMAKE_CURRENT_SOURCE_DIR}/include
     PRIVATE ${CMAKE_CURRENT_SOURCE_DIR}/src)

diff --git a/README.md b/README.md
@@ -267,8 +267,12 @@ rf-detr.cpp provides:
 - Faster than PyTorch CPU on every variant we measured (1.05x to 1.45x across
   Nano-to-Medium).
 - Quantization down to about 30 MB (Q4_K) with measured accuracy tradeoffs.
-- CUDA / Metal / Vulkan support via ggml backends. CPU is the only one we ship and
-  benchmark today; the others compile but are not yet validated.
+- GPU offload via ggml backends (CUDA / Metal / Vulkan / HIP). Build with
+  `-DRFDETR_GGML_CUDA=ON` (or `_METAL` / `_VULKAN` / `_HIPBLAS`) and inference
+  runs on the GPU: weights are realized in VRAM and the compute graph runs on
+  the device, with the deformable-attention sampler automatically falling back
+  to CPU via the ggml scheduler. Validated on an NVIDIA GB10: 23.6 ms/image
+  (F16) vs 274 ms on the same box's CPU — an 11.6x speedup.
 - A flat C ABI ([`include/rfdetr.h`](include/rfdetr.h)) for embedding via dlopen, purego,
   or cgo.
 - End-to-end parity validation against the upstream PyTorch reference, per-module and
@@ -298,7 +302,32 @@ they're in place. Run `scripts/apply_ggml_patches.sh` manually to inspect the pa
 | `RFDETR_SHARED`           | OFF     | Build `librfdetr.so` (shared library for embedding)      |
 | `GGML_NATIVE`             | ON      | Compile ggml with `-march=native`                        |
 | `GGML_LLAMAFILE`          | ON      | Enable ggml's tinyBLAS SGEMM (closes most of the PyTorch gap) |
-| `GGML_CUDA` / `GGML_METAL` | OFF    | Enable GPU backends (untested for rf-detr.cpp, may need work) |
+| `RFDETR_GGML_CUDA` / `_METAL` / `_VULKAN` / `_HIPBLAS` | OFF | Offload inference to GPU. Weights go to VRAM; the deformable-attention sampler falls back to CPU via the ggml scheduler. One backend per build. |
+
+### GPU offload
+
+rf-detr.cpp can offload inference to a GPU via ggml's backends. Build with one
+of:
+
+```bash
+cmake -B build -DRFDETR_BUILD_CLI=ON -DRFDETR_GGML_CUDA=ON     # NVIDIA (CUDA)
+cmake -B build -DRFDETR_BUILD_CLI=ON -DRFDETR_GGML_HIPBLAS=ON  # AMD (ROCm/HIP)
+cmake -B build -DRFDETR_BUILD_CLI=ON -DRFDETR_GGML_METAL=ON    # Apple (Metal)
+cmake -B build -DRFDETR_BUILD_CLI=ON -DRFDETR_GGML_VULKAN=ON   # cross-vendor (Vulkan)
+```
+
+When a device is present, model weights are realized in VRAM and the compute
+graph runs on the GPU. The one op without a GPU kernel — the deformable-
+attention bilinear sampler — is automatically run on CPU by the ggml
+scheduler, which inserts the device↔host copies. If no device is found at
+runtime, it falls back cleanly to CPU.
+
+Validated on an NVIDIA GB10 (Grace Blackwell, CUDA 13, compute capability
+12.1): rfdetr-base F16 runs at **23.6 ms/image** on the GPU vs **274 ms** on
+the same box's 20-core ARM CPU (8 threads) — an **11.6x speedup**. Detections
+match the CPU baseline within the standard tolerance (score ≤ 0.05, bbox
+≤ 2 px); the 3 deformable-attention ops are confirmed running on CPU via the
+scheduler. See [BENCHMARK.md](BENCHMARK.md#gpu) for details.
 
 ## Tests
 

diff --git a/src/backend.cpp b/src/backend.cpp
@@ -6,8 +6,59 @@
 #include "ggml-backend.h"
 #include "ggml-cpu.h"
 
+#if defined(RFDETR_USE_CUDA)
+#include "ggml-cuda.h"
+#endif
+#if defined(RFDETR_USE_METAL)
+#include "ggml-metal.h"
+#endif
+#if defined(RFDETR_USE_VULKAN)
+#include "ggml-vulkan.h"
+#endif
+
+#include <vector>
+
 namespace rfdetr {
 
+/* Try to create a GPU backend if one was compiled in and a device exists.
+ * Returns nullptr (not an error) when no GPU backend is built or no device
+ * is present — the caller falls back to CPU-only. */
+static ggml_backend_t try_init_gpu_backend() {
+#if defined(RFDETR_USE_CUDA)
+    int n = ggml_backend_cuda_get_device_count();
+    if (n > 0) {
+        ggml_backend_t b = ggml_backend_cuda_init(0);  // device 0
+        if (b) {
+            rfdetr_logf(RFDETR_LOG_INFO, "GPU backend: CUDA device 0 (%d available)", n);
+            return b;
+        }
+        rfdetr_logf(RFDETR_LOG_WARN, "ggml_backend_cuda_init(0) failed; using CPU");
+    }
+    return nullptr;
+#elif defined(RFDETR_USE_METAL)
+    ggml_backend_t b = ggml_backend_metal_init();
+    if (b) {
+        rfdetr_logf(RFDETR_LOG_INFO, "GPU backend: Metal");
+        return b;
+    }
+    rfdetr_logf(RFDETR_LOG_WARN, "ggml_backend_metal_init failed; using CPU");
+    return nullptr;
+#elif defined(RFDETR_USE_VULKAN)
+    int n = ggml_backend_vk_get_device_count();
+    if (n > 0) {
+        ggml_backend_t b = ggml_backend_vk_init(0);
+        if (b) {
+            rfdetr_logf(RFDETR_LOG_INFO, "GPU backend: Vulkan device 0 (%d available)", n);
+            return b;
+        }
+        rfdetr_logf(RFDETR_LOG_WARN, "ggml_backend_vk_init(0) failed; using CPU");
+    }
+    return nullptr;
+#else
+    return nullptr;  // CPU-only build
+#endif
+}
+
 ggml_backend_t init_backend(int n_threads, rfdetr_status* out_status) {
     auto set = [&](rfdetr_status s) { if (out_status) *out_status = s; };
 
@@ -67,10 +118,38 @@ BackendCtx init_backend_ctx(int n_threads, rfdetr_status* out_status) {
         }
     }
 
+    /* Try a GPU backend. If present, build a scheduler spanning [gpu, cpu]
+     * so ops the GPU can't run (the deformable ggml_custom_4d) fall back to
+     * CPU automatically. */
+    ctx.gpu = try_init_gpu_backend();
+    if (ctx.gpu) {
+        std::vector<ggml_backend_t> backends = { ctx.gpu, ctx.cpu };
+        std::vector<ggml_backend_buffer_type_t> bufts = {
+            ggml_backend_get_default_buffer_type(ctx.gpu),
+            ggml_backend_get_default_buffer_type(ctx.cpu),
+        };
+        ctx.sched = ggml_backend_sched_new(
+            backends.data(), bufts.data(), (int)backends.size(),
+            /*graph_size*/ 16384, /*parallel*/ false, /*op_offload*/ true);
+        if (!ctx.sched) {
+            rfdetr_logf(RFDETR_LOG_WARN,
+                        "ggml_backend_sched_new failed; falling back to CPU-only");
+            ggml_backend_free(ctx.gpu);
+            ctx.gpu = nullptr;
+        }
+    }
+
     set(RFDETR_OK);
     return ctx;
 }
 
+ggml_backend_buffer_type_t backend_ctx_weight_buft(const BackendCtx& ctx) {
+    if (ctx.gpu) {
+        return ggml_backend_get_default_buffer_type(ctx.gpu);
+    }
+    return ggml_backend_get_default_buffer_type(ctx.cpu);
+}
+
 void free_backend_ctx(BackendCtx& ctx) {
     /* Free the gallocrs BEFORE the backends. The gallocr owns the compute
      * scratch buffers (allocated via the backend's buffer_type); freeing
@@ -85,6 +164,14 @@ void free_backend_ctx(BackendCtx& ctx) {
         ggml_gallocr_free(ctx.galloc_b);
         ctx.galloc_b = nullptr;
     }
+    if (ctx.sched) {
+        ggml_backend_sched_free(ctx.sched);
+        ctx.sched = nullptr;
+    }
+    if (ctx.gpu) {
+        ggml_backend_free(ctx.gpu);
+        ctx.gpu = nullptr;
+    }
     if (ctx.cpu) {
         ggml_backend_free(ctx.cpu);
         ctx.cpu = nullptr;
@@ -99,7 +186,38 @@ void free_backend_ctx(BackendCtx& ctx) {
     }
 }
 
-int backend_ctx_graph_compute(BackendCtx& ctx, ::ggml_cgraph* graph) {
+bool backend_ctx_graph_alloc(BackendCtx& ctx, ::ggml_cgraph* graph, int which_graph) {
+    if (ctx.sched) {
+        ggml_backend_sched_reset(ctx.sched);
+        if (!ggml_backend_sched_alloc_graph(ctx.sched, graph)) {
+            rfdetr_logf(RFDETR_LOG_ERROR, "backend_ctx_graph_alloc: sched alloc failed");
+            return false;
+        }
+        return true;
+    }
+    /* CPU path: persistent gallocr per graph. */
+    ggml_gallocr_t* slot = (which_graph == 0) ? &ctx.galloc_a : &ctx.galloc_b;
+    if (!*slot) {
+        *slot = ggml_gallocr_new(ggml_backend_get_default_buffer_type(ctx.cpu));
+        if (!*slot) {
+            rfdetr_logf(RFDETR_LOG_ERROR, "backend_ctx_graph_alloc: gallocr_new failed");
+            return false;
+        }
+    }
+    if (!ggml_gallocr_alloc_graph(*slot, graph)) {
+        rfdetr_logf(RFDETR_LOG_ERROR, "backend_ctx_graph_alloc: gallocr_alloc_graph failed");
+        return false;
+    }
+    return true;
+}
+
+int backend_ctx_graph_compute(BackendCtx& ctx, ::ggml_cgraph* graph, int which_graph) {
+    (void)which_graph;
+    if (ctx.sched) {
+        ggml_status st = ggml_backend_sched_graph_compute(ctx.sched, graph);
+        ggml_backend_sched_synchronize(ctx.sched);
+        return (int)st;
+    }
     ggml_status st = ggml_backend_graph_compute(ctx.cpu, graph);
     ggml_backend_synchronize(ctx.cpu);
     return (int)st;

diff --git a/src/backend.hpp b/src/backend.hpp
@@ -12,6 +12,10 @@ typedef struct ggml_threadpool* ggml_threadpool_t;
 struct ggml_cgraph;
 struct ggml_gallocr;
 typedef struct ggml_gallocr* ggml_gallocr_t;
+struct ggml_backend_sched;
+typedef struct ggml_backend_sched* ggml_backend_sched_t;
+struct ggml_backend_buffer_type;
+typedef struct ggml_backend_buffer_type* ggml_backend_buffer_type_t;
 
 namespace rfdetr {
 
@@ -57,6 +61,18 @@ struct BackendCtx {
      * Lazily created on first use, freed in free_backend_ctx. */
     ggml_gallocr_t       galloc_a    = nullptr;
     ggml_gallocr_t       galloc_b    = nullptr;
+
+    /* Optional GPU backend (CUDA / Metal / Vulkan), created when the
+     * library was built with one of RFDETR_USE_CUDA / _METAL / _VULKAN
+     * AND a device is actually present at runtime. nullptr on CPU-only
+     * builds or when no device is found. */
+    ggml_backend_t       gpu        = nullptr;
+
+    /* Scheduler spanning [gpu, cpu] when gpu != nullptr. Routes ops to the
+     * GPU and falls back to CPU for ops the GPU backend can't run (notably
+     * the deformable-attention ggml_custom_4d sampler). When gpu == nullptr
+     * this stays null and we use the plain CPU compute path. */
+    ggml_backend_sched_t sched      = nullptr;
 };
 
 /* Initialize the compute backend bundle. Creates a CPU backend and attaches
@@ -69,8 +85,27 @@ BackendCtx init_backend_ctx(int n_threads, rfdetr_status* out_status);
 /* Release a BackendCtx. Safe to call on a zero-initialized struct. */
 void free_backend_ctx(BackendCtx& ctx);
 
-/* Run a graph on the bundle's CPU backend. */
-int /* ggml_status */ backend_ctx_graph_compute(BackendCtx& ctx, ::ggml_cgraph* graph);
+/* Buffer type that model weights should be realized on. Returns the GPU
+ * backend's default buffer type when a GPU is active (so weights live in
+ * VRAM), otherwise the CPU host buffer type. Never returns null on a
+ * successfully-initialized BackendCtx. */
+ggml_backend_buffer_type_t backend_ctx_weight_buft(const BackendCtx& ctx);
+
+/* Allocate buffers for a graph. Uses the sched when active (GPU), else the
+ * persistent per-graph gallocr (which_graph: 0 = A, 1 = B). Returns false on
+ * allocation failure. Call this, then ggml_backend_tensor_set() the graph
+ * inputs, then backend_ctx_graph_compute(). */
+bool backend_ctx_graph_alloc(BackendCtx& ctx, ::ggml_cgraph* graph, int which_graph);
+
+/* Allocate + run a graph on the bundle. When a GPU + sched are present the
+ * graph is allocated and computed via ggml_backend_sched (which places ops
+ * across GPU/CPU and inserts cross-device copies as needed). On CPU-only
+ * bundles it falls back to the persistent-gallocr + cpu-backend path.
+ *
+ * `which_graph` selects the persistent allocator slot on CPU-only builds
+ * (0 = graph A, 1 = graph B). Ignored when the sched is active (the sched
+ * owns allocation). */
+int /* ggml_status */ backend_ctx_graph_compute(BackendCtx& ctx, ::ggml_cgraph* graph, int which_graph);
 
 }  // namespace rfdetr
 

diff --git a/src/rfdetr.cpp b/src/rfdetr.cpp
@@ -72,10 +72,12 @@ extern "C" rfdetr_context* rfdetr_init(const rfdetr_params* params, rfdetr_statu
         return nullptr;
     }
 
-    /* Weights are realized on the CPU backend's host buffer; both CPU and
-     * BLAS backends use ggml's host buffer type, so the BLAS backend can
-     * read them in-place via the sched. */
-    rfdetr_status rw_st = rfdetr::model_realize_weights(*m, bctx.cpu);
+    /* Realize weights on the GPU backend when one is active (offload to
+     * VRAM); otherwise on the CPU host buffer. The pos_embed bicubic
+     * resample inside model_realize_weights uses ggml_backend_tensor_get/set
+     * which work on any backend, so no other change is needed there. */
+    ggml_backend_t weight_backend = bctx.gpu ? bctx.gpu : bctx.cpu;
+    rfdetr_status rw_st = rfdetr::model_realize_weights(*m, weight_backend);
     if (rw_st != RFDETR_OK) {
         rfdetr::free_backend_ctx(bctx);
         rfdetr::model_free(m);