Skip to content
34 changes: 34 additions & 0 deletions BENCHMARK.md
Original file line number Diff line number Diff line change
Expand Up @@ -646,3 +646,37 @@ See the commit history for the full back-and-forth.
- Bench images are not committed (`benchmarks/images/` is in `.gitignore`).
The benchmark JSON and plot artifacts are committed and reproducible from
any equivalent set of inputs.

## GPU

Built with `-DRFDETR_GGML_CUDA=ON`, rf-detr.cpp offloads weights to VRAM and
runs the compute graph on the GPU. The deformable-attention bilinear sampler
(`ggml_custom_4d`) has no GPU kernel, so the ggml scheduler runs those 3 ops
(one per decoder layer) on CPU and inserts the device↔host copies
automatically — confirmed via `GGML_SCHED_DEBUG=2`.

| Device | Model | F16 median ms @ batch 1 | Same-box CPU F16 median ms | Speedup |
|--------|-------|------------------------:|---------------------------:|--------:|
| NVIDIA GB10 (Grace Blackwell, CUDA 13.0, cc 12.1, 122 GB) | rfdetr-base | **23.6** | 274 (20-core ARM, 8 threads) | **11.6x** |

GPU bench: min 22.6 / median 23.6 / mean 23.5 / max 24.9 ms over 20 iters
after 5 warmup. Same-box CPU bench: median 274 ms (the GB10's ARM CPU is much
weaker than a desktop x86 part — the cross-machine comparison vs the Ryzen
9950X3D's ~137 ms CPU number elsewhere in this doc is not apples-to-apples;
the 11.6x here is the honest same-box figure).

Correctness: GPU detections match the CPU baseline (`expected_base-f16.json`)
8/8 at threshold 0.55 within the standard tolerance (score ≤ 0.05, bbox
≤ 2 px). The overlapping high-confidence detections agree to ≤ 0.002 score —
the small delta is GPU matmul rounding; the deformable sampler is CPU in both
paths.

Reproduce on a CUDA host:

```bash
export PATH=/usr/local/cuda/bin:$PATH
cmake -B build-cuda -DRFDETR_BUILD_CLI=ON -DRFDETR_GGML_CUDA=ON
cmake --build build-cuda -j
./build-cuda/bin/rfdetr-cli bench --model models/rfdetr-base-f16.gguf \
--input tests/fixtures/ci/test_image.jpg --iters 20 --warmup 5
```
18 changes: 18 additions & 0 deletions CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -123,6 +123,24 @@ else()
add_library(rfdetr STATIC ${RFDETR_SOURCES})
endif()

# Propagate the active GPU backend to src/backend.cpp so it knows which
# ggml_backend_*_init() to call. ggml itself is configured above via the
# GGML_* cache vars; these definitions just gate our wiring code.
if(RFDETR_GGML_CUDA)
target_compile_definitions(rfdetr PRIVATE RFDETR_USE_CUDA)
endif()
if(RFDETR_GGML_METAL)
target_compile_definitions(rfdetr PRIVATE RFDETR_USE_METAL)
endif()
if(RFDETR_GGML_VULKAN)
target_compile_definitions(rfdetr PRIVATE RFDETR_USE_VULKAN)
endif()
if(RFDETR_GGML_HIPBLAS)
# HIP uses the CUDA backend code path in ggml (ggml_backend_cuda_init
# works for HIP builds). Expose the same define.
target_compile_definitions(rfdetr PRIVATE RFDETR_USE_CUDA)
endif()

target_include_directories(rfdetr
PUBLIC ${CMAKE_CURRENT_SOURCE_DIR}/include
PRIVATE ${CMAKE_CURRENT_SOURCE_DIR}/src)
Expand Down
35 changes: 32 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -267,8 +267,12 @@ rf-detr.cpp provides:
- Faster than PyTorch CPU on every variant we measured (1.05x to 1.45x across
Nano-to-Medium).
- Quantization down to about 30 MB (Q4_K) with measured accuracy tradeoffs.
- CUDA / Metal / Vulkan support via ggml backends. CPU is the only one we ship and
benchmark today; the others compile but are not yet validated.
- GPU offload via ggml backends (CUDA / Metal / Vulkan / HIP). Build with
`-DRFDETR_GGML_CUDA=ON` (or `_METAL` / `_VULKAN` / `_HIPBLAS`) and inference
runs on the GPU: weights are realized in VRAM and the compute graph runs on
the device, with the deformable-attention sampler automatically falling back
to CPU via the ggml scheduler. Validated on an NVIDIA GB10: 23.6 ms/image
(F16) vs 274 ms on the same box's CPU — an 11.6x speedup.
- A flat C ABI ([`include/rfdetr.h`](include/rfdetr.h)) for embedding via dlopen, purego,
or cgo.
- End-to-end parity validation against the upstream PyTorch reference, per-module and
Expand Down Expand Up @@ -298,7 +302,32 @@ they're in place. Run `scripts/apply_ggml_patches.sh` manually to inspect the pa
| `RFDETR_SHARED` | OFF | Build `librfdetr.so` (shared library for embedding) |
| `GGML_NATIVE` | ON | Compile ggml with `-march=native` |
| `GGML_LLAMAFILE` | ON | Enable ggml's tinyBLAS SGEMM (closes most of the PyTorch gap) |
| `GGML_CUDA` / `GGML_METAL` | OFF | Enable GPU backends (untested for rf-detr.cpp, may need work) |
| `RFDETR_GGML_CUDA` / `_METAL` / `_VULKAN` / `_HIPBLAS` | OFF | Offload inference to GPU. Weights go to VRAM; the deformable-attention sampler falls back to CPU via the ggml scheduler. One backend per build. |

### GPU offload

rf-detr.cpp can offload inference to a GPU via ggml's backends. Build with one
of:

```bash
cmake -B build -DRFDETR_BUILD_CLI=ON -DRFDETR_GGML_CUDA=ON # NVIDIA (CUDA)
cmake -B build -DRFDETR_BUILD_CLI=ON -DRFDETR_GGML_HIPBLAS=ON # AMD (ROCm/HIP)
cmake -B build -DRFDETR_BUILD_CLI=ON -DRFDETR_GGML_METAL=ON # Apple (Metal)
cmake -B build -DRFDETR_BUILD_CLI=ON -DRFDETR_GGML_VULKAN=ON # cross-vendor (Vulkan)
```

When a device is present, model weights are realized in VRAM and the compute
graph runs on the GPU. The one op without a GPU kernel — the deformable-
attention bilinear sampler — is automatically run on CPU by the ggml
scheduler, which inserts the device↔host copies. If no device is found at
runtime, it falls back cleanly to CPU.

Validated on an NVIDIA GB10 (Grace Blackwell, CUDA 13, compute capability
12.1): rfdetr-base F16 runs at **23.6 ms/image** on the GPU vs **274 ms** on
the same box's 20-core ARM CPU (8 threads) — an **11.6x speedup**. Detections
match the CPU baseline within the standard tolerance (score ≤ 0.05, bbox
≤ 2 px); the 3 deformable-attention ops are confirmed running on CPU via the
scheduler. See [BENCHMARK.md](BENCHMARK.md#gpu) for details.

## Tests

Expand Down
120 changes: 119 additions & 1 deletion src/backend.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -6,8 +6,59 @@
#include "ggml-backend.h"
#include "ggml-cpu.h"

#if defined(RFDETR_USE_CUDA)
#include "ggml-cuda.h"
#endif
#if defined(RFDETR_USE_METAL)
#include "ggml-metal.h"
#endif
#if defined(RFDETR_USE_VULKAN)
#include "ggml-vulkan.h"
#endif

#include <vector>

namespace rfdetr {

/* Try to create a GPU backend if one was compiled in and a device exists.
* Returns nullptr (not an error) when no GPU backend is built or no device
* is present — the caller falls back to CPU-only. */
static ggml_backend_t try_init_gpu_backend() {
#if defined(RFDETR_USE_CUDA)
int n = ggml_backend_cuda_get_device_count();
if (n > 0) {
ggml_backend_t b = ggml_backend_cuda_init(0); // device 0
if (b) {
rfdetr_logf(RFDETR_LOG_INFO, "GPU backend: CUDA device 0 (%d available)", n);
return b;
}
rfdetr_logf(RFDETR_LOG_WARN, "ggml_backend_cuda_init(0) failed; using CPU");
}
return nullptr;
#elif defined(RFDETR_USE_METAL)
ggml_backend_t b = ggml_backend_metal_init();
if (b) {
rfdetr_logf(RFDETR_LOG_INFO, "GPU backend: Metal");
return b;
}
rfdetr_logf(RFDETR_LOG_WARN, "ggml_backend_metal_init failed; using CPU");
return nullptr;
#elif defined(RFDETR_USE_VULKAN)
int n = ggml_backend_vk_get_device_count();
if (n > 0) {
ggml_backend_t b = ggml_backend_vk_init(0);
if (b) {
rfdetr_logf(RFDETR_LOG_INFO, "GPU backend: Vulkan device 0 (%d available)", n);
return b;
}
rfdetr_logf(RFDETR_LOG_WARN, "ggml_backend_vk_init(0) failed; using CPU");
}
return nullptr;
#else
return nullptr; // CPU-only build
#endif
}

ggml_backend_t init_backend(int n_threads, rfdetr_status* out_status) {
auto set = [&](rfdetr_status s) { if (out_status) *out_status = s; };

Expand Down Expand Up @@ -67,10 +118,38 @@ BackendCtx init_backend_ctx(int n_threads, rfdetr_status* out_status) {
}
}

/* Try a GPU backend. If present, build a scheduler spanning [gpu, cpu]
* so ops the GPU can't run (the deformable ggml_custom_4d) fall back to
* CPU automatically. */
ctx.gpu = try_init_gpu_backend();
if (ctx.gpu) {
std::vector<ggml_backend_t> backends = { ctx.gpu, ctx.cpu };
std::vector<ggml_backend_buffer_type_t> bufts = {
ggml_backend_get_default_buffer_type(ctx.gpu),
ggml_backend_get_default_buffer_type(ctx.cpu),
};
ctx.sched = ggml_backend_sched_new(
backends.data(), bufts.data(), (int)backends.size(),
/*graph_size*/ 16384, /*parallel*/ false, /*op_offload*/ true);
if (!ctx.sched) {
rfdetr_logf(RFDETR_LOG_WARN,
"ggml_backend_sched_new failed; falling back to CPU-only");
ggml_backend_free(ctx.gpu);
ctx.gpu = nullptr;
}
}

set(RFDETR_OK);
return ctx;
}

ggml_backend_buffer_type_t backend_ctx_weight_buft(const BackendCtx& ctx) {
if (ctx.gpu) {
return ggml_backend_get_default_buffer_type(ctx.gpu);
}
return ggml_backend_get_default_buffer_type(ctx.cpu);
}

void free_backend_ctx(BackendCtx& ctx) {
/* Free the gallocrs BEFORE the backends. The gallocr owns the compute
* scratch buffers (allocated via the backend's buffer_type); freeing
Expand All @@ -85,6 +164,14 @@ void free_backend_ctx(BackendCtx& ctx) {
ggml_gallocr_free(ctx.galloc_b);
ctx.galloc_b = nullptr;
}
if (ctx.sched) {
ggml_backend_sched_free(ctx.sched);
ctx.sched = nullptr;
}
if (ctx.gpu) {
ggml_backend_free(ctx.gpu);
ctx.gpu = nullptr;
}
if (ctx.cpu) {
ggml_backend_free(ctx.cpu);
ctx.cpu = nullptr;
Expand All @@ -99,7 +186,38 @@ void free_backend_ctx(BackendCtx& ctx) {
}
}

int backend_ctx_graph_compute(BackendCtx& ctx, ::ggml_cgraph* graph) {
bool backend_ctx_graph_alloc(BackendCtx& ctx, ::ggml_cgraph* graph, int which_graph) {
if (ctx.sched) {
ggml_backend_sched_reset(ctx.sched);
if (!ggml_backend_sched_alloc_graph(ctx.sched, graph)) {
rfdetr_logf(RFDETR_LOG_ERROR, "backend_ctx_graph_alloc: sched alloc failed");
return false;
}
return true;
}
/* CPU path: persistent gallocr per graph. */
ggml_gallocr_t* slot = (which_graph == 0) ? &ctx.galloc_a : &ctx.galloc_b;
if (!*slot) {
*slot = ggml_gallocr_new(ggml_backend_get_default_buffer_type(ctx.cpu));
if (!*slot) {
rfdetr_logf(RFDETR_LOG_ERROR, "backend_ctx_graph_alloc: gallocr_new failed");
return false;
}
}
if (!ggml_gallocr_alloc_graph(*slot, graph)) {
rfdetr_logf(RFDETR_LOG_ERROR, "backend_ctx_graph_alloc: gallocr_alloc_graph failed");
return false;
}
return true;
}

int backend_ctx_graph_compute(BackendCtx& ctx, ::ggml_cgraph* graph, int which_graph) {
(void)which_graph;
if (ctx.sched) {
ggml_status st = ggml_backend_sched_graph_compute(ctx.sched, graph);
ggml_backend_sched_synchronize(ctx.sched);
return (int)st;
}
ggml_status st = ggml_backend_graph_compute(ctx.cpu, graph);
ggml_backend_synchronize(ctx.cpu);
return (int)st;
Expand Down
39 changes: 37 additions & 2 deletions src/backend.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,10 @@ typedef struct ggml_threadpool* ggml_threadpool_t;
struct ggml_cgraph;
struct ggml_gallocr;
typedef struct ggml_gallocr* ggml_gallocr_t;
struct ggml_backend_sched;
typedef struct ggml_backend_sched* ggml_backend_sched_t;
struct ggml_backend_buffer_type;
typedef struct ggml_backend_buffer_type* ggml_backend_buffer_type_t;

namespace rfdetr {

Expand Down Expand Up @@ -57,6 +61,18 @@ struct BackendCtx {
* Lazily created on first use, freed in free_backend_ctx. */
ggml_gallocr_t galloc_a = nullptr;
ggml_gallocr_t galloc_b = nullptr;

/* Optional GPU backend (CUDA / Metal / Vulkan), created when the
* library was built with one of RFDETR_USE_CUDA / _METAL / _VULKAN
* AND a device is actually present at runtime. nullptr on CPU-only
* builds or when no device is found. */
ggml_backend_t gpu = nullptr;

/* Scheduler spanning [gpu, cpu] when gpu != nullptr. Routes ops to the
* GPU and falls back to CPU for ops the GPU backend can't run (notably
* the deformable-attention ggml_custom_4d sampler). When gpu == nullptr
* this stays null and we use the plain CPU compute path. */
ggml_backend_sched_t sched = nullptr;
};

/* Initialize the compute backend bundle. Creates a CPU backend and attaches
Expand All @@ -69,8 +85,27 @@ BackendCtx init_backend_ctx(int n_threads, rfdetr_status* out_status);
/* Release a BackendCtx. Safe to call on a zero-initialized struct. */
void free_backend_ctx(BackendCtx& ctx);

/* Run a graph on the bundle's CPU backend. */
int /* ggml_status */ backend_ctx_graph_compute(BackendCtx& ctx, ::ggml_cgraph* graph);
/* Buffer type that model weights should be realized on. Returns the GPU
* backend's default buffer type when a GPU is active (so weights live in
* VRAM), otherwise the CPU host buffer type. Never returns null on a
* successfully-initialized BackendCtx. */
ggml_backend_buffer_type_t backend_ctx_weight_buft(const BackendCtx& ctx);

/* Allocate buffers for a graph. Uses the sched when active (GPU), else the
* persistent per-graph gallocr (which_graph: 0 = A, 1 = B). Returns false on
* allocation failure. Call this, then ggml_backend_tensor_set() the graph
* inputs, then backend_ctx_graph_compute(). */
bool backend_ctx_graph_alloc(BackendCtx& ctx, ::ggml_cgraph* graph, int which_graph);

/* Allocate + run a graph on the bundle. When a GPU + sched are present the
* graph is allocated and computed via ggml_backend_sched (which places ops
* across GPU/CPU and inserts cross-device copies as needed). On CPU-only
* bundles it falls back to the persistent-gallocr + cpu-backend path.
*
* `which_graph` selects the persistent allocator slot on CPU-only builds
* (0 = graph A, 1 = graph B). Ignored when the sched is active (the sched
* owns allocation). */
int /* ggml_status */ backend_ctx_graph_compute(BackendCtx& ctx, ::ggml_cgraph* graph, int which_graph);

} // namespace rfdetr

Expand Down
10 changes: 6 additions & 4 deletions src/rfdetr.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -72,10 +72,12 @@ extern "C" rfdetr_context* rfdetr_init(const rfdetr_params* params, rfdetr_statu
return nullptr;
}

/* Weights are realized on the CPU backend's host buffer; both CPU and
* BLAS backends use ggml's host buffer type, so the BLAS backend can
* read them in-place via the sched. */
rfdetr_status rw_st = rfdetr::model_realize_weights(*m, bctx.cpu);
/* Realize weights on the GPU backend when one is active (offload to
* VRAM); otherwise on the CPU host buffer. The pos_embed bicubic
* resample inside model_realize_weights uses ggml_backend_tensor_get/set
* which work on any backend, so no other change is needed there. */
ggml_backend_t weight_backend = bctx.gpu ? bctx.gpu : bctx.cpu;
rfdetr_status rw_st = rfdetr::model_realize_weights(*m, weight_backend);
if (rw_st != RFDETR_OK) {
rfdetr::free_backend_ctx(bctx);
rfdetr::model_free(m);
Expand Down
Loading
Loading