cubin-function-patch is a single-header C99 API for replacing reserved CUDA
device-function bodies in a fully linked cubin/ELF with same-signature function
bodies from nvPTXCompiler --compile-only RDC ELF output.
The public header is intentionally at the repository root:
cubin_function_patch.h
Copy it into a project, include it normally, and define
CUBIN_FUNCTION_PATCH_IMPLEMENTATION in exactly one C or C++ translation unit.
The normal CUDA dynamic path is:
CUDA/PTX input -> ptxas/nvPTXCompiler/nvJitLink -> final cubin
That can be too expensive when one small device function changes but a large global-kernel body stays the same. This library supports a narrower workflow:
cold path:
link a final cubin template that contains oversized reserved function bodies
create a cubin_function_patch handle from that linked cubin
hot path:
compile replacement device functions with nvPTXCompiler --compile-only
copy replacement function text into the reserved function slots
load the patched cubin with the CUDA Driver API
The patch handle scans the template cubin once, records function text ranges, and later patches caller-owned output cubin buffers. The library does not hide allocation: handle memory and output cubin memory are supplied by the caller.
This is useful when all of these are true:
- the global CUDA kernels are large and mostly stable;
- the changed code is isolated behind one or more noinline device functions;
- the replacement functions have a fixed ABI;
- compile latency matters enough that relinking the full cubin is too slow;
- inputs are controlled by your compiler pipeline, not arbitrary user uploads.
The main motivating use case is dynamic CUDA program generation where many small evaluator functions are compiled and patched into a reserved Gram/kernel template.
This library is intentionally narrow. It is not a general CUDA linker.
The replacement function must:
- have the same symbol name as the reserved function;
- have the same calling convention and signature;
- target the same CUDA ELF machine architecture;
- fit inside the reserved linked function body;
- be self-contained, except for CUDA constant-bank relocations that can be matched to reserved side sections;
- not require new sections or metadata that the linked template did not reserve;
- not rely on branch targets in stale tail bytes when the replacement is smaller than the reserved body.
The linked template must:
- contain each patchable function as an
STT_FUNCsymbol; - reserve worst-case function-body size;
- reserve conservative
.nv.info.*metadata for the worst-case body. The patcher preserves this metadata instead of copying replacement.nv.info.*; - reserve compatible
.nv.constant2.*side sections and cap/merc mirror text sections when the target CUDA version emits them; - be built for the same architecture and ABI assumptions as replacement RDC objects;
- keep metadata conservative enough for all replacement bodies.
If a replacement is smaller than the reserved function body, the default policy
leaves tail bytes unchanged. Use
CUBIN_FUNCTION_PATCH_TAIL_REQUIRE_EXACT_SIZE if exact body size is required.
This library does not try to:
- patch arbitrary untrusted cubins;
- validate every CUDA ABI or metadata invariant NVIDIA's linker would validate;
- grow function bodies;
- add new sections;
- resolve general relocations;
- patch host code;
- support non-NVIDIA GPU formats;
- abstract over AMD, Apple, CPU, or generic accelerator backends.
It is a low-level CUDA tool for controlled compiler pipelines.
The implementation parses little-endian 64-bit CUDA ELF/cubin data and is intended for Linux CUDA toolchains on:
x86_64
ARM SBSA / aarch64
The tests require an NVIDIA driver and CUDA development tools.
The test suite is not a parser-only check. It compiles CUDA fixture kernels, produces RDC and linked cubins, patches device-function bodies, loads the patched cubins through the CUDA Driver API, launches the kernels, and compares device outputs against normally linked references.
The current code has been tested on:
| System | CPU architecture | GPU | Target |
|---|---|---|---|
| Local workstation | x86_64 |
NVIDIA GeForce RTX 5090 | sm_120 |
| Ada workstation | x86_64 |
NVIDIA GeForce RTX 4090 | sm_89 |
| Lambda Labs GH200 | ARM SBSA / aarch64 |
NVIDIA GH200 480GB | sm_90 |
#define CUBIN_FUNCTION_PATCH_IMPLEMENTATION
#include "cubin_function_patch.h"Typical use:
const char* symbols[] = {"patch_site"};
size_t handle_bytes = 0;
void* handle_memory = NULL;
CubinFunctionPatchHandle* handle = NULL;
size_t output_bytes = 0;
void* output_cubin = NULL;
size_t written = 0;
cubin_function_patch_handle_size(symbols, 1, &handle_bytes);
handle_memory = malloc(handle_bytes);
cubin_function_patch_create(
template_cubin,
template_cubin_bytes,
symbols,
1,
handle_memory,
handle_bytes,
&handle
);
cubin_function_patch_output_size(handle, &output_bytes);
output_cubin = malloc(output_bytes);
cubin_function_patch_apply_all(
handle,
replacement_rdc_elf,
replacement_rdc_elf_bytes,
output_cubin,
output_bytes,
&written,
NULL,
0
);
/* output_cubin now contains a linked cubin-sized image with patched text. */Production code should check every CubinFunctionPatchResult and free/destroy
state according to the ownership rules below.
The API has three phases.
- Measure and create a handle:
cubin_function_patch_handle_size(...);
cubin_function_patch_create(...);- Query template/output information:
cubin_function_patch_num_sites(...);
cubin_function_patch_site_symbol(...);
cubin_function_patch_site_reserved_size(...);
cubin_function_patch_output_size(...);- Patch one or more replacement functions:
cubin_function_patch_begin(...);
cubin_function_patch_apply_one_in_place(...);
cubin_function_patch_apply_all_in_place(...);
cubin_function_patch_apply_one(...);
cubin_function_patch_apply_all(...);The _ex variants accept an explicit CubinFunctionPatchTailPolicy.
cubin_function_patch_createdoes not copy the linked template cubin. Keep the template bytes alive and immutable while the handle exists.- Handle memory is caller-owned.
- Output cubin memory is caller-owned.
- The library performs no internal allocation.
cubin_function_patch_destroycurrently performs no deallocation, but should still be called for API symmetry.- The handle is immutable after creation. Multiple threads may patch through the same handle when each call writes to a distinct output buffer.
Build:
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build --parallel
ctest --test-dir build --output-on-failureOverride the CUDA architecture used by build-time tests:
cmake -S . -B build -DCUBIN_FUNCTION_PATCH_TEST_SM=90The tests use two paths:
test/runtime_compile: embeds CUDA source withthirdparty/incbin, compiles it at test runtime with NVRTC, nvPTXCompiler, and nvJitLink, patches cubins, launches them through the CUDA Driver API, and compares against normally linked references.test/cmake_rdc: uses CMake custom commands to compile CUDA files to RDC cubins and linked reference/template cubins before the test runs. The binary embeds those artifacts withINCBINand then uses only the CUDA Driver API pluscubin_function_patch.h.
Shared CUDA fixtures live in:
test/common/cuda
The bench/ directory contains a standalone multicore benchmark for the
compile/link paths this library is meant to replace:
cmake -S . -B build-bench \
-DCMAKE_BUILD_TYPE=Release \
-DBUILD_TESTING=ON \
-DCUBIN_FUNCTION_PATCH_BUILD_BENCHMARKS=ON
cmake --build build-bench --parallel
./build-bench/bench/cubin_function_patch_compile_scaling_bench \
--modules 120 \
--workers 1,2,4,8,12 \
--repeats 1 \
--warmup 0The benchmark checks correctness first by compiling representative cubins,
loading them with the CUDA Driver API, launching the kernel, and comparing
against a CPU reference. It also performs a final sanity pass across a few
compiled module indices after the timed rows. Use --skip-correctness and
--skip-final-sanity to omit those checks during pure timing runs.
The timed rows are:
| Mode | Hot-path work |
|---|---|
full_compile |
NVRTC compiles the full CUDA source and nvPTXCompiler compiles the full cubin. |
full_ptx_inject |
A precompiled PTX template containing the global kernel and noinline device function is patched at the PTX text level, then nvPTXCompiler compiles the whole injected PTX directly to a final cubin. |
partial_nvjitlink |
NVRTC/nvPTXCompiler compile a replacement noinline device function, then nvJitLink links it with a prebuilt caller RDC. |
partial_patch |
NVRTC/nvPTXCompiler compile a replacement noinline device function, then cubin_function_patch patches it into a prelinked template cubin. |
nvjitlink_only |
Replacement RDCs are precompiled before timing; the hot path only runs nvJitLink. |
patch_only |
Replacement RDCs are precompiled before timing; the hot path only copies and patches cubin bytes. |
nvJitLink is created with -O0 -no-cache so the link-only rows do not report
cache hits as linker throughput.
On the local RTX 5090 workstation (x86_64, CUDA 13.3, sm_120), this command:
./build-bench/bench/cubin_function_patch_compile_scaling_bench \
--modules 120 \
--workers 1,2,4,8,12 \
--repeats 1 \
--warmup 0 \
--skip-correctness \
--skip-final-sanityproduced:
mode workers best_ms modules/s nvptx_sum_ms inject_ms nvjitlink_ms patch_ms
full_compile 1 2686.246 44.672 1978.775 0.000 0.000 0.000
full_compile 12 797.842 150.406 2124.971 0.000 0.000 0.000
full_ptx_inject 1 1933.417 62.066 1932.856 0.279 0.000 0.000
full_ptx_inject 12 224.036 535.629 2603.814 0.588 0.000 0.000
partial_nvjitlink 1 1024.719 117.105 278.462 0.000 72.846 0.000
partial_nvjitlink 12 732.781 163.760 296.122 0.000 76.823 0.000
partial_patch 1 939.842 127.681 273.373 0.000 0.000 0.370
partial_patch 12 746.391 160.774 296.851 0.000 0.000 0.496
nvjitlink_only 1 60.108 1996.415 0.000 0.000 60.017 0.000
nvjitlink_only 12 37.854 3170.104 0.000 0.000 332.153 0.000
patch_only 1 0.155 772106.355 0.000 0.000 0.000 0.125
patch_only 12 0.152 787489.420 0.000 0.000 0.000 0.136
On the Ada workstation (x86_64, CUDA 13.1, RTX 4090, sm_89), the same
120-module command produced:
mode workers best_ms modules/s nvptx_sum_ms inject_ms nvjitlink_ms patch_ms
full_compile 1 3293.147 36.439 1708.950 0.000 0.000 0.000
full_compile 12 793.777 151.176 1805.477 0.000 0.000 0.000
full_ptx_inject 1 1588.421 75.547 1587.574 0.312 0.000 0.000
full_ptx_inject 12 207.552 578.170 2099.767 0.832 0.000 0.000
partial_nvjitlink 1 985.561 121.758 273.287 0.000 12.228 0.000
partial_nvjitlink 12 774.067 155.025 354.581 0.000 12.696 0.000
partial_patch 1 969.649 123.756 274.591 0.000 0.000 0.309
partial_patch 12 767.767 156.297 359.584 0.000 0.000 0.432
nvjitlink_only 1 9.123 13152.882 0.000 0.000 9.035 0.000
nvjitlink_only 12 12.135 9888.523 0.000 0.000 124.076 0.000
patch_only 1 0.174 688784.298 0.000 0.000 0.000 0.139
patch_only 12 0.240 499679.372 0.000 0.000 0.000 0.190
The *_only rows are the cleanest final-link comparison: nvJitLink improves
only modestly or regresses under many worker threads, while patching stays close
to a memory copy of the prelinked template.
The repository is primarily single-header, but CMake also installs the header and license:
cmake --install build --prefix /tmp/cubin-function-patch-installInstalled files:
include/cubin_function_patch.h
share/cubin-function-patch/LICENSE
The tests vendor incbin under thirdparty/incbin. It keeps its own license in
thirdparty/incbin/UNLICENSE.
cubin-function-patch is licensed under the MIT license.
Copyright (c) 2026 Charlie Durham
See LICENSE.