High-performance GPU kernels in pure Rust. Lowered to PTX IR at compile time, validated against the current GPU and JIT-compiled by the driver at launch. No CUDA C++, no Python, no toolkit.
KAIO (from the Greek καιω — to burn, to ignite) is for Rust engineers who need custom GPU kernels today — fused attention variants, quantization ops, novel activations — and would otherwise be writing CUDA C++ because their framework doesn't support them.
- 92.5% of cuBLAS sgemm at 4096² on RTX 4090 (tensor-core matmul, async path, fp16 inputs with fp32 accumulation). Full benchmarks →
- Windows and Linux native. No WSL2, no Triton's Linux-only runtime,
no Python.
cargo buildworks everywhere. - No CUDA toolkit required — just the NVIDIA display driver. Build in CI on a standard GitHub runner; host tests pass without a GPU.
- Pure-Rust kernel authorship. The
#[gpu_kernel]proc macro lowers Rust to a PTX IR module at compile time; at launch the module is validated against the current GPU's SM target, emitted to PTX text, and handed to the CUDA driver for JIT compilation. Type-safe kernel signatures catch dtype mismatches at compile time, not as silent GPU corruption at runtime.
Clone the repo, run one command, see six real ML kernels build and execute on your GPU:
git clone https://github.com/dmriding/kaio.git
cd kaio
cargo xtask showcaseYou'll see fused_silu_gate, gelu_comparison, rms_norm, layer_norm, softmax, int8_dequant, and int8_matmul compile, launch, verify correctness against a CPU reference, and report median latency. The seven examples span activations, normalizations, reductions, and the quantize → matmul pipeline: the canonical transformer-primitive arc plus the v0.3.0 W8A8 headline op.
Want the performance pitch instead? cargo xtask bench runs the tensor-core matmul benchmark against cuBLAS sgemm across five sizes. Or cargo xtask all for both. cargo xtask --help for the full tooling surface.
Requires an NVIDIA GPU with an installed display driver (NVIDIA 525 or newer — any standard Game Ready or Studio driver works). No CUDA toolkit install needed.
Requires an NVIDIA GPU (SM 7.0+, Volta or newer) with an installed display driver — NVIDIA 525 or newer (any standard Game Ready or Studio driver works; no Tesla/TCC-specific drivers needed).
cargo add kaiouse kaio::prelude::*;
#[gpu_kernel(block_size = 256)]
fn saxpy(x: &[f32], y: &mut [f32], alpha: f32, n: u32) {
let idx = thread_idx_x() + block_idx_x() * block_dim_x();
if idx < n {
y[idx] = alpha * x[idx] + y[idx];
}
}
fn main() -> Result<()> {
let device = KaioDevice::new(0)?;
let n = 1024u32;
let x = device.alloc_from(&vec![1.0f32; n as usize])?;
let mut y = device.alloc_from(&vec![2.0f32; n as usize])?;
saxpy::launch(&device, &x, &mut y, 2.5f32, n)?;
let result = y.to_host(&device)?;
println!("result: {:?}", &result[..8]);
Ok(())
}$ cargo run
result: [4.5, 4.5, 4.5, 4.5, 4.5, 4.5, 4.5, 4.5]
SAXPY is for learning the DSL. The actual value looks like this:
use kaio::prelude::*;
// Gated SiLU — the feedforward activation in every LLaMA / Mistral /
// Qwen block. llama.cpp, vLLM, and TensorRT-LLM all ship hand-written
// CUDA for it. With KAIO it's 7 lines of Rust, lowered to a PTX IR
// module at compile time and JIT-loaded at launch.
#[gpu_kernel(block_size = 256)]
fn fused_silu_gate(x: &[f32], gate: &[f32], out: &mut [f32], n: u32) {
let idx = thread_idx_x() + block_idx_x() * block_dim_x();
if idx < n {
let xi = x[idx];
let sig = 1.0f32 / (1.0f32 + exp(-xi));
out[idx] = xi * sig * gate[idx];
}
}Run it from the repo root with no directory changes:
$ cargo xtask showcase silu
=== fused_silu_gate ===
Input size: 1048576 elements
Correctness: PASS (max_abs_err = 1.49e-8)
Median latency: 188.8 μs (of 100 timed runs, 5 warm-ups skipped)
Or run all six showcases in sequence with cargo xtask showcase:
fused SiLU-gate,
exact vs fast GELU,
single-block RMSNorm,
single-block LayerNorm,
single-block softmax,
INT8 dequantization. Each is a complete
standalone project with correctness + timing (with its own Cargo.toml
so you can copy the directory out of the repo as a reference for your
own kernel); the GELU comparison's README explains why kernel fusion
matters more than arithmetic optimization for ML workloads (the
bandwidth-bound teaching moment).
Reach for KAIO when:
- Your framework can't support a custom op (novel attention, fused activation, quantization) and you don't want to drop into CUDA C++ for one kernel.
- You need GPU inference on Windows without WSL2 or Triton's Linux-only runtime.
- Your CI runs on standard GitHub runners without GPU or CUDA
toolkit access. KAIO's host tests pass without a GPU; only the
#[ignore]-gated integration tests need one. Flip the matrix on later. - You need deterministic VRAM usage, explicit buffer reuse, or zero-copy transfers that high-level frameworks abstract away.
- You're prototyping GPU code in a language you already know (Rust) without learning CUDA C++.
KAIO is not a replacement for Candle or Burn. It is the layer you use when you need more control than they provide.
| KAIO | cudarc | Candle / Burn | Raw CUDA | |
|---|---|---|---|---|
| Write kernels in Rust | Yes | No (load PTX) | No | No |
| Automatic PTX generation | Yes | No | N/A | No |
| Windows support | Yes | Yes | Partial | Yes |
| No CUDA toolkit needed | Yes | Yes | Varies | No |
| Type-safe kernel signatures | Yes | No | N/A | No |
| ML framework integration | Standalone | Standalone | Built-in | Manual |
Measured on RTX 4090 (sm_89), median of 20 timed iterations after 5 warmups:
| Size | TC sync TFLOPS | TC async TFLOPS | cuBLAS sgemm TFLOPS | sync vs cuBLAS | async vs cuBLAS |
|---|---|---|---|---|---|
| 256³ | 0.05 | 0.05 | 1.77 | 2.9% | 2.6% |
| 512³ | 0.37 | 0.34 | 11.09 | 3.3% | 3.1% |
| 1024³ | 2.87 | 2.62 | 37.35 | 7.7% | 7.0% |
| 2048³ | 17.34 | 16.74 | 52.91 | 32.8% | 31.6% |
| 4096³ | 40.93 | 45.96 | 49.72 | 82.3% | 92.5% |
Tensor-core matmul at 4096² reaches 92.5% of cuBLAS sgemm on the async
path. Small sizes lag because a 64×64 multi-warp block tile needs
~16 blocks per SM to fill an RTX 4090's 128 SMs; at 256³ there are only
16 blocks in the entire grid, so kernel-launch overhead dominates.
For small shapes prefer scalar matmul() or stay
on cuBLAS.
Apples-to-apples disclaimer: KAIO uses fp16 inputs with fp32 accumulation; cuBLAS sgemm is f32 in / f32 out. The comparison is a project-local performance baseline, not a precision-identity claim. See docs/performance.md for the full analysis, the bank-conflict rationale for why async outpaces sync at large sizes, and the path-to-higher-numbers roadmap.
The Rust ML ecosystem can't keep up with Python. Every time a new model architecture drops with a custom operation — a novel attention variant, a fused activation, a custom quantization kernel — frameworks like candle and burn can't support it until someone writes the GPU function. Today, that means writing CUDA C++, fighting FFI bindings, and giving up on Windows.
Meanwhile, Python developers write a Triton kernel in an afternoon and move on. Triton doesn't support Windows, requires Python, and JIT-compiles at runtime — but it works, and Rust has no equivalent.
KAIO is that equivalent.
Copy these skeletons, fill in your logic.
Bounds-checked element-wise:
#[gpu_kernel(block_size = 256)]
fn my_kernel(input: &[f32], output: &mut [f32], n: u32) {
let idx = thread_idx_x() + block_idx_x() * block_dim_x();
if idx < n {
output[idx] = input[idx] * 2.0; // your logic here
}
}Shared memory tiling:
#[gpu_kernel(block_size = 256)]
fn tiled(data: &[f32], out: &mut [f32], n: u32) {
let tid = thread_idx_x();
let idx = tid + block_idx_x() * block_dim_x();
let tile = shared_mem![f32; 256];
if idx < n { tile[tid] = data[idx]; }
bar_sync();
if idx < n { out[idx] = tile[tid]; } // read from shared
}Block reduction:
#[gpu_kernel(block_size = 256)]
fn reduce(input: &[f32], out: &mut [f32], n: u32) {
let idx = thread_idx_x() + block_idx_x() * block_dim_x();
let val = if idx < n { input[idx] } else { 0.0f32 };
let sum = block_reduce_sum(val);
if thread_idx_x() == 0u32 { out[block_idx_x()] = sum; }
}| Feature | Notes |
|---|---|
#[gpu_kernel] proc macro |
Rust → PTX IR at compile time; PTX emitted + JIT-loaded at launch. Type-safe launch wrapper auto-generated. |
| Shared memory + reductions + warp shuffles | shared_mem![], bar_sync(), block_reduce_sum/max, shfl_sync_*. |
| 2D blocks, FMA, math builtins | block_size = (16,16), fma, sqrt, exp, log, tanh, abs, min, max. |
| Scalar tiled matmul | kaio_ops::matmul / matmul_auto — 31% of cuBLAS sgemm. Any SM. |
| Fused attention + FlashAttention | kaio_ops::attention, attention_flash (O(d_k) memory). Any SM. |
| Tensor-core matmul | kaio_ops::matmul_tc / matmul_tc_async / matmul_auto_tc — f16 → f32, SM 8.0+, 82.3% sync / 92.5% async of cuBLAS sgemm at 4096². |
| INT8 dequantize-matmul (W8A8) | kaio_ops::matmul_int8 — symmetric i8 × i8 → f32 with single-scalar scale, SM 8.0+, K%32==0. 80–94 TOPS at 4096³ on RTX 4090 sm_89 (median ~89 across 6 runs). v0.3.0 reference quant op. |
| Auto-tuner + cache | tune_matmul, matmul_auto, matmul_auto_tc with JSON cache. |
| PTX inspection | KAIO_DUMP_PTX=1, KAIO_PTX_STATS=1, KAIO_PTX_ANNOTATE=1. |
See docs.rs/kaio for the full API surface and the internal IR types.
KAIO is pre-1.0 software. Current engineering constraints:
- NVIDIA only. SM 7.0+ (Volta, Turing, Ampere, Ada Lovelace, Hopper). No AMD, no Intel, no Apple Silicon.
- Matmul performance is size-dependent. Tensor-core matmul reaches 82.3% sync / 92.5% async of cuBLAS sgemm at 4096² on Ampere, but lags heavily at ≤1024² (2-8%) because a 64×64 multi-warp block tile doesn't fill the SM array until the grid is large. Scalar matmul tops out at 31% of cuBLAS. For small shapes prefer cuBLAS or the scalar path. Details →
- Inference only. No autograd / backward pass. Training integration
(
kaio-candlebridge) is planned for Phase 7. - DSL is a Rust subset. No closures, traits, generics, method
calls, or string operations inside
#[gpu_kernel]function bodies. Arithmetic, comparisons, bitwise operators (&|^<<>>!), short-circuit&&/||, and compound assignment (including bitwise&=/|=/<<=/ etc.) all supported as of v0.2.1. - FlashAttention d_k limit.
attention_flash()requires d_k ≤ 256 (one thread per output dimension). - Single-device. No multi-GPU support.
- First-call PTX module load. The
#[gpu_kernel]macro generates PTX at Rust compile time, but CUDA-driver module loading still happens at first launch. First call pays the module-load latency; subsequent launches are dispatch-only. (The in-process PTX cache was removed in v0.2.1 to simplify the load path and run kernels throughPtxModule::validate()on every launch.) - API will change before 1.0. Breaking changes documented in CHANGELOG per release.
Four layers, bottom to top:
+-------------------------------------------+
| Layer 4: Block-Level Operations | tiled matmul, fused attention, TC matmul
+-------------------------------------------+
| Layer 3: #[gpu_kernel] Proc Macro | Rust syntax → PTX automatically
+-------------------------------------------+
| Layer 2: Runtime (kaio-runtime) | device memory, kernel launch, SM validation
+-------------------------------------------+
| Layer 1: PTX Codegen (kaio-core) | IR types, instruction emitters, fragments
+-------------------------------------------+
| Crate | Description |
|---|---|
kaio |
Umbrella crate — re-exports everything via prelude |
kaio-macros |
#[gpu_kernel] proc macro |
kaio-core |
PTX IR, instruction emitters, fragment containers, zero external deps |
kaio-runtime |
CUDA driver wrapper via cudarc |
kaio-ops |
Pre-built GPU operations (matmul, attention, TC matmul, auto-tuner) |
- GPUs: NVIDIA SM 7.0+ (Volta, Turing, Ampere, Ada Lovelace,
Hopper). Tensor-core kernels (
matmul_tc*) require SM 8.0+ (Ampere or newer). - Platforms: Windows 10/11 and Linux (Ubuntu 22.04+).
- Driver: NVIDIA 525+ (CUDA 12.0+ compatible). Standard Game Ready or Studio drivers work on consumer cards; Tesla/TCC drivers are not required and not needed for KAIO's dynamic-loading path.
- Tested on: RTX 4090 (sm_89) under Windows.
cargo build --workspace
cargo test --workspace # host tests (no GPU needed)
cargo test --workspace -- --ignored # GPU tests (requires NVIDIA GPU)
KAIO_DUMP_PTX=1 cargo test # inspect generated PTXRequires Rust 1.94+ (pinned via rust-toolchain.toml). The version
floor reflects edition-2024 features and const-evaluation patterns used
in kernel tile-layout computation. No CUDA toolkit is needed to build
— KAIO resolves the NVIDIA driver at runtime via dynamic loading
(nvcuda.dll on Windows, libcuda.so on Linux).
When something goes wrong — launch errors, silent NaN, unexpectedly slow
performance — docs/debugging.md is the single
entry point. It covers the env vars (KAIO_DUMP_PTX, KAIO_PTX_STATS,
KAIO_PTX_ANNOTATE, KAIO_SM_TARGET, KAIO_TUNE_CACHE,
KAIO_SUPPRESS_DEBUG_WARNING), the async-launch error model,
compute-sanitizer usage for silent-corruption diagnosis, tolerance
guidance for numerical verification, and a troubleshooting flowchart
for "did it compile → launch → produce right output?"
93.65% line coverage across the 20,156-line workspace (1,280 lines
uncovered, mostly host-side parser error paths, the xtask repo-tooling
binary, and the unreachable-by-design host stubs for GPU builtins in
kaio/src/gpu_builtins.rs). Shipped kernel crates are well above the
workspace average — kaio-ops/src/matmul_int8_kernel.rs at 97.77%,
matmul_tc_kernel.rs at 97.74%, matmul_tc_async_kernel.rs at 99.40%,
attention_tc_kernel.rs at 98.82%. Measured on RTX 4090 sm_89 via
cargo llvm-cov with the host test suite and the full GPU-only
--ignored test suite merged:
cargo install cargo-llvm-cov # one-time
cargo llvm-cov clean --workspace
cargo llvm-cov --workspace --no-report
cargo llvm-cov --workspace --no-report -- --ignored
cargo llvm-cov report --summary-onlyThe number is static (updated per release, not per CI run) because
the GPU-ignored tests require actual NVIDIA hardware and can't run on
standard GitHub Actions runners. See
docs/testing-strategy.md for the full
testing model (host tests, GPU integration tests, ptxas_verify
structural checks, and the matmul_tc_bench performance harness).
The #[gpu_kernel] macro:
- Parses your Rust function body into a kernel IR.
- Lowers expressions to PTX instruction sequences.
- Generates a
build_ptx()function that constructs the IR at runtime (first call only, cached viaOnceLock). - Emits PTX text through
PtxWriter. - Wraps everything in a typed
launch()function that handles device memory and kernel dispatch.
The generated PTX is validated against nvcc output and passes
ptxas --verify on every tested kernel.
You can also build kernels directly via the Layer 1 IR API for maximum control:
use kaio_core::emit::{Emit, PtxWriter};
use kaio_core::ir::*;
use kaio_core::types::PtxType;
let mut alloc = RegisterAllocator::new();
let mut kernel = PtxKernel::new("my_kernel");
kernel.add_param(PtxParam::pointer("data", PtxType::F32));
// ... build instructions ...
let mut module = PtxModule::new("sm_89");
module.add_kernel(kernel);
let mut w = PtxWriter::new();
module.emit(&mut w).unwrap();
println!("{}", w.finish()); // valid PTX assemblySee kaio-runtime/tests/vector_add_e2e.rs for a complete end-to-end example.
- Phase 1 — PTX codegen + runtime (IR → PTX → GPU execution)
- Phase 2 —
#[gpu_kernel]proc macro (arithmetic, control flow, memory access, math builtins) - Phase 3 — Loops, shared memory, reductions, softmax
- Phase 4 — Tiled matmul (31% of cuBLAS),
kaio-opscrate, 2D blocks, FMA, PTX inspection tools - Phase 5 — Fused attention + FlashAttention + auto-tuning, crates.io v0.1.0
- Phase 6 — Tensor cores (
mma.syncfp16/bf16), async copies (cp.async), bank-conflict padding. 82.3% sync / 92.5% async of cuBLAS sgemm at 4096² on Ampere. Three standalone showcase examples. crates.io v0.2.0. - Phase 7 — Quantized kernels (INT8/INT4), training
integration (
kaio-candlebridge),ldmatrix.syncfor further TC headroom, bf16 TC matmul variant. - Phase 8 — PyO3 bindings (Python access to
kaio-ops).
See CHANGELOG.md for per-release detail and docs/phases.md for deeper phase plans.
- Always bounds-check array writes.
if idx < nbefore every global memory access — out-of-bounds GPU writes corrupt memory silently. - Shared memory must fit within block limits. Default is 48 KB
per block.
shared_mem![f32; 12288]= 48 KB = the limit. - Thread indexing errors are the most common bug. Double-check your row / col math, especially with 2D blocks. Off-by-one in a kernel doesn't panic — it writes to the wrong address.
If something is confusing, awkward, or broken — open an issue. Even small friction matters. This project is actively developed and feedback directly shapes what gets built next.
Licensed under either of
- Apache License, Version 2.0 (LICENSE-APACHE or http://www.apache.org/licenses/LICENSE-2.0)
- MIT license (LICENSE-MIT or http://opensource.org/licenses/MIT)
at your option.
Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.