Skip to content

dmriding/kaio

KAIO

Crates.io Documentation Build Status Coverage License Rust

High-performance GPU kernels in pure Rust. Lowered to PTX IR at compile time, validated against the current GPU and JIT-compiled by the driver at launch. No CUDA C++, no Python, no toolkit.

KAIO (from the Greek καιω — to burn, to ignite) is for Rust engineers who need custom GPU kernels today — fused attention variants, quantization ops, novel activations — and would otherwise be writing CUDA C++ because their framework doesn't support them.

Key highlights

  • 92.5% of cuBLAS sgemm at 4096² on RTX 4090 (tensor-core matmul, async path, fp16 inputs with fp32 accumulation). Full benchmarks →
  • Windows and Linux native. No WSL2, no Triton's Linux-only runtime, no Python. cargo build works everywhere.
  • No CUDA toolkit required — just the NVIDIA display driver. Build in CI on a standard GitHub runner; host tests pass without a GPU.
  • Pure-Rust kernel authorship. The #[gpu_kernel] proc macro lowers Rust to a PTX IR module at compile time; at launch the module is validated against the current GPU's SM target, emitted to PTX text, and handed to the CUDA driver for JIT compilation. Type-safe kernel signatures catch dtype mismatches at compile time, not as silent GPU corruption at runtime.

Try KAIO in 30 seconds

Clone the repo, run one command, see six real ML kernels build and execute on your GPU:

git clone https://github.com/dmriding/kaio.git
cd kaio
cargo xtask showcase

You'll see fused_silu_gate, gelu_comparison, rms_norm, layer_norm, softmax, int8_dequant, and int8_matmul compile, launch, verify correctness against a CPU reference, and report median latency. The seven examples span activations, normalizations, reductions, and the quantize → matmul pipeline: the canonical transformer-primitive arc plus the v0.3.0 W8A8 headline op.

Want the performance pitch instead? cargo xtask bench runs the tensor-core matmul benchmark against cuBLAS sgemm across five sizes. Or cargo xtask all for both. cargo xtask --help for the full tooling surface.

Requires an NVIDIA GPU with an installed display driver (NVIDIA 525 or newer — any standard Game Ready or Studio driver works). No CUDA toolkit install needed.

Quick Start

Requires an NVIDIA GPU (SM 7.0+, Volta or newer) with an installed display driver — NVIDIA 525 or newer (any standard Game Ready or Studio driver works; no Tesla/TCC-specific drivers needed).

cargo add kaio
use kaio::prelude::*;

#[gpu_kernel(block_size = 256)]
fn saxpy(x: &[f32], y: &mut [f32], alpha: f32, n: u32) {
    let idx = thread_idx_x() + block_idx_x() * block_dim_x();
    if idx < n {
        y[idx] = alpha * x[idx] + y[idx];
    }
}

fn main() -> Result<()> {
    let device = KaioDevice::new(0)?;
    let n = 1024u32;

    let x = device.alloc_from(&vec![1.0f32; n as usize])?;
    let mut y = device.alloc_from(&vec![2.0f32; n as usize])?;

    saxpy::launch(&device, &x, &mut y, 2.5f32, n)?;

    let result = y.to_host(&device)?;
    println!("result: {:?}", &result[..8]);
    Ok(())
}
$ cargo run
result: [4.5, 4.5, 4.5, 4.5, 4.5, 4.5, 4.5, 4.5]

The real pitch — fused ML kernels

SAXPY is for learning the DSL. The actual value looks like this:

use kaio::prelude::*;

// Gated SiLU — the feedforward activation in every LLaMA / Mistral /
// Qwen block. llama.cpp, vLLM, and TensorRT-LLM all ship hand-written
// CUDA for it. With KAIO it's 7 lines of Rust, lowered to a PTX IR
// module at compile time and JIT-loaded at launch.
#[gpu_kernel(block_size = 256)]
fn fused_silu_gate(x: &[f32], gate: &[f32], out: &mut [f32], n: u32) {
    let idx = thread_idx_x() + block_idx_x() * block_dim_x();
    if idx < n {
        let xi = x[idx];
        let sig = 1.0f32 / (1.0f32 + exp(-xi));
        out[idx] = xi * sig * gate[idx];
    }
}

Run it from the repo root with no directory changes:

$ cargo xtask showcase silu

=== fused_silu_gate ===
Input size:        1048576 elements
Correctness:       PASS  (max_abs_err = 1.49e-8)
Median latency:    188.8 μs  (of 100 timed runs, 5 warm-ups skipped)

Or run all six showcases in sequence with cargo xtask showcase: fused SiLU-gate, exact vs fast GELU, single-block RMSNorm, single-block LayerNorm, single-block softmax, INT8 dequantization. Each is a complete standalone project with correctness + timing (with its own Cargo.toml so you can copy the directory out of the repo as a reference for your own kernel); the GELU comparison's README explains why kernel fusion matters more than arithmetic optimization for ML workloads (the bandwidth-bound teaching moment).

When to use KAIO

Reach for KAIO when:

  • Your framework can't support a custom op (novel attention, fused activation, quantization) and you don't want to drop into CUDA C++ for one kernel.
  • You need GPU inference on Windows without WSL2 or Triton's Linux-only runtime.
  • Your CI runs on standard GitHub runners without GPU or CUDA toolkit access. KAIO's host tests pass without a GPU; only the #[ignore]-gated integration tests need one. Flip the matrix on later.
  • You need deterministic VRAM usage, explicit buffer reuse, or zero-copy transfers that high-level frameworks abstract away.
  • You're prototyping GPU code in a language you already know (Rust) without learning CUDA C++.

KAIO is not a replacement for Candle or Burn. It is the layer you use when you need more control than they provide.

KAIO cudarc Candle / Burn Raw CUDA
Write kernels in Rust Yes No (load PTX) No No
Automatic PTX generation Yes No N/A No
Windows support Yes Yes Partial Yes
No CUDA toolkit needed Yes Yes Varies No
Type-safe kernel signatures Yes No N/A No
ML framework integration Standalone Standalone Built-in Manual

Performance

Measured on RTX 4090 (sm_89), median of 20 timed iterations after 5 warmups:

Size TC sync TFLOPS TC async TFLOPS cuBLAS sgemm TFLOPS sync vs cuBLAS async vs cuBLAS
256³ 0.05 0.05 1.77 2.9% 2.6%
512³ 0.37 0.34 11.09 3.3% 3.1%
1024³ 2.87 2.62 37.35 7.7% 7.0%
2048³ 17.34 16.74 52.91 32.8% 31.6%
4096³ 40.93 45.96 49.72 82.3% 92.5%

Tensor-core matmul at 4096² reaches 92.5% of cuBLAS sgemm on the async path. Small sizes lag because a 64×64 multi-warp block tile needs ~16 blocks per SM to fill an RTX 4090's 128 SMs; at 256³ there are only 16 blocks in the entire grid, so kernel-launch overhead dominates. For small shapes prefer scalar matmul() or stay on cuBLAS.

Apples-to-apples disclaimer: KAIO uses fp16 inputs with fp32 accumulation; cuBLAS sgemm is f32 in / f32 out. The comparison is a project-local performance baseline, not a precision-identity claim. See docs/performance.md for the full analysis, the bank-conflict rationale for why async outpaces sync at large sizes, and the path-to-higher-numbers roadmap.

The problem KAIO solves

The Rust ML ecosystem can't keep up with Python. Every time a new model architecture drops with a custom operation — a novel attention variant, a fused activation, a custom quantization kernel — frameworks like candle and burn can't support it until someone writes the GPU function. Today, that means writing CUDA C++, fighting FFI bindings, and giving up on Windows.

Meanwhile, Python developers write a Triton kernel in an afternoon and move on. Triton doesn't support Windows, requires Python, and JIT-compiles at runtime — but it works, and Rust has no equivalent.

KAIO is that equivalent.

Patterns

Copy these skeletons, fill in your logic.

Bounds-checked element-wise:

#[gpu_kernel(block_size = 256)]
fn my_kernel(input: &[f32], output: &mut [f32], n: u32) {
    let idx = thread_idx_x() + block_idx_x() * block_dim_x();
    if idx < n {
        output[idx] = input[idx] * 2.0; // your logic here
    }
}

Shared memory tiling:

#[gpu_kernel(block_size = 256)]
fn tiled(data: &[f32], out: &mut [f32], n: u32) {
    let tid = thread_idx_x();
    let idx = tid + block_idx_x() * block_dim_x();
    let tile = shared_mem![f32; 256];
    if idx < n { tile[tid] = data[idx]; }
    bar_sync();
    if idx < n { out[idx] = tile[tid]; } // read from shared
}

Block reduction:

#[gpu_kernel(block_size = 256)]
fn reduce(input: &[f32], out: &mut [f32], n: u32) {
    let idx = thread_idx_x() + block_idx_x() * block_dim_x();
    let val = if idx < n { input[idx] } else { 0.0f32 };
    let sum = block_reduce_sum(val);
    if thread_idx_x() == 0u32 { out[block_idx_x()] = sum; }
}

Feature summary

Feature Notes
#[gpu_kernel] proc macro Rust → PTX IR at compile time; PTX emitted + JIT-loaded at launch. Type-safe launch wrapper auto-generated.
Shared memory + reductions + warp shuffles shared_mem![], bar_sync(), block_reduce_sum/max, shfl_sync_*.
2D blocks, FMA, math builtins block_size = (16,16), fma, sqrt, exp, log, tanh, abs, min, max.
Scalar tiled matmul kaio_ops::matmul / matmul_auto — 31% of cuBLAS sgemm. Any SM.
Fused attention + FlashAttention kaio_ops::attention, attention_flash (O(d_k) memory). Any SM.
Tensor-core matmul kaio_ops::matmul_tc / matmul_tc_async / matmul_auto_tc — f16 → f32, SM 8.0+, 82.3% sync / 92.5% async of cuBLAS sgemm at 4096².
INT8 dequantize-matmul (W8A8) kaio_ops::matmul_int8 — symmetric i8 × i8 → f32 with single-scalar scale, SM 8.0+, K%32==0. 80–94 TOPS at 4096³ on RTX 4090 sm_89 (median ~89 across 6 runs). v0.3.0 reference quant op.
Auto-tuner + cache tune_matmul, matmul_auto, matmul_auto_tc with JSON cache.
PTX inspection KAIO_DUMP_PTX=1, KAIO_PTX_STATS=1, KAIO_PTX_ANNOTATE=1.

See docs.rs/kaio for the full API surface and the internal IR types.

Project status and constraints

KAIO is pre-1.0 software. Current engineering constraints:

  • NVIDIA only. SM 7.0+ (Volta, Turing, Ampere, Ada Lovelace, Hopper). No AMD, no Intel, no Apple Silicon.
  • Matmul performance is size-dependent. Tensor-core matmul reaches 82.3% sync / 92.5% async of cuBLAS sgemm at 4096² on Ampere, but lags heavily at ≤1024² (2-8%) because a 64×64 multi-warp block tile doesn't fill the SM array until the grid is large. Scalar matmul tops out at 31% of cuBLAS. For small shapes prefer cuBLAS or the scalar path. Details →
  • Inference only. No autograd / backward pass. Training integration (kaio-candle bridge) is planned for Phase 7.
  • DSL is a Rust subset. No closures, traits, generics, method calls, or string operations inside #[gpu_kernel] function bodies. Arithmetic, comparisons, bitwise operators (& | ^ << >> !), short-circuit && / ||, and compound assignment (including bitwise &= / |= / <<= / etc.) all supported as of v0.2.1.
  • FlashAttention d_k limit. attention_flash() requires d_k ≤ 256 (one thread per output dimension).
  • Single-device. No multi-GPU support.
  • First-call PTX module load. The #[gpu_kernel] macro generates PTX at Rust compile time, but CUDA-driver module loading still happens at first launch. First call pays the module-load latency; subsequent launches are dispatch-only. (The in-process PTX cache was removed in v0.2.1 to simplify the load path and run kernels through PtxModule::validate() on every launch.)
  • API will change before 1.0. Breaking changes documented in CHANGELOG per release.

Architecture

Four layers, bottom to top:

+-------------------------------------------+
|  Layer 4: Block-Level Operations          |  tiled matmul, fused attention, TC matmul
+-------------------------------------------+
|  Layer 3: #[gpu_kernel] Proc Macro        |  Rust syntax → PTX automatically
+-------------------------------------------+
|  Layer 2: Runtime (kaio-runtime)          |  device memory, kernel launch, SM validation
+-------------------------------------------+
|  Layer 1: PTX Codegen (kaio-core)         |  IR types, instruction emitters, fragments
+-------------------------------------------+
Crate Description
kaio Umbrella crate — re-exports everything via prelude
kaio-macros #[gpu_kernel] proc macro
kaio-core PTX IR, instruction emitters, fragment containers, zero external deps
kaio-runtime CUDA driver wrapper via cudarc
kaio-ops Pre-built GPU operations (matmul, attention, TC matmul, auto-tuner)

Target hardware

  • GPUs: NVIDIA SM 7.0+ (Volta, Turing, Ampere, Ada Lovelace, Hopper). Tensor-core kernels (matmul_tc*) require SM 8.0+ (Ampere or newer).
  • Platforms: Windows 10/11 and Linux (Ubuntu 22.04+).
  • Driver: NVIDIA 525+ (CUDA 12.0+ compatible). Standard Game Ready or Studio drivers work on consumer cards; Tesla/TCC drivers are not required and not needed for KAIO's dynamic-loading path.
  • Tested on: RTX 4090 (sm_89) under Windows.

Building

cargo build --workspace
cargo test --workspace                  # host tests (no GPU needed)
cargo test --workspace -- --ignored     # GPU tests (requires NVIDIA GPU)
KAIO_DUMP_PTX=1 cargo test              # inspect generated PTX

Requires Rust 1.94+ (pinned via rust-toolchain.toml). The version floor reflects edition-2024 features and const-evaluation patterns used in kernel tile-layout computation. No CUDA toolkit is needed to build — KAIO resolves the NVIDIA driver at runtime via dynamic loading (nvcuda.dll on Windows, libcuda.so on Linux).

Debugging

When something goes wrong — launch errors, silent NaN, unexpectedly slow performance — docs/debugging.md is the single entry point. It covers the env vars (KAIO_DUMP_PTX, KAIO_PTX_STATS, KAIO_PTX_ANNOTATE, KAIO_SM_TARGET, KAIO_TUNE_CACHE, KAIO_SUPPRESS_DEBUG_WARNING), the async-launch error model, compute-sanitizer usage for silent-corruption diagnosis, tolerance guidance for numerical verification, and a troubleshooting flowchart for "did it compile → launch → produce right output?"

Test coverage

93.65% line coverage across the 20,156-line workspace (1,280 lines uncovered, mostly host-side parser error paths, the xtask repo-tooling binary, and the unreachable-by-design host stubs for GPU builtins in kaio/src/gpu_builtins.rs). Shipped kernel crates are well above the workspace average — kaio-ops/src/matmul_int8_kernel.rs at 97.77%, matmul_tc_kernel.rs at 97.74%, matmul_tc_async_kernel.rs at 99.40%, attention_tc_kernel.rs at 98.82%. Measured on RTX 4090 sm_89 via cargo llvm-cov with the host test suite and the full GPU-only --ignored test suite merged:

cargo install cargo-llvm-cov           # one-time
cargo llvm-cov clean --workspace
cargo llvm-cov --workspace --no-report
cargo llvm-cov --workspace --no-report -- --ignored
cargo llvm-cov report --summary-only

The number is static (updated per release, not per CI run) because the GPU-ignored tests require actual NVIDIA hardware and can't run on standard GitHub Actions runners. See docs/testing-strategy.md for the full testing model (host tests, GPU integration tests, ptxas_verify structural checks, and the matmul_tc_bench performance harness).

How it works

The #[gpu_kernel] macro:

  1. Parses your Rust function body into a kernel IR.
  2. Lowers expressions to PTX instruction sequences.
  3. Generates a build_ptx() function that constructs the IR at runtime (first call only, cached via OnceLock).
  4. Emits PTX text through PtxWriter.
  5. Wraps everything in a typed launch() function that handles device memory and kernel dispatch.

The generated PTX is validated against nvcc output and passes ptxas --verify on every tested kernel.

IR API (advanced)

You can also build kernels directly via the Layer 1 IR API for maximum control:

use kaio_core::emit::{Emit, PtxWriter};
use kaio_core::ir::*;
use kaio_core::types::PtxType;

let mut alloc = RegisterAllocator::new();
let mut kernel = PtxKernel::new("my_kernel");
kernel.add_param(PtxParam::pointer("data", PtxType::F32));
// ... build instructions ...

let mut module = PtxModule::new("sm_89");
module.add_kernel(kernel);
let mut w = PtxWriter::new();
module.emit(&mut w).unwrap();
println!("{}", w.finish()); // valid PTX assembly

See kaio-runtime/tests/vector_add_e2e.rs for a complete end-to-end example.

Roadmap

  • Phase 1 — PTX codegen + runtime (IR → PTX → GPU execution)
  • Phase 2#[gpu_kernel] proc macro (arithmetic, control flow, memory access, math builtins)
  • Phase 3 — Loops, shared memory, reductions, softmax
  • Phase 4 — Tiled matmul (31% of cuBLAS), kaio-ops crate, 2D blocks, FMA, PTX inspection tools
  • Phase 5 — Fused attention + FlashAttention + auto-tuning, crates.io v0.1.0
  • Phase 6 — Tensor cores (mma.sync fp16/bf16), async copies (cp.async), bank-conflict padding. 82.3% sync / 92.5% async of cuBLAS sgemm at 4096² on Ampere. Three standalone showcase examples. crates.io v0.2.0.
  • Phase 7 — Quantized kernels (INT8/INT4), training integration (kaio-candle bridge), ldmatrix.sync for further TC headroom, bf16 TC matmul variant.
  • Phase 8 — PyO3 bindings (Python access to kaio-ops).

See CHANGELOG.md for per-release detail and docs/phases.md for deeper phase plans.

Common pitfalls

  • Always bounds-check array writes. if idx < n before every global memory access — out-of-bounds GPU writes corrupt memory silently.
  • Shared memory must fit within block limits. Default is 48 KB per block. shared_mem![f32; 12288] = 48 KB = the limit.
  • Thread indexing errors are the most common bug. Double-check your row / col math, especially with 2D blocks. Off-by-one in a kernel doesn't panic — it writes to the wrong address.

Feedback

If something is confusing, awkward, or broken — open an issue. Even small friction matters. This project is actively developed and feedback directly shapes what gets built next.

License

Licensed under either of

at your option.

Contribution

Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.

About

Rust-native GPU kernel authoring framework: write GPU compute kernels in Rust, compile to PTX. The Triton equivalent for the Rust ecosystem.

Topics

Resources

License

Apache-2.0, MIT licenses found

Licenses found

Apache-2.0
LICENSE-APACHE
MIT
LICENSE-MIT

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages