KAIO

High-performance GPU kernels in pure Rust. Lowered to PTX IR at compile time, validated against the current GPU and JIT-compiled by the driver at launch. No CUDA C++, no Python, no toolkit.

KAIO (from the Greek καιω — to burn, to ignite) is for Rust engineers who need custom GPU kernels today — fused attention variants, quantization ops, novel activations — and would otherwise be writing CUDA C++ because their framework doesn't support them.

Key highlights

92.5% of cuBLAS sgemm at 4096² on RTX 4090 (tensor-core matmul, async path, fp16 inputs with fp32 accumulation). Full benchmarks →
Windows and Linux native. No WSL2, no Triton's Linux-only runtime, no Python. cargo build works everywhere.
No CUDA toolkit required — just the NVIDIA display driver. Build in CI on a standard GitHub runner; host tests pass without a GPU.
Pure-Rust kernel authorship. The #[gpu_kernel] proc macro lowers Rust to a PTX IR module at compile time; at launch the module is validated against the current GPU's SM target, emitted to PTX text, and handed to the CUDA driver for JIT compilation. Type-safe kernel signatures catch dtype mismatches at compile time, not as silent GPU corruption at runtime.

Try KAIO in 30 seconds

Clone the repo, run one command, see six real ML kernels build and execute on your GPU:

git clone https://github.com/dmriding/kaio.git
cd kaio
cargo xtask showcase

You'll see fused_silu_gate, gelu_comparison, rms_norm, layer_norm, softmax, int8_dequant, and int8_matmul compile, launch, verify correctness against a CPU reference, and report median latency. The seven examples span activations, normalizations, reductions, and the quantize → matmul pipeline: the canonical transformer-primitive arc plus the v0.3.0 W8A8 headline op.

Want the performance pitch instead? cargo xtask bench runs the tensor-core matmul benchmark against cuBLAS sgemm across five sizes. Or cargo xtask all for both. cargo xtask --help for the full tooling surface.

Requires an NVIDIA GPU with an installed display driver (NVIDIA 525 or newer — any standard Game Ready or Studio driver works). No CUDA toolkit install needed.

Quick Start

Requires an NVIDIA GPU (SM 7.0+, Volta or newer) with an installed display driver — NVIDIA 525 or newer (any standard Game Ready or Studio driver works; no Tesla/TCC-specific drivers needed).

cargo add kaio

use kaio::prelude::*;

#[gpu_kernel(block_size = 256)]
fn saxpy(x: &[f32], y: &mut [f32], alpha: f32, n: u32) {
    let idx = thread_idx_x() + block_idx_x() * block_dim_x();
    if idx < n {
        y[idx] = alpha * x[idx] + y[idx];
    }
}

fn main() -> Result<()> {
    let device = KaioDevice::new(0)?;
    let n = 1024u32;

    let x = device.alloc_from(&vec![1.0f32; n as usize])?;
    let mut y = device.alloc_from(&vec![2.0f32; n as usize])?;

    saxpy::launch(&device, &x, &mut y, 2.5f32, n)?;

    let result = y.to_host(&device)?;
    println!("result: {:?}", &result[..8]);
    Ok(())
}

$ cargo run
result: [4.5, 4.5, 4.5, 4.5, 4.5, 4.5, 4.5, 4.5]

The real pitch — fused ML kernels

SAXPY is for learning the DSL. The actual value looks like this:

use kaio::prelude::*;

// Gated SiLU — the feedforward activation in every LLaMA / Mistral /
// Qwen block. llama.cpp, vLLM, and TensorRT-LLM all ship hand-written
// CUDA for it. With KAIO it's 7 lines of Rust, lowered to a PTX IR
// module at compile time and JIT-loaded at launch.
#[gpu_kernel(block_size = 256)]
fn fused_silu_gate(x: &[f32], gate: &[f32], out: &mut [f32], n: u32) {
    let idx = thread_idx_x() + block_idx_x() * block_dim_x();
    if idx < n {
        let xi = x[idx];
        let sig = 1.0f32 / (1.0f32 + exp(-xi));
        out[idx] = xi * sig * gate[idx];
    }
}

Run it from the repo root with no directory changes:

$ cargo xtask showcase silu

=== fused_silu_gate ===
Input size:        1048576 elements
Correctness:       PASS  (max_abs_err = 1.49e-8)
Median latency:    188.8 μs  (of 100 timed runs, 5 warm-ups skipped)

Or run all six showcases in sequence with cargo xtask showcase: fused SiLU-gate, exact vs fast GELU, single-block RMSNorm, single-block LayerNorm, single-block softmax, INT8 dequantization. Each is a complete standalone project with correctness + timing (with its own Cargo.toml so you can copy the directory out of the repo as a reference for your own kernel); the GELU comparison's README explains why kernel fusion matters more than arithmetic optimization for ML workloads (the bandwidth-bound teaching moment).

When to use KAIO

Reach for KAIO when:

Your framework can't support a custom op (novel attention, fused activation, quantization) and you don't want to drop into CUDA C++ for one kernel.
You need GPU inference on Windows without WSL2 or Triton's Linux-only runtime.
Your CI runs on standard GitHub runners without GPU or CUDA toolkit access. KAIO's host tests pass without a GPU; only the #[ignore]-gated integration tests need one. Flip the matrix on later.
You need deterministic VRAM usage, explicit buffer reuse, or zero-copy transfers that high-level frameworks abstract away.
You're prototyping GPU code in a language you already know (Rust) without learning CUDA C++.

KAIO is not a replacement for Candle or Burn. It is the layer you use when you need more control than they provide.

	KAIO	cudarc	Candle / Burn	Raw CUDA
Write kernels in Rust	Yes	No (load PTX)	No	No
Automatic PTX generation	Yes	No	N/A	No
Windows support	Yes	Yes	Partial	Yes
No CUDA toolkit needed	Yes	Yes	Varies	No
Type-safe kernel signatures	Yes	No	N/A	No
ML framework integration	Standalone	Standalone	Built-in	Manual

Performance

Measured on RTX 4090 (sm_89), median of 20 timed iterations after 5 warmups:

Size	TC sync TFLOPS	TC async TFLOPS	cuBLAS sgemm TFLOPS	sync vs cuBLAS	async vs cuBLAS
256³	0.05	0.05	1.77	2.9%	2.6%
512³	0.37	0.34	11.09	3.3%	3.1%
1024³	2.87	2.62	37.35	7.7%	7.0%
2048³	17.34	16.74	52.91	32.8%	31.6%
4096³	40.93	45.96	49.72	82.3%	92.5%

Tensor-core matmul at 4096² reaches 92.5% of cuBLAS sgemm on the async path. Small sizes lag because a 64×64 multi-warp block tile needs ~16 blocks per SM to fill an RTX 4090's 128 SMs; at 256³ there are only 16 blocks in the entire grid, so kernel-launch overhead dominates. For small shapes prefer scalar matmul() or stay on cuBLAS.

Apples-to-apples disclaimer: KAIO uses fp16 inputs with fp32 accumulation; cuBLAS sgemm is f32 in / f32 out. The comparison is a project-local performance baseline, not a precision-identity claim. See docs/performance.md for the full analysis, the bank-conflict rationale for why async outpaces sync at large sizes, and the path-to-higher-numbers roadmap.

The problem KAIO solves

The Rust ML ecosystem can't keep up with Python. Every time a new model architecture drops with a custom operation — a novel attention variant, a fused activation, a custom quantization kernel — frameworks like candle and burn can't support it until someone writes the GPU function. Today, that means writing CUDA C++, fighting FFI bindings, and giving up on Windows.

Meanwhile, Python developers write a Triton kernel in an afternoon and move on. Triton doesn't support Windows, requires Python, and JIT-compiles at runtime — but it works, and Rust has no equivalent.

KAIO is that equivalent.

Patterns

Copy these skeletons, fill in your logic.

Bounds-checked element-wise:

#[gpu_kernel(block_size = 256)]
fn my_kernel(input: &[f32], output: &mut [f32], n: u32) {
    let idx = thread_idx_x() + block_idx_x() * block_dim_x();
    if idx < n {
        output[idx] = input[idx] * 2.0; // your logic here
    }
}

Shared memory tiling:

#[gpu_kernel(block_size = 256)]
fn tiled(data: &[f32], out: &mut [f32], n: u32) {
    let tid = thread_idx_x();
    let idx = tid + block_idx_x() * block_dim_x();
    let tile = shared_mem![f32; 256];
    if idx < n { tile[tid] = data[idx]; }
    bar_sync();
    if idx < n { out[idx] = tile[tid]; } // read from shared
}

Block reduction:

#[gpu_kernel(block_size = 256)]
fn reduce(input: &[f32], out: &mut [f32], n: u32) {
    let idx = thread_idx_x() + block_idx_x() * block_dim_x();
    let val = if idx < n { input[idx] } else { 0.0f32 };
    let sum = block_reduce_sum(val);
    if thread_idx_x() == 0u32 { out[block_idx_x()] = sum; }
}

Feature summary

Feature	Notes
`#[gpu_kernel]` proc macro	Rust → PTX IR at compile time; PTX emitted + JIT-loaded at launch. Type-safe launch wrapper auto-generated.
Shared memory + reductions + warp shuffles	`shared_mem![]`, `bar_sync()`, `block_reduce_sum/max`, `shfl_sync_*`.
2D blocks, FMA, math builtins	`block_size = (16,16)`, `fma`, `sqrt`, `exp`, `log`, `tanh`, `abs`, `min`, `max`.
Scalar tiled matmul	`kaio_ops::matmul` / `matmul_auto` — 31% of cuBLAS sgemm. Any SM.
Fused attention + FlashAttention	`kaio_ops::attention`, `attention_flash` (O(d_k) memory). Any SM.
Tensor-core matmul	`kaio_ops::matmul_tc` / `matmul_tc_async` / `matmul_auto_tc` — f16 → f32, SM 8.0+, 82.3% sync / 92.5% async of cuBLAS sgemm at 4096².
INT8 dequantize-matmul (W8A8)	`kaio_ops::matmul_int8` — symmetric i8 × i8 → f32 with single-scalar scale, SM 8.0+, K%32==0. 80–94 TOPS at 4096³ on RTX 4090 sm_89 (median ~89 across 6 runs). v0.3.0 reference quant op.
Auto-tuner + cache	`tune_matmul`, `matmul_auto`, `matmul_auto_tc` with JSON cache.
PTX inspection	`KAIO_DUMP_PTX=1`, `KAIO_PTX_STATS=1`, `KAIO_PTX_ANNOTATE=1`.

See docs.rs/kaio for the full API surface and the internal IR types.

Project status and constraints

KAIO is pre-1.0 software. Current engineering constraints:

NVIDIA only. SM 7.0+ (Volta, Turing, Ampere, Ada Lovelace, Hopper). No AMD, no Intel, no Apple Silicon.
Matmul performance is size-dependent. Tensor-core matmul reaches 82.3% sync / 92.5% async of cuBLAS sgemm at 4096² on Ampere, but lags heavily at ≤1024² (2-8%) because a 64×64 multi-warp block tile doesn't fill the SM array until the grid is large. Scalar matmul tops out at 31% of cuBLAS. For small shapes prefer cuBLAS or the scalar path. Details →
Inference only. No autograd / backward pass. Training integration (kaio-candle bridge) is planned for Phase 7.
DSL is a Rust subset. No closures, traits, generics, method calls, or string operations inside #[gpu_kernel] function bodies. Arithmetic, comparisons, bitwise operators (& | ^ << >> !), short-circuit && / ||, and compound assignment (including bitwise &= / |= / <<= / etc.) all supported as of v0.2.1.
FlashAttention d_k limit. attention_flash() requires d_k ≤ 256 (one thread per output dimension).
Single-device. No multi-GPU support.
First-call PTX module load. The #[gpu_kernel] macro generates PTX at Rust compile time, but CUDA-driver module loading still happens at first launch. First call pays the module-load latency; subsequent launches are dispatch-only. (The in-process PTX cache was removed in v0.2.1 to simplify the load path and run kernels through PtxModule::validate() on every launch.)
API will change before 1.0. Breaking changes documented in CHANGELOG per release.

Architecture

Four layers, bottom to top:

+-------------------------------------------+
|  Layer 4: Block-Level Operations          |  tiled matmul, fused attention, TC matmul
+-------------------------------------------+
|  Layer 3: #[gpu_kernel] Proc Macro        |  Rust syntax → PTX automatically
+-------------------------------------------+
|  Layer 2: Runtime (kaio-runtime)          |  device memory, kernel launch, SM validation
+-------------------------------------------+
|  Layer 1: PTX Codegen (kaio-core)         |  IR types, instruction emitters, fragments
+-------------------------------------------+

Crate	Description
`kaio`	Umbrella crate — re-exports everything via `prelude`
`kaio-macros`	`#[gpu_kernel]` proc macro
`kaio-core`	PTX IR, instruction emitters, fragment containers, zero external deps
`kaio-runtime`	CUDA driver wrapper via cudarc
`kaio-ops`	Pre-built GPU operations (matmul, attention, TC matmul, auto-tuner)

Target hardware

GPUs: NVIDIA SM 7.0+ (Volta, Turing, Ampere, Ada Lovelace, Hopper). Tensor-core kernels (matmul_tc*) require SM 8.0+ (Ampere or newer).
Platforms: Windows 10/11 and Linux (Ubuntu 22.04+).
Driver: NVIDIA 525+ (CUDA 12.0+ compatible). Standard Game Ready or Studio drivers work on consumer cards; Tesla/TCC drivers are not required and not needed for KAIO's dynamic-loading path.
Tested on: RTX 4090 (sm_89) under Windows.

Building

cargo build --workspace
cargo test --workspace                  # host tests (no GPU needed)
cargo test --workspace -- --ignored     # GPU tests (requires NVIDIA GPU)
KAIO_DUMP_PTX=1 cargo test              # inspect generated PTX

Requires Rust 1.94+ (pinned via rust-toolchain.toml). The version floor reflects edition-2024 features and const-evaluation patterns used in kernel tile-layout computation. No CUDA toolkit is needed to build — KAIO resolves the NVIDIA driver at runtime via dynamic loading (nvcuda.dll on Windows, libcuda.so on Linux).

Debugging

When something goes wrong — launch errors, silent NaN, unexpectedly slow performance — docs/debugging.md is the single entry point. It covers the env vars (KAIO_DUMP_PTX, KAIO_PTX_STATS, KAIO_PTX_ANNOTATE, KAIO_SM_TARGET, KAIO_TUNE_CACHE, KAIO_SUPPRESS_DEBUG_WARNING), the async-launch error model, compute-sanitizer usage for silent-corruption diagnosis, tolerance guidance for numerical verification, and a troubleshooting flowchart for "did it compile → launch → produce right output?"

Test coverage

93.65% line coverage across the 20,156-line workspace (1,280 lines uncovered, mostly host-side parser error paths, the xtask repo-tooling binary, and the unreachable-by-design host stubs for GPU builtins in kaio/src/gpu_builtins.rs). Shipped kernel crates are well above the workspace average — kaio-ops/src/matmul_int8_kernel.rs at 97.77%, matmul_tc_kernel.rs at 97.74%, matmul_tc_async_kernel.rs at 99.40%, attention_tc_kernel.rs at 98.82%. Measured on RTX 4090 sm_89 via cargo llvm-cov with the host test suite and the full GPU-only --ignored test suite merged:

cargo install cargo-llvm-cov           # one-time
cargo llvm-cov clean --workspace
cargo llvm-cov --workspace --no-report
cargo llvm-cov --workspace --no-report -- --ignored
cargo llvm-cov report --summary-only

The number is static (updated per release, not per CI run) because the GPU-ignored tests require actual NVIDIA hardware and can't run on standard GitHub Actions runners. See docs/testing-strategy.md for the full testing model (host tests, GPU integration tests, ptxas_verify structural checks, and the matmul_tc_bench performance harness).

How it works

The #[gpu_kernel] macro:

Parses your Rust function body into a kernel IR.
Lowers expressions to PTX instruction sequences.
Generates a build_ptx() function that constructs the IR at runtime (first call only, cached via OnceLock).
Emits PTX text through PtxWriter.
Wraps everything in a typed launch() function that handles device memory and kernel dispatch.

The generated PTX is validated against nvcc output and passes ptxas --verify on every tested kernel.

IR API (advanced)

You can also build kernels directly via the Layer 1 IR API for maximum control:

use kaio_core::emit::{Emit, PtxWriter};
use kaio_core::ir::*;
use kaio_core::types::PtxType;

let mut alloc = RegisterAllocator::new();
let mut kernel = PtxKernel::new("my_kernel");
kernel.add_param(PtxParam::pointer("data", PtxType::F32));
// ... build instructions ...

let mut module = PtxModule::new("sm_89");
module.add_kernel(kernel);
let mut w = PtxWriter::new();
module.emit(&mut w).unwrap();
println!("{}", w.finish()); // valid PTX assembly

See kaio-runtime/tests/vector_add_e2e.rs for a complete end-to-end example.

Roadmap

Phase 1 — PTX codegen + runtime (IR → PTX → GPU execution)
Phase 2 — #[gpu_kernel] proc macro (arithmetic, control flow, memory access, math builtins)
Phase 3 — Loops, shared memory, reductions, softmax
Phase 4 — Tiled matmul (31% of cuBLAS), kaio-ops crate, 2D blocks, FMA, PTX inspection tools
Phase 5 — Fused attention + FlashAttention + auto-tuning, crates.io v0.1.0
Phase 6 — Tensor cores (mma.sync fp16/bf16), async copies (cp.async), bank-conflict padding. 82.3% sync / 92.5% async of cuBLAS sgemm at 4096² on Ampere. Three standalone showcase examples. crates.io v0.2.0.
Phase 7 — Quantized kernels (INT8/INT4), training integration (kaio-candle bridge), ldmatrix.sync for further TC headroom, bf16 TC matmul variant.
Phase 8 — PyO3 bindings (Python access to kaio-ops).

See CHANGELOG.md for per-release detail and docs/phases.md for deeper phase plans.

Common pitfalls

Always bounds-check array writes. if idx < n before every global memory access — out-of-bounds GPU writes corrupt memory silently.
Shared memory must fit within block limits. Default is 48 KB per block. shared_mem![f32; 12288] = 48 KB = the limit.
Thread indexing errors are the most common bug. Double-check your row / col math, especially with 2D blocks. Off-by-one in a kernel doesn't panic — it writes to the wrong address.

Feedback

If something is confusing, awkward, or broken — open an issue. Even small friction matters. This project is actively developed and feedback directly shapes what gets built next.

License

Licensed under either of

Apache License, Version 2.0 (LICENSE-APACHE or http://www.apache.org/licenses/LICENSE-2.0)
MIT license (LICENSE-MIT or http://opensource.org/licenses/MIT)

at your option.

Contribution

Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

KAIO

Key highlights

Try KAIO in 30 seconds

Quick Start

The real pitch — fused ML kernels

When to use KAIO

Performance

The problem KAIO solves

Patterns

Feature summary

Project status and constraints

Architecture

Target hardware

Building

Debugging

Test coverage

How it works

IR API (advanced)

Roadmap

Common pitfalls

Feedback

License

Contribution

About

Licenses found

Uh oh!

Releases 7

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 78 Commits
.cargo		.cargo
.github/workflows		.github/workflows
docs		docs
examples		examples
kaio-core		kaio-core
kaio-macros		kaio-macros
kaio-ops		kaio-ops
kaio-runtime		kaio-runtime
kaio		kaio
xtask		xtask
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE-APACHE		LICENSE-APACHE
LICENSE-MIT		LICENSE-MIT
README.md		README.md
rust-toolchain.toml		rust-toolchain.toml

Folders and files

Latest commit

History

Repository files navigation

KAIO

Key highlights

Try KAIO in 30 seconds

Quick Start

The real pitch — fused ML kernels

When to use KAIO

Performance

The problem KAIO solves

Patterns

Feature summary

Project status and constraints

Architecture

Target hardware

Building

Debugging

Test coverage

How it works

IR API (advanced)

Roadmap

Common pitfalls

Feedback

License

Contribution

About

Topics

Resources

License

Licenses found

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 7

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages