gpuemu

Catch silently-wrong GPU kernels before they reach production.

Documentation • Quick Start • The Problem • Research • Community

The problem: every benchmark says your kernel is "correct"

The industry-standard correctness oracle for a GPU kernel is one line:

torch.allclose(my_kernel(x), reference(x), atol=1e-5, rtol=1e-2)

One shape. One dtype. One seed. Every modern LLM-kernel benchmark — KernelBench, TritonBench, GEAK, KernelBand, STARK — uses the same oracle. Kernels that pass it ship to production.

That oracle is blind to entire bug classes that LLM-generated CUDA/Triton code routinely contains:

Bug class	Example	Why allclose misses it
Tail-mask leak	softmax forgets to mask the last partial tile	Only fires when `H` isn't a multiple of `BLOCK` — `H=256` looks fine
Accumulator scale	matmul writes `acc =` instead of `acc +=`	Result happens to match within `rtol` on the chosen shape
Missing normalisation	attention without `1/√D`	Saturates softmax differently; one shape looks correct
Online-softmax rescale	flash-attention forgets `acc *= α` after max update	Only wrong when `N > BLOCK_N`

In a measured 26-op corpus across five GPU classes (RTX 3060, A10, L40S, A100 SXM4, H100 NVL), the standard one-shape oracle accepts 10 out of 10 of these LLM-style buggy kernels as correct. The same 26-op corpus measured with the seeded oracle catches 10 out of 10 with 0 false positives on the 16 correct controls (P1).

Why it matters: silent correctness regressions ship at LLM scale

Every modern LLM training and inference stack now ships LLM-generated CUDA/Triton kernels — fused attention, custom matmul variants, normalisation layers — and a silently-wrong kernel runs at scale. A miscompiled matmul propagates through every forward pass; a broken flash-attention degrades long-context quality without crashing; an unmasked reduction taints metrics no one looks at. The cost is GPU-hours wasted on silently-broken work and slow, untraceable quality regressions that survive months of CI green builds.

This is not hypothetical. Every benchmark in the LLM-kernel literature shares the same oracle gap; the kernels they bless are the kernels that ship.

What gpuemu does

gpuemu replaces "allclose on one shape" with an operator-domain–aware correctness regime. The validation step runs without a GPU (it compares against a high-precision CPU reference); only the optional artifact step needs one.

Capability	What it does	What you gain
fp64 reference oracle	Validates kernel output against a high-precision CPU reference, per dtype	A real correctness signal — not "it matched on the one shape we happened to try"
Op-schema-aware fuzzing	Per-op generator with boundary + regular + adversarial value distributions	Coverage of the partial-tile and edge cases your kernel actually breaks on
Per-op calibrated tolerances	A p95-of-controls × 1.5 envelope, fit per op and dtype	Catches real regressions without flagging normal floating-point noise
Static PTX/SASS lint	Register pressure, spills, and instruction counts vs a baseline	A performance-regression gate on the compiled artifact
Reproducible RNG	Bit-identical xorshift128+ in Rust and Python; exact input snapshots	Every failure replays byte-for-byte from its seed, on any machine

Every default above is backed by a measured study — see Research & evidence.

What teams gain

gpuemu turns "silent wrong-output" from an invisible failure mode into a red CI check with a replayable seed. It serves three profiles — each page leads with a real, cited regression the workflow prevents:

Frontier-lab kernel teams (Anthropic / OpenAI / DeepMind / Meta / xAI) — a pre-merge correctness gate scaled to hundreds of ops, blocking PRs with replay-seed links.
OSS-inference maintainers (vLLM / SGLang / TensorRT-LLM / llama.cpp / MLC-LLM) — a one-line GitHub Action that catches numerical regressions like SGLang #21238 and vLLM #26378 before release.
Inference-as-a-service vendors (Fireworks / Together / Anyscale / Modal / Replicate / Baseten / Modular) — a signed Kernel Correctness Report customers verify offline as SLA evidence.

Want to pilot the enterprise tier (private rule packs, on-prem daemon, signed reports)? See Design Partners.

Quick start

Install

# Rust daemon + CLI
cargo install gpuemu

# Python client
pip install gpuemu

# VS Code extension (optional)
code --install-extension gpuemu.gpuemu

Validate a kernel

from gpuemu import Client

client = Client()

# Fuzz with op-schema-aware inputs and an fp64 reference oracle.
results = client.fuzz_op_client_side(
    "flash_attention",
    run_op=lambda inputs: my_flash_attn(inputs["q"], inputs["k"], inputs["v"]),
    iterations=100,
    value_distribution="adversarial",  # recommended default
)

print(f"Passed: {results.passed}/{results.total}")

A failure reports the seed, dtype, shape, and a base64 snapshot of the failing input. Re-run it byte-for-byte from any machine.

Compared to

The first question a champion gets asked is "isn't this just torch.testing.assert_close?" or "isn't this what KernelBench does?". Short version:

Tool	What it does well	The gap gpuemu fills
`torch.testing.assert_close`	Standard, simple, in-tree	One shape, one dtype, one seed; catches 0/10 LLM-style bugs in our 26-op corpus across 5 GPU classes
KernelBench / TritonBench / GEAK / KernelBand / STARK	Leaderboards for LLM-generated kernels	Use the same one-shape oracle inside; not user-facing
NVIDIA Compute Sanitizer	Memcheck / racecheck / synccheck	Memory bugs only — silent numerical wrong-output is invisible to it
Triton built-in testing	Same `assert_close` semantics	No op-schema fuzz, no fp64 reference
HF Kernel Hub	Distribution + ABI checks	Explicitly assumes a correctness tool upstream — that's gpuemu's slot
ncu / cuobjdump / ptxas	Static PTX/SASS introspection	No lint policy, no baseline diffing, no regression gate
FreeFuzz / DocTer / NablaFuzz / FuzzGPT	API-level DL framework fuzzers	API layer, not kernel; ACL TOSEM 2025 measured 6.5 % real-world bug catch

The full walk-through, citations, and the five moat signals it surfaces live in the Compared to alternatives page.

Execution modes

gpuemu supports three ways to run your kernels:

# Client-side (recommended): your code runs the GPU op; gpuemu validates.
results = client.fuzz_op_client_side("matmul",
    run_op=lambda i: torch.matmul(i["a"], i["b"]),
    iterations=100)

# Daemon-orchestrated: fetch cases, run yourself, submit outputs.
for case in client.get_test_batch("my_op", count=50):
    out = my_gpu_op(case["inputs"])
    client.submit_output("my_op", case["inputs"], out, case["seed"])

# Script-based: register reference + op scripts in gpuemu.toml; daemon runs everything.

VS Code integration

Validation failures appear as red squiggles with code actions:

Problems panel — seed, dtype, shape per failure
Code actions — "Reproduce failure" or "Minimize test case"
Test Explorer — ops appear in the Testing sidebar
On-save validation — auto-triggers on reference-script save

Architecture

┌──────────────────┐     ┌──────────────────┐     ┌──────────────────┐
│   gpuemu CLI     │     │   Python Client  │     │  VS Code Ext     │
│   (Rust)         │     │   (gpuemu)       │     │  (TypeScript)    │
└────────┬─────────┘     └────────┬─────────┘     └────────┬─────────┘
         │                        │                        │
         └────────────────────────┼────────────────────────┘
                                  │ IPC (NNG/Unix sockets)
                                  ▼
                    ┌─────────────────────────────┐
                    │      gpuemu-daemon          │
                    │  ├── Validation engine      │
                    │  ├── Op-schema fuzzer       │
                    │  ├── Artifact analyzer      │
                    │  └── Failure storage (sled) │
                    └─────────────────────────────┘

Framework support

Framework	Status	Install
PyTorch	Stable	`pip install gpuemu[torch]`
JAX	Stable	`pip install gpuemu[jax]`
TensorFlow	Stable	`pip install gpuemu[tensorflow]`
Raw CUDA/Triton	Stable	`pip install gpuemu`

What gpuemu is NOT

Not a cycle-accurate GPU emulator — correctness, not timing simulation.
Not a replacement for real hardware — final benchmarks still belong on the target GPU.
Not a training framework — kernel-level oracle, not a model-level one.

Research & evidence

gpuemu is the engineering product of a research program. The defaults above aren't hand-tuned guesses — each is anchored to a measured study, and all four ship as LaTeX manuscripts with replayable run records and a kernel corpus.

#	Study	Headline finding
P1 (arXiv:2606.20128)	The correctness illusion in LLM-generated GPU kernels	Seeded oracle catches 9/9 LLM-style bugs on the 24-op single-GPU corpus and 10/10 on the extended 26-op cross-GPU corpus across 5 GPU classes (RTX 3060, A10, L40S, A100 SXM4, H100 NVL), with 0 false positives on controls
P2	Operator-aware mixed-precision tolerance calibration	A p95-of-controls × 1.5 envelope raises kernel-bug recall from 65% to 82% over a fixed `atol/rtol`, at zero precision cost
P3	Test-input generation for tensor programs	Seven-strategy ablation: adversarial sampling wins at 93% recall; "regular shapes only" misses 100% of tail-mask bugs
P4	Static PTX metrics track structural regressions but miss semantic ones	Structural Δregs / Δinstrs predicts Δperf% across 5 GPU classes; semantic bugs compile to identical PTX and need the correctness oracle

Full manuscripts, run-id records, and the replayable corpus live in the gpuemu research program. A one-page summary is on The Evidence.

Documentation

Full docs: docs.skelfresearch.com/gpuemu

The Problem — what allclose misses
Industry Impact — what silent bugs cost
The Evidence — P1–P4 in one page
Quick Start — first validation in 5 minutes
Architecture Deep Dive

Platform support

Platform	Status
Linux	Primary target
macOS	Fully supported (CPU validation)
Windows	Planned

Contributing

cargo test                    # Rust (58 tests)
cd gpuemu-py && pytest -v     # Python (11 tests, +7 daemon-live tests)

See the Contributing Guide.

License

Dual-licensed under MIT or Apache 2.0 at your option.

_{Built with care by the Skelf Research team}

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
.github/workflows		.github/workflows
crates		crates
docs		docs
documentation		documentation
gpuemu-py		gpuemu-py
integrations/github-action		integrations/github-action
packaging		packaging
scripts		scripts
templates		templates
vscode-gpuemu		vscode-gpuemu
.dockerignore		.dockerignore
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
Dockerfile		Dockerfile
LICENSE-APACHE		LICENSE-APACHE
LICENSE-MIT		LICENSE-MIT
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

gpuemu

The problem: every benchmark says your kernel is "correct"

Why it matters: silent correctness regressions ship at LLM scale

What gpuemu does

What teams gain

Quick start

Install

Validate a kernel

Compared to

Execution modes

VS Code integration

Architecture

Framework support

What gpuemu is NOT

Research & evidence

Documentation

Platform support

Contributing

License

About

Licenses found

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

gpuemu

The problem: every benchmark says your kernel is "correct"

Why it matters: silent correctness regressions ship at LLM scale

What gpuemu does

What teams gain

Quick start

Install

Validate a kernel

Compared to

Execution modes

VS Code integration

Architecture

Framework support

What gpuemu is NOT

Research & evidence

Documentation

Platform support

Contributing

License

About

Topics

Resources

License

Licenses found

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages