Catch silently-wrong GPU kernels before they reach production.
Documentation • Quick Start • The Problem • Research • Community
The industry-standard correctness oracle for a GPU kernel is one line:
torch.allclose(my_kernel(x), reference(x), atol=1e-5, rtol=1e-2)One shape. One dtype. One seed. Every modern LLM-kernel benchmark — KernelBench, TritonBench, GEAK, KernelBand, STARK — uses the same oracle. Kernels that pass it ship to production.
That oracle is blind to entire bug classes that LLM-generated CUDA/Triton code routinely contains:
| Bug class | Example | Why allclose misses it |
|---|---|---|
| Tail-mask leak | softmax forgets to mask the last partial tile | Only fires when H isn't a multiple of BLOCK — H=256 looks fine |
| Accumulator scale | matmul writes acc = instead of acc += |
Result happens to match within rtol on the chosen shape |
| Missing normalisation | attention without 1/√D |
Saturates softmax differently; one shape looks correct |
| Online-softmax rescale | flash-attention forgets acc *= α after max update |
Only wrong when N > BLOCK_N |
In a measured 26-op corpus across five GPU classes (RTX 3060, A10, L40S, A100 SXM4, H100 NVL), the standard one-shape oracle accepts 10 out of 10 of these LLM-style buggy kernels as correct. The same 26-op corpus measured with the seeded oracle catches 10 out of 10 with 0 false positives on the 16 correct controls (P1).
Every modern LLM training and inference stack now ships LLM-generated CUDA/Triton kernels — fused attention, custom matmul variants, normalisation layers — and a silently-wrong kernel runs at scale. A miscompiled matmul propagates through every forward pass; a broken flash-attention degrades long-context quality without crashing; an unmasked reduction taints metrics no one looks at. The cost is GPU-hours wasted on silently-broken work and slow, untraceable quality regressions that survive months of CI green builds.
This is not hypothetical. Every benchmark in the LLM-kernel literature shares the same oracle gap; the kernels they bless are the kernels that ship.
gpuemu replaces "allclose on one shape" with an operator-domain–aware correctness regime. The validation step runs without a GPU (it compares against a high-precision CPU reference); only the optional artifact step needs one.
| Capability | What it does | What you gain |
|---|---|---|
| fp64 reference oracle | Validates kernel output against a high-precision CPU reference, per dtype | A real correctness signal — not "it matched on the one shape we happened to try" |
| Op-schema-aware fuzzing | Per-op generator with boundary + regular + adversarial value distributions | Coverage of the partial-tile and edge cases your kernel actually breaks on |
| Per-op calibrated tolerances | A p95-of-controls × 1.5 envelope, fit per op and dtype | Catches real regressions without flagging normal floating-point noise |
| Static PTX/SASS lint | Register pressure, spills, and instruction counts vs a baseline | A performance-regression gate on the compiled artifact |
| Reproducible RNG | Bit-identical xorshift128+ in Rust and Python; exact input snapshots | Every failure replays byte-for-byte from its seed, on any machine |
Every default above is backed by a measured study — see Research & evidence.
gpuemu turns "silent wrong-output" from an invisible failure mode into a red CI check with a replayable seed. It serves three profiles — each page leads with a real, cited regression the workflow prevents:
- Frontier-lab kernel teams (Anthropic / OpenAI / DeepMind / Meta / xAI) — a pre-merge correctness gate scaled to hundreds of ops, blocking PRs with replay-seed links.
- OSS-inference maintainers (vLLM / SGLang / TensorRT-LLM / llama.cpp / MLC-LLM) — a one-line GitHub Action that catches numerical regressions like SGLang #21238 and vLLM #26378 before release.
- Inference-as-a-service vendors (Fireworks / Together / Anyscale / Modal / Replicate / Baseten / Modular) — a signed Kernel Correctness Report customers verify offline as SLA evidence.
Want to pilot the enterprise tier (private rule packs, on-prem daemon, signed reports)? See Design Partners.
# Rust daemon + CLI
cargo install gpuemu
# Python client
pip install gpuemu
# VS Code extension (optional)
code --install-extension gpuemu.gpuemufrom gpuemu import Client
client = Client()
# Fuzz with op-schema-aware inputs and an fp64 reference oracle.
results = client.fuzz_op_client_side(
"flash_attention",
run_op=lambda inputs: my_flash_attn(inputs["q"], inputs["k"], inputs["v"]),
iterations=100,
value_distribution="adversarial", # recommended default
)
print(f"Passed: {results.passed}/{results.total}")A failure reports the seed, dtype, shape, and a base64 snapshot of the failing input. Re-run it byte-for-byte from any machine.
The first question a champion gets asked is "isn't this just
torch.testing.assert_close?" or "isn't this what KernelBench does?". Short version:
| Tool | What it does well | The gap gpuemu fills |
|---|---|---|
torch.testing.assert_close |
Standard, simple, in-tree | One shape, one dtype, one seed; catches 0/10 LLM-style bugs in our 26-op corpus across 5 GPU classes |
| KernelBench / TritonBench / GEAK / KernelBand / STARK | Leaderboards for LLM-generated kernels | Use the same one-shape oracle inside; not user-facing |
| NVIDIA Compute Sanitizer | Memcheck / racecheck / synccheck | Memory bugs only — silent numerical wrong-output is invisible to it |
| Triton built-in testing | Same assert_close semantics |
No op-schema fuzz, no fp64 reference |
| HF Kernel Hub | Distribution + ABI checks | Explicitly assumes a correctness tool upstream — that's gpuemu's slot |
| ncu / cuobjdump / ptxas | Static PTX/SASS introspection | No lint policy, no baseline diffing, no regression gate |
| FreeFuzz / DocTer / NablaFuzz / FuzzGPT | API-level DL framework fuzzers | API layer, not kernel; ACL TOSEM 2025 measured 6.5 % real-world bug catch |
The full walk-through, citations, and the five moat signals it surfaces live in the Compared to alternatives page.
gpuemu supports three ways to run your kernels:
# Client-side (recommended): your code runs the GPU op; gpuemu validates.
results = client.fuzz_op_client_side("matmul",
run_op=lambda i: torch.matmul(i["a"], i["b"]),
iterations=100)
# Daemon-orchestrated: fetch cases, run yourself, submit outputs.
for case in client.get_test_batch("my_op", count=50):
out = my_gpu_op(case["inputs"])
client.submit_output("my_op", case["inputs"], out, case["seed"])
# Script-based: register reference + op scripts in gpuemu.toml; daemon runs everything.Validation failures appear as red squiggles with code actions:
- Problems panel — seed, dtype, shape per failure
- Code actions — "Reproduce failure" or "Minimize test case"
- Test Explorer — ops appear in the Testing sidebar
- On-save validation — auto-triggers on reference-script save
┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐
│ gpuemu CLI │ │ Python Client │ │ VS Code Ext │
│ (Rust) │ │ (gpuemu) │ │ (TypeScript) │
└────────┬─────────┘ └────────┬─────────┘ └────────┬─────────┘
│ │ │
└────────────────────────┼────────────────────────┘
│ IPC (NNG/Unix sockets)
▼
┌─────────────────────────────┐
│ gpuemu-daemon │
│ ├── Validation engine │
│ ├── Op-schema fuzzer │
│ ├── Artifact analyzer │
│ └── Failure storage (sled) │
└─────────────────────────────┘
| Framework | Status | Install |
|---|---|---|
| PyTorch | Stable | pip install gpuemu[torch] |
| JAX | Stable | pip install gpuemu[jax] |
| TensorFlow | Stable | pip install gpuemu[tensorflow] |
| Raw CUDA/Triton | Stable | pip install gpuemu |
- Not a cycle-accurate GPU emulator — correctness, not timing simulation.
- Not a replacement for real hardware — final benchmarks still belong on the target GPU.
- Not a training framework — kernel-level oracle, not a model-level one.
gpuemu is the engineering product of a research program. The defaults above aren't hand-tuned guesses — each is anchored to a measured study, and all four ship as LaTeX manuscripts with replayable run records and a kernel corpus.
| # | Study | Headline finding |
|---|---|---|
| P1 (arXiv:2606.20128) | The correctness illusion in LLM-generated GPU kernels | Seeded oracle catches 9/9 LLM-style bugs on the 24-op single-GPU corpus and 10/10 on the extended 26-op cross-GPU corpus across 5 GPU classes (RTX 3060, A10, L40S, A100 SXM4, H100 NVL), with 0 false positives on controls |
| P2 | Operator-aware mixed-precision tolerance calibration | A p95-of-controls × 1.5 envelope raises kernel-bug recall from 65% to 82% over a fixed atol/rtol, at zero precision cost |
| P3 | Test-input generation for tensor programs | Seven-strategy ablation: adversarial sampling wins at 93% recall; "regular shapes only" misses 100% of tail-mask bugs |
| P4 | Static PTX metrics track structural regressions but miss semantic ones | Structural Δregs / Δinstrs predicts Δperf% across 5 GPU classes; semantic bugs compile to identical PTX and need the correctness oracle |
Full manuscripts, run-id records, and the replayable corpus live in the gpuemu research program. A one-page summary is on The Evidence.
Full docs: docs.skelfresearch.com/gpuemu
- The Problem — what allclose misses
- Industry Impact — what silent bugs cost
- The Evidence — P1–P4 in one page
- Quick Start — first validation in 5 minutes
- Architecture Deep Dive
| Platform | Status |
|---|---|
| Linux | Primary target |
| macOS | Fully supported (CPU validation) |
| Windows | Planned |
cargo test # Rust (58 tests)
cd gpuemu-py && pytest -v # Python (11 tests, +7 daemon-live tests)See the Contributing Guide.
Dual-licensed under MIT or Apache 2.0 at your option.
Built with care by the Skelf Research team