Skip to content

Skelf-Research/gpuemu

Repository files navigation

gpuemu

gpuemu

Catch silently-wrong GPU kernels before they reach production.

CI Crates.io PyPI VS Code License

DocumentationQuick StartThe ProblemResearchCommunity


The problem: every benchmark says your kernel is "correct"

The industry-standard correctness oracle for a GPU kernel is one line:

torch.allclose(my_kernel(x), reference(x), atol=1e-5, rtol=1e-2)

One shape. One dtype. One seed. Every modern LLM-kernel benchmark — KernelBench, TritonBench, GEAK, KernelBand, STARK — uses the same oracle. Kernels that pass it ship to production.

That oracle is blind to entire bug classes that LLM-generated CUDA/Triton code routinely contains:

Bug class Example Why allclose misses it
Tail-mask leak softmax forgets to mask the last partial tile Only fires when H isn't a multiple of BLOCKH=256 looks fine
Accumulator scale matmul writes acc = instead of acc += Result happens to match within rtol on the chosen shape
Missing normalisation attention without 1/√D Saturates softmax differently; one shape looks correct
Online-softmax rescale flash-attention forgets acc *= α after max update Only wrong when N > BLOCK_N

In a measured 26-op corpus across five GPU classes (RTX 3060, A10, L40S, A100 SXM4, H100 NVL), the standard one-shape oracle accepts 10 out of 10 of these LLM-style buggy kernels as correct. The same 26-op corpus measured with the seeded oracle catches 10 out of 10 with 0 false positives on the 16 correct controls (P1).

Why it matters: silent correctness regressions ship at LLM scale

Every modern LLM training and inference stack now ships LLM-generated CUDA/Triton kernels — fused attention, custom matmul variants, normalisation layers — and a silently-wrong kernel runs at scale. A miscompiled matmul propagates through every forward pass; a broken flash-attention degrades long-context quality without crashing; an unmasked reduction taints metrics no one looks at. The cost is GPU-hours wasted on silently-broken work and slow, untraceable quality regressions that survive months of CI green builds.

This is not hypothetical. Every benchmark in the LLM-kernel literature shares the same oracle gap; the kernels they bless are the kernels that ship.

What gpuemu does

gpuemu replaces "allclose on one shape" with an operator-domain–aware correctness regime. The validation step runs without a GPU (it compares against a high-precision CPU reference); only the optional artifact step needs one.

Capability What it does What you gain
fp64 reference oracle Validates kernel output against a high-precision CPU reference, per dtype A real correctness signal — not "it matched on the one shape we happened to try"
Op-schema-aware fuzzing Per-op generator with boundary + regular + adversarial value distributions Coverage of the partial-tile and edge cases your kernel actually breaks on
Per-op calibrated tolerances A p95-of-controls × 1.5 envelope, fit per op and dtype Catches real regressions without flagging normal floating-point noise
Static PTX/SASS lint Register pressure, spills, and instruction counts vs a baseline A performance-regression gate on the compiled artifact
Reproducible RNG Bit-identical xorshift128+ in Rust and Python; exact input snapshots Every failure replays byte-for-byte from its seed, on any machine

Every default above is backed by a measured study — see Research & evidence.

What teams gain

gpuemu turns "silent wrong-output" from an invisible failure mode into a red CI check with a replayable seed. It serves three profiles — each page leads with a real, cited regression the workflow prevents:

  • Frontier-lab kernel teams (Anthropic / OpenAI / DeepMind / Meta / xAI) — a pre-merge correctness gate scaled to hundreds of ops, blocking PRs with replay-seed links.
  • OSS-inference maintainers (vLLM / SGLang / TensorRT-LLM / llama.cpp / MLC-LLM) — a one-line GitHub Action that catches numerical regressions like SGLang #21238 and vLLM #26378 before release.
  • Inference-as-a-service vendors (Fireworks / Together / Anyscale / Modal / Replicate / Baseten / Modular) — a signed Kernel Correctness Report customers verify offline as SLA evidence.

Want to pilot the enterprise tier (private rule packs, on-prem daemon, signed reports)? See Design Partners.

Quick start

Install

# Rust daemon + CLI
cargo install gpuemu

# Python client
pip install gpuemu

# VS Code extension (optional)
code --install-extension gpuemu.gpuemu

Validate a kernel

from gpuemu import Client

client = Client()

# Fuzz with op-schema-aware inputs and an fp64 reference oracle.
results = client.fuzz_op_client_side(
    "flash_attention",
    run_op=lambda inputs: my_flash_attn(inputs["q"], inputs["k"], inputs["v"]),
    iterations=100,
    value_distribution="adversarial",  # recommended default
)

print(f"Passed: {results.passed}/{results.total}")

A failure reports the seed, dtype, shape, and a base64 snapshot of the failing input. Re-run it byte-for-byte from any machine.

Compared to

The first question a champion gets asked is "isn't this just torch.testing.assert_close?" or "isn't this what KernelBench does?". Short version:

Tool What it does well The gap gpuemu fills
torch.testing.assert_close Standard, simple, in-tree One shape, one dtype, one seed; catches 0/10 LLM-style bugs in our 26-op corpus across 5 GPU classes
KernelBench / TritonBench / GEAK / KernelBand / STARK Leaderboards for LLM-generated kernels Use the same one-shape oracle inside; not user-facing
NVIDIA Compute Sanitizer Memcheck / racecheck / synccheck Memory bugs only — silent numerical wrong-output is invisible to it
Triton built-in testing Same assert_close semantics No op-schema fuzz, no fp64 reference
HF Kernel Hub Distribution + ABI checks Explicitly assumes a correctness tool upstream — that's gpuemu's slot
ncu / cuobjdump / ptxas Static PTX/SASS introspection No lint policy, no baseline diffing, no regression gate
FreeFuzz / DocTer / NablaFuzz / FuzzGPT API-level DL framework fuzzers API layer, not kernel; ACL TOSEM 2025 measured 6.5 % real-world bug catch

The full walk-through, citations, and the five moat signals it surfaces live in the Compared to alternatives page.

Execution modes

gpuemu supports three ways to run your kernels:

# Client-side (recommended): your code runs the GPU op; gpuemu validates.
results = client.fuzz_op_client_side("matmul",
    run_op=lambda i: torch.matmul(i["a"], i["b"]),
    iterations=100)

# Daemon-orchestrated: fetch cases, run yourself, submit outputs.
for case in client.get_test_batch("my_op", count=50):
    out = my_gpu_op(case["inputs"])
    client.submit_output("my_op", case["inputs"], out, case["seed"])

# Script-based: register reference + op scripts in gpuemu.toml; daemon runs everything.

VS Code integration

Validation failures appear as red squiggles with code actions:

  • Problems panel — seed, dtype, shape per failure
  • Code actions — "Reproduce failure" or "Minimize test case"
  • Test Explorer — ops appear in the Testing sidebar
  • On-save validation — auto-triggers on reference-script save

Architecture

┌──────────────────┐     ┌──────────────────┐     ┌──────────────────┐
│   gpuemu CLI     │     │   Python Client  │     │  VS Code Ext     │
│   (Rust)         │     │   (gpuemu)       │     │  (TypeScript)    │
└────────┬─────────┘     └────────┬─────────┘     └────────┬─────────┘
         │                        │                        │
         └────────────────────────┼────────────────────────┘
                                  │ IPC (NNG/Unix sockets)
                                  ▼
                    ┌─────────────────────────────┐
                    │      gpuemu-daemon          │
                    │  ├── Validation engine      │
                    │  ├── Op-schema fuzzer       │
                    │  ├── Artifact analyzer      │
                    │  └── Failure storage (sled) │
                    └─────────────────────────────┘

Framework support

Framework Status Install
PyTorch Stable pip install gpuemu[torch]
JAX Stable pip install gpuemu[jax]
TensorFlow Stable pip install gpuemu[tensorflow]
Raw CUDA/Triton Stable pip install gpuemu

What gpuemu is NOT

  • Not a cycle-accurate GPU emulator — correctness, not timing simulation.
  • Not a replacement for real hardware — final benchmarks still belong on the target GPU.
  • Not a training framework — kernel-level oracle, not a model-level one.

Research & evidence

gpuemu is the engineering product of a research program. The defaults above aren't hand-tuned guesses — each is anchored to a measured study, and all four ship as LaTeX manuscripts with replayable run records and a kernel corpus.

# Study Headline finding
P1 (arXiv:2606.20128) The correctness illusion in LLM-generated GPU kernels Seeded oracle catches 9/9 LLM-style bugs on the 24-op single-GPU corpus and 10/10 on the extended 26-op cross-GPU corpus across 5 GPU classes (RTX 3060, A10, L40S, A100 SXM4, H100 NVL), with 0 false positives on controls
P2 Operator-aware mixed-precision tolerance calibration A p95-of-controls × 1.5 envelope raises kernel-bug recall from 65% to 82% over a fixed atol/rtol, at zero precision cost
P3 Test-input generation for tensor programs Seven-strategy ablation: adversarial sampling wins at 93% recall; "regular shapes only" misses 100% of tail-mask bugs
P4 Static PTX metrics track structural regressions but miss semantic ones Structural Δregs / Δinstrs predicts Δperf% across 5 GPU classes; semantic bugs compile to identical PTX and need the correctness oracle

Full manuscripts, run-id records, and the replayable corpus live in the gpuemu research program. A one-page summary is on The Evidence.


Documentation

Full docs: docs.skelfresearch.com/gpuemu

Platform support

Platform Status
Linux Primary target
macOS Fully supported (CPU validation)
Windows Planned

Contributing

cargo test                    # Rust (58 tests)
cd gpuemu-py && pytest -v     # Python (11 tests, +7 daemon-live tests)

See the Contributing Guide.

License

Dual-licensed under MIT or Apache 2.0 at your option.


Built with care by the Skelf Research team

About

Catch silently-wrong GPU kernels before they reach production.

Topics

Resources

License

Apache-2.0, MIT licenses found

Licenses found

Apache-2.0
LICENSE-APACHE
MIT
LICENSE-MIT

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors