kerndiff

kerndiff compares two CUDA or Triton kernel implementations on the same GPU and reports:

end-to-end latency (adaptive multi-run timing)
Nsight Compute counters
derived metrics and deltas
static PTX instruction diffs
a compact roofline summary

It is designed for practical iteration: change a kernel, run one command, and see if it is faster and why.

Install

Published package:

pip install kerndiff

Development install:

pip install -e .

Runtime prerequisites:

NVIDIA drivers and a supported GPU for real profiling
nvcc for CUDA kernel compilation
ncu (Nsight Compute) for hardware counters

License:

MIT, see LICENSE

Quickstart

kerndiff examples/vec_add_v1.cu examples/vec_add_v2.cu --fn vec_add

Mock mode (no GPU required):

kerndiff --mock examples/vec_add_v1.cu examples/vec_add_v2.cu --fn vec_add --no-color

Common workflows

Compare two files:

kerndiff v1.cu v2.cu --fn my_kernel

Compare HEAD vs working copy for one tracked file:

kerndiff kernel.cu --fn my_kernel

Run all common kernels in two files:

kerndiff a.cu b.cu --all

Write JSON output:

kerndiff a.cu b.cu --fn k --format json > result.json

Write JSON to file while keeping stderr progress:

kerndiff a.cu b.cu --fn k --export-json result.json

Write a Perfetto-compatible host trace:

kerndiff a.cu b.cu --fn k --export-perfetto trace.json

Key options

--fn NAME: kernel name
--all: profile all common kernels
--call "kernel<<<...>>>(...)": override launch expression
--dtype {float,half,int,int4}: harness buffer dtype
--elems N: harness buffer size
--min-runs N, --max-runs N: adaptive timing bounds
--noise-threshold PCT: CV stop threshold
--warmup N: warmup iterations
--format {term,json}: output format
--output FILE: write output to file
--export-json FILE: write JSON file and keep stderr progress
--export-perfetto FILE: write a Perfetto/Chrome trace JSON file for host phases and timing samples
--no-color: disable ANSI colors
--gpu N: GPU index
--arch sm_XX: target SM architecture
--mock: fixture-backed run without GPU

Output interpretation

Primary signal:

latency row and verdict line (v2 is ... faster/slower)

Frequent supporting signals:

global_load_eff: memory coalescing quality
dram_bw: achieved memory bandwidth
l2_hit_rate: cache locality/reuse
register_count and sm_occupancy: register pressure tradeoff
stall_*: dominant stall sources

Roofline row:

reports memory vs compute bound and estimated headroom

PTX diff:

static instruction counts only (not dynamic execution counts)

Repository layout

This repository uses a src layout:

src/kerndiff/: package code
src/kerndiff/runtimes/: runtime-specific compilation and execution adapters (cuda, triton)
examples/: benchmark kernels
tests/: test suite

Development

Run tests:

python -m pytest tests/ -q

Release

Bump src/kerndiff/__init__.py, build locally, and verify the distributions:

python3 -m pip install --upgrade build twine
python3 -m build
python3 -m twine check dist/*

Publishing is automated by .github/workflows/publish.yml when a GitHub Release is published.

Troubleshooting

If Nsight Compute metrics are missing:

ensure ncu is installed
run with elevated permissions if needed on your system

If nvcc is not found:

ensure CUDA toolkit is installed or available in your Python CUDA environment

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
.github/workflows		.github/workflows
examples		examples
flash-attention		flash-attention
src/kerndiff		src/kerndiff
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
logs.md		logs.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

kerndiff

Install

Quickstart

Common workflows

Key options

Output interpretation

Repository layout

Development

Release

Troubleshooting

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

kerndiff

Install

Quickstart

Common workflows

Key options

Output interpretation

Repository layout

Development

Release

Troubleshooting

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages