kerndiff compares two CUDA or Triton kernel implementations on the same GPU and reports:
- end-to-end latency (adaptive multi-run timing)
- Nsight Compute counters
- derived metrics and deltas
- static PTX instruction diffs
- a compact roofline summary
It is designed for practical iteration: change a kernel, run one command, and see if it is faster and why.
Published package:
pip install kerndiffDevelopment install:
pip install -e .Runtime prerequisites:
- NVIDIA drivers and a supported GPU for real profiling
nvccfor CUDA kernel compilationncu(Nsight Compute) for hardware counters
License:
- MIT, see
LICENSE
kerndiff examples/vec_add_v1.cu examples/vec_add_v2.cu --fn vec_addMock mode (no GPU required):
kerndiff --mock examples/vec_add_v1.cu examples/vec_add_v2.cu --fn vec_add --no-colorCompare two files:
kerndiff v1.cu v2.cu --fn my_kernelCompare HEAD vs working copy for one tracked file:
kerndiff kernel.cu --fn my_kernelRun all common kernels in two files:
kerndiff a.cu b.cu --allWrite JSON output:
kerndiff a.cu b.cu --fn k --format json > result.jsonWrite JSON to file while keeping stderr progress:
kerndiff a.cu b.cu --fn k --export-json result.jsonWrite a Perfetto-compatible host trace:
kerndiff a.cu b.cu --fn k --export-perfetto trace.json--fn NAME: kernel name--all: profile all common kernels--call "kernel<<<...>>>(...)": override launch expression--dtype {float,half,int,int4}: harness buffer dtype--elems N: harness buffer size--min-runs N,--max-runs N: adaptive timing bounds--noise-threshold PCT: CV stop threshold--warmup N: warmup iterations--format {term,json}: output format--output FILE: write output to file--export-json FILE: write JSON file and keep stderr progress--export-perfetto FILE: write a Perfetto/Chrome trace JSON file for host phases and timing samples--no-color: disable ANSI colors--gpu N: GPU index--arch sm_XX: target SM architecture--mock: fixture-backed run without GPU
Primary signal:
latencyrow and verdict line (v2 is ... faster/slower)
Frequent supporting signals:
global_load_eff: memory coalescing qualitydram_bw: achieved memory bandwidthl2_hit_rate: cache locality/reuseregister_countandsm_occupancy: register pressure tradeoffstall_*: dominant stall sources
Roofline row:
- reports
memoryvscomputebound and estimated headroom
PTX diff:
- static instruction counts only (not dynamic execution counts)
This repository uses a src layout:
src/kerndiff/: package codesrc/kerndiff/runtimes/: runtime-specific compilation and execution adapters (cuda,triton)examples/: benchmark kernelstests/: test suite
Run tests:
python -m pytest tests/ -qBump src/kerndiff/__init__.py, build locally, and verify the distributions:
python3 -m pip install --upgrade build twine
python3 -m build
python3 -m twine check dist/*Publishing is automated by .github/workflows/publish.yml when a GitHub Release is published.
If Nsight Compute metrics are missing:
- ensure
ncuis installed - run with elevated permissions if needed on your system
If nvcc is not found:
- ensure CUDA toolkit is installed or available in your Python CUDA environment