Agent-driven GPU microbenchmark framework for academic-simulator calibration.
Auto-MicroBench autonomously writes, runs, validates, and refines CUDA microbenchmarks to extract the hardware parameters that simulators like Accel-Sim and GPGPU-Sim need: instruction throughputs, latencies, cache sizes/latencies/bandwidths, shared-memory characteristics, warp scheduling parameters, tensor-core throughputs, and so on.
The design is a microbench-flavored cousin of AutoKernel: the agent edits a single CUDA file, a fixed harness compiles + runs + validates, and an orchestrator picks the next parameter from a catalog. The win condition is measurement accuracy (tight confidence intervals, SASS verification, NCU validation), not speed.
You launch a coding agent (Claude Code, Codex, Cursor, …) inside this directory and let it run. The agent does all the editing, compiling, benchmarking, and iteration — you just kick it off and check back later.
A handful of pieces you should know exist before you start:
program.md— the agent's operating manual. The kickoff prompt below tells the agent to read this first.parameter_catalog.py— the list of hardware parameters to measure (currently 34 entries: FMA throughputs/latencies, cache sizes/latencies, DRAM bandwidth, shared memory, tensor cores, …). This is the spec.benchmark.cu— the only file the agent rewrites, one parameter at a time.harness.py— fixed evaluation harness. Compiles, runs N trials, verifies emitted SASS, optionally validates with Nsight Compute, logs to a TSV. The agent never edits this.orchestrate.py— schedules the next parameter, tracks coverage.knowledge/— reference material the agent reads:common_knowledges/(per-technique playbooks) plusptx_isa_markdown/(PTX ISA spec + CUDA API + ncu/nsys guides; fetched on first checkout, not redistributed).templates/,validators/,configs/,report.py,analyze.py— starter kernels, NCU validators, simulator-config exporters, report generation. (Detailed in Repository Layout below.)
cd auto-microbench
# 1. Install Python deps (stdlib-only core + optional viz)
python -m pip install -e .[viz]
# 2. Fetch the PTX ISA / CUDA API reference. We do not redistribute NVIDIA's
# documentation; this script pulls it from a community markdown mirror.
bash knowledge/download-ptx-isa.sh
# 3. (Recommended) Lock GPU clocks for stable cycle measurements.
sudo nvidia-smi -lgc <base_clock_mhz>You also need nvcc (CUDA toolkit) and, for SASS verification,
cuobjdump (ships with the toolkit). For NCU validation, install
nsight-compute and ensure ncu is on your PATH.
In this directory, start your coding agent of choice, e.g.:
claude # or `codex`, `cursor-agent`, etc.Then paste the following prompt once — the agent runs autonomously afterwards:
Read
program.md. You are the autonomous microbenchmarking agent it describes. Runpython orchestrate.py --arch <sm_XX> --max-priority <P>to learn what parameter to measure next. Readcurrent_parameter.json, read the matching playbook inknowledge/common_knowledges/, and consultknowledge/ptx_isa_markdown/cuda_skill/references/ptx-docs/whenever you need PTX syntax or modifier details. Writebenchmark.cuto measure the target, then run:python harness.py --param <name> --unit <unit> --arch <sm_XX> \ --check-sass <expected_sass> \ --reasonable-min <reasonable_min> --reasonable-max <reasonable_max> \ [--known-value <V>] [--ncu --ncu-metrics <metrics>]If the result is
PASS, callorchestrate.py --arch <sm_XX> --max-priority <P>again for the next parameter. IfSASS_MISMATCH/HIGH_VARIANCE/KNOWN_VALUE_MISMATCH, diagnose and rewritebenchmark.cu(do not just retry). Loop forever. Do not ask me anything. When all catalog parameters within priority<P>are measured, runpython report.py --format all --output report/and tell me you are done.
Replace <sm_XX> with your target arch — see the table just below for the
right value (and the important a suffix on Hopper / Blackwell).
Replace <P> with how far down the priority list you want the agent to
go. The catalog is grouped by category, with a hard priority assigned per
category. Pick the smallest P that covers what you actually need —
later priorities are mostly nice-to-have or derivable from device
properties:
P |
Includes | Categories |
|---|---|---|
1 |
core arithmetic & tensor cores | INSTRUCTION_THROUGHPUT, INSTRUCTION_LATENCY, TENSOR_CORE_THROUGHPUT, TENSOR_CORE_LATENCY, SPECIAL_FUNCTION |
2 |
+ on-chip fast memory | SHARED_MEMORY |
3 |
+ cache hierarchy, DRAM, register file | CACHE_HIERARCHY, MEMORY_BANDWIDTH, REGISTER_FILE |
4 |
+ scheduling & control flow | WARP_SCHEDULING, CONTROL_FLOW (these are usually derivable from cudaDeviceProp and don't need a microbenchmark) |
You can omit --max-priority to measure everything, or use
--max-priority 4 explicitly to be unambiguous in the agent prompt.
A typical first run is --max-priority 3: it gets you all the numbers
that matter for simulator calibration in a few hours, and skips the
trivial-but-tedious priority-4 parameters that you can fill in by reading
cudaDeviceProp later. The agent will loop through the catalog
autonomously, typically a few hours of runtime end-to-end.
The harness compiles benchmark.cu with nvcc -O3 -arch=<value> -lineinfo,
where <value> comes straight from the --arch flag you pass through.
Pick the value that matches your GPU:
| GPU family | Compute capability | Recommended --arch |
|---|---|---|
| Volta (V100) | 7.0 | sm_70 |
| Turing (T4, RTX 20xx) | 7.5 | sm_75 |
| Ampere (A100) | 8.0 | sm_80 |
| Ampere (A40, RTX 30xx) | 8.6 | sm_86 |
| Ada Lovelace (RTX 40xx, L40, L40S) | 8.9 | sm_89 |
| Hopper (H100, H200) | 9.0 | sm_90a (not bare sm_90) |
| Blackwell datacentre (B100, B200, GB200) | 10.0 | sm_100a |
| Blackwell datacentre (B300) | 10.3 | sm_103a |
| Blackwell consumer (RTX 50xx) | 12.0 | sm_120a |
Why the a suffix on Hopper and Blackwell. From sm_90 onwards,
NVIDIA gates generation-specific instructions — wgmma.mma_async,
tcgen05.mma, several TMA forms, certain cp.async.bulk variants — on
the architecture-specific target (the a suffix). A bare sm_90 /
sm_100 / sm_120 build is the cross-generation-portable variant and
will silently refuse to lower those PTX instructions; the most common
symptom is an unexpected SASS_MISMATCH from the harness even though your
inline PTX looks correct.
Rule of thumb:
- Use the
aform (sm_90a,sm_100a,sm_103a,sm_120a) for all microbenchmarks targeting Hopper / Blackwell. The whole point of this framework is to exercise architecture-specific instructions, so portable builds are almost never what you want. - For Volta / Turing / Ampere / Ada (
sm_70–sm_89), theasuffix did not exist; the bare value is correct.
If you pass a bare sm_90 / sm_100 / sm_103 / sm_120, the harness
prints a warning at compile time so you don't silently spend hours
measuring fall-back code paths.
In another terminal:
cd auto-microbench
watch -n 5 'python orchestrate.py --progress; echo; tail -10 results/measurements.tsv 2>/dev/null'The progress bar fills up and TSV rows accumulate without further input from you.
To wipe every measurement artifact (TSV log, progress notes, generated reports, archived solved kernels, compiled binary) and start a clean run:
python reset.py # interactive — asks "type 'yes' to proceed"
python reset.py --yes # skip the prompt
python reset.py --dry-run # show what would be removed, exit without
# touching anythingreset.py deliberately leaves your source files (harness.py,
orchestrate.py, …), knowledge/, templates/, and benchmark.cu alone
— it only removes generated state. Pass --include-benchmark-cu if you
also want to drop the agent's last-edited workspace file.
The agent calls report.py, which produces:
report/measurement_report.md— human-readable summary tablereport/measurements.json— programmatic exportreport/accelsim_hw_config.cfg— Accel-Sim config fragmentreport/gpgpusim_hw_config.cfg— GPGPU-Sim config fragment
Review the Markdown summary, then merge the simulator config into your Accel-Sim / GPGPU-Sim setup.
auto-microbench/
├── program.md agent operating instructions (the "playbook")
├── parameter_catalog.py what to measure: 34-entry catalog of HW parameters
├── benchmark.cu the ONE file the agent modifies (one parameter at a time)
├── harness.py fixed harness: compile, SASS-check, run N trials, NCU-validate, log
├── orchestrate.py pick next parameter, track coverage, write current_parameter.json
├── report.py aggregate results.tsv into Markdown / JSON / simulator configs
├── analyze.py progress.png + cv_per_param.png (matplotlib optional)
├── reset.py wipe all measurement state for a fresh run (human-only tool)
│
├── knowledge/ agent reference material
│ ├── README.md index pointing at the two layers below
│ ├── common_knowledges/ per-technique microbenchmarking playbooks
│ │ ├── instruction_throughput.md
│ │ ├── instruction_latency.md
│ │ ├── cache_hierarchy.md
│ │ ├── memory_bandwidth.md
│ │ ├── shared_memory.md
│ │ ├── register_file.md
│ │ ├── warp_scheduling.md
│ │ └── common_pitfalls.md
│ ├── download-ptx-isa.sh fetch the PTX/CUDA reference (gitignored target)
│ └── ptx_isa_markdown/ NVIDIA docs in grep-able markdown (NOT redistributed)
│ └── cuda_skill/references/
│ ├── ptx-docs/ PTX ISA 9.1 — 405 files (instruction set, special regs, …)
│ ├── ptx-isa.md navigation guide for the PTX docs
│ ├── cuda-runtime-docs/ CUDA Runtime API 13.1
│ ├── cuda-driver-docs/ CUDA Driver API 13.1
│ ├── ncu-guide.md Nsight Compute metric reference
│ ├── nsys-guide.md Nsight Systems patterns
│ ├── debugging-tools.md compute-sanitizer, cuda-gdb
│ └── performance-traps.md bank conflicts, divergence, …
│
├── templates/ starter benchmark.cu patterns the agent can copy from
│ ├── throughput_template.cu LANES × independent register chains, block-wide window
│ ├── latency_dep_chain.cu single-warp serial dependency chain
│ ├── cache_pointer_chase.cu randomized pointer chase, working-set sweep
│ ├── bandwidth_streaming.cu float4 streaming reads, cudaEvent-timed
│ └── shared_mem_bank.cu SMEM latency / bank-conflict probe
│
├── validators/ thin NCU-based ground-truth checks
│ ├── sass_checker.py wraps cuobjdump --dump-sass, used by harness.py --check-sass
│ ├── instruction_counter.py confirm pipe counters via NCU
│ ├── cache_hit_checker.py confirm intended cache level was hit
│ └── occupancy_checker.py confirm latency vs throughput occupancy targeting
│
├── configs/ simulator config exporters (called by report.py)
│ ├── accelsim_config.py
│ └── gpgpusim_config.py
│
├── solved_kernels/ per-parameter archive: the benchmark.cu that
│ produced each PASS / PASS_WITH_NOTE measurement,
│ saved automatically with a metadata header.
│ `<param_name>.cu`, latest passing version wins.
│
├── results/ runtime output (gitignored)
│ ├── measurements.tsv one row per harness run
│ ├── progress_notes.json deferred / annotated parameters (agent backlog)
│ └── analysis/ progress.png + cv_per_param.png
└── report/ generated reports + simulator configs (gitignored)
| File / dir | Owner | What it does |
|---|---|---|
program.md |
agent reads, never edits | The "playbook." Workflow, rules, decision matrix, hard constraints, knowledge-file index. |
parameter_catalog.py |
human edits to extend; agent reads | The catalog of parameters to measure. Each entry: name, category, unit, technique, priority, known value, NCU metrics, expected SASS. |
benchmark.cu |
agent rewrites every iteration | The single file the agent edits. One parameter per kernel. Must print RESULT: <value> to stdout. |
harness.py |
fixed | Compiles benchmark.cu, optionally runs cuobjdump --dump-sass for SASS verification (--check-sass), runs N trials, optionally runs ncu for counter validation, logs a row to results/measurements.tsv. |
orchestrate.py |
fixed | Picks the next pending parameter from the catalog (priority + dependency order), writes its spec to current_parameter.json, prints a progress summary. Supports --max-priority, --only, --reset, --progress. |
report.py |
fixed | Aggregates results/measurements.tsv into Markdown summary + JSON export + Accel-Sim config + GPGPU-Sim config. |
analyze.py |
fixed | Optional matplotlib charts: progress over time, CV per parameter. |
knowledge/common_knowledges/*.md |
agent reads | Methodology playbooks, one per technique. |
knowledge/ptx_isa_markdown/ |
agent greps | NVIDIA spec mirror; fetched on first checkout by download-ptx-isa.sh (not committed to this repo). |
templates/*.cu |
agent copies + modifies | Canonical starter kernels per technique. |
validators/*.py |
agent invokes ad-hoc | Standalone NCU/cuobjdump wrappers for fail-fast checks during debugging. |
configs/*.py |
called by report.py |
Simulator-config exporters. |
current_parameter.json |
written by orchestrate.py, read by agent |
The contract between scheduler and agent: the spec for the parameter currently being measured. |
results/measurements.tsv |
written by harness.py |
Append-only log; one row per benchmark run. |
orchestrate.py next → agent reads knowledge/common_knowledges/<technique>.md
→ greps knowledge/ptx_isa_markdown/.../ptx-docs/ for PTX syntax
→ agent rewrites benchmark.cu
→ python harness.py --param <name> --unit <u> --arch <a>
--check-sass <pat> [--ncu ...]
→ PASS / HIGH_VARIANCE / SASS_MISMATCH / KNOWN_VALUE_MISMATCH / ERROR
→ iterate or move on
→ python report.py
No MCP servers, no custom tools. Everything goes through the shell.
| Action | Command |
|---|---|
| Get next parameter to measure | python orchestrate.py --arch <sm_XX> |
| See coverage progress | python orchestrate.py --progress |
| Restrict to high-priority parameters only | python orchestrate.py --max-priority 1 |
| Iterate on one parameter | python orchestrate.py --only <name> |
| Reset a parameter for re-measurement | python orchestrate.py --reset <name> |
| Defer a stubborn parameter (with notes) | python orchestrate.py --defer <name> --reason "..." --next-idea "..." |
| Annotate without changing state | python orchestrate.py --annotate <name> --reason "..." |
| Bring a deferred parameter back | python orchestrate.py --retry <name> |
| View the deferred / annotated backlog | python orchestrate.py --backlog |
| Compile + SASS-check + run + NCU-validate | python harness.py --param <name> --unit <u> --arch <a> --check-sass <pat> [--known-value V] [--ncu --ncu-metrics ...] |
| Disable auto-archive of solved kernel | python harness.py … --no-archive |
| Override archive directory | python harness.py … --archive-dir <path> |
| Standalone SASS check | python -m validators.sass_checker --binary ./benchmark --require <pat> |
| Verify a specific NCU counter | python -m validators.instruction_counter --pipe fma --expected 5e9 |
| Verify cache hit rate | python -m validators.cache_hit_checker --target l1 --min-hit-rate 0.9 |
| Verify occupancy | python -m validators.occupancy_checker --max-pct 5 |
| Generate the final report | python report.py --format all --output report/ |
| Visualize progress | python analyze.py |
The agent edits exactly one file: benchmark.cu.
Every harness run appends one row to results/measurements.tsv:
| Column | Meaning |
|---|---|
parameter |
Name from parameter_catalog.py |
value |
Reported measurement (median across trials) |
unit |
e.g. instructions/cycle/SM, cycles, GB/s |
mean, median, std_dev |
Trial statistics |
CV |
Coefficient of variation (std/mean); target < 0.02 |
CI_low, CI_high |
95 % confidence interval |
trials |
Number of valid trials |
status |
PASS, HIGH_VARIANCE, KNOWN_VALUE_MISMATCH, OUT_OF_RANGE, SASS_MISMATCH, COMPILE_ERROR, RUNTIME_ERROR, NO_DATA |
ncu_validated |
True if an NCU counter pass succeeded |
sass_validated |
True if --check-sass patterns were satisfied |
notes, timestamp |
Free-form |
PASS requires CV < 5 % AND (if a known value was supplied) within ±10 %
AND (if a reasonable_range is set on the catalog entry) the measured
value falls within it.
Each entry in parameter_catalog.py carries a reasonable_range = (low, high) — a wide bracket that any plausible measurement on any modern
NVIDIA GPU should fit inside. The agent passes this through to
harness.py via --reasonable-min / --reasonable-max. When the
measured value lands outside the range, the harness flags the run as
OUT_OF_RANGE and adds a "recheck methodology" note. Unlike
SASS_MISMATCH (hard fail before timing), OUT_OF_RANGE is a soft
warning — the run still gets logged, but the agent should treat it as a
red flag to redesign the kernel rather than accept the number.
Use the ranges as architecture-agnostic plausibility bounds; they are intentionally wide so they don't reject legitimate generation-to- generation variation. If you find yourself frequently tripping a range, the right fix is usually "the methodology is wrong," not "widen the range."
Edit parameter_catalog.py:
PARAMETERS.append(Parameter(
name="my_new_param",
category=Category.INSTRUCTION_THROUGHPUT,
description="...",
unit="instructions/cycle/SM",
measurement_technique=Technique.THROUGHPUT_LOOP,
priority=2,
known_value=64,
ncu_validation_metrics=["smsp__inst_executed_pipe_fma.avg.per_cycle_active"],
expected_sass=["FFMA"],
reasonable_range=(16.0, 512.0), # plausible bracket across all modern NVIDIA GPUs
))Run python parameter_catalog.py after editing — the self-test will
flag any priority/range typos (e.g. a known_value that falls outside
its own reasonable_range).
Then python orchestrate.py picks it up automatically on the next call.
The agent does not need a new prompt.
Today the catalog's known-value defaults are populated for Ampere (SM80).
For Hopper / Ada / Blackwell, copy parameter_catalog.py to
parameter_catalog_sm90.py (etc.), update the known values + expected
SASS where they differ, and load the new file in orchestrate.py. (A
factory-style refactor for cleaner multi-arch support is planned.)
- Single-file edit. Agent diffs are small and reverts are clean.
- SASS verification is mandatory. ptxas may fuse, lower, or emulate
what you wrote in inline PTX.
--check-sassrunscuobjdump --dump-sassbefore timing, so a wrong-instruction kernel never produces a logged measurement. - Cycle-level timing by default. Throughput / latency / cache /
shared-memory benchmarks read the GPU clock directly via inline PTX
(
mov.u64 %0, %%clock64;), bracketed with__syncthreads(). Wall-clock is only used where the metric is rate-based (DRAM bandwidth). - NCU as ground truth. A measurement that "looks fine" but doesn't match counter expectations is wrong, period.
- TSV logging. Plain-text, git-friendly, trivially parseable.
- Confidence intervals over point estimates. "4.0 ± 0.1 cycles" is much more useful to a simulator developer than "4.0 cycles."
- Resumable. Re-running
orchestrate.pypicks up where you left off — parameters already inresults/measurements.tsvare skipped. - Defer + persistent notes. When the agent hits a wall on a parameter,
it can
--defer <name>to skip it without committing a row, and the reasoning is logged toresults/progress_notes.json. The next session (or the next agent, on a smaller-context model) sees the prior notes the moment that parameter is dispensed again, so the trail of "what was tried, what to try next" survives across runs. - Solved-kernel archive. On every
PASS/PASS_WITH_NOTE, the harness auto-copiesbenchmark.cutosolved_kernels/<param>.cuwith a metadata header (status, value, CV, CI, SASS / NCU verdict, arch, notes). After a full run you have a directory of working benchmarks, one per parameter, that you can commit, share, or use as starting templates for the next GPU. - Provenance-respecting docs. NVIDIA's PTX/CUDA reference is fetched from upstream by an opt-in script, not redistributed in this repo.
Initial scaffold. Battle-testing on real GPUs (A100 / RTX 4090 / RTX 5090 / H100) and the multi-arch refactor are next. PRs / issues welcome.
MIT for code in this repository. The PTX ISA / CUDA API documentation
fetched by knowledge/download-ptx-isa.sh is © NVIDIA Corporation and
distributed via the upstream community mirror
technillogue/ptx-isa-markdown;
this repository neither contains nor redistributes those documents.