Auto-MicroBench

Agent-driven GPU microbenchmark framework for academic-simulator calibration.

Auto-MicroBench autonomously writes, runs, validates, and refines CUDA microbenchmarks to extract the hardware parameters that simulators like Accel-Sim and GPGPU-Sim need: instruction throughputs, latencies, cache sizes/latencies/bandwidths, shared-memory characteristics, warp scheduling parameters, tensor-core throughputs, and so on.

The design is a microbench-flavored cousin of AutoKernel: the agent edits a single CUDA file, a fixed harness compiles + runs + validates, and an orchestrator picks the next parameter from a catalog. The win condition is measurement accuracy (tight confidence intervals, SASS verification, NCU validation), not speed.

Quick Start

You launch a coding agent (Claude Code, Codex, Cursor, …) inside this directory and let it run. The agent does all the editing, compiling, benchmarking, and iteration — you just kick it off and check back later.

Repository at a glance

A handful of pieces you should know exist before you start:

program.md — the agent's operating manual. The kickoff prompt below tells the agent to read this first.
parameter_catalog.py — the list of hardware parameters to measure (currently 34 entries: FMA throughputs/latencies, cache sizes/latencies, DRAM bandwidth, shared memory, tensor cores, …). This is the spec.
benchmark.cu — the only file the agent rewrites, one parameter at a time.
harness.py — fixed evaluation harness. Compiles, runs N trials, verifies emitted SASS, optionally validates with Nsight Compute, logs to a TSV. The agent never edits this.
orchestrate.py — schedules the next parameter, tracks coverage.
knowledge/ — reference material the agent reads: common_knowledges/ (per-technique playbooks) plus ptx_isa_markdown/ (PTX ISA spec + CUDA API + ncu/nsys guides; fetched on first checkout, not redistributed).
templates/, validators/, configs/, report.py, analyze.py — starter kernels, NCU validators, simulator-config exporters, report generation. (Detailed in Repository Layout below.)

One-time setup (you do this once)

cd auto-microbench

# 1. Install Python deps (stdlib-only core + optional viz)
python -m pip install -e .[viz]

# 2. Fetch the PTX ISA / CUDA API reference. We do not redistribute NVIDIA's
#    documentation; this script pulls it from a community markdown mirror.
bash knowledge/download-ptx-isa.sh

# 3. (Recommended) Lock GPU clocks for stable cycle measurements.
sudo nvidia-smi -lgc <base_clock_mhz>

You also need nvcc (CUDA toolkit) and, for SASS verification, cuobjdump (ships with the toolkit). For NCU validation, install nsight-compute and ensure ncu is on your PATH.

Launch the agent (you do this each session)

In this directory, start your coding agent of choice, e.g.:

claude        # or `codex`, `cursor-agent`, etc.

Then paste the following prompt once — the agent runs autonomously afterwards:

Read program.md. You are the autonomous microbenchmarking agent it describes. Run python orchestrate.py --arch <sm_XX> --max-priority <P> to learn what parameter to measure next. Read current_parameter.json, read the matching playbook in knowledge/common_knowledges/, and consult knowledge/ptx_isa_markdown/cuda_skill/references/ptx-docs/ whenever you need PTX syntax or modifier details. Write benchmark.cu to measure the target, then run:
python harness.py --param <name> --unit <unit> --arch <sm_XX> \
    --check-sass <expected_sass> \
    --reasonable-min <reasonable_min> --reasonable-max <reasonable_max> \
    [--known-value <V>] [--ncu --ncu-metrics <metrics>]
If the result is PASS, call orchestrate.py --arch <sm_XX> --max-priority <P> again for the next parameter. If SASS_MISMATCH / HIGH_VARIANCE / KNOWN_VALUE_MISMATCH, diagnose and rewrite benchmark.cu (do not just retry). Loop forever. Do not ask me anything. When all catalog parameters within priority <P> are measured, run python report.py --format all --output report/ and tell me you are done.

Replace <sm_XX> with your target arch — see the table just below for the right value (and the important a suffix on Hopper / Blackwell).

Replace <P> with how far down the priority list you want the agent to go. The catalog is grouped by category, with a hard priority assigned per category. Pick the smallest P that covers what you actually need — later priorities are mostly nice-to-have or derivable from device properties:

`P`	Includes	Categories
`1`	core arithmetic & tensor cores	INSTRUCTION_THROUGHPUT, INSTRUCTION_LATENCY, TENSOR_CORE_THROUGHPUT, TENSOR_CORE_LATENCY, SPECIAL_FUNCTION
`2`	+ on-chip fast memory	SHARED_MEMORY
`3`	+ cache hierarchy, DRAM, register file	CACHE_HIERARCHY, MEMORY_BANDWIDTH, REGISTER_FILE
`4`	+ scheduling & control flow	WARP_SCHEDULING, CONTROL_FLOW (these are usually derivable from `cudaDeviceProp` and don't need a microbenchmark)

You can omit --max-priority to measure everything, or use --max-priority 4 explicitly to be unambiguous in the agent prompt.

A typical first run is --max-priority 3: it gets you all the numbers that matter for simulator calibration in a few hours, and skips the trivial-but-tedious priority-4 parameters that you can fill in by reading cudaDeviceProp later. The agent will loop through the catalog autonomously, typically a few hours of runtime end-to-end.

GPU architecture target (the `--arch` flag)

The harness compiles benchmark.cu with nvcc -O3 -arch=<value> -lineinfo, where <value> comes straight from the --arch flag you pass through. Pick the value that matches your GPU:

GPU family	Compute capability	Recommended `--arch`
Volta (V100)	7.0	`sm_70`
Turing (T4, RTX 20xx)	7.5	`sm_75`
Ampere (A100)	8.0	`sm_80`
Ampere (A40, RTX 30xx)	8.6	`sm_86`
Ada Lovelace (RTX 40xx, L40, L40S)	8.9	`sm_89`
Hopper (H100, H200)	9.0	`sm_90a` (not bare `sm_90`)
Blackwell datacentre (B100, B200, GB200)	10.0	`sm_100a`
Blackwell datacentre (B300)	10.3	`sm_103a`
Blackwell consumer (RTX 50xx)	12.0	`sm_120a`

Why the a suffix on Hopper and Blackwell. From sm_90 onwards, NVIDIA gates generation-specific instructions — wgmma.mma_async, tcgen05.mma, several TMA forms, certain cp.async.bulk variants — on the architecture-specific target (the a suffix). A bare sm_90 / sm_100 / sm_120 build is the cross-generation-portable variant and will silently refuse to lower those PTX instructions; the most common symptom is an unexpected SASS_MISMATCH from the harness even though your inline PTX looks correct.

Rule of thumb:

Use the a form (sm_90a, sm_100a, sm_103a, sm_120a) for all microbenchmarks targeting Hopper / Blackwell. The whole point of this framework is to exercise architecture-specific instructions, so portable builds are almost never what you want.
For Volta / Turing / Ampere / Ada (sm_70–sm_89), the a suffix did not exist; the bare value is correct.

If you pass a bare sm_90 / sm_100 / sm_103 / sm_120, the harness prints a warning at compile time so you don't silently spend hours measuring fall-back code paths.

What you check while it runs

In another terminal:

cd auto-microbench
watch -n 5 'python orchestrate.py --progress; echo; tail -10 results/measurements.tsv 2>/dev/null'

The progress bar fills up and TSV rows accumulate without further input from you.

Starting fresh / resetting between runs

To wipe every measurement artifact (TSV log, progress notes, generated reports, archived solved kernels, compiled binary) and start a clean run:

python reset.py                  # interactive — asks "type 'yes' to proceed"
python reset.py --yes            # skip the prompt
python reset.py --dry-run        # show what would be removed, exit without
                                 # touching anything

reset.py deliberately leaves your source files (harness.py, orchestrate.py, …), knowledge/, templates/, and benchmark.cu alone — it only removes generated state. Pass --include-benchmark-cu if you also want to drop the agent's last-edited workspace file.

When it is done

The agent calls report.py, which produces:

report/measurement_report.md — human-readable summary table
report/measurements.json — programmatic export
report/accelsim_hw_config.cfg — Accel-Sim config fragment
report/gpgpusim_hw_config.cfg — GPGPU-Sim config fragment

Review the Markdown summary, then merge the simulator config into your Accel-Sim / GPGPU-Sim setup.

Repository Layout

auto-microbench/
├── program.md              agent operating instructions (the "playbook")
├── parameter_catalog.py    what to measure: 34-entry catalog of HW parameters
├── benchmark.cu            the ONE file the agent modifies (one parameter at a time)
├── harness.py              fixed harness: compile, SASS-check, run N trials, NCU-validate, log
├── orchestrate.py          pick next parameter, track coverage, write current_parameter.json
├── report.py               aggregate results.tsv into Markdown / JSON / simulator configs
├── analyze.py              progress.png + cv_per_param.png (matplotlib optional)
├── reset.py                wipe all measurement state for a fresh run (human-only tool)
│
├── knowledge/                       agent reference material
│   ├── README.md                    index pointing at the two layers below
│   ├── common_knowledges/           per-technique microbenchmarking playbooks
│   │   ├── instruction_throughput.md
│   │   ├── instruction_latency.md
│   │   ├── cache_hierarchy.md
│   │   ├── memory_bandwidth.md
│   │   ├── shared_memory.md
│   │   ├── register_file.md
│   │   ├── warp_scheduling.md
│   │   └── common_pitfalls.md
│   ├── download-ptx-isa.sh          fetch the PTX/CUDA reference (gitignored target)
│   └── ptx_isa_markdown/            NVIDIA docs in grep-able markdown (NOT redistributed)
│       └── cuda_skill/references/
│           ├── ptx-docs/            PTX ISA 9.1 — 405 files (instruction set, special regs, …)
│           ├── ptx-isa.md           navigation guide for the PTX docs
│           ├── cuda-runtime-docs/   CUDA Runtime API 13.1
│           ├── cuda-driver-docs/    CUDA Driver API 13.1
│           ├── ncu-guide.md         Nsight Compute metric reference
│           ├── nsys-guide.md        Nsight Systems patterns
│           ├── debugging-tools.md   compute-sanitizer, cuda-gdb
│           └── performance-traps.md bank conflicts, divergence, …
│
├── templates/              starter benchmark.cu patterns the agent can copy from
│   ├── throughput_template.cu       LANES × independent register chains, block-wide window
│   ├── latency_dep_chain.cu         single-warp serial dependency chain
│   ├── cache_pointer_chase.cu       randomized pointer chase, working-set sweep
│   ├── bandwidth_streaming.cu       float4 streaming reads, cudaEvent-timed
│   └── shared_mem_bank.cu           SMEM latency / bank-conflict probe
│
├── validators/             thin NCU-based ground-truth checks
│   ├── sass_checker.py     wraps cuobjdump --dump-sass, used by harness.py --check-sass
│   ├── instruction_counter.py       confirm pipe counters via NCU
│   ├── cache_hit_checker.py         confirm intended cache level was hit
│   └── occupancy_checker.py         confirm latency vs throughput occupancy targeting
│
├── configs/                simulator config exporters (called by report.py)
│   ├── accelsim_config.py
│   └── gpgpusim_config.py
│
├── solved_kernels/         per-parameter archive: the benchmark.cu that
│                           produced each PASS / PASS_WITH_NOTE measurement,
│                           saved automatically with a metadata header.
│                           `<param_name>.cu`, latest passing version wins.
│
├── results/                runtime output (gitignored)
│   ├── measurements.tsv             one row per harness run
│   ├── progress_notes.json          deferred / annotated parameters (agent backlog)
│   └── analysis/                    progress.png + cv_per_param.png
└── report/                 generated reports + simulator configs (gitignored)

File-by-file responsibility

File / dir	Owner	What it does
`program.md`	agent reads, never edits	The "playbook." Workflow, rules, decision matrix, hard constraints, knowledge-file index.
`parameter_catalog.py`	human edits to extend; agent reads	The catalog of parameters to measure. Each entry: name, category, unit, technique, priority, known value, NCU metrics, expected SASS.
`benchmark.cu`	agent rewrites every iteration	The single file the agent edits. One parameter per kernel. Must print `RESULT: <value>` to stdout.
`harness.py`	fixed	Compiles `benchmark.cu`, optionally runs `cuobjdump --dump-sass` for SASS verification (`--check-sass`), runs N trials, optionally runs `ncu` for counter validation, logs a row to `results/measurements.tsv`.
`orchestrate.py`	fixed	Picks the next pending parameter from the catalog (priority + dependency order), writes its spec to `current_parameter.json`, prints a progress summary. Supports `--max-priority`, `--only`, `--reset`, `--progress`.
`report.py`	fixed	Aggregates `results/measurements.tsv` into Markdown summary + JSON export + Accel-Sim config + GPGPU-Sim config.
`analyze.py`	fixed	Optional matplotlib charts: progress over time, CV per parameter.
`knowledge/common_knowledges/*.md`	agent reads	Methodology playbooks, one per technique.
`knowledge/ptx_isa_markdown/`	agent greps	NVIDIA spec mirror; fetched on first checkout by `download-ptx-isa.sh` (not committed to this repo).
`templates/*.cu`	agent copies + modifies	Canonical starter kernels per technique.
`validators/*.py`	agent invokes ad-hoc	Standalone NCU/cuobjdump wrappers for fail-fast checks during debugging.
`configs/*.py`	called by `report.py`	Simulator-config exporters.
`current_parameter.json`	written by `orchestrate.py`, read by agent	The contract between scheduler and agent: the spec for the parameter currently being measured.
`results/measurements.tsv`	written by `harness.py`	Append-only log; one row per benchmark run.

The agent loop in one diagram

orchestrate.py next  →  agent reads knowledge/common_knowledges/<technique>.md
                    →  greps knowledge/ptx_isa_markdown/.../ptx-docs/ for PTX syntax
                    →  agent rewrites benchmark.cu
                    →  python harness.py --param <name> --unit <u> --arch <a>
                                          --check-sass <pat> [--ncu ...]
                    →  PASS / HIGH_VARIANCE / SASS_MISMATCH / KNOWN_VALUE_MISMATCH / ERROR
                    →  iterate or move on
                    →  python report.py

Tools the agent calls (shell-only contract)

No MCP servers, no custom tools. Everything goes through the shell.

Action	Command
Get next parameter to measure	`python orchestrate.py --arch <sm_XX>`
See coverage progress	`python orchestrate.py --progress`
Restrict to high-priority parameters only	`python orchestrate.py --max-priority 1`
Iterate on one parameter	`python orchestrate.py --only <name>`
Reset a parameter for re-measurement	`python orchestrate.py --reset <name>`
Defer a stubborn parameter (with notes)	`python orchestrate.py --defer <name> --reason "..." --next-idea "..."`
Annotate without changing state	`python orchestrate.py --annotate <name> --reason "..."`
Bring a deferred parameter back	`python orchestrate.py --retry <name>`
View the deferred / annotated backlog	`python orchestrate.py --backlog`
Compile + SASS-check + run + NCU-validate	`python harness.py --param <name> --unit <u> --arch <a> --check-sass <pat> [--known-value V] [--ncu --ncu-metrics ...]`
Disable auto-archive of solved kernel	`python harness.py … --no-archive`
Override archive directory	`python harness.py … --archive-dir <path>`
Standalone SASS check	`python -m validators.sass_checker --binary ./benchmark --require <pat>`
Verify a specific NCU counter	`python -m validators.instruction_counter --pipe fma --expected 5e9`
Verify cache hit rate	`python -m validators.cache_hit_checker --target l1 --min-hit-rate 0.9`
Verify occupancy	`python -m validators.occupancy_checker --max-pct 5`
Generate the final report	`python report.py --format all --output report/`
Visualize progress	`python analyze.py`

The agent edits exactly one file: benchmark.cu.

The Result TSV

Every harness run appends one row to results/measurements.tsv:

Column	Meaning
`parameter`	Name from `parameter_catalog.py`
`value`	Reported measurement (median across trials)
`unit`	e.g. `instructions/cycle/SM`, `cycles`, `GB/s`
`mean`, `median`, `std_dev`	Trial statistics
`CV`	Coefficient of variation (std/mean); target < 0.02
`CI_low`, `CI_high`	95 % confidence interval
`trials`	Number of valid trials
`status`	`PASS`, `HIGH_VARIANCE`, `KNOWN_VALUE_MISMATCH`, `OUT_OF_RANGE`, `SASS_MISMATCH`, `COMPILE_ERROR`, `RUNTIME_ERROR`, `NO_DATA`
`ncu_validated`	True if an NCU counter pass succeeded
`sass_validated`	True if `--check-sass` patterns were satisfied
`notes`, `timestamp`	Free-form

PASS requires CV < 5 % AND (if a known value was supplied) within ±10 % AND (if a reasonable_range is set on the catalog entry) the measured value falls within it.

`OUT_OF_RANGE` is a soft warning

Each entry in parameter_catalog.py carries a reasonable_range = (low, high) — a wide bracket that any plausible measurement on any modern NVIDIA GPU should fit inside. The agent passes this through to harness.py via --reasonable-min / --reasonable-max. When the measured value lands outside the range, the harness flags the run as OUT_OF_RANGE and adds a "recheck methodology" note. Unlike SASS_MISMATCH (hard fail before timing), OUT_OF_RANGE is a soft warning — the run still gets logged, but the agent should treat it as a red flag to redesign the kernel rather than accept the number.

Use the ranges as architecture-agnostic plausibility bounds; they are intentionally wide so they don't reject legitimate generation-to- generation variation. If you find yourself frequently tripping a range, the right fix is usually "the methodology is wrong," not "widen the range."

Adding a New Parameter

Edit parameter_catalog.py:

PARAMETERS.append(Parameter(
    name="my_new_param",
    category=Category.INSTRUCTION_THROUGHPUT,
    description="...",
    unit="instructions/cycle/SM",
    measurement_technique=Technique.THROUGHPUT_LOOP,
    priority=2,
    known_value=64,
    ncu_validation_metrics=["smsp__inst_executed_pipe_fma.avg.per_cycle_active"],
    expected_sass=["FFMA"],
    reasonable_range=(16.0, 512.0),   # plausible bracket across all modern NVIDIA GPUs
))

Run python parameter_catalog.py after editing — the self-test will flag any priority/range typos (e.g. a known_value that falls outside its own reasonable_range).

Then python orchestrate.py picks it up automatically on the next call. The agent does not need a new prompt.

Adding a New Architecture

Today the catalog's known-value defaults are populated for Ampere (SM80). For Hopper / Ada / Blackwell, copy parameter_catalog.py to parameter_catalog_sm90.py (etc.), update the known values + expected SASS where they differ, and load the new file in orchestrate.py. (A factory-style refactor for cleaner multi-arch support is planned.)

Design Choices

Single-file edit. Agent diffs are small and reverts are clean.
SASS verification is mandatory. ptxas may fuse, lower, or emulate what you wrote in inline PTX. --check-sass runs cuobjdump --dump-sass before timing, so a wrong-instruction kernel never produces a logged measurement.
Cycle-level timing by default. Throughput / latency / cache / shared-memory benchmarks read the GPU clock directly via inline PTX (mov.u64 %0, %%clock64;), bracketed with __syncthreads(). Wall-clock is only used where the metric is rate-based (DRAM bandwidth).
NCU as ground truth. A measurement that "looks fine" but doesn't match counter expectations is wrong, period.
TSV logging. Plain-text, git-friendly, trivially parseable.
Confidence intervals over point estimates. "4.0 ± 0.1 cycles" is much more useful to a simulator developer than "4.0 cycles."
Resumable. Re-running orchestrate.py picks up where you left off — parameters already in results/measurements.tsv are skipped.
Defer + persistent notes. When the agent hits a wall on a parameter, it can --defer <name> to skip it without committing a row, and the reasoning is logged to results/progress_notes.json. The next session (or the next agent, on a smaller-context model) sees the prior notes the moment that parameter is dispensed again, so the trail of "what was tried, what to try next" survives across runs.
Solved-kernel archive. On every PASS / PASS_WITH_NOTE, the harness auto-copies benchmark.cu to solved_kernels/<param>.cu with a metadata header (status, value, CV, CI, SASS / NCU verdict, arch, notes). After a full run you have a directory of working benchmarks, one per parameter, that you can commit, share, or use as starting templates for the next GPU.
Provenance-respecting docs. NVIDIA's PTX/CUDA reference is fetched from upstream by an opt-in script, not redistributed in this repo.

Status

Initial scaffold. Battle-testing on real GPUs (A100 / RTX 4090 / RTX 5090 / H100) and the multi-arch refactor are next. PRs / issues welcome.

License

MIT for code in this repository. The PTX ISA / CUDA API documentation fetched by knowledge/download-ptx-isa.sh is © NVIDIA Corporation and distributed via the upstream community mirror technillogue/ptx-isa-markdown; this repository neither contains nor redistributes those documents.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Auto-MicroBench

Quick Start

Repository at a glance

One-time setup (you do this once)

Launch the agent (you do this each session)

GPU architecture target (the `--arch` flag)

What you check while it runs

Starting fresh / resetting between runs

When it is done

Repository Layout

File-by-file responsibility

The agent loop in one diagram

Tools the agent calls (shell-only contract)

The Result TSV

`OUT_OF_RANGE` is a soft warning

Adding a New Parameter

Adding a New Architecture

Design Choices

Status

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
__pycache__		__pycache__
configs		configs
figures		figures
knowledge		knowledge
templates		templates
validators		validators
README.md		README.md
analyze.py		analyze.py
benchmark.cu		benchmark.cu
harness.py		harness.py
kick_off.md		kick_off.md
orchestrate.py		orchestrate.py
parameter_catalog.py		parameter_catalog.py
program.md		program.md
pyproject.toml		pyproject.toml
report.py		report.py
reset.py		reset.py

Folders and files

Latest commit

History

Repository files navigation

Auto-MicroBench

Quick Start

Repository at a glance

One-time setup (you do this once)

Launch the agent (you do this each session)

GPU architecture target (the --arch flag)

What you check while it runs

Starting fresh / resetting between runs

When it is done

Repository Layout

File-by-file responsibility

The agent loop in one diagram

Tools the agent calls (shell-only contract)

The Result TSV

OUT_OF_RANGE is a soft warning

Adding a New Parameter

Adding a New Architecture

Design Choices

Status

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

GPU architecture target (the `--arch` flag)

`OUT_OF_RANGE` is a soft warning

Packages