Skip to content

abdelfattah-lab/Auto-Microbench

Repository files navigation

Auto-MicroBench

Agent-driven GPU microbenchmark framework for academic-simulator calibration.

Auto-MicroBench autonomously writes, runs, validates, and refines CUDA microbenchmarks to extract the hardware parameters that simulators like Accel-Sim and GPGPU-Sim need: instruction throughputs, latencies, cache sizes/latencies/bandwidths, shared-memory characteristics, warp scheduling parameters, tensor-core throughputs, and so on.

The design is a microbench-flavored cousin of AutoKernel: the agent edits a single CUDA file, a fixed harness compiles + runs + validates, and an orchestrator picks the next parameter from a catalog. The win condition is measurement accuracy (tight confidence intervals, SASS verification, NCU validation), not speed.


Quick Start

You launch a coding agent (Claude Code, Codex, Cursor, …) inside this directory and let it run. The agent does all the editing, compiling, benchmarking, and iteration — you just kick it off and check back later.

Repository at a glance

A handful of pieces you should know exist before you start:

  • program.md — the agent's operating manual. The kickoff prompt below tells the agent to read this first.
  • parameter_catalog.py — the list of hardware parameters to measure (currently 34 entries: FMA throughputs/latencies, cache sizes/latencies, DRAM bandwidth, shared memory, tensor cores, …). This is the spec.
  • benchmark.cu — the only file the agent rewrites, one parameter at a time.
  • harness.py — fixed evaluation harness. Compiles, runs N trials, verifies emitted SASS, optionally validates with Nsight Compute, logs to a TSV. The agent never edits this.
  • orchestrate.py — schedules the next parameter, tracks coverage.
  • knowledge/ — reference material the agent reads: common_knowledges/ (per-technique playbooks) plus ptx_isa_markdown/ (PTX ISA spec + CUDA API + ncu/nsys guides; fetched on first checkout, not redistributed).
  • templates/, validators/, configs/, report.py, analyze.py — starter kernels, NCU validators, simulator-config exporters, report generation. (Detailed in Repository Layout below.)

One-time setup (you do this once)

cd auto-microbench

# 1. Install Python deps (stdlib-only core + optional viz)
python -m pip install -e .[viz]

# 2. Fetch the PTX ISA / CUDA API reference. We do not redistribute NVIDIA's
#    documentation; this script pulls it from a community markdown mirror.
bash knowledge/download-ptx-isa.sh

# 3. (Recommended) Lock GPU clocks for stable cycle measurements.
sudo nvidia-smi -lgc <base_clock_mhz>

You also need nvcc (CUDA toolkit) and, for SASS verification, cuobjdump (ships with the toolkit). For NCU validation, install nsight-compute and ensure ncu is on your PATH.

Launch the agent (you do this each session)

In this directory, start your coding agent of choice, e.g.:

claude        # or `codex`, `cursor-agent`, etc.

Then paste the following prompt once — the agent runs autonomously afterwards:

Read program.md. You are the autonomous microbenchmarking agent it describes. Run python orchestrate.py --arch <sm_XX> --max-priority <P> to learn what parameter to measure next. Read current_parameter.json, read the matching playbook in knowledge/common_knowledges/, and consult knowledge/ptx_isa_markdown/cuda_skill/references/ptx-docs/ whenever you need PTX syntax or modifier details. Write benchmark.cu to measure the target, then run:

python harness.py --param <name> --unit <unit> --arch <sm_XX> \
    --check-sass <expected_sass> \
    --reasonable-min <reasonable_min> --reasonable-max <reasonable_max> \
    [--known-value <V>] [--ncu --ncu-metrics <metrics>]

If the result is PASS, call orchestrate.py --arch <sm_XX> --max-priority <P> again for the next parameter. If SASS_MISMATCH / HIGH_VARIANCE / KNOWN_VALUE_MISMATCH, diagnose and rewrite benchmark.cu (do not just retry). Loop forever. Do not ask me anything. When all catalog parameters within priority <P> are measured, run python report.py --format all --output report/ and tell me you are done.

Replace <sm_XX> with your target arch — see the table just below for the right value (and the important a suffix on Hopper / Blackwell).

Replace <P> with how far down the priority list you want the agent to go. The catalog is grouped by category, with a hard priority assigned per category. Pick the smallest P that covers what you actually need — later priorities are mostly nice-to-have or derivable from device properties:

P Includes Categories
1 core arithmetic & tensor cores INSTRUCTION_THROUGHPUT, INSTRUCTION_LATENCY, TENSOR_CORE_THROUGHPUT, TENSOR_CORE_LATENCY, SPECIAL_FUNCTION
2 + on-chip fast memory SHARED_MEMORY
3 + cache hierarchy, DRAM, register file CACHE_HIERARCHY, MEMORY_BANDWIDTH, REGISTER_FILE
4 + scheduling & control flow WARP_SCHEDULING, CONTROL_FLOW (these are usually derivable from cudaDeviceProp and don't need a microbenchmark)

You can omit --max-priority to measure everything, or use --max-priority 4 explicitly to be unambiguous in the agent prompt.

A typical first run is --max-priority 3: it gets you all the numbers that matter for simulator calibration in a few hours, and skips the trivial-but-tedious priority-4 parameters that you can fill in by reading cudaDeviceProp later. The agent will loop through the catalog autonomously, typically a few hours of runtime end-to-end.

GPU architecture target (the --arch flag)

The harness compiles benchmark.cu with nvcc -O3 -arch=<value> -lineinfo, where <value> comes straight from the --arch flag you pass through. Pick the value that matches your GPU:

GPU family Compute capability Recommended --arch
Volta (V100) 7.0 sm_70
Turing (T4, RTX 20xx) 7.5 sm_75
Ampere (A100) 8.0 sm_80
Ampere (A40, RTX 30xx) 8.6 sm_86
Ada Lovelace (RTX 40xx, L40, L40S) 8.9 sm_89
Hopper (H100, H200) 9.0 sm_90a (not bare sm_90)
Blackwell datacentre (B100, B200, GB200) 10.0 sm_100a
Blackwell datacentre (B300) 10.3 sm_103a
Blackwell consumer (RTX 50xx) 12.0 sm_120a

Why the a suffix on Hopper and Blackwell. From sm_90 onwards, NVIDIA gates generation-specific instructions — wgmma.mma_async, tcgen05.mma, several TMA forms, certain cp.async.bulk variants — on the architecture-specific target (the a suffix). A bare sm_90 / sm_100 / sm_120 build is the cross-generation-portable variant and will silently refuse to lower those PTX instructions; the most common symptom is an unexpected SASS_MISMATCH from the harness even though your inline PTX looks correct.

Rule of thumb:

  • Use the a form (sm_90a, sm_100a, sm_103a, sm_120a) for all microbenchmarks targeting Hopper / Blackwell. The whole point of this framework is to exercise architecture-specific instructions, so portable builds are almost never what you want.
  • For Volta / Turing / Ampere / Ada (sm_70sm_89), the a suffix did not exist; the bare value is correct.

If you pass a bare sm_90 / sm_100 / sm_103 / sm_120, the harness prints a warning at compile time so you don't silently spend hours measuring fall-back code paths.

What you check while it runs

In another terminal:

cd auto-microbench
watch -n 5 'python orchestrate.py --progress; echo; tail -10 results/measurements.tsv 2>/dev/null'

The progress bar fills up and TSV rows accumulate without further input from you.

Starting fresh / resetting between runs

To wipe every measurement artifact (TSV log, progress notes, generated reports, archived solved kernels, compiled binary) and start a clean run:

python reset.py                  # interactive — asks "type 'yes' to proceed"
python reset.py --yes            # skip the prompt
python reset.py --dry-run        # show what would be removed, exit without
                                 # touching anything

reset.py deliberately leaves your source files (harness.py, orchestrate.py, …), knowledge/, templates/, and benchmark.cu alone — it only removes generated state. Pass --include-benchmark-cu if you also want to drop the agent's last-edited workspace file.

When it is done

The agent calls report.py, which produces:

  • report/measurement_report.md — human-readable summary table
  • report/measurements.json — programmatic export
  • report/accelsim_hw_config.cfg — Accel-Sim config fragment
  • report/gpgpusim_hw_config.cfg — GPGPU-Sim config fragment

Review the Markdown summary, then merge the simulator config into your Accel-Sim / GPGPU-Sim setup.


Repository Layout

auto-microbench/
├── program.md              agent operating instructions (the "playbook")
├── parameter_catalog.py    what to measure: 34-entry catalog of HW parameters
├── benchmark.cu            the ONE file the agent modifies (one parameter at a time)
├── harness.py              fixed harness: compile, SASS-check, run N trials, NCU-validate, log
├── orchestrate.py          pick next parameter, track coverage, write current_parameter.json
├── report.py               aggregate results.tsv into Markdown / JSON / simulator configs
├── analyze.py              progress.png + cv_per_param.png (matplotlib optional)
├── reset.py                wipe all measurement state for a fresh run (human-only tool)
│
├── knowledge/                       agent reference material
│   ├── README.md                    index pointing at the two layers below
│   ├── common_knowledges/           per-technique microbenchmarking playbooks
│   │   ├── instruction_throughput.md
│   │   ├── instruction_latency.md
│   │   ├── cache_hierarchy.md
│   │   ├── memory_bandwidth.md
│   │   ├── shared_memory.md
│   │   ├── register_file.md
│   │   ├── warp_scheduling.md
│   │   └── common_pitfalls.md
│   ├── download-ptx-isa.sh          fetch the PTX/CUDA reference (gitignored target)
│   └── ptx_isa_markdown/            NVIDIA docs in grep-able markdown (NOT redistributed)
│       └── cuda_skill/references/
│           ├── ptx-docs/            PTX ISA 9.1 — 405 files (instruction set, special regs, …)
│           ├── ptx-isa.md           navigation guide for the PTX docs
│           ├── cuda-runtime-docs/   CUDA Runtime API 13.1
│           ├── cuda-driver-docs/    CUDA Driver API 13.1
│           ├── ncu-guide.md         Nsight Compute metric reference
│           ├── nsys-guide.md        Nsight Systems patterns
│           ├── debugging-tools.md   compute-sanitizer, cuda-gdb
│           └── performance-traps.md bank conflicts, divergence, …
│
├── templates/              starter benchmark.cu patterns the agent can copy from
│   ├── throughput_template.cu       LANES × independent register chains, block-wide window
│   ├── latency_dep_chain.cu         single-warp serial dependency chain
│   ├── cache_pointer_chase.cu       randomized pointer chase, working-set sweep
│   ├── bandwidth_streaming.cu       float4 streaming reads, cudaEvent-timed
│   └── shared_mem_bank.cu           SMEM latency / bank-conflict probe
│
├── validators/             thin NCU-based ground-truth checks
│   ├── sass_checker.py     wraps cuobjdump --dump-sass, used by harness.py --check-sass
│   ├── instruction_counter.py       confirm pipe counters via NCU
│   ├── cache_hit_checker.py         confirm intended cache level was hit
│   └── occupancy_checker.py         confirm latency vs throughput occupancy targeting
│
├── configs/                simulator config exporters (called by report.py)
│   ├── accelsim_config.py
│   └── gpgpusim_config.py
│
├── solved_kernels/         per-parameter archive: the benchmark.cu that
│                           produced each PASS / PASS_WITH_NOTE measurement,
│                           saved automatically with a metadata header.
│                           `<param_name>.cu`, latest passing version wins.
│
├── results/                runtime output (gitignored)
│   ├── measurements.tsv             one row per harness run
│   ├── progress_notes.json          deferred / annotated parameters (agent backlog)
│   └── analysis/                    progress.png + cv_per_param.png
└── report/                 generated reports + simulator configs (gitignored)

File-by-file responsibility

File / dir Owner What it does
program.md agent reads, never edits The "playbook." Workflow, rules, decision matrix, hard constraints, knowledge-file index.
parameter_catalog.py human edits to extend; agent reads The catalog of parameters to measure. Each entry: name, category, unit, technique, priority, known value, NCU metrics, expected SASS.
benchmark.cu agent rewrites every iteration The single file the agent edits. One parameter per kernel. Must print RESULT: <value> to stdout.
harness.py fixed Compiles benchmark.cu, optionally runs cuobjdump --dump-sass for SASS verification (--check-sass), runs N trials, optionally runs ncu for counter validation, logs a row to results/measurements.tsv.
orchestrate.py fixed Picks the next pending parameter from the catalog (priority + dependency order), writes its spec to current_parameter.json, prints a progress summary. Supports --max-priority, --only, --reset, --progress.
report.py fixed Aggregates results/measurements.tsv into Markdown summary + JSON export + Accel-Sim config + GPGPU-Sim config.
analyze.py fixed Optional matplotlib charts: progress over time, CV per parameter.
knowledge/common_knowledges/*.md agent reads Methodology playbooks, one per technique.
knowledge/ptx_isa_markdown/ agent greps NVIDIA spec mirror; fetched on first checkout by download-ptx-isa.sh (not committed to this repo).
templates/*.cu agent copies + modifies Canonical starter kernels per technique.
validators/*.py agent invokes ad-hoc Standalone NCU/cuobjdump wrappers for fail-fast checks during debugging.
configs/*.py called by report.py Simulator-config exporters.
current_parameter.json written by orchestrate.py, read by agent The contract between scheduler and agent: the spec for the parameter currently being measured.
results/measurements.tsv written by harness.py Append-only log; one row per benchmark run.

The agent loop in one diagram

orchestrate.py next  →  agent reads knowledge/common_knowledges/<technique>.md
                    →  greps knowledge/ptx_isa_markdown/.../ptx-docs/ for PTX syntax
                    →  agent rewrites benchmark.cu
                    →  python harness.py --param <name> --unit <u> --arch <a>
                                          --check-sass <pat> [--ncu ...]
                    →  PASS / HIGH_VARIANCE / SASS_MISMATCH / KNOWN_VALUE_MISMATCH / ERROR
                    →  iterate or move on
                    →  python report.py

Tools the agent calls (shell-only contract)

No MCP servers, no custom tools. Everything goes through the shell.

Action Command
Get next parameter to measure python orchestrate.py --arch <sm_XX>
See coverage progress python orchestrate.py --progress
Restrict to high-priority parameters only python orchestrate.py --max-priority 1
Iterate on one parameter python orchestrate.py --only <name>
Reset a parameter for re-measurement python orchestrate.py --reset <name>
Defer a stubborn parameter (with notes) python orchestrate.py --defer <name> --reason "..." --next-idea "..."
Annotate without changing state python orchestrate.py --annotate <name> --reason "..."
Bring a deferred parameter back python orchestrate.py --retry <name>
View the deferred / annotated backlog python orchestrate.py --backlog
Compile + SASS-check + run + NCU-validate python harness.py --param <name> --unit <u> --arch <a> --check-sass <pat> [--known-value V] [--ncu --ncu-metrics ...]
Disable auto-archive of solved kernel python harness.py … --no-archive
Override archive directory python harness.py … --archive-dir <path>
Standalone SASS check python -m validators.sass_checker --binary ./benchmark --require <pat>
Verify a specific NCU counter python -m validators.instruction_counter --pipe fma --expected 5e9
Verify cache hit rate python -m validators.cache_hit_checker --target l1 --min-hit-rate 0.9
Verify occupancy python -m validators.occupancy_checker --max-pct 5
Generate the final report python report.py --format all --output report/
Visualize progress python analyze.py

The agent edits exactly one file: benchmark.cu.


The Result TSV

Every harness run appends one row to results/measurements.tsv:

Column Meaning
parameter Name from parameter_catalog.py
value Reported measurement (median across trials)
unit e.g. instructions/cycle/SM, cycles, GB/s
mean, median, std_dev Trial statistics
CV Coefficient of variation (std/mean); target < 0.02
CI_low, CI_high 95 % confidence interval
trials Number of valid trials
status PASS, HIGH_VARIANCE, KNOWN_VALUE_MISMATCH, OUT_OF_RANGE, SASS_MISMATCH, COMPILE_ERROR, RUNTIME_ERROR, NO_DATA
ncu_validated True if an NCU counter pass succeeded
sass_validated True if --check-sass patterns were satisfied
notes, timestamp Free-form

PASS requires CV < 5 % AND (if a known value was supplied) within ±10 % AND (if a reasonable_range is set on the catalog entry) the measured value falls within it.

OUT_OF_RANGE is a soft warning

Each entry in parameter_catalog.py carries a reasonable_range = (low, high) — a wide bracket that any plausible measurement on any modern NVIDIA GPU should fit inside. The agent passes this through to harness.py via --reasonable-min / --reasonable-max. When the measured value lands outside the range, the harness flags the run as OUT_OF_RANGE and adds a "recheck methodology" note. Unlike SASS_MISMATCH (hard fail before timing), OUT_OF_RANGE is a soft warning — the run still gets logged, but the agent should treat it as a red flag to redesign the kernel rather than accept the number.

Use the ranges as architecture-agnostic plausibility bounds; they are intentionally wide so they don't reject legitimate generation-to- generation variation. If you find yourself frequently tripping a range, the right fix is usually "the methodology is wrong," not "widen the range."


Adding a New Parameter

Edit parameter_catalog.py:

PARAMETERS.append(Parameter(
    name="my_new_param",
    category=Category.INSTRUCTION_THROUGHPUT,
    description="...",
    unit="instructions/cycle/SM",
    measurement_technique=Technique.THROUGHPUT_LOOP,
    priority=2,
    known_value=64,
    ncu_validation_metrics=["smsp__inst_executed_pipe_fma.avg.per_cycle_active"],
    expected_sass=["FFMA"],
    reasonable_range=(16.0, 512.0),   # plausible bracket across all modern NVIDIA GPUs
))

Run python parameter_catalog.py after editing — the self-test will flag any priority/range typos (e.g. a known_value that falls outside its own reasonable_range).

Then python orchestrate.py picks it up automatically on the next call. The agent does not need a new prompt.

Adding a New Architecture

Today the catalog's known-value defaults are populated for Ampere (SM80). For Hopper / Ada / Blackwell, copy parameter_catalog.py to parameter_catalog_sm90.py (etc.), update the known values + expected SASS where they differ, and load the new file in orchestrate.py. (A factory-style refactor for cleaner multi-arch support is planned.)


Design Choices

  • Single-file edit. Agent diffs are small and reverts are clean.
  • SASS verification is mandatory. ptxas may fuse, lower, or emulate what you wrote in inline PTX. --check-sass runs cuobjdump --dump-sass before timing, so a wrong-instruction kernel never produces a logged measurement.
  • Cycle-level timing by default. Throughput / latency / cache / shared-memory benchmarks read the GPU clock directly via inline PTX (mov.u64 %0, %%clock64;), bracketed with __syncthreads(). Wall-clock is only used where the metric is rate-based (DRAM bandwidth).
  • NCU as ground truth. A measurement that "looks fine" but doesn't match counter expectations is wrong, period.
  • TSV logging. Plain-text, git-friendly, trivially parseable.
  • Confidence intervals over point estimates. "4.0 ± 0.1 cycles" is much more useful to a simulator developer than "4.0 cycles."
  • Resumable. Re-running orchestrate.py picks up where you left off — parameters already in results/measurements.tsv are skipped.
  • Defer + persistent notes. When the agent hits a wall on a parameter, it can --defer <name> to skip it without committing a row, and the reasoning is logged to results/progress_notes.json. The next session (or the next agent, on a smaller-context model) sees the prior notes the moment that parameter is dispensed again, so the trail of "what was tried, what to try next" survives across runs.
  • Solved-kernel archive. On every PASS / PASS_WITH_NOTE, the harness auto-copies benchmark.cu to solved_kernels/<param>.cu with a metadata header (status, value, CV, CI, SASS / NCU verdict, arch, notes). After a full run you have a directory of working benchmarks, one per parameter, that you can commit, share, or use as starting templates for the next GPU.
  • Provenance-respecting docs. NVIDIA's PTX/CUDA reference is fetched from upstream by an opt-in script, not redistributed in this repo.

Status

Initial scaffold. Battle-testing on real GPUs (A100 / RTX 4090 / RTX 5090 / H100) and the multi-arch refactor are next. PRs / issues welcome.

License

MIT for code in this repository. The PTX ISA / CUDA API documentation fetched by knowledge/download-ptx-isa.sh is © NVIDIA Corporation and distributed via the upstream community mirror technillogue/ptx-isa-markdown; this repository neither contains nor redistributes those documents.

About

Code Implementation for OSCAR 2026 Workshop: Toward Sustainable GPU Modeling: Autonomous Micro-benchmarking with LLM Agents

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors