aet — Agentic Eval Tool

Repo-agnostic research evaluation harness for UC Berkeley BAR / SLICE.

What it is

aet is a lightweight, repo-agnostic harness for running, tracking, and comparing research evaluation suites. It targets compiler, runtime, and agentic code-generation experiments — structured benchmarks where you want reproducible runs, artifact validation, and side-by-side method comparisons. TargetGen (MLIR dialect generation from hardware specs) is the first fully-supported suite.

Quickstart

pip install aet
# or with optional deps:
pip install "aet[tracking]"

# Initialize a project:
aet init-project --template targetgen --project-root ./my-evals
cd my-evals

# Start a run:
aet init-run --suite targetgen --target gemmini --method v0_naive_claude --seed 1

# Validate:
aet validate runs/targetgen/2026-06-08_v0_naive_claude_seed001/

# Compare (writes summary.md, statistical_comparison.md, trajectory_similarity.md):
aet compare --suite targetgen

# Run a sweep:
aet run-suite --suite targetgen --target gemmini \
    --methods v0_naive_claude,v2_schema_generator --seeds 1,2,3

# List recorded runs:
aet runs --suite targetgen

# Inspect a single run:
aet show runs/targetgen/2026-06-08_v0_naive_claude_seed001/

# Set a performance baseline (picks best run by score, or pass --run-id):
aet baseline set --suite targetgen
aet baseline show --suite targetgen
# Subsequent compares write regression_report.md flagging cost >1.2× or score <baseline−0.05

Suites

Suite	Description	Use case
`default`	Generic pass/fail artifact evaluation	Any project; minimal config required
`targetgen`	MLIR dialect + lowering generation from HW specs	TargetGen / compiler dialect research

Tracking modes

Mode	What it does	When to use
local	JSON run manifests in `runs/`; always on	Every run, no setup needed
mlflow	Logs params, metrics, artifacts to an MLflow server	Experiment dashboards, sweep grids
OTel	Emits spans/metrics via OpenTelemetry SDK	Distributed tracing, CI pipelines
SigNoz	OTel-compatible viewer (receiver, not an SDK dep)	Self-hosted observability UI

Optional dependencies

Extra	Packages included	Use case
`[tracking]`	`mlflow`, `opentelemetry-sdk`	MLflow + OTel experiment tracking
`[ray]`	`ray[default]`	Parallel sweep execution via Ray
`[dev]`	`pytest`, `ruff`, `jsonschema`	Development and CI
`[all]`	All of the above	Full installation

Only pyyaml is required at install time.

Project structure

After aet init-project --template targetgen --project-root ./my-evals:

my-evals/
  configs/          # Suite and target configs (YAML)
  datasets/         # Input specs and golden reference files
  methods/          # Method definitions (prompt templates, scripts)
  runs/             # Run artifacts and manifests (written by aet)
  reports/          # Comparison summaries (written by aet compare)
  observability/    # OTel collector config, SigNoz compose file

Repo-agnostic design

aet ships as an installed package that provides the harness, validators, and CLI. It has no opinions about your compiler or model code. Project-specific data — golden files, method configs, datasets, and target specs — lives entirely in your project directory, initialized via aet init-project. This means the same aet install can drive evaluations across unrelated repositories without any per-repo monkey-patching.

License

Apache-2.0. Copyright UC Berkeley BAR / SLICE.
Future home: https://github.com/ucb-bar/agentic-eval-tool

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
docs		docs
examples		examples
src/aet		src/aet
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

aet — Agentic Eval Tool

What it is

Quickstart

Suites

Tracking modes

Optional dependencies

Project structure

Repo-agnostic design

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

aet — Agentic Eval Tool

What it is

Quickstart

Suites

Tracking modes

Optional dependencies

Project structure

Repo-agnostic design

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages