pip install parley-bench # core deps only
pip install 'parley-bench[dev]' # + pytest / ruff / mypy
pip install 'parley-bench[whisper]' # + openai-whisper for the optional adapterPython 3.10+. The default install pulls only numpy, pydantic, pyyaml, typer, rich — no heavy ML deps.
parley synth --out path.jsonl --episodes 32 # write a synthetic dataset
parley run config.yaml # run a benchmark suite
parley report runs/<name>/ --format markdown # re-render
parley list --kind metric # enumerate plugins
parley validate config.yaml # parse-check a configparley --help shows the full surface; parley <cmd> --help for any
subcommand.
name: my-suite
seed: 0
dataset:
source: synth # or 'file' with `path:` set
episodes: 24
env:
name: tabletop
pipelines:
- name: codec+scripted
speech: { name: codec } # see `parley list --kind speech`
grounding: { name: rule_based }
policy: { name: scripted }
perturbations:
- name: noise5
steps:
- { name: additive_noise, params: { snr_db: 5 } }
metrics: [wer, grounding_f1, success_rate, latency]
runner:
max_steps: 32
workers: 1
output_dir: runs/my-suiteThe full example set lives under examples/ — start
with quickstart.yaml.
Given output_dir: runs/my-suite:
runs/my-suite/
├── report.json # versioned JSON, the canonical machine output
├── config.resolved.yaml # snapshot of the parsed config for reproducibility
└── traces/ # one JSON per (pipeline, perturbation, episode)
├── codec+scripted--clean--ep-00000.json
├── codec+scripted--noise5--ep-00000.json
└── …
report.json carries the full set of per-cell summaries plus a
leaderboard. Re-rendering happens without re-running:
parley report runs/my-suite/ # markdown table + leaderboard
parley report runs/my-suite/ --format csv > x.csv
parley report runs/my-suite/ --format json | jq # raw schemaEverything the CLI does is a thin wrapper. See
examples/programmatic/custom_suite.py
for an end-to-end example that builds the config in code, uses the
snr_sweep() / codec_sweep() helpers, runs the engine, and prints the
sensitivity index + worst-group report on top of the standard table.
- Implement the relevant
Protocol(parley.speech.base.SpeechFrontend,parley.policy.base.VLAPolicy, etc.). It just has to expose the right attributes — Parley usesruntime_checkableProtocols. - Register it:
@registry.speech.register("my_thing")at import time. - Reference it from the YAML by name. Constructor kwargs go under
params:.
That's it — no entry-point boilerplate, no factory classes. See
parley/speech/whisper_adapter.py for the smallest interesting example
(it's an import-lazy wrapper around openai-whisper).
The seed field on the config drives every stochastic component
(perturbation noise, env spawn jitter, random policy actions). Two runs
of the same YAML with the same seed produce byte-identical
report.json. CI runs with caching disabled to keep this property
honest; once you have a report.json you trust, the
config.resolved.yaml next to it is enough to reproduce.