The official repository for EigenBench: A Comparative Behavioral Measure of Value Alignment.
EigenBench is a black-box framework for quantifying value alignment across language models without relying on ground-truth labels. Given a model ensemble, a constitution describing a value system, and a scenario dataset, models judge each other’s responses in pairwise comparisons; these judgments are fit with a Bradley-Terry-Davison (BTD) model and aggregated with EigenTrust into consensus alignment scores.
- Install
- Quick Start
- Run Spec
- Spec Modes
- Bootstrap Resampling
- Outputs
- Repo Layout
- Datasets Used in the Paper
- ValueArena
- Citation
python -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txtSet API keys in .env:
OPENROUTER_API_KEY— required for OpenRouter modelsHF_TOKEN— required for gated/private Hugging Face models (also reads from~/.huggingface/tokenviahuggingface-cli login)
- Create a run folder and copy the example spec.
mkdir -p runs/my_run
cp runs/example/spec.py runs/my_run/spec.py-
Edit
runs/my_run/spec.py(required fields:models,dataset.path,constitution.path,constitution.num_criteria). -
Run:
Option A: Local (collect + train locally)
python scripts/run.py runs/my_run/spec.pyOption B: Cloud (collect locally, train + upload on ValueArena Space)
Add to your spec:
"upload": {
"enabled": True,
"name": "my-run",
"group": "",
"note": "optional note",
},Then run:
export SPACE_SECRET="your-secret"
python scripts/run.py runs/my_run/spec.pyCollection runs locally, then the evaluations are sent to the Space which handles BTD training, bootstrap, EigenTrust, and upload to ValueArena in the background.
If you already have evaluations.jsonl, set collection.enabled=False to skip collection and just train+upload via the Space.
Mixed-model runs work out of the box — just prefix local model paths with hf_local: in your spec. The pipeline auto-detects and batches local models through vLLM while routing API models through OpenRouter.
Top-level keys in RUN_SPEC:
models:{display_name: openrouter_model_id}or{display_name: hf_local:<hf_model_path>}dataset: scenario source and slicing.constitution: constitution file path and criterion count.collection: evaluation sampling/collection settings.training: BT/BTD training settings.
path: JSON file of scenarios.start: start offset (default0).count: number of scenarios afterstart(omit for all remaining).shuffle: shuffle before slicing.shuffle_seed: reproducible shuffle seed.
path: constitution JSON file.num_criteria(required): hard cap used for collection + extraction.
"collection": {
"enabled": True,
"cached_responses_path": "data/responses/main_cache.jsonl", # optional
},
"training": {
"enabled": True,
}Behavior:
- If
cached_responses_pathis set, cache stage runs first. - Then evaluation collection runs.
- Then training/eigentrust runs.
"collection": {
"enabled": False,
"evaluations_path": "runs/my_run/evaluations.jsonl",
},
"constitution": {
"path": "data/constitutions/kindness.json",
"num_criteria": 8,
},
"training": {
"enabled": True,
}Use this when you already have evaluation transcripts and only want BT/BTD + EigenTrust outputs.
"collection": {
"enabled": True,
},
"training": {
"enabled": False,
}Use this to build/append evaluations.jsonl without running model fitting.
"collection": {
"enabled": False,
"cached_responses_path": "data/responses/main_cache.jsonl",
},
"training": {
"enabled": False,
}Use this to precompute model responses for scenarios.
Mix OpenRouter API models and local Hugging Face models in the same run. Local models are automatically batched through vLLM for efficient GPU inference, while API models are called through OpenRouter. Use hf_local: prefixes in your models dict:
"models": {
"Claude 4 Sonnet": "anthropic/claude-sonnet-4", # OpenRouter
"Qwen-sarcasm": "hf_local:maius/qwen-2.5-7b-it-personas/sarcasm", # lora
"Qwen": "hf_local:Qwen/Qwen2.5-7B-Instruct", # local
},
"collection": {
"enabled": True,
"sampler_mode": "random_judge_group", # or "all_to_all"
},
"training": {
"enabled": True,
}The pipeline auto-detects hf_local: models and routes to the mixed collection path, which runs in 3 batched phases:
- Responses — all evaluee responses (OpenRouter sequential, vLLM batched)
- Reflections — all judge reflections (OpenRouter sequential, vLLM batched)
- Comparisons — all pairwise comparisons (OpenRouter sequential, vLLM batched)
This is significantly faster than one-at-a-time API-style calls for local models.
LoRA adapter syntax: hf_local:org/repo/subfolder — the subfolder is resolved as a LoRA adapter on the base model detected from adapter_config.json.
Use sampler_mode: "all_to_all" for exhaustive evaluation where every model judges every other model's response on every scenario:
"collection": {
"enabled": True,
"sampler_mode": "all_to_all",
},
"training": {
"enabled": True,
}In all-to-all mode:
- Every model acts as a judge for every scenario
- Every model's response is evaluated by every judge
- Reflections are per-judge (each judge reflects independently on each response)
- All ordered pairs
(eval1, eval2)are compared
This produces the most complete evaluation matrix but scales as O(scenarios × models² × models²)
Adds error bars to EigenBench Elo scores by resampling comparisons and retraining BT/BTD models.
"training": {
"bootstrap": {
"enabled": True,
"n_bootstraps": 100,
"random_seed": 42,
"save_models": False,
"save_trust_matrices": True,
},
}Warning
Bootstrap only retrains the BT/BTD model. Run it locally on CPU to avoid wasting GPU compute time.
Per run folder (runs/<run_name>/):
evaluations.jsonl(if collection ran)btd_d<dim>/folders (if training ran), containing:training_loss.pngmodel.pteigentrust.txtuv_embeddings_pca.pngeigenbench.pnglog_train.txtbootstrap/(if bootstrap enabled):samples.jsonsummary.jsonbootstrap_elo.png
EigenBench/
├── pipeline/
│ ├── eval/ # collection orchestration + sampling
│ │ ├── collect.py # OpenRouter-only collection
│ │ ├── mixed_collect.py # mixed OpenRouter + vLLM collection (+ all-to-all)
│ │ ├── criteria_collectors.py # prompt builders + single-group collection
│ │ ├── samplers.py # judge/evaluee sampling strategies
│ │ └── flows.py # response-only collection
│ ├── train/ # BT/BTD fitting + plots
│ │ ├── bt_models.py # VectorBT, VectorBTD, CriteriaVectorBTD
│ │ ├── train.py # training loop + utilities
│ │ └── plots.py # embedding + Elo visualizations
│ ├── trust/ # trust matrix + EigenTrust
│ ├── utils/ # record IO + comparison extraction
│ ├── config/ # run-spec + dataset/constitution loaders
│ └── providers/ # model API calls (OpenRouter + vLLM)
├── scripts/
│ ├── run.py # only user entrypoint
│ ├── run_collect.py # internal: routes to mixed or OpenRouter-only collection
│ ├── run_collect_responses.py # internal: response cache stage
│ ├── run_train.py # internal: training stage
│ └── upload_results.py # manual upload to ValueArena
├── notebooks/
│ ├── mixed_openrouter_local_collection.ipynb # legacy notebook (now integrated into CLI)
│ ├── bootstrap_resampling.ipynb # bootstrap analysis
├── runs/
│ └── <run_name>/
│ ├── spec.py # per-run config
│ ├── evaluations.jsonl # collected judgments
│ └── btd_d<dim>/ # training outputs
├── data/
│ ├── constitutions/ # committed constitutions
│ ├── scenarios/ # local scenario datasets
│ └── responses/ # shared cached responses
- AskReddit: https://www.kaggle.com/datasets/rodmcn/askreddit-questions-and-answers
- OpenAssistant: https://huggingface.co/datasets/OpenAssistant/oasst1
- AIRiskDilemmas (LitmusValues): https://huggingface.co/datasets/kellycyy/AIRiskDilemmas
Upload run results to the ValueArena leaderboard.
Add an upload section to your spec to automatically train and upload results to ValueArena after collection finishes. Training runs on the HF Space (free CPU), so no local GPU is needed.
"upload": {
"enabled": True,
"name": "oct/goodness", # run slug on ValueArena
"group": "oct", # optional grouping
"note": "LoRA-only (12 personas)", # shows in the table
},Set the SPACE_SECRET env var (or upload.secret in spec) before running:
export SPACE_SECRET="your-secret"
python scripts/run.py runs/my_run/spec.pyWhen upload.enabled=True, local training is skipped. After collection, the evaluations and spec are sent to the Space which handles BTD training, bootstrap, EigenTrust, and upload to ValueArena in the background.
# Single run
python3 scripts/upload_results.py --name "my-run" --run-dir runs/my_run/ --note "optional note"
# Batch upload (all sub-runs in a folder)
python3 scripts/upload_results.py --batch-dir runs/matrix/ --name "matrix" --note "12 persona LoRAs"--nameis the run slug on HF. For batch, it's the prefix (matrix→matrix/goodness,matrix/humor, etc.)--noteshows in the table on the website- Re-uploading with the same name overwrites the previous entry
- Git commit hash and scenario range are captured automatically
@misc{chang2025eigenbenchcomparativebehavioralmeasure,
title={EigenBench: A Comparative Behavioral Measure of Value Alignment},
author={Jonathn Chang and Leonhard Piff and Suvadip Sana and Jasmine X. Li and Lionel Levine},
year={2025},
eprint={2509.01938},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2509.01938},
}