Skip to content

jchang153/EigenBench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

127 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

EigenBench: A Comparative Behavioral Measure of Value Alignment

The official repository for EigenBench: A Comparative Behavioral Measure of Value Alignment.

EigenBench is a black-box framework for quantifying value alignment across language models without relying on ground-truth labels. Given a model ensemble, a constitution describing a value system, and a scenario dataset, models judge each other’s responses in pairwise comparisons; these judgments are fit with a Bradley-Terry-Davison (BTD) model and aggregated with EigenTrust into consensus alignment scores.

EigenBench pipeline

Table of Contents

Install

python -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt

Set API keys in .env:

  • OPENROUTER_API_KEY — required for OpenRouter models
  • HF_TOKEN — required for gated/private Hugging Face models (also reads from ~/.huggingface/token via huggingface-cli login)

Quick Start

  1. Create a run folder and copy the example spec.
mkdir -p runs/my_run
cp runs/example/spec.py runs/my_run/spec.py
  1. Edit runs/my_run/spec.py (required fields: models, dataset.path, constitution.path, constitution.num_criteria).

  2. Run:

Option A: Local (collect + train locally)

python scripts/run.py runs/my_run/spec.py

Option B: Cloud (collect locally, train + upload on ValueArena Space)

Add to your spec:

"upload": {
    "enabled": True,
    "name": "my-run",
    "group": "",
    "note": "optional note",
},

Then run:

export SPACE_SECRET="your-secret"
python scripts/run.py runs/my_run/spec.py

Collection runs locally, then the evaluations are sent to the Space which handles BTD training, bootstrap, EigenTrust, and upload to ValueArena in the background.

If you already have evaluations.jsonl, set collection.enabled=False to skip collection and just train+upload via the Space.

Mixed-model runs work out of the box — just prefix local model paths with hf_local: in your spec. The pipeline auto-detects and batches local models through vLLM while routing API models through OpenRouter.

Run Spec

Top-level keys in RUN_SPEC:

  • models: {display_name: openrouter_model_id} or {display_name: hf_local:<hf_model_path>}
  • dataset: scenario source and slicing.
  • constitution: constitution file path and criterion count.
  • collection: evaluation sampling/collection settings.
  • training: BT/BTD training settings.

Dataset controls

  • path: JSON file of scenarios.
  • start: start offset (default 0).
  • count: number of scenarios after start (omit for all remaining).
  • shuffle: shuffle before slicing.
  • shuffle_seed: reproducible shuffle seed.

Constitution controls

  • path: constitution JSON file.
  • num_criteria (required): hard cap used for collection + extraction.

Spec Modes

Spec Mode: Full Pipeline

"collection": {
    "enabled": True,
    "cached_responses_path": "data/responses/main_cache.jsonl",  # optional
},
"training": {
    "enabled": True,
}

Behavior:

  • If cached_responses_path is set, cache stage runs first.
  • Then evaluation collection runs.
  • Then training/eigentrust runs.

Spec Mode: Train Only

"collection": {
    "enabled": False,
    "evaluations_path": "runs/my_run/evaluations.jsonl",
},
"constitution": {
    "path": "data/constitutions/kindness.json",
    "num_criteria": 8,
},
"training": {
    "enabled": True,
}

Use this when you already have evaluation transcripts and only want BT/BTD + EigenTrust outputs.

Spec Mode: Collect Only

"collection": {
    "enabled": True,
},
"training": {
    "enabled": False,
}

Use this to build/append evaluations.jsonl without running model fitting.

Spec Mode: Cache Only

"collection": {
    "enabled": False,
    "cached_responses_path": "data/responses/main_cache.jsonl",
},
"training": {
    "enabled": False,
}

Use this to precompute model responses for scenarios.

Spec Mode: Mixed HF Local + OpenRouter

Mix OpenRouter API models and local Hugging Face models in the same run. Local models are automatically batched through vLLM for efficient GPU inference, while API models are called through OpenRouter. Use hf_local: prefixes in your models dict:

"models": {
    "Claude 4 Sonnet": "anthropic/claude-sonnet-4",                      # OpenRouter
    "Qwen-sarcasm": "hf_local:maius/qwen-2.5-7b-it-personas/sarcasm",     # lora
    "Qwen": "hf_local:Qwen/Qwen2.5-7B-Instruct",                       # local
},
"collection": {
    "enabled": True,
    "sampler_mode": "random_judge_group",  # or "all_to_all"
},
"training": {
    "enabled": True,
}

The pipeline auto-detects hf_local: models and routes to the mixed collection path, which runs in 3 batched phases:

  1. Responses — all evaluee responses (OpenRouter sequential, vLLM batched)
  2. Reflections — all judge reflections (OpenRouter sequential, vLLM batched)
  3. Comparisons — all pairwise comparisons (OpenRouter sequential, vLLM batched)

This is significantly faster than one-at-a-time API-style calls for local models.

LoRA adapter syntax: hf_local:org/repo/subfolder — the subfolder is resolved as a LoRA adapter on the base model detected from adapter_config.json.

Spec Mode: All-to-All Collection

Use sampler_mode: "all_to_all" for exhaustive evaluation where every model judges every other model's response on every scenario:

"collection": {
    "enabled": True,
    "sampler_mode": "all_to_all",
},
"training": {
    "enabled": True,
}

In all-to-all mode:

  • Every model acts as a judge for every scenario
  • Every model's response is evaluated by every judge
  • Reflections are per-judge (each judge reflects independently on each response)
  • All ordered pairs (eval1, eval2) are compared

This produces the most complete evaluation matrix but scales as O(scenarios × models² × models²)

Bootstrap Resampling

Adds error bars to EigenBench Elo scores by resampling comparisons and retraining BT/BTD models.

"training": {
    "bootstrap": {
        "enabled": True,
        "n_bootstraps": 100,
        "random_seed": 42,
        "save_models": False,
        "save_trust_matrices": True,
    },
}

Warning

Bootstrap only retrains the BT/BTD model. Run it locally on CPU to avoid wasting GPU compute time.

Outputs

Per run folder (runs/<run_name>/):

  • evaluations.jsonl (if collection ran)
  • btd_d<dim>/ folders (if training ran), containing:
    • training_loss.png
    • model.pt
    • eigentrust.txt
    • uv_embeddings_pca.png
    • eigenbench.png
    • log_train.txt
    • bootstrap/ (if bootstrap enabled):
      • samples.json
      • summary.json
      • bootstrap_elo.png

Repo Layout

EigenBench/
├── pipeline/
│   ├── eval/          # collection orchestration + sampling
│   │   ├── collect.py             # OpenRouter-only collection
│   │   ├── mixed_collect.py       # mixed OpenRouter + vLLM collection (+ all-to-all)
│   │   ├── criteria_collectors.py # prompt builders + single-group collection
│   │   ├── samplers.py            # judge/evaluee sampling strategies
│   │   └── flows.py               # response-only collection
│   ├── train/         # BT/BTD fitting + plots
│   │   ├── bt_models.py           # VectorBT, VectorBTD, CriteriaVectorBTD
│   │   ├── train.py               # training loop + utilities
│   │   └── plots.py               # embedding + Elo visualizations
│   ├── trust/         # trust matrix + EigenTrust
│   ├── utils/         # record IO + comparison extraction
│   ├── config/        # run-spec + dataset/constitution loaders
│   └── providers/     # model API calls (OpenRouter + vLLM)
├── scripts/
│   ├── run.py                    # only user entrypoint
│   ├── run_collect.py            # internal: routes to mixed or OpenRouter-only collection
│   ├── run_collect_responses.py  # internal: response cache stage
│   ├── run_train.py              # internal: training stage
│   └── upload_results.py         # manual upload to ValueArena
├── notebooks/
│   ├── mixed_openrouter_local_collection.ipynb  # legacy notebook (now integrated into CLI)
│   ├── bootstrap_resampling.ipynb               # bootstrap analysis
├── runs/
│   └── <run_name>/
│       ├── spec.py            # per-run config
│       ├── evaluations.jsonl  # collected judgments
│       └── btd_d<dim>/        # training outputs
├── data/
│   ├── constitutions/         # committed constitutions
│   ├── scenarios/             # local scenario datasets
│   └── responses/             # shared cached responses

Datasets Used in the Paper

ValueArena

Upload run results to the ValueArena leaderboard.

Auto-upload via Space

Add an upload section to your spec to automatically train and upload results to ValueArena after collection finishes. Training runs on the HF Space (free CPU), so no local GPU is needed.

"upload": {
    "enabled": True,
    "name": "oct/goodness",       # run slug on ValueArena
    "group": "oct",               # optional grouping
    "note": "LoRA-only (12 personas)",  # shows in the table
},

Set the SPACE_SECRET env var (or upload.secret in spec) before running:

export SPACE_SECRET="your-secret"
python scripts/run.py runs/my_run/spec.py

When upload.enabled=True, local training is skipped. After collection, the evaluations and spec are sent to the Space which handles BTD training, bootstrap, EigenTrust, and upload to ValueArena in the background.

Manual upload

# Single run
python3 scripts/upload_results.py --name "my-run" --run-dir runs/my_run/ --note "optional note"

# Batch upload (all sub-runs in a folder)
python3 scripts/upload_results.py --batch-dir runs/matrix/ --name "matrix" --note "12 persona LoRAs"
  • --name is the run slug on HF. For batch, it's the prefix (matrixmatrix/goodness, matrix/humor, etc.)
  • --note shows in the table on the website
  • Re-uploading with the same name overwrites the previous entry
  • Git commit hash and scenario range are captured automatically

Citation

@misc{chang2025eigenbenchcomparativebehavioralmeasure,
      title={EigenBench: A Comparative Behavioral Measure of Value Alignment},
      author={Jonathn Chang and Leonhard Piff and Suvadip Sana and Jasmine X. Li and Lionel Levine},
      year={2025},
      eprint={2509.01938},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2509.01938},
}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors