AI Safety Benchmarks & Certification

[Homepage]

AI Safety Benchmarks & Certification

This repository contains the pipeline and datasets required for evaluating frontier models on benchmarks developed by researchers at EuroSafeAI. By rigorously evaluating frontier models on a variety of benchmarks, we aim to reduce the systemic risk posed by AI and its applications.

arXiv	Benchmark	File	Tasks
2506.12758	Democratic or Authoritarian? Probing a New Dimension of Political Biases in Large Language Models	democratic_authoritarian_bias.py	@fscale @favscore @rolemodel
2602.17433	Preserving Historical Truth: Detecting Historical Revisionism in Large Language Models	preserving_historical_truth.py	@no_push @explicit_push
2510.04891	SocialHarmBench: Revealing LLM Vulnerabilities to Socially Harmful Requests	socialharmbench.py	@social_harm_bench
2603.04217	When Do Language Models Endorse Limitations on Universal Human Rights Principles?	llm_human_rights.py	@udhr @udhr_individual @udhr_government @echr @echr_individual @echr_government

Getting Started

This pipeline relies heavily on the AISI Inspect framework for tracking model performance, grading, and logging. You'll need an API key from a supported provider, a full list can be found here. Store this key as an environment variable.

The packages are managed by uv; the instructions for installing it can be found here. After installing according to the latest documentation, create your virtual environment with python=3.10 and download the required packages.

uv venv --python 3.10
uv pip install -r requirements.txt

To run the certification pipeline using uv, use the following:

uv run certify.py \
  --model       {api-formatted model name for testing, i.e. openrouter/google/gemini-3-flash-preview} \
  --grader      {OPTIONAL: api-formatted model name for grading, i.e. openai/gpt-4o} \
  --name        {OPTIONAL: the name stored in models/models.json} \
  --proivder    {OPTIONAL: the model provider, stored in models/models.json} \
  --region      {OPTIONAL: a description of the model's origin (i.e. US, Asia)} \
  --specialty  {OPTIONAL: the model's primary task (i.e. coding, math)} \
  --epochs      {OPTIONAL: the number of epochs to run, default=1} \
  --rerun       {OPTIONAL: rerun results that are already present for the model}

If a grader model is not specified with --grader, a group of models is used for LLM-as-a-judge grading as specified in GRADERS.md.

All results are stored in models/models.json which will automatically be updated with new models or replace previously run models. By default, the script will skip benchmarks that have already been processed; however, you can override this with by adding --rerun argument to certify.py. All logs will be in logs/{benchmark_name}; these can be accessed to use unreported metrics or other metadata about the samples.

You can also use any package manager of your choice (i.e. anaconda); install the requirements by omitting uv and execute the pipeline using python certify.py with the appropriate arguments.

To evaluate on individual benchmarks, you can use AISI Inspect's CLI uv run inspect eval scripts/evals/{file}@{task}.py. Note that you wil have to set certain parameters, like the model to be evaluated, which can be found here.

Future Tasks / TODOs

In order of urgency:

~~Write the summarization/metric scripts to calculate overall model performance on benchmarks with multiple tasks (i.e. Democratic vs. Authoritarian Bias).~~
~~Modify certify.py to allow the specification of individual tasks.~~ Not Implementing
Update scripts/README.md and benchmarks/README.md to outline how to incorporate new benchmarks and define the repository's structure.
Support locally run models as well as api-models.
Generate private/held out datasets.
~~Rename scripts directory to something more fitting~~
Connect repo to ESAI's certificate page to automatically flag for updates when new models are run.

Will need a personal access token

Impement Majority Voting for LLM-as-a-judge grading
Combine UDHR and ECHR datasets for the human rights limitations; currently benchmarking on UDHR.
Use the UDHR and ECHR individual and government steering in evaluations.
Add file to process results from logs incase of a crash --- avoiding crashes with try, except, but logs are always stored and can be used as a fallback

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

[Homepage]

AI Safety Benchmarks & Certification

Getting Started

Future Tasks / TODOs

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 48 Commits
.github/workflows		.github/workflows
benchmarks		benchmarks
media		media
models		models
scripts		scripts
tasks		tasks
.gitignore		.gitignore
GRADERS.md		GRADERS.md
README.md		README.md
certify.py		certify.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

[Homepage]

AI Safety Benchmarks & Certification

Getting Started

Future Tasks / TODOs

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages