Clinical Arena — 7-Dimensional Clinical AI Evaluation

"In clinical AI, ranking models by style preference is insufficient — safety, calibration, and guideline adherence must be measured explicitly."

Attribute	Value
Status	Incubating
Maturity	Design Phase
License	Apache-2.0
Part of	Evidence Commons
Mission Pillar	Pillar 1 (Clinical AI Evaluation & Benchmarking)

Overview

Clinical Arena is a benchmark suite designed to evaluate clinical AI agents across seven explicitly defined dimensions: clinical accuracy, evidence grounding, safety awareness, uncertainty calibration, guideline adherence, reasoning transparency, and actionability. Evaluation uses structured clinical vignettes with expert-validated rubrics where safety-relevant failures carry asymmetric penalties. The ranking system uses Graph-Elo trust scores rather than raw Elo, reflecting the directed nature of clinical evidence hierarchies.

The parent codebase contains a benchmarking engine in evidenceos-research/evidenceos-bench. This repository is intended to hold the extracted, standalone evaluation framework suitable for independent use by clinical AI researchers and regulatory bodies. No standalone code has been extracted yet.

Architecture

Component	Description	Platform Status
Vignette Library	Structured clinical cases with expert-validated ground truth	Designed
7-Dimension Scorer	Per-dimension rubric evaluation with weighted aggregation	Designed
Graph-Elo Ranker	Trust-score ranking adapted for clinical safety asymmetry	Designed
Model Adapter	Standardized interface for submitting AI agent responses	Designed
Report Generator	Per-model evaluation reports with dimension breakdowns	Designed

Current State

What exists in the parent codebase:

Benchmarking engine scaffolded in evidenceos-research/evidenceos-bench
7-dimension scoring rubric defined at the design level

What does not exist yet:

Standalone extracted benchmark suite
Validated vignette library with expert consensus on ground truth
Graph-Elo ranking implementation
Integration with BRIDGE-TBI for model evaluation feeds (INT-05, not built)
Integration with RAIGH Academy for training case generation from failures (INT-06, not built)
Test suite or CI pipeline

Extraction Plan

Define the vignette schema and scoring API as language-agnostic JSON specifications
Extract the benchmarking engine from evidenceos-research/evidenceos-bench into this repository
Implement the Graph-Elo ranking algorithm with clinical safety weighting
Build model adapter interfaces for common inference APIs
Validate scoring rubrics against expert panel consensus
Publish v0.1.0 with a reference vignette set and scorer

Ecosystem Context

graph LR
    A[BRIDGE-TBI<br/>model versions] -->|INT-05| B[Clinical Arena]
    B -->|INT-06| C[RAIGH Academy<br/>training cases]
    D[evidenceos-bench<br/>scoring engine] --> B
    style B fill:#2A9D8F,stroke:#1E3A8A,color:#fff

Clinical Arena is part of the Lab-in-a-Box product family, providing model comparison infrastructure for researchers evaluating clinical AI systems. Arena evaluation results are designed to feed into BRIDGE-TBI deployment gates (INT-05) and RAIGH Academy case study libraries (INT-06).

Canonical source: evidenceos-research/evidenceos-bench

Contributing

This project is not yet accepting contributions. The evaluation framework, vignette schema, and scoring API are still under active design. See CONTRIBUTING.md for future plans.

License

Apache-2.0 — see LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
STATUS.md		STATUS.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Clinical Arena — 7-Dimensional Clinical AI Evaluation

Overview

Architecture

Current State

Extraction Plan

Ecosystem Context

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Clinical Arena — 7-Dimensional Clinical AI Evaluation

Overview

Architecture

Current State

Extraction Plan

Ecosystem Context

Contributing

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages