"In clinical AI, ranking models by style preference is insufficient — safety, calibration, and guideline adherence must be measured explicitly."
| Attribute | Value |
|---|---|
| Status | Incubating |
| Maturity | Design Phase |
| License | Apache-2.0 |
| Part of | Evidence Commons |
| Mission Pillar | Pillar 1 (Clinical AI Evaluation & Benchmarking) |
Clinical Arena is a benchmark suite designed to evaluate clinical AI agents across seven explicitly defined dimensions: clinical accuracy, evidence grounding, safety awareness, uncertainty calibration, guideline adherence, reasoning transparency, and actionability. Evaluation uses structured clinical vignettes with expert-validated rubrics where safety-relevant failures carry asymmetric penalties. The ranking system uses Graph-Elo trust scores rather than raw Elo, reflecting the directed nature of clinical evidence hierarchies.
The parent codebase contains a benchmarking engine in evidenceos-research/evidenceos-bench. This repository is intended to hold the extracted, standalone evaluation framework suitable for independent use by clinical AI researchers and regulatory bodies. No standalone code has been extracted yet.
| Component | Description | Platform Status |
|---|---|---|
| Vignette Library | Structured clinical cases with expert-validated ground truth | Designed |
| 7-Dimension Scorer | Per-dimension rubric evaluation with weighted aggregation | Designed |
| Graph-Elo Ranker | Trust-score ranking adapted for clinical safety asymmetry | Designed |
| Model Adapter | Standardized interface for submitting AI agent responses | Designed |
| Report Generator | Per-model evaluation reports with dimension breakdowns | Designed |
What exists in the parent codebase:
- Benchmarking engine scaffolded in
evidenceos-research/evidenceos-bench - 7-dimension scoring rubric defined at the design level
What does not exist yet:
- Standalone extracted benchmark suite
- Validated vignette library with expert consensus on ground truth
- Graph-Elo ranking implementation
- Integration with BRIDGE-TBI for model evaluation feeds (INT-05, not built)
- Integration with RAIGH Academy for training case generation from failures (INT-06, not built)
- Test suite or CI pipeline
- Define the vignette schema and scoring API as language-agnostic JSON specifications
- Extract the benchmarking engine from
evidenceos-research/evidenceos-benchinto this repository - Implement the Graph-Elo ranking algorithm with clinical safety weighting
- Build model adapter interfaces for common inference APIs
- Validate scoring rubrics against expert panel consensus
- Publish v0.1.0 with a reference vignette set and scorer
graph LR
A[BRIDGE-TBI<br/>model versions] -->|INT-05| B[Clinical Arena]
B -->|INT-06| C[RAIGH Academy<br/>training cases]
D[evidenceos-bench<br/>scoring engine] --> B
style B fill:#2A9D8F,stroke:#1E3A8A,color:#fff
Clinical Arena is part of the Lab-in-a-Box product family, providing model comparison infrastructure for researchers evaluating clinical AI systems. Arena evaluation results are designed to feed into BRIDGE-TBI deployment gates (INT-05) and RAIGH Academy case study libraries (INT-06).
Canonical source: evidenceos-research/evidenceos-bench
This project is not yet accepting contributions. The evaluation framework, vignette schema, and scoring API are still under active design. See CONTRIBUTING.md for future plans.
Apache-2.0 — see LICENSE for details.