Skip to content

EvidenceOSS/clinical-arena

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Clinical Arena — 7-Dimensional Clinical AI Evaluation

Status: Incubating License: Apache-2.0 Part of: Evidence Commons

"In clinical AI, ranking models by style preference is insufficient — safety, calibration, and guideline adherence must be measured explicitly."

Attribute Value
Status Incubating
Maturity Design Phase
License Apache-2.0
Part of Evidence Commons
Mission Pillar Pillar 1 (Clinical AI Evaluation & Benchmarking)

Overview

Clinical Arena is a benchmark suite designed to evaluate clinical AI agents across seven explicitly defined dimensions: clinical accuracy, evidence grounding, safety awareness, uncertainty calibration, guideline adherence, reasoning transparency, and actionability. Evaluation uses structured clinical vignettes with expert-validated rubrics where safety-relevant failures carry asymmetric penalties. The ranking system uses Graph-Elo trust scores rather than raw Elo, reflecting the directed nature of clinical evidence hierarchies.

The parent codebase contains a benchmarking engine in evidenceos-research/evidenceos-bench. This repository is intended to hold the extracted, standalone evaluation framework suitable for independent use by clinical AI researchers and regulatory bodies. No standalone code has been extracted yet.

Architecture

Component Description Platform Status
Vignette Library Structured clinical cases with expert-validated ground truth Designed
7-Dimension Scorer Per-dimension rubric evaluation with weighted aggregation Designed
Graph-Elo Ranker Trust-score ranking adapted for clinical safety asymmetry Designed
Model Adapter Standardized interface for submitting AI agent responses Designed
Report Generator Per-model evaluation reports with dimension breakdowns Designed

Current State

What exists in the parent codebase:

  • Benchmarking engine scaffolded in evidenceos-research/evidenceos-bench
  • 7-dimension scoring rubric defined at the design level

What does not exist yet:

  • Standalone extracted benchmark suite
  • Validated vignette library with expert consensus on ground truth
  • Graph-Elo ranking implementation
  • Integration with BRIDGE-TBI for model evaluation feeds (INT-05, not built)
  • Integration with RAIGH Academy for training case generation from failures (INT-06, not built)
  • Test suite or CI pipeline

Extraction Plan

  1. Define the vignette schema and scoring API as language-agnostic JSON specifications
  2. Extract the benchmarking engine from evidenceos-research/evidenceos-bench into this repository
  3. Implement the Graph-Elo ranking algorithm with clinical safety weighting
  4. Build model adapter interfaces for common inference APIs
  5. Validate scoring rubrics against expert panel consensus
  6. Publish v0.1.0 with a reference vignette set and scorer

Ecosystem Context

graph LR
    A[BRIDGE-TBI<br/>model versions] -->|INT-05| B[Clinical Arena]
    B -->|INT-06| C[RAIGH Academy<br/>training cases]
    D[evidenceos-bench<br/>scoring engine] --> B
    style B fill:#2A9D8F,stroke:#1E3A8A,color:#fff
Loading

Clinical Arena is part of the Lab-in-a-Box product family, providing model comparison infrastructure for researchers evaluating clinical AI systems. Arena evaluation results are designed to feed into BRIDGE-TBI deployment gates (INT-05) and RAIGH Academy case study libraries (INT-06).

Canonical source: evidenceos-research/evidenceos-bench

Contributing

This project is not yet accepting contributions. The evaluation framework, vignette schema, and scoring API are still under active design. See CONTRIBUTING.md for future plans.

License

Apache-2.0 — see LICENSE for details.

About

Benchmark suite for clinical AI and agent reliability

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors