Skip to content

gero-science/ct-validation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ct-validation

An open framework for benchmarking gene-indication evidence against clinical trial outcomes.

ct-validation tests whether a set of gene-indication pairs is enriched for clinical success. It computes risk ratios and odds ratios with confidence intervals across clinical phase transitions and supports semantic disease matching through ontology-based similarity.

Paper: Kostiuk K, Igumnov D, Fedichev P, Feizi A. ct-validation: an open framework for benchmarking gene-indication evidence against clinical trial outcomes. (2026)

Installation

Requires Python 3.11+.

pip install ct-validation

Optional extras:

pip install ct-validation[plot]  # forest plot visualization
pip install ct-validation[mcp]   # MCP server for agent workflows
pip install ct-validation[parse] # data source parsers
pip install ct-validation[fetch] # ChEMBL fetching script dependencies

Quick start

Python API

import ct_validation as ctv

results = ctv.validate(
    clinical_trials="data/clinical_trials/gene_indication_max_phase.parquet",
    targets="data/genetic_evidence/genetic_evidence.parquet",
    similarity_lookup="data/mappings/efo_similarity_lookup_0.5.parquet",
)
print(results)
#   phase_label  n_yes   n_no  rr  rr_ci_lower  rr_ci_upper  ...

Batch mode — compare multiple evidence sources at once:

results = ctv.validate(
    clinical_trials="data/clinical_trials/gene_indication_max_phase.parquet",
    targets=[
        "data/genetic_evidence/gwas_catalog.parquet",
        "data/genetic_evidence/clinvar.parquet",
        "data/genetic_evidence/omim.parquet",
    ],
    similarity_lookup="data/mappings/efo_similarity_lookup_0.5.parquet",
)
# returns a list of DataFrames, one per evidence source

Prioritized mode — test whether a novel source adds value over an established baseline:

results = ctv.validate(
    clinical_trials="data/clinical_trials/gene_indication_max_phase.parquet",
    targets="data/genetic_evidence/novel_score.parquet",
    baseline_evidence="data/genetic_evidence/established_genetics.parquet",
    similarity_lookup="data/mappings/efo_similarity_lookup_0.5.parquet",
)
# pairs supported only by baseline are excluded

Expand a disease set using semantic similarity:

expanded = ctv.get_expanded_disease_set(
    efo_ids={"EFO:0000270", "EFO:0000384"},
    similarity_pairs="data/mappings/efo_similarity_lookup_0.5.parquet",
    similarity_threshold=0.8,
)

CLI

# With config file
ct-validation --config configs/default.yaml

# With explicit arguments
ct-validation \
    --clinical-trials ct.parquet \
    --targets evidence.parquet \
    --similarity-lookup similarity.parquet \
    -o results/

# Batch mode (multiple evidence sources)
ct-validation \
    --clinical-trials ct.parquet \
    --targets gwas.parquet --targets clinvar.parquet --targets omim.parquet \
    -o results/

MCP server

ct-validation-mcp

Exposes two tools for agent-based workflows:

  • ct_validate — compute phase-transition enrichment
  • expand_disease_set — expand EFO IDs via semantic similarity

Input schemas

Input Columns Description
clinical_trials gene, efo_id, max_phase Target-indication pairs with highest phase reached
targets gene, efo_id Gene-indication pairs with supporting evidence
similarity_lookup efo_id_1, efo_id_2, similarity Pairwise EFO similarity (optional)
baseline_evidence gene, efo_id Baseline evidence for prioritized mode (optional)
gene_universe one gene per line (text file) Restrict analysis to these genes (optional)

All inputs accept Parquet files or pandas DataFrames (except gene_universe, which is a text file or a Python set).

Output schema

Column Description
phase_from, phase_to Phase transition (e.g. 1→2, 1→4)
n_yes, n_no Pairs entering phase (with/without evidence)
x_yes, x_no Pairs reaching target phase
rate_yes, rate_no Progression rates
rr, rr_ci_lower, rr_ci_upper Risk ratio with 95% CI (Katz log method)
or, or_ci_lower, or_ci_upper Odds ratio with 95% CI (Woolf logit method)

Enrichment logic

For each phase transition, target-indication pairs that reached at least the starting phase are divided into supported and unsupported groups. The risk ratio is:

RR = (x_yes / n_yes) / (x_no / n_no)

A risk ratio greater than one indicates that genetically supported pairs are more likely to progress. When a similarity lookup is provided, a pair (gene, disease) is considered supported if there exists evidence (gene, disease') with similarity above the threshold (default 0.8).

Prioritized mode

When baseline_evidence is provided, pairs supported only by the baseline are excluded. This tests whether a novel evidence source adds predictive value beyond an established benchmark.

Visualization

import ct_validation as ctv

results = ctv.validate(...)
ctv.forest_plot(results, metric="rr", title="Phase I → Approved")

Data source parsers

The scripts/ directory contains reproducible parsers for public databases:

Genetic evidence (scripts/parse/genetic_evidence/):

  • GWAS Catalog — genome-wide significant associations (p < 1e-8)
  • ClinVar — pathogenic/likely pathogenic variants
  • OMIM — established molecular basis (mapping code 3)
  • Open Targets — genetic evidence streams (score ≥ 0.5)
  • Genebass — exome-wide associations (p ≤ 1e-7)

Clinical trials (scripts/parse/clinical_trials/):

  • ChEMBL — gene-drug and drug-indication links (pChEMBL > 7.0)
  • Open Targets — known drug and indication data
  • STITCH — high-confidence activation/inhibition links
  • DGIdb — drug-gene interactions
  • TrialPanorama — interventional studies

Ontology (scripts/r/):

  • EFO semantic similarity matrix (Lin + Resnik information content)

See DATA_SOURCES.md for download links, versions, and fetching instructions.

Configure paths in configs/parsing.yaml and run:

python scripts/parse/run_parsing.py

Configuration

See configs/default.yaml for validation settings and configs/parsing.yaml for data source paths. All config values can be overridden via CLI arguments.

License

MIT