Sponsor: F. Hoffmann-La Roche Ltd / Genentech, Inc.
Protocol OID / ID: GO29436 (IMpower150)
ClinicalTrials.gov Registry: NCT02366143
Indication: First-line treatment of chemotherapy-naïve participants with Stage IV non-squamous Non-Small Cell Lung Cancer (NSCLC)
Submission Package Version: 5.0.0 (Compliance Verified)
Software Validation Standard: FDA GxP (Good Clinical Programming Practice) / 21 CFR Part 11 Compliant
This technical repository implements a computable, metadata-native clinical regulatory data engineering pipeline for the IMpower150 (Study GO29436) clinical trial. IMpower150 is a Phase III, randomized, open-label, multi-center study designed to evaluate the efficacy and safety of Atezolizumab (anti-PD-L1 antibody) in combination with Bevacizumab and platinum-doublet chemotherapy (Carboplatin + Paclitaxel) compared to Bevacizumab + Chemotherapy alone in 1L non-squamous metastatic NSCLC.
This platform bridges the gap between structured clinical design and verifiable submission artifacts. By establishing an automated, single-source-of-truth metadata repository, it builds a fully traceable lineage graph connecting ICH M11 digital protocol objectives, ICH E9(R1) clinical estimands, CDISC COSMoS-aligned Biomedical Concepts, CDISC CORE-aligned derivation rules, and CDISC ARS v1.0 statistical results.
[ICH M11 Protocol Objective]
│
▼ (measures)
[ICH E9(R1) Estimands] ──► [ADICE OCCDS Intercurrent Events]
│
▼ (realized by)
[COSMoS Biomedical Concepts] ──► [parent_bc_id Concept Inheritance]
│
▼ (linked to)
[Endpoint Definitions] ──► [Investigator vs. Parallel BICR Assessor]
│
▼ (compiled by)
[Derivation Rules] ──► [Vectorized DuckDB SQL Engine] ──► [SAS template macros]
│
▼ (serializes)
[CDISC Datasets] ──► [Dataset-JSON v1.1.0 (long names) / XPT (decoupled)]
│
▼ (validates)
[5-Level QC Conformance Engine] ──► [Level 3 Explainable Root-Cause Narratives]
│
▼ (delivers)
[Submission Package] ──► [Define.xml v2.1, JSON-LD SDRG, ARS ard.json, M11 JSON]
This platform is developed in strict alignment with the latest FDA Data Standards Catalog (DSC) and PMDA Conformance Rules, validating all statistical calculations and structures.
| Standard / Framework | Version | Submission Implementation | Regulatory Authority Alignment |
|---|---|---|---|
| CDISC SDTMIG | v3.4 | Automated domain mapping (DM, AE, EX, LB, RS, TU, TR, DS, SV), including mandatory SV domain visit checks. |
Mandatory (FDA/PMDA) |
| CDISC ADaMIG | v1.3 | Derivation of time-to-event parameters (AVAL, CNSR, PARAMCD) and demographic pools (ADSL). |
Mandatory (FDA/PMDA) |
| CDISC Dataset-JSON | v1.1.0 | Full envelope-compliant Dataset-JSON and NDJSON streaming, allowing long, descriptive variables decoupled from SAS v5 limitations. | Expected 2026 Adoption |
| CDISC Define.xml | v2.1 | Auto-generated XML metadata dictionary with strict schema conformance, including Value Level Metadata (VLM) and Analysis Results Metadata (ARM). | Mandatory (FDA/PMDA) |
| CDISC ARS | v1.0.0 | Output of structured results-to-endpoint linked metadata (ard.json) tracking KM estimators, hazard ratios, and log-rank statistics. |
Released Standard (2025) |
| ICH E9(R1) Estimands | Addendum | Active tracking of 82 intercurrent events in ADICE and stabilized panel weights (SW_IPCW) in ADPANEL for treatment policy estimand sensitivity. |
Mandated Guideline |
| ICH M11 | Template | Electronic Exchange-ready digital protocol mapping (m11_protocol.json) linking objectives to endpoints. |
Effective June 11, 2026 |
| RECIST v1.1 / iRECIST | 2009/2017 | Standardized tumor scan assessment models tracking progression events and immune-confirmed responses. | FDA Oncology Mandate |
The repository adheres to a strict, audit-grade GxP directory structure, separating source code, database seeds, tests, and submission-ready outputs.
IMpower150/
├── README.md ← This file (Institutional-grade submission guide)
├── CHANGELOG.md ← Software Development Life Cycle (SDLC) audit trail
├── study_config.yaml ← Single-source-of-truth study specifications (M11 & Estimands)
├── alembic.ini ← Database migration configuration
├── Dockerfile ← Multi-stage GxP environment compiler sandbox
├── docker-compose.yml ← Automation wrapper for containerized execution
├── requirements.txt ← Pinned Python dependencies
├── run_app.py ← Launches the Streamlit Regulatory Dashboard
│
├── src/ ← Operational Source Modules
│ ├── models.py ← SQLAlchemy ORM (14-table database schema with inheritance)
│ ├── app.py ← Streamlit regulatory dashboard (Lineage visualizer)
│ ├── orchestrator.py ← Pipeline orchestrator & precision stopwatch telemetry
│ ├── ingest_protocol.py ← Protocol YAML and NCT JSON parser
│ ├── rule_parser.py ← Declarative derivation rule compiler
│ ├── execution_adapter.py ← Vectorized clinical DuckDB adapter (Dataset-JSON & XPT writer)
│ ├── qc_engine.py ← 5-level QC conformance validator (CORE rules, RECIST, EVS CT)
│ ├── graph_builder.py ← DiGraph lineage compiler (W3C RDF Turtle & SHACL shapes exporter)
│ ├── define_xml_generator.py← XML schema generator (Define.xml v2.1 & JSON-LD SDRG)
│ ├── ard_generator.py ← CDISC ARS v1.0 compliant ard.json statistical serializer
│ ├── m11_protocol_exporter.py← Structured digital protocol exporter
│ ├── lineage_report_generator.py← Formatted HTML lineage report generator
│ ├── snapshot_manager.py ← Reproducibility ledger and environment manifest hashing
│ ├── log_parser.py ← SAS execution log anomaly parser
│ └── confidence_scorer.py ← AI curation confidence metric scorer
│
├── seeds/ ← Database Seeding Scripts
│ ├── seed_clinical_rules.py ← Seeding derivation rules and variables
│ └── seed_arm_results.py ← Seeding Analysis Results Metadata (ARM)
│
├── tests/ ← Verification & Validation Suite
│ ├── test_pipeline.py ← End-to-end pipeline verification test
│ ├── test_ai_governance.py ← AI governance extraction test
│ └── audit_probe.py ← Comprehensive database & output gap audit tool
│
└── sas/ ← SAS Code Assets
├── templates/ ← Base and oncology-specific macro templates
└── programs/ ← Generated executable SAS programs
To satisfy FDA software validation standards, this platform enforces absolute environment repeatability. It isolates all database operations, Python execution packages, and DuckDB analytical engines within a locked multi-stage Docker environment.
This method executes the entire end-to-end computable clinical pipeline, runs 5-level QC checks, builds semantic ontologies, and writes verified submission packages to the host outputs/ folder.
docker-compose up --buildFor local debugging or custom pipeline execution, ensure a Python 3.11+ environment is active:
- Install Pinned Dependencies:
pip install -r requirements.txt
- Execute E2E Integration Suite & Conformance Gates:
python tests/test_pipeline.py
- Launch the Streamlit Conformance Dashboard:
python run_app.py # Opens locally at http://localhost:8501
The clinical cohort engine generates a realistic, simulated population of 100 oncology subjects designed to evaluate the primary treatment policy estimands and parallel blinded reviewer endpoints.
- Progression-Free Survival (PFS - ITT-WT): Evaluated using Investigator RECIST 1.1 criteria. In accordance with FDA oncology mandates, progression is censored at the last evaluable response assessment if a patient initiates a subsequent non-protocol anti-cancer therapy prior to documented progression.
- EMA Sensitivity Analysis (PFS_EMA - ITT-WT): Under EMA regulatory rules, initiation of a subsequent non-protocol anti-cancer therapy is considered a disease progression event. The pipeline compiles parallel programs to evaluate both regulatory branches.
- Blinded Independent Central Review (PFS_BICR & OS_BICR): To mitigate investigator assessment bias, the pipeline executes parallel derivations using Blinded Independent Central Review (BICR) tumor scan records.
-
immune Progression-Free Survival (iPFS - ITT):
Evaluated under iRECIST criteria. Captures unconfirmed progressive disease (iUPD) and requires confirmational scans
$\ge 4$ weeks later to evaluate immunotherapy response patterns. - Overall Survival (OS - ITT-WT): Time from randomization to death from any cause, evaluated under a Treatment Policy strategy.
- ADSL (Subject-Level): Captures key baseline covariates, demographics, and active estimand population flags (
WTFL- Wild-Type,TEFFFL- Teff-high biomarker, andPSYFL- Principal Stratum Flag tracking crossover subjects). - ADICE (Intercurrent Events): Captured as a CDISC OCCDS structure recording the exact longitudinal events (subsequent therapies, treatment discontinuations, deaths) affecting clinical estimands.
- ADPANEL (Longitudinal Weights): Calculates time-varying stabilized censoring weights (SW_IPCW) based on baseline ECOG performance status and time-varying indicators to correct for crossover biases in survival estimates.
Every pipeline execution undergoes a strict, multi-dimensional validation suite processed inside our DuckDB analytical store:
- Level 1 (CDISC CORE Standards): Evaluates structural compliance (CDISC CORE rule IDs like
CORE-000006,CORE-000008,CORE-000012). - Level 2 (Oncology RECIST Semantics): Flags clinical logical contradictions (e.g.,
RECIST_003which flags if a subject has a documented progression date in raw records but is censored in ADaM). - Level 3 (Lineage Root-Cause Tracing): Walks backward through the NetworkX lineage graph to compose complete clinical explainable narratives explaining any Level 2 discrepancies.
- Level 4 (Cross-Dataset Referential Integrity): Verifies referential integrity keys across domains (
CORE-000042verifyingUSUBJIDkeys againstADSLindex). - Level 5 (Controlled Terminology Validation): Validates all terminology codes against standard NCI EVS Thesaurus maps (
CORE-000080).
Wall-clock timing metrics are measured across all 9 orchestrator stages to guarantee performance transparency and pipeline efficiency:
================================================================================
ORCHESTRATED REGULATORY PIPELINE TIMING METRICS (M16)
--------------------------------------------------------------------------------
Stage 0 (DB Init & Seed): 0.6172s
Stage 1 (Compile Rules): 0.0426s
Stage 2 (Seed ARM Results): 0.0236s
Stage 3 (Environment Snapshot): 0.0176s
Stage 4 (Execute Programs): 4.0698s
Stage 5 (Build Lineage Graph): 0.0486s
Stage 6 (Run QC Engine): 0.2136s
Stage 7 (Compile Submissions): 0.1462s
Stage 8 (Generate ARD & M11): 0.0604s
--------------------------------------------------------------------------------
Total Execution Time: 5.2397s
================================================================================
All outputs successfully generated in outputs/ are fully compliant with FDA eCTD electronic submission requirements:
- define.xml: CDISC Define.xml v2.1 compliant metadata dictionary featuring full VLM and ARM structures (Programmatic validation: 0 errors).
- sdrg.jsonld: Machine-readable Study Data Reviewer's Guide (SDRG) in JSON-LD format, featuring embedded COSMoS concept and endpoint URIs.
- sdrg.html: Interactive, browser-ready HTML reviewer's guide.
- ard.json: CDISC ARS v1.0 compliant statistical results data linking hazard ratios and KM survival rates to protocol endpoints.
- m11_protocol.json: Structured, digital protocol exchange format conforming to ICH M11.
- lineage_ontology.ttl: Full W3C RDF Turtle Lineage Ontology defining semantic clinical concept hierarchies (
subClassOf) and SHACL shape constraints. - lineage_report.html: Visual, premium interactive variable lineage and traceability report.
- Clinical Datasets Directory (outputs/datasets/): Dual formats for submission including classic CDISC SAS XPT and modern CDISC Dataset-JSON v1.1.0 format with NDJSON streaming support.
For regulatory submissions, please contact the Lead Clinical Data Architect or the designated Regulatory Operations Officer.