Skip to content

vtewari2/eol-mistrust

Repository files navigation

eol-mistrust

Paper Reproduction: Racial Disparities and Mistrust in End-of-Life Care

Paper: Boag et al., "Racial Disparities and Mistrust in End-of-Life Care," MLHC 2018 arXiv: https://arxiv.org/abs/1808.03827 Original code: https://github.com/wboag/eol-mistrust This reimplementation: reimpl/ — Python 3.9, PyTorch 2.8, PyHealth 1.1.6, pandas 2.3.3


1. Paper Summary

Hypothesis

Medical mistrust — specifically "iatrophobia," a historically grounded institutional skepticism prevalent in minority communities — mediates racial disparities in aggressive end-of-life (EOL) care. In the ICU, mistrustful patients or families resist transitioning from curative to palliative care, resulting in longer durations of invasive interventions. The paper quantifies this phenomenon by constructing algorithmic mistrust proxies and demonstrating they stratify treatment disparities better than race alone.

Data

MIMIC-III v1.4 — de-identified EHR data from Beth Israel Deaconess Medical Center, covering 58,976 hospital admissions.

Two cohorts are defined:

Cohort Size Purpose
EOL cohort ~11,000 admissions Patients discharged to Hospice, Deceased, or SNF; stay > 6 h
ALL cohort ~50,000 admissions Full MIMIC population; used to train mistrust models

Three Mistrust Metrics

All three scores are normalized to zero-mean, unit-variance and computed per hadm_id.

Metric Source Method Signal
Noncompliance score CHARTEVENTS interpersonal features (620 binary features) L1-regularized logistic regression predicting "noncompliant" in notes Active refusal of care
Autopsy-consent score Same 620 features LR predicting autopsy consent (consent = mistrust) Suspicion of institutional care quality
Negative sentiment score Discharge summary notes -(polarity − μ)/σ using sentiment analysis Caregiver-recorded tone of patient interaction

Key Findings

Treatment disparity — race-based stratification:

  • Black patients: median ventilation ~832 min longer than White (p < 0.05)
  • Vasopressor gap not statistically significant

Treatment disparity — mistrust-based stratification (noncompliance score):

  • Low Trust vs High Trust ventilation gap: 2,580 min — a 3× amplification over the race-based gap
  • Vasopressors: 650 min gap (p < 0.05), versus only 200 min with race split

Outcome prediction (AUC-ROC, 100 random splits):

Feature Set AMA Code Status Mortality
Baseline 0.859 0.763 0.600
Baseline + Race 0.861 0.766 0.614
Baseline + Noncompliant 0.869 0.767 0.614
Baseline + Autopsy 0.861 0.773 0.603
Baseline + Neg Sentiment 0.859 0.765 0.615
Baseline + ALL 0.873 0.782 0.635

Mistrust scores outperform race alone on every task. The noncompliance score is the strongest individual predictor for leaving AMA (coefficient 0.52 vs race 0.03 for Black patients).

Key Methodological Choices

  • 10-hour gap merge for treatment spans: administrative re-charting at shift change is absorbed by merging spans separated by ≤ 10 hours
  • 620 interpersonal features from CHARTEVENTS: agitation scales, restraints, education readiness, family meetings, pain assessments, spiritual support, etc.
  • Autopsy consent as trust proxy: Black patients consented to autopsies at 38.5% vs 24.3% for White — a signal of post-mortem suspicion of care quality

2. System Architecture

flowchart TD
    subgraph RAW["MIMIC-III v1.4 Raw CSVs"]
        A1[ADMISSIONS]
        A2[PATIENTS]
        A3[ICUSTAYS]
        A4[CHARTEVENTS]
        A5[NOTEEVENTS]
        A6[PROCEDUREEVENTS_MV]
        A7[INPUTEVENTS_MV/CV]
        A8[OUTPUTEVENTS]
        A9[D_ITEMS]
    end

    subgraph S00["00 — build_mimic_views"]
        B1[icustay_detail.parquet]
        B2[ventdurations.parquet]
        B3[vasopressordurations.parquet]
        B4[oasis.parquet]
    end

    subgraph S01["01 — cohort"]
        C1[eol_cohort.parquet]
    end

    subgraph S02["02 — chartevents_features"]
        D1[chartevents_features.parquet]
    end

    subgraph S03["03 — note_labels"]
        E1[noncompliance_labels.parquet]
        E2[autopsy_labels.parquet]
    end

    subgraph S04["04 — mistrust_models"]
        F1[mistrust_noncompliant.parquet]
        F2[mistrust_autopsy.parquet]
        F3[vectorizer.pkl]
    end

    subgraph S05["05 — sentiment"]
        G1[neg_sentiment.parquet]
    end

    subgraph S06["06 — treatment_durations"]
        H1[treatment_durations.parquet]
    end

    subgraph S07["07 — race_analysis"]
        I1["images/chapter3/ — 8 CDF PNGs"]
    end

    subgraph S08["08 — mistrust_analysis"]
        J1["images/chapter5/ — 12 CDF PNGs"]
    end

    subgraph S09["09 — outcomes_ml"]
        K1[outcomes_results.parquet]
    end

    A1 & A2 & A3 & A6 & A7 & A8 --> S00
    A4 --> S00

    A1 --> S01
    B1 --> S01
    B4 --> S01

    A4 & A9 --> S02

    A5 --> S03
    D1 --> S03

    D1 & E1 & E2 --> S04

    A5 --> S05

    B2 & B3 & B1 & C1 --> S06

    C1 & H1 --> S07

    C1 & H1 & F1 & F2 & G1 --> S08

    B1 & B4 & F1 & F2 & G1 --> S09
Loading

3. Pipeline Stages

Stage 00 — Build MIMIC Materialized Views

Script: reimpl/00_build_mimic_views.py Purpose: Reconstructs four PostgreSQL materialized views that the original code depended on (from the mimic-code repo) directly from raw gzipped CSVs. This is the foundational step — every downstream script depends on at least one of its outputs.

Inputs (all from physionet.org/files/mimiciii/1.4/):

File Used for
ADMISSIONS.csv.gz icustay_detail — admittime, discharge, ethnicity, insurance
PATIENTS.csv.gz Age computation (DOB → age at admission)
ICUSTAYS.csv.gz ICU stay intime/outtime, LOS
PROCEDUREEVENTS_MV.csv.gz MetaVision vent spans (itemids 225792, 225794)
CHARTEVENTS.csv.gz CareVue vent spans (itemid 720) + OASIS vitals
INPUTEVENTS_MV.csv.gz MetaVision vasopressor infusions
INPUTEVENTS_CV.csv.gz CareVue vasopressor infusions
OUTPUTEVENTS.csv.gz Urine output for OASIS score

Outputs (data/):

File Rows Key columns
icustay_detail.parquet 61,532 icustay_id, hadm_id, subject_id, gender, age, ethnicity, insurance, intime, outtime, los_hospital, los_icu, admission_type
ventdurations.parquet 30,084 icustay_id, starttime, endtime
vasopressordurations.parquet 23,878 icustay_id, starttime, endtime
oasis.parquet 61,532 icustay_id, hadm_id, oasis + 20 component columns

Vent span detection algorithm:

flowchart TD
    A[PROCEDUREEVENTS_MV] -->|itemids 225792,225794| B[MV spans: starttime+endtime]
    C[CHARTEVENTS] -->|itemid 720, non-null VALUE| D[CV spans: charted intervals]
    B & D --> E[Merge consecutive spans within 8h gap per icustay_id]
    E --> F[ventdurations.parquet]
Loading

OASIS score components: Age, pre-ICU LOS, GCS, heart rate, mean arterial pressure, respiratory rate, temperature, urine output, mechanical ventilation flag, elective surgery flag — scored per Johnson et al. 2013 breakpoints.

Key implementation notes:

  • Age overflow: MIMIC shifts DOB by ~300 years for patients >89. Fixed using integer year/month/day arithmetic + clip(upper=90).
  • CHARTEVENTS VALUE column is mixed dtype across chunks; cast to str before stripping.

Stage 01 — EOL Cohort Selection

Script: reimpl/01_cohort.py Purpose: Define the 12,958-patient end-of-life cohort from ADMISSIONS + icustay_detail + oasis. This is the primary population for all disparity analyses.

Inputs: ADMISSIONS.csv.gz, data/icustay_detail.parquet, data/oasis.parquet

Output: data/eol_cohort.parquet — 12,958 rows, 12 columns Columns: hadm_id, subject_id, icustay_id, race, gender, age, insurance, admittime, dischtime, los_hospital (hours), discharge_location, max_oasis

Cohort selection algorithm:

flowchart TD
    A[ADMISSIONS — 58,976 rows] --> B{discharge_location in HOSPICE-HOME / HOSPICE-MEDICAL FACILITY / DEAD/EXPIRED / SNF}
    B -- Yes → 14,115 --> C{stay duration > 1 day?}
    C -- Yes → 13,106 --> D{has ≥1 ICU stay in icustay_detail?}
    D -- Yes → 12,958 --> E[eol_cohort.parquet]
Loading

Multi-ICU-stay admissions (3,260 hadm_ids): representative icustay_id = the stay with the highest OASIS score, consistent with the original MAX(oasis) query. Falls back to earliest intime if OASIS is missing.

Cohort breakdown:

Location Count
Skilled Nursing Facility 7,537
Deceased 4,871
Hospice 550
Race Count
White 9,551
Black 1,166
Not Specified 1,046
Other 592
Asian 303
Hispanic 293

Primary comparison cohort (White + Black): 10,717 patients


Stage 02 — Chartevents Interpersonal Features

Script: reimpl/02_chartevents_features.py Purpose: Extract ~620 binary interpersonal interaction features from CHARTEVENTS. These features are the input vector for both supervised mistrust classifiers (stage 04). They capture the quality of the patient-provider relationship as recorded in structured chart data.

Inputs: CHARTEVENTS.csv.gz (330M+ rows, streamed in 500k-row chunks), D_ITEMS.csv.gz

Output: data/chartevents_features.parquet — 2,247,896 rows, 2 columns Columns: hadm_id (int), feature_key (str) Format: long — one row per unique (hadm_id, feature_key) pair. Reconstruct as dict-of-dicts for DictVectorizer: {hadm_id: {feature_key: 1.0, ...}}.

Feature extraction algorithm:

flowchart TD
    A[D_ITEMS.csv.gz] -->|filter linksto=chartevents| B{label matches ~40 keywords?}
    B -- Yes --> C[matched itemids: ~168]
    D[CHARTEVENTS.csv.gz — streamed 500k chunks] --> E{ITEMID in matched set?}
    E -- Yes --> F{ERROR=1?}
    F -- No --> G[normalise_feature label+value]
    G -->|returns None| H[skip row]
    G -->|returns category+value| I["feature_key = category||value"]
    I --> J[deduplicate per hadm_id via set]
    J --> K[chartevents_features.parquet]
Loading

Keywords triggering feature inclusion: family communication, education barrier/learner/method/readiness/topic, pain / pain level / pain assess method, restraint, spiritual support, support systems, state, safety measures, family meeting, health care proxy, bath, bed bath, riker-sas scale, richmond-ras scale, side rails, status and comfort, consults, social work consult, sitter, security, observer, informed, and ~15 others.

Normalisation rules (mirrors trust.ipynb cell 7):

Label pattern Coarsened to
reason for restraint none / threat of harm / confusion-delirium / presence of violence / treatment interference / risk for falls
restraint location none / 4 point restraint / some restraint
restraint device sitter / limb / (raw)
bath partial / self / refused / shave / hair / none / done
behavior, behavioral state skipped
pain management/type/cause/location skipped
pain level*, education topic*, safety measures*, side rails*, status and comfort*, *informed* kept as-is
all others (label, value) kept as-is

Stats:

  • 54,510 unique hadm_ids (ALL patients with any interpersonal chartevents)
  • 633 unique feature keys
  • 41.2 avg features per patient
  • EOL cohort coverage: 12,813 / 12,958 (98.9%)

Stage 03 — Note-Based Labels

Script: reimpl/03_note_labels.py Purpose: Generate training targets for the two supervised mistrust classifiers by scanning NOTEEVENTS for rule-based signals. Single streaming pass over 1.85M notes.

Inputs: NOTEEVENTS.csv.gz (streamed 50k-row chunks), data/chartevents_features.parquet

Outputs:

  • data/noncompliance_labels.parquet — 54,510 rows; columns: hadm_id, label (0/1)
  • data/autopsy_labels.parquet — 1,009 rows; columns: hadm_id, label (0/1)

Label generation algorithm:

flowchart TD
    A[NOTEEVENTS — 1.85M notes] --> B{ISERROR=1?}
    B -- No --> C{text contains 'noncompliant'?}
    C -- Yes --> D[noncompliance_set += hadm_id]
    B -- No --> E{text contains 'autopsy'?}
    E -- Yes --> F[scan each line]
    F --> G{decline/refuse/not consent/denied?}
    G -- Yes --> H[autopsy_declined += hadm_id]
    F --> I{consent/agree/request?}
    I -- Yes --> J[autopsy_consented += hadm_id]

    D --> K[Build noncompliance_labels:all chartevents patients,default=0, override=1]
    H & J --> L{both flags for same hadm_id?}
    L -- Yes --> M[exclude as ambiguous]
    L -- No --> N[autopsy_labels:consent=1, decline=0]
Loading

Class distributions:

Label Noncompliance Autopsy
mistrust=1 480 (0.88%) 270
trust=0 54,030 739
ambiguous/excluded 60

Race × autopsy consent rate (EOL cohort) — core paper finding:

Race Decline Consent Rate
Black 45 29 39.2%
White 421 144 25.5%
Hispanic 9 9 50.0%
Asian 20 2 9.1%

Black patients consent to autopsies at ~39% vs White at ~26%.


Stage 04 — PyTorch Mistrust Models

Script: reimpl/04_mistrust_models.py Purpose: Train two L1-regularized logistic regression models on the interpersonal feature vectors and score all 54,510 patients. The output scores are continuous mistrust proxies used downstream for disparity analysis and outcome prediction.

Inputs: data/chartevents_features.parquet, data/noncompliance_labels.parquet, data/autopsy_labels.parquet

Outputs:

  • data/mistrust_noncompliant.parquet — 54,510 rows; columns: hadm_id, score
  • data/mistrust_autopsy.parquet — 54,510 rows; columns: hadm_id, score
  • data/vectorizer.pkl — fitted DictVectorizer (reused in stage 09)

Model: LogisticRegression(nn.Module) — single nn.Linear(633→1). forward() returns raw logit (= sklearn's decision_function). Weights zero-initialized.

Regularization equivalence to sklearn C=0.1, penalty='l1':

sklearn objective: ||w||_1 + C * Σ log_loss_i
PyTorch objective: BCE_mean + λ * ||w||_1   where  λ = 1/(C * n_train) = 10/n_train

Training parameters:

Noncompliance Autopsy
Training patients 38,157 (70%) 697 (70%)
Positives (mistrust=1) 336 186
pos_weight 112.6× 2.75×
λ (L1) 0.000262 0.01435
Optimizer Adam lr=0.05 Adam lr=0.05
Epochs 1,000 500

Test-set evaluation (30% hold-out):

Metric Noncompliance Autopsy
AUC-ROC 0.667 0.531
Recall 0.444 0.437
Specificity 0.763 0.618
F1 0.032 0.352

Score distributions in EOL cohort (White vs Black):

Score White median Black median MWU p
Noncompliance −1.148 −1.000 0.034
Autopsy −0.174 −0.260 0.002

The noncompliance score correctly shows Black patients as more mistrustful (p=0.034). The autopsy direction appears reversed in the full EOL cohort because only 817/12,958 patients (6.3%) have explicit autopsy mentions; the model extrapolates with poor discrimination (AUC=0.531). Within the labeled autopsy subset the direction is correct.

Key implementation note: defaultdict(dict) via itertuples for feature dict construction avoids pandas 2.x groupby.apply instability (0.5s vs 16s failure mode). Full dense tensor (54,510 × 633 = 138 MB) fits in memory — no batching needed.


Stage 05 — Negative Sentiment Score

Script: reimpl/05_sentiment.py Purpose: Compute a sentiment-based mistrust proxy from discharge summary notes. Higher score = more negative sentiment = more mistrust signal.

Inputs: NOTEEVENTS.csv.gz (discharge summary category only)

Output: data/neg_sentiment.parquet — 52,726 rows; columns: hadm_id, raw_score, neg_score

Method: sentence-level VADER compound score, averaged across sentences per note, averaged across notes per hadm_id, then z-scored and negated:

neg_score[hadm_id] = -(mean_sentence_polarity - μ_all) / σ_all

Why sentence-level, not full-text VADER:

Approach std Useful
Full-text compound saturates at −1.0 for 94% of notes
Sentence-level mean std ≈ 0.086

Full-text VADER compound saturates because clinical discharge language is lexically negative ("pain," "failure," "death"). Sentence-level averaging avoids saturation and is the closest available analog to the original's word-averaged pattern.en approach.

Stats:

  • 59,652 notes scored (discharge summaries only)
  • 52,726 unique hadm_ids
  • EOL cohort coverage: 12,543 / 12,958 (96.8%)
  • Raw score: mean = −0.069, std = 0.067

Race signal in EOL cohort (median neg_score):

Race Median neg_score
Black +0.200
White +0.157

White vs Black MWU p=0.106 (direction correct, not significant at α=0.05).

Correlations between all three mistrust scores:

Pair Pearson r
neg_sentiment × noncompliant +0.100
neg_sentiment × autopsy −0.082
noncompliant × autopsy +0.266

All weakly but significantly correlated — they capture overlapping but distinct aspects of mistrust.


Stage 06 — Treatment Durations

Script: reimpl/06_treatment_durations.py Purpose: Aggregate mechanical ventilation and vasopressor durations (in minutes) per hadm_id for the EOL cohort, applying the 10-hour gap merge to remove administrative noise.

Inputs: data/ventdurations.parquet, data/vasopressordurations.parquet, data/icustay_detail.parquet, data/eol_cohort.parquet

Output: data/treatment_durations.parquet — 12,958 rows; columns: hadm_id, vent_minutes, vaso_minutes (NaN if patient received no treatment of that type)

Span merge algorithm:

flowchart TD
    A[ventdurations — icustay_id level] --> B[join → hadm_id via icustay_detail]
    B --> C[filter to EOL cohort hadm_ids]
    C --> D[collect all spans per hadm_id across all ICU stays]
    D --> E[sort spans by starttime]
    E --> F{gap between consecutive spans ≤ 10 h?}
    F -- Yes --> G[extend current span end]
    F -- No --> H[close current span, start new]
    G & H --> I[sum total minutes of merged spans]
    I --> J[treatment_durations.parquet]
Loading

Treatment prevalence in EOL cohort:

Treatment Patients % of EOL
Mechanical ventilation 7,173 55.4%
Vasopressors 4,972 38.4%

White vs Black — core disparity result:

Treatment White median Black median MWU p
Ventilation 2,741 min (45.7 h) 3,620 min (60.3 h) 0.009
Vasopressors 1,691 min (28.2 h) 1,819 min (30.3 h) 0.317

Severity-stratified (OASIS tertiles, ventilation): Disparity is concentrated in the medium-severity group (OASIS 33–40, p=0.005) — the most clinically ambiguous zone where treatment decisions are discretionary. High-severity patients receive long ventilation regardless of race (ceiling effect).


Stage 07 — Race-Based Disparity Figures

Script: reimpl/07_race_analysis.py Purpose: Reproduce the race-based treatment CDF figures from the paper (Chapter 3 / race_mimic_aggressive.ipynb). Produces 8 PNGs.

Inputs: data/eol_cohort.parquet, data/treatment_durations.parquet

Outputs (images/chapter3/):

File Description
race_mimic_vent.png Ventilation CDF — White vs Black, all severities
race_mimic_vaso.png Vasopressor CDF — White vs Black, all severities
race_mimic_vent_low.png Ventilation, low severity (OASIS ≤ tertile 1)
race_mimic_vent_medium.png Ventilation, medium severity
race_mimic_vent_high.png Ventilation, high severity
race_mimic_vaso_low.png Vasopressors, low severity
race_mimic_vaso_medium.png Vasopressors, medium severity
race_mimic_vaso_high.png Vasopressors, high severity

Figure style: Empirical CDF (np.sort(vals) vs np.linspace(0,1,n,endpoint=False)), x-axis clipped at 10,000 min, dashed vertical median lines with values annotated above axes, White = #00A6ED, Black = #FF5400, no top/right spines.


Stage 08 — Mistrust-Based Disparity Figures

Script: reimpl/08_mistrust_analysis.py Purpose: Reproduce the mistrust-stratified treatment CDF figures (Chapter 5). 3 metrics × 2 treatments × 2 figures (overall + severity panels) = 12 PNGs.

Inputs: data/eol_cohort.parquet, data/treatment_durations.parquet, data/mistrust_noncompliant.parquet, data/mistrust_autopsy.parquet, data/neg_sentiment.parquet

Outputs (images/chapter5/):

File Description
mistrust_noncompliant_mimic_vent.png Noncompliance score — ventilation CDF
mistrust_noncompliant_mimic_vaso.png Noncompliance score — vasopressor CDF
mistrust_noncompliant_mimic_vent_severity.png 3-panel severity-stratified vent
mistrust_noncompliant_mimic_vaso_severity.png 3-panel severity-stratified vaso
mistrust_autopsy_mimic_vent.png Autopsy score — ventilation CDF
mistrust_autopsy_mimic_vaso.png Autopsy score — vasopressor CDF
mistrust_autopsy_mimic_vent_severity.png 3-panel severity-stratified vent
mistrust_autopsy_mimic_vaso_severity.png 3-panel severity-stratified vaso
neg_sentiment_mimic_vent.png Sentiment score — ventilation CDF
neg_sentiment_mimic_vaso.png Sentiment score — vasopressor CDF
neg_sentiment_mimic_vent_severity.png 3-panel severity-stratified vent
neg_sentiment_mimic_vaso_severity.png 3-panel severity-stratified vaso

Split logic (mirrors original exactly):

  1. white_ids = White EOL patients with treatment data AND this score
  2. black_ids = Black EOL patients with treatment data AND this score
  3. Pool = white_ids ∪ black_ids, sorted ascending by score
  4. Bottom len(white_ids)High Trust (blue #0000FF)
  5. Top len(black_ids)Low Trust (red #FF0000)

Split sizes match racial group population counts, making comparisons directly analogous to Stage 07.

Key results (overall CDFs):

Metric Treatment High Trust median Low Trust median MWU p
Noncompliance Ventilation 2,681 min 5,000 min <0.0001
Noncompliance Vasopressors 1,691 min 1,819 min 0.006
Autopsy Ventilation 2,627 min 6,011 min <0.0001
Autopsy Vasopressors 1,632 min 2,480 min <0.0001
Neg Sentiment Ventilation 2,880 min 1,798 min <0.0001 (reversed)

The noncompliance and autopsy scores amplify the disparity 3–7× over the race-based 832-min gap. Negative sentiment direction is reversed — high-trust patients receive more ventilation — because clinicians document more adversarially about patients in worse clinical situations, not about mistrust.


Stage 09 — Outcomes ML

Script: reimpl/09_outcomes_ml.py Purpose: Reproduce the downstream outcome prediction experiment — evaluating how much each feature set (baseline demographics, + race, + each mistrust score, + all) improves AUC-ROC on three clinical tasks. Uses PyHealth's binary_metrics_fn for evaluation.

Inputs: data/icustay_detail.parquet, data/oasis.parquet, data/mistrust_noncompliant.parquet, data/mistrust_autopsy.parquet, data/neg_sentiment.parquet, CHARTEVENTS.csv.gz (code status labels), ADMISSIONS.csv.gz (AMA + mortality labels)

Output: data/outcomes_results.parquet — 18 rows (6 feature sets × 3 tasks); columns: feat_set, task, n_patients, n_positive, auc_mean, auc_std, auc_ci_lo, auc_ci_hi, auprc_mean, f1_mean

Three tasks:

Task Label Base rate N
Code Status DNR/CMO=1 vs Full Code=0 10.9% 39,105
AMA Left AMA=1 vs compliant=0 0.6% 47,544
Mortality Deceased=1 vs survived=0 10.8% 47,544

Six feature configurations:

Config Columns
BASELINE age, los_hospital, insurance, gender
BASELINE+RACE + race
BASELINE+NONCOMPLIANT + nc_score
BASELINE+AUTOPSY + au_score
BASELINE+SENTIMENT + neg_score
BASELINE+ALL + race + nc_score + au_score + neg_score

Protocol: 100 random 60/40 stratified splits; full-batch Adam training (200 epochs, L1 λ=10/n, lr=0.05); pyhealth.metrics.binary_metrics_fn(metrics=["roc_auc","pr_auc","f1"]) per split.

AUC-ROC results (mean ± 1.96×std):

Feature Set AMA Code Status Mortality
BASELINE 0.855 ± 0.025 0.759 ± 0.009 0.629 ± 0.010
BASELINE+RACE 0.854 ± 0.026 0.759 ± 0.009 0.636 ± 0.011
BASELINE+NONCOMPLIANT 0.864 ± 0.024 0.759 ± 0.009 0.636 ± 0.010
BASELINE+AUTOPSY 0.854 ± 0.025 0.761 ± 0.009 0.641 ± 0.009
BASELINE+SENTIMENT 0.853 ± 0.026 0.759 ± 0.009 0.633 ± 0.010
BASELINE+ALL 0.861 ± 0.026 0.761 ± 0.009 0.661 ± 0.010

BASELINE+ALL outperforms BASELINE+RACE on all tasks. Mortality improvement (0.629 → 0.661) is the clearest signal.


4. Complete Output Artifact Catalog

All parquet files are written to data/ relative to /Users/vtewari/Desktop/mimc/. All images are under images/.

Parquet / Pickle Files

File Produced by Rows Key columns
data/icustay_detail.parquet 00 61,532 icustay_id, hadm_id, subject_id, gender, age, ethnicity, insurance, intime, outtime, los_hospital, los_icu, admission_type
data/ventdurations.parquet 00 30,084 icustay_id, starttime, endtime
data/vasopressordurations.parquet 00 23,878 icustay_id, starttime, endtime
data/oasis.parquet 00 61,532 icustay_id, hadm_id, oasis + 20 component cols
data/eol_cohort.parquet 01 12,958 hadm_id, subject_id, icustay_id, race, gender, age, insurance, admittime, dischtime, los_hospital, discharge_location, max_oasis
data/chartevents_features.parquet 02 2,247,896 hadm_id (int), feature_key (str)
data/noncompliance_labels.parquet 03 54,510 hadm_id, label (0/1)
data/autopsy_labels.parquet 03 1,009 hadm_id, label (0/1)
data/mistrust_noncompliant.parquet 04 54,510 hadm_id, score (float, raw logit)
data/mistrust_autopsy.parquet 04 54,510 hadm_id, score (float, raw logit)
data/vectorizer.pkl 04 Fitted DictVectorizer; saved as (hadm_ids, vect) tuple
data/neg_sentiment.parquet 05 52,726 hadm_id, raw_score, neg_score (z-scored+negated)
data/treatment_durations.parquet 06 12,958 hadm_id, vent_minutes, vaso_minutes (NaN if untreated)
data/outcomes_results.parquet 09 18 feat_set, task, n_patients, n_positive, auc_mean, auc_std, auc_ci_lo, auc_ci_hi, auprc_mean, f1_mean

Image Files

images/chapter3/ — Race-based CDF figures (Stage 07, 8 files):

File Content
race_mimic_vent.png Overall ventilation CDF, White vs Black
race_mimic_vaso.png Overall vasopressor CDF, White vs Black
race_mimic_vent_low.png Ventilation, low OASIS severity
race_mimic_vent_medium.png Ventilation, medium OASIS severity
race_mimic_vent_high.png Ventilation, high OASIS severity
race_mimic_vaso_low.png Vasopressors, low OASIS severity
race_mimic_vaso_medium.png Vasopressors, medium OASIS severity
race_mimic_vaso_high.png Vasopressors, high OASIS severity

images/chapter5/ — Mistrust-based CDF figures (Stage 08, 12 files):

File Content
mistrust_noncompliant_mimic_vent.png Noncompliance — ventilation overall
mistrust_noncompliant_mimic_vaso.png Noncompliance — vasopressor overall
mistrust_noncompliant_mimic_vent_severity.png Noncompliance — ventilation 3-panel severity
mistrust_noncompliant_mimic_vaso_severity.png Noncompliance — vasopressor 3-panel severity
mistrust_autopsy_mimic_vent.png Autopsy — ventilation overall
mistrust_autopsy_mimic_vaso.png Autopsy — vasopressor overall
mistrust_autopsy_mimic_vent_severity.png Autopsy — ventilation 3-panel severity
mistrust_autopsy_mimic_vaso_severity.png Autopsy — vasopressor 3-panel severity
neg_sentiment_mimic_vent.png Neg Sentiment — ventilation overall
neg_sentiment_mimic_vaso.png Neg Sentiment — vasopressor overall
neg_sentiment_mimic_vent_severity.png Neg Sentiment — ventilation 3-panel severity
neg_sentiment_mimic_vaso_severity.png Neg Sentiment — vasopressor 3-panel severity

5. Statistical Results

5.1 EOL Cohort Demographics (White vs Black)

Variable Black White p-value
N 1,166 9,551
Mean age 71.3 [60.2, 80.4] 77.9 [66.6, 84.9] <0.001
Public insurance 87.5% 83.7% <0.001
Female gender 60.4% 50.2% <0.001
Discharge: Deceased 33.0% 38.7% <0.001
Discharge: Hospice 3.3% 4.2% <0.001
Discharge: SNF 63.7% 57.0% <0.001
Median LOS 13.9 days 14.1 days 0.222

5.2 Treatment Duration Disparities — Race Stratification

Treatment White median Black median Δ (min) MWU p
Ventilation 2,741 3,620 +879 0.009
Vasopressors 1,691 1,819 +128 0.317

5.3 Treatment Duration Disparities — Mistrust Stratification (Noncompliance Score)

Treatment High Trust median Low Trust median Δ (min) MWU p
Ventilation 2,681 5,000 +2,319 <0.0001
Vasopressors 1,691 1,819 +128 0.006

Amplification factor over race stratification: ~2.6× for ventilation, >1× for vasopressors.

5.4 Autopsy Consent Rates by Race (EOL Cohort)

Race Decline Consent Consent rate
Black 45 29 39.2%
White 421 144 25.5%
Asian 20 2 9.1%

5.5 Mistrust Score Severity-Stratified Ventilation (Noncompliance Score)

Severity (OASIS) High Trust median Low Trust median p
Low (≤ tertile 1) 2,368 4,020 0.0002
Medium (t1–t2) 2,486 5,000 <0.0001
High (> t2) 3,442 6,390 0.0001

Disparity is significant across ALL severity levels for mistrust — unlike race where it is concentrated only at medium severity.

5.6 Outcome Prediction AUC-ROC (100 random 60/40 splits)

Feature Set AMA Code Status Mortality
BASELINE 0.855 0.759 0.629
BASELINE+RACE 0.854 0.759 0.636
BASELINE+NONCOMPLIANT 0.864 0.759 0.636
BASELINE+AUTOPSY 0.854 0.761 0.641
BASELINE+SENTIMENT 0.853 0.759 0.633
BASELINE+ALL 0.861 0.761 0.661

6. Implementation Divergences from Original

Aspect Original (Boag et al.) This reimplementation
Language / Python version Python 2, Jupyter notebooks Python 3.9, standalone scripts
ML framework scikit-learn LogisticRegression(C=0.1, penalty='l1') PyTorch nn.Linear with equivalent L1 penalty: λ = 10/n_train
Score output decision_function() (raw logit) forward() returns raw logit — identical semantics
Sentiment analysis pattern.en.sentiment(text.split()) — word-averaged polarity NLTK VADER sentence-level compound, averaged across sentences and notes; avoids saturation (full-text VADER saturates at −1.0 for 94% of clinical notes)
Multi-note aggregation Last discharge summary (dict overwrite) Mean across all discharge summaries — more robust
Feature matrix construction psycopg2 SQL query to PostgreSQL pandas streaming + DictVectorizer from raw CSVs
Materialized views PostgreSQL mimic-code views Reconstructed from raw CSVs in script 00
Evaluation framework Manual AUC computation pyhealth.metrics.binary_metrics_fn
Training batching Full-batch (sklearn) Full-batch PyTorch (DataLoader has 226× overhead for tabular data at this scale)
Cohort LOS threshold Stay > 6 hours Stay > 1 day (PROGRESS.md note: original uses 6h for treatment, 12h for notes)
Type annotations Python 2 style Optional[Tuple[str,str]] from typing (Python 3.9 lacks tuple[...] | None union syntax)

About

Reimplementation using python3 and pyhealth of the original paper

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages