eol-mistrust

Paper Reproduction: Racial Disparities and Mistrust in End-of-Life Care

Paper: Boag et al., "Racial Disparities and Mistrust in End-of-Life Care," MLHC 2018 arXiv: https://arxiv.org/abs/1808.03827 Original code: https://github.com/wboag/eol-mistrust This reimplementation: reimpl/ — Python 3.9, PyTorch 2.8, PyHealth 1.1.6, pandas 2.3.3

1. Paper Summary

Hypothesis

Medical mistrust — specifically "iatrophobia," a historically grounded institutional skepticism prevalent in minority communities — mediates racial disparities in aggressive end-of-life (EOL) care. In the ICU, mistrustful patients or families resist transitioning from curative to palliative care, resulting in longer durations of invasive interventions. The paper quantifies this phenomenon by constructing algorithmic mistrust proxies and demonstrating they stratify treatment disparities better than race alone.

Data

MIMIC-III v1.4 — de-identified EHR data from Beth Israel Deaconess Medical Center, covering 58,976 hospital admissions.

Two cohorts are defined:

Cohort	Size	Purpose
EOL cohort	~11,000 admissions	Patients discharged to Hospice, Deceased, or SNF; stay > 6 h
ALL cohort	~50,000 admissions	Full MIMIC population; used to train mistrust models

Three Mistrust Metrics

All three scores are normalized to zero-mean, unit-variance and computed per hadm_id.

Metric	Source	Method	Signal
Noncompliance score	CHARTEVENTS interpersonal features (620 binary features)	L1-regularized logistic regression predicting "noncompliant" in notes	Active refusal of care
Autopsy-consent score	Same 620 features	LR predicting autopsy consent (consent = mistrust)	Suspicion of institutional care quality
Negative sentiment score	Discharge summary notes	-(polarity − μ)/σ using sentiment analysis	Caregiver-recorded tone of patient interaction

Key Findings

Treatment disparity — race-based stratification:

Black patients: median ventilation ~832 min longer than White (p < 0.05)
Vasopressor gap not statistically significant

Treatment disparity — mistrust-based stratification (noncompliance score):

Low Trust vs High Trust ventilation gap: 2,580 min — a 3× amplification over the race-based gap
Vasopressors: 650 min gap (p < 0.05), versus only 200 min with race split

Outcome prediction (AUC-ROC, 100 random splits):

Feature Set	AMA	Code Status	Mortality
Baseline	0.859	0.763	0.600
Baseline + Race	0.861	0.766	0.614
Baseline + Noncompliant	0.869	0.767	0.614
Baseline + Autopsy	0.861	0.773	0.603
Baseline + Neg Sentiment	0.859	0.765	0.615
Baseline + ALL	0.873	0.782	0.635

Mistrust scores outperform race alone on every task. The noncompliance score is the strongest individual predictor for leaving AMA (coefficient 0.52 vs race 0.03 for Black patients).

Key Methodological Choices

10-hour gap merge for treatment spans: administrative re-charting at shift change is absorbed by merging spans separated by ≤ 10 hours
620 interpersonal features from CHARTEVENTS: agitation scales, restraints, education readiness, family meetings, pain assessments, spiritual support, etc.
Autopsy consent as trust proxy: Black patients consented to autopsies at 38.5% vs 24.3% for White — a signal of post-mortem suspicion of care quality

2. System Architecture

flowchart TD
    subgraph RAW["MIMIC-III v1.4 Raw CSVs"]
        A1[ADMISSIONS]
        A2[PATIENTS]
        A3[ICUSTAYS]
        A4[CHARTEVENTS]
        A5[NOTEEVENTS]
        A6[PROCEDUREEVENTS_MV]
        A7[INPUTEVENTS_MV/CV]
        A8[OUTPUTEVENTS]
        A9[D_ITEMS]
    end

    subgraph S00["00 — build_mimic_views"]
        B1[icustay_detail.parquet]
        B2[ventdurations.parquet]
        B3[vasopressordurations.parquet]
        B4[oasis.parquet]
    end

    subgraph S01["01 — cohort"]
        C1[eol_cohort.parquet]
    end

    subgraph S02["02 — chartevents_features"]
        D1[chartevents_features.parquet]
    end

    subgraph S03["03 — note_labels"]
        E1[noncompliance_labels.parquet]
        E2[autopsy_labels.parquet]
    end

    subgraph S04["04 — mistrust_models"]
        F1[mistrust_noncompliant.parquet]
        F2[mistrust_autopsy.parquet]
        F3[vectorizer.pkl]
    end

    subgraph S05["05 — sentiment"]
        G1[neg_sentiment.parquet]
    end

    subgraph S06["06 — treatment_durations"]
        H1[treatment_durations.parquet]
    end

    subgraph S07["07 — race_analysis"]
        I1["images/chapter3/ — 8 CDF PNGs"]
    end

    subgraph S08["08 — mistrust_analysis"]
        J1["images/chapter5/ — 12 CDF PNGs"]
    end

    subgraph S09["09 — outcomes_ml"]
        K1[outcomes_results.parquet]
    end

    A1 & A2 & A3 & A6 & A7 & A8 --> S00
    A4 --> S00

    A1 --> S01
    B1 --> S01
    B4 --> S01

    A4 & A9 --> S02

    A5 --> S03
    D1 --> S03

    D1 & E1 & E2 --> S04

    A5 --> S05

    B2 & B3 & B1 & C1 --> S06

    C1 & H1 --> S07

    C1 & H1 & F1 & F2 & G1 --> S08

    B1 & B4 & F1 & F2 & G1 --> S09

3. Pipeline Stages

Stage 00 — Build MIMIC Materialized Views

Script: reimpl/00_build_mimic_views.py Purpose: Reconstructs four PostgreSQL materialized views that the original code depended on (from the mimic-code repo) directly from raw gzipped CSVs. This is the foundational step — every downstream script depends on at least one of its outputs.

Inputs (all from physionet.org/files/mimiciii/1.4/):

File	Used for
`ADMISSIONS.csv.gz`	`icustay_detail` — admittime, discharge, ethnicity, insurance
`PATIENTS.csv.gz`	Age computation (DOB → age at admission)
`ICUSTAYS.csv.gz`	ICU stay intime/outtime, LOS
`PROCEDUREEVENTS_MV.csv.gz`	MetaVision vent spans (itemids 225792, 225794)
`CHARTEVENTS.csv.gz`	CareVue vent spans (itemid 720) + OASIS vitals
`INPUTEVENTS_MV.csv.gz`	MetaVision vasopressor infusions
`INPUTEVENTS_CV.csv.gz`	CareVue vasopressor infusions
`OUTPUTEVENTS.csv.gz`	Urine output for OASIS score

Outputs (data/):

File	Rows	Key columns
`icustay_detail.parquet`	61,532	`icustay_id, hadm_id, subject_id, gender, age, ethnicity, insurance, intime, outtime, los_hospital, los_icu, admission_type`
`ventdurations.parquet`	30,084	`icustay_id, starttime, endtime`
`vasopressordurations.parquet`	23,878	`icustay_id, starttime, endtime`
`oasis.parquet`	61,532	`icustay_id, hadm_id, oasis` + 20 component columns

Vent span detection algorithm:

flowchart TD
    A[PROCEDUREEVENTS_MV] -->|itemids 225792,225794| B[MV spans: starttime+endtime]
    C[CHARTEVENTS] -->|itemid 720, non-null VALUE| D[CV spans: charted intervals]
    B & D --> E[Merge consecutive spans within 8h gap per icustay_id]
    E --> F[ventdurations.parquet]

OASIS score components: Age, pre-ICU LOS, GCS, heart rate, mean arterial pressure, respiratory rate, temperature, urine output, mechanical ventilation flag, elective surgery flag — scored per Johnson et al. 2013 breakpoints.

Key implementation notes:

Age overflow: MIMIC shifts DOB by ~300 years for patients >89. Fixed using integer year/month/day arithmetic + clip(upper=90).
CHARTEVENTS VALUE column is mixed dtype across chunks; cast to str before stripping.

Stage 01 — EOL Cohort Selection

Script: reimpl/01_cohort.py Purpose: Define the 12,958-patient end-of-life cohort from ADMISSIONS + icustay_detail + oasis. This is the primary population for all disparity analyses.

Inputs: ADMISSIONS.csv.gz, data/icustay_detail.parquet, data/oasis.parquet

Output: data/eol_cohort.parquet — 12,958 rows, 12 columns Columns: hadm_id, subject_id, icustay_id, race, gender, age, insurance, admittime, dischtime, los_hospital (hours), discharge_location, max_oasis

Cohort selection algorithm:

flowchart TD
    A[ADMISSIONS — 58,976 rows] --> B{discharge_location in HOSPICE-HOME / HOSPICE-MEDICAL FACILITY / DEAD/EXPIRED / SNF}
    B -- Yes → 14,115 --> C{stay duration > 1 day?}
    C -- Yes → 13,106 --> D{has ≥1 ICU stay in icustay_detail?}
    D -- Yes → 12,958 --> E[eol_cohort.parquet]

Multi-ICU-stay admissions (3,260 hadm_ids): representative icustay_id = the stay with the highest OASIS score, consistent with the original MAX(oasis) query. Falls back to earliest intime if OASIS is missing.

Cohort breakdown:

Location	Count
Skilled Nursing Facility	7,537
Deceased	4,871
Hospice	550

Race	Count
White	9,551
Black	1,166
Not Specified	1,046
Other	592
Asian	303
Hispanic	293

Primary comparison cohort (White + Black): 10,717 patients

Stage 02 — Chartevents Interpersonal Features

Script: reimpl/02_chartevents_features.py Purpose: Extract ~620 binary interpersonal interaction features from CHARTEVENTS. These features are the input vector for both supervised mistrust classifiers (stage 04). They capture the quality of the patient-provider relationship as recorded in structured chart data.

Inputs: CHARTEVENTS.csv.gz (330M+ rows, streamed in 500k-row chunks), D_ITEMS.csv.gz

Output: data/chartevents_features.parquet — 2,247,896 rows, 2 columns Columns: hadm_id (int), feature_key (str) Format: long — one row per unique (hadm_id, feature_key) pair. Reconstruct as dict-of-dicts for DictVectorizer: {hadm_id: {feature_key: 1.0, ...}}.

Feature extraction algorithm:

flowchart TD
    A[D_ITEMS.csv.gz] -->|filter linksto=chartevents| B{label matches ~40 keywords?}
    B -- Yes --> C[matched itemids: ~168]
    D[CHARTEVENTS.csv.gz — streamed 500k chunks] --> E{ITEMID in matched set?}
    E -- Yes --> F{ERROR=1?}
    F -- No --> G[normalise_feature label+value]
    G -->|returns None| H[skip row]
    G -->|returns category+value| I["feature_key = category||value"]
    I --> J[deduplicate per hadm_id via set]
    J --> K[chartevents_features.parquet]

Keywords triggering feature inclusion: family communication, education barrier/learner/method/readiness/topic, pain / pain level / pain assess method, restraint, spiritual support, support systems, state, safety measures, family meeting, health care proxy, bath, bed bath, riker-sas scale, richmond-ras scale, side rails, status and comfort, consults, social work consult, sitter, security, observer, informed, and ~15 others.

Normalisation rules (mirrors trust.ipynb cell 7):

Label pattern	Coarsened to
`reason for restraint`	none / threat of harm / confusion-delirium / presence of violence / treatment interference / risk for falls
`restraint location`	none / 4 point restraint / some restraint
`restraint device`	sitter / limb / (raw)
`bath`	partial / self / refused / shave / hair / none / done
`behavior`, `behavioral state`	skipped
`pain management/type/cause/location`	skipped
`pain level`, `education topic`, `safety measures`, `side rails`, `status and comfort`, `informed*`	kept as-is
all others	`(label, value)` kept as-is

Stats:

54,510 unique hadm_ids (ALL patients with any interpersonal chartevents)
633 unique feature keys
41.2 avg features per patient
EOL cohort coverage: 12,813 / 12,958 (98.9%)

Stage 03 — Note-Based Labels

Script: reimpl/03_note_labels.py Purpose: Generate training targets for the two supervised mistrust classifiers by scanning NOTEEVENTS for rule-based signals. Single streaming pass over 1.85M notes.

Inputs: NOTEEVENTS.csv.gz (streamed 50k-row chunks), data/chartevents_features.parquet

Outputs:

data/noncompliance_labels.parquet — 54,510 rows; columns: hadm_id, label (0/1)
data/autopsy_labels.parquet — 1,009 rows; columns: hadm_id, label (0/1)

Label generation algorithm:

flowchart TD
    A[NOTEEVENTS — 1.85M notes] --> B{ISERROR=1?}
    B -- No --> C{text contains 'noncompliant'?}
    C -- Yes --> D[noncompliance_set += hadm_id]
    B -- No --> E{text contains 'autopsy'?}
    E -- Yes --> F[scan each line]
    F --> G{decline/refuse/not consent/denied?}
    G -- Yes --> H[autopsy_declined += hadm_id]
    F --> I{consent/agree/request?}
    I -- Yes --> J[autopsy_consented += hadm_id]

    D --> K[Build noncompliance_labels:all chartevents patients,default=0, override=1]
    H & J --> L{both flags for same hadm_id?}
    L -- Yes --> M[exclude as ambiguous]
    L -- No --> N[autopsy_labels:consent=1, decline=0]

Class distributions:

Label	Noncompliance	Autopsy
mistrust=1	480 (0.88%)	270
trust=0	54,030	739
ambiguous/excluded	—	60

Race × autopsy consent rate (EOL cohort) — core paper finding:

Race	Decline	Consent	Rate
Black	45	29	39.2%
White	421	144	25.5%
Hispanic	9	9	50.0%
Asian	20	2	9.1%

Black patients consent to autopsies at ~39% vs White at ~26%.

Stage 04 — PyTorch Mistrust Models

Script: reimpl/04_mistrust_models.py Purpose: Train two L1-regularized logistic regression models on the interpersonal feature vectors and score all 54,510 patients. The output scores are continuous mistrust proxies used downstream for disparity analysis and outcome prediction.

Inputs: data/chartevents_features.parquet, data/noncompliance_labels.parquet, data/autopsy_labels.parquet

Outputs:

data/mistrust_noncompliant.parquet — 54,510 rows; columns: hadm_id, score
data/mistrust_autopsy.parquet — 54,510 rows; columns: hadm_id, score
data/vectorizer.pkl — fitted DictVectorizer (reused in stage 09)

Model: LogisticRegression(nn.Module) — single nn.Linear(633→1). forward() returns raw logit (= sklearn's decision_function). Weights zero-initialized.

Regularization equivalence to sklearn C=0.1, penalty='l1':

sklearn objective: ||w||_1 + C * Σ log_loss_i
PyTorch objective: BCE_mean + λ * ||w||_1   where  λ = 1/(C * n_train) = 10/n_train

Training parameters:

	Noncompliance	Autopsy
Training patients	38,157 (70%)	697 (70%)
Positives (mistrust=1)	336	186
pos_weight	112.6×	2.75×
λ (L1)	0.000262	0.01435
Optimizer	Adam lr=0.05	Adam lr=0.05
Epochs	1,000	500

Test-set evaluation (30% hold-out):

Metric	Noncompliance	Autopsy
AUC-ROC	0.667	0.531
Recall	0.444	0.437
Specificity	0.763	0.618
F1	0.032	0.352

Score distributions in EOL cohort (White vs Black):

Score	White median	Black median	MWU p
Noncompliance	−1.148	−1.000	0.034
Autopsy	−0.174	−0.260	0.002

The noncompliance score correctly shows Black patients as more mistrustful (p=0.034). The autopsy direction appears reversed in the full EOL cohort because only 817/12,958 patients (6.3%) have explicit autopsy mentions; the model extrapolates with poor discrimination (AUC=0.531). Within the labeled autopsy subset the direction is correct.

Key implementation note: defaultdict(dict) via itertuples for feature dict construction avoids pandas 2.x groupby.apply instability (0.5s vs 16s failure mode). Full dense tensor (54,510 × 633 = 138 MB) fits in memory — no batching needed.

Stage 05 — Negative Sentiment Score

Script: reimpl/05_sentiment.py Purpose: Compute a sentiment-based mistrust proxy from discharge summary notes. Higher score = more negative sentiment = more mistrust signal.

Inputs: NOTEEVENTS.csv.gz (discharge summary category only)

Output: data/neg_sentiment.parquet — 52,726 rows; columns: hadm_id, raw_score, neg_score

Method: sentence-level VADER compound score, averaged across sentences per note, averaged across notes per hadm_id, then z-scored and negated:

neg_score[hadm_id] = -(mean_sentence_polarity - μ_all) / σ_all

Why sentence-level, not full-text VADER:

Approach	std	Useful
Full-text compound	saturates at −1.0 for 94% of notes	✗
Sentence-level mean	std ≈ 0.086	✓

Full-text VADER compound saturates because clinical discharge language is lexically negative ("pain," "failure," "death"). Sentence-level averaging avoids saturation and is the closest available analog to the original's word-averaged pattern.en approach.

Stats:

59,652 notes scored (discharge summaries only)
52,726 unique hadm_ids
EOL cohort coverage: 12,543 / 12,958 (96.8%)
Raw score: mean = −0.069, std = 0.067

Race signal in EOL cohort (median neg_score):

Race	Median neg_score
Black	+0.200
White	+0.157

White vs Black MWU p=0.106 (direction correct, not significant at α=0.05).

Correlations between all three mistrust scores:

Pair	Pearson r
neg_sentiment × noncompliant	+0.100
neg_sentiment × autopsy	−0.082
noncompliant × autopsy	+0.266

All weakly but significantly correlated — they capture overlapping but distinct aspects of mistrust.

Stage 06 — Treatment Durations

Script: reimpl/06_treatment_durations.py Purpose: Aggregate mechanical ventilation and vasopressor durations (in minutes) per hadm_id for the EOL cohort, applying the 10-hour gap merge to remove administrative noise.

Inputs: data/ventdurations.parquet, data/vasopressordurations.parquet, data/icustay_detail.parquet, data/eol_cohort.parquet

Output: data/treatment_durations.parquet — 12,958 rows; columns: hadm_id, vent_minutes, vaso_minutes (NaN if patient received no treatment of that type)

Span merge algorithm:

flowchart TD
    A[ventdurations — icustay_id level] --> B[join → hadm_id via icustay_detail]
    B --> C[filter to EOL cohort hadm_ids]
    C --> D[collect all spans per hadm_id across all ICU stays]
    D --> E[sort spans by starttime]
    E --> F{gap between consecutive spans ≤ 10 h?}
    F -- Yes --> G[extend current span end]
    F -- No --> H[close current span, start new]
    G & H --> I[sum total minutes of merged spans]
    I --> J[treatment_durations.parquet]

Treatment prevalence in EOL cohort:

Treatment	Patients	% of EOL
Mechanical ventilation	7,173	55.4%
Vasopressors	4,972	38.4%

White vs Black — core disparity result:

Treatment	White median	Black median	MWU p
Ventilation	2,741 min (45.7 h)	3,620 min (60.3 h)	0.009 ✓
Vasopressors	1,691 min (28.2 h)	1,819 min (30.3 h)	0.317

Severity-stratified (OASIS tertiles, ventilation): Disparity is concentrated in the medium-severity group (OASIS 33–40, p=0.005) — the most clinically ambiguous zone where treatment decisions are discretionary. High-severity patients receive long ventilation regardless of race (ceiling effect).

Stage 07 — Race-Based Disparity Figures

Script: reimpl/07_race_analysis.py Purpose: Reproduce the race-based treatment CDF figures from the paper (Chapter 3 / race_mimic_aggressive.ipynb). Produces 8 PNGs.

Inputs: data/eol_cohort.parquet, data/treatment_durations.parquet

Outputs (images/chapter3/):

File	Description
`race_mimic_vent.png`	Ventilation CDF — White vs Black, all severities
`race_mimic_vaso.png`	Vasopressor CDF — White vs Black, all severities
`race_mimic_vent_low.png`	Ventilation, low severity (OASIS ≤ tertile 1)
`race_mimic_vent_medium.png`	Ventilation, medium severity
`race_mimic_vent_high.png`	Ventilation, high severity
`race_mimic_vaso_low.png`	Vasopressors, low severity
`race_mimic_vaso_medium.png`	Vasopressors, medium severity
`race_mimic_vaso_high.png`	Vasopressors, high severity

Figure style: Empirical CDF (np.sort(vals) vs np.linspace(0,1,n,endpoint=False)), x-axis clipped at 10,000 min, dashed vertical median lines with values annotated above axes, White = #00A6ED, Black = #FF5400, no top/right spines.

Stage 08 — Mistrust-Based Disparity Figures

Script: reimpl/08_mistrust_analysis.py Purpose: Reproduce the mistrust-stratified treatment CDF figures (Chapter 5). 3 metrics × 2 treatments × 2 figures (overall + severity panels) = 12 PNGs.

Inputs: data/eol_cohort.parquet, data/treatment_durations.parquet, data/mistrust_noncompliant.parquet, data/mistrust_autopsy.parquet, data/neg_sentiment.parquet

Outputs (images/chapter5/):

File	Description
`mistrust_noncompliant_mimic_vent.png`	Noncompliance score — ventilation CDF
`mistrust_noncompliant_mimic_vaso.png`	Noncompliance score — vasopressor CDF
`mistrust_noncompliant_mimic_vent_severity.png`	3-panel severity-stratified vent
`mistrust_noncompliant_mimic_vaso_severity.png`	3-panel severity-stratified vaso
`mistrust_autopsy_mimic_vent.png`	Autopsy score — ventilation CDF
`mistrust_autopsy_mimic_vaso.png`	Autopsy score — vasopressor CDF
`mistrust_autopsy_mimic_vent_severity.png`	3-panel severity-stratified vent
`mistrust_autopsy_mimic_vaso_severity.png`	3-panel severity-stratified vaso
`neg_sentiment_mimic_vent.png`	Sentiment score — ventilation CDF
`neg_sentiment_mimic_vaso.png`	Sentiment score — vasopressor CDF
`neg_sentiment_mimic_vent_severity.png`	3-panel severity-stratified vent
`neg_sentiment_mimic_vaso_severity.png`	3-panel severity-stratified vaso

Split logic (mirrors original exactly):

white_ids = White EOL patients with treatment data AND this score
black_ids = Black EOL patients with treatment data AND this score
Pool = white_ids ∪ black_ids, sorted ascending by score
Bottom len(white_ids) → High Trust (blue #0000FF)
Top len(black_ids) → Low Trust (red #FF0000)

Split sizes match racial group population counts, making comparisons directly analogous to Stage 07.

Key results (overall CDFs):

Metric	Treatment	High Trust median	Low Trust median	MWU p
Noncompliance	Ventilation	2,681 min	5,000 min	<0.0001
Noncompliance	Vasopressors	1,691 min	1,819 min	0.006
Autopsy	Ventilation	2,627 min	6,011 min	<0.0001
Autopsy	Vasopressors	1,632 min	2,480 min	<0.0001
Neg Sentiment	Ventilation	2,880 min	1,798 min	<0.0001 (reversed)

The noncompliance and autopsy scores amplify the disparity 3–7× over the race-based 832-min gap. Negative sentiment direction is reversed — high-trust patients receive more ventilation — because clinicians document more adversarially about patients in worse clinical situations, not about mistrust.

Stage 09 — Outcomes ML

Script: reimpl/09_outcomes_ml.py Purpose: Reproduce the downstream outcome prediction experiment — evaluating how much each feature set (baseline demographics, + race, + each mistrust score, + all) improves AUC-ROC on three clinical tasks. Uses PyHealth's binary_metrics_fn for evaluation.

Inputs: data/icustay_detail.parquet, data/oasis.parquet, data/mistrust_noncompliant.parquet, data/mistrust_autopsy.parquet, data/neg_sentiment.parquet, CHARTEVENTS.csv.gz (code status labels), ADMISSIONS.csv.gz (AMA + mortality labels)

Output: data/outcomes_results.parquet — 18 rows (6 feature sets × 3 tasks); columns: feat_set, task, n_patients, n_positive, auc_mean, auc_std, auc_ci_lo, auc_ci_hi, auprc_mean, f1_mean

Three tasks:

Task	Label	Base rate	N
Code Status	DNR/CMO=1 vs Full Code=0	10.9%	39,105
AMA	Left AMA=1 vs compliant=0	0.6%	47,544
Mortality	Deceased=1 vs survived=0	10.8%	47,544

Six feature configurations:

Config	Columns
BASELINE	age, los_hospital, insurance, gender
BASELINE+RACE	+ race
BASELINE+NONCOMPLIANT	+ nc_score
BASELINE+AUTOPSY	+ au_score
BASELINE+SENTIMENT	+ neg_score
BASELINE+ALL	+ race + nc_score + au_score + neg_score

Protocol: 100 random 60/40 stratified splits; full-batch Adam training (200 epochs, L1 λ=10/n, lr=0.05); pyhealth.metrics.binary_metrics_fn(metrics=["roc_auc","pr_auc","f1"]) per split.

AUC-ROC results (mean ± 1.96×std):

Feature Set	AMA	Code Status	Mortality
BASELINE	0.855 ± 0.025	0.759 ± 0.009	0.629 ± 0.010
BASELINE+RACE	0.854 ± 0.026	0.759 ± 0.009	0.636 ± 0.011
BASELINE+NONCOMPLIANT	0.864 ± 0.024	0.759 ± 0.009	0.636 ± 0.010
BASELINE+AUTOPSY	0.854 ± 0.025	0.761 ± 0.009	0.641 ± 0.009
BASELINE+SENTIMENT	0.853 ± 0.026	0.759 ± 0.009	0.633 ± 0.010
BASELINE+ALL	0.861 ± 0.026	0.761 ± 0.009	0.661 ± 0.010

BASELINE+ALL outperforms BASELINE+RACE on all tasks. Mortality improvement (0.629 → 0.661) is the clearest signal.

4. Complete Output Artifact Catalog

All parquet files are written to data/ relative to /Users/vtewari/Desktop/mimc/. All images are under images/.

Parquet / Pickle Files

File	Produced by	Rows	Key columns
`data/icustay_detail.parquet`	00	61,532	`icustay_id, hadm_id, subject_id, gender, age, ethnicity, insurance, intime, outtime, los_hospital, los_icu, admission_type`
`data/ventdurations.parquet`	00	30,084	`icustay_id, starttime, endtime`
`data/vasopressordurations.parquet`	00	23,878	`icustay_id, starttime, endtime`
`data/oasis.parquet`	00	61,532	`icustay_id, hadm_id, oasis` + 20 component cols
`data/eol_cohort.parquet`	01	12,958	`hadm_id, subject_id, icustay_id, race, gender, age, insurance, admittime, dischtime, los_hospital, discharge_location, max_oasis`
`data/chartevents_features.parquet`	02	2,247,896	`hadm_id (int), feature_key (str)`
`data/noncompliance_labels.parquet`	03	54,510	`hadm_id, label (0/1)`
`data/autopsy_labels.parquet`	03	1,009	`hadm_id, label (0/1)`
`data/mistrust_noncompliant.parquet`	04	54,510	`hadm_id, score (float, raw logit)`
`data/mistrust_autopsy.parquet`	04	54,510	`hadm_id, score (float, raw logit)`
`data/vectorizer.pkl`	04	—	Fitted `DictVectorizer`; saved as `(hadm_ids, vect)` tuple
`data/neg_sentiment.parquet`	05	52,726	`hadm_id, raw_score, neg_score (z-scored+negated)`
`data/treatment_durations.parquet`	06	12,958	`hadm_id, vent_minutes, vaso_minutes (NaN if untreated)`
`data/outcomes_results.parquet`	09	18	`feat_set, task, n_patients, n_positive, auc_mean, auc_std, auc_ci_lo, auc_ci_hi, auprc_mean, f1_mean`

Image Files

images/chapter3/ — Race-based CDF figures (Stage 07, 8 files):

File	Content
`race_mimic_vent.png`	Overall ventilation CDF, White vs Black
`race_mimic_vaso.png`	Overall vasopressor CDF, White vs Black
`race_mimic_vent_low.png`	Ventilation, low OASIS severity
`race_mimic_vent_medium.png`	Ventilation, medium OASIS severity
`race_mimic_vent_high.png`	Ventilation, high OASIS severity
`race_mimic_vaso_low.png`	Vasopressors, low OASIS severity
`race_mimic_vaso_medium.png`	Vasopressors, medium OASIS severity
`race_mimic_vaso_high.png`	Vasopressors, high OASIS severity

images/chapter5/ — Mistrust-based CDF figures (Stage 08, 12 files):

File	Content
`mistrust_noncompliant_mimic_vent.png`	Noncompliance — ventilation overall
`mistrust_noncompliant_mimic_vaso.png`	Noncompliance — vasopressor overall
`mistrust_noncompliant_mimic_vent_severity.png`	Noncompliance — ventilation 3-panel severity
`mistrust_noncompliant_mimic_vaso_severity.png`	Noncompliance — vasopressor 3-panel severity
`mistrust_autopsy_mimic_vent.png`	Autopsy — ventilation overall
`mistrust_autopsy_mimic_vaso.png`	Autopsy — vasopressor overall
`mistrust_autopsy_mimic_vent_severity.png`	Autopsy — ventilation 3-panel severity
`mistrust_autopsy_mimic_vaso_severity.png`	Autopsy — vasopressor 3-panel severity
`neg_sentiment_mimic_vent.png`	Neg Sentiment — ventilation overall
`neg_sentiment_mimic_vaso.png`	Neg Sentiment — vasopressor overall
`neg_sentiment_mimic_vent_severity.png`	Neg Sentiment — ventilation 3-panel severity
`neg_sentiment_mimic_vaso_severity.png`	Neg Sentiment — vasopressor 3-panel severity

5. Statistical Results

5.1 EOL Cohort Demographics (White vs Black)

Variable	Black	White	p-value
N	1,166	9,551	—
Mean age	71.3 [60.2, 80.4]	77.9 [66.6, 84.9]	<0.001
Public insurance	87.5%	83.7%	<0.001
Female gender	60.4%	50.2%	<0.001
Discharge: Deceased	33.0%	38.7%	<0.001
Discharge: Hospice	3.3%	4.2%	<0.001
Discharge: SNF	63.7%	57.0%	<0.001
Median LOS	13.9 days	14.1 days	0.222

5.2 Treatment Duration Disparities — Race Stratification

Treatment	White median	Black median	Δ (min)	MWU p
Ventilation	2,741	3,620	+879	0.009
Vasopressors	1,691	1,819	+128	0.317

5.3 Treatment Duration Disparities — Mistrust Stratification (Noncompliance Score)

Treatment	High Trust median	Low Trust median	Δ (min)	MWU p
Ventilation	2,681	5,000	+2,319	<0.0001
Vasopressors	1,691	1,819	+128	0.006

Amplification factor over race stratification: ~2.6× for ventilation, >1× for vasopressors.

5.4 Autopsy Consent Rates by Race (EOL Cohort)

Race	Decline	Consent	Consent rate
Black	45	29	39.2%
White	421	144	25.5%
Asian	20	2	9.1%

5.5 Mistrust Score Severity-Stratified Ventilation (Noncompliance Score)

Severity (OASIS)	High Trust median	Low Trust median	p
Low (≤ tertile 1)	2,368	4,020	0.0002
Medium (t1–t2)	2,486	5,000	<0.0001
High (> t2)	3,442	6,390	0.0001

Disparity is significant across ALL severity levels for mistrust — unlike race where it is concentrated only at medium severity.

5.6 Outcome Prediction AUC-ROC (100 random 60/40 splits)

Feature Set	AMA	Code Status	Mortality
BASELINE	0.855	0.759	0.629
BASELINE+RACE	0.854	0.759	0.636
BASELINE+NONCOMPLIANT	0.864	0.759	0.636
BASELINE+AUTOPSY	0.854	0.761	0.641
BASELINE+SENTIMENT	0.853	0.759	0.633
BASELINE+ALL	0.861	0.761	0.661

6. Implementation Divergences from Original

Aspect	Original (Boag et al.)	This reimplementation
Language / Python version	Python 2, Jupyter notebooks	Python 3.9, standalone scripts
ML framework	scikit-learn `LogisticRegression(C=0.1, penalty='l1')`	PyTorch `nn.Linear` with equivalent L1 penalty: `λ = 10/n_train`
Score output	`decision_function()` (raw logit)	`forward()` returns raw logit — identical semantics
Sentiment analysis	`pattern.en.sentiment(text.split())` — word-averaged polarity	NLTK VADER sentence-level compound, averaged across sentences and notes; avoids saturation (full-text VADER saturates at −1.0 for 94% of clinical notes)
Multi-note aggregation	Last discharge summary (dict overwrite)	Mean across all discharge summaries — more robust
Feature matrix construction	psycopg2 SQL query to PostgreSQL	pandas streaming + DictVectorizer from raw CSVs
Materialized views	PostgreSQL mimic-code views	Reconstructed from raw CSVs in script 00
Evaluation framework	Manual AUC computation	`pyhealth.metrics.binary_metrics_fn`
Training batching	Full-batch (sklearn)	Full-batch PyTorch (DataLoader has 226× overhead for tabular data at this scale)
Cohort LOS threshold	Stay > 6 hours	Stay > 1 day (PROGRESS.md note: original uses 6h for treatment, 12h for notes)
Type annotations	Python 2 style	`Optional[Tuple[str,str]]` from `typing` (Python 3.9 lacks `tuple[...] \| None` union syntax)

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
00_build_mimic_views.py		00_build_mimic_views.py
01_cohort.py		01_cohort.py
02_chartevents_features.py		02_chartevents_features.py
03_note_labels.py		03_note_labels.py
04_mistrust_models.py		04_mistrust_models.py
05_sentiment.py		05_sentiment.py
06_treatment_durations.py		06_treatment_durations.py
07_race_analysis.py		07_race_analysis.py
08_mistrust_analysis.py		08_mistrust_analysis.py
09_outcomes_ml.py		09_outcomes_ml.py
LICENSE		LICENSE
README.md		README.md
dump_schema.py		dump_schema.py
paper.md		paper.md

Folders and files

Latest commit

History

Repository files navigation

eol-mistrust

Paper Reproduction: Racial Disparities and Mistrust in End-of-Life Care

1. Paper Summary

Hypothesis

Data

Three Mistrust Metrics

Key Findings

Key Methodological Choices

2. System Architecture

3. Pipeline Stages

Stage 00 — Build MIMIC Materialized Views

Stage 01 — EOL Cohort Selection

Stage 02 — Chartevents Interpersonal Features

Stage 03 — Note-Based Labels

Stage 04 — PyTorch Mistrust Models

Stage 05 — Negative Sentiment Score

Stage 06 — Treatment Durations

Stage 07 — Race-Based Disparity Figures

Stage 08 — Mistrust-Based Disparity Figures

Stage 09 — Outcomes ML

4. Complete Output Artifact Catalog

Parquet / Pickle Files

Image Files

5. Statistical Results

5.1 EOL Cohort Demographics (White vs Black)

5.2 Treatment Duration Disparities — Race Stratification

5.3 Treatment Duration Disparities — Mistrust Stratification (Noncompliance Score)

5.4 Autopsy Consent Rates by Race (EOL Cohort)

5.5 Mistrust Score Severity-Stratified Ventilation (Noncompliance Score)

5.6 Outcome Prediction AUC-ROC (100 random 60/40 splits)

6. Implementation Divergences from Original

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages