Paper: Boag et al., "Racial Disparities and Mistrust in End-of-Life Care," MLHC 2018
arXiv: https://arxiv.org/abs/1808.03827
Original code: https://github.com/wboag/eol-mistrust
This reimplementation: reimpl/ — Python 3.9, PyTorch 2.8, PyHealth 1.1.6, pandas 2.3.3
Medical mistrust — specifically "iatrophobia," a historically grounded institutional skepticism prevalent in minority communities — mediates racial disparities in aggressive end-of-life (EOL) care. In the ICU, mistrustful patients or families resist transitioning from curative to palliative care, resulting in longer durations of invasive interventions. The paper quantifies this phenomenon by constructing algorithmic mistrust proxies and demonstrating they stratify treatment disparities better than race alone.
MIMIC-III v1.4 — de-identified EHR data from Beth Israel Deaconess Medical Center, covering 58,976 hospital admissions.
Two cohorts are defined:
| Cohort | Size | Purpose |
|---|---|---|
| EOL cohort | ~11,000 admissions | Patients discharged to Hospice, Deceased, or SNF; stay > 6 h |
| ALL cohort | ~50,000 admissions | Full MIMIC population; used to train mistrust models |
All three scores are normalized to zero-mean, unit-variance and computed per hadm_id.
| Metric | Source | Method | Signal |
|---|---|---|---|
| Noncompliance score | CHARTEVENTS interpersonal features (620 binary features) | L1-regularized logistic regression predicting "noncompliant" in notes | Active refusal of care |
| Autopsy-consent score | Same 620 features | LR predicting autopsy consent (consent = mistrust) | Suspicion of institutional care quality |
| Negative sentiment score | Discharge summary notes | -(polarity − μ)/σ using sentiment analysis | Caregiver-recorded tone of patient interaction |
Treatment disparity — race-based stratification:
- Black patients: median ventilation ~832 min longer than White (p < 0.05)
- Vasopressor gap not statistically significant
Treatment disparity — mistrust-based stratification (noncompliance score):
- Low Trust vs High Trust ventilation gap: 2,580 min — a 3× amplification over the race-based gap
- Vasopressors: 650 min gap (p < 0.05), versus only 200 min with race split
Outcome prediction (AUC-ROC, 100 random splits):
| Feature Set | AMA | Code Status | Mortality |
|---|---|---|---|
| Baseline | 0.859 | 0.763 | 0.600 |
| Baseline + Race | 0.861 | 0.766 | 0.614 |
| Baseline + Noncompliant | 0.869 | 0.767 | 0.614 |
| Baseline + Autopsy | 0.861 | 0.773 | 0.603 |
| Baseline + Neg Sentiment | 0.859 | 0.765 | 0.615 |
| Baseline + ALL | 0.873 | 0.782 | 0.635 |
Mistrust scores outperform race alone on every task. The noncompliance score is the strongest individual predictor for leaving AMA (coefficient 0.52 vs race 0.03 for Black patients).
- 10-hour gap merge for treatment spans: administrative re-charting at shift change is absorbed by merging spans separated by ≤ 10 hours
- 620 interpersonal features from CHARTEVENTS: agitation scales, restraints, education readiness, family meetings, pain assessments, spiritual support, etc.
- Autopsy consent as trust proxy: Black patients consented to autopsies at 38.5% vs 24.3% for White — a signal of post-mortem suspicion of care quality
flowchart TD
subgraph RAW["MIMIC-III v1.4 Raw CSVs"]
A1[ADMISSIONS]
A2[PATIENTS]
A3[ICUSTAYS]
A4[CHARTEVENTS]
A5[NOTEEVENTS]
A6[PROCEDUREEVENTS_MV]
A7[INPUTEVENTS_MV/CV]
A8[OUTPUTEVENTS]
A9[D_ITEMS]
end
subgraph S00["00 — build_mimic_views"]
B1[icustay_detail.parquet]
B2[ventdurations.parquet]
B3[vasopressordurations.parquet]
B4[oasis.parquet]
end
subgraph S01["01 — cohort"]
C1[eol_cohort.parquet]
end
subgraph S02["02 — chartevents_features"]
D1[chartevents_features.parquet]
end
subgraph S03["03 — note_labels"]
E1[noncompliance_labels.parquet]
E2[autopsy_labels.parquet]
end
subgraph S04["04 — mistrust_models"]
F1[mistrust_noncompliant.parquet]
F2[mistrust_autopsy.parquet]
F3[vectorizer.pkl]
end
subgraph S05["05 — sentiment"]
G1[neg_sentiment.parquet]
end
subgraph S06["06 — treatment_durations"]
H1[treatment_durations.parquet]
end
subgraph S07["07 — race_analysis"]
I1["images/chapter3/ — 8 CDF PNGs"]
end
subgraph S08["08 — mistrust_analysis"]
J1["images/chapter5/ — 12 CDF PNGs"]
end
subgraph S09["09 — outcomes_ml"]
K1[outcomes_results.parquet]
end
A1 & A2 & A3 & A6 & A7 & A8 --> S00
A4 --> S00
A1 --> S01
B1 --> S01
B4 --> S01
A4 & A9 --> S02
A5 --> S03
D1 --> S03
D1 & E1 & E2 --> S04
A5 --> S05
B2 & B3 & B1 & C1 --> S06
C1 & H1 --> S07
C1 & H1 & F1 & F2 & G1 --> S08
B1 & B4 & F1 & F2 & G1 --> S09
Script: reimpl/00_build_mimic_views.py
Purpose: Reconstructs four PostgreSQL materialized views that the original code depended on (from the mimic-code repo) directly from raw gzipped CSVs. This is the foundational step — every downstream script depends on at least one of its outputs.
Inputs (all from physionet.org/files/mimiciii/1.4/):
| File | Used for |
|---|---|
ADMISSIONS.csv.gz |
icustay_detail — admittime, discharge, ethnicity, insurance |
PATIENTS.csv.gz |
Age computation (DOB → age at admission) |
ICUSTAYS.csv.gz |
ICU stay intime/outtime, LOS |
PROCEDUREEVENTS_MV.csv.gz |
MetaVision vent spans (itemids 225792, 225794) |
CHARTEVENTS.csv.gz |
CareVue vent spans (itemid 720) + OASIS vitals |
INPUTEVENTS_MV.csv.gz |
MetaVision vasopressor infusions |
INPUTEVENTS_CV.csv.gz |
CareVue vasopressor infusions |
OUTPUTEVENTS.csv.gz |
Urine output for OASIS score |
Outputs (data/):
| File | Rows | Key columns |
|---|---|---|
icustay_detail.parquet |
61,532 | icustay_id, hadm_id, subject_id, gender, age, ethnicity, insurance, intime, outtime, los_hospital, los_icu, admission_type |
ventdurations.parquet |
30,084 | icustay_id, starttime, endtime |
vasopressordurations.parquet |
23,878 | icustay_id, starttime, endtime |
oasis.parquet |
61,532 | icustay_id, hadm_id, oasis + 20 component columns |
Vent span detection algorithm:
flowchart TD
A[PROCEDUREEVENTS_MV] -->|itemids 225792,225794| B[MV spans: starttime+endtime]
C[CHARTEVENTS] -->|itemid 720, non-null VALUE| D[CV spans: charted intervals]
B & D --> E[Merge consecutive spans within 8h gap per icustay_id]
E --> F[ventdurations.parquet]
OASIS score components: Age, pre-ICU LOS, GCS, heart rate, mean arterial pressure, respiratory rate, temperature, urine output, mechanical ventilation flag, elective surgery flag — scored per Johnson et al. 2013 breakpoints.
Key implementation notes:
- Age overflow: MIMIC shifts DOB by ~300 years for patients >89. Fixed using integer year/month/day arithmetic +
clip(upper=90). - CHARTEVENTS VALUE column is mixed dtype across chunks; cast to
strbefore stripping.
Script: reimpl/01_cohort.py
Purpose: Define the 12,958-patient end-of-life cohort from ADMISSIONS + icustay_detail + oasis. This is the primary population for all disparity analyses.
Inputs: ADMISSIONS.csv.gz, data/icustay_detail.parquet, data/oasis.parquet
Output: data/eol_cohort.parquet — 12,958 rows, 12 columns
Columns: hadm_id, subject_id, icustay_id, race, gender, age, insurance, admittime, dischtime, los_hospital (hours), discharge_location, max_oasis
Cohort selection algorithm:
flowchart TD
A[ADMISSIONS — 58,976 rows] --> B{discharge_location in HOSPICE-HOME / HOSPICE-MEDICAL FACILITY / DEAD/EXPIRED / SNF}
B -- Yes → 14,115 --> C{stay duration > 1 day?}
C -- Yes → 13,106 --> D{has ≥1 ICU stay in icustay_detail?}
D -- Yes → 12,958 --> E[eol_cohort.parquet]
Multi-ICU-stay admissions (3,260 hadm_ids): representative icustay_id = the stay with the highest OASIS score, consistent with the original MAX(oasis) query. Falls back to earliest intime if OASIS is missing.
Cohort breakdown:
| Location | Count |
|---|---|
| Skilled Nursing Facility | 7,537 |
| Deceased | 4,871 |
| Hospice | 550 |
| Race | Count |
|---|---|
| White | 9,551 |
| Black | 1,166 |
| Not Specified | 1,046 |
| Other | 592 |
| Asian | 303 |
| Hispanic | 293 |
Primary comparison cohort (White + Black): 10,717 patients
Script: reimpl/02_chartevents_features.py
Purpose: Extract ~620 binary interpersonal interaction features from CHARTEVENTS. These features are the input vector for both supervised mistrust classifiers (stage 04). They capture the quality of the patient-provider relationship as recorded in structured chart data.
Inputs: CHARTEVENTS.csv.gz (330M+ rows, streamed in 500k-row chunks), D_ITEMS.csv.gz
Output: data/chartevents_features.parquet — 2,247,896 rows, 2 columns
Columns: hadm_id (int), feature_key (str)
Format: long — one row per unique (hadm_id, feature_key) pair. Reconstruct as dict-of-dicts for DictVectorizer: {hadm_id: {feature_key: 1.0, ...}}.
Feature extraction algorithm:
flowchart TD
A[D_ITEMS.csv.gz] -->|filter linksto=chartevents| B{label matches ~40 keywords?}
B -- Yes --> C[matched itemids: ~168]
D[CHARTEVENTS.csv.gz — streamed 500k chunks] --> E{ITEMID in matched set?}
E -- Yes --> F{ERROR=1?}
F -- No --> G[normalise_feature label+value]
G -->|returns None| H[skip row]
G -->|returns category+value| I["feature_key = category||value"]
I --> J[deduplicate per hadm_id via set]
J --> K[chartevents_features.parquet]
Keywords triggering feature inclusion: family communication, education barrier/learner/method/readiness/topic, pain / pain level / pain assess method, restraint, spiritual support, support systems, state, safety measures, family meeting, health care proxy, bath, bed bath, riker-sas scale, richmond-ras scale, side rails, status and comfort, consults, social work consult, sitter, security, observer, informed, and ~15 others.
Normalisation rules (mirrors trust.ipynb cell 7):
| Label pattern | Coarsened to |
|---|---|
reason for restraint |
none / threat of harm / confusion-delirium / presence of violence / treatment interference / risk for falls |
restraint location |
none / 4 point restraint / some restraint |
restraint device |
sitter / limb / (raw) |
bath |
partial / self / refused / shave / hair / none / done |
behavior, behavioral state |
skipped |
pain management/type/cause/location |
skipped |
pain level*, education topic*, safety measures*, side rails*, status and comfort*, *informed* |
kept as-is |
| all others | (label, value) kept as-is |
Stats:
- 54,510 unique hadm_ids (ALL patients with any interpersonal chartevents)
- 633 unique feature keys
- 41.2 avg features per patient
- EOL cohort coverage: 12,813 / 12,958 (98.9%)
Script: reimpl/03_note_labels.py
Purpose: Generate training targets for the two supervised mistrust classifiers by scanning NOTEEVENTS for rule-based signals. Single streaming pass over 1.85M notes.
Inputs: NOTEEVENTS.csv.gz (streamed 50k-row chunks), data/chartevents_features.parquet
Outputs:
data/noncompliance_labels.parquet— 54,510 rows; columns:hadm_id, label (0/1)data/autopsy_labels.parquet— 1,009 rows; columns:hadm_id, label (0/1)
Label generation algorithm:
flowchart TD
A[NOTEEVENTS — 1.85M notes] --> B{ISERROR=1?}
B -- No --> C{text contains 'noncompliant'?}
C -- Yes --> D[noncompliance_set += hadm_id]
B -- No --> E{text contains 'autopsy'?}
E -- Yes --> F[scan each line]
F --> G{decline/refuse/not consent/denied?}
G -- Yes --> H[autopsy_declined += hadm_id]
F --> I{consent/agree/request?}
I -- Yes --> J[autopsy_consented += hadm_id]
D --> K[Build noncompliance_labels:all chartevents patients,default=0, override=1]
H & J --> L{both flags for same hadm_id?}
L -- Yes --> M[exclude as ambiguous]
L -- No --> N[autopsy_labels:consent=1, decline=0]
Class distributions:
| Label | Noncompliance | Autopsy |
|---|---|---|
| mistrust=1 | 480 (0.88%) | 270 |
| trust=0 | 54,030 | 739 |
| ambiguous/excluded | — | 60 |
Race × autopsy consent rate (EOL cohort) — core paper finding:
| Race | Decline | Consent | Rate |
|---|---|---|---|
| Black | 45 | 29 | 39.2% |
| White | 421 | 144 | 25.5% |
| Hispanic | 9 | 9 | 50.0% |
| Asian | 20 | 2 | 9.1% |
Black patients consent to autopsies at ~39% vs White at ~26%.
Script: reimpl/04_mistrust_models.py
Purpose: Train two L1-regularized logistic regression models on the interpersonal feature vectors and score all 54,510 patients. The output scores are continuous mistrust proxies used downstream for disparity analysis and outcome prediction.
Inputs: data/chartevents_features.parquet, data/noncompliance_labels.parquet, data/autopsy_labels.parquet
Outputs:
data/mistrust_noncompliant.parquet— 54,510 rows; columns:hadm_id, scoredata/mistrust_autopsy.parquet— 54,510 rows; columns:hadm_id, scoredata/vectorizer.pkl— fittedDictVectorizer(reused in stage 09)
Model: LogisticRegression(nn.Module) — single nn.Linear(633→1). forward() returns raw logit (= sklearn's decision_function). Weights zero-initialized.
Regularization equivalence to sklearn C=0.1, penalty='l1':
sklearn objective: ||w||_1 + C * Σ log_loss_i
PyTorch objective: BCE_mean + λ * ||w||_1 where λ = 1/(C * n_train) = 10/n_train
Training parameters:
| Noncompliance | Autopsy | |
|---|---|---|
| Training patients | 38,157 (70%) | 697 (70%) |
| Positives (mistrust=1) | 336 | 186 |
| pos_weight | 112.6× | 2.75× |
| λ (L1) | 0.000262 | 0.01435 |
| Optimizer | Adam lr=0.05 | Adam lr=0.05 |
| Epochs | 1,000 | 500 |
Test-set evaluation (30% hold-out):
| Metric | Noncompliance | Autopsy |
|---|---|---|
| AUC-ROC | 0.667 | 0.531 |
| Recall | 0.444 | 0.437 |
| Specificity | 0.763 | 0.618 |
| F1 | 0.032 | 0.352 |
Score distributions in EOL cohort (White vs Black):
| Score | White median | Black median | MWU p |
|---|---|---|---|
| Noncompliance | −1.148 | −1.000 | 0.034 |
| Autopsy | −0.174 | −0.260 | 0.002 |
The noncompliance score correctly shows Black patients as more mistrustful (p=0.034). The autopsy direction appears reversed in the full EOL cohort because only 817/12,958 patients (6.3%) have explicit autopsy mentions; the model extrapolates with poor discrimination (AUC=0.531). Within the labeled autopsy subset the direction is correct.
Key implementation note: defaultdict(dict) via itertuples for feature dict construction avoids pandas 2.x groupby.apply instability (0.5s vs 16s failure mode). Full dense tensor (54,510 × 633 = 138 MB) fits in memory — no batching needed.
Script: reimpl/05_sentiment.py
Purpose: Compute a sentiment-based mistrust proxy from discharge summary notes. Higher score = more negative sentiment = more mistrust signal.
Inputs: NOTEEVENTS.csv.gz (discharge summary category only)
Output: data/neg_sentiment.parquet — 52,726 rows; columns: hadm_id, raw_score, neg_score
Method: sentence-level VADER compound score, averaged across sentences per note, averaged across notes per hadm_id, then z-scored and negated:
neg_score[hadm_id] = -(mean_sentence_polarity - μ_all) / σ_all
Why sentence-level, not full-text VADER:
| Approach | std | Useful |
|---|---|---|
| Full-text compound | saturates at −1.0 for 94% of notes | ✗ |
| Sentence-level mean | std ≈ 0.086 | ✓ |
Full-text VADER compound saturates because clinical discharge language is lexically negative ("pain," "failure," "death"). Sentence-level averaging avoids saturation and is the closest available analog to the original's word-averaged pattern.en approach.
Stats:
- 59,652 notes scored (discharge summaries only)
- 52,726 unique hadm_ids
- EOL cohort coverage: 12,543 / 12,958 (96.8%)
- Raw score: mean = −0.069, std = 0.067
Race signal in EOL cohort (median neg_score):
| Race | Median neg_score |
|---|---|
| Black | +0.200 |
| White | +0.157 |
White vs Black MWU p=0.106 (direction correct, not significant at α=0.05).
Correlations between all three mistrust scores:
| Pair | Pearson r |
|---|---|
| neg_sentiment × noncompliant | +0.100 |
| neg_sentiment × autopsy | −0.082 |
| noncompliant × autopsy | +0.266 |
All weakly but significantly correlated — they capture overlapping but distinct aspects of mistrust.
Script: reimpl/06_treatment_durations.py
Purpose: Aggregate mechanical ventilation and vasopressor durations (in minutes) per hadm_id for the EOL cohort, applying the 10-hour gap merge to remove administrative noise.
Inputs: data/ventdurations.parquet, data/vasopressordurations.parquet, data/icustay_detail.parquet, data/eol_cohort.parquet
Output: data/treatment_durations.parquet — 12,958 rows; columns: hadm_id, vent_minutes, vaso_minutes (NaN if patient received no treatment of that type)
Span merge algorithm:
flowchart TD
A[ventdurations — icustay_id level] --> B[join → hadm_id via icustay_detail]
B --> C[filter to EOL cohort hadm_ids]
C --> D[collect all spans per hadm_id across all ICU stays]
D --> E[sort spans by starttime]
E --> F{gap between consecutive spans ≤ 10 h?}
F -- Yes --> G[extend current span end]
F -- No --> H[close current span, start new]
G & H --> I[sum total minutes of merged spans]
I --> J[treatment_durations.parquet]
Treatment prevalence in EOL cohort:
| Treatment | Patients | % of EOL |
|---|---|---|
| Mechanical ventilation | 7,173 | 55.4% |
| Vasopressors | 4,972 | 38.4% |
White vs Black — core disparity result:
| Treatment | White median | Black median | MWU p |
|---|---|---|---|
| Ventilation | 2,741 min (45.7 h) | 3,620 min (60.3 h) | 0.009 ✓ |
| Vasopressors | 1,691 min (28.2 h) | 1,819 min (30.3 h) | 0.317 |
Severity-stratified (OASIS tertiles, ventilation): Disparity is concentrated in the medium-severity group (OASIS 33–40, p=0.005) — the most clinically ambiguous zone where treatment decisions are discretionary. High-severity patients receive long ventilation regardless of race (ceiling effect).
Script: reimpl/07_race_analysis.py
Purpose: Reproduce the race-based treatment CDF figures from the paper (Chapter 3 / race_mimic_aggressive.ipynb). Produces 8 PNGs.
Inputs: data/eol_cohort.parquet, data/treatment_durations.parquet
Outputs (images/chapter3/):
| File | Description |
|---|---|
race_mimic_vent.png |
Ventilation CDF — White vs Black, all severities |
race_mimic_vaso.png |
Vasopressor CDF — White vs Black, all severities |
race_mimic_vent_low.png |
Ventilation, low severity (OASIS ≤ tertile 1) |
race_mimic_vent_medium.png |
Ventilation, medium severity |
race_mimic_vent_high.png |
Ventilation, high severity |
race_mimic_vaso_low.png |
Vasopressors, low severity |
race_mimic_vaso_medium.png |
Vasopressors, medium severity |
race_mimic_vaso_high.png |
Vasopressors, high severity |
Figure style: Empirical CDF (np.sort(vals) vs np.linspace(0,1,n,endpoint=False)), x-axis clipped at 10,000 min, dashed vertical median lines with values annotated above axes, White = #00A6ED, Black = #FF5400, no top/right spines.
Script: reimpl/08_mistrust_analysis.py
Purpose: Reproduce the mistrust-stratified treatment CDF figures (Chapter 5). 3 metrics × 2 treatments × 2 figures (overall + severity panels) = 12 PNGs.
Inputs: data/eol_cohort.parquet, data/treatment_durations.parquet, data/mistrust_noncompliant.parquet, data/mistrust_autopsy.parquet, data/neg_sentiment.parquet
Outputs (images/chapter5/):
| File | Description |
|---|---|
mistrust_noncompliant_mimic_vent.png |
Noncompliance score — ventilation CDF |
mistrust_noncompliant_mimic_vaso.png |
Noncompliance score — vasopressor CDF |
mistrust_noncompliant_mimic_vent_severity.png |
3-panel severity-stratified vent |
mistrust_noncompliant_mimic_vaso_severity.png |
3-panel severity-stratified vaso |
mistrust_autopsy_mimic_vent.png |
Autopsy score — ventilation CDF |
mistrust_autopsy_mimic_vaso.png |
Autopsy score — vasopressor CDF |
mistrust_autopsy_mimic_vent_severity.png |
3-panel severity-stratified vent |
mistrust_autopsy_mimic_vaso_severity.png |
3-panel severity-stratified vaso |
neg_sentiment_mimic_vent.png |
Sentiment score — ventilation CDF |
neg_sentiment_mimic_vaso.png |
Sentiment score — vasopressor CDF |
neg_sentiment_mimic_vent_severity.png |
3-panel severity-stratified vent |
neg_sentiment_mimic_vaso_severity.png |
3-panel severity-stratified vaso |
Split logic (mirrors original exactly):
white_ids= White EOL patients with treatment data AND this scoreblack_ids= Black EOL patients with treatment data AND this score- Pool = white_ids ∪ black_ids, sorted ascending by score
- Bottom
len(white_ids)→ High Trust (blue#0000FF) - Top
len(black_ids)→ Low Trust (red#FF0000)
Split sizes match racial group population counts, making comparisons directly analogous to Stage 07.
Key results (overall CDFs):
| Metric | Treatment | High Trust median | Low Trust median | MWU p |
|---|---|---|---|---|
| Noncompliance | Ventilation | 2,681 min | 5,000 min | <0.0001 |
| Noncompliance | Vasopressors | 1,691 min | 1,819 min | 0.006 |
| Autopsy | Ventilation | 2,627 min | 6,011 min | <0.0001 |
| Autopsy | Vasopressors | 1,632 min | 2,480 min | <0.0001 |
| Neg Sentiment | Ventilation | 2,880 min | 1,798 min | <0.0001 (reversed) |
The noncompliance and autopsy scores amplify the disparity 3–7× over the race-based 832-min gap. Negative sentiment direction is reversed — high-trust patients receive more ventilation — because clinicians document more adversarially about patients in worse clinical situations, not about mistrust.
Script: reimpl/09_outcomes_ml.py
Purpose: Reproduce the downstream outcome prediction experiment — evaluating how much each feature set (baseline demographics, + race, + each mistrust score, + all) improves AUC-ROC on three clinical tasks. Uses PyHealth's binary_metrics_fn for evaluation.
Inputs: data/icustay_detail.parquet, data/oasis.parquet, data/mistrust_noncompliant.parquet, data/mistrust_autopsy.parquet, data/neg_sentiment.parquet, CHARTEVENTS.csv.gz (code status labels), ADMISSIONS.csv.gz (AMA + mortality labels)
Output: data/outcomes_results.parquet — 18 rows (6 feature sets × 3 tasks); columns: feat_set, task, n_patients, n_positive, auc_mean, auc_std, auc_ci_lo, auc_ci_hi, auprc_mean, f1_mean
Three tasks:
| Task | Label | Base rate | N |
|---|---|---|---|
| Code Status | DNR/CMO=1 vs Full Code=0 | 10.9% | 39,105 |
| AMA | Left AMA=1 vs compliant=0 | 0.6% | 47,544 |
| Mortality | Deceased=1 vs survived=0 | 10.8% | 47,544 |
Six feature configurations:
| Config | Columns |
|---|---|
| BASELINE | age, los_hospital, insurance, gender |
| BASELINE+RACE | + race |
| BASELINE+NONCOMPLIANT | + nc_score |
| BASELINE+AUTOPSY | + au_score |
| BASELINE+SENTIMENT | + neg_score |
| BASELINE+ALL | + race + nc_score + au_score + neg_score |
Protocol: 100 random 60/40 stratified splits; full-batch Adam training (200 epochs, L1 λ=10/n, lr=0.05); pyhealth.metrics.binary_metrics_fn(metrics=["roc_auc","pr_auc","f1"]) per split.
AUC-ROC results (mean ± 1.96×std):
| Feature Set | AMA | Code Status | Mortality |
|---|---|---|---|
| BASELINE | 0.855 ± 0.025 | 0.759 ± 0.009 | 0.629 ± 0.010 |
| BASELINE+RACE | 0.854 ± 0.026 | 0.759 ± 0.009 | 0.636 ± 0.011 |
| BASELINE+NONCOMPLIANT | 0.864 ± 0.024 | 0.759 ± 0.009 | 0.636 ± 0.010 |
| BASELINE+AUTOPSY | 0.854 ± 0.025 | 0.761 ± 0.009 | 0.641 ± 0.009 |
| BASELINE+SENTIMENT | 0.853 ± 0.026 | 0.759 ± 0.009 | 0.633 ± 0.010 |
| BASELINE+ALL | 0.861 ± 0.026 | 0.761 ± 0.009 | 0.661 ± 0.010 |
BASELINE+ALL outperforms BASELINE+RACE on all tasks. Mortality improvement (0.629 → 0.661) is the clearest signal.
All parquet files are written to data/ relative to /Users/vtewari/Desktop/mimc/. All images are under images/.
| File | Produced by | Rows | Key columns |
|---|---|---|---|
data/icustay_detail.parquet |
00 | 61,532 | icustay_id, hadm_id, subject_id, gender, age, ethnicity, insurance, intime, outtime, los_hospital, los_icu, admission_type |
data/ventdurations.parquet |
00 | 30,084 | icustay_id, starttime, endtime |
data/vasopressordurations.parquet |
00 | 23,878 | icustay_id, starttime, endtime |
data/oasis.parquet |
00 | 61,532 | icustay_id, hadm_id, oasis + 20 component cols |
data/eol_cohort.parquet |
01 | 12,958 | hadm_id, subject_id, icustay_id, race, gender, age, insurance, admittime, dischtime, los_hospital, discharge_location, max_oasis |
data/chartevents_features.parquet |
02 | 2,247,896 | hadm_id (int), feature_key (str) |
data/noncompliance_labels.parquet |
03 | 54,510 | hadm_id, label (0/1) |
data/autopsy_labels.parquet |
03 | 1,009 | hadm_id, label (0/1) |
data/mistrust_noncompliant.parquet |
04 | 54,510 | hadm_id, score (float, raw logit) |
data/mistrust_autopsy.parquet |
04 | 54,510 | hadm_id, score (float, raw logit) |
data/vectorizer.pkl |
04 | — | Fitted DictVectorizer; saved as (hadm_ids, vect) tuple |
data/neg_sentiment.parquet |
05 | 52,726 | hadm_id, raw_score, neg_score (z-scored+negated) |
data/treatment_durations.parquet |
06 | 12,958 | hadm_id, vent_minutes, vaso_minutes (NaN if untreated) |
data/outcomes_results.parquet |
09 | 18 | feat_set, task, n_patients, n_positive, auc_mean, auc_std, auc_ci_lo, auc_ci_hi, auprc_mean, f1_mean |
images/chapter3/ — Race-based CDF figures (Stage 07, 8 files):
| File | Content |
|---|---|
race_mimic_vent.png |
Overall ventilation CDF, White vs Black |
race_mimic_vaso.png |
Overall vasopressor CDF, White vs Black |
race_mimic_vent_low.png |
Ventilation, low OASIS severity |
race_mimic_vent_medium.png |
Ventilation, medium OASIS severity |
race_mimic_vent_high.png |
Ventilation, high OASIS severity |
race_mimic_vaso_low.png |
Vasopressors, low OASIS severity |
race_mimic_vaso_medium.png |
Vasopressors, medium OASIS severity |
race_mimic_vaso_high.png |
Vasopressors, high OASIS severity |
images/chapter5/ — Mistrust-based CDF figures (Stage 08, 12 files):
| File | Content |
|---|---|
mistrust_noncompliant_mimic_vent.png |
Noncompliance — ventilation overall |
mistrust_noncompliant_mimic_vaso.png |
Noncompliance — vasopressor overall |
mistrust_noncompliant_mimic_vent_severity.png |
Noncompliance — ventilation 3-panel severity |
mistrust_noncompliant_mimic_vaso_severity.png |
Noncompliance — vasopressor 3-panel severity |
mistrust_autopsy_mimic_vent.png |
Autopsy — ventilation overall |
mistrust_autopsy_mimic_vaso.png |
Autopsy — vasopressor overall |
mistrust_autopsy_mimic_vent_severity.png |
Autopsy — ventilation 3-panel severity |
mistrust_autopsy_mimic_vaso_severity.png |
Autopsy — vasopressor 3-panel severity |
neg_sentiment_mimic_vent.png |
Neg Sentiment — ventilation overall |
neg_sentiment_mimic_vaso.png |
Neg Sentiment — vasopressor overall |
neg_sentiment_mimic_vent_severity.png |
Neg Sentiment — ventilation 3-panel severity |
neg_sentiment_mimic_vaso_severity.png |
Neg Sentiment — vasopressor 3-panel severity |
| Variable | Black | White | p-value |
|---|---|---|---|
| N | 1,166 | 9,551 | — |
| Mean age | 71.3 [60.2, 80.4] | 77.9 [66.6, 84.9] | <0.001 |
| Public insurance | 87.5% | 83.7% | <0.001 |
| Female gender | 60.4% | 50.2% | <0.001 |
| Discharge: Deceased | 33.0% | 38.7% | <0.001 |
| Discharge: Hospice | 3.3% | 4.2% | <0.001 |
| Discharge: SNF | 63.7% | 57.0% | <0.001 |
| Median LOS | 13.9 days | 14.1 days | 0.222 |
| Treatment | White median | Black median | Δ (min) | MWU p |
|---|---|---|---|---|
| Ventilation | 2,741 | 3,620 | +879 | 0.009 |
| Vasopressors | 1,691 | 1,819 | +128 | 0.317 |
| Treatment | High Trust median | Low Trust median | Δ (min) | MWU p |
|---|---|---|---|---|
| Ventilation | 2,681 | 5,000 | +2,319 | <0.0001 |
| Vasopressors | 1,691 | 1,819 | +128 | 0.006 |
Amplification factor over race stratification: ~2.6× for ventilation, >1× for vasopressors.
| Race | Decline | Consent | Consent rate |
|---|---|---|---|
| Black | 45 | 29 | 39.2% |
| White | 421 | 144 | 25.5% |
| Asian | 20 | 2 | 9.1% |
| Severity (OASIS) | High Trust median | Low Trust median | p |
|---|---|---|---|
| Low (≤ tertile 1) | 2,368 | 4,020 | 0.0002 |
| Medium (t1–t2) | 2,486 | 5,000 | <0.0001 |
| High (> t2) | 3,442 | 6,390 | 0.0001 |
Disparity is significant across ALL severity levels for mistrust — unlike race where it is concentrated only at medium severity.
| Feature Set | AMA | Code Status | Mortality |
|---|---|---|---|
| BASELINE | 0.855 | 0.759 | 0.629 |
| BASELINE+RACE | 0.854 | 0.759 | 0.636 |
| BASELINE+NONCOMPLIANT | 0.864 | 0.759 | 0.636 |
| BASELINE+AUTOPSY | 0.854 | 0.761 | 0.641 |
| BASELINE+SENTIMENT | 0.853 | 0.759 | 0.633 |
| BASELINE+ALL | 0.861 | 0.761 | 0.661 |
| Aspect | Original (Boag et al.) | This reimplementation |
|---|---|---|
| Language / Python version | Python 2, Jupyter notebooks | Python 3.9, standalone scripts |
| ML framework | scikit-learn LogisticRegression(C=0.1, penalty='l1') |
PyTorch nn.Linear with equivalent L1 penalty: λ = 10/n_train |
| Score output | decision_function() (raw logit) |
forward() returns raw logit — identical semantics |
| Sentiment analysis | pattern.en.sentiment(text.split()) — word-averaged polarity |
NLTK VADER sentence-level compound, averaged across sentences and notes; avoids saturation (full-text VADER saturates at −1.0 for 94% of clinical notes) |
| Multi-note aggregation | Last discharge summary (dict overwrite) | Mean across all discharge summaries — more robust |
| Feature matrix construction | psycopg2 SQL query to PostgreSQL | pandas streaming + DictVectorizer from raw CSVs |
| Materialized views | PostgreSQL mimic-code views | Reconstructed from raw CSVs in script 00 |
| Evaluation framework | Manual AUC computation | pyhealth.metrics.binary_metrics_fn |
| Training batching | Full-batch (sklearn) | Full-batch PyTorch (DataLoader has 226× overhead for tabular data at this scale) |
| Cohort LOS threshold | Stay > 6 hours | Stay > 1 day (PROGRESS.md note: original uses 6h for treatment, 12h for notes) |
| Type annotations | Python 2 style | Optional[Tuple[str,str]] from typing (Python 3.9 lacks tuple[...] | None union syntax) |