Course: SSIM916 — Problem Set #2: Using Text as Data
Student Number: 750091800
Date: 26 March 2026
Repository: github.com/purvadip/MedAbstractClassifier
Can machine learning models trained on weakly-supervised medical abstracts reliably classify them by research type (Diagnosis, Treatment, Prevention), and which textual features emerge as the strongest interpretable predictors?
This project automates the categorisation of PubMed RCT abstracts into three research-type classes — Diagnosis, Treatment, and Prevention — using the PubMed 20k RCT dataset (Dernoncourt & Lee, 2017). Labels are derived through keyword-frequency heuristics rather than expert annotation, constituting a weak supervision framework. Two classical ML baselines (TF-IDF + Logistic Regression and Multinomial Naive Bayes) are evaluated and compared against a BioBERT external benchmark.
The motivation is practical: automated abstract triage can reduce manual screening workloads in evidence synthesis pipelines (e.g., Cochrane systematic reviews), where initial screening alone routinely exceeds 1,000 person-hours per review.
Source: PubMed 20k RCT — arxiv.org/abs/1710.06071
Local path: PubMed_20k_RCT/ (place in project root before running)
The dataset contains ~20,000 structured RCT abstracts from MEDLINE (1976–2017), segmented into sentence roles (Background, Methods, Results, Conclusions). Sentences are reconstructed per abstract_id into full abstract texts to simulate unstructured real-world input.
Label assignment uses a keyword-frequency heuristic:
- Sensitivity, imaging → Diagnosis
- Vaccine, prophylaxis → Prevention
- Placebo, dose, intervention → Treatment (also the tie-breaking default)
This introduces measurable noise: 11.5% of abstracts had no matching keywords (n = 2,301) and 14.7% had tied scores (n = 2,935). Both groups defaulted to Treatment, producing a heavily skewed initial distribution (Treatment: 17,805; Diagnosis: 1,110; Prevention: 1,085). Stratified undersampling brought all three classes to 6,667 abstracts each.
Requires Python 3.8+.
git clone https://github.com/purvadip/MedAbstractClassifier
cd medical_classification
pip install -r requirements.txtjupyter notebookOpen medical_classification.ipynb and select Cell → Run All. The notebook runs the full preprocessing pipeline, trains both baseline models, and saves all figures to outputs/figures/. No manual steps are required once the src/ directory is in place.
Split indices are serialised to split_indices.json (72/8/20 train/validation/test) to ensure deterministic evaluation across runs.
BioBERT fine-tuning requires a GPU and is isolated in biobert_colab.ipynb.
- Upload
biobert_colab.ipynbto colab.research.google.com. - Enable GPU: Runtime → Change Runtime Type → Hardware Accelerator → GPU.
- Select Run All.
- Outputs and model checkpoints save automatically to
/content/drive/MyDrive/medical_classification/. - After training completes (~5–10 minutes), copy the final metrics from Cell 19 into the Section 9 Model Comparison Table of the main notebook.
All metrics are computed on the held-out test set (n = 1,773) under leakage-free conditions. Label keywords were removed from the corpus before vectorisation to prevent the models from reverse-engineering the heuristic.
| Model | Accuracy | Macro F1 | Precision | Recall | ROC-AUC (OVR) |
|---|---|---|---|---|---|
| Logistic Regression (C = 10.0) | 81.50% | 0.6105 | 0.7833 | 0.5500 | 0.87 |
| Multinomial NB (α = 0.05) | 81.22% | 0.6370 | 0.6823 | 0.5867 | 0.97 |
| BioBERT v1.2 (benchmark) | 93.67% | 0.9362 | 0.9365 | 0.9360 | — |
BioBERT result reported as an external benchmark; it was not trained or evaluated in this repository.
Class-level breakdown (Logistic Regression):
Treatment F1 = 0.89 (precision 0.79, recall 0.97) — Diagnosis F1 = 0.50 (precision 0.79, recall 0.37) — Prevention F1 = 0.44 (precision 0.76, recall 0.31).
The Treatment class benefits directly from the heuristic's default behaviour; the minority classes reflect genuine signal sparsity rather than model failure. The micro-averaged ROC-AUC of 0.87 indicates that probabilistic discrimination is substantially better than threshold-level performance suggests — clinicians could adjust operating thresholds on the ROC curve to prioritise Diagnosis or Prevention review queues.
The top Logistic Regression coefficients per class align with clinical intuition:
- Diagnosis: resonance, magnetic resonance, accuracy, contrast, colonoscopy, hypersensitivity
- Prevention: vaccine, vaccination, incidences of, infection, influenza, programs
- Treatment: interventions, controlled, trials, doses, chemotherapy, double blind
The presence of magnetic resonance and double blind as top-weighted unigrams confirms that TF-IDF is picking up genuinely discriminative clinical vocabulary, not artefacts of the heuristic.
- Label noise ceiling. With 11.5% no-keyword and 14.7% tied-score abstracts defaulting to Treatment, model performance is bounded by heuristic quality rather than architecture.
- Sentence-role information lost. Reconstructing full abstracts flattens Background, Methods, Results, and Conclusions into one text block. Methods sentences and Results sentences are processed identically.
- No semantic synonymy. TF-IDF treats renal failure and kidney failure as orthogonal vectors, restricting vocabulary coverage for medically equivalent terms.
- Single-label constraint. Multi-domain abstracts (e.g., testing the prevention efficacy of a diagnostic tool) are forced into one class and consistently misclassified.
All code, preprocessing scripts, and trained model artefacts are in the public repository. The notebook is fully self-contained once the dataset folder is placed in the project root.
# Reproduce baseline results
git clone https://github.com/purvadip/MedAbstractClassifier
cd medical_classification
pip install -r requirements.txt
jupyter notebook # open medical_classification.ipynb → Run AllSplit indices in split_indices.json guarantee the same held-out test set across every run.
Dernoncourt, F., & Lee, J. Y. (2017). PubMed 200k RCT: A dataset for sequential sentence classification in medical abstracts. arXiv:1710.06071.
Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805.
He, H., & Garcia, E. A. (2009). Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering, 21(9), 1263–1284.
Lee, J., Yoon, W., Kim, S., Kim, D., Kim, S., So, C. H., & Kang, J. (2020). BioBERT: A pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, 36(4), 1234–1240.
Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to Information Retrieval. Cambridge University Press.
Pedregosa, F., et al. (2011). Scikit-learn: Machine learning in Python. JMLR, 12, 2825–2830.
Ratner, A. J., De Sa, C., Wu, S., Selsam, D., & Ré, C. (2017). Snorkel: Rapid training data creation with weak supervision. VLDB Endowment, 11(3), 269–282.