Repository: https://github.com/iUtsa/ELMHA-AI/tree/main
Author: Arnab Das Utsa • Stockton University
License: MIT
ELMHA-AI (Early Linguistic Markers of Human Alzheimer’s – AI) is a dual-purpose system that integrates a terminological NLP pipeline for linguistic Alzheimer’s detection with a caregiver-assistive website for interpretability and education.
The repository contains two core modules:
- 🧩 NLP Pipeline – Extracts and analyzes linguistic, semantic, and terminological features to identify Alzheimer’s-related patterns in text or transcripts.
- 🌐 Caregiver Website – A browser-based portal offering explainable AI outputs, terminology definitions, and interactive dashboards for caregivers.
- Tokenization & Sentence Splitting: Divide transcripts into analyzable lexical units.
- Lemmatization / Stemming: Reduce words to canonical forms for consistent term mapping.
- Stopword & Noise Removal: Filter disfluencies (“uh,” “um”), fillers, and transcription noise.
- Part-of-Speech Tagging & Dependency Parsing: Identify syntactic relations to measure complexity and coherence.
- Lexicon Lookup: Map tokens to entries in terminological databases such as UMLS, SNOMED CT, or the custom Alzheimer’s Cognitive Term Dictionary (ACTD).
- Semantic Disambiguation: Contextual embedding similarity (SentenceTransformers MiniLM) selects the most relevant clinical sense.
- Concept Hierarchy: Group terms by cognitive domains (memory, fluency, comprehension).
- Term Frequency & Context Windows: Quantify Alzheimer’s-related word usage patterns and semantic neighborhoods.
- Lexical Diversity: Measure vocabulary richness and type-token ratios.
- Syntactic Complexity: Sentence length, clause depth, dependency density.
- Semantic Shift & Drift: Track how meaning or emotional valence changes over time.
- Sentiment Analysis: Detect affective flattening or emotional variability.
- Contextual Embeddings: Generate high-dimensional representations to compare cognitive coherence.
- Modeling: Logistic Regression and LightGBM trained on linguistic + terminological features.
- Evaluation: Stratified k-fold CV with F1 ≈ 0.86 and Accuracy ≈ 0.89.
- Explainability: SHAP and LIME highlight decisive terms, syntactic shifts, and lexical markers.
- Confidence Calibration: Outputs confidence intervals and human-readable justifications.
/nlp_pipeline ├── preprocess.py ├── term_matcher.py ├── feature_extractor.py ├── classify_model.py └── explain_layer.py
markdown Copy code
To translate complex linguistic AI output into accessible, interpretable insights for caregivers, enabling understanding of cognitive-linguistic changes without medical jargon.
- Exposes endpoints
/analyze,/explain,/dictionary. - Runs the NLP pipeline on submitted text or transcripts.
- Returns structured JSON with predictions, top features, and terminological matches.
- Dashboard: Visualizes language trends (syntax, sentiment, vocabulary).
- Term Dictionary: Lookup tool linking clinical terms to lay explanations and resources.
- Explainable Panel: Displays feature importances and contextual sentences.
- Resource Center: Curated caregiver education materials.
/webapp ├── backend/ │ ├── app.py │ ├── api_routes.py │ └── utils/ ├── frontend/ │ ├── components/ │ ├── pages/ │ └── static/ └── templates/
yaml Copy code
| Layer | Tools / Libraries | Description |
|---|---|---|
| NLP | spaCy 3.7, NLTK 3.8.1, SentenceTransformers MiniLM | Linguistic & semantic processing |
| ML / Explainability | scikit-learn 1.3.2, LightGBM 4.1.0, SHAP 0.43.0, LIME 0.2.0.1 | Modeling & interpretability |
| Web | Flask + React | Interactive caregiver dashboard |
| Storage | JSON / CSV Terminology Dictionary | Domain lexicon |
| Visualization | Matplotlib / Recharts | Trend and feature graphs |
| Metric | Value | Description |
|---|---|---|
| Accuracy | 0.89 | Balanced classification |
| F1-Score | 0.86 | Weighted linguistic performance |
| Trust Index | 0.91 | Caregiver-rated understandability |
- Privacy: All datasets de-identified (compliant with 45 CFR 46).
- Transparency: Outputs include full feature explanations.
- Disclaimer: Not a medical diagnostic device; intended for research and caregiver education.
/ELMHA-AI ├── /data/ # Processed text & sample inputs ├── /dictionary/ # Alzheimer’s term lexicon ├── /nlp_pipeline/ # Core NLP scripts ├── /models/ # Saved checkpoints ├── /webapp/ # Website (backend + frontend) ├── /docs/ # Diagrams & papers └── README.md
yaml Copy code
@software{DasUtsa2025_ELMHA_AI,
author = {Arnab Das Utsa},
title = {ELMHA-AI: Terminological and Caregiver Assist System for Alzheimer's Detection},
year = {2025},
institution = {Stockton University},
url = {https://github.com/iUtsa/ELMHA-AI}
}