This project applies machine learning to classify disease types based on patient health indicators such as BMI, Blood Pressure, Cholesterol, Heart Rate, and lifestyle factors.
The notebook explores multiple classification models and evaluates their performance. The motivation is to analyze how well ML models can help in early diagnosis and support healthcare decision-making.
- File:
disease_classification_dataset.csv - Rows: ~1800
- Columns: 9
- Quantitative: Age, BMI, Blood Pressure, Cholesterol, Heart Rate
- Categorical: Smoking Habit (Yes/No), Physical Activity Level (Low/Medium/High), Family History (Yes/No)
- Target: Disease Type (Healthy, Disease 1, Disease 2, etc.)
Dataset is fairly balanced (each disease type ≈ 600 samples).
Includes missing values → handled with imputation.
- Missing Values:
- BMI & Cholesterol → filled with mean
- Smoking Habit → filled with mode
- Categorical Encoding:
- LabelEncoder used for Smoking Habit, Physical Activity Level, Family History, Disease Type
- Scaling:
- StandardScaler applied to numerical features (Age, BMI, BP, Cholesterol, Heart Rate)
- Dimensionality Reduction & Clustering:
- KMeans clustering (elbow method for k selection)
- PCA for 2D visualization of clusters
- Train/Test split: 70% training, 30% testing
- Random State: 42 (for reproducibility)
- Hidden layers: (50, 50)
- Max Iterations: 1000
- Learning rate: 0.01
- Accuracy: ~33.5%
- AUC: 0.48
- Performs better on Disease Type 2, weak on others.
- Max Depth: 5
- Accuracy: ~35% (best among models)
- AUC: 0.50
- Baseline but underfits.
- Max Iterations: 1000
- Accuracy: ~31% (lowest)
- AUC: ≈0.50
- Struggles with multi-class predictions.
| Model | Accuracy | AUC |
|---|---|---|
| Decision Tree | ~35% | 0.50 |
| Neural Network | ~33.5% | 0.48 |
| Logistic Regression | ~31% | 0.50 |
- Best Model: Decision Tree (but still weak).
- Observation: All models performed poorly (~random guessing).
- Reason: Weak correlations in features, lack of hyperparameter tuning, and complex non-linear patterns.
- Correlation heatmap
- Distribution of disease types
- Feature distributions (BMI, BP, Cholesterol, Heart Rate)
- Boxplots by disease type
- KMeans cluster visualization (2D PCA)
- ROC Curves & Confusion Matrices for all models
- Bar chart comparing model accuracies
- All models achieved low accuracy (31–35%), showing difficulty in learning disease patterns.
- Decision Tree performed best, but still insufficient.
- Models require:
- Hyperparameter tuning
- Better feature engineering
- More advanced algorithms (Random Forest, Gradient Boosting, XGBoost)
-
Clone this repository:
git clone https://github.com/yourusername/disease-classification-ml.git cd disease-classification-ml -
Install dependencies:
pip install -r requirements.txt
-
Place
disease_classification_dataset.csvin the project folder. -
Run the notebook:
jupyter notebook GroupNo-6_21301282_21201240.ipynb
- Python 3
- NumPy
- Pandas
- Matplotlib, Seaborn
- Scikit-learn
Future Work: Explore Random Forest, Gradient Boosting, and Neural Networks with deeper tuning for improved classification accuracy.
Abdullah Al Fahad
LinkedIn