Disease Classification using Machine Learning

Project Overview

This project applies machine learning to classify disease types based on patient health indicators such as BMI, Blood Pressure, Cholesterol, Heart Rate, and lifestyle factors.

The notebook explores multiple classification models and evaluates their performance. The motivation is to analyze how well ML models can help in early diagnosis and support healthcare decision-making.

Dataset Description

File: disease_classification_dataset.csv
Rows: ~1800
Columns: 9

Features

Quantitative: Age, BMI, Blood Pressure, Cholesterol, Heart Rate
Categorical: Smoking Habit (Yes/No), Physical Activity Level (Low/Medium/High), Family History (Yes/No)
Target: Disease Type (Healthy, Disease 1, Disease 2, etc.)

Dataset is fairly balanced (each disease type ≈ 600 samples).
Includes missing values → handled with imputation.

Data Preprocessing

Missing Values:
- BMI & Cholesterol → filled with mean
- Smoking Habit → filled with mode
Categorical Encoding:
- LabelEncoder used for Smoking Habit, Physical Activity Level, Family History, Disease Type
Scaling:
- StandardScaler applied to numerical features (Age, BMI, BP, Cholesterol, Heart Rate)
Dimensionality Reduction & Clustering:
- KMeans clustering (elbow method for k selection)
- PCA for 2D visualization of clusters

Dataset Splitting

Train/Test split: 70% training, 30% testing
Random State: 42 (for reproducibility)

Models Implemented

1. Neural Network (MLPClassifier)

Hidden layers: (50, 50)
Max Iterations: 1000
Learning rate: 0.01
Accuracy: ~33.5%
AUC: 0.48
Performs better on Disease Type 2, weak on others.

2. Decision Tree Classifier

Max Depth: 5
Accuracy: ~35% (best among models)
AUC: 0.50
Baseline but underfits.

3. Logistic Regression

Max Iterations: 1000
Accuracy: ~31% (lowest)
AUC: ≈0.50
Struggles with multi-class predictions.

Results & Comparison

Model	Accuracy	AUC
Decision Tree	~35%	0.50
Neural Network	~33.5%	0.48
Logistic Regression	~31%	0.50

Best Model: Decision Tree (but still weak).
Observation: All models performed poorly (~random guessing).
Reason: Weak correlations in features, lack of hyperparameter tuning, and complex non-linear patterns.

Visualizations

Correlation heatmap
Distribution of disease types
Feature distributions (BMI, BP, Cholesterol, Heart Rate)
Boxplots by disease type
KMeans cluster visualization (2D PCA)
ROC Curves & Confusion Matrices for all models
Bar chart comparing model accuracies

Conclusion

All models achieved low accuracy (31–35%), showing difficulty in learning disease patterns.
Decision Tree performed best, but still insufficient.
Models require:
- Hyperparameter tuning
- Better feature engineering
- More advanced algorithms (Random Forest, Gradient Boosting, XGBoost)

How to Run

Clone this repository:

git clone https://github.com/yourusername/disease-classification-ml.git
cd disease-classification-ml

Install dependencies:
```
pip install -r requirements.txt
```
Place disease_classification_dataset.csv in the project folder.

Run the notebook:

jupyter notebook GroupNo-6_21301282_21201240.ipynb

Tech Stack

Python 3
NumPy
Pandas
Matplotlib, Seaborn
Scikit-learn

Future Work: Explore Random Forest, Gradient Boosting, and Neural Networks with deeper tuning for improved classification accuracy.

Author

Abdullah Al Fahad
LinkedIn

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
README.md		README.md
code.ipynb		code.ipynb
disease_classification_dataset.csv		disease_classification_dataset.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Disease Classification using Machine Learning

Project Overview

Dataset Description

Features

Data Preprocessing

Dataset Splitting

Models Implemented

1. Neural Network (MLPClassifier)

2. Decision Tree Classifier

3. Logistic Regression

Results & Comparison

Visualizations

Conclusion

How to Run

Tech Stack

Author

About

Uh oh!

Releases

Packages

Languages

fah-ayon/Disease-Classification-using-Machine-Learning

Folders and files

Latest commit

History

Repository files navigation

Disease Classification using Machine Learning

Project Overview

Dataset Description

Features

Data Preprocessing

Dataset Splitting

Models Implemented

1. Neural Network (MLPClassifier)

2. Decision Tree Classifier

3. Logistic Regression

Results & Comparison

Visualizations

Conclusion

How to Run

Tech Stack

Author

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages