Skip to content

This project focuses on disease classification using various machine learning algorithms. The goal is to build predictive models that classify patient data into relevant disease categories, helping in early diagnosis and healthcare decision-making.

Notifications You must be signed in to change notification settings

fah-ayon/Disease-Classification-using-Machine-Learning

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 

Repository files navigation

Disease Classification using Machine Learning

Project Overview

This project applies machine learning to classify disease types based on patient health indicators such as BMI, Blood Pressure, Cholesterol, Heart Rate, and lifestyle factors.

The notebook explores multiple classification models and evaluates their performance. The motivation is to analyze how well ML models can help in early diagnosis and support healthcare decision-making.


Dataset Description

  • File: disease_classification_dataset.csv
  • Rows: ~1800
  • Columns: 9

Features

  • Quantitative: Age, BMI, Blood Pressure, Cholesterol, Heart Rate
  • Categorical: Smoking Habit (Yes/No), Physical Activity Level (Low/Medium/High), Family History (Yes/No)
  • Target: Disease Type (Healthy, Disease 1, Disease 2, etc.)

Dataset is fairly balanced (each disease type ≈ 600 samples).
Includes missing values → handled with imputation.


Data Preprocessing

  • Missing Values:
    • BMI & Cholesterol → filled with mean
    • Smoking Habit → filled with mode
  • Categorical Encoding:
    • LabelEncoder used for Smoking Habit, Physical Activity Level, Family History, Disease Type
  • Scaling:
    • StandardScaler applied to numerical features (Age, BMI, BP, Cholesterol, Heart Rate)
  • Dimensionality Reduction & Clustering:
    • KMeans clustering (elbow method for k selection)
    • PCA for 2D visualization of clusters

Dataset Splitting

  • Train/Test split: 70% training, 30% testing
  • Random State: 42 (for reproducibility)

Models Implemented

1. Neural Network (MLPClassifier)

  • Hidden layers: (50, 50)
  • Max Iterations: 1000
  • Learning rate: 0.01
  • Accuracy: ~33.5%
  • AUC: 0.48
  • Performs better on Disease Type 2, weak on others.

2. Decision Tree Classifier

  • Max Depth: 5
  • Accuracy: ~35% (best among models)
  • AUC: 0.50
  • Baseline but underfits.

3. Logistic Regression

  • Max Iterations: 1000
  • Accuracy: ~31% (lowest)
  • AUC: ≈0.50
  • Struggles with multi-class predictions.

Results & Comparison

Model Accuracy AUC
Decision Tree ~35% 0.50
Neural Network ~33.5% 0.48
Logistic Regression ~31% 0.50
  • Best Model: Decision Tree (but still weak).
  • Observation: All models performed poorly (~random guessing).
  • Reason: Weak correlations in features, lack of hyperparameter tuning, and complex non-linear patterns.

Visualizations

  • Correlation heatmap
  • Distribution of disease types
  • Feature distributions (BMI, BP, Cholesterol, Heart Rate)
  • Boxplots by disease type
  • KMeans cluster visualization (2D PCA)
  • ROC Curves & Confusion Matrices for all models
  • Bar chart comparing model accuracies

Conclusion

  • All models achieved low accuracy (31–35%), showing difficulty in learning disease patterns.
  • Decision Tree performed best, but still insufficient.
  • Models require:
    • Hyperparameter tuning
    • Better feature engineering
    • More advanced algorithms (Random Forest, Gradient Boosting, XGBoost)

How to Run

  1. Clone this repository:

    git clone https://github.com/yourusername/disease-classification-ml.git
    cd disease-classification-ml
  2. Install dependencies:

    pip install -r requirements.txt
  3. Place disease_classification_dataset.csv in the project folder.

  4. Run the notebook:

    jupyter notebook GroupNo-6_21301282_21201240.ipynb

Tech Stack

  • Python 3
  • NumPy
  • Pandas
  • Matplotlib, Seaborn
  • Scikit-learn

Future Work: Explore Random Forest, Gradient Boosting, and Neural Networks with deeper tuning for improved classification accuracy.

Author

Abdullah Al Fahad
LinkedIn

About

This project focuses on disease classification using various machine learning algorithms. The goal is to build predictive models that classify patient data into relevant disease categories, helping in early diagnosis and healthcare decision-making.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published