#Diabetes Prediction using Binary Classification (Machine Learning)
This project focuses on building a machine learning classification model to predict diabetes based on patient’s medical attributes. The objective is to analyze key factors involved to diagnose diabetes in an individual.
Diabetes is a major healthcare concern worldwide. Early prediction enables preventive care, timely diagnosis, and better treatment outcomes. This project demonstrates how data-driven models can support clinical decision-making.
The dataset contains diagnostic health measurements from a publicly available dataset.
Source: Kaggle – Pima Indians Diabetes Dataset
Input Features:
- Pregnancies
- Glucose
- Blood Pressure
- Skin Thickness
- Insulin
- Body Mass Index (BMI)
- Diabetes Pedigree Function
- Age
Target Variable:
- Outcome (0 = Non - Diabetic, 1 = Diabetic)
[Dataset Features] (images./Dataset_features_diabetes.png)
This project notebook can be executed directly in Google Colab without any local setup. [Open in Google Colab] https://colab.research.google.com/drive/1r_wWsz3Q8Y12vikfxS2FUR4PPOtSfRTj?usp=sharing
- Data cleaning and visualization
- Feature engineering
- Model training and comparison
- Model evaluation using classification metrics
- Usage of Binary Classification as supervised learning model where the target variable has exactly two possible classes.
Python, Pandas, Scikit-learn, Seaborn, Matplotlib
- Logistic Regression
- Decision Tree
- Random Forest Classifier
- K-Nearest Neighbors (KNN)
- Naive Bayes Model
- Ensemble Methods (Gradient Boosting and Ada Boost Model)
Evaluation was performed using the following metrics:
- Precision
- Accuracy
- Recall
- F1-Score
- ROC-AUC Score
-
Achieved accuracy of 76.62% on test data.
-
Logistic Regression achieved a ROC AUC of 0.73.
[ROC Curve] (images./ROC_curve_diabetes_dataset.png)
-
Random Forest Classifier demonstrated superior predictive performance compared to other models.
[Performance Scores]:
{Before Preprocessing} (images./Performance_scores_before_preprocessing.png)
{After Preprocessing} (images./Performance_scores_after_preprocessing.png)
diabetes-prediction/ ├── data/ │ └── diabetes.csv # Dataset ├── notebooks/ │ └── diabetes_prediction.ipynb # Exploratory analysis & modeling ├── src/ ├── preprocessing.py # Data cleaning, scaling, encoding ├── train.py # Model training logic └── evaluate.py # Metrics, ROC AUC, plots ├── requirements.txt # Dependencies └── README.md # Project documentation
- Clone the repository:
git clone https://github.com/your-username/diabetes-prediction.git cd diabetes-prediction - Create and activate a virtual environment to isolate project dependencies. python -m venv venv source venv/bin/activate #macOS/Linux venv\Scripts\activate # Windows
- Install dependencies: pip install -r requirements.txt
- Run the notebook: jupyter notebook
- (Recommended) Run on Google Colab by uploading the notebook/clicking on ‘Open on Colab’ button.
-
Gained insights into the importance of data scaling, as model performance varied significantly before and after preprocessing.
-
Utilized correlation matrices to examine relationships among features.
[Correlation Matrix] (images./Heatmap_diabetes_dataset.png)
[Correlation Matrix] (images./Pairplot_diabetes_dataset.png)
-
Acquired practical experience processing and analyzing real-world healthcare data.
This project is developed for educational purposes using a publicly available dataset and is not intended for clinical or medical diagnosis.