This project performs binary classification using the Pima Indians Diabetes Dataset. The aim is to predict whether a patient is likely to have diabetes based on various medical attributes such as glucose level, blood pressure, insulin level, BMI, etc.
The dataset used in this project is sourced from YBI Foundation GitHub Datasets.
It contains the following features:
pregnanciesglucosediastolictricepsinsulinbmidpf(diabetes pedigree function)agediabetes(target label: 0 or 1)
-
Import Libraries
- pandas
- sklearn (for train/test split, logistic regression, metrics)
-
Data Loading & Exploration
- Loaded dataset with
pd.read_csv() - Used
.head(),.info(),.describe()for initial analysis
- Loaded dataset with
-
Preprocessing
- Defined features
Xand targety - Split data into training and testing sets (80/20 split)
- Defined features
-
Model Training
- Used
LogisticRegressionmodel fromsklearn - Trained the model on the training data
- Used
-
Prediction & Evaluation
- Made predictions on the test set
- Evaluated model with
confusion_matrix,accuracy_score, andclassification_report
- Accuracy: ~76.6%
- Precision, Recall, F1-score available for both classes in the report.
precision recall f1-score support
0 0.76 0.92 0.83 145
1 0.79 0.52 0.63 86
accuracy 0.77 231
macro avg 0.78 0.72 0.73 231
weighted avg 0.77 0.77 0.76 231
β
Conclusion
The logistic regression model performs reasonably well in predicting diabetes based on the given features. The dataset is imbalanced, which slightly affects the recall for class 1.
π€ Author
Bijaya Kumar Rout
π§ bijayakumarrout2005@gmail.com
π Note
This notebook is a part of the YBI Foundation Data Science Project on binary classification. You can open the notebook locally or on a Jupyter-supported platform for better rendering of plots and outputs.