This project develops an Artificial Neural Network (ANN) to classify patients into three medical categories: Normal, Prediabetes, and Diabetes.
The goal is to demonstrate how deep learning can be applied to clinical tabular data to assist in early medical diagnosis. The model processes various physiological metrics (cholesterol, glucose, BMI, etc.) to predict diabetic status with high precision.
The dataset used in this project is sourced from Kaggle.
- Source: Diabetes Dataset (Kaggle)
- Description: Patient clinical records including glycosylated hemoglobin levels, BMI, and blood pressure.
- Target Variable:
glyhb_cat(0: Normal, 1: Prediabetes, 2: Diabetes).
The notebook follows a rigorous data science pipeline to ensure model reliability:
- Exploratory Data Analysis (EDA): Detailed visualization of clinical variables using histograms to detect distribution patterns.
-
Missing Value Management:
- Mean Imputation: Applied to normally distributed features like blood pressure and cholesterol.
- Median Imputation: Applied to skewed features to maintain robust statistical integrity.
-
Advanced Preprocessing:
-
Anti-Leakage Design: The
glyhbcolumn is dropped from the features ($X$ ) because the target ($y$ ) is derived from it. -
Robust Scaling: Utilization of
RobustScalerto handle outliers effectively. - Label Encoding: Converting text labels into integers (0, 1, 2) for PyTorch compatibility.
-
Anti-Leakage Design: The
-
ANN Architecture:
- Input Layer: 14 nodes.
- Hidden Layers: Two dense layers (12 & 24 neurons) using ReLU activation.
- Output Layer: 3 nodes representing class probabilities.
The model is trained for 500 epochs using the Adam Optimizer and Cross-Entropy Loss. The evaluation phase includes:
- Loss Curve Plotting: Tracking the convergence of the model.
- Accuracy Calculation: Performance metrics calculated over a 20% dedicated test set.
Clone this repository and install the required Python libraries:
# Clone the repository
git clone [https://github.com/nicolausprima/NN-Classifcation.git](https://github.com/nicolausprima/NN-Classifcation.git)
# Navigate to the project folder
cd diabetes-ann-pytorch
# Install dependencies
pip install torch pandas matplotlib scikit-learn notebookTo view the analysis and run the model training, launch the Jupyter Notebook environment:
# Launch Jupyter Notebook
jupyter notebook NNDiabetes.ipynbThrough this project, several key insights were gathered regarding medical data classification:
- Preprocessing Matters: Medical data often contains outliers; the use of
RobustScalerwas pivotal in stabilizing the model's learning process. - Integrity in Features: By removing the
glyhbcolumn from features, the model demonstrates true predictive power rather than relying on direct indicators, proving it can generalize based on other physiological metrics. - Efficiency of ANN: A relatively simple 3-layer architecture is sufficient to capture the non-linear relationships in tabular clinical data, achieving a high degree of confidence in predicting patient health status.
Created by [Nicolaus Prima Dharma]