Binary classification of breast tumors (Benign / Malignant) using the UCI Breast Cancer Wisconsin dataset (569 samples, 30 features).
Group Members: Mokham Birdi, Eduardo Federmann Saito, Anton Florendo, Kunal Joshi
Instructor: Mohammad Soltanshah
This project builds and compares multiple machine learning classifiers to diagnose breast cancer from digitized FNA (Fine Needle Aspiration) cell nuclei images. The goal is to correctly identify malignant tumors while minimizing false negatives — a missed malignant diagnosis carries significant clinical risk.
Three model families are implemented and compared:
- Logistic Regression — sklearn baseline + custom implementation from scratch (gradient descent + L2)
- K-Nearest Neighbors — sklearn with hyperparameter tuning
- Decision Tree — sklearn with dummy baseline for reference
- Source: UCI Breast Cancer Wisconsin (Diagnostic) dataset
- Size: 569 instances, 30 numerical features
- Features: Mean, standard error, and worst values of 10 cell nucleus measurements (radius, texture, perimeter, area, smoothness, compactness, concavity, concave points, symmetry, fractal dimension)
- Target:
M(Malignant) →1,B(Benign) →0 - Class split: ~37% Malignant, ~63% Benign
Requirements: Python 3.14, pip
# Activate virtual environment
source .venv/bin/activate
# Install dependencies
pip install -r requirements.txtDependencies: numpy, pandas, scikit-learn, matplotlib, seaborn, jupyter
All scripts must be run from the project root directory.
python models/DecisionTrees/decisionTrees.py
python models/KNN/KNN.py
python models/LogisticRegression/logisticRegression.pyEach script prints classification metrics (accuracy, precision, recall, F1) and displays a confusion matrix.
Exploratory Data Analysis:
jupyter notebook models/eda.ipynbCovers class balance, feature distributions, and correlation analysis.
Logistic Regression Full Comparison:
jupyter notebook models/lr_comparison.ipynbRenders the complete comparison table (sklearn vs. scratch model), training loss curves, and confusion matrices at both default and tuned thresholds.
A shared preprocessing module that executes automatically on import. Every model script imports this module, which triggers the full pipeline:
- Loads
models/data.csvinto a pandas DataFrame - Drops unused columns (
id,Unnamed: 32), removes duplicates and NaN rows - Encodes target:
M → 1,B → 0 - Performs an 80/20 stratified train/test split (
random_state=42) - Applies Z-score normalization via
StandardScaler— fit on train set only to prevent data leakage
Exposed variables: X_train_scaled, X_test_scaled, y_train, y_test
Helper functions:
getData()— returns the four split/scaled arraysdownloadData()— exports split datasets as CSV filesshowPlots()— displays class balance chart, feature histograms, and scatter plots
| Model | File | Highlights |
|---|---|---|
| Decision Tree | models/DecisionTrees/decisionTrees.py |
DummyClassifier worst-case baseline (~63%) + tuned DecisionTreeClassifier (~93%) |
| KNN | models/KNN/KNN.py |
KNeighborsClassifier with k=5; ~95.6% accuracy |
| Logistic Regression | models/LogisticRegression/logisticRegression.py |
sklearn baseline + LogisticRegressionScratch with gradient descent, L2 regularization, and threshold tuning |
The most in-depth model, implemented across three files:
| File | Role |
|---|---|
lr_model.py |
LogisticRegressionScratch class — sigmoid activation, gradient descent training loop with optional L2 regularization (lambda), configurable predict(threshold), tracks loss_history per fit |
tuning.py |
tune_hyperparameters(): 5-fold StratifiedKFold grid search over 20 combos (4 learning rates × 5 lambda values), optimizing recall. find_best_threshold(): sweeps thresholds 0.05–0.95, selects by F2-score to prioritize recall |
logisticRegression.py |
Orchestration script — trains sklearn LogisticRegression, then runs hyperparameter tuning and trains the scratch model at both default (0.5) and optimized decision thresholds |
Note:
tuning.pyuses bare imports that only resolve correctly when invoked throughlogisticRegression.py, which adds the package directory tosys.path. Do not runtuning.pyorlr_model.pydirectly.
| Metric | Why it matters |
|---|---|
| Recall (malignant) | Primary metric — a missed cancer diagnosis is far more dangerous than a false alarm |
| Precision | Tracks how often positive predictions are correct |
| F1-score | Harmonic mean of precision and recall |
| F2-score | Weighs recall twice as heavily as precision; used for threshold selection |
| Confusion Matrix | Full breakdown of TP, FP, TN, FN per model |
All models are evaluated on the same held-out 20% test set produced by the shared data pipeline.