Breast Cancer Classification — CMPT 310 Group 14

Binary classification of breast tumors (Benign / Malignant) using the UCI Breast Cancer Wisconsin dataset (569 samples, 30 features).

Group Members: Mokham Birdi, Eduardo Federmann Saito, Anton Florendo, Kunal Joshi

Instructor: Mohammad Soltanshah

Project Overview

This project builds and compares multiple machine learning classifiers to diagnose breast cancer from digitized FNA (Fine Needle Aspiration) cell nuclei images. The goal is to correctly identify malignant tumors while minimizing false negatives — a missed malignant diagnosis carries significant clinical risk.

Three model families are implemented and compared:

Logistic Regression — sklearn baseline + custom implementation from scratch (gradient descent + L2)
K-Nearest Neighbors — sklearn with hyperparameter tuning
Decision Tree — sklearn with dummy baseline for reference

Dataset

Source: UCI Breast Cancer Wisconsin (Diagnostic) dataset
Size: 569 instances, 30 numerical features
Features: Mean, standard error, and worst values of 10 cell nucleus measurements (radius, texture, perimeter, area, smoothness, compactness, concavity, concave points, symmetry, fractal dimension)
Target: M (Malignant) → 1, B (Benign) → 0
Class split: ~37% Malignant, ~63% Benign

Setup

Requirements: Python 3.14, pip

# Activate virtual environment
source .venv/bin/activate

# Install dependencies
pip install -r requirements.txt

Dependencies: numpy, pandas, scikit-learn, matplotlib, seaborn, jupyter

How to Run

All scripts must be run from the project root directory.

Individual Model Scripts

python models/DecisionTrees/decisionTrees.py
python models/KNN/KNN.py
python models/LogisticRegression/logisticRegression.py

Each script prints classification metrics (accuracy, precision, recall, F1) and displays a confusion matrix.

Jupyter Notebooks

Exploratory Data Analysis:

jupyter notebook models/eda.ipynb

Covers class balance, feature distributions, and correlation analysis.

Logistic Regression Full Comparison:

jupyter notebook models/lr_comparison.ipynb

Renders the complete comparison table (sklearn vs. scratch model), training loss curves, and confusion matrices at both default and tuned thresholds.

Architecture

Data Pipeline (`models/feature_engineering.py`)

A shared preprocessing module that executes automatically on import. Every model script imports this module, which triggers the full pipeline:

Loads models/data.csv into a pandas DataFrame
Drops unused columns (id, Unnamed: 32), removes duplicates and NaN rows
Encodes target: M → 1, B → 0
Performs an 80/20 stratified train/test split (random_state=42)
Applies Z-score normalization via StandardScaler — fit on train set only to prevent data leakage

Exposed variables: X_train_scaled, X_test_scaled, y_train, y_test

Helper functions:

getData() — returns the four split/scaled arrays
downloadData() — exports split datasets as CSV files
showPlots() — displays class balance chart, feature histograms, and scatter plots

Models

Model	File	Highlights
Decision Tree	`models/DecisionTrees/decisionTrees.py`	`DummyClassifier` worst-case baseline (~63%) + tuned `DecisionTreeClassifier` (~93%)
KNN	`models/KNN/KNN.py`	`KNeighborsClassifier` with `k=5`; ~95.6% accuracy
Logistic Regression	`models/LogisticRegression/logisticRegression.py`	sklearn baseline + `LogisticRegressionScratch` with gradient descent, L2 regularization, and threshold tuning

Logistic Regression Package (`models/LogisticRegression/`)

The most in-depth model, implemented across three files:

File	Role
`lr_model.py`	`LogisticRegressionScratch` class — sigmoid activation, gradient descent training loop with optional L2 regularization (`lambda`), configurable `predict(threshold)`, tracks `loss_history` per fit
`tuning.py`	`tune_hyperparameters()`: 5-fold StratifiedKFold grid search over 20 combos (4 learning rates × 5 lambda values), optimizing recall. `find_best_threshold()`: sweeps thresholds 0.05–0.95, selects by F2-score to prioritize recall
`logisticRegression.py`	Orchestration script — trains sklearn `LogisticRegression`, then runs hyperparameter tuning and trains the scratch model at both default (0.5) and optimized decision thresholds

Note: tuning.py uses bare imports that only resolve correctly when invoked through logisticRegression.py, which adds the package directory to sys.path. Do not run tuning.py or lr_model.py directly.

Evaluation

Metric	Why it matters
Recall (malignant)	Primary metric — a missed cancer diagnosis is far more dangerous than a false alarm
Precision	Tracks how often positive predictions are correct
F1-score	Harmonic mean of precision and recall
F2-score	Weighs recall twice as heavily as precision; used for threshold selection
Confusion Matrix	Full breakdown of TP, FP, TN, FN per model

All models are evaluated on the same held-out 20% test set produced by the shared data pipeline.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
models		models
README.md		README.md
figures.zip		figures.zip
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Breast Cancer Classification — CMPT 310 Group 14

Project Overview

Dataset

Setup

How to Run

Individual Model Scripts

Jupyter Notebooks

Architecture

Data Pipeline (`models/feature_engineering.py`)

Models

Logistic Regression Package (`models/LogisticRegression/`)

Evaluation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Breast Cancer Classification — CMPT 310 Group 14

Project Overview

Dataset

Setup

How to Run

Individual Model Scripts

Jupyter Notebooks

Architecture

Data Pipeline (models/feature_engineering.py)

Models

Logistic Regression Package (models/LogisticRegression/)

Evaluation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Data Pipeline (`models/feature_engineering.py`)

Logistic Regression Package (`models/LogisticRegression/`)

Packages