Skip to content

kunaljoshi2/Tumor-Classifier

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Breast Cancer Classification — CMPT 310 Group 14

Binary classification of breast tumors (Benign / Malignant) using the UCI Breast Cancer Wisconsin dataset (569 samples, 30 features).

Group Members: Mokham Birdi, Eduardo Federmann Saito, Anton Florendo, Kunal Joshi

Instructor: Mohammad Soltanshah


Project Overview

This project builds and compares multiple machine learning classifiers to diagnose breast cancer from digitized FNA (Fine Needle Aspiration) cell nuclei images. The goal is to correctly identify malignant tumors while minimizing false negatives — a missed malignant diagnosis carries significant clinical risk.

Three model families are implemented and compared:

  • Logistic Regression — sklearn baseline + custom implementation from scratch (gradient descent + L2)
  • K-Nearest Neighbors — sklearn with hyperparameter tuning
  • Decision Tree — sklearn with dummy baseline for reference

Dataset

  • Source: UCI Breast Cancer Wisconsin (Diagnostic) dataset
  • Size: 569 instances, 30 numerical features
  • Features: Mean, standard error, and worst values of 10 cell nucleus measurements (radius, texture, perimeter, area, smoothness, compactness, concavity, concave points, symmetry, fractal dimension)
  • Target: M (Malignant) → 1, B (Benign) → 0
  • Class split: ~37% Malignant, ~63% Benign

Setup

Requirements: Python 3.14, pip

# Activate virtual environment
source .venv/bin/activate

# Install dependencies
pip install -r requirements.txt

Dependencies: numpy, pandas, scikit-learn, matplotlib, seaborn, jupyter


How to Run

All scripts must be run from the project root directory.

Individual Model Scripts

python models/DecisionTrees/decisionTrees.py
python models/KNN/KNN.py
python models/LogisticRegression/logisticRegression.py

Each script prints classification metrics (accuracy, precision, recall, F1) and displays a confusion matrix.

Jupyter Notebooks

Exploratory Data Analysis:

jupyter notebook models/eda.ipynb

Covers class balance, feature distributions, and correlation analysis.

Logistic Regression Full Comparison:

jupyter notebook models/lr_comparison.ipynb

Renders the complete comparison table (sklearn vs. scratch model), training loss curves, and confusion matrices at both default and tuned thresholds.


Architecture

Data Pipeline (models/feature_engineering.py)

A shared preprocessing module that executes automatically on import. Every model script imports this module, which triggers the full pipeline:

  1. Loads models/data.csv into a pandas DataFrame
  2. Drops unused columns (id, Unnamed: 32), removes duplicates and NaN rows
  3. Encodes target: M → 1, B → 0
  4. Performs an 80/20 stratified train/test split (random_state=42)
  5. Applies Z-score normalization via StandardScaler — fit on train set only to prevent data leakage

Exposed variables: X_train_scaled, X_test_scaled, y_train, y_test

Helper functions:

  • getData() — returns the four split/scaled arrays
  • downloadData() — exports split datasets as CSV files
  • showPlots() — displays class balance chart, feature histograms, and scatter plots

Models

Model File Highlights
Decision Tree models/DecisionTrees/decisionTrees.py DummyClassifier worst-case baseline (~63%) + tuned DecisionTreeClassifier (~93%)
KNN models/KNN/KNN.py KNeighborsClassifier with k=5; ~95.6% accuracy
Logistic Regression models/LogisticRegression/logisticRegression.py sklearn baseline + LogisticRegressionScratch with gradient descent, L2 regularization, and threshold tuning

Logistic Regression Package (models/LogisticRegression/)

The most in-depth model, implemented across three files:

File Role
lr_model.py LogisticRegressionScratch class — sigmoid activation, gradient descent training loop with optional L2 regularization (lambda), configurable predict(threshold), tracks loss_history per fit
tuning.py tune_hyperparameters(): 5-fold StratifiedKFold grid search over 20 combos (4 learning rates × 5 lambda values), optimizing recall. find_best_threshold(): sweeps thresholds 0.05–0.95, selects by F2-score to prioritize recall
logisticRegression.py Orchestration script — trains sklearn LogisticRegression, then runs hyperparameter tuning and trains the scratch model at both default (0.5) and optimized decision thresholds

Note: tuning.py uses bare imports that only resolve correctly when invoked through logisticRegression.py, which adds the package directory to sys.path. Do not run tuning.py or lr_model.py directly.


Evaluation

Metric Why it matters
Recall (malignant) Primary metric — a missed cancer diagnosis is far more dangerous than a false alarm
Precision Tracks how often positive predictions are correct
F1-score Harmonic mean of precision and recall
F2-score Weighs recall twice as heavily as precision; used for threshold selection
Confusion Matrix Full breakdown of TP, FP, TN, FN per model

All models are evaluated on the same held-out 20% test set produced by the shared data pipeline.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors