Skip to content

Medtabka/Business-Data-Analytics-Project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Predicting Customer Churn in Online Retail

Overview

This project builds a machine-learning pipeline to predict customer churn in an online retail setting, using the Online Retail II dataset from the UCI Machine Learning Repository.

The pipeline covers data cleaning, exploratory data analysis, RFM-based feature engineering with a leakage-free churn label, K-Means customer segmentation, model training and hyperparameter tuning (Logistic Regression, Random Forest, XGBoost), and SHAP-based explainability.

Repository structure

.
├── churn_analysis.ipynb          # Full analysis notebook
├── data/
│   ├── raw/                      # Raw Excel dataset
│   └── processed/                # Cleaned dataframes (pickle)
├── models/
│   └── artifacts/                # Trained models, scalers (joblib)
├── reports/
│   ├── figures/                  # Generated plots (PNG)
│   └── tables/                   # Summary statistics (CSV)
├── Predicting Customer Churn...  # Final report (PDF + DOCX)
├── requirements.txt              # Python dependencies
└── README.md

How to run

  1. Install dependencies:
    pip install -r requirements.txt
  2. Place online_retail_II.xlsx in data/raw/ (already included).
  3. Open and run churn_analysis.ipynb from top to bottom.

All figures, tables, and model artifacts are saved automatically.

Key results

Model Test AUC
Logistic Regression 0.810
Random Forest 0.818
XGBoost 0.822

XGBoost was selected as the final model. Feature importance via SHAP shows that recency, frequency, and recent purchase momentum are the strongest churn predictors.

Requirements

  • Python 3.10+
  • See requirements.txt for package versions

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors