This project builds a machine-learning pipeline to predict customer churn in an online retail setting, using the Online Retail II dataset from the UCI Machine Learning Repository.
The pipeline covers data cleaning, exploratory data analysis, RFM-based feature engineering with a leakage-free churn label, K-Means customer segmentation, model training and hyperparameter tuning (Logistic Regression, Random Forest, XGBoost), and SHAP-based explainability.
.
├── churn_analysis.ipynb # Full analysis notebook
├── data/
│ ├── raw/ # Raw Excel dataset
│ └── processed/ # Cleaned dataframes (pickle)
├── models/
│ └── artifacts/ # Trained models, scalers (joblib)
├── reports/
│ ├── figures/ # Generated plots (PNG)
│ └── tables/ # Summary statistics (CSV)
├── Predicting Customer Churn... # Final report (PDF + DOCX)
├── requirements.txt # Python dependencies
└── README.md
- Install dependencies:
pip install -r requirements.txt
- Place
online_retail_II.xlsxindata/raw/(already included). - Open and run
churn_analysis.ipynbfrom top to bottom.
All figures, tables, and model artifacts are saved automatically.
| Model | Test AUC |
|---|---|
| Logistic Regression | 0.810 |
| Random Forest | 0.818 |
| XGBoost | 0.822 |
XGBoost was selected as the final model. Feature importance via SHAP shows that recency, frequency, and recent purchase momentum are the strongest churn predictors.
- Python 3.10+
- See
requirements.txtfor package versions