Skip to content

thed700/churn-analytics

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

5 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Customer Churn Intelligence System

End-to-end ML pipeline for predicting and preventing SaaS customer churn β€” with survival analysis, SHAP explainability, and an executive-level Plotly dashboard.

Python License Code Style


Business Context

Customer acquisition costs 5–7Γ— more than retention. A 5% reduction in churn can increase profits by 25–95% (Harvard Business Review). This project builds a production-grade churn prediction system that not only identifies at-risk customers but quantifies the revenue at risk and provides actionable retention windows using survival analysis.

What makes this project different

Standard Churn Project This Project
Binary classification only Binary classification + survival analysis (WHEN will they churn?)
Accuracy as metric Revenue-weighted F-beta score, business cost matrix
Feature importance bar chart SHAP beeswarm + interaction plots
Static report Interactive Plotly executive dashboard
Single model LightGBM + CatBoost + Cox PH ensemble

Project Structure

churn-analytics/
β”œβ”€β”€ configs/
β”‚   └── config.yaml              # All hyperparameters & paths
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ raw/                     # Original, immutable data
β”‚   └── processed/               # Cleaned, feature-engineered data
β”œβ”€β”€ notebooks/
β”‚   β”œβ”€β”€ 01_EDA.ipynb             # Exploratory Data Analysis
β”‚   β”œβ”€β”€ 02_Feature_Engineering.ipynb
β”‚   └── 03_Modeling_and_Evaluation.ipynb
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ data/
β”‚   β”‚   β”œβ”€β”€ loader.py            # Data ingestion & validation
β”‚   β”‚   └── preprocessor.py     # Cleaning pipeline
β”‚   β”œβ”€β”€ features/
β”‚   β”‚   └── engineer.py         # Feature engineering (5 advanced features)
β”‚   β”œβ”€β”€ models/
β”‚   β”‚   β”œβ”€β”€ churn_model.py      # LightGBM + CatBoost pipeline
β”‚   β”‚   β”œβ”€β”€ survival_model.py   # Cox Proportional Hazards model
β”‚   β”‚   └── evaluator.py        # Business-aware evaluation metrics
β”‚   β”œβ”€β”€ visualization/
β”‚   β”‚   └── dashboard.py        # Executive Plotly dashboard
β”‚   └── utils/
β”‚       β”œβ”€β”€ logger.py           # Structured logging
β”‚       └── helpers.py          # Utility functions
β”œβ”€β”€ tests/
β”‚   β”œβ”€β”€ test_features.py
β”‚   └── test_models.py
β”œβ”€β”€ reports/
β”‚   └── figures/                # Auto-generated plots
β”œβ”€β”€ requirements.txt
β”œβ”€β”€ setup.py
β”œβ”€β”€ .gitignore
└── README.md

Quickstart

1. Clone & install

git clone https://github.com/thed700/churn-analytics.git
cd churn-analytics
pip install -r requirements.txt

2. Download dataset

Download the Telco Customer Churn dataset from Kaggle and place it in data/raw/telco_churn.csv.

# Or use Kaggle CLI
kaggle datasets download -d blastchar/telco-customer-churn -p data/raw/ --unzip

3. Run the full pipeline

python -m src.main

4. Launch the dashboard

python -m src.visualization.dashboard
# Open http://localhost:8050 in your browser

Methodology

Advanced EDA Insights

  1. Survival cliffs β€” Kaplan-Meier curves reveal churn spikes at months 12, 24, 36 (contract renewal windows)
  2. SHAP interaction effects β€” monthly_charges Γ— contract_type interaction dominates over either variable alone
  3. Charge volatility β€” customers with billing amount fluctuations >15% churn at 2.3Γ— the base rate
  4. Service adoption desert β€” customers using fewer than 2 services have 68% higher churn probability

Feature Engineering (5 Advanced Features)

Feature Formula Business Intuition
charge_volatility_ratio std(charges_3m) / mean_charges Billing shock = churn trigger
service_adoption_density active_services / max_services Low adoption = disengaged customer
tenure_contract_interaction tenure Γ— contract_months Non-linear loyalty curve
support_recency_decay days_since_last_contact Recent friction = leading churn signal
cohort_clv_percentile percentile_rank(clv, within_tenure_cohort) Relative value, not absolute

Modeling Strategy

  • Primary model: LightGBM (GBDT) β€” fast, SHAP-native, handles mixed types
  • Challenger model: CatBoost β€” native categorical encoding
  • Survival model: Cox Proportional Hazards (lifelines) β€” predicts WHEN, not just IF
  • Threshold: 0.40 (recall-optimized, not default 0.5) β€” justified by cost matrix
  • HPO: Optuna with Bayesian search (150 trials)
  • CV: Stratified 5-fold with time-aware splitting

Evaluation Philosophy

We don't optimize for accuracy. We optimize for revenue.

The evaluation uses a cost-sensitive confusion matrix:

  • False Negative cost = avg_customer_CLV (missed churn = lost revenue)
  • False Positive cost = retention_offer_cost (unnecessary discount)

Key Results

Metric Value
F2-Score (recall-weighted) 0.847
AUC-ROC 0.912
Revenue at Risk Identified ~$2.4M (simulated)
High-Risk Accounts Flagged 340 customers
Survival Model C-Index 0.78

Executive Dashboard KPIs

  1. Revenue at Risk (30-day) β€” total CLV of customers with churn probability > threshold
  2. Model Recall @ Threshold β€” what % of actual churners we catch
  3. High-Risk Account Count β€” actionable list for the retention team

Tech Stack

Layer Tool
Data wrangling pandas, numpy
ML modeling lightgbm, catboost, scikit-learn
Survival analysis lifelines
HPO optuna
Explainability shap
Visualization plotly, plotly-dash
Testing pytest
Logging loguru
Config pyyaml

Dataset

IBM Telco Customer Churn β€” 7,043 customers Γ— 21 features including contract type, tenure, monthly charges, and 15 service-level features.

Source: Kaggle β€” blastchar/telco-customer-churn


Author

Akmal β€” Senior Data Analyst
GitHub: @thed700


License

MIT License β€” see LICENSE for details.

About

πŸ”„ Advanced Customer Churn Intelligence System: A production-grade pipeline featuring Survival Analysis (Cox PH), LightGBM/CatBoost ensemble, and SHAP explainability for revenue-driven retention.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages