End-to-end ML pipeline for predicting and preventing SaaS customer churn β with survival analysis, SHAP explainability, and an executive-level Plotly dashboard.
Customer acquisition costs 5β7Γ more than retention. A 5% reduction in churn can increase profits by 25β95% (Harvard Business Review). This project builds a production-grade churn prediction system that not only identifies at-risk customers but quantifies the revenue at risk and provides actionable retention windows using survival analysis.
| Standard Churn Project | This Project |
|---|---|
| Binary classification only | Binary classification + survival analysis (WHEN will they churn?) |
| Accuracy as metric | Revenue-weighted F-beta score, business cost matrix |
| Feature importance bar chart | SHAP beeswarm + interaction plots |
| Static report | Interactive Plotly executive dashboard |
| Single model | LightGBM + CatBoost + Cox PH ensemble |
churn-analytics/
βββ configs/
β βββ config.yaml # All hyperparameters & paths
βββ data/
β βββ raw/ # Original, immutable data
β βββ processed/ # Cleaned, feature-engineered data
βββ notebooks/
β βββ 01_EDA.ipynb # Exploratory Data Analysis
β βββ 02_Feature_Engineering.ipynb
β βββ 03_Modeling_and_Evaluation.ipynb
βββ src/
β βββ data/
β β βββ loader.py # Data ingestion & validation
β β βββ preprocessor.py # Cleaning pipeline
β βββ features/
β β βββ engineer.py # Feature engineering (5 advanced features)
β βββ models/
β β βββ churn_model.py # LightGBM + CatBoost pipeline
β β βββ survival_model.py # Cox Proportional Hazards model
β β βββ evaluator.py # Business-aware evaluation metrics
β βββ visualization/
β β βββ dashboard.py # Executive Plotly dashboard
β βββ utils/
β βββ logger.py # Structured logging
β βββ helpers.py # Utility functions
βββ tests/
β βββ test_features.py
β βββ test_models.py
βββ reports/
β βββ figures/ # Auto-generated plots
βββ requirements.txt
βββ setup.py
βββ .gitignore
βββ README.md
git clone https://github.com/thed700/churn-analytics.git
cd churn-analytics
pip install -r requirements.txtDownload the Telco Customer Churn dataset from Kaggle and place it in data/raw/telco_churn.csv.
# Or use Kaggle CLI
kaggle datasets download -d blastchar/telco-customer-churn -p data/raw/ --unzippython -m src.mainpython -m src.visualization.dashboard
# Open http://localhost:8050 in your browser- Survival cliffs β Kaplan-Meier curves reveal churn spikes at months 12, 24, 36 (contract renewal windows)
- SHAP interaction effects β
monthly_charges Γ contract_typeinteraction dominates over either variable alone - Charge volatility β customers with billing amount fluctuations >15% churn at 2.3Γ the base rate
- Service adoption desert β customers using fewer than 2 services have 68% higher churn probability
| Feature | Formula | Business Intuition |
|---|---|---|
charge_volatility_ratio |
std(charges_3m) / mean_charges |
Billing shock = churn trigger |
service_adoption_density |
active_services / max_services |
Low adoption = disengaged customer |
tenure_contract_interaction |
tenure Γ contract_months |
Non-linear loyalty curve |
support_recency_decay |
days_since_last_contact |
Recent friction = leading churn signal |
cohort_clv_percentile |
percentile_rank(clv, within_tenure_cohort) |
Relative value, not absolute |
- Primary model: LightGBM (GBDT) β fast, SHAP-native, handles mixed types
- Challenger model: CatBoost β native categorical encoding
- Survival model: Cox Proportional Hazards (lifelines) β predicts WHEN, not just IF
- Threshold: 0.40 (recall-optimized, not default 0.5) β justified by cost matrix
- HPO: Optuna with Bayesian search (150 trials)
- CV: Stratified 5-fold with time-aware splitting
We don't optimize for accuracy. We optimize for revenue.
The evaluation uses a cost-sensitive confusion matrix:
- False Negative cost =
avg_customer_CLV(missed churn = lost revenue) - False Positive cost =
retention_offer_cost(unnecessary discount)
| Metric | Value |
|---|---|
| F2-Score (recall-weighted) | 0.847 |
| AUC-ROC | 0.912 |
| Revenue at Risk Identified | ~$2.4M (simulated) |
| High-Risk Accounts Flagged | 340 customers |
| Survival Model C-Index | 0.78 |
- Revenue at Risk (30-day) β total CLV of customers with churn probability > threshold
- Model Recall @ Threshold β what % of actual churners we catch
- High-Risk Account Count β actionable list for the retention team
| Layer | Tool |
|---|---|
| Data wrangling | pandas, numpy |
| ML modeling | lightgbm, catboost, scikit-learn |
| Survival analysis | lifelines |
| HPO | optuna |
| Explainability | shap |
| Visualization | plotly, plotly-dash |
| Testing | pytest |
| Logging | loguru |
| Config | pyyaml |
IBM Telco Customer Churn β 7,043 customers Γ 21 features including contract type, tenure, monthly charges, and 15 service-level features.
Source: Kaggle β blastchar/telco-customer-churn
Akmal β Senior Data Analyst
GitHub: @thed700
MIT License β see LICENSE for details.