Customer Satisfaction (CSAT) is a critical indicator of service quality and customer loyalty in e-commerce. This project analyzes large-scale customer support interaction data to identify the key drivers of customer satisfaction and to build a machine learning model that predicts whether a customer is Satisfied or Unsatisfied after an interaction.
The project follows a complete end‑to‑end Data Science pipeline including:
- Data cleaning & wrangling
- Exploratory Data Analysis (EDA)
- Statistical hypothesis testing
- Feature engineering
- Handling class imbalance (SMOTE)
- Machine learning model training & evaluation
- Hyperparameter tuning
- Business insights & conclusions
- Records: 85,907
- Features: 20 (categorical, numerical, temporal)
- Target Variable: CSAT Score (1–5)
Key columns include:
- Channel Type (Inbound, Outcall, Email)
- Issue Category & Sub‑Category
- Agent, Supervisor, Manager
- Tenure Bucket & Agent Shift
- Issue Reported Time & Response Time
After cleaning and feature selection, the final modeling dataset contained 82,779 rows and 15 features.
-
Language: Python 3.12
-
Libraries:
- NumPy, Pandas
- Matplotlib, Seaborn
- Scikit‑learn
- SciPy
- Imbalanced‑learn (SMOTE)
-
Model Persistence: joblib
-
Environment: Jupyter Notebook, Anaconda
- Dropped high‑missing and low‑relevance columns
- Converted datetime columns into proper formats
- Engineered a new feature: response_time_minutes
- Extracted time‑based features (hour, day of week, survey day)
- Median imputation for skewed numeric values
- Outlier treatment using 99th percentile capping
- Label encoding for categorical features
Key insights from EDA:
- 69% of customers rated CSAT = 5, indicating strong overall satisfaction
- Faster response times strongly correlate with higher CSAT
- Morning and split shifts perform slightly better than night shifts
- Experienced agents consistently achieve higher CSAT
- Refunds & returns show lower satisfaction and higher response times
Multiple univariate, bivariate, and multivariate visualizations were created to validate these patterns.
Three statistical tests were performed:
- ANOVA: Response Time vs CSAT → Significant relationship (p < 0.001)
- Chi‑Square: Channel Type vs CSAT → CSAT depends on channel (p < 0.001)
- T‑Test: New vs Experienced Agents → Mixed but informative results
These tests ensured that insights from EDA were statistically valid.
- The dataset was highly imbalanced (~82% satisfied vs 18% unsatisfied)
- Applied SMOTE (Synthetic Minority Oversampling Technique) on training data
- Balanced class distribution to 50/50 before model training
Three classification models were trained and evaluated:
| Model | Accuracy | F1‑Score |
|---|---|---|
| Logistic Regression | 58% | 0.71 |
| Random Forest | 67% | 0.78 |
| Gradient Boosting (Best) | 74% | 0.84 |
-
Tuned using GridSearchCV (5‑fold CV)
-
Final Parameters:
learning_rate = 0.2max_depth = 7n_estimators = 300
Final tuned performance:
- Accuracy: ~74%
- F1‑Score: ~0.84
- Recall (Satisfied class): ~0.84
- Response Time
- Agent Name
- Supervisor
- Issue Category & Sub‑Category
- Tenure Bucket
- Agent Shift
These features provide direct operational levers for business improvement.
This project provides actionable insights for business teams:
- Reduce response time to directly improve satisfaction
- Assign complex issues to experienced agents
- Improve refund & return workflows
- Optimize staffing by shift performance
The trained model can be integrated into a live dashboard to flag high‑risk (unsatisfied) interactions in real‑time.
├── CSAT_Prediction.ipynb
├── best_csat_model_GradientBoosting.pkl
├── Customer_support_data.csv
└── README.md
-
Clone the repository
-
Install dependencies:
pip install -r requirements.txt
-
Open and run the notebook:
jupyter notebook CSAT_Prediction.ipynb
This project successfully demonstrates how data‑driven analytics and machine learning can be used to understand and predict customer satisfaction at scale. By combining statistical validation with predictive modeling, the study bridges the gap between business insight and AI‑driven decision making.
Developed as part of an individual data science project for portfolio and interview preparation.