This project segments mall customers using KMeans clustering on annual income and spending score. It includes exploratory analysis, model selection visuals, clustering, and evaluation metrics.
Mall_Customers.csv includes the following columns:
CustomerIDGenderAgeAnnual Income (k$)Spending Score (1-100)
- Exploratory analysis of feature distributions
- Feature scaling with
StandardScaler - Elbow method and silhouette score for cluster selection
- KMeans clustering (
k=5) - Cluster profiling (mean income/spending and count)
All outputs are saved in reports/:
eda_income_distribution.pngeda_spending_distribution.pngeda_income_vs_spending.pngelbow_method.pngsilhouette_by_k.pngdavies_bouldin_by_k.pngcalinski_harabasz_by_k.pngcustomer_segments.pngcluster_profile.csvmetrics.csvmetrics_by_k.csvsegmented_customers.csv
python kmeans_customer_segmentation.py- Evaluation metrics are saved to
reports/metrics.csv - Cluster profiles are saved to
reports/cluster_profile.csv
- Silhouette Score: ranges from -1 to 1. Higher is better.
- Davies-Bouldin Index: lower is better.
- Calinski-Harabasz Index: higher is better.
To justify k=5, compare metrics across k in reports/metrics_by_k.csv alongside the elbow and silhouette plots.
Use these heuristics together:
- Pick a
kat the elbow where WCSS reduction slows. - Prefer higher silhouette scores.
- Prefer lower Davies-Bouldin scores.
- Prefer higher Calinski-Harabasz scores.
If all three agree around k=5, it is a strong, defensible choice.
This project demonstrates:
- Proper preprocessing and scaling
- Model selection with multiple diagnostics
- Clear, reproducible outputs
- Interpretable cluster summaries for stakeholders