End-to-End Data Analytics Portfolio Project β Python Β· SQL Β· Tableau
A retail business operating across the United States had accumulated rich transactional data from 3,900 customers β but had never systematically analysed it. The result: marketing budgets spent without ROI, high-value customers not identified, and churning customers not detected until it was too late.
This project builds a complete, data-driven intelligence system to answer five critical business questions:
| Pain Point | The Problem |
|---|---|
| π΄ No Segmentation | All 3,900 customers receive identical marketing regardless of loyalty or spend |
| π΄ Discount Inefficiency | 32% of customers use promo codes β but does it actually lift revenue? |
| π΄ Silent Churn | Low-frequency, disengaged customers leave silently, never flagged in advance |
| π΄ Missed Seasons | Seasonal demand patterns exist, but inventory and promotions are never aligned |
| π΄ Operational Blindspots | No benchmark for which shipping method or payment channel drives satisfaction |
customer-behaviour-analysis/
β
βββ π Customer_Behaviour_Analysis.ipynb # Python EDA + statistical analysis + churn model
βββ ποΈ SQL_Queries.sql # 30+ SQL queries (Basic β Advanced) + BI View
βββ π Customer_Behaviour_Analysis.twbx # Tableau workbook (2 dashboards)
βββ π README.md # You are here
β
βββ π assets/
βββ dashboard_segmentation.png # Customer Segmentation Dashboard screenshot
βββ dashboard_product.png # Product & Category Performance screenshot
Source: Shopping Trends Dataset
Records: 3,900 customers Β· 19 features Β· US market (50 states)
| Feature Group | Columns |
|---|---|
| Demographics | customer_id, age, age_group (derived), gender, location |
| Product | item_purchased, category, size, color, season |
| Transaction | purchase_amount_usd, review_rating, previous_purchases |
| Behaviour | frequency_of_purchases, subscription_status, payment_method, shipping_type |
| Marketing | discount_applied, promo_code_used, preferred_payment_method |
Key Stats:
Total Revenue β $233,081
Total Customers β 3,900
Avg Purchase β $59.76
Avg Rating β 3.75 / 5.0
File: Customer_Behaviour_Analysis.ipynb
Libraries: pandas Β· numpy Β· seaborn Β· matplotlib Β· plotly Β· scipy
1. Data Cleaning & EDA
- Null value check, data type inspection, shape analysis
- Purchase amount distribution (histogram + KDE)
- Spending by category (boxplot), category popularity (countplot)
- Average spend by season, gender distribution, location-based patterns
2. Outlier Detection
# IQR Method β identify anomalous high-spend customers
Q1 = df["Purchase Amount (USD)"].quantile(0.25)
Q3 = df["Purchase Amount (USD)"].quantile(0.75)
IQR = Q3 - Q1
outliers = df[df["Purchase Amount (USD)"] > Q3 + 1.5 * IQR]3. Hypothesis Testing
# Welch's t-test: Does gender significantly affect spending?
_, p_value = stats.ttest_ind(male_spend, female_spend, equal_var=False)
# Result: p > 0.05 β Gender does NOT significantly affect purchase amount4. Correlation Analysis
# Age vs Purchase Amount β Pearson correlation
df[['Age', 'Purchase Amount (USD)']].corr()5. Confidence Interval Estimation
# 95% CI for true population mean of purchase amount
z = stats.norm.ppf(0.975)
lower = mean - z * (std / np.sqrt(n))
upper = mean + z * (std / np.sqrt(n))6. Churn Risk Scoring Model
def churn_score(freq, subscription):
score = 0
if freq in ['annually', 'every 3 months']: score += 2
elif freq == 'quarterly': score += 1
if subscription == 'No': score += 2
return score
df['churn_risk'] = df['churn_score'].apply(lambda x: 1 if x >= 3 else 0)Customers scoring β₯ 3 are flagged as high churn risk for proactive re-engagement.
File: SQL_Queries.sql
Database: MySQL Β· customer_behaviour schema Β· shopping table
- Total customers, revenue, average purchase amount
- Sales breakdown by gender, age group, category, item, size, color, season
- Payment method preferences, subscription vs non-subscription revenue
- Discount and promo code usage counts
- Average spend per customer, top 10 high-value customers
- Revenue cross-tabbed by: location Γ category, season Γ category, gender Γ category, age group Γ category
- Discount impact: total revenue, avg purchase, order count β with vs. without discount
- Subscription status vs frequency of purchase cross-analysis
- Shipping type vs average review rating (customer satisfaction benchmark)
RFM-Style Customer Segmentation:
SELECT
customer_id,
CASE
WHEN SUM(purchase_amount_usd) >= 300 THEN 'High_Value'
WHEN SUM(purchase_amount_usd) >= 150 THEN 'Medium_Value'
ELSE 'Low_Value'
END AS monetary_segment,
CASE
WHEN MAX(previous_purchases) >= 15 THEN 'Highly_Engaged'
WHEN MAX(previous_purchases) >= 5 THEN 'Moderately_Engaged'
ELSE 'Low_Engagement'
END AS engagement_segment,
CASE
WHEN frequency_of_purchases IN ('Weekly','Fortnightly') THEN 'High_Frequency'
WHEN frequency_of_purchases = 'Monthly' THEN 'Medium_Frequency'
ELSE 'Low_Frequency'
END AS frequency_segment
FROM shopping
GROUP BY customer_id, frequency_of_purchases;High-Value Loyal Customer Identification:
-- Spend β₯ $300 AND previous purchases β₯ 10 AND Weekly/Fortnightly frequency
SELECT customer_id, SUM(purchase_amount_usd) AS total_spent, ...
FROM shopping
GROUP BY customer_id, frequency_of_purchases
HAVING SUM(purchase_amount_usd) >= 300
AND MAX(previous_purchases) >= 10
AND frequency_of_purchases IN ('Weekly', 'Fortnightly');Churn-Risk Customer Isolation:
-- Spend < $100 AND previous purchases < 3 AND Annual/Quarterly frequency
HAVING SUM(purchase_amount_usd) < 100
AND MAX(previous_purchases) < 3
AND frequency_of_purchases IN ('Annually','Quarterly');Business Intelligence View (Tableau Data Source):
CREATE VIEW customer_behavior_intelligence AS
SELECT customer_id, gender, age, location, category, season,
subscription_status, frequency_of_purchases,
discount_applied, promo_code_used,
SUM(purchase_amount_usd) AS total_spent,
AVG(purchase_amount_usd) AS avg_spent,
MAX(previous_purchases) AS previous_purchases
FROM shopping
GROUP BY customer_id, gender, age, location, category, season,
subscription_status, frequency_of_purchases,
discount_applied, promo_code_used;File: Customer_Behaviour_Analysis.twbx
| Visual | Insight |
|---|---|
| Customer Segmentation Treemap | Purchase frequency distribution across 6 tiers |
| Age-wise Purchase Rate Treemap | Adults 37.97% Β· Young Adults 31.82% Β· Seniors 30.21% |
| Loyal Customers Table | All repeat buyers (previous purchases = 50) |
| Top 10 High-Value Customers | Ranked by composite score |
| High-Value Customer Bar Chart | Previous purchases vs purchase amount (26 customers) |
Filters: Location Β· Frequency of Purchases
| Visual | Insight |
|---|---|
| Operations Heatmap | Payment method Γ Shipping type transaction counts |
| Category-Season Matrix | Revenue cross-tab: Clothing leads at $104,264 |
| Size Trend Bar Chart | M (44,410) > L (27,071) > S (16,429) > XL (10,961) |
| Colour Preference Chart | Top colours: Cyan (4,519), Silver (4,550), Gray (4,499) |
| Seasonal Product Demand Table | Item-level demand split across Fall/Spring/Summer/Winter |
| Item-wise Performance Chart | Previous purchases by individual product |
| Marketing Effectiveness | Promo code impact: 32.14% Yes vs 67.86% No |
| High-level Category Bar | Accessories $74,200 Β· Footwear $36,093 Β· Outerwear $18,524 |
Filters: Location Β· Category Β· Item Purchased Β· Season
- Purchase frequency is evenly distributed across all 6 tiers (539β584 customers each) β no dominant buying pattern, signalling an opportunity to shift customers to higher-frequency tiers
- Adults (35β54) drive 37.97% of purchases, the largest segment, followed closely by Young Adults (31.82%) and Seniors (30.21%)
- Clothing dominates at $104,264 β 40% above Accessories ($74,200) and nearly 3Γ Outerwear ($18,524)
- Clothing revenue is stable across all four seasons, making it a reliable year-round inventory anchor
- Medium (M) is the dominant size at 44,410 units β assortment should reflect this distribution
- Discount and promo code usage does not significantly increase average purchase value β tools attract existing buyers rather than driving incremental spend
- Subscribed customers purchase more frequently, making subscription conversion the priority retention strategy
| Test | Finding |
|---|---|
| Welch's t-test (Gender vs Spend) | p > 0.05 β gender has no statistically significant impact on purchase amount |
| 95% Confidence Interval | True population mean of purchase amount falls within a narrow, stable range |
| IQR Outlier Detection | No extreme outliers found β purchase distribution is clean |
| # | Recommendation | Rationale |
|---|---|---|
| 01 | Tiered Loyalty Programme | Use RFM segments to create Gold/Silver/Bronze tiers; prioritise High Value + Highly Engaged customers |
| 02 | Convert Discount Users to Subscribers | Redirect 32% promo-code budget toward subscription conversion β subscribed customers spend more consistently |
| 03 | Proactive Churn Intervention | Run churn score model monthly; flag score β₯ 3 customers for personalised re-engagement within 30 days |
| 04 | Seasonal Stock Alignment | Prioritise Clothing year-round, Outerwear in Fall/Winter; align promotional spend to Spring Accessories peaks |
| 05 | Senior Customer Strategy | Seniors are 30% of purchases but under-targeted β design dedicated product bundles and communication |
| 06 | Operational Shipping Review | Benchmark shipping type vs satisfaction ratings; route customers to highest-rated shipping options |
# 1. Clone the repository
git clone https://github.com/soumimukherjee22/customer-behaviour-analysis.git
cd customer-behaviour-analysis
# 2. Install dependencies
pip install pandas numpy seaborn matplotlib plotly scipy jupyter
# 3. Launch the notebook
jupyter notebook Customer_Behaviour_Analysis.ipynb-- 1. Create the database
CREATE DATABASE customer_behaviour;
USE customer_behaviour;
-- 2. Import the dataset as the 'shopping' table
-- (Use MySQL Workbench import wizard or LOAD DATA INFILE)
-- 3. Run SQL_Queries.sql top-to-bottom
-- It will: clean columns β add age_group β run all 30+ queries β create the BI view
SOURCE SQL_Queries.sql;1. Open Tableau Desktop (2023+ recommended)
2. Open Customer_Behaviour_Analysis.twbx
3. The workbook connects to embedded data β no additional setup needed
4. Use the tab bar to navigate between dashboards
5. Interact with Location, Category, and Season filters to explore the data
| Tool | Version | Purpose |
|---|---|---|
| Python | 3.10+ | EDA, statistical analysis, churn modelling |
| pandas | 2.0+ | Data manipulation and aggregation |
| NumPy | 1.24+ | Numerical computation |
| seaborn / matplotlib | Latest | Static visualisations |
| Plotly | 5.0+ | Interactive charts |
| SciPy | 1.10+ | Hypothesis testing, statistical tests |
| MySQL | 8.0 | Database, querying, BI view creation |
| Tableau Desktop | 2023+ | Interactive dashboards |
| File | Description |
|---|---|
Customer_Behaviour_Analysis.ipynb |
Full EDA, hypothesis testing, confidence interval, churn scoring |
SQL_Queries.sql |
30+ queries across 3 tiers + RFM segmentation + BI VIEW |
Customer_Behaviour_Analysis.twbx |
2 interactive Tableau dashboards |
README.md |
This file |
Soumi Mukherjee
Data Analyst Β· Python Β· SQL Β· Tableau
This project is licensed under the MIT License β see the LICENSE file for details.

