Author: Loc Ha
Date: 2025-04-25
I. π Background & Overview
II. π Dataset Description & Data Structure
III. βοΈ Main Process
IV. π Key Insights & Recommendations
SuperStore is a fictional retail company based in the United States, specializing in all-occasion gifts.
During the Christmas and New Year season, the Marketing team plans to launch loyalty and promotional campaigns. Manual customer segmentation with Excel is no longer feasible at scale.
This project applies RFM (Recency β Frequency β Monetary) analysis to segment customers of SuperStore, aiming to identify valuable customer groups and provide actionable insights for targeted marketing and retention campaigns.
β Business Questions:
- Which customer groups are most valuable to retain?
- What segments show high churn risk?
- How can the company increase purchase frequency and customer lifetime value?
- Source: Online Retail Transaction Dataset (UK-based non-store online retail)
- Period: 01/12/2010 β 09/12/2011
- Size: ~500K transactions
- Format:
.csv
π Table Schema (Simplified)
| Column Name | Description | Data Type |
|---|---|---|
| InvoiceNo | Transaction ID. If starts with 'C' β cancel | String/Int |
| StockCode | Product code, unique | String/Int |
| Description | Product name | String |
| Quantity | Quantity per transaction | Int |
| InvoiceDate | Date and time of transaction | DateTime |
| UnitPrice | Price per unit (GBP) | Float |
| CustomerID | Unique customer ID | Int |
- Data Preparation: Handle nulls, clean invalid records, convert types.
π View Python code for Data Preparation
# EXPLORE DATA
print(main_data.info())
print(main_data.describe())
print(main_data.isnull().sum())
# CORRECT DATA TYPES
main_data['CustomerID'] = main_data['CustomerID'].astype('Int64')
print(main_data['CustomerID'].dtype)
main_data['InvoiceDate'] = pd.to_datetime(main_data['InvoiceDate'], errors='coerce')
# HANDLE NULLS
main_data = main_data[main_data['CustomerID'].notnull()]
# FILTER DATA
main_data = main_data[(main_data['Quantity'] > 0) & (main_data['UnitPrice'] > 0)]
main_data = main_data[
main_data['StockCode'].str.match(r'^[A-Za-z0-9]{4,7}$') &
main_data['Description'].str.contains(
'POSTAGE|Bank Charges|CARRIAGE|Next Day Carriage|Manual',
case=False, na=False
)
]- Feature Engineering: Create
TotalPrice = Quantity Γ UnitPrice.
main_data['TotalPrice'] = main_data['Quantity'] * main_data['UnitPrice']- RFM Calculation: Compute Recency, Frequency, Monetary per customer.
π View Python code for RFM Calculation
# Reference date for recency
snapshot_date = pd.to_datetime('2011-12-31')
# Recency (days since last purchase)
recency = main_data.groupby('CustomerID')['InvoiceDate'].max().reset_index()
recency['Recency'] = (snapshot_date - recency['InvoiceDate']).dt.days
# Frequency (unique invoices)
frequency = main_data.groupby('CustomerID')['InvoiceNo'].nunique().reset_index()
frequency.columns = ['CustomerID', 'Frequency']
# Monetary (total spend)
monetary = main_data.groupby('CustomerID')['TotalPrice'].sum().reset_index()
monetary.columns = ['CustomerID', 'Monetary']
# Combine
rfm = recency[['CustomerID', 'Recency']].merge(frequency, on='CustomerID').merge(monetary, on='CustomerID')
print(rfm.head())- Scoring: Assign quintiles (1β5) for R, F, M and build RFM Score.
rfm['R_Score'] = pd.qcut(rfm['Recency'], 5, labels=[5,4,3,2,1]) # lower recency β higher score
rfm['F_Score'] = pd.qcut(rfm['Frequency'].rank(method='first'), 5, labels=[1,2,3,4,5])
rfm['M_Score'] = pd.qcut(rfm['Monetary'], 5, labels=[1,2,3,4,5])
rfm['RFM_Score'] = rfm['R_Score'].astype(str) + rfm['F_Score'].astype(str) + rfm['M_Score'].astype(str)- Segmentation: Map scores to segments (Champions, Loyal, At Risk, etc.).
π View Python code for RFM Segmentation
segment_map = {
'Champions': ['555','554','544','545','454','455','445'],
'Loyal': ['543','444','435','355','354','345','344','335'],
'Potential Loyalist': ['553','551','552','541','542','533','532','531','452','451','442','441','431','453','433','432','423','353','352','351','342','341','333','323'],
'New Customers': ['512','511','422','421','412','411','311'],
'Promising': ['525','524','523','522','521','515','514','513','425','424','413','414','415','315','314','313'],
'Need Attention': ['535','534','443','434','343','334','325','324'],
'About To Sleep': ['331','321','312','221','213','231','241','251'],
'At Risk': ['255','254','245','244','253','252','243','242','235','234','225','224','153','152','145','143','142','135','134','133','125','124'],
'Cannot Lose Them': ['155','154','144','214','215','115','114','113'],
'Hibernating customers': ['332','322','233','232','223','222','132','123','122','212','211'],
}
def assign_segment(rfm_score):
for segment, scores in segment_map.items():
if rfm_score in scores:
return segment
return 'Lost customers'
rfm['Segment'] = rfm['RFM_Score'].apply(assign_segment)
print(rfm.head())- Visualization: Histograms, treemap, bar charts, boxplots.
π View Python code for Visualization
# Histograms
fig, ax = plt.subplots(figsize=(12, 3)); sns.histplot(rfm['Recency'], kde=True); ax.set_title('Distribution of Recency'); plt.show()
fig, ax = plt.subplots(figsize=(12, 3)); sns.histplot(rfm['Frequency'], kde=True); ax.set_title('Distribution of Frequency'); plt.show()
fig, ax = plt.subplots(figsize=(12, 3)); sns.histplot(rfm['Monetary'], kde=True); ax.set_title('Distribution of Monetary'); plt.show()
# Treemap of segment share
temp_rfm = rfm.groupby('Segment')['CustomerID'].count().reset_index()
temp_rfm.columns = ['Segment','Cust_count']
temp_rfm['Count_share'] = temp_rfm['Cust_count'] / temp_rfm['Cust_count'].sum()
import squarify
colors = ['#FF0000','#00FFFF','#FFFFAA','#A52A2A','#800080','#00FF00','#808000','#FF0CB3','#FFA500','#FF00FF','#736F6E']
fig, ax = plt.subplots(1, figsize=(15,8))
squarify.plot(
sizes=temp_rfm['Cust_count'],
label=temp_rfm['Segment'],
value=[f'{x*100:.2f}%' for x in temp_rfm['Count_share']],
alpha=.8, color=colors,
bar_kwargs=dict(linewidth=1.5, edgecolor="White")
)
plt.title('RFM Segments of Customer Count', fontsize=16); plt.axis('off'); plt.show()
# Countplot by segment
fig, ax = plt.subplots(figsize=(12, 6))
sns.countplot(data=rfm, x='Segment', order=rfm['Segment'].value_counts().index, palette='viridis')
plt.title('Number of Customers in Each Segment'); plt.xticks(rotation=45); plt.xlabel('Segment'); plt.ylabel('Number of Customers'); plt.show()
# Boxplots R/F/M by segment
fig, axes = plt.subplots(3, 1, figsize=(15, 18))
sns.boxplot(x='Segment', y='Recency', data=rfm.sort_values('Segment'), palette='viridis', ax=axes[0]); axes[0].set_title('Recency Distribution by Customer Segment'); axes[0].set_xlabel('Customer Segment'); axes[0].set_ylabel('Recency (Days)'); axes[0].tick_params(axis='x', rotation=45)
sns.boxplot(x='Segment', y='Frequency', data=rfm.sort_values('Segment'), palette='viridis', ax=axes[1]); axes[1].set_title('Frequency Distribution by Customer Segment'); axes[1].set_xlabel('Customer Segment'); axes[1].set_ylabel('Frequency (Number of Orders)'); axes[1].tick_params(axis='x', rotation=45)
sns.boxplot(x='Segment', y='Monetary', data=rfm.sort_values('Segment'), palette='viridis', ax=axes[2]); axes[2].set_title('Monetary Distribution by Customer Segment'); axes[2].set_xlabel('Customer Segment'); axes[2].set_ylabel('Monetary (Total Price)'); axes[2].tick_params(axis='x', rotation=45)
plt.tight_layout(); plt.show()Distribution of Recency
- Most customers made a purchase within the last 120 days (~4 months), and the number of customers steadily decreases as the recency period increases.
- This is a positive signal indicating that the majority of customers are still active and engaged.
- π‘Recommendation: Focus customer appreciation campaigns on those with low Recency to strengthen loyalty and encourage repeat purchases.
Distribution of Frequency
- During the dataset period, purchase frequency is concentrated at 0β5 times, with very few customers buying more than 5 times and only a handful exceeding 10 times.
- This indicates that most customers are low-frequency buyers, and repeat purchases are relatively rare.
- π‘ Recommendation: Encourage customers to increase purchase frequency through loyalty programs such as points accumulation, free shipping for low-value orders, or vouchers for subsequent purchases.
Distribution of Monetary
-
Most customers have a Customer Lifetime Value (CLV) below $1000, with the majority concentrated under $500.
-
Beyond $1000, the number of customers drops sharply, and very few exceed higher spending thresholds.
-
π‘ Recommendations:
- $1β$500: Build shopping habits through discount vouchers for next orders or welcome combos for new customers.
- $500β$1000: Encourage higher spending with tiered loyalty levels (Silver, Gold, Platinum) and offer bonus points with expiration dates to drive usage.
- >$1000: Promote premium products with exclusive perks and focus on retaining these customers, nurturing them toward VIP level >$4000.
- >$4000 (VIPs): Provide VIP/black cards, exclusive personal benefits (dedicated account manager, personalized gifts on birthdays/holidays).
- One $8000 CLV customer: Requires special care and attention with personalized services and unique rewards.
Customer Segments Analysis (Recency β Frequency β Monetary)
| Segment | Current Status (R/F/M) | π‘ Recommendation |
|---|---|---|
| Champions | Recency <50 days, high Frequency & Monetary (some $1000β3000+) | VIP care: loyalty cards, exclusive offers, or personal services. |
| Loyal | Recency ~100 days, medium Frequency & Monetary | Maintain benefits, nurture to increase CLV. |
| Potential Loyalists | Recency ~50β100 days, medium Frequency & Monetary, some with high intent | Reward consistent purchases with points, aim to convert to Loyal. |
| Promising | Recency ~50β100 days, moderate Frequency & Monetary | Upselling with combos and loyalty points to boost CLV. |
| New Customers | Recency ~50β100 days, first-time buyers, small spend | Onboarding guides, bundles, discount voucher for next purchase. |
| Need Attention | Recency ~100 days, lower engagement than Loyal | Special offers and programs to convert them into Loyal customers. |
| About To Sleep | Avg. Recency ~100 days, very low Frequency & Monetary | Send reminder emails with discounts to reactivate purchases. |
| At Risk | Recency ~150β250 days, few transactions, low spend | Free gifts with repurchase within 30 days; recommend products from past history. |
| Hibernating | Recency ~100β350 days, declining Frequency & Monetary | Reactivation campaigns with strong vouchers and personalized suggestions. |
| Cannot Lose Them | Recency ~200β300 days, low Frequency & low Monetary | Provide vouchers/free gifts for purchases within 30 days to win them back. |
| Lost Customers | Recency ~250β350 days, very low Frequency & Monetary | Final attempt: 25β50% discount or free products to bring them back. |
| Segment Insights | % of Customers | Approx. Count | π‘ Strategic Recommendation |
|---|---|---|---|
| Champions β Strong loyalty, high frequency & spend | 19.7% | ~65 | π’ Priority 1: Maintain & grow with VIP care, exclusive offers, loyalty programs |
| Hibernating β Once active, now disengaged | 11.0% | ~36 | π‘ Priority 2: Reactivation campaigns with 20β30% vouchers, personalized offers |
| Lost Customers β Long inactivity, low frequency & spend | 9.3% | ~31 | π‘ Priority 2: Attempt win-back with strong discounts or free product trials |
| Potential Loyalists β Decent spend, medium frequency, growth opportunity | 11.4% | ~38 | π‘ Priority 3: Encourage repeat purchases with points & loyalty tiers |
| Promising β Recently active, moderate spend | 5.1% | ~17 | π‘ Priority 3: Upselling with bundles, reward points to increase CLV |
| About To Sleep β Avg. recency ~100 days, very low activity | 10.5% | ~35 | π΄ Priority 4: Send reminders & discounts to prevent churn |
| At Risk β High recency (150β250 days), low activity | 10.1% | ~34 | π΄ Priority 4: Free gifts or special offers within 30 days to reactivate |
| New Customers β First-time buyers, low spend | 5.1% | ~17 | π‘ Priority 5: Onboarding guides, discounts to build habits |
| Loyal β Stable customers, medium frequency & spend | 8.1% | ~27 | π’ Ongoing: Maintain benefits, upsell premium products |
| Cannot Lose Them β Very few but critical customers | 3.9% | ~13 | π΄ Immediate care: Exclusive vouchers/gifts to avoid churn |
- π’ Maintain & expand Champions
- π‘ Reactivate Hibernating & Lost Customers
- π‘ Develop Potential Loyalists & Promising
- π΄ Prevent churn in About To Sleep & At Risk
- π‘ Support New Customers to form purchase habits
- Recency: Most customers purchased recently (<120 days), but ~30% fall into At Risk, About To Sleep, Hibernating, Lost.
- Frequency: Majority purchase β€5 times; very few are frequent buyers.
- Monetary: CLV mostly < Β£500, only a handful > Β£1000.
- Segments: Champions (19.7%) and Loyal customers are strong, but churn-prone groups are sizable.
- Retain Champions: Exclusive rewards and personal care.
- Reactivation Campaigns: Discounts for Hibernating & Lost.
- Grow Potential Loyalists: Incentives for repeat purchases.
- Address At Risk: Reminder emails, promotions, gifts.
- Nurture New Customers: Onboarding and vouchers for retention.