This project analyzes Yelp-style coffee shop reviews to understand the influence of top users and their social connections on review counts and ratings.
- Dataset: CoffeeKing (MySQL database created from Yelp dataset subset)
- Tools: SQL, Python (pandas, matplotlib, seaborn), Power BI
- Goal: Identify whether social influence (friends, top reviewers) plays a bigger role than operational factors (like opening hours).
-
Data Preparation
- Created MySQL database from raw data.
- Indexed key tables for faster joins.
- Cleaned and structured data for analysis.
-
Exploratory Data Analysis (EDA)
- Review distribution by state.
- Opening hours vs. review counts.
- Word frequency (food, place, service) from text reviews.
-
Deeper Analysis
- Top users & their friends' review contributions (Top 5, 10, 20, 40).
- Correlation between user ratings and friends’ ratings.
- Created new metrics:
- Influence Ratio (IR): % of reviews driven by top users + friends.
-
Visualization
- Power BI dashboards for state-level analysis.
- Python scatterplots & distribution charts for correlations.
- Opening hours ≠ review count: No significant correlation.
- Top users & friends = strong influence: In some states, Top 40 users and their friends accounted for up to 25%+ of reviews.
- Ratings cluster effect: Ratings above 3 stars were more stable across networks; low ratings (1–2 stars) were more scattered.
- Correlation: Pearson (0.34) and Spearman (0.32) show a moderate positive relationship between top users’ stars and their friends’ stars.
- Extend analysis to Top 20% of users (larger sample).
- Explore network graph visualization (social connections).
- Business recommendation: Identify and engage top reviewers to strengthen brand influence.