This notebook contains the Descriptive Analysis portion of my Data Science internship, focused on understanding the structure and distribution of both numerical and categorical variables in a restaurant dataset.
- Compute statistical summaries for key numerical columns
- Explore and interpret the distribution of categorical features like
Country Code,City, andCuisines - Identify the most frequent cuisines and cities in the dataset
Selected relevant numerical features:
Average Cost for twoPrice rangeAggregate ratingVotes
For these, calculated:
- Mean, Median, Standard Deviation
- Used
.describe()to generate percentiles, min/max, and spread
Key Findings:
- Cost and Votes are highly skewed (long-tail distribution)
- Most ratings lie between 2.5 and 4.0
- Median cost is ₹400; average is skewed due to some high-end entries
- 15 unique country codes
- Code
1(India) appeared most frequently (8652 times) - Created a mapping to actual country names using
.map()
- 141 unique cities
- New Delhi had the highest number of restaurants (5473)
- NCR cities dominate the dataset
- 1825 unique cuisine combinations
- North Indian is the most common standalone cuisine (936 entries)
- Long-tail distribution with many unique/rare combinations
- Python
- Pandas
- Jupyter Notebook