Exploratory data analysis and interactive business dashboard for an e-commerce platform dataset.
data_analysis/
├── dashboard.py # Streamlit interactive dashboard
├── EDA_Refactored.ipynb # Static analysis notebook
├── business_metrics.py # Metric calculation functions
├── data_loader.py # Data loading, cleaning, and date filtering
├── requirements.txt # Python dependencies
└── ecommerce_data/ # Data directory (CSV files)
├── orders_dataset.csv
├── order_items_dataset.csv
├── products_dataset.csv
├── customers_dataset.csv
├── order_reviews_dataset.csv
└── order_payments_dataset.csv
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install -r requirements.txtstreamlit run dashboard.pyThe dashboard opens at http://localhost:8501 and includes:
- Header: Title and a global date-range filter (defaults to full year 2023).
- KPI Row: Total Revenue, Monthly Growth, Average Order Value, Total Orders — each with a trend indicator vs the prior period of equal length.
- Revenue Trend: Solid line (current period) vs dashed line (prior period), Y-axis in abbreviated dollar format.
- Top 10 Categories: Horizontal bar chart with blue gradient, values formatted as $300K / $2M.
- Revenue by State: US choropleth map with blue gradient colour scale.
- Satisfaction vs Delivery Time: Average review score per delivery-speed bucket.
- Bottom Row: Average delivery time (with trend) and average review score (with star rating).
Changing the date range filter updates every chart and card simultaneously.
jupyter notebook EDA_Refactored.ipynbOpen EDA_Refactored.ipynb and edit the first configuration cell:
ANALYSIS_YEAR = 2023 # Year to analyse
ANALYSIS_MONTH = None # None = full year; set 1-12 for a single month
COMPARISON_YEAR = 2022 # Year used in the year-over-year comparison chartAll downstream cells read from these variables. No other cell needs to be changed.
| Function | Description |
|---|---|
load_all(clean=True) |
Load all six CSVs; optionally apply standard cleaning |
filter_by_date_range(orders, start_date, end_date) |
Filter orders to an inclusive date range |
filter_by_period(orders, year, month=None) |
Filter orders to a year, and optionally a month |
clean_orders(orders) |
Deduplicate and drop null-timestamp rows |
clean_order_items(order_items) |
Deduplicate and remove negative-price rows |
| Function | Returns | Description |
|---|---|---|
revenue_summary(orders, payments) |
dict | Total revenue, order count, AOV |
monthly_revenue(orders, payments) |
DataFrame | Revenue and order count by month |
yoy_monthly_comparison(...) |
DataFrame | Side-by-side monthly revenue for two years |
payment_type_breakdown(orders, payments) |
DataFrame | Revenue and count by payment method |
order_status_distribution(orders) |
DataFrame | Count and % by order status |
top_categories_by_revenue(items, products, n) |
DataFrame | Top N categories by revenue |
avg_item_value_by_category(items, products, n) |
DataFrame | Average item price per category |
revenue_by_state(orders, customers, payments, n) |
DataFrame | Top N states by revenue |
order_volume_by_state(orders, customers, n) |
DataFrame | Top N states by order count |
delivery_performance(orders) |
DataFrame | Per-order delivery days and on-time flag |
delivery_summary(orders) |
dict | Avg days, median days, on-time rate |
review_score_distribution(reviews) |
DataFrame | Count and % for each score 1-5 |
avg_review_score(reviews) |
float | Mean review score |
Place the six CSV files in ecommerce_data/ before running the notebook.
The expected schema matches the Olist Brazilian E-Commerce dataset
(available on Kaggle: olistbr/brazilian-ecommerce).
CSV files are gitignored to avoid committing large data files.