End-to-end Datathon project for a Vietnamese fashion e-commerce retailer. The solution combines business diagnostics, time-series feature engineering, and a two-stage LightGBM forecasting pipeline to predict daily Revenue and COGS for the 2023-01-01 to 2024-07-01 competition horizon.
The repository is structured so the full submission can be rebuilt with one command, while the report and EDA artifacts explain the commercial story behind the forecast.
| Area | What this project demonstrates |
|---|---|
| Forecasting | Daily Revenue and COGS prediction for an 18-month horizon. |
| Modeling | LightGBM shape models with year-mean normalization and a global scale multiplier. |
| Validation | Expanding-window time-series CV over 2020-2022, with 2022 holdout reporting. |
| Feature engineering | Calendar, Tet windows, mega-sale events, promotions, operational profiles, and safe long-range lags. |
| Explainability | SHAP, gain, and split importances exported per target. |
| Business analysis | Funnel, seasonality, promotion, margin, return, and inventory diagnostics. |
Out-of-fold performance on the 2022 holdout window:
| Target | MAE | RMSE | R² | MAPE |
|---|---|---|---|---|
| Revenue | 569,742 | 777,769 | 0.784 | 21.92% |
| COGS | 524,905 | 710,629 | 0.763 | 22.47% |
Key business findings from the EDA report:
- Sessions grew 63% and customer signups grew 606% from 2018 to 2022, but conversion fell from 0.74% to 0.33%.
- Revenue peaks in Q2 rather than around Tet, suggesting Tet behaves more like a logistics shock than a sales lift.
- Promotional order lines show a sharp gross-margin drop: 20.8% margin without promotion versus 1.6% with promotion.
- Cancellation rate stays flat at 9.2% with or without promotions, so discounting did not measurably improve fulfillment quality.
The forecasting system uses a two-stage design.
1. Shape model. Revenue and COGS are divided by their yearly mean before
training, so the LightGBM models focus on within-year shape: weekday effects,
seasonality, holidays, campaign windows, promotions, and operational profiles.
Recent years receive higher sample weights with
exp(0.10 * (year - max_year)).
2. Global multiplier. A scalar correction is fitted from post-2019 out-of-fold residuals. It adjusts the shape-only forecast when the post-recovery growth slope is steeper than the trend learned from earlier years.
The feature set is built only from supplied competition CSVs:
- Calendar and event flags: day of week, month, cyclical encodings, Tet windows, mega-sale dates, Black Friday, and Vietnamese public holidays.
- Promotion features derived from
promotions.csv. - Operational signals from traffic, orders, order items, returns, reviews,
signups, payments, and shipments, collapsed to
(month, day_of_week)profiles using training rows only. - Long-range Revenue and COGS lags of 365, 730, 1095, and 1460 days, plus rolling 365-day means at safe lags.
No raw test-period operational features and no test-period Revenue or COGS values are used as model inputs.
The full analytical write-up is in report/main.tex. The
report figures and all cited numbers can be regenerated from
eda/eda_notebook.py.
.
├── main.py # one-command pipeline: build -> train -> predict
├── requirements.txt
├── README.md
├── src/
│ ├── config.py # paths, constants, LGB parameters, calendar tables
│ ├── features.py # calendar, event, and promotion feature builders
│ ├── build_master.py # raw CSVs -> master feature frames
│ ├── model_shape.py # LightGBM Revenue and COGS shape models
│ ├── model_multiplier.py # global scale correction
│ └── predict.py # writes submission CSVs
├── eda/
│ ├── eda_notebook.py # regenerates report figures and cited values
│ └── figures/ # charts, dashboards, and exported report values
├── report/
│ ├── main.tex # competition report
│ ├── references.bib
│ └── README.md
├── dataset/ # raw competition CSVs, not committed
└── output/ # generated model artifacts and submissions
-
Install dependencies:
pip install -r requirements.txt
-
Place the raw competition CSVs into
./dataset/.dataset/ ├── customers.csv ├── geography.csv ├── inventory.csv ├── order_items.csv ├── orders.csv ├── payments.csv ├── products.csv ├── promotions.csv ├── returns.csv ├── reviews.csv ├── sales.csv ├── sample_submission.csv ├── shipments.csv └── web_traffic.csv -
Run the full pipeline:
python main.py
The pipeline writes these artifacts to ./output/:
| File | Purpose |
|---|---|
submission.csv |
Final competition submission with multiplier. |
submission_with_multiplier.csv |
Traceability copy of the final submission. |
submission_B1b.csv |
Shape-only baseline with multiplier disabled. |
master_full.parquet, master_train.parquet, master_test.parquet |
Cached feature frames. |
b1b_oof_revenue.csv, b1b_oof_cogs.csv |
2022 holdout out-of-fold predictions. |
b1b_shap_revenue.csv, b1b_shap_cogs.csv |
SHAP, gain, and split importances. |
b1b_lgb_revenue.txt, b1b_lgb_cogs.txt |
Saved LightGBM models. |
seed=42 is pinned in LGB_PARAMS, and SHAP sampling uses
random_state=0.
- No external data. All model features are built from supplied CSVs. Tet and
fixed-holiday dates are encoded as calendar facts in
src/config.py. - No target leakage. Test-period Revenue and COGS are never used as inputs.
Raw operational features are retained with a
_rawsuffix for diagnostics and explicitly dropped before training. - Safe lags. Sales lag features start at 365 days, which keeps them known at prediction time for the 18-month test horizon.
- Reproducible run. The submission pipeline is a single command from raw CSVs to final output.
The raw competition data is intentionally excluded from version control. To run
the project locally, place the CSVs in dataset/ and execute python main.py.


