Datathon 2026 Round 1: Sales Forecasting & Funnel Diagnostics

End-to-end Datathon project for a Vietnamese fashion e-commerce retailer. The solution combines business diagnostics, time-series feature engineering, and a two-stage LightGBM forecasting pipeline to predict daily Revenue and COGS for the 2023-01-01 to 2024-07-01 competition horizon.

The repository is structured so the full submission can be rebuilt with one command, while the report and EDA artifacts explain the commercial story behind the forecast.

Project highlights

Area	What this project demonstrates
Forecasting	Daily Revenue and COGS prediction for an 18-month horizon.
Modeling	LightGBM shape models with year-mean normalization and a global scale multiplier.
Validation	Expanding-window time-series CV over 2020-2022, with 2022 holdout reporting.
Feature engineering	Calendar, Tet windows, mega-sale events, promotions, operational profiles, and safe long-range lags.
Explainability	SHAP, gain, and split importances exported per target.
Business analysis	Funnel, seasonality, promotion, margin, return, and inventory diagnostics.

Results snapshot

Out-of-fold performance on the 2022 holdout window:

Target	MAE	RMSE	R²	MAPE
Revenue	569,742	777,769	0.784	21.92%
COGS	524,905	710,629	0.763	22.47%

Key business findings from the EDA report:

Sessions grew 63% and customer signups grew 606% from 2018 to 2022, but conversion fell from 0.74% to 0.33%.
Revenue peaks in Q2 rather than around Tet, suggesting Tet behaves more like a logistics shock than a sales lift.
Promotional order lines show a sharp gross-margin drop: 20.8% margin without promotion versus 1.6% with promotion.
Cancellation rate stays flat at 9.2% with or without promotions, so discounting did not measurably improve fulfillment quality.

Forecasting approach

The forecasting system uses a two-stage design.

1. Shape model. Revenue and COGS are divided by their yearly mean before training, so the LightGBM models focus on within-year shape: weekday effects, seasonality, holidays, campaign windows, promotions, and operational profiles. Recent years receive higher sample weights with exp(0.10 * (year - max_year)).

2. Global multiplier. A scalar correction is fitted from post-2019 out-of-fold residuals. It adjusts the shape-only forecast when the post-recovery growth slope is steeper than the trend learned from earlier years.

The feature set is built only from supplied competition CSVs:

Calendar and event flags: day of week, month, cyclical encodings, Tet windows, mega-sale dates, Black Friday, and Vietnamese public holidays.
Promotion features derived from promotions.csv.
Operational signals from traffic, orders, order items, returns, reviews, signups, payments, and shipments, collapsed to (month, day_of_week) profiles using training rows only.
Long-range Revenue and COGS lags of 365, 730, 1095, and 1460 days, plus rolling 365-day means at safe lags.

No raw test-period operational features and no test-period Revenue or COGS values are used as model inputs.

Visual diagnostics

The full analytical write-up is in report/main.tex. The report figures and all cited numbers can be regenerated from eda/eda_notebook.py.

Repository layout

.
├── main.py                       # one-command pipeline: build -> train -> predict
├── requirements.txt
├── README.md
├── src/
│   ├── config.py                 # paths, constants, LGB parameters, calendar tables
│   ├── features.py               # calendar, event, and promotion feature builders
│   ├── build_master.py           # raw CSVs -> master feature frames
│   ├── model_shape.py            # LightGBM Revenue and COGS shape models
│   ├── model_multiplier.py       # global scale correction
│   └── predict.py                # writes submission CSVs
├── eda/
│   ├── eda_notebook.py           # regenerates report figures and cited values
│   └── figures/                  # charts, dashboards, and exported report values
├── report/
│   ├── main.tex                  # competition report
│   ├── references.bib
│   └── README.md
├── dataset/                      # raw competition CSVs, not committed
└── output/                       # generated model artifacts and submissions

Reproduce the submission

Install dependencies:
```
pip install -r requirements.txt
```

Place the raw competition CSVs into ./dataset/.

dataset/
├── customers.csv
├── geography.csv
├── inventory.csv
├── order_items.csv
├── orders.csv
├── payments.csv
├── products.csv
├── promotions.csv
├── returns.csv
├── reviews.csv
├── sales.csv
├── sample_submission.csv
├── shipments.csv
└── web_traffic.csv

Run the full pipeline:
```
python main.py
```

The pipeline writes these artifacts to ./output/:

File	Purpose
`submission.csv`	Final competition submission with multiplier.
`submission_with_multiplier.csv`	Traceability copy of the final submission.
`submission_B1b.csv`	Shape-only baseline with multiplier disabled.
`master_full.parquet`, `master_train.parquet`, `master_test.parquet`	Cached feature frames.
`b1b_oof_revenue.csv`, `b1b_oof_cogs.csv`	2022 holdout out-of-fold predictions.
`b1b_shap_revenue.csv`, `b1b_shap_cogs.csv`	SHAP, gain, and split importances.
`b1b_lgb_revenue.txt`, `b1b_lgb_cogs.txt`	Saved LightGBM models.

seed=42 is pinned in LGB_PARAMS, and SHAP sampling uses random_state=0.

Competition compliance

No external data. All model features are built from supplied CSVs. Tet and fixed-holiday dates are encoded as calendar facts in src/config.py.
No target leakage. Test-period Revenue and COGS are never used as inputs. Raw operational features are retained with a _raw suffix for diagnostics and explicitly dropped before training.
Safe lags. Sales lag features start at 365 days, which keeps them known at prediction time for the 18-month test horizon.
Reproducible run. The submission pipeline is a single command from raw CSVs to final output.

Notes for reviewers

The raw competition data is intentionally excluded from version control. To run the project locally, place the CSVs in dataset/ and execute python main.py.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Datathon 2026 Round 1: Sales Forecasting & Funnel Diagnostics

Project highlights

Results snapshot

Forecasting approach

Visual diagnostics

Repository layout

Reproduce the submission

Competition compliance

Notes for reviewers

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
eda		eda
report		report
src		src
.gitignore		.gitignore
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Datathon 2026 Round 1: Sales Forecasting & Funnel Diagnostics

Project highlights

Results snapshot

Forecasting approach

Visual diagnostics

Repository layout

Reproduce the submission

Competition compliance

Notes for reviewers

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages