Skip to content

Dangmotm/Datathon

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Datathon 2026 Round 1: Sales Forecasting & Funnel Diagnostics

Python LightGBM SHAP Status

End-to-end Datathon project for a Vietnamese fashion e-commerce retailer. The solution combines business diagnostics, time-series feature engineering, and a two-stage LightGBM forecasting pipeline to predict daily Revenue and COGS for the 2023-01-01 to 2024-07-01 competition horizon.

The repository is structured so the full submission can be rebuilt with one command, while the report and EDA artifacts explain the commercial story behind the forecast.

Executive dashboard overview

Project highlights

Area What this project demonstrates
Forecasting Daily Revenue and COGS prediction for an 18-month horizon.
Modeling LightGBM shape models with year-mean normalization and a global scale multiplier.
Validation Expanding-window time-series CV over 2020-2022, with 2022 holdout reporting.
Feature engineering Calendar, Tet windows, mega-sale events, promotions, operational profiles, and safe long-range lags.
Explainability SHAP, gain, and split importances exported per target.
Business analysis Funnel, seasonality, promotion, margin, return, and inventory diagnostics.

Results snapshot

Out-of-fold performance on the 2022 holdout window:

Target MAE RMSE MAPE
Revenue 569,742 777,769 0.784 21.92%
COGS 524,905 710,629 0.763 22.47%

Key business findings from the EDA report:

  • Sessions grew 63% and customer signups grew 606% from 2018 to 2022, but conversion fell from 0.74% to 0.33%.
  • Revenue peaks in Q2 rather than around Tet, suggesting Tet behaves more like a logistics shock than a sales lift.
  • Promotional order lines show a sharp gross-margin drop: 20.8% margin without promotion versus 1.6% with promotion.
  • Cancellation rate stays flat at 9.2% with or without promotions, so discounting did not measurably improve fulfillment quality.

Forecasting approach

The forecasting system uses a two-stage design.

1. Shape model. Revenue and COGS are divided by their yearly mean before training, so the LightGBM models focus on within-year shape: weekday effects, seasonality, holidays, campaign windows, promotions, and operational profiles. Recent years receive higher sample weights with exp(0.10 * (year - max_year)).

2. Global multiplier. A scalar correction is fitted from post-2019 out-of-fold residuals. It adjusts the shape-only forecast when the post-recovery growth slope is steeper than the trend learned from earlier years.

The feature set is built only from supplied competition CSVs:

  • Calendar and event flags: day of week, month, cyclical encodings, Tet windows, mega-sale dates, Black Friday, and Vietnamese public holidays.
  • Promotion features derived from promotions.csv.
  • Operational signals from traffic, orders, order items, returns, reviews, signups, payments, and shipments, collapsed to (month, day_of_week) profiles using training rows only.
  • Long-range Revenue and COGS lags of 365, 730, 1095, and 1460 days, plus rolling 365-day means at safe lags.

No raw test-period operational features and no test-period Revenue or COGS values are used as model inputs.

Visual diagnostics

Revenue and gross profit trajectory Revenue SHAP feature importance

The full analytical write-up is in report/main.tex. The report figures and all cited numbers can be regenerated from eda/eda_notebook.py.

Repository layout

.
├── main.py                       # one-command pipeline: build -> train -> predict
├── requirements.txt
├── README.md
├── src/
│   ├── config.py                 # paths, constants, LGB parameters, calendar tables
│   ├── features.py               # calendar, event, and promotion feature builders
│   ├── build_master.py           # raw CSVs -> master feature frames
│   ├── model_shape.py            # LightGBM Revenue and COGS shape models
│   ├── model_multiplier.py       # global scale correction
│   └── predict.py                # writes submission CSVs
├── eda/
│   ├── eda_notebook.py           # regenerates report figures and cited values
│   └── figures/                  # charts, dashboards, and exported report values
├── report/
│   ├── main.tex                  # competition report
│   ├── references.bib
│   └── README.md
├── dataset/                      # raw competition CSVs, not committed
└── output/                       # generated model artifacts and submissions

Reproduce the submission

  1. Install dependencies:

    pip install -r requirements.txt
  2. Place the raw competition CSVs into ./dataset/.

    dataset/
    ├── customers.csv
    ├── geography.csv
    ├── inventory.csv
    ├── order_items.csv
    ├── orders.csv
    ├── payments.csv
    ├── products.csv
    ├── promotions.csv
    ├── returns.csv
    ├── reviews.csv
    ├── sales.csv
    ├── sample_submission.csv
    ├── shipments.csv
    └── web_traffic.csv
    
  3. Run the full pipeline:

    python main.py

The pipeline writes these artifacts to ./output/:

File Purpose
submission.csv Final competition submission with multiplier.
submission_with_multiplier.csv Traceability copy of the final submission.
submission_B1b.csv Shape-only baseline with multiplier disabled.
master_full.parquet, master_train.parquet, master_test.parquet Cached feature frames.
b1b_oof_revenue.csv, b1b_oof_cogs.csv 2022 holdout out-of-fold predictions.
b1b_shap_revenue.csv, b1b_shap_cogs.csv SHAP, gain, and split importances.
b1b_lgb_revenue.txt, b1b_lgb_cogs.txt Saved LightGBM models.

seed=42 is pinned in LGB_PARAMS, and SHAP sampling uses random_state=0.

Competition compliance

  • No external data. All model features are built from supplied CSVs. Tet and fixed-holiday dates are encoded as calendar facts in src/config.py.
  • No target leakage. Test-period Revenue and COGS are never used as inputs. Raw operational features are retained with a _raw suffix for diagnostics and explicitly dropped before training.
  • Safe lags. Sales lag features start at 365 days, which keeps them known at prediction time for the 18-month test horizon.
  • Reproducible run. The submission pipeline is a single command from raw CSVs to final output.

Notes for reviewers

The raw competition data is intentionally excluded from version control. To run the project locally, place the CSVs in dataset/ and execute python main.py.

About

Datathon 2026 VinTelligence - Sales forecasting and EDA for Vietnamese fashion e-commerce. LightGBM, SHAP, NeurIPS report.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors