Skip to content

LaboNapitupulu/DataVers-Competition-Lightgbm-Model

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 

Repository files navigation

🚖 Ride-Hailing Trip Classification

Datavers Competition 2026 Submission

Python LightGBM Scikit-Learn Status

High-Performance Classification Model for Large-Scale Geospatial Data (4M+ Records)
Optimized for Speed, Accuracy, and Low Memory Footprint.

⬇️ Jump to Download Section


📋 Project Overview

This repository hosts the official solution by Team Adyatama Fighters for the Datavers Competition. Our mission was to classify ride-hailing trips into specific categories (e.g., normal, fraud, specific services) using a massive dataset of over 4 million records.

We engineered a robust pipeline focusing on Advanced Geospatial Analysis and Memory Optimization, enabling us to train a highly accurate LightGBM model on limited resources while adhering to strict submission size constraints (<150 MB).

👥 Team Adyatama Fighters

Name Student ID Major Institution
Fabio Banyu Cyto 123450104 Data Science Sumatra Institute of Technology
Labo John Noel Napitupulu 123450037 Data Science Sumatra Institute of Technology

📥 Download Resources (Official Submission)

Due to file size constraints, the Model and Submission files are hosted in the Releases section.

File Name Description Size Link
🤖 Model Final V6 model_final_v6.joblib
(Pruned LightGBM + LZMA Compression)
109 MB Download from Releases
📓 Source Notebook notebook-datavers...ipynb
(Training & Inference Code)
16 KB Download from Releases
📄 Submission CSV submission_final_v6.csv
(Prediction Result)
110 MB Download from Releases

⚠️ Note for Evaluation: The model file is compressed using LZMA. Please load it using Python's joblib:

import joblib
model = joblib.load('model_final_v6.joblib')

🛠️ Methodology & Approach

1. 🧹 Preprocessing & Optimization

Handling 4 million rows required aggressive memory management strategies:

  • Automated Downcasting: Implemented a reduce_mem_usage script to convert data types (e.g., float64 $\to$ float32), reducing RAM usage by >60%.
  • Garbage Collection: Strategic manual GC triggers to prevent OOM (Out-Of-Memory) errors.

2. 🌍 Advanced Feature Engineering

We transformed raw GPS data into rich behavioral signals:

A. Geospatial Features

  • Distance Metrics: Calculated Haversine (Air distance), Manhattan (City block distance), and Euclidean distances.
  • Road Tortuosity: Ratio of Manhattan to Haversine distance to detect non-linear routes (anomalies).
  • Spatial Binning (Grid System): Rounding coordinates to 2 decimal places to group nearby zones.
  • Rotated Coordinates: Applied 45° rotation to help Decision Trees split diagonal boundaries effectively.

B. Temporal Features

  • Cyclical Encoding: Sine/Cosine transformation for hours to preserve time continuity.
  • Rush Hour Flag: Boolean markers for peak traffic windows (07-09 AM & 04-07 PM).

C. Interaction Features

  • Price Per KM: Detecting price anomalies relative to distance traveled.

3. 🤖 Modeling Strategy (LightGBM)

We chose LightGBM for its superior training speed and low memory usage.

Component Configuration
Objective Multiclass Classification
Metric Macro F1-Score
Trees (Training) 3500 Estimators
Leaves 128 (High capacity for complex patterns)
Max Bin 255 (High Fidelity for Accuracy)
Strategy Stratified Split (90:10) with Early Stopping

💾 The Engineering Challenge: Model Compression

One of the unique constraints of this competition was the <150 MB Submission Limit. A standard LightGBM model with 3500 trees exceeds 200 MB. We implemented a custom compression pipeline to meet this requirement without sacrificing accuracy.

🚀 Our Solution:

  1. Tree Pruning: We analyzed tree importance and pruned the model from 3500 to 2500 trees (removing only the least significant boosters).
  2. LZMA Compression: We utilized joblib with LZMA Level 9 compression to pack the model density.

Result: Reduced model size from ~160 MB to 109 MB (32% reduction) while maintaining validation performance.


📈 Results

The model demonstrated consistent performance across validation sets:

Metric Score
Validation Macro F1-Score 0.59613
Final Model Size 109 MB

📂 Repository Structure

├── input/                  # Raw dataset (Git ignored)
├── notebooks/              # Jupyter Notebooks
│   └── notebook-datavers-adyatama.ipynb
├── output/                 # Models & Submissions (See Releases)
│   ├── model_final_v6.joblib
│   └── submission_final_v6.csv
├── requirements.txt        # Dependencies
└── README.md               # Documentation

About

Official repository for Datavers Competition 2026 (Team Adyatama Fighters). Featuring a high-performance LightGBM model for large-scale ride-hailing trip classification (4M+ rows). Includes advanced geospatial feature engineering, memory optimization techniques, and model pruning strategies.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors