🚖 Ride-Hailing Trip Classification

Datavers Competition 2026 Submission

High-Performance Classification Model for Large-Scale Geospatial Data (4M+ Records)
Optimized for Speed, Accuracy, and Low Memory Footprint.

⬇️ Jump to Download Section

📋 Project Overview

This repository hosts the official solution by Team Adyatama Fighters for the Datavers Competition. Our mission was to classify ride-hailing trips into specific categories (e.g., normal, fraud, specific services) using a massive dataset of over 4 million records.

We engineered a robust pipeline focusing on Advanced Geospatial Analysis and Memory Optimization, enabling us to train a highly accurate LightGBM model on limited resources while adhering to strict submission size constraints (<150 MB).

👥 Team Adyatama Fighters

Name	Student ID	Major	Institution
Fabio Banyu Cyto	123450104	Data Science	Sumatra Institute of Technology
Labo John Noel Napitupulu	123450037	Data Science	Sumatra Institute of Technology

📥 Download Resources (Official Submission)

Due to file size constraints, the Model and Submission files are hosted in the Releases section.

File Name	Description	Size	Link
🤖 Model Final V6	`model_final_v6.joblib` (Pruned LightGBM + LZMA Compression)	109 MB	Download from Releases
📓 Source Notebook	`notebook-datavers...ipynb` (Training & Inference Code)	16 KB	Download from Releases
📄 Submission CSV	`submission_final_v6.csv` (Prediction Result)	110 MB	Download from Releases

⚠️ Note for Evaluation: The model file is compressed using LZMA. Please load it using Python's joblib:
import joblib
model = joblib.load('model_final_v6.joblib')

🛠️ Methodology & Approach

1. 🧹 Preprocessing & Optimization

Handling 4 million rows required aggressive memory management strategies:

Automated Downcasting: Implemented a reduce_mem_usage script to convert data types (e.g., float64 $\to$ float32), reducing RAM usage by >60%.
Garbage Collection: Strategic manual GC triggers to prevent OOM (Out-Of-Memory) errors.

2. 🌍 Advanced Feature Engineering

We transformed raw GPS data into rich behavioral signals:

A. Geospatial Features

Distance Metrics: Calculated Haversine (Air distance), Manhattan (City block distance), and Euclidean distances.
Road Tortuosity: Ratio of Manhattan to Haversine distance to detect non-linear routes (anomalies).
Spatial Binning (Grid System): Rounding coordinates to 2 decimal places to group nearby zones.
Rotated Coordinates: Applied 45° rotation to help Decision Trees split diagonal boundaries effectively.

B. Temporal Features

Cyclical Encoding: Sine/Cosine transformation for hours to preserve time continuity.
Rush Hour Flag: Boolean markers for peak traffic windows (07-09 AM & 04-07 PM).

C. Interaction Features

Price Per KM: Detecting price anomalies relative to distance traveled.

3. 🤖 Modeling Strategy (LightGBM)

We chose LightGBM for its superior training speed and low memory usage.

Component	Configuration
Objective	Multiclass Classification
Metric	Macro F1-Score
Trees (Training)	3500 Estimators
Leaves	128 (High capacity for complex patterns)
Max Bin	255 (High Fidelity for Accuracy)
Strategy	Stratified Split (90:10) with Early Stopping

💾 The Engineering Challenge: Model Compression

One of the unique constraints of this competition was the <150 MB Submission Limit. A standard LightGBM model with 3500 trees exceeds 200 MB. We implemented a custom compression pipeline to meet this requirement without sacrificing accuracy.

🚀 Our Solution:

Tree Pruning: We analyzed tree importance and pruned the model from 3500 to 2500 trees (removing only the least significant boosters).

LZMA Compression: We utilized joblib with LZMA Level 9 compression to pack the model density.

Result: Reduced model size from ~160 MB to 109 MB (32% reduction) while maintaining validation performance.

📈 Results

The model demonstrated consistent performance across validation sets:

Metric	Score
Validation Macro F1-Score	0.59613
Final Model Size	109 MB

📂 Repository Structure

├── input/                  # Raw dataset (Git ignored)
├── notebooks/              # Jupyter Notebooks
│   └── notebook-datavers-adyatama.ipynb
├── output/                 # Models & Submissions (See Releases)
│   ├── model_final_v6.joblib
│   └── submission_final_v6.csv
├── requirements.txt        # Dependencies
└── README.md               # Documentation

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
Sertifikat Peserta DataVers_Fabio Banyu Cyto.pdf		Sertifikat Peserta DataVers_Fabio Banyu Cyto.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🚖 Ride-Hailing Trip Classification

Datavers Competition 2026 Submission

📋 Project Overview

👥 Team Adyatama Fighters

📥 Download Resources (Official Submission)

🛠️ Methodology & Approach

1. 🧹 Preprocessing & Optimization

2. 🌍 Advanced Feature Engineering

A. Geospatial Features

B. Temporal Features

C. Interaction Features

3. 🤖 Modeling Strategy (LightGBM)

💾 The Engineering Challenge: Model Compression

📈 Results

📂 Repository Structure

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

🚖 Ride-Hailing Trip Classification

Datavers Competition 2026 Submission

📋 Project Overview

👥 Team Adyatama Fighters

📥 Download Resources (Official Submission)

🛠️ Methodology & Approach

1. 🧹 Preprocessing & Optimization

2. 🌍 Advanced Feature Engineering

A. Geospatial Features

B. Temporal Features

C. Interaction Features

3. 🤖 Modeling Strategy (LightGBM)

💾 The Engineering Challenge: Model Compression

📈 Results

📂 Repository Structure

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages