High-Performance Classification Model for Large-Scale Geospatial Data (4M+ Records)
Optimized for Speed, Accuracy, and Low Memory Footprint.
This repository hosts the official solution by Team Adyatama Fighters for the Datavers Competition. Our mission was to classify ride-hailing trips into specific categories (e.g., normal, fraud, specific services) using a massive dataset of over 4 million records.
We engineered a robust pipeline focusing on Advanced Geospatial Analysis and Memory Optimization, enabling us to train a highly accurate LightGBM model on limited resources while adhering to strict submission size constraints (<150 MB).
| Name | Student ID | Major | Institution |
|---|---|---|---|
| Fabio Banyu Cyto | 123450104 | Data Science | Sumatra Institute of Technology |
| Labo John Noel Napitupulu | 123450037 | Data Science | Sumatra Institute of Technology |
Due to file size constraints, the Model and Submission files are hosted in the Releases section.
| File Name | Description | Size | Link |
|---|---|---|---|
| 🤖 Model Final V6 | model_final_v6.joblib(Pruned LightGBM + LZMA Compression) |
109 MB | Download from Releases |
| 📓 Source Notebook | notebook-datavers...ipynb(Training & Inference Code) |
16 KB | Download from Releases |
| 📄 Submission CSV | submission_final_v6.csv(Prediction Result) |
110 MB | Download from Releases |
⚠️ Note for Evaluation: The model file is compressed using LZMA. Please load it using Python'sjoblib:import joblib model = joblib.load('model_final_v6.joblib')
Handling 4 million rows required aggressive memory management strategies:
-
Automated Downcasting: Implemented a
reduce_mem_usagescript to convert data types (e.g.,float64$\to$ float32), reducing RAM usage by >60%. - Garbage Collection: Strategic manual GC triggers to prevent OOM (Out-Of-Memory) errors.
We transformed raw GPS data into rich behavioral signals:
- Distance Metrics: Calculated Haversine (Air distance), Manhattan (City block distance), and Euclidean distances.
- Road Tortuosity: Ratio of Manhattan to Haversine distance to detect non-linear routes (anomalies).
- Spatial Binning (Grid System): Rounding coordinates to 2 decimal places to group nearby zones.
- Rotated Coordinates: Applied 45° rotation to help Decision Trees split diagonal boundaries effectively.
- Cyclical Encoding: Sine/Cosine transformation for hours to preserve time continuity.
- Rush Hour Flag: Boolean markers for peak traffic windows (07-09 AM & 04-07 PM).
- Price Per KM: Detecting price anomalies relative to distance traveled.
We chose LightGBM for its superior training speed and low memory usage.
| Component | Configuration |
|---|---|
| Objective | Multiclass Classification |
| Metric | Macro F1-Score |
| Trees (Training) | 3500 Estimators |
| Leaves | 128 (High capacity for complex patterns) |
| Max Bin | 255 (High Fidelity for Accuracy) |
| Strategy | Stratified Split (90:10) with Early Stopping |
One of the unique constraints of this competition was the <150 MB Submission Limit. A standard LightGBM model with 3500 trees exceeds 200 MB. We implemented a custom compression pipeline to meet this requirement without sacrificing accuracy.
🚀 Our Solution:
- Tree Pruning: We analyzed tree importance and pruned the model from 3500 to 2500 trees (removing only the least significant boosters).
- LZMA Compression: We utilized
joblibwith LZMA Level 9 compression to pack the model density.Result: Reduced model size from ~160 MB to 109 MB (32% reduction) while maintaining validation performance.
The model demonstrated consistent performance across validation sets:
| Metric | Score |
|---|---|
| Validation Macro F1-Score | 0.59613 |
| Final Model Size | 109 MB |
├── input/ # Raw dataset (Git ignored)
├── notebooks/ # Jupyter Notebooks
│ └── notebook-datavers-adyatama.ipynb
├── output/ # Models & Submissions (See Releases)
│ ├── model_final_v6.joblib
│ └── submission_final_v6.csv
├── requirements.txt # Dependencies
└── README.md # Documentation