AutoML System Built From Scratch (Finance-Grade)

Overview

This project is a full end-to-end AutoML (Automatic Machine Learning) system built from scratch, without relying on any AutoML frameworks such as Auto-sklearn, H2O, or TPOT.

The goal of this project is to demonstrate real ML systems engineering, not just model training.
The system automatically handles:

Dataset profiling
Feature engineering
Feature selection
Model selection
Hyperparameter optimization
Cross-validation
Model persistence
Inference
Explainability (SHAP)

The project was tested on real financial time-series data (Ethereum market data), but it is designed to work on any tabular dataset (classification or regression).

This project is suitable for:

Final-year engineering projects
Resume / portfolio projects
Interview deep dives for ML, data science, and quant roles

Key Capabilities

Automatic detection of classification vs regression
Numeric feature engineering:
- log, sqrt, power transformations
Datetime feature engineering:
- hour, day, month, weekday
- cyclical sin/cos encoding
Feature selection:
- Constant feature removal
- Correlation pruning
- Mutual information ranking
Model families supported:
- Logistic / Linear Regression
- Random Forest
- Extra Trees
- XGBoost
- LightGBM
- CatBoost
- Neural Networks (MLP)
Bayesian hyperparameter optimization using Optuna
Cross-validation with stability penalty
Automatic best-model selection
Model + metadata persistence
Production-ready inference pipeline
SHAP-based global and local explainability

Project Architecture

automl/
├── profiler/            # Dataset analysis and problem detection
├── features/            # Feature engineering modules
├── models/              # Model definitions and trainers
├── optimization/        # Hyperparameter optimization logic
├── evaluation/          # Validation and scoring logic
├── pipeline.py          # End-to-end AutoML orchestration
├── config.py            # Global configuration
├── logger.py            # Centralized logging
automl.py                # CLI entrypoint for training
predict.py               # Inference script
explain.py               # SHAP explainability
reports/
├── best_model.pkl       # Final trained model
├── metadata.pkl         # Feature list, problem type, configs

How the System Works (Detailed)

1. Dataset Profiling

Detects problem type (classification or regression)
Identifies numeric, categorical, and datetime features
Computes dataset statistics

2. Feature Engineering

Numeric features are expanded using mathematical transforms
Datetime features are decomposed into meaningful components
Cyclical time features are encoded using sin/cos transforms

3. Feature Selection

Constant features are removed
Highly correlated features are pruned
Mutual information is used to select informative features

4. Model Search

Multiple model families are trained
All models are evaluated using cross-validation

5. Hyperparameter Optimization

Optuna performs Bayesian optimization
Each model has its own search space
Models are penalized for unstable CV performance

6. Final Training

Best model is retrained on full dataset
Model and metadata are saved to disk

7. Inference

New data is passed through the same feature engineering
Features are aligned using metadata
Predictions are generated safely

8. Explainability

SHAP is used for model interpretability
Global feature importance plots
Local (per-row) explanations

Installation

1. Create a virtual environment (recommended)

python -m venv venv
source venv/bin/activate   # Linux/Mac
venv\Scripts\activate    # Windows

2. Install dependencies

pip install -r requirements.txt

Usage

Train AutoML

python automl.py --data dataset.csv --target label

Run Inference

python predict.py --model reports/best_model.pkl --meta reports/metadata.pkl --data new_data.csv --output predictions.csv

Explain Model (SHAP)

python explain.py --model reports/best_model.pkl --meta reports/metadata.pkl --data dataset.csv

Why This Project Is Different

No AutoML black-box libraries
Clear separation of training vs inference
Reproducible and auditable ML pipeline
Explainability-first design
Modular and extensible architecture

Tech Stack

Python
Pandas, NumPy
scikit-learn
XGBoost
LightGBM
CatBoost
Optuna
SHAP
Matplotlib

Author

Built by Aarav Mehta
https://github.com/AaravMehta-07 https://www.linkedin.com/in/aarav-mehta-16a183337/ Computer Engineering | AI / ML Systems

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AutoML System Built From Scratch (Finance-Grade)

Overview

Key Capabilities

Project Architecture

How the System Works (Detailed)

1. Dataset Profiling

2. Feature Engineering

3. Feature Selection

4. Model Search

5. Hyperparameter Optimization

6. Final Training

7. Inference

8. Explainability

Installation

1. Create a virtual environment (recommended)

2. Install dependencies

Usage

Train AutoML

Run Inference

Explain Model (SHAP)

Why This Project Is Different

Tech Stack

Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
automl		automl
README.md		README.md
automl.py		automl.py
explain.py		explain.py
predict.py		predict.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

AutoML System Built From Scratch (Finance-Grade)

Overview

Key Capabilities

Project Architecture

How the System Works (Detailed)

1. Dataset Profiling

2. Feature Engineering

3. Feature Selection

4. Model Search

5. Hyperparameter Optimization

6. Final Training

7. Inference

8. Explainability

Installation

1. Create a virtual environment (recommended)

2. Install dependencies

Usage

Train AutoML

Run Inference

Explain Model (SHAP)

Why This Project Is Different

Tech Stack

Author

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages