Skip to content

fhitzke/ML-Benchmark-Methodology

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

27 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Benchmarking Machine Learning Models on Tabular Data: A Statistically Rigorous Evaluation Pipeline

Background

The basis for this evaluation pipeline is the benchmark study "Challenging the Performance-Interpretability Trade-off: An Evaluation of Interpretable Machine Learning Models" by Kruschel et al. The authors provide their code in the associated GitHub repository: https://github.com/NicoHambauer/Model-Performance-vs-Interpretability

Environment Setup

To get started, ensure that Conda is installed on your system and that you have cloned this repository to your local machine. To set up the Conda environment, execute the following command in the root directory of this project:

./setup_environment.sh

Datasets

Dataset names are aliased in the code as follows. Potential updated versions can be retrieved via the links below:

Classification

Dataset name Alias Repository Link
College college https://www.kaggle.com/datasets/saddamazyazy/go-to-college-dataset
Water potability water https://kaggle.com/adityakadiwal/water-potability
Stroke stroke https://kaggle.com/fedesoriano/stroke-prediction-dataset
Customer churn telco https://www.kaggle.com/datasets/blastchar/telco-customer-churn
Recidivism compas https://www.kaggle.com/datasets/danofer/compass
Credit scoring fico https://github.com/nyuvis/Fico-Challenge
Income adults adult https://archive.ics.uci.edu/ml/datasets/adult
Bank marketing bank https://archive.ics.uci.edu/ml/datasets/Bank+Marketing
Airline satisfaction airline https://kaggle.com/teejmahal20/airline-passenger-satisfaction
Weather forecast weather https://www.kaggle.com/datasets/jsphygy/weather-dataset-rattle-package

Regression

Dataset name Alias Repository Link
Car price car https://archive.ics.uci.edu/ml/datasets/automobile
Student grade student https://archive.ics.uci.edu/ml/datasets/Student+Performance
Productivity productivity https://archive.ics.uci.edu/dataset/597/productivity+prediction+of+garment+employees
Medical insurance medical https://www.kaggle.com/datasets/mirichoi0218/insurance
Violent crimes crimes https://archive.ics.uci.edu/ml/datasets/Communities+and+Crime
Crab farming crab https://www.kaggle.com/datasets/sidhus/crab-age-prediction
Wine quality wine https://archive.ics.uci.edu/ml/datasets/wine+quality
Bike rental bike https://archive.ics.uci.edu/ml/datasets/Bike+Sharing+Dataset
House price housing https://www.kaggle.com/datasets/neildhere/california-housing
Diamond price diamond https://www.kaggle.com/datasets/nancyalaswad90/diamonds-prices

License

This project is operated under an MIT license. Every file must contain the REUSE-compliant license and copyright declaration.

About

This repository belongs to my diploma thesis "Benchmarking Machine Learning Models on Tabular Data: A Statistically Rigorous Evaluation Pipeline". In the thesis I benchmark various ML models and assess their predictive performance differences using statistical hypothesis testing.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors