Imbalanced Classification Benchmark

Benchmarking five class-imbalance strategies — random under-sampling, random over-sampling, SMOTE, ADASYN, and a custom UMCE ensemble — across three classifiers and twelve imbalanced KEEL datasets, with statistical significance testing (two-way ANOVA + Tukey HSD).

Headline result: the custom UMCE method achieves the highest mean balanced accuracy of every strategy tested — overall and for each of the three classifiers.

Research question

On strongly imbalanced binary datasets, which resampling strategy best helps a standard classifier recover minority-class performance? And does a custom under-sampling ensemble (UMCE) beat the well-known baselines?

Strategies compared

Strategy	What it does
Random under-sampling	Subsample the majority class down to the minority size.
Random over-sampling	Bootstrap the minority class up to the majority size.
SMOTE	Synthesise new minority samples by interpolating neighbours.
ADASYN	Like SMOTE, but generates more samples in hard-to-learn regions.
UMCE (custom)	Under-sampling with a Multiple-Classifier Ensemble: split the majority class into k ≈ (imbalance ratio) balanced subsets, train one classifier per subset together with all minority samples, and combine predictions by majority vote.

Resampling is applied to the training fold only, never to the test fold.

Classifiers

Random Forest, Decision Tree, and Gaussian Naive Bayes — each fitted on StandardScaler-normalised features.

Data

Twelve binary imbalanced datasets from the KEEL repository, each pre-split into 5 stratified cross-validation folds: 5 Ecoli variants, Glass2, and 6 Yeast variants (protein/cell-localisation and glass-type tasks with strong class skew).

Methodology

5-fold cross-validation; seven metrics recorded per fold: accuracy, balanced accuracy, precision, recall, F1, classification error, AUC-ROC.
Two-way ANOVA on balanced accuracy with an interaction term: value ~ C(model) + C(method) + C(model):C(method).
Tukey HSD post-hoc test over the model × method groups (19 of the pairwise differences are significant at α = 0.05).

Results

Mean balanced accuracy across the 12 datasets (5-fold CV):

Strategy	Overall	Random Forest	Decision Tree	Gaussian NB
Random under-sampling	0.782	0.857	0.809	0.680
Random over-sampling	0.723	0.769	0.766	0.633
SMOTE	0.768	0.829	0.802	0.672
ADASYN	0.715	0.766	0.742	0.636
UMCE (ours)	0.819	0.865	0.848	0.742

UMCE is the strongest strategy overall and for every classifier, and Random Forest is the strongest base learner. Full per-dataset numbers are in ranking_results.xlsx; ANOVA/Tukey output in anova_results.xlsx and tukey_results.csv.

Reproduce

uv venv && source .venv/bin/activate
uv pip install -r requirements.txt

python main.py          # all classifiers × strategies over the 12 datasets -> results/*.json
python calc_average.py  # average each metric across the 5 folds        -> results/average_*.json
python results.py       # two-way ANOVA + Tukey HSD                      -> anova_results.xlsx, tukey_results.csv
python ranking.py       # per-dataset, per-method classifier ranking     -> ranking_results.xlsx

Repository layout

.
├── load_data.py      # parse the KEEL ARFF folds from data_raw/
├── sampling.py       # random under/over-sampling, SMOTE, ADASYN wrappers
├── umce.py           # the custom Under-sampling Multiple-Classifier Ensemble
├── models.py         # train + evaluate RF / DT / GaussianNB, 7 metrics each
├── main.py           # experiment runner -> results/*.json
├── calc_average.py   # average folds    -> results/average_*.json
├── results.py        # two-way ANOVA + Tukey HSD
├── ranking.py        # per-dataset method ranking
├── statistic.py      # normality / ANOVA helpers
├── handle_pickle.py  # small (de)serialisation helper
├── data_raw/         # 12 KEEL datasets, 5 folds each (ARFF)
└── results/          # computed metrics (JSON)

Data attribution

Datasets are from the KEEL imbalanced-classification dataset repository (Alcalá-Fdez et al., KEEL Data-Mining Software Tool). See https://sci2s.ugr.es/keel/imbalanced.php.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Imbalanced Classification Benchmark

Research question

Strategies compared

Classifiers

Data

Methodology

Results

Reproduce

Repository layout

Data attribution

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
data_raw		data_raw
results		results
.gitignore		.gitignore
README.md		README.md
anova_results.xlsx		anova_results.xlsx
calc_average.py		calc_average.py
flatten.xlsx		flatten.xlsx
handle_pickle.py		handle_pickle.py
load_data.py		load_data.py
main.py		main.py
models.py		models.py
post_hoc.png		post_hoc.png
ranking.py		ranking.py
ranking_results.xlsx		ranking_results.xlsx
requirements.txt		requirements.txt
results.py		results.py
sampling.py		sampling.py
statistic.py		statistic.py
tukey_results.csv		tukey_results.csv
umce.py		umce.py

Folders and files

Latest commit

History

Repository files navigation

Imbalanced Classification Benchmark

Research question

Strategies compared

Classifiers

Data

Methodology

Results

Reproduce

Repository layout

Data attribution

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages