Benchmarking five class-imbalance strategies — random under-sampling, random over-sampling, SMOTE, ADASYN, and a custom UMCE ensemble — across three classifiers and twelve imbalanced KEEL datasets, with statistical significance testing (two-way ANOVA + Tukey HSD).
Headline result: the custom UMCE method achieves the highest mean balanced accuracy of every strategy tested — overall and for each of the three classifiers.
On strongly imbalanced binary datasets, which resampling strategy best helps a standard classifier recover minority-class performance? And does a custom under-sampling ensemble (UMCE) beat the well-known baselines?
| Strategy | What it does |
|---|---|
| Random under-sampling | Subsample the majority class down to the minority size. |
| Random over-sampling | Bootstrap the minority class up to the majority size. |
| SMOTE | Synthesise new minority samples by interpolating neighbours. |
| ADASYN | Like SMOTE, but generates more samples in hard-to-learn regions. |
| UMCE (custom) | Under-sampling with a Multiple-Classifier Ensemble: split the majority class into k ≈ (imbalance ratio) balanced subsets, train one classifier per subset together with all minority samples, and combine predictions by majority vote. |
Resampling is applied to the training fold only, never to the test fold.
Random Forest, Decision Tree, and Gaussian Naive Bayes — each fitted on
StandardScaler-normalised features.
Twelve binary imbalanced datasets from the KEEL repository, each pre-split into 5 stratified cross-validation folds: 5 Ecoli variants, Glass2, and 6 Yeast variants (protein/cell-localisation and glass-type tasks with strong class skew).
- 5-fold cross-validation; seven metrics recorded per fold: accuracy, balanced accuracy, precision, recall, F1, classification error, AUC-ROC.
- Two-way ANOVA on balanced accuracy with an interaction term:
value ~ C(model) + C(method) + C(model):C(method). - Tukey HSD post-hoc test over the model × method groups (19 of the pairwise differences are significant at α = 0.05).
Mean balanced accuracy across the 12 datasets (5-fold CV):
| Strategy | Overall | Random Forest | Decision Tree | Gaussian NB |
|---|---|---|---|---|
| Random under-sampling | 0.782 | 0.857 | 0.809 | 0.680 |
| Random over-sampling | 0.723 | 0.769 | 0.766 | 0.633 |
| SMOTE | 0.768 | 0.829 | 0.802 | 0.672 |
| ADASYN | 0.715 | 0.766 | 0.742 | 0.636 |
| UMCE (ours) | 0.819 | 0.865 | 0.848 | 0.742 |
UMCE is the strongest strategy overall and for every classifier, and
Random Forest is the strongest base learner. Full per-dataset numbers are in
ranking_results.xlsx; ANOVA/Tukey output in anova_results.xlsx and
tukey_results.csv.
uv venv && source .venv/bin/activate
uv pip install -r requirements.txt
python main.py # all classifiers × strategies over the 12 datasets -> results/*.json
python calc_average.py # average each metric across the 5 folds -> results/average_*.json
python results.py # two-way ANOVA + Tukey HSD -> anova_results.xlsx, tukey_results.csv
python ranking.py # per-dataset, per-method classifier ranking -> ranking_results.xlsx.
├── load_data.py # parse the KEEL ARFF folds from data_raw/
├── sampling.py # random under/over-sampling, SMOTE, ADASYN wrappers
├── umce.py # the custom Under-sampling Multiple-Classifier Ensemble
├── models.py # train + evaluate RF / DT / GaussianNB, 7 metrics each
├── main.py # experiment runner -> results/*.json
├── calc_average.py # average folds -> results/average_*.json
├── results.py # two-way ANOVA + Tukey HSD
├── ranking.py # per-dataset method ranking
├── statistic.py # normality / ANOVA helpers
├── handle_pickle.py # small (de)serialisation helper
├── data_raw/ # 12 KEEL datasets, 5 folds each (ARFF)
└── results/ # computed metrics (JSON)
Datasets are from the KEEL imbalanced-classification dataset repository (Alcalá-Fdez et al., KEEL Data-Mining Software Tool). See https://sci2s.ugr.es/keel/imbalanced.php.
