A scalable, end-to-end pipeline for:
- Sampling continuous search spaces
- Evaluating BBOB benchmark functions
- Extracting ELA (Exploratory Landscape Analysis) features
- Studying compression ratio effects on sampling on random subspaces
- Building large datasets efficiently (parallel + chunked)
flowchart LR
A[Sample X] --> B["Evaluate f(X) on BBOB"]
B --> C[Extract ELA Features]
C --> D[Aggregate Dataset]
subgraph Advanced ["Sampling on Subspaces Pipeline"]
E[Sample in low dimension d]
F[Project to high dimension D]
G[Evaluate BBOB]
H[ELA on full + slices]
end
E --> F --> G --> H
.
├── doe_sampling.py # Generate X samples
├── y_sampling.py # Evaluate BBOB functions
├── ela_sampling.py # Extract ELA features
├── sampler.py # Alternative IOH-based sampling
│
├── slicing_sampling_test_parallel.py
├── slicing_all_in_sampling_test_parallel.py
│ └── Low-D → High-D sampling + parallel ELA
│
├── parallel_loader.py # Build final dataset (chunked)
├── parallel_loader_slices.py
├── parallel_loader_slices_all_in.py
│ └── Parallel loading of many CSV files
│
└── data/ # Outputs (generated)
Just run the following line in bash:
python3 -m pip install -r requirements.txt
The following is an example to use any Quasi-Monte-Carlo sampling. Currently, the code allows to use halton, sobol or lhs to generate points to be then passed to one of the BBOB functions and get function evaluations for ulterior assessment.
python doe_sampling.py \
--dim 20 \
--n 1000 \
--sampler lhs \
--seed 42 \
--out samples.csv
As an output, a folder is generated with the corresponding dimension, number of samples
x_samples/
reduction/
Dimension_20/
seed_42/
Samples_1000/
samples.csv
Run:
python y_sampling.py
Where you need to open the file and select the folder with the already generated samples. By running the script correctly, then the following directories will be generated:
bbob_evaluations/
reduction/
Dimension_20/
seed_42/
Samples_1000/
f_1/
id_0/
evaluations.csv
The script y_sampling.py has predefined to evaluate the 24 functions from the BBOB and the first 15 instances of each function.
Run:
python ela_sampling.py
Therein you have to indicate the directories with the samples and the corresponding evaluations.
Then the output is the following:
ela_features/
reduction/
Dimension_20/
seed_42/
Samples_1000/
f_1/
id_0/
ela_features.csv
Each file contains:
feature_1, feature_2, ..., feature_n
Run:
python parallel_loader.py
The latter script generates the files:
complete_data_generated.csv
complete_data_generated.parquet
Final dataset includes:
- dimension
- seed
- n_samples
- function_idx
- instance_idx
- source_file
Run:
python slicing_sampling_test_parallel.py
Or:
python slicing_all_in_sampling_test_parallel.py
The distinction between the first and the second one is the point density allocation as in the second all points are evaluated in one defined subspace, whereas the first one splits point density into multiple subspaces or "slices".
flowchart LR
A[Low-D samples] --> B[Random embedding]
B --> C[High-D samples]
C --> D[Evaluate BBOB]
D --> E["ELA (full dataset)"]
C --> F[Split into slices]
F --> G[ELA per slice]
By running either one of the aforementioned scripts, then the following structure will be generated. This is the renderization of the flowchart explained above.
sampling_outputs_20D_10D/
f1/
iid_0/
group0/
full.csv
slice1.csv
slice2.csv
| File | Description |
|---|---|
full.csv |
ELA features on all samples |
slice*.csv |
ELA features per low-D slice |
X Samples
x1, x2, ..., xd
Function Evaluations
fX
ELA Features
feature_1, feature_2, ..., feature_n
Includes:
- Features
- Metadata
- File origin
- Core Concepts:
- Sampling Methods (Latin Hypercube (LHS), Sobol, Halton, Monte Carlo)
- ELA Features (via pflacco) (Meta features, Distribution features ,Level sets, Nearest Better Clustering (NBC), Dispersion, Information content, PCA, Fitness-distance correlation
- Projection Strategy (Sample in low dimension (d) --> Embed into high dimension (D) --> Evaluate in high-D
- Analyze landscape structure via ELA
- Performance Features
- Parallel processing (multiprocessing)
- Chunked data loading
- Memory-safe streaming
- Parquet output (fast + compressed)
- Handles millions of CSV files
- Sobol sampling requires n = 2^k
- ELA level features require enough samples
- File paths encode metadata → do not change structure
- Use Parquet for large datasets
You can modify:
- Sampling method (get_sampler)
- Dimensions (D, d)
- Number of groups / slices
- Enabled ELA features
- Optimization landscape analysis
- Meta-learning dataset generation
- Benchmarking optimization algorithms
- Studying dimensionality reduction effects
This repository provides a complete, scalable pipeline for:
✔ Sampling & benchmarking ✔ ELA feature extraction ✔ High-dimensional analysis ✔ Large-scale dataset construction
📜 License
MIT License