Skip to content

olarterodriguezivan/Random_embeddings_BBOB

Repository files navigation

BBOB Sampling & ELA Feature Extraction Pipeline

Python Status License Data

A scalable, end-to-end pipeline for:

  • Sampling continuous search spaces
  • Evaluating BBOB benchmark functions
  • Extracting ELA (Exploratory Landscape Analysis) features
  • Studying compression ratio effects on sampling on random subspaces
  • Building large datasets efficiently (parallel + chunked)

Full Pipeline Overview

flowchart LR
    A[Sample X] --> B["Evaluate f(X) on BBOB"]
    B --> C[Extract ELA Features]
    C --> D[Aggregate Dataset]

    subgraph Advanced ["Sampling on Subspaces Pipeline"]
        E[Sample in low dimension d]
        F[Project to high dimension D]
        G[Evaluate BBOB]
        H[ELA on full + slices]
    end

    E --> F --> G --> H
Loading

Project Structure

.
├── doe_sampling.py                      # Generate X samples
├── y_sampling.py                        # Evaluate BBOB functions
├── ela_sampling.py                      # Extract ELA features
├── sampler.py                           # Alternative IOH-based sampling
│
├── slicing_sampling_test_parallel.py
├── slicing_all_in_sampling_test_parallel.py
│   └── Low-D → High-D sampling + parallel ELA
│
├── parallel_loader.py                   # Build final dataset (chunked)
├── parallel_loader_slices.py
├── parallel_loader_slices_all_in.py
│   └── Parallel loading of many CSV files
│
└── data/                                # Outputs (generated)

Installation

Just run the following line in bash:

python3 -m pip install -r requirements.txt

Usage (End-to-End)

Generate Samples for Full-Space Sampling on BBOB

The following is an example to use any Quasi-Monte-Carlo sampling. Currently, the code allows to use halton, sobol or lhs to generate points to be then passed to one of the BBOB functions and get function evaluations for ulterior assessment.

Example

python doe_sampling.py \
    --dim 20 \
    --n 1000 \
    --sampler lhs \
    --seed 42 \
    --out samples.csv

Output

As an output, a folder is generated with the corresponding dimension, number of samples $n$, the utilized qmc-sampler, the random seed set as in:

x_samples/
  reduction/
    Dimension_20/
      seed_42/
        Samples_1000/
          samples.csv

Evaluation of BBOB Functions

Run:

python y_sampling.py

Where you need to open the file and select the folder with the already generated samples. By running the script correctly, then the following directories will be generated:

bbob_evaluations/
  reduction/
    Dimension_20/
      seed_42/
        Samples_1000/
          f_1/
            id_0/
              evaluations.csv

The script y_sampling.py has predefined to evaluate the 24 functions from the BBOB and the first 15 instances of each function.

Extract ELA features

Run:

python ela_sampling.py

Therein you have to indicate the directories with the samples and the corresponding evaluations.

Then the output is the following:

ela_features/
  reduction/
    Dimension_20/
      seed_42/
        Samples_1000/
          f_1/
            id_0/
              ela_features.csv

Each file contains:

feature_1, feature_2, ..., feature_n

Building Final Dataset

Run:

python parallel_loader.py

The latter script generates the files:

complete_data_generated.csv
complete_data_generated.parquet

Final dataset includes:

  • dimension
  • seed
  • n_samples
  • function_idx
  • instance_idx
  • source_file

Sampling on subspaces

Initial Sampling

Run:

python slicing_sampling_test_parallel.py

Or:

python slicing_all_in_sampling_test_parallel.py

The distinction between the first and the second one is the point density allocation as in the second all points are evaluated in one defined subspace, whereas the first one splits point density into multiple subspaces or "slices".

What Happens?

flowchart LR
    A[Low-D samples] --> B[Random embedding]
    B --> C[High-D samples]
    C --> D[Evaluate BBOB]
    D --> E["ELA (full dataset)"]

    C --> F[Split into slices]
    F --> G[ELA per slice]
Loading

Output Structure

By running either one of the aforementioned scripts, then the following structure will be generated. This is the renderization of the flowchart explained above.

sampling_outputs_20D_10D/
  f1/
    iid_0/
      group0/
        full.csv
        slice1.csv
        slice2.csv

Meaning

File Description
full.csv ELA features on all samples
slice*.csv ELA features per low-D slice

Data Formats (Unified)

X Samples

x1, x2, ..., xd

Function Evaluations

fX

ELA Features

feature_1, feature_2, ..., feature_n

Final Dataset (Aggregated)

Includes:

  • Features
  • Metadata
  • File origin
  • Core Concepts:
  • Sampling Methods (Latin Hypercube (LHS), Sobol, Halton, Monte Carlo)
  • ELA Features (via pflacco) (Meta features, Distribution features ,Level sets, Nearest Better Clustering (NBC), Dispersion, Information content, PCA, Fitness-distance correlation
  • Projection Strategy (Sample in low dimension (d) --> Embed into high dimension (D) --> Evaluate in high-D

Details

  • Analyze landscape structure via ELA
  • Performance Features
  • Parallel processing (multiprocessing)
  • Chunked data loading
  • Memory-safe streaming
  • Parquet output (fast + compressed)
  • Handles millions of CSV files

Important Notes

  • Sobol sampling requires n = 2^k
  • ELA level features require enough samples
  • File paths encode metadata → do not change structure
  • Use Parquet for large datasets

Customization

You can modify:

  • Sampling method (get_sampler)
  • Dimensions (D, d)
  • Number of groups / slices
  • Enabled ELA features

Use Cases

  • Optimization landscape analysis
  • Meta-learning dataset generation
  • Benchmarking optimization algorithms
  • Studying dimensionality reduction effects

Summary

This repository provides a complete, scalable pipeline for:

✔ Sampling & benchmarking ✔ ELA feature extraction ✔ High-dimensional analysis ✔ Large-scale dataset construction

📜 License

MIT License

About

This is a repository meant to analyze how sampling black-box optimization landscapes with Random Embeddings could be beneficial for sparse cases

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages