This repository contains the codebase accompanying the paper Integrating Meta-Features with Knowledge Graph Embeddings for Meta-Learning by Klironomos A., Dasoulas I., Periti F., Gad-Elrab M., Paulheim H., Dimou A., Kharlamov E., accepted at ESWC 2026.
- Repository Structure
- Prerequisites
- Command Line Arguments
- Reproducing Paper Results
- Citation
- License
- Acknowledgments
The repository consists of two main subprojects:
-
openml_exekgs_generation/: Crawls OpenML to obtain experiment data and generates executable knowledge graphs (ExeKGs) from runs and tasks. This subproject handles the data collection and initial KG generation phase. -
kge_experiments/: The main subproject that trains Knowledge Graph Embeddings and uses them for two downstream meta-learning tasks.Core functionality includes:
- Preparing and processing knowledge graph data
- Training RDF2Vec-based Knowledge Graph Embedding (KGE) models
- Training Link Prediction (LP) models on the knowledge graph (optional, for PPE task only)
- Analyzing results and generating publication-ready figures and tables
Downstream Task 1: Pipeline Performance Estimation (PPE)
- Predicting pipeline performance on datasets using RDF2Vec or LP embeddings
- Evaluating performance prediction accuracy on unseen datasets and pipelines
Downstream Task 2: Dataset Performance Similarity Estimation (DPSE)
- Calculating KGE-based dataset similarities using RDF2Vec embeddings
- Computing Graph Edit Distances (GEDs) between datasets
- Evaluating dataset similarity measures for retrieving similar datasets
- Python 3.10
- Poetry (for dependency management)
Install dependencies using Poetry:
poetry installThis section describes the command-line arguments for both subprojects.
These arguments are used when generating ExeKGs from OpenML (Step 0).
--mp-runs: Enable multiprocessing for runs (default: False)--mp-tasks: Enable multiprocessing for tasks (default: False)--task-id <List[int]>: Specify particular task IDs to process (optional)--use-processed-tasks-and-runs: Use previously processed tasks and runs (default: False)--offset-step <int>: Step size for processing tasks in batches (default: 1000)--n-tasks <int>: Number of tasks to process (optional, processes all if not specified)--n-runs-per-flow <int>: Number of runs to process per flow (default: 10)
These arguments are used for training KGE models and performing meta-learning tasks (Steps 1+).
The following arguments can be used with any kge_experiments command:
--verbose: Enable verbose output for detailed logging and debugging information--rdf2vec-d <int>: Set the distance (depth) of random walks for RDF2Vec embeddings (default: 10)--rdf2vec-w <int>: Set the number of random walks per entity for RDF2Vec embeddings (default: 10)--rdf2vec-ws <string>: Set the walk strategy for RDF2Vec embeddings (default: "random")--use-mlseakg: Include MLSeaKG (Machine Learning Semantic Knowledge Graph) in addition to ExeKGs--use-mkga: Use MKGA (Multi-Modal Knowledge Graph Augmentation) preprocessing method
--filter-by-dataset-ids: Filter the data to include only specific dataset IDs (predefined in the code)--excl-flows-per-task: Exclude invalid flows per task based on predefined criteria--excl-performance-values: Exclude performance values from the knowledge graph (for PPE downstream task)--remove-old-files: Remove existing processed files before generating new ones
--chunk-size <int>: Set the chunk size for processing data in batches (default: 100)--cpu-count <int>: Set the number of CPU cores to use for parallel processing (default: 4)
--model-type <string>: Type of link prediction model to train - options: DistMultReaLitEGated, TransEReaLitEGated, DistMultReaLitE, TransEReaLitE, ComplExReaLitEGated, ComplExReaLitE (default: "DistMultReaLitEGated")--timeout-hours <float>: Timeout in hours for training (optional)--inverse-triples/--no-inverse-triples: Create inverse triples in the knowledge graph (default: enabled, should be True for LP models)
Note: Link Prediction models are used exclusively for the PPE downstream task. These models handle literals natively, so MKGA preprocessing (--use-mkga) should be disabled when training LP models.
--replace-similarities: Replace existing similarity calculations if they already exist--replace-info: Replace existing dataset information if it already exists--add-data-entity-sim: Calculate similarities based on data entity embeddings only--add-data-entity-and-pipeline-sim: Calculate similarities using both data entity and pipeline embeddings--add-pipeline-sim: Calculate similarities based on pipeline embeddings only--add-dataset-info: Add dataset information to the similarity results
Note: At least one of the similarity calculation options (--add-data-entity-sim, --add-data-entity-and-pipeline-sim, --add-pipeline-sim, or --add-dataset-info) must be specified.
--replace-geds: Replace existing Graph Edit Distance (GED) calculations if they already exist--add-dataset-info: Add dataset information to the GED results--replace-info: Replace existing dataset information in GED results if it already exists--num-processes <int>: Set the number of parallel processes for GED calculation (default: 10)--chunk-size <int>: Set the chunk size for processing GED calculations in batches (default: 5)
--split-mode <string>: Set the data splitting strategy - either "dataset" or "pipeline" (default: "dataset")--target <string>: Specify the target variable to predict (default: "f1_score")--emb-aggr-type <string>: Set embedding aggregation type - either "concat" or "mean" (default: "concat")--model <string>: Choose the machine learning model - options include RF, SVC, LR, RFReg, SVR, LRReg (default: "RF")--dataset-emb-source <string>: Specify the dataset embedding source - "rdf2vec", "pykeen-lp", or metafeature types (metafeatures_all, metafeatures_statistical, metafeatures_landmarkers, metafeatures_information_theory) (default: "rdf2vec")--pipeline-emb-source <string>: Specify the pipeline embedding source - "rdf2vec" or "pykeen-lp" (default: "rdf2vec")--min-train-samples-per-run-id <int>: Set minimum training samples per run ID (default: 50)--pykeen-lp-model-name <string>: PyKEEN LP model name (required when using pykeen-lp as embedding source) - options: DistMultReaLitEGated, TransEReaLitEGated, DistMultReaLitE, TransEReaLitE, ComplExReaLitEGated, ComplExReaLitE--pykeen-lp-inverse-triples/--no-pykeen-lp-inverse-triples: Whether the PyKEEN model was trained with inverse triples (default: enabled, must match training configuration)
Follow this complete step-by-step process to reproduce the results from the paper. The workflow consists of:
- Steps 0-2: Data preparation and KGE model training (common to both tasks)
- Step 2a: Train Link Prediction models (optional, for PPE only)
- Steps 3-4: Downstream Task 1 - Pipeline Performance Estimation (PPE)
- Steps 5-7: Downstream Task 2 - Dataset Performance Similarity Estimation (DPSE)
- Step 8: Ablation study analyzing the impact of MLSeaKG integration on both tasks
Each step builds upon the previous ones.
This step uses the openml_exekgs_generation/ subproject to crawl OpenML and generate ExeKGs from runs and tasks.
Generate ExeKGs:
# Run the ExeKG generation command (from repository root)
python -m openml_exekgs_generation.main --no-mp-runs --mp-tasks --n-runs-per-flow=10Move Generated Files to the Required Location:
After generation, move the output ExeKGs and logs to the kge_experiments/ data directory:
# Create the target directories if they don't exist
mkdir -p kge_experiments/data/input/datasets/raw/
mkdir -p kge_experiments/data/input/logs/
# Move the generated ExeKGs files to the expected location
mv openml_exekgs_generation/output/exekgs kge_experiments/data/input/datasets/raw/
# Move the tasks log file (required by kge_experiments)
mv openml_exekgs_generation/output/logs/tasks_log.csv kge_experiments/data/input/logs/Note: The tasks_log.csv file contains metadata about OpenML tasks and datasets that is used by the kge_experiments codebase for all subsequent steps.
Prepare the knowledge graph data by filtering and processing ExeKGs. Performance values are excluded:
python -m typer kge_experiments.cli.main run prepare-data --filter-by-dataset-ids --excl-flows-per-task --excl-performance-valuesTrain RDF2Vec model.
python -m typer kge_experiments.cli.main run --use-mlseakg --use-mkga --rdf2vec-d 20 --rdf2vec-w 10 --rdf2vec-ws "random" train-rdf2vec --cpu-count 10 --chunk-size 1000The following steps (2a, 3-4) focus on predicting pipeline performance on datasets. This task evaluates how well embeddings can predict the performance of machine learning pipelines on both seen and unseen datasets.
Note: Link Prediction (LP) models are used exclusively for the PPE task and provide an alternative to RDF2Vec embeddings for both dataset and pipeline representations.
Link Prediction models can be used as an alternative embedding source for pipeline performance prediction. These models are trained on the knowledge graph to learn entity and relation embeddings through the link prediction task.
Supported Models (all handle literals):
DistMultReaLitEGatedTransEReaLitEGatedDistMultReaLitETransEReaLitEComplExReaLitEGatedComplExReaLitE
Training command:
python -m typer kge_experiments.cli.main run \
--use-mlseakg --no-use-mkga \
train-link-prediction \
--model-type <MODEL> \
--inverse-triplesImportant notes:
--no-use-mkga: MKGA preprocessing must be disabled for LP models because these models natively handle literals--inverse-triples: Should always be True (default) for LP models to create bidirectional relationships in the knowledge graph<MODEL>: Choose from supported models listed above- Optional: Use
--timeout-hours <HOURS>to set a training timeout
The paper evaluates pipeline performance prediction in two scenarios: predicting performance on unseen datasets and predicting performance for unseen pipelines.
python -m typer kge_experiments.cli.main run \
--use-mlseakg <--use-mkga|--no-use-mkga> \
--rdf2vec-d 20 --rdf2vec-w 10 \
predict-pipeline-performance \
--split-mode dataset \
--target <TARGET> \
--emb-aggr-type concat \
--model <MODEL> \
--dataset-emb-source <DATASET_EMB_SOURCE> \
--pipeline-emb-source <PIPELINE_EMB_SOURCE> \
--min-train-samples-per-run-id 50 \
[--pykeen-lp-model-name <LP_MODEL>] \
[--pykeen-lp-inverse-triples]python -m typer kge_experiments.cli.main run \
--use-mlseakg <--use-mkga|--no-use-mkga> \
--rdf2vec-d 20 --rdf2vec-w 10 \
predict-pipeline-performance \
--split-mode pipeline \
--target <TARGET> \
--emb-aggr-type concat \
--model <MODEL> \
--dataset-emb-source <DATASET_EMB_SOURCE> \
--pipeline-emb-source <PIPELINE_EMB_SOURCE> \
--min-train-samples-per-run-id 1 \
[--pykeen-lp-model-name <LP_MODEL>] \
[--pykeen-lp-inverse-triples]Notes:
- When
--pipeline-emb-source pykeen-lpor--dataset-emb-source pykeen-lp, you must specify:--pykeen-lp-model-name: The LP model name (e.g.,DistMultReaLitEGated)--pykeen-lp-inverse-triples: Must match the training configuration (should be True)
- Use
--no-use-mkgawith PyKEEN LP models (they handle literals natively) - Use
--use-mkgawith RDF2Vec embeddings - The paper evaluates all combinations of metafeature types with both RDF2Vec and PyKEEN LP embeddings
Generate comprehensive LaTeX tables combining meta-classification and meta-regression results for pipeline performance prediction.
Run the script for both split modes:
# Generate table for dataset-based splitting (predicting on unseen datasets)
SPLIT_MODE=dataset MIN_TRAIN_SAMPLES=50 python kge_experiments/scripts/plot_pipeline_performance_prediction_results.py
# Generate table for pipeline-based splitting (predicting on unseen pipelines)
SPLIT_MODE=pipeline MIN_TRAIN_SAMPLES=1 python kge_experiments/scripts/plot_pipeline_performance_prediction_results.pyEnvironment variables:
SPLIT_MODE: Set to "dataset" or "pipeline" to specify the evaluation scenarioMIN_TRAIN_SAMPLES: Minimum number of training samples per run ID (50 for dataset split, 1 for pipeline split)
Generated outputs:
The script generates two combined LaTeX tables (one per split mode), each containing both meta-classification and meta-regression results:
-
Dataset split table:
combined_results_split_dataset_min_samples_50_table.tex- Columns: Dataset Emb. | Pipeline Strategy | MSE | R² | Accuracy | F1
- Rows grouped by target metric (accuracy, precision)
- Shows performance across different dataset embedding sources and pipeline strategies
-
Pipeline split table:
combined_results_split_pipeline_min_samples_1_table.tex- Columns: Method | MSE | R² | Accuracy | F1
- Includes baseline methods (average performance, closest embedding)
- Rows grouped by target metric (accuracy, precision)
Table features:
- Combined metrics: Each table includes both meta-regression metrics (MSE, R²) and meta-classification metrics (Accuracy, F1) side-by-side
- Grouped by target: Results are organized by target metric rows (e.g., accuracy vs precision)
- Best performance highlighting:
- Bold: Best performance within each embedding source group (dataset split only)
- Underlined: Overall best performance across all methods
- Automatic selection: Shows only the best performing configuration for each embedding source and target combination
- Baseline comparisons: Includes metafeature-only baselines alongside RDF2Vec and PyKEEN LP embeddings
The following steps (5-7) focus on calculating and evaluating dataset similarities using the trained KGE models. This task aims to retrieve similar datasets based on their performance characteristics encoded in the knowledge graphs.
Note: Link Prediction models are NOT used for this task. Only RDF2Vec embeddings are used for dataset similarity calculations.
Calculate similarities using the trained RDF2Vec models. Run these commands for each configuration used in Step 2:
python -m typer kge_experiments.cli.main run --use-mlseakg --use-mkga --rdf2vec-d 20 --rdf2vec-w 10 --rdf2vec-ws "random" calculate-kge-similarities --add-data-entity-sim --add-data-entity-and-pipeline-sim --add-pipeline-simThe similarities are calculated in a pairwise fashion for every possible pair of datasets. The results are saved in CSV files for further analysis.
Calculate and add Graph Edit Distances to the similarities:
python -m typer kge_experiments.cli.main run calculate-and-add-geds-to-similaritiesGenerate comprehensive plots and LaTeX tables for dataset similarity evaluation results:
python kge_experiments/scripts/plot_dataset_similarity_results.py --excl_dot_product --excl_manhattan --excl_euclidean --excl_metrics_with_k 10 15 20 0.8Available arguments for customizing the analysis:
--excl_hit_metrics: Exclude Hit metrics from the analysis--excl_metrics_with_k: Exclude specific metrics with k values (e.g., NDCG@5, NDCG@10)--excl_dot_product: Exclude dot product similarity measurements--excl_manhattan: Exclude Manhattan distance measurements--excl_euclidean: Exclude Euclidean distance measurements--only_with_mkga: Include only results with MKGA preprocessing--only_with_mlsea: Include only results with MLSeaKG integration
Generated outputs:
- Heatmaps comparing baseline methods vs. best KGE approaches
- LaTeX tables for paper publication
- Performance fluctuation plots across different configurations
Perform a comprehensive ablation study to analyze the impact of MLSeaKG integration on both dataset similarity and pipeline performance prediction tasks.
For dataset-based splitting (predicting on unseen datasets):
MIN_TRAIN_SAMPLES=50 SPLIT_MODE="dataset" python kge_experiments/scripts/mlseakg_ablation_study.py --excl_metrics_with_k 10 15 20 0.8 --excl_dot_product --excl_manhattan --excl_euclidean --only_without_mkgaFor pipeline-based splitting (predicting on unseen pipelines):
MIN_TRAIN_SAMPLES=1 SPLIT_MODE="pipeline" python kge_experiments/scripts/mlseakg_ablation_study.py --excl_metrics_with_k 10 15 20 0.8 --excl_dot_product --excl_manhattan --excl_euclidean --only_without_mkgaAvailable arguments:
--excl_metrics_with_k: Exclude specific metrics with k values (e.g., 10 15 20 0.8)--excl_dot_product: Exclude dot product similarity measurements--excl_manhattan: Exclude Manhattan distance measurements--excl_euclidean: Exclude Euclidean distance measurements--only_without_mkga: Include only results without MKGA preprocessing
Environment variables:
SPLIT_MODE: Set to "dataset" or "pipeline" for pipeline performance prediction analysisMIN_TRAIN_SAMPLES: Minimum training samples per run ID (50 for dataset split, 1 for pipeline split)
Generated outputs:
- Combined comparison tables: Side-by-side comparison of results with and without MLSeaKG
mlseakg_ablation_similarity_comparison.tex: Dataset similarity resultsmlseakg_ablation_pipeline_{SPLIT_MODE}_min_samples_{MIN_TRAIN_SAMPLES}_classification_comparison.tex: Meta-Classification resultsmlseakg_ablation_pipeline_{SPLIT_MODE}_min_samples_{MIN_TRAIN_SAMPLES}_regression_comparison.tex: Meta-Regression results
- Analysis features:
- Identifies and compares best configurations with and without MLSeaKG
- Generates LaTeX tables
- Saves results to
kge_experiments/data/ablation_results/directory
If you use this code or our methods in your research, please cite our paper:
@InProceedings{10.1007/978-3-032-25156-5_18,
author="Klironomos, Antonis
and Dasoulas, Ioannis
and Periti, Francesco
and Gad-Elrab, Mohamed H.
and Paulheim, Heiko
and Dimou, Anastasia
and Kharlamov, Evgeny",
editor="Acosta, Maribel
and van Erp, Marieke
and Rudolph, Sebastian
and Hartig, Olaf
and Spahiu, Blerina
and Rula, Anisa
and Garijo, Daniel
and Osborne, Francesco",
title="Integrating Meta-features with Knowledge Graph Embeddings for Meta-learning",
booktitle="The Semantic Web",
year="2026",
publisher="Springer Nature Switzerland",
address="Cham",
pages="336--357",
isbn="978-3-032-25156-5"
}This software is open-sourced under the AGPL-3.0 license. See the LICENSE file for details.
For a list of open source components included in this project, see the file 3rd-party-licenses.txt.
This project includes outsourced code in the following locations:
-
MKGA-exekgs-extension/: Code adapted from the MKGA repository. -
kge_experiments/classes/rdf2vec.py: Code adapted from a fork of the pyRDF2Vec repository.