MPBoot_LLM

MPBoot_LLM generates R2RML mappings from relational database dumps and target ontologies with an LLM-driven pipeline. The repo also includes a local evaluation stack for RODI-style benchmarks and shared result webpages.

Requirements

Python 3.9+
Java 11+
git
curl
unzip
Docker
either mvn or Docker access to the maven image

All commands below assume you are at the repository root:

cd mpboot_llm

Environment Setup

Bootstrap the local toolchain:

bash scripts/bootstrap.sh

This installs or prepares:

.venv
.tools/robot
.tools/rodi
.tools/ontop/jdbc
.tools/bin/psql_docker.sh
the local PostgreSQL Docker container

Then create .env:

cp .env.example .env

At minimum, set the API key for the provider you want to use and set LLM_PROVIDER to select it.

Useful environment variables:

LLM_PROVIDER — provider to use (claude, gpt4o, gpt4o-mini, groq, gemini, ollama; default: claude)
OLLAMA_BASE_URL — Ollama server URL (default: http://localhost:11434)
OLLAMA_MODEL — model name to use with Ollama (default: deepseek-r1:32b)
ANTHROPIC_API_KEY
ANTHROPIC_PROXY_URL
ANTHROPIC_MOCK_LOG_LEVEL
MPBOOT_DB_PORT
MPBOOT_DB_NAME
MPBOOT_R2RML_FORCE_DOUBLE_FOR_DECIMALS

Tutorial

A worked example using the Conference NoFKs dataset is in tutorial/. It runs entirely locally — Ollama for the LLM and Docker for PostgreSQL — with no cloud API key required. It covers all steps from running the pipeline to the optional RODI evaluation.

Workflow 1: Generate Mappings From Your Own Dump And Ontology

If you already have a relational dump and an ontology, you do not need the RODI dataset download path. The mapping pipeline expects a dataset directory containing:

dump.sql or dump_pg_compatible.sql
ontology.ttl or ontology.owl
optionally queries/*.qpair if you also want RODI-style evaluation later

Minimal directory layout

Example:

my_input/
  my_dataset/
    dump.sql
    ontology.ttl

Convert your dataset to the repo's PostgreSQL-compatible format

bash scripts/create_pg_compatible_dataset.sh my_input pg_compatible/outputs/data_pg_compatible

That will create:

pg_compatible/outputs/data_pg_compatible/my_dataset/

with:

dump_pg_compatible.sql
ontology.ttl or copied ontology.owl
copied extra files such as queries/

If you only want to process a single dataset directory directly:

bash scripts/create_pg_compatible_dataset.sh my_input/my_dataset pg_compatible/outputs/data_pg_compatible/my_dataset

Generate `ontology.owl` if your ontology is Turtle

bash scripts/generate_owlxml_ontologies.sh pg_compatible/outputs/data_pg_compatible --dataset my_dataset

If ontology.owl already exists, this step can be skipped unless you want to overwrite it:

bash scripts/generate_owlxml_ontologies.sh pg_compatible/outputs/data_pg_compatible --dataset my_dataset --overwrite

Run mapping generation for that dataset

bash scripts/create_mapping_single_dataset.sh --dataset-dir pg_compatible/outputs/data_pg_compatible/my_dataset

The runner will stage the dataset into the live workspace:

and then execute the mapping phases.

Useful variants:

bash scripts/create_mapping_single_dataset.sh --dataset-dir pg_compatible/outputs/data_pg_compatible/my_dataset --dry-run
bash scripts/create_mapping_single_dataset.sh --dataset-dir pg_compatible/outputs/data_pg_compatible/my_dataset --from phase1
bash scripts/create_mapping_single_dataset.sh --dataset-dir pg_compatible/outputs/data_pg_compatible/my_dataset --only phase7

Evaluate your own dataset

Evaluation is optional. If you also provide queries/*.qpair, you can evaluate an archived run later with:

bash scripts/evaluation.sh outputs/<model>/<timestamp> --dataset my_dataset --method all

If you have no qpair queries, the mapping-generation workflow still works; only the RODI query evaluation path is unavailable.

Workflow 2: Full RODI Pipeline

This is the workflow for the bundled RODI benchmark datasets under datasets/rodi/.

One command for one dataset

Run everything for one dataset:

bash scripts/run_end_to_end_dataset.sh mondial_rel --method all

Variants:

bash scripts/run_end_to_end_dataset.sh mondial_rel --method rodi
bash scripts/run_end_to_end_dataset.sh mondial_rel --skip-evaluation --skip-summary

This wrapper will:

bootstrap missing tools
download the requested RODI dataset if missing
normalize mondial_rel if needed
build the PostgreSQL-compatible dataset copy
generate ontology.owl if needed
start the local Anthropic cache server
run mapping generation
stop the cache server
run evaluation
regenerate the shared summary webpages

One command for all datasets

Run the full batch:

bash scripts/run_end_to_end_all.sh

Variants:

bash scripts/run_end_to_end_all.sh --method rodi
bash scripts/run_end_to_end_all.sh --skip-evaluation --skip-summary

Manual step-by-step RODI workflow

If you want the individual steps instead of the wrapper:

Download the selected benchmark datasets:

bash scripts/bootstrap.sh --download-rodi

Download only one dataset:

bash scripts/bootstrap.sh --download-rodi --dataset mondial_rel

Build PostgreSQL-compatible dataset copies:

bash scripts/create_pg_compatible_dataset.sh datasets/rodi

Generate OWL/XML where needed:

bash scripts/generate_owlxml_ontologies.sh pg_compatible/outputs/data_pg_compatible

Run mapping generation:

bash scripts/create_all_mapping.sh pg_compatible/outputs/data_pg_compatible --keep-going

Or one dataset only:

bash scripts/create_mapping_single_dataset.sh --dataset-dir pg_compatible/outputs/data_pg_compatible/mondial_rel

Evaluate an archived batch:

bash scripts/evaluation.sh outputs/<model>/<timestamp> --method all --keep-going

Only one dataset:

bash scripts/evaluation.sh outputs/<model>/<timestamp> --dataset mondial_rel --method all

Regenerate the shared webpages:

bash scripts/generate_summary_portal.sh

You can still pass an archived batch path if you want to anchor discovery to one run:

bash scripts/generate_summary_portal.sh outputs/<model>/<timestamp>

Notes about `mondial_rel`

mondial_rel needs one repo-specific normalization step. During bootstrap preparation, the repo:

keeps only the relevant schema from the original dump
renames the schema to mondial_rel
strips obsolete schema prefixes from the Mondial .qpair SQL

That normalization is handled by scripts/bootstrap_prepare_rodi_dumps.sh.

Outputs

Live workspace during a run

The active workspace used by the mapping agents is:

The live generated mapping ends up at:

src/outputs/mappings/mappings_r2rml.ttl

Archived runs

Each completed run is archived under:

outputs/<model>/<timestamp>/<dataset>/

Typical contents:

mappings_r2rml.ttl
run_metadata.json
run.log
inputs/
workspace/
evaluation/ after evaluation

Shared summary webpages

Generated result pages live under:

Regenerate them with:

bash scripts/generate_summary_portal.sh

Script Reference

Main entrypoints:

Experimental Results

Experimental results are archived on Zenodo at 10.5281/zenodo.20073873.

Reference metadata

DOI: 10.5281/zenodo.20073239

This repository is archived on Zenodo at 10.5281/zenodo.20073239.

Name		Name	Last commit message	Last commit date
Latest commit History 104 Commits
anthropic_mock_server		anthropic_mock_server
pg_compatible		pg_compatible
scripts		scripts
src		src
tutorial		tutorial
.env.example		.env.example
.gitignore		.gitignore
.zenodo.json		.zenodo.json
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MPBoot_LLM

Requirements

Environment Setup

Tutorial

Workflow 1: Generate Mappings From Your Own Dump And Ontology

Minimal directory layout

Convert your dataset to the repo's PostgreSQL-compatible format

Generate `ontology.owl` if your ontology is Turtle

Run mapping generation for that dataset

Evaluate your own dataset

Workflow 2: Full RODI Pipeline

One command for one dataset

One command for all datasets

Manual step-by-step RODI workflow

Notes about `mondial_rel`

Outputs

Live workspace during a run

Archived runs

Shared summary webpages

Script Reference

Experimental Results

Reference metadata

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

MPBoot_LLM

Requirements

Environment Setup

Tutorial

Workflow 1: Generate Mappings From Your Own Dump And Ontology

Minimal directory layout

Convert your dataset to the repo's PostgreSQL-compatible format

Generate ontology.owl if your ontology is Turtle

Run mapping generation for that dataset

Evaluate your own dataset

Workflow 2: Full RODI Pipeline

One command for one dataset

One command for all datasets

Manual step-by-step RODI workflow

Notes about mondial_rel

Outputs

Live workspace during a run

Archived runs

Shared summary webpages

Script Reference

Experimental Results

Reference metadata

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Generate `ontology.owl` if your ontology is Turtle

Notes about `mondial_rel`

Packages