Skip to content

marcodipanfilo/mpboot_llm

Repository files navigation

MPBoot_LLM

MPBoot_LLM generates R2RML mappings from relational database dumps and target ontologies with an LLM-driven pipeline. The repo also includes a local evaluation stack for RODI-style benchmarks and shared result webpages.

Requirements

  • Python 3.9+
  • Java 11+
  • git
  • curl
  • unzip
  • Docker
  • either mvn or Docker access to the maven image

All commands below assume you are at the repository root:

cd mpboot_llm

Environment Setup

Bootstrap the local toolchain:

bash scripts/bootstrap.sh

This installs or prepares:

  • .venv
  • .tools/robot
  • .tools/rodi
  • .tools/ontop/jdbc
  • .tools/bin/psql_docker.sh
  • the local PostgreSQL Docker container

Then create .env:

cp .env.example .env

At minimum, set the API key for the provider you want to use and set LLM_PROVIDER to select it.

Useful environment variables:

  • LLM_PROVIDER — provider to use (claude, gpt4o, gpt4o-mini, groq, gemini, ollama; default: claude)
  • OLLAMA_BASE_URL — Ollama server URL (default: http://localhost:11434)
  • OLLAMA_MODEL — model name to use with Ollama (default: deepseek-r1:32b)
  • ANTHROPIC_API_KEY
  • ANTHROPIC_PROXY_URL
  • ANTHROPIC_MOCK_LOG_LEVEL
  • MPBOOT_DB_PORT
  • MPBOOT_DB_NAME
  • MPBOOT_R2RML_FORCE_DOUBLE_FOR_DECIMALS

Tutorial

A worked example using the Conference NoFKs dataset is in tutorial/. It runs entirely locally — Ollama for the LLM and Docker for PostgreSQL — with no cloud API key required. It covers all steps from running the pipeline to the optional RODI evaluation.

Workflow 1: Generate Mappings From Your Own Dump And Ontology

If you already have a relational dump and an ontology, you do not need the RODI dataset download path. The mapping pipeline expects a dataset directory containing:

  • dump.sql or dump_pg_compatible.sql
  • ontology.ttl or ontology.owl
  • optionally queries/*.qpair if you also want RODI-style evaluation later

Minimal directory layout

Example:

my_input/
  my_dataset/
    dump.sql
    ontology.ttl

Convert your dataset to the repo's PostgreSQL-compatible format

bash scripts/create_pg_compatible_dataset.sh my_input pg_compatible/outputs/data_pg_compatible

That will create:

pg_compatible/outputs/data_pg_compatible/my_dataset/

with:

  • dump_pg_compatible.sql
  • ontology.ttl or copied ontology.owl
  • copied extra files such as queries/

If you only want to process a single dataset directory directly:

bash scripts/create_pg_compatible_dataset.sh my_input/my_dataset pg_compatible/outputs/data_pg_compatible/my_dataset

Generate ontology.owl if your ontology is Turtle

bash scripts/generate_owlxml_ontologies.sh pg_compatible/outputs/data_pg_compatible --dataset my_dataset

If ontology.owl already exists, this step can be skipped unless you want to overwrite it:

bash scripts/generate_owlxml_ontologies.sh pg_compatible/outputs/data_pg_compatible --dataset my_dataset --overwrite

Run mapping generation for that dataset

bash scripts/create_mapping_single_dataset.sh --dataset-dir pg_compatible/outputs/data_pg_compatible/my_dataset

The runner will stage the dataset into the live workspace:

and then execute the mapping phases.

Useful variants:

bash scripts/create_mapping_single_dataset.sh --dataset-dir pg_compatible/outputs/data_pg_compatible/my_dataset --dry-run
bash scripts/create_mapping_single_dataset.sh --dataset-dir pg_compatible/outputs/data_pg_compatible/my_dataset --from phase1
bash scripts/create_mapping_single_dataset.sh --dataset-dir pg_compatible/outputs/data_pg_compatible/my_dataset --only phase7

Evaluate your own dataset

Evaluation is optional. If you also provide queries/*.qpair, you can evaluate an archived run later with:

bash scripts/evaluation.sh outputs/<model>/<timestamp> --dataset my_dataset --method all

If you have no qpair queries, the mapping-generation workflow still works; only the RODI query evaluation path is unavailable.

Workflow 2: Full RODI Pipeline

This is the workflow for the bundled RODI benchmark datasets under datasets/rodi/.

One command for one dataset

Run everything for one dataset:

bash scripts/run_end_to_end_dataset.sh mondial_rel --method all

Variants:

bash scripts/run_end_to_end_dataset.sh mondial_rel --method rodi
bash scripts/run_end_to_end_dataset.sh mondial_rel --skip-evaluation --skip-summary

This wrapper will:

  1. bootstrap missing tools
  2. download the requested RODI dataset if missing
  3. normalize mondial_rel if needed
  4. build the PostgreSQL-compatible dataset copy
  5. generate ontology.owl if needed
  6. start the local Anthropic cache server
  7. run mapping generation
  8. stop the cache server
  9. run evaluation
  10. regenerate the shared summary webpages

One command for all datasets

Run the full batch:

bash scripts/run_end_to_end_all.sh

Variants:

bash scripts/run_end_to_end_all.sh --method rodi
bash scripts/run_end_to_end_all.sh --skip-evaluation --skip-summary

Manual step-by-step RODI workflow

If you want the individual steps instead of the wrapper:

  1. Download the selected benchmark datasets:
bash scripts/bootstrap.sh --download-rodi

Download only one dataset:

bash scripts/bootstrap.sh --download-rodi --dataset mondial_rel
  1. Build PostgreSQL-compatible dataset copies:
bash scripts/create_pg_compatible_dataset.sh datasets/rodi
  1. Generate OWL/XML where needed:
bash scripts/generate_owlxml_ontologies.sh pg_compatible/outputs/data_pg_compatible
  1. Run mapping generation:
bash scripts/create_all_mapping.sh pg_compatible/outputs/data_pg_compatible --keep-going

Or one dataset only:

bash scripts/create_mapping_single_dataset.sh --dataset-dir pg_compatible/outputs/data_pg_compatible/mondial_rel
  1. Evaluate an archived batch:
bash scripts/evaluation.sh outputs/<model>/<timestamp> --method all --keep-going

Only one dataset:

bash scripts/evaluation.sh outputs/<model>/<timestamp> --dataset mondial_rel --method all
  1. Regenerate the shared webpages:
bash scripts/generate_summary_portal.sh

You can still pass an archived batch path if you want to anchor discovery to one run:

bash scripts/generate_summary_portal.sh outputs/<model>/<timestamp>

Notes about mondial_rel

mondial_rel needs one repo-specific normalization step. During bootstrap preparation, the repo:

  • keeps only the relevant schema from the original dump
  • renames the schema to mondial_rel
  • strips obsolete schema prefixes from the Mondial .qpair SQL

That normalization is handled by scripts/bootstrap_prepare_rodi_dumps.sh.

Outputs

Live workspace during a run

The active workspace used by the mapping agents is:

The live generated mapping ends up at:

Archived runs

Each completed run is archived under:

outputs/<model>/<timestamp>/<dataset>/

Typical contents:

  • mappings_r2rml.ttl
  • run_metadata.json
  • run.log
  • inputs/
  • workspace/
  • evaluation/ after evaluation

Shared summary webpages

Generated result pages live under:

Regenerate them with:

bash scripts/generate_summary_portal.sh

Script Reference

Main entrypoints:

Experimental Results

Experimental results are archived on Zenodo at 10.5281/zenodo.20073873.

Reference metadata

DOI: 10.5281/zenodo.20073239

This repository is archived on Zenodo at 10.5281/zenodo.20073239.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors