Skip to content

Add JSON-LD generation script with NRP AI support#4

Merged
lmarini merged 59 commits into
masterfrom
3-generate-jsonld-datasets-from-websites
Jun 1, 2026
Merged

Add JSON-LD generation script with NRP AI support#4
lmarini merged 59 commits into
masterfrom
3-generate-jsonld-datasets-from-websites

Conversation

@ywkim312

@ywkim312 ywkim312 commented Dec 16, 2025

Copy link
Copy Markdown
Contributor

JSON-LD Generation Script for Dataset Webpages

Closes #3

Summary

This PR adds an automated script to generate JSON-LD descriptions for scientific datasets by analyzing their webpages using AI. The script reads dataset information from a CSV file (exported from Google Sheets), fetches webpage content, uses AI to extract metadata, and generates Schema.org-compliant JSON-LD files.

Features

  • Multi-AI Service Support: Works with NRP LLM (default), OpenAI, and Anthropic Claude
  • Automatic Metadata Extraction: Uses AI to detect and extract dataset information from webpages
  • JSON-LD Generation: Creates Schema.org-compliant JSON-LD with proper spatial coverage formatting
  • Robust Error Handling:
    • Automatic retry on timeouts (2 attempts)
    • Connection error retry logic
    • Server error detection and reporting
    • Comprehensive error summary at completion
  • Timeout Protection: Threading-based timeout enforcement (3 minutes per API call) to prevent hanging
  • Spatial Coverage Fixing: Automatically corrects spatial coverage box format to Schema.org standard

Files Added

  • scripts/generate_jsonld.py - Main script for JSON-LD generation
  • scripts/requirements.txt - Python dependencies
  • scripts/README.md - Setup and usage instructions
  • prompts/dataset-detection-prompt.txt - AI prompt for dataset detection
  • prompts/jsonld-generation-prompt.txt - AI prompt for JSON-LD generation
  • .env.example - Template for API key configuration
  • .gitignore - Updated to exclude sensitive files and generated output

Setup

  1. Install dependencies:

    pip install -r scripts/requirements.txt
  2. Configure API key (see .env.example):

    • To get an API key, please contact either David or Yong Wook
    cp .env.example .env
    # Edit .env and add your NRP_API_KEY
  3. Download datasets.csv from the Google Sheet (see scripts/README.md for instructions)

Testing

Test with a single URL

python scripts/generate_jsonld.py --test-url "http://hydro.iis.u-tokyo.ac.jp/~yamadai/MERIT_DEM/"

Test with CSV (limit to 1 dataset)

python scripts/generate_jsonld.py --csv datasets.csv --limit 1

Process all datasets

python scripts/generate_jsonld.py --csv datasets.csv

Example Output

The script generates JSON-LD files in data/objects/summoned/generated/ with filenames like:

  • MERIT_DEM_956de6b6.jsonld

Each file contains Schema.org-compliant JSON-LD with:

  • Dataset metadata (name, description, creator, publisher)
  • Spatial coverage (properly formatted bounding boxes)
  • Keywords and categorization
  • Distribution information
  • License information

Workflow

  1. Script reads CSV file and filters datasets where hasJSONLD? is FALSE, #ERROR!, or empty
  2. For each dataset:
    • Fetches the webpage content
    • Uses AI to detect and extract dataset metadata
    • Uses AI to generate JSON-LD from extracted metadata
    • Validates and saves the JSON-LD file
  3. Provides summary of successful and failed datasets

Error Handling

  • Timeouts: Automatically retries up to 2 times before skipping
  • Connection Errors: Detected and retried (connection refused, reset, etc.)
  • Server Errors: Detected and reported (500, 502, 503, 504)
  • Failed Datasets: Tracked and summarized at the end with reasons

Notes

  • The script uses NRP LLM by default (no --ai-service flag needed)
  • API calls have a 6-minute timeout per request
  • Each API operation (detection or generation) makes a single attempt (no retries)
  • If a timeout or error occurs, the script skips to the next dataset
  • Processing all datasets may take several hours due to API call durations
  • Generated files are saved to data/objects/summoned/generated/ (gitignored)

@ywkim312 ywkim312 linked an issue Dec 16, 2025 that may be closed by this pull request
@ywkim312

ywkim312 commented Jan 5, 2026

Copy link
Copy Markdown
Contributor Author

JSON-LD Generation Workflow

New workflow:

  1. Read CSV → get URL (no AI)
  2. Pass URL to LLM → LLM browses/analyzes webpage (AI: Detection)
  3. LLM extracts metadata → returns JSON (same AI)
  4. Generate JSON-LD → save file (AI: Generation)

What changed:

  • Before: Download HTML → Extract text → Send text to LLM
  • Now: Pass URL → LLM browses/analyzes → LLM extracts metadata

LLM Services: Gemini (default - uses URL Context Tool to fetch webpage content), NRP, OpenAI, Anthropic

Two LLM calls per dataset: Detection + Generation (same service)

@ywkim312

Copy link
Copy Markdown
Contributor Author

GitHub Actions Workflows - Rewrite Summary

Rewrote both GitHub Actions workflows to match this repository's structure. The original workflows were copied from earthcube/GeoCODES-Metadata and contained incorrect paths, repository references, and syntax errors.

sitemap_resources.yaml

Fixed:

  • Syntax error on line 99
  • Repository references: earthcube/GeoCODES-Metadataearthcube/communityCollections
  • Updated paths to match actual structure (data/objects/summoned/, collection/)
  • Parameter name: excluded-pathsexclude-paths
  • Updated to checkout@v4 and set fetch-depth: 0
  • Removed references to non-existent files

Generates 4 sitemaps:

  1. Generated datasets: data/objects/summoned/generated/
  2. Summoned datasets: data/objects/summoned/ (excluding generated)
  3. Collection files: collection/
  4. Combined: data/

Triggers on push to main.

validate_with_dataset_schema.yaml

Fixed:

  • Added missing checkout steps
  • Updated paths to match repository structure
  • Replaced walbo/validate-json with existing scripts/validate_jsonld.py
  • Removed references to non-existent schema file
  • Updated to checkout@v4

Three validation jobs:

  1. validate-jsonld-generated: Validates data/objects/summoned/generated/
  2. validate-jsonld-summoned: Validates data/objects/summoned/ (excluding generated)
  3. validate-jsonld-all: Validates all JSON-LD files in data/

Triggers on push and pull requests (excluding gh-pages).

Both workflows are production-ready and properly configured for this repository.

@ywkim312 ywkim312 marked this pull request as ready for review February 16, 2026 18:12
@ywkim312 ywkim312 self-assigned this Feb 16, 2026
@lmarini lmarini self-requested a review June 1, 2026 19:10
@lmarini lmarini merged commit ffd6319 into master Jun 1, 2026
7 checks passed
@ywkim312 ywkim312 deleted the 3-generate-jsonld-datasets-from-websites branch June 1, 2026 19:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Generate JsonLD datasets from websites

4 participants