Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
59 commits
Select commit Hold shift + click to select a range
9f7adef
Add JSON-LD generation script with NRP AI support
ywkim312 Dec 16, 2025
62460b5
removed unnecessary comment
ywkim312 Dec 16, 2025
19891e8
Fixed the error caused by the curly braces in the prompte template
ywkim312 Dec 16, 2025
8c18bc4
Update for correct bounding box generation
ywkim312 Dec 16, 2025
fb1d201
Fixed another curly braces errors
ywkim312 Dec 16, 2025
c58123f
added error handling for time out
ywkim312 Dec 16, 2025
a31f722
updated error handling
ywkim312 Dec 16, 2025
10b6ed4
updated error handling
ywkim312 Dec 16, 2025
08b85ee
Updated API for the different situation
ywkim312 Dec 16, 2025
6fa777f
Changed the time out to 6min
ywkim312 Dec 16, 2025
df0f822
added the generated json-ld for MERIT_DEM
ywkim312 Dec 16, 2025
2cf0321
Updated to use gemini
ywkim312 Jan 2, 2026
6a3716a
updated the code for using gemini with 3 retries
ywkim312 Jan 5, 2026
338a3f1
Update the code to directly fetch the url content
ywkim312 Jan 5, 2026
bf21b77
Added first 7 sites
ywkim312 Jan 20, 2026
037dbfe
added sitemaps
ywkim312 Jan 26, 2026
8751bfd
Added github action for generating sitemaps and validation
ywkim312 Jan 26, 2026
2ab87ea
fixed validation github action
ywkim312 Jan 26, 2026
3919f2e
make a single site map that covers all
ywkim312 Feb 9, 2026
84ca70a
changed main to master in the sitemap's url
ywkim312 Feb 9, 2026
39df567
Update sitemap
valentinedwv Feb 9, 2026
75b9551
Update sitemap
valentinedwv Feb 9, 2026
9f5a179
Update sitemap
valentinedwv Feb 9, 2026
88cb579
Update path to generated
valentinedwv Feb 9, 2026
b2bb0ad
Update path to generated
valentinedwv Feb 9, 2026
0f4da62
Update path to generated
valentinedwv Feb 9, 2026
60fef1e
Update path to generated. fix base url
valentinedwv Feb 9, 2026
977e2a1
Update path to generated. fix base url
valentinedwv Feb 9, 2026
1b06956
Update path to generated. fix base url
valentinedwv Feb 9, 2026
540d9a9
Update path to generated. fix base url
valentinedwv Feb 9, 2026
7072ec5
fix validation github action
ywkim312 Feb 10, 2026
a5face0
fix validate summoned JSON-LD
ywkim312 Feb 10, 2026
3825d12
fix validation all file github action
ywkim312 Feb 10, 2026
483bb6c
fix keyword to become json array
ywkim312 Feb 10, 2026
ada05f6
Fixes encodningFormat as string array
ywkim312 Feb 10, 2026
a1be354
Added the entry that saying the document was created by AI
ywkim312 Feb 16, 2026
674cadd
Added the Json-LD's of the rest of the sites
ywkim312 Feb 16, 2026
14a4df9
Updated prompt
ywkim312 Feb 16, 2026
be48884
added the validation script
ywkim312 Feb 16, 2026
7cbafe1
updated encodingFormat to string array
ywkim312 Feb 16, 2026
f921268
updated .gitignore
ywkim312 Feb 16, 2026
d9f688d
removed unnecessary files
ywkim312 Feb 16, 2026
d1c76e5
retreieved delete file
ywkim312 Feb 16, 2026
7a92a6d
Retrieved mkdocs.yaml
ywkim312 Feb 16, 2026
6e340f5
updated README
ywkim312 Feb 16, 2026
0e0a912
test sitemap generation
ywkim312 Feb 16, 2026
952b293
chore: update JSON-LD sitemaps
github-actions[bot] Feb 16, 2026
0b6faf5
site map test try 2
ywkim312 Feb 16, 2026
71ad6f3
chore: update JSON-LD sitemaps
github-actions[bot] Feb 16, 2026
a1cfd03
Update sitemap workflow
ywkim312 Feb 16, 2026
389962d
chore: update JSON-LD sitemaps
github-actions[bot] Feb 16, 2026
d51bdb1
changed main to master
ywkim312 Feb 16, 2026
3c94014
chore: update JSON-LD sitemaps
github-actions[bot] Feb 16, 2026
c46ea20
Removed test branch from the github action
ywkim312 Feb 16, 2026
8bedbd4
Remove prompt.txt files from generated folders
ywkim312 Feb 16, 2026
4eb7e27
Validate generated JSON-LD metadata
jaywt May 20, 2026
d7a9d37
Add variable-level coverage metadata
jaywt May 20, 2026
3d62951
Update generated JSON-LD metadata
jaywt May 20, 2026
ffbd9c0
Merge pull request #7 from jaywt/jsonld-metadata-validation
valentinedwv May 28, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 28 additions & 0 deletions .env.example
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
# API Keys for JSON-LD Generation Script
# Copy this file to .env and fill in your actual API keys
# The .env file is gitignored and will not be committed

# Google Gemini API Key (default service - uses URL Context Tool to browse URLs directly)
# Get your key from: https://aistudio.google.com/apikey
# Free tier available with .edu email
# Default model: gemini-2.0-flash
GEMINI_API_KEY=your-gemini-api-key-here

# NRP (National Research Platform) API Key
# Get your key from: https://nrp.ai/documentation/userdocs/ai/llm-managed/
# Available NRP models: qwen3, llama3-sdsc, gpt-oss, gorilla, olmo, gemma3, kimi, etc.
# Default model: qwen3
# Note: NRP fetches HTML and extracts text before sending to AI
NRP_API_KEY=your-nrp-api-key-here

# OpenAI/ChatGPT API Key
# Get your key from: https://platform.openai.com/api-keys
# Default model: gpt-4o
# Note: OpenAI fetches HTML and extracts text before sending to AI
OPENAI_API_KEY=your-openai-api-key-here

# Anthropic (Claude) API Key
# Get your key from: https://console.anthropic.com/
# Default model: claude-3-5-sonnet-20241022
# Note: Anthropic fetches HTML and extracts text before sending to AI
ANTHROPIC_API_KEY=your-anthropic-api-key-here
91 changes: 91 additions & 0 deletions .github/workflows/sitemap_resources.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,91 @@
name: Generate XML sitemap for JSON-LD resources

on:
push:
branches:
- main
- master

permissions:
contents: write

jobs:
sitemap_job:
runs-on: ubuntu-latest
name: Generate sitemap for JSON-LD files
steps:
- uses: actions/setup-python@v2
with:
python-version: 3.x
- run: pip install mkdocs
- run: pip install mkdocs-schema-reader
- name: Checkout the repo
uses: actions/checkout@v6
with:
fetch-depth: 0

# Generate single sitemap for all JSON-LD files in data and collection directories
- name: Generate sitemap for all JSON-LD resources
id: sitemap_all
uses: cicirello/generate-sitemap@v1
with:
base-url-path: https://raw.githubusercontent.com/earthcube/communityCollections/refs/heads/${{ github.ref_name }}/data/objects/summoned
# https://raw.githubusercontent.com/earthcube/communityCollections/refs/heads/gh-pages
path-to-root: data/objects/summoned
include-html: false
include-pdf: false
additional-extensions: jsonld json xml
exclude-paths: .git .github docs scripts crawler prompts .vscode
- name: Output sitemap stats
run: |
echo "sitemap-path = ${{ steps.sitemap_all.outputs.sitemap-path }}"
echo "url-count = ${{ steps.sitemap_all.outputs.url-count }}"
echo "excluded-count = ${{ steps.sitemap_all.outputs.excluded-count }}"

- name: Generate sitemap for just AI Generated JSON-LD resources
id: sitemap_generated
uses: cicirello/generate-sitemap@v1
with:
base-url-path: https://raw.githubusercontent.com/earthcube/communityCollections/refs/heads/${{ github.ref_name }}/data/objects/summoned/generated
path-to-root: data/objects/summoned/generated
include-html: false
include-pdf: false
additional-extensions: jsonld json xml
exclude-paths:
.git .github docs scripts crawler prompts .vscode

- name: Commit and push sitemaps to branch
run: |
git config user.name "github-actions[bot]"
git config user.email "41898282+github-actions[bot]@users.noreply.github.com"
ls -la data/objects/summoned/sitemap.xml data/objects/summoned/generated/sitemap.xml 2>/dev/null || true
git add data/objects/summoned/sitemap.xml data/objects/summoned/generated/sitemap.xml
git status
if ! git diff --staged --quiet; then
git commit -m "chore: update JSON-LD sitemaps"
git fetch origin ${{ github.ref_name }}
git rebase origin/${{ github.ref_name }}
git push origin HEAD:${{ github.ref_name }}
fi

### WE MIGHT WANT TO DO INDIVIDUAL SITEMAPS
# - name: Generate sitemap for all JSON-LD resources
# id: sitemap_glim
# uses: cicirello/generate-sitemap@v1
# with:
# base-url-path: https://raw.githubusercontent.com/earthcube/communityCollections/refs/heads/${{ github.ref_name }}/data/objects/summoned/glim
# path-to-root: data/objects/summoned/glim
# include-html: false
# include-pdf: false
# additional-extensions: jsonld json xml
# exclude-paths:
# .git .github docs scripts crawler prompts .vscode

####### MKDOCS
- run: mkdocs build --config-file mkdocs.yaml
- name: push to gh pages
uses: JamesIves/github-pages-deploy-action@v4
with:
branch: gh-pages
folder: .
clean: false
148 changes: 148 additions & 0 deletions .github/workflows/validate_with_dataset_schema.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,148 @@
name: Validate JSON-LD files with Schema.org Dataset schema

on:
push:
branches-ignore: [ 'gh-pages' ]
pull_request:
branches-ignore: [ 'gh-pages' ]

jobs:
validate-jsonld-generated:
runs-on: ubuntu-latest
name: Validate generated JSON-LD files
steps:
- name: Checkout the repo
uses: actions/checkout@v4
with:
fetch-depth: 1

- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: '3.11'

- name: Find and validate generated JSON-LD files
run: |
python scripts/validate_jsonld_batch.py data/objects/summoned/generated

validate-jsonld-summoned:
runs-on: ubuntu-latest
name: Validate summoned JSON-LD files
steps:
- name: Checkout the repo
uses: actions/checkout@v4
with:
fetch-depth: 1

- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: '3.11'

- name: Find and validate summoned JSON-LD files
run: |
python - << 'RELAXED_VALIDATE'
import json, sys
from pathlib import Path
dir_ = Path("data/objects/summoned")
if not dir_.exists():
print("Directory not found, skipping."); sys.exit(0)
files = [f for f in dir_.rglob("*.jsonld") if "generated" not in str(f)]
if not files:
print("No JSON-LD files found."); sys.exit(0)
errs = []
for f in sorted(files):
try:
with open(f) as fp: data = json.load(fp)
except Exception as e:
errs.append(f"{f}: {e}"); continue
for k in ["@context", "@type", "name"]:
if k not in data: errs.append(f"{f}: missing {k}")
if "spatialCoverage" in data and isinstance(data["spatialCoverage"], dict):
geo = data["spatialCoverage"].get("geo", {})
if isinstance(geo, dict) and "box" in geo and isinstance(geo["box"], str):
parts = geo["box"].strip().split()
if len(parts) == 4:
try:
a,b,c,d = float(parts[0]),float(parts[1]),float(parts[2]),float(parts[3])
if (-90<=b<=90 and -90<=d<=90): west,south,east,north = a,b,c,d
else: south,west,north,east = a,b,c,d
if not (-90<=south<=90 and -90<=north<=90 and -180<=west<=180 and -180<=east<=180):
errs.append(f"{f}: box out of range")
except ValueError: errs.append(f"{f}: invalid box numbers")
elif len(parts) == 2:
try:
ws, en = parts[0].split(","), parts[1].split(",")
if len(ws)==2 and len(en)==2:
west,south = float(ws[0]),float(ws[1])
east,north = float(en[0]),float(en[1])
if not (-90<=south<=90 and -90<=north<=90 and -180<=west<=180 and -180<=east<=180):
errs.append(f"{f}: box out of range")
except ValueError: errs.append(f"{f}: invalid box format")
else: errs.append(f"{f}: box expected 2 or 4 numbers")
if errs:
for e in errs: print(e)
sys.exit(1)
print(f"All {len(files)} summoned JSON-LD file(s) validated.")
RELAXED_VALIDATE

validate-jsonld-all:
runs-on: ubuntu-latest
name: Validate all JSON-LD files
steps:
- name: Checkout the repo
uses: actions/checkout@v4
with:
fetch-depth: 1

- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: '3.11'

- name: Find and validate all JSON-LD files
run: |
python - << 'RELAXED_VALIDATE_ALL'
import json, sys
from pathlib import Path
dir_ = Path("data")
if not dir_.exists():
print("Directory not found, skipping."); sys.exit(0)
files = list(dir_.rglob("*.jsonld"))
if not files:
print("No JSON-LD files found."); sys.exit(0)
errs = []
for f in sorted(files):
try:
with open(f) as fp: data = json.load(fp)
except Exception as e:
errs.append(f"{f}: {e}"); continue
for k in ["@context", "@type", "name"]:
if k not in data: errs.append(f"{f}: missing {k}")
if "spatialCoverage" in data and isinstance(data["spatialCoverage"], dict):
geo = data["spatialCoverage"].get("geo", {})
if isinstance(geo, dict) and "box" in geo and isinstance(geo["box"], str):
parts = geo["box"].strip().split()
if len(parts) == 4:
try:
a,b,c,d = float(parts[0]),float(parts[1]),float(parts[2]),float(parts[3])
if (-90<=b<=90 and -90<=d<=90): west,south,east,north = a,b,c,d
else: south,west,north,east = a,b,c,d
if not (-90<=south<=90 and -90<=north<=90 and -180<=west<=180 and -180<=east<=180):
errs.append(f"{f}: box out of range")
except ValueError: errs.append(f"{f}: invalid box numbers")
elif len(parts) == 2:
try:
ws, en = parts[0].split(","), parts[1].split(",")
if len(ws)==2 and len(en)==2:
west,south = float(ws[0]),float(ws[1])
east,north = float(en[0]),float(en[1])
if not (-90<=south<=90 and -90<=north<=90 and -180<=west<=180 and -180<=east<=180):
errs.append(f"{f}: box out of range")
except ValueError: errs.append(f"{f}: invalid box format")
else: errs.append(f"{f}: box expected 2 or 4 numbers")
if errs:
for e in errs: print(e)
sys.exit(1)
print(f"All {len(files)} JSON-LD file(s) validated.")
RELAXED_VALIDATE_ALL
20 changes: 20 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1 +1,21 @@
.idea

# Python
__pycache__/
*.py[cod]
*$py.class
*.so
.Python
*.egg-info/
dist/
build/

# Environment variables
.env
.env.local

# CSV data files (downloaded from Google Sheets)
datasets.csv

# Generated JSON-LD files
#data/objects/summoned/generated/
4 changes: 4 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,3 +5,7 @@

Documentation, files and code related to the exposure of resource on the web for indexing.

Sitemaps (generated by [sitemap_resources.yaml](.github/workflows/sitemap_resources.yaml) on push to `master` / `main` / feature branches):

- **All JSON-LD under data/objects/summoned:** [GitHub Pages](https://earthcube.github.io/communityCollections/data/objects/summoned/sitemap.xml) · [Raw (e.g. master)](https://raw.githubusercontent.com/earthcube/communityCollections/master/data/objects/summoned/sitemap.xml)
- **AI-generated JSON-LD only:** [GitHub Pages](https://earthcube.github.io/communityCollections/data/objects/summoned/generated/sitemap.xml) · [Raw (e.g. master)](https://raw.githubusercontent.com/earthcube/communityCollections/master/data/objects/summoned/generated/sitemap.xml)
Loading
Loading