Feat/geodata services#18
Open
jonathanhhb wants to merge 34 commits into
Open
Conversation
Introduces a services/ directory for per-datasource REST microservices.
Each service: FastAPI app, Dockerfile, requirements.txt, test_client.py,
managed by a top-level services/Makefile.
unwpp-service (port 8100):
- Pre-warms on startup: downloads all 4 WPP 2024 CSV.gz files to CACHE_DIR
(mounted volume) if not already cached, then loads into memory
- GET /health — liveness + loaded dataset names
- GET /demographics/{ISO}?start_year=&end_year= — CBR/CDR, age distribution,
cumulative life-table deaths; 404 on unknown ISO, 400 on inverted range
- test_client.py: 27 checks covering happy path, lowercase ISO, single year,
future years (2024-file branch), unknown ISO, and inverted range
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
GET /health
GET /boundaries/{ISO}/{level} → GeoJSON FeatureCollection
Downloads gadm41_{ISO}_shp.zip on first request for a country (temp-file
write to avoid partial-download cache poisoning), caches under CACHE_DIR.
All admin levels for a country are served from the single cached zip via
pyogrio (bundled GDAL — no system GDAL needed in the Docker image).
Properties per feature: nodeid (sequential int), name, gid.
Returns 404 for unknown ISO or unavailable level, 400 for level outside 0–5.
test_client.py: 16 checks — structure, level 0/1/2, cache speed, error cases.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Single-file Leaflet map — enter ISO + admin level, fetches from gadm-service, renders GeoJSON polygons with hover highlight and click-to-inspect properties. Open directly in a browser; no build step. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…gin clients) Without this, browsers block fetch() calls from file:// pages or any origin that differs from the service host. All services now send Access-Control-Allow-Origin: * for GET requests. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
GET /health
GET /boundaries/{ISO}/{level} → GeoJSON FeatureCollection
Same endpoint contract as gadm-service. Downloads per-(ISO, level) zip from
geoBoundaries v6.0.0 on GitHub; properties: nodeid, name, gid (from shapeID).
CORS enabled. test_client.py: 14 checks — structure, cache speed, error cases.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
GET /health (reports gdb_ready: true/false during startup)
GET /boundaries/{ISO}/{level} → GeoJSON FeatureCollection
Downloads the single ~1-2 GB UNOCHA global GDB zip from HDX on startup,
extracts it, then serves filtered per-country/level slices. First request
for an (ISO, level) pair reads the GDB layer and caches the GeoDataFrame
in memory; all subsequent requests are served from the in-memory cache.
Returns 503 if startup download/extraction is still in progress.
Supports admin levels 0-3 (UNOCHA coverage). CORS enabled.
test_client.py: --wait flag polls /health until gdb_ready before testing.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
POST /prewarm/{iso} downloads WorldPop UN-adjusted raster on demand;
POST /aggregate/{iso} accepts a GeoJSON FeatureCollection and returns
{nodeid: population} by summing raster pixels per polygon via rasterio.
Requires libexpat1 in Docker (rasterio shared lib dependency).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- client.html: Leaflet choropleth that chains shapes service → worldpop aggregate; mode toggle for "Boundaries only" (no worldpop call) useful for testing shapes independently - main.py: per-(iso,year) threading.Lock prevents duplicate concurrent raster downloads; updated docstring Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- prewarm_countries.py: parallel AKS init-job script; default list of ~50 humanitarian ISOs; --countries FILE, --workers, --year flags - client.html: clarify status message during first-run raster download Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Calls gadm/geoboundaries/unocha, worldpop, and unwpp services to produce the same data files as laser-init (gpkg, cxr.csv, age_dist.csv, life_exp.csv, config.yaml, provenance.json). Validated for ETH admin2 2010-2020 with GADM shapes: 79 features, 87.5M total population. Model scripts and validation plots still require the laser-init Load phase. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
After producing data files via services, --emit-scripts imports AbmLoader and write_plots from laser.init (falling back to ../src if not installed) to emit config.yaml, seir.py, plot.py, PNGs, and report.pdf. Full workflow for ETH admin2 2010-2020 with GADM validated end-to-end. User runs: python generate.py ETH 2 2010 2020 --shape-source gadm --emit-scripts Then: cd ETH/2010 && python seir.py Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Remove sys.path fallback — laser.init is now importable directly. install: pip install -e /path/to/laser-init Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
WorldPop pixel sums are float; LASER is agent-based and requires exact integer counts. np.repeat silently truncates floats causing agent count mismatch and broadcast errors in the model. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds services/AKS/ with:
- geodata-services.yaml: PVCs (2–100 Gi), Deployments, and LoadBalancer Services
for all 5 services in the existing laser-ai namespace
- worldpop-prewarm-job.yaml: one-shot Job to pre-warm raster cache post-deploy
- DEPLOYMENT.md: push → deploy → prewarm → get IPs → use with generate.py
Makefile: push-{service}/push-all-geodata and k8s-deploy/k8s-status/k8s-ips/
k8s-prewarm/k8s-delete/k8s-delete-all targets with REGISTRY/TAG/KUBECONFIG vars.
worldpop Dockerfile: include prewarm_countries.py so the Job reuses the same image.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The Azure LoadBalancer has a 4-min TCP idle timeout. A slow raster download (NGA = 421 MB, ~8 min) holds the HTTP connection open with no data flowing, triggering a reset. The download continues server-side; retrying hits the cache. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace localhost port-based URLs with AKS LoadBalancer IPs. Source selector now swaps full URLs (each service has its own IP on port 80). Defaults: shapes (GADM): http://48.200.52.126 shapes (geoBoundaries): http://4.149.210.205 shapes (UNOCHA): http://40.91.121.206 worldpop: http://4.155.140.158 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
prewarm_countries.py: retry up to 15× (60s apart) on connection reset — same Azure LB 4-min idle timeout that affects aggregate requests. All print() calls use flush=True for immediate log visibility. Job: python -u for unbuffered stdout in Kubernetes logs. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace per-polygon rasterio.mask.mask loop with a single rasterio.features.rasterize pass over the full raster array. For COD/26 polygons ~45s → ~3s. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
apiFetch() wrapper shows which step failed (Shapes / WorldPop), the HTTP status code, first 300 chars of response body, and the exact URL called. Network errors show the browser's error message instead of generic "Load Failed". Status bar turns red on error and uses white-space:pre-wrap for multi-line output. Progress messages show checkmarks as steps complete. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Vectorized approach (v0.5.3a) read the full raster as float64 + a same-size int32 burned array, peaking at ~6 GB for COD and OOM-killing the pod (4 restarts). New approach: read raster once as float32 (no upcast), then for each polygon extract its bounding-box window from the in-memory array and apply rasterio.features.geometry_mask to that small slice. One disk read, no country-sized secondary array. Memory: ~1.9 GB for COD (float32 raster only) vs ~6 GB previously. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Log-normalised scale bunched NGA/MWI into orange/red because most regions cluster in the upper half of the log range. Quantile classification maps rank → colour stop, guaranteeing the full yellow→red range is used regardless of the population distribution. Legend now notes "quantile classification". Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Some GADM country ZIPs have no CRS metadata in the .shp file. Calling
.to_crs("EPSG:4326") on a naive GeoDataFrame raises ValueError. Fix: set
EPSG:4326 first if gdf.crs is None (GADM is always WGS-84).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
CAN's WorldPop raster is ~107k×50k pixels = ~21 GB uncompressed float32. Reading it all at once OOM-kills the pod. Strategy: if uncompressed size < 2 GB, read whole raster once (fast path, used for NGA/COD/ETH etc). Otherwise use per-polygon src.read(window=win) from disk — memory scales with the largest polygon window, not the full raster. Logs which mode is selected per request. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
worldpop/main.py: HEAD request before any download; if Content-Length exceeds 600 MB, return HTTP 422 with a clear message directing the user to POST /prewarm first. NGA (421 MB) passes; CAN/USA/RUS/AUS are blocked immediately rather than hanging for 40 min or OOM-killing the pod. prewarm_countries.py: add GBR, DEU, FRA — small rasters (~50-150 MB) that give evaluators familiar countries for ground-truthing without the large-raster problem. Job tag bumped to v0.5.6a to pick up both changes. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
600 MB was too low — COD is 988 MB and works fine. CAN (3.1 GB) and BRA (3.7 GB) hit the Azure LB 4-min timeout even when cached. 2 GB passes all prewarmed LMIC countries (largest: IND 1.6 GB) and blocks only genuinely slow countries (CAN, BRA, USA, RUS). Size check now runs after cache hit in aggregate() so cached-but-oversized rasters get a useful HTTP 422 instead of a silent LB timeout. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- worldpop: stream response per polygon so Azure LB never sees idle connection - worldpop: strip-based windowed reads (8 M px/strip) prevent OOM on large states (BRA Amazonas was 342 M pixels = 1.4 GB — killed the pod at 4 Gi limit) - worldpop: geographic sort of features in windowed mode maximises GDAL tile-cache hit rate for dense level-2 queries (e.g. São Paulo's clustered municipalities) - worldpop: set GDAL_CACHEMAX=512 MB (was 64 MB default) - worldpop: coordinate truncation to 3 dp in client before POST (BRA/1 was 38 MB) - worldpop: bump image to v0.5.12a; update deployment + prewarm job YAML - gadm: simplify geometries at 0.001° for level ≥ 1 — BRA/2 was OOMkilling the GADM pod (800+ MB response); now 46 MB in 18 s with 5 572 features - client: warn and bail for large-raster countries with > 500 polygons, render boundary map and print generate.py command instead of hanging for hours - client: loading spinner on the Load button while requests are in flight - test_client: add NGA multi-polygon test; add --large-country / --gadm-url flags for testing real admin-1 shapes end-to-end (BRA passes at 212 M total pop) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
After a successful choropleth render (or large-country bailout), show a dark code block below the map with the fully-parameterized generate.py command the user can run locally to produce model files. Includes a clipboard copy button with "Copied!" feedback. Also fixes the large-country warning to use correct positional CLI args instead of wrong --iso/--level flags. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Move generate.py into the laser.init package so it can be installed as a proper CLI entry point. services/generate.py becomes a thin shim that delegates to the installed package. Adds httpx to declared dependencies (was already used but not declared). Browser command updated to use laser-generate instead of a path-relative python3 invocation. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
21 offline tests covering: entry point existence and pyproject.toml registration, argument parsing (help, missing args, shape-source routing), build_gdf helper (types, values, CRS, missing-pop default), all file writers, and an end-to-end main() run with mocked HTTP services. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
laser-generate now resolves service URLs in precedence order: 1. --shapes-url / --worldpop-url / --unwpp-url CLI flags 2. laser_config.yaml: gadm_url, geoboundaries_url, unocha_url, worldpop_url, unwpp_url 3. localhost fallbacks (8101/8102/8103/8104/8100) Users point at AKS once in ~/.laser/laser_config.yaml and never need flags. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Adds a new service-driven geodata generation workflow to the laser-init codebase, centered on a laser-generate CLI that calls geodata microservices (shapes, WorldPop aggregation, and UNWPP demographics) to produce the same core data artifacts expected by existing LASER model loaders.
Changes:
- Introduces
src/laser/init/generate.pyand registers thelaser-generateentry point inpyproject.toml. - Adds offline unit tests covering the new generator CLI and output writers.
- Adds multiple FastAPI-based geodata microservices (GADM, geoBoundaries, UNOCHA, WorldPop, UNWPP), plus local Docker tooling and AKS deployment manifests.
Reviewed changes
Copilot reviewed 31 out of 31 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/test_generate.py | Adds offline tests for laser-generate CLI behavior and file outputs. |
| src/laser/init/generate.py | Implements the laser-generate CLI, service calls, data assembly, and output writing. |
| services/worldpop/test_client.py | Provides a manual test client for the WorldPop aggregation service. |
| services/worldpop/requirements.txt | Declares Python dependencies for worldpop-service. |
| services/worldpop/prewarm_countries.py | Adds a parallel prewarm script for seeding the WorldPop raster cache. |
| services/worldpop/main.py | Implements the WorldPop raster download/cache and polygon aggregation API. |
| services/worldpop/Dockerfile | Container build for worldpop-service. |
| services/worldpop/client.html | Browser client for exploring boundaries + WorldPop choropleths. |
| services/unwpp/test_client.py | Manual test client for the UNWPP demographics service. |
| services/unwpp/requirements.txt | Declares Python dependencies for unwpp-service. |
| services/unwpp/main.py | Implements UNWPP prewarm + in-memory query API for demographics. |
| services/unwpp/Dockerfile | Container build for unwpp-service. |
| services/unocha/test_client.py | Manual test client for UNOCHA boundaries service. |
| services/unocha/requirements.txt | Declares Python dependencies for unocha-service. |
| services/unocha/main.py | Implements UNOCHA global GDB download/extract and boundaries API. |
| services/unocha/Dockerfile | Container build for unocha-service. |
| services/Makefile | Adds local Docker build/run/test targets plus AKS push/deploy helpers. |
| services/geoboundaries/test_client.py | Manual test client for geoBoundaries service. |
| services/geoboundaries/requirements.txt | Declares Python dependencies for geoboundaries-service. |
| services/geoboundaries/main.py | Implements geoBoundaries on-demand download/cache and boundaries API. |
| services/geoboundaries/Dockerfile | Container build for geoboundaries-service. |
| services/generate.py | Adds a shim to call the installed laser.init.generate:main. |
| services/gadm/test_client.py | Manual test client for GADM boundaries service. |
| services/gadm/requirements.txt | Declares Python dependencies for gadm-service. |
| services/gadm/main.py | Implements GADM on-demand download/cache and boundaries API. |
| services/gadm/Dockerfile | Container build for gadm-service. |
| services/gadm/client.html | Browser client for exploring gadm-service boundaries. |
| services/AKS/worldpop-prewarm-job.yaml | Adds an AKS Job manifest to prewarm the WorldPop cache PVC. |
| services/AKS/geodata-services.yaml | Adds AKS PVCs, Deployments, and LoadBalancer Services for all geodata services. |
| services/AKS/DEPLOYMENT.md | Documents AKS deployment, operation, and image tagging conventions. |
| pyproject.toml | Adds httpx dependency and registers the laser-generate script entry point. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+10
to
+15
| from unittest.mock import MagicMock, patch | ||
|
|
||
| import geopandas as gpd | ||
| import pandas as pd | ||
| import pytest | ||
| from shapely.geometry import box |
Comment on lines
+109
to
+118
| with patch.object(sys, "argv", ["laser-generate", "NGA", "1", "2020", "2020", | ||
| "--output-dir", str(tmp_path)]): | ||
| with patch.object(gen, "fetch_shapes", side_effect=capture_shapes): | ||
| with patch.object(gen, "fetch_population", return_value=SAMPLE_POP): | ||
| with patch.object(gen, "fetch_demographics", return_value=SAMPLE_DEMO): | ||
| gen.main() | ||
|
|
||
| assert len(called_urls) == 1 | ||
| assert "8103" in called_urls[0] # unocha port | ||
|
|
| }, | ||
| } | ||
| with dest.open("w") as f: | ||
| yaml.dump(cfg, f, default_flow_style=False, sort_keys=False) |
Comment on lines
+86
to
+95
| def fetch_population(wp_url: str, iso: str, year: int, fc: dict) -> dict: | ||
| url = f"{wp_url}/aggregate/{iso}?year={year}" | ||
| print(f" worldpop← {url} (first run downloads raster — may take minutes)") | ||
| # The Azure LB has a 4-min idle timeout; a slow raster download causes a | ||
| # connection reset mid-request. Retry: the download continues server-side | ||
| # and subsequent calls hit the cache quickly. | ||
| for attempt in range(1, 11): | ||
| try: | ||
| pop = _post(url, fc) | ||
| break |
Comment on lines
+96
to
+99
| f"{iso} raster is {mb} MB — too large for on-demand download " | ||
| f"(limit {_MAX_RASTER_MB} MB). " | ||
| f"Pre-warm it first: POST /prewarm/{iso}?year={year} " | ||
| f"(runs in the background; may take 30+ min for large countries)." |
| logger.warning("Pre-flight HEAD failed (%s) — proceeding with download", exc) | ||
|
|
||
| logger.info("Downloading WorldPop raster for %s year=%d ...", iso, year) | ||
| tmp = Path(tempfile.mktemp(dir=CACHE_DIR, suffix=".tmp")) |
Comment on lines
+27
to
+31
| import rasterio.mask | ||
| import rasterio.windows | ||
| from fastapi import Body, FastAPI, HTTPException, Query | ||
| from fastapi.middleware.cors import CORSMiddleware | ||
| from fastapi.responses import JSONResponse, StreamingResponse |
Comment on lines
+45
to
+56
| # Write to a temp file first — avoids a partial file being treated as cached | ||
| tmp = Path(tempfile.mktemp(dir=dest.parent, suffix=".tmp")) | ||
| try: | ||
| with httpx.stream("GET", url, follow_redirects=True, timeout=600) as r: | ||
| if r.status_code == 404: | ||
| raise HTTPException(404, f"GADM has no data for ISO {iso!r}") | ||
| r.raise_for_status() | ||
| with tmp.open("wb") as f: | ||
| for chunk in r.iter_bytes(chunk_size=65_536): | ||
| f.write(chunk) | ||
| shutil.move(str(tmp), str(dest)) | ||
| except Exception: |
Comment on lines
+41
to
+55
| dest.parent.mkdir(parents=True, exist_ok=True) | ||
| url = f"{_GB_BASE}/{iso}/ADM{level}/geoBoundaries-{iso}-ADM{level}-all.zip" | ||
| logger.info("Downloading %s ...", url) | ||
|
|
||
| tmp = Path(tempfile.mktemp(dir=dest.parent, suffix=".tmp")) | ||
| try: | ||
| with httpx.stream("GET", url, follow_redirects=True, timeout=600) as r: | ||
| if r.status_code == 404: | ||
| raise HTTPException(404, f"geoBoundaries has no data for {iso!r} ADM{level}") | ||
| r.raise_for_status() | ||
| with tmp.open("wb") as f: | ||
| for chunk in r.iter_bytes(chunk_size=65_536): | ||
| f.write(chunk) | ||
| shutil.move(str(tmp), str(dest)) | ||
| except Exception: |
Comment on lines
+59
to
+73
| CACHE_DIR.mkdir(parents=True, exist_ok=True) | ||
| logger.info("Downloading UNOCHA global GDB (~1-2 GB) — this takes a few minutes ...") | ||
| tmp = Path(tempfile.mktemp(dir=CACHE_DIR, suffix=".tmp")) | ||
| try: | ||
| with httpx.stream("GET", _HDX_URL, follow_redirects=True, timeout=1800) as r: | ||
| r.raise_for_status() | ||
| with tmp.open("wb") as f: | ||
| downloaded = 0 | ||
| for chunk in r.iter_bytes(chunk_size=1_048_576): # 1 MB chunks | ||
| f.write(chunk) | ||
| downloaded += len(chunk) | ||
| if downloaded % (100 * 1_048_576) == 0: | ||
| logger.info(" ... %.0f MB downloaded", downloaded / 1e6) | ||
| shutil.move(str(tmp), str(dest)) | ||
| except Exception: |
Documents what the service is, why it exists, the API (endpoints, properties, errors), caching behaviour, local/Docker/AKS run instructions, and known limitations (BRA/2 slowness, single-replica PVC constraint). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Each README covers: purpose and why the service exists, full API reference (endpoints, parameters, response shapes, error codes), caching and startup behaviour, local/Docker/AKS run instructions, and known limitations. Includes a comparison table across shape sources in geoboundaries and unocha. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
_MAX_WINDOWED_POLYGONS: 500 → 150. IRN/2 has 429 counties — under the old threshold the browser attempted windowed aggregation, took ~30 min, and timed out. 150 catches IRN/2, TZA/2 (186), and similar cases while still allowing level-1 queries for most countries to run in the browser. Also fixes two wrong ISO codes in _LARGE_RASTER_ISOS: ANG → AGO (Angola) MAL → MYS (Malaysia) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.