Skip to content

GeoBrix: H3 cell rasterizer + gbx.viz module + example notebooks & diagrams#42

Merged
mjohns-databricks merged 63 commits into
mainfrom
beta/0.4.0
Jun 24, 2026
Merged

GeoBrix: H3 cell rasterizer + gbx.viz module + example notebooks & diagrams#42
mjohns-databricks merged 63 commits into
mainfrom
beta/0.4.0

Conversation

@mjohns-databricks

@mjohns-databricks mjohns-databricks commented Jun 23, 2026

Copy link
Copy Markdown
Collaborator

Summary

The 0.4.0 beta example + visualization layer, branched from beta/0.4.0 (the 0.4.0 base — STAC, register(only=), lightweight gbx_rst_fromfile, CI hardening — merged in #41). On top of that base this PR adds:

  • H3 cell rasterizer (RasterX, both tiers) — gbx_rst_h3_rasterize_agg, gbx_h3_cell_bbox, and the light-tier rst_h3_gridspec helper, with a worked SF-DEM demo notebook + a published light-vs-heavy benchmark.
  • gbx.viz — a tier-agnostic visualization module behind a new [viz] extra (raster plotting, coverage-depth + mask-layer composites, Spark→GeoPandas adapters), plus two Python-only pyrx escape-hatches.
  • eo-series migration onto these package APIs (deleting per-notebook library.py).
  • Hero pipeline diagrams at the top of every example notebook (eo-series refreshed to the lightweight tier; new xView + h3-rasterize diagrams).

Binding-parity: 156 registered functions (two new: gbx_rst_h3_rasterize_agg, gbx_h3_cell_bbox).


H3 cell rasterizer (RasterX — both tiers)

The inverse of the gbx_rst_h3_rastertogrid* family: synthesize a raster from H3-indexed values instead of reducing a raster to per-cell stats.

  • gbx_rst_h3_rasterize_agg — grouped aggregator (heavy Scala TypedImperativeAggregate + light pyrx pandas_udf) that burns a set of H3 cells (one row per cell, optional value; null → presence mask 1.0) into one GTiff tile per group via pixel-centroid assignment. Extent/grid supplied explicitly or auto-derived from the cell set (+ kring_pad). Heavy returns a tile STRUCT; the lightweight SQL form returns BINARY (a PySpark grouped-pandas_udf can't return a struct) — the Python wrapper recomposes the struct.
  • gbx_h3_cell_bbox — scalar STRUCT<xmin,ymin,xmax,ymax> for one H3 cell in a target EPSG, optionally k-ring-padded.
  • rst_h3_gridspec (light/Python DataFrame helper) — derive the canonical shared canvas (extent + pixel grid) from a cell set before aggregating, so per-band tiles are pixel-aligned for stacking.
  • Exact heavy↔light parity — a JAR-gated test asserts byte-identical covered-pixel masks; confirmed on-cluster (0 diverging pixels).
  • Demo notebook notebooks/examples/h3-rasterize/h3_rasterize_isobands.ipynb — San Francisco DEM → 100 m elevation isobands → H3 polyfill → shared rst_h3_gridspec canvas → per-band rst_h3_rasterize_agg (materialized once to a session temp table) → rst_frombands_agg multi-band stack → gbx.viz rendering. Doc page + README + a #h3-grid cross-reference from the function docs.

databricks.labs.gbx.viz (new [viz] extra)

Tier-agnostic — works with pyrx or rasterx tiles. Heavy deps (matplotlib + geopandas) are lazy-imported behind a viz/_env.py guard; folium/mapclassify ship in the extra for GeoDataFrame.explore().

  • plot_raster(raster_bytes) / plot_file(path) — decimation to a pixel budget + per-band 2–98% percentile stretch (UInt16 EO) + nodata masking + viridis(1-band)/RGB(multi); headless-safe backend selection. composite="depth" renders a multi-band presence stack as a per-pixel coverage-depth gradient (vs a mostly-black RGB), and single-band constant-value presence masks now draw as a solid footprint over a light background (previously rendered blank — rasterio.plot.show discards the clim, so the single-band path uses imshow).
  • plot_mask_layers(layers) — overlay several single-band mask tiles on one axes, each a solid colour with a legend (the multi-threshold coverage view).
  • as_gdf / cells_as_gdf (optional dissolve_by) / grid_as_gdf — Spark DataFrame / H3 cells / a rst_h3_gridspec grid struct → GeoPandas (EPSG:4326) for .plot() / .explore(); driver-side max_rows guard.

pyrx escape-hatches (Python-only)

On databricks.labs.gbx.pyrx.functionsnot SQL-registered, so binding-parity is unaffected:

  • tile_to_numpy(tile_or_bytes) — read a tile's raster into a NumPy array.
  • rst_apply(tile_col, fn, returnType=DoubleType()) — apply your own function to each tile's open rasterio dataset, one scalar per row.

eo-series migration

  • config_nb.ipynb installs geobrix[light,stac,viz] and imports the helpers from the package; its local as_gdf/cells_as_gdf defs and import library are gone.
  • library.py deleted — every helper is now superseded by gbx.viz / the escape-hatches. Notebooks 01–04 call the package functions directly (nb03's raster→timeseries projection moves to rst_apply("tile", fn)). READMEs updated.

Example notebooks & hero diagrams

A hero pipeline diagram at the top of every example notebook, from a shared SVG→PNG generator (resources/images/*.py):

  • The four eo-series diagrams refreshed to the lightweight tiershapefile_gbx, gtiff_gbx (reader + writer), StacClient.download / StacClient.repair, rst_apply — matching the migrated notebooks (tier-agnostic built-ins like h3_tessellateaswkb / st_* kept).
  • New xView (per-object clipping) and h3-rasterize (DEM isobands → band stack) diagrams via example-diagrams.py (reuses the eo-series framework).

Benchmark

gbx_rst_h3_rasterize_agg added to the cluster benchmark (both tiers, fixed 20-worker cluster, 1000-group spark-path): heavy 1.50 ms/tile vs light 2.26 ms/tile (heavy ~1.5×), cross-tier consistency exact. Recorded in benchmarking.mdx; classified in performance.mdx (which also gained the gbx_custom_* custom-grid family). The bench surfaced — and we fixed — a real light-tier defect: a null typed-Double value column burned NaN instead of presence 1.0 (np.nan is not None slipped the guard).

Testing

  • H3 rasterizer — TDD build (nearest-value round-trip + partition pixel-count bounds, FCC fixed-wireless fixture, JAR-gated heavy↔light exact-mask parity); BenchDispatchTest + the Scala bench suite green; light test/viz + test/pyrx green.
  • gbx.viztest/viz wired into the lightweight CI tier (_LIGHT_TEST_DIRS + pyrx_build); [viz] deps hash-pinned in requirements-pyrx-ci.txt. New regression tests: composite="depth", single-band presence render (asserts drawn pixels, not just a figure object), plot_mask_layers overlay, null-value-column presence.
  • Binding-parity 156 (Scala override def name / Python functions.py / function-info.json); QC doc gates green (diagram-coverage 108 rst_*, release-notes-functions, doc-coverage D2–D5).
  • CI build main green on the feature tip (heavy + light).

eo-series notebook outputs were re-executed on cluster/Serverless; the H3 demo + xView likewise carry executed outputs.

This pull request and its description were written by Isaac.

Michael Johns added 12 commits June 23, 2026 08:45
Design for a tier-agnostic gbx.viz module ([viz] extra: matplotlib + geopandas +
folium + mapclassify) promoting the EO-series notebook helpers -- plot_raster /
plot_file (decimation + 2-98% percentile stretch + nodata masking) and as_gdf /
cells_as_gdf (Spark DF -> GeoDataFrame for .explore()) -- plus two Python-only
pyrx escape-hatches (tile_to_numpy, generalized rst_apply). Drops generate_cells;
keeps set_conf_safe + band-table ETL notebook-local. Pending user review.

Co-authored-by: Isaac
Adds gbx.viz package (importable skeleton for later tasks), assert_viz_available()
guard (raises ImportError with install hint if matplotlib/geopandas missing), and
the [viz] optional-dependency extra in pyproject.toml. Pins matplotlib==3.10.9,
geopandas==1.1.3, folium==0.20.0, mapclassify==2.10.0 in requirements-pyrx-ci.in
(latest on corp proxy) and regenerates the hash-pinned lock (83 packages). 2 tests
pass.

Co-authored-by: Isaac
Add _raster.py with _decimated_read, _needs_percentile_stretch, and
_percentile_stretch helpers; test_raster.py with 3 TDD tests (3 passed).
Append _render, plot_raster, plot_file to viz/_raster.py; export
from viz/__init__.py. Matplotlib/rasterio are lazy-imported inside
each plotter; assert_viz_available() guards the public API. Agg
backend forced when headless. Also fixes unused pytest import (F401)
left from Task 2 and adds missing # noqa: E402 annotations so
flake8 is fully clean for all viz files.

Co-authored-by: Isaac
Replace unreliable get_current_fig_manager() probe with a correct
headless guard: select Agg before pyplot import only when pyplot has
not yet been imported (no prior use() lock-in), MPLBACKEND is unset,
and no display is present (DISPLAY/WAYLAND_DISPLAY absent).  Databricks
notebooks pre-import pyplot with their own backend, so they are never
overridden.

Co-authored-by: Isaac
Adds viz._vector with as_gdf (WKT column → GeoDataFrame, EPSG:4326,
max_rows guard with truncation warning) and cells_as_gdf (H3 bigint cell
ids → boundary polygons via h3 v4 int_to_str + cell_to_boundary).
Exports both from viz.__init__. 3/3 tests green.

Co-authored-by: Isaac
Add two Python-only escape-hatches in pyrx/core/escape.py:
- tile_to_numpy(tile_or_bytes): drops a collected tile or raw bytes to
  a numpy ndarray (all bands) for host-side exploration.
- rst_apply(tile_col, fn, returnType): applies an arbitrary rasterio
  callable per-row via a dynamic @udf; null tile -> null.

Both are re-exported on pyrx.functions (noqa: F401) so callers use
`from databricks.labs.gbx.pyrx.functions import tile_to_numpy, rst_apply`.
Neither is SQL-registered; binding-parity count is unchanged (154).

Co-authored-by: Isaac
Add "viz" to _LIGHT_TEST_DIRS in test/conftest.py (heavy CI phase
collect_ignore exclusion) and to the pytest dir list in the light CI
action. Clean-venv verification against requirements-pyrx-ci.txt
confirmed 829 passed, 2 skipped, RC=0 — all viz deps already in lock.

Co-authored-by: Isaac
…; drop library.py

The gbx.viz module ([viz] extra) and the pyrx escape-hatches (rst_apply,
tile_to_numpy) now provide, as first-class package APIs, the helpers the
eo-series previously carried in a local library.py + config_nb defs.

- config_nb.ipynb: install geobrix[light,stac,viz] (folium/mapclassify/geopandas
  now come via [viz], not a manual %pip); import plot_raster/plot_file/as_gdf/
  cells_as_gdf from databricks.labs.gbx.viz and rst_apply/tile_to_numpy from
  pyrx.functions; drop the local as_gdf/cells_as_gdf defs and the `import library`;
  keep the one surviving constant (FILENAME_TIMESTAMP_FORMAT) inline.
- library.py: DELETED. Everything in it is now superseded — plot_raster/plot_file
  by gbx.viz, to_numpy_arr/rasterio_lambda by tile_to_numpy/rst_apply,
  generate_cells was dead heavy-only, _set_conf_safe duplicated config_nb's
  set_conf_safe, FILE_SIZE_THRESHOLD was unused.
- 01-04: call the package functions directly (plot_raster/plot_file, the bare
  FILENAME_TIMESTAMP_FORMAT). nb03's raster->timeseries projection moves from
  library.rasterio_lambda("tile.raster", fn) to rst_apply("tile", fn) — rst_apply
  takes the tile struct and opens tile["raster"] itself (not the raw bytes column).
- README: drop the library.py row; note viz helpers come from gbx.viz; [light,stac,viz].

Source cells migrated; notebooks need re-execution on a cluster to refresh outputs
(they read /Volumes and run on Serverless/classic, not locally).

Co-authored-by: Isaac
Re-run on Serverless against the [viz]-enabled wheel. config_nb drops the now-dead
imports (geopandas/matplotlib.pyplot/rasterio.MemoryFile/io.BytesIO — superseded by
gbx.viz) and the library.py autoreload lines; installs geobrix[light,stac,viz].
Notebooks 01-04 carry refreshed outputs from the package-based viz/escape-hatch APIs.

Co-authored-by: Isaac
…ation

Bring docs/docs/notebooks/eo-series.mdx and the series README in line with the
gbx.viz migration: drop all library.py references (file deleted), install
geobrix[light,stac,viz] (+ the viz extras matplotlib/geopandas/folium/mapclassify),
note visualization helpers come from databricks.labs.gbx.viz, and replace the
raster->timeseries `rasterio_lambda` mention with the `rst_apply` escape-hatch.
Option-2 (heavyweight) now flips only config_nb.ipynb.

Co-authored-by: Isaac
Michael Johns added 3 commits June 23, 2026 19:24
…c) design

Design for a DGGS-cell rasterizer: rasterize a set of H3 cell ids (+ optional
value) into a raster tile via pixel-centroid burn (the inverse of
rst_h3_rastertogrid). Grouped aggregator rst_h3_rasterize_agg (heavy UDAF + light
grouped pandas_udf, light SQL returns BINARY per the light-agg convention; default
value = 1/NoData presence mask; default 4326 with optional projected srid + auto
extent/pixel-size, full overrides).

rst_h3_gridspec defines the complete SHARED grid/canvas (snapped origin + pixel
size + dims + srid) once over the union of all thresholds, so every band
rasterizes to a byte-identical transform and stacks cleanly via rst_frombands_agg
(no half-pixel drift). Implemented as a scalar per-cell bbox + native min/max +
snap (both tiers; avoids the grouped-pandas_udf struct-return limit). Quadbin/BNG
variants are follow-ons. Validation: CI round-trip vs rastertogrid + partition
property + a committed FCC fixed-wireless subset; DEM elevation-isoband notebook
for the full polygons->polyfill->rasterize->stack demo. Pending user review.

Co-authored-by: Isaac
Adds cellraster.py with pure-function rasterization primitives:
_h3_str (signed-Long normalization), _resolution (uniform-res guard),
cell_bbox, snap_bounds (lattice-aligned snapping, DRY helper for Task 2),
compute_gridspec (kring-padded, snapped 8-tuple), and cells_to_raster
(pixel-centroid burn to float64 GTiff bytes). 5/5 tests green.

Note: brief's half-pixel centroid expansion pre-snap caused the
single-cell kring_pad=0 case to straddle two lattice slots (→2x2);
removed expansion — snap_bounds correctly gives 1x1 for a single point.
Michael Johns added 2 commits June 24, 2026 13:01
rasterio.plot.show() renders a constant-valued single band (an H3 presence mask,
all 1.0) as a blank plot and ignores the explicit vmin/vmax, so every per-band
inspection looked empty. Render the single-band branch with ax.imshow + an explicit
plotting_extent instead -- it honors the clim and the masked array (NoData ->
transparent over the facecolor). Replace the figure-exists check with a real
regression test asserting the footprint is actually drawn (non-degenerate clim +
coloured pixels in the rasterized buffer); the old check passed while blank.

Verified locally across full/mid/sparse coverage and a continuous DEM-like raster.

Co-authored-by: Isaac
Remove unused imports (math/numpy in _h3_cell_bbox_udf, math in rst_h3_gridspec)
and the dead _mode_val capture (mode is already applied via the bbox UDF), plus an
unused numpy import in test_core_cellraster. Reformat test_vector_raster_bridge with
in-container black (a prior host-black pass diverged from CI's black). Clears the CI
Python-lint gate; no behaviour change.

Co-authored-by: Isaac
Add plot_mask_layers(layers, colors=, ...): overlay several single-band presence-
mask tiles on one axes, each a solid colour with a legend (NoData transparent over a
grey facecolor). Tiles must share a grid; draw order is largest-first so nested
coverage stays visible. Notebook cell 8 now overlays the two mid-coverage bands on a
single plot instead of two separate viridis figures. Regression test asserts two
overlaid AxesImages, a 2-entry legend, and both requested colours present in the
rasterized buffer.

Co-authored-by: Isaac
…summary

Cell 16 described dissolve_by as active while the code renders per-cell (kept
deliberately: per-cell tooltips read better at this size). Reword to the per-cell
render and add a Note pointing at dissolve_by="band_level" for larger sets. Update
the summary table's Visualize rows to the actual calls (plot_mask_layers,
plot_raster(composite="depth")).

Co-authored-by: Isaac
- New docs/docs/notebooks/h3-rasterize.mdx (registered in sidebars.js) documenting
  the SF Bay Area H3 rasterize -> band-stack example.
- viz.mdx: document plot_mask_layers, grid_as_gdf, composite="depth", and the
  cells_as_gdf dissolve_by option; update the import list.
- beta-release-notes: add gbx.viz module bullet + H3 rasterize example bullet.
- raster-functions: add explicit {#h3-grid} anchor so release-notes link resolves.
- README: retarget the example to the San Francisco DEM, the session temp table,
  and the new viz helpers.

Validated with a full Docusaurus build (no broken links).

Co-authored-by: Isaac
Michael Johns added 9 commits June 24, 2026 14:13
…sterize_agg

Add a "Worked example" tip in the rst_h3_rasterize_agg section pointing at the new
H3 Rasterize notebook, cross-linking rst_h3_gridspec, rst_frombands_agg, and gbx.viz.

Co-authored-by: Isaac
New "h3_aggregate" input kind: a fixed 331-cell res-9 H3 set on a hardcoded
explicit grid, mirrored byte-for-byte in spec.py (_H3RAGG_*) and BenchDispatch.scala
(h3Ragg*) so the light and heavy legs burn the identical cells onto the identical
canvas. Python: FnSpec (dggs, spark-path, fingerprint) + _h3_aggregate_df group
builder + runner wiring. Scala: DGGS category, h3Aggregate set/inputKind, heavy
aggregate case (UDAF -> gbx_rst_fromcontent tile struct), HeavyRunner group branch,
BenchDispatchTest count 106->107. Cross-tier masks verified byte-identical (0 px)
via the local JAR; Scala bench suite green (20 tests).

Co-authored-by: Isaac
…in fromcontent

The heavy UDAF's dataType is tileDataType(BinaryType) -- a tile STRUCT, like
rst_rasterize_agg -- so the cluster heavy leg must call gbx_rst_h3_rasterize_agg
directly (the consistency collect reads `raster` off the struct). Wrapping it in
gbx_rst_fromcontent (a BINARY->struct helper, only needed for the lightweight SQL
form) fed a struct where bytes were expected and aborted the distributed agg with
"[INTERNAL_ERROR] Couldn't find method eval". Match the rst_rasterize_agg pattern.

Co-authored-by: Isaac
… not NaN

A null in a TYPED (Double) value column arrives in the pandas_udf as np.nan, and
`np.nan is not None` is True, so the presence guard burned float(np.nan)=NaN instead
of 1.0. Guard with pd.isna. The cluster benchmark caught this as a heavy(1.0)-vs-
light(NaN) cross-tier divergence (the value-omitted path was already correct, which
is why the JAR-gated parity test passed). Regression test added with an explicit
nullable DoubleType value column.

Co-authored-by: Isaac
…5x, exact)

Cluster bench (1000 groups, fixed 20 workers): heavy 1.50 ms/tile vs light 2.26
ms/tile (heavy ~1.5x faster), cross-tier parity exact. Footnoted because the H3
rasterizer's workload (331-cell groups -> 39x24 output) differs from the 1024² rows,
so the cross-tier ratio + exact parity are the comparable takeaways.

Co-authored-by: Isaac
…mily

Bring the lightweight implementation-techniques page current: add
rst_h3_rasterize_agg to the grouped-aggregate UDF table, and a "GridX (pygx) --
custom grid" section in the Arrow scalar tab mirroring the quadbin/BNG families
(scalar cell-ops as pandas_udf, array polyfill/kring as plain @udf). Regular scalar
UDFs (metadata/accessors, h3_cell_bbox) remain intentionally unlisted.

Co-authored-by: Isaac
The landscape SVG already carried rst_h3_rasterize_agg + the 108 count, but its
screenshot PNG was last rendered pre-H3 (107). Re-screenshot from the current SVG so
the slide asset matches; the portrait PNG (used on the docs page) was already current.

Co-authored-by: Isaac
eo-series 01-04: embed the existing eo-series-0N.png banner after the intro cell.
xView + h3-rasterize: new resources/images/example-diagrams.py (reuses the
eo-series.py diagram framework via importlib) renders xview-clipping.png and
h3-rasterize.png, embedded after each notebook's intro cell and in their READMEs.
Includes the executed-notebook + banner edits supplied for the example set.

Co-authored-by: Isaac
Update the four eo-series notebook diagrams' chips + captions to match the migrated
notebooks: shapefile_ogr→shapefile_gbx, pystac_client→StacClient (01); the deleted
download_band/update_assets flow → StacClient.download / StacClient.repair (02);
gdal reader→gtiff_gbx, rasterio_lambda→rst_apply (03); gdal writer→gtiff_gbx (04).
Tier-agnostic built-ins (h3_tessellateaswkb, st_*) and still-used functions kept.
Re-rendered all four PNGs. Also includes the h3 notebook's first-cell edit.

Co-authored-by: Isaac
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant