Skip to content

Support string/categorical columns in rasterize with QGIS-visible labels#3483

Merged
brendancol merged 4 commits into
mainfrom
issue-3482
Jun 24, 2026
Merged

Support string/categorical columns in rasterize with QGIS-visible labels#3483
brendancol merged 4 commits into
mainfrom
issue-3482

Conversation

@brendancol

Copy link
Copy Markdown
Contributor

Closes #3482.

What this does

rasterize now accepts a string or categorical GeoDataFrame column as column=. The labels are encoded to integer codes and the value-to-label map rides along in the result attrs, so it can be written to a GeoTIFF and shown in QGIS.

  • A non-numeric column is label-encoded with pandas Categorical: codes 0..N-1 on an int32 band with -1 nodata. Object/string columns sort lexically; an existing Categorical keeps its declared order. Explicit dtype/fill still win.
  • The result carries attrs['category_names'] (index == pixel code) and an auto-generated attrs['category_colors'] (one RGBA per class).
  • to_geotiff writes a PAM <file>.tif.aux.xml sidecar (<CategoryNames> plus a thematic RAT with Value/Class and RGBA columns) when those attrs are present. This is the only mechanism GDAL/QGIS read for category labels; an embedded RAT in the GDAL_METADATA tag is ignored (confirmed with gdalinfo 3.10.3).
  • open_geotiff parses the sidecar back into attrs, so the round-trip preserves the labels and colors.

Numeric columns and (geometry, value) pairs are unchanged.

Backend coverage

Encoding happens in _parse_input before backend dispatch, so all four backends (numpy / cupy / dask+numpy / dask+cupy) burn the same integer codes. The sidecar read and write are backend-independent.

Test plan

  • Encoding: string column -> int32 codes, sorted category_names, codes match category order, default fill -1.
  • Ordered Categorical preserves declared order.
  • Missing category -> nodata; explicit fill remaps the missing code.
  • One color per category; RGBA in range.
  • Numeric column unchanged (float64, no category attrs).
  • GeoTIFF round-trip: sidecar written, open_geotiff reads names and colors back, no sidecar for numeric rasters.
  • gdalinfo shows the Categories: block (guarded by shutil.which).
  • PAM XML build/parse round-trip including special-character escaping.
  • GPU parity: codes and names match CPU (ran locally with CUDA).
  • Existing test_rasterize.py (233 passed) and geotiff writer/reader suites green.
  • User guide notebook executes clean.

@brendancol brendancol left a comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Review: Support string/categorical columns in rasterize with QGIS-visible labels

Reviewed the diff in full from a worktree on the PR branch. The rasterize-side encoding and the write path look solid and well tested. The reader picked up two real problems that bite on ordinary files, not just categorical ones.

Blockers (must fix before merge)

  • read_pam_sidecar crashes on common GDAL statistics sidecars (xrspatial/geotiff/_pam.py:135, :170). open_geotiff now reads <source>.aux.xml for every local string source. GDAL routinely writes a .aux.xml holding band statistics with a histogram RAT whose Value column is Real (<F>0.0</F>). _parse_rat does int(fields[value_col]), which raises ValueError: invalid literal for int() with base 10: '0.0'. That exception propagates out through _attach_category_attrs, so opening a perfectly normal GeoTIFF that happens to have a stats sidecar would fail. Reproduced with an athematic histogram sidecar.
  • Non-categorical RAT yields bogus category_names (xrspatial/geotiff/_pam.py:159-188). An athematic RAT with no Name column falls through to name = '' for every row, so a histogram/statistics RAT would attach category_names = ['', '', ...] to a continuous raster. Gate on tableType == 'thematic' and require a Name column; otherwise fall back to the <CategoryNames> element (which GDAL only writes for real categories).

Suggestions (should fix, not blocking)

  • Make sidecar reads fully fail-closed (xrspatial/geotiff/_pam.py:122-157). The docstring promises an empty dict when the sidecar "cannot be parsed", but the try only covers the file read and safe_fromstring. Element lookups and _parse_rat run outside it, so a malformed external sidecar raises instead of returning {}. Widen the guard to cover parsing, and parse the value cell as int(float(...)) for tolerance.

Nits (optional improvements)

  • Passing a non-numeric column via the plural columns= arg still raises the raw could not convert string to float from .astype(np.float64) (xrspatial/rasterize.py:3801). Categorical encoding only applies to the singular column=. Pre-existing behavior; a clearer message would help, but it is out of scope here.

What looks good

  • Encoding lives in _parse_input before backend dispatch, so all four backends burn identical codes; GPU parity is tested directly.
  • The empirical call on storage is right: an embedded RAT is ignored by GDAL, the PAM sidecar is what QGIS reads. The round-trip test asserts it via gdalinfo.
  • Missing categories map to the fill via the -1 pandas sentinel, with an explicit-fill remap and a test for it.
  • XML special characters are escaped and covered by a test.

Checklist

  • Algorithm matches intent (pandas categorical codes, sorted vs declared order)
  • All implemented backends produce consistent results (GPU parity tested)
  • NaN / missing handling correct (-1 sentinel -> fill)
  • Edge cases covered (missing category, explicit fill, numeric unchanged)
  • Reader robust on arbitrary real-world sidecars (see Blockers)
  • No premature materialization
  • Benchmark not needed (no new perf-critical path)
  • README matrix unchanged (no new function, backend support unchanged)
  • Docstrings present and accurate

@brendancol brendancol left a comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow-up review (after 5c18598)

Re-reviewed the reader after the fixes. Both blockers are resolved and the follow-up has regression tests.

Blockers -- resolved

  • read_pam_sidecar no longer crashes on GDAL statistics sidecars. The whole parse runs inside a try that returns {} on OSError/ValueError/TypeError, and the Value cell is parsed as int(float(...)) so a Real-typed cell like 0.0 is fine (xrspatial/geotiff/_pam.py:118-150, :181). Confirmed: opening a .tif with an athematic histogram sidecar returns {} and does not raise.
  • Non-categorical RATs no longer leak bogus names. The reader only consults the RAT when tableType == 'thematic', and _parse_rat returns (None, None) without a Name column, falling back to <CategoryNames> (_pam.py:134, :169-171).

New tests

  • test_athematic_stats_sidecar_ignored and test_malformed_sidecar_returns_empty lock in the fail-closed behavior. Full file: 18 passed.

Disposition of the original review

  • Blocker (stats-sidecar crash): fixed.
  • Blocker (bogus names from athematic RAT): fixed.
  • Suggestion (fully fail-closed reads + tolerant value parse): fixed.
  • Nit (cryptic error on non-numeric columns=): left as-is. It is pre-existing behavior on the plural path and outside this PR's scope; the singular column= is the documented entry point for categorical data.

No new issues found.

@brendancol brendancol merged commit 6223953 into main Jun 24, 2026
17 of 18 checks passed
@brendancol brendancol deleted the issue-3482 branch June 25, 2026 14:55
brendancol added a commit that referenced this pull request Jun 25, 2026
…te paths (#3518) (#3519)

The #3483 categorical PAM sidecar feature is round-trip tested only on the
eager numpy write path. The dask streaming and GPU (nvCOMP) writers each emit
the sidecar via their own _write_category_sidecar() call, but no test exercised
those branches, so a refactor dropping one would lose category labels silently.

Add geotiff/tests/write/test_category_sidecar_backends_3483.py: dask and GPU
write round-trips of names + colors, plus a names-only round-trip covering the
category_colors=None build branch. Test-only; all three pass on a CUDA host.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support string/categorical columns in rasterize with QGIS-visible category labels

1 participant