Support string/categorical columns in rasterize with QGIS-visible labels#3483
Merged
Conversation
brendancol
commented
Jun 24, 2026
brendancol
left a comment
Contributor
Author
There was a problem hiding this comment.
PR Review: Support string/categorical columns in rasterize with QGIS-visible labels
Reviewed the diff in full from a worktree on the PR branch. The rasterize-side encoding and the write path look solid and well tested. The reader picked up two real problems that bite on ordinary files, not just categorical ones.
Blockers (must fix before merge)
-
read_pam_sidecarcrashes on common GDAL statistics sidecars (xrspatial/geotiff/_pam.py:135,:170).open_geotiffnow reads<source>.aux.xmlfor every local string source. GDAL routinely writes a.aux.xmlholding band statistics with a histogram RAT whose Value column isReal(<F>0.0</F>)._parse_ratdoesint(fields[value_col]), which raisesValueError: invalid literal for int() with base 10: '0.0'. That exception propagates out through_attach_category_attrs, so opening a perfectly normal GeoTIFF that happens to have a stats sidecar would fail. Reproduced with an athematic histogram sidecar. - Non-categorical RAT yields bogus
category_names(xrspatial/geotiff/_pam.py:159-188). An athematic RAT with no Name column falls through toname = ''for every row, so a histogram/statistics RAT would attachcategory_names = ['', '', ...]to a continuous raster. Gate ontableType == 'thematic'and require a Name column; otherwise fall back to the<CategoryNames>element (which GDAL only writes for real categories).
Suggestions (should fix, not blocking)
- Make sidecar reads fully fail-closed (
xrspatial/geotiff/_pam.py:122-157). The docstring promises an empty dict when the sidecar "cannot be parsed", but thetryonly covers the file read andsafe_fromstring. Element lookups and_parse_ratrun outside it, so a malformed external sidecar raises instead of returning{}. Widen the guard to cover parsing, and parse the value cell asint(float(...))for tolerance.
Nits (optional improvements)
- Passing a non-numeric column via the plural
columns=arg still raises the rawcould not convert string to floatfrom.astype(np.float64)(xrspatial/rasterize.py:3801). Categorical encoding only applies to the singularcolumn=. Pre-existing behavior; a clearer message would help, but it is out of scope here.
What looks good
- Encoding lives in
_parse_inputbefore backend dispatch, so all four backends burn identical codes; GPU parity is tested directly. - The empirical call on storage is right: an embedded RAT is ignored by GDAL, the PAM sidecar is what QGIS reads. The round-trip test asserts it via
gdalinfo. - Missing categories map to the fill via the
-1pandas sentinel, with an explicit-fill remap and a test for it. - XML special characters are escaped and covered by a test.
Checklist
- Algorithm matches intent (pandas categorical codes, sorted vs declared order)
- All implemented backends produce consistent results (GPU parity tested)
- NaN / missing handling correct (-1 sentinel -> fill)
- Edge cases covered (missing category, explicit fill, numeric unchanged)
- Reader robust on arbitrary real-world sidecars (see Blockers)
- No premature materialization
- Benchmark not needed (no new perf-critical path)
- README matrix unchanged (no new function, backend support unchanged)
- Docstrings present and accurate
brendancol
commented
Jun 24, 2026
brendancol
left a comment
Contributor
Author
There was a problem hiding this comment.
Follow-up review (after 5c18598)
Re-reviewed the reader after the fixes. Both blockers are resolved and the follow-up has regression tests.
Blockers -- resolved
read_pam_sidecarno longer crashes on GDAL statistics sidecars. The whole parse runs inside atrythat returns{}onOSError/ValueError/TypeError, and the Value cell is parsed asint(float(...))so aReal-typed cell like0.0is fine (xrspatial/geotiff/_pam.py:118-150,:181). Confirmed: opening a.tifwith an athematic histogram sidecar returns{}and does not raise.- Non-categorical RATs no longer leak bogus names. The reader only consults the RAT when
tableType == 'thematic', and_parse_ratreturns(None, None)without a Name column, falling back to<CategoryNames>(_pam.py:134,:169-171).
New tests
test_athematic_stats_sidecar_ignoredandtest_malformed_sidecar_returns_emptylock in the fail-closed behavior. Full file: 18 passed.
Disposition of the original review
- Blocker (stats-sidecar crash): fixed.
- Blocker (bogus names from athematic RAT): fixed.
- Suggestion (fully fail-closed reads + tolerant value parse): fixed.
- Nit (cryptic error on non-numeric
columns=): left as-is. It is pre-existing behavior on the plural path and outside this PR's scope; the singularcolumn=is the documented entry point for categorical data.
No new issues found.
This was referenced Jun 25, 2026
brendancol
added a commit
that referenced
this pull request
Jun 25, 2026
…te paths (#3518) (#3519) The #3483 categorical PAM sidecar feature is round-trip tested only on the eager numpy write path. The dask streaming and GPU (nvCOMP) writers each emit the sidecar via their own _write_category_sidecar() call, but no test exercised those branches, so a refactor dropping one would lose category labels silently. Add geotiff/tests/write/test_category_sidecar_backends_3483.py: dask and GPU write round-trips of names + colors, plus a names-only round-trip covering the category_colors=None build branch. Test-only; all three pass on a CUDA host.
This was referenced Jun 26, 2026
Closed
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #3482.
What this does
rasterizenow accepts a string or categorical GeoDataFrame column ascolumn=. The labels are encoded to integer codes and the value-to-label map rides along in the result attrs, so it can be written to a GeoTIFF and shown in QGIS.columnis label-encoded with pandas Categorical: codes0..N-1on anint32band with-1nodata. Object/string columns sort lexically; an existingCategoricalkeeps its declared order. Explicitdtype/fillstill win.attrs['category_names'](index == pixel code) and an auto-generatedattrs['category_colors'](one RGBA per class).to_geotiffwrites a PAM<file>.tif.aux.xmlsidecar (<CategoryNames>plus a thematic RAT with Value/Class and RGBA columns) when those attrs are present. This is the only mechanism GDAL/QGIS read for category labels; an embedded RAT in the GDAL_METADATA tag is ignored (confirmed with gdalinfo 3.10.3).open_geotiffparses the sidecar back into attrs, so the round-trip preserves the labels and colors.Numeric columns and
(geometry, value)pairs are unchanged.Backend coverage
Encoding happens in
_parse_inputbefore backend dispatch, so all four backends (numpy / cupy / dask+numpy / dask+cupy) burn the same integer codes. The sidecar read and write are backend-independent.Test plan
category_names, codes match category order, default fill -1.Categoricalpreserves declared order.open_geotiffreads names and colors back, no sidecar for numeric rasters.gdalinfoshows theCategories:block (guarded byshutil.which).test_rasterize.py(233 passed) and geotiff writer/reader suites green.