arrow-util: canonicalize ranges decoded from Parquet by DAlperin · Pull Request #36499 · MaterializeInc/materialize

DAlperin · 2026-05-11T01:48:52Z

The reader for range columns called push_range_with, which writes the range to the row verbatim. For discrete range types (int4/int8/date), Parquet files authored by external engines may encode ranges with non-canonical bounds — e.g. [1,10] for int4range, which MZ stores internally as [1,11). Without canonicalization those rows do not compare or hash equal to logically-identical values constructed inside MZ, so COPY FROM PARQUET rows mismatch their pure-SQL counterparts.

Switch the decode path to push_range, which canonicalizes the range before packing. Add a unit test that decodes a hand-built non-canonical arrow StructArray and a testdrive regression test that round-trips a DuckDB-authored Parquet file through COPY FROM ... (FORMAT PARQUET).

Fixes DB-55.

https://claude.ai/code/session_01KXioUsT2r5o2tRwrsHm4iR

The reader for range columns called `push_range_with`, which writes the range to the row verbatim. For discrete range types (int4/int8/date), Parquet files authored by external engines may encode ranges with non-canonical bounds — e.g. `[1,10]` for `int4range`, which MZ stores internally as `[1,11)`. Without canonicalization those rows do not compare or hash equal to logically-identical values constructed inside MZ, so `COPY FROM PARQUET` rows mismatch their pure-SQL counterparts. Switch the decode path to `push_range`, which canonicalizes the range before packing. Add a unit test that decodes a hand-built non-canonical arrow StructArray and a testdrive regression test that round-trips a DuckDB-authored Parquet file through `COPY FROM ... (FORMAT PARQUET)`. Fixes DB-55. https://claude.ai/code/session_01KXioUsT2r5o2tRwrsHm4iR

def-

Thanks for adding a test, lgtm!

def- · 2026-05-11T05:50:07Z

+    key = _setup(c)
+
+    c.run_testdrive_files(
+        f"--var=s3-access-key={key}",
+        "--var=aws-endpoint=minio:9000",
+        "range-noncanonical.td",
+    )


Many workflows in test/iceberg do just this, we could put them all in a workflow_cdc instead that iterates over the files? But more of a test cleanup, not that important for this PR itself.

DAlperin requested a review from def- May 11, 2026 01:48

def- approved these changes May 11, 2026

View reviewed changes

def- mentioned this pull request May 11, 2026

Fix row to recordbatch conversion errors #36500

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

arrow-util: canonicalize ranges decoded from Parquet#36499

arrow-util: canonicalize ranges decoded from Parquet#36499
DAlperin wants to merge 1 commit into
mainfrom
claude/fix-parquet-range-canonicalization-LghhN

DAlperin commented May 11, 2026

Uh oh!

def- left a comment

Uh oh!

def- May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

DAlperin commented May 11, 2026

Uh oh!

def- left a comment

Choose a reason for hiding this comment

Uh oh!

def- May 11, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants