arrow-util: canonicalize ranges decoded from Parquet#36499
Open
DAlperin wants to merge 1 commit into
Open
Conversation
The reader for range columns called `push_range_with`, which writes the range to the row verbatim. For discrete range types (int4/int8/date), Parquet files authored by external engines may encode ranges with non-canonical bounds — e.g. `[1,10]` for `int4range`, which MZ stores internally as `[1,11)`. Without canonicalization those rows do not compare or hash equal to logically-identical values constructed inside MZ, so `COPY FROM PARQUET` rows mismatch their pure-SQL counterparts. Switch the decode path to `push_range`, which canonicalizes the range before packing. Add a unit test that decodes a hand-built non-canonical arrow StructArray and a testdrive regression test that round-trips a DuckDB-authored Parquet file through `COPY FROM ... (FORMAT PARQUET)`. Fixes DB-55. https://claude.ai/code/session_01KXioUsT2r5o2tRwrsHm4iR
def-
approved these changes
May 11, 2026
Contributor
def-
left a comment
There was a problem hiding this comment.
Thanks for adding a test, lgtm!
Comment on lines
+287
to
+293
| key = _setup(c) | ||
|
|
||
| c.run_testdrive_files( | ||
| f"--var=s3-access-key={key}", | ||
| "--var=aws-endpoint=minio:9000", | ||
| "range-noncanonical.td", | ||
| ) |
Contributor
There was a problem hiding this comment.
Many workflows in test/iceberg do just this, we could put them all in a workflow_cdc instead that iterates over the files? But more of a test cleanup, not that important for this PR itself.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The reader for range columns called
push_range_with, which writes the range to the row verbatim. For discrete range types (int4/int8/date), Parquet files authored by external engines may encode ranges with non-canonical bounds — e.g.[1,10]forint4range, which MZ stores internally as[1,11). Without canonicalization those rows do not compare or hash equal to logically-identical values constructed inside MZ, soCOPY FROM PARQUETrows mismatch their pure-SQL counterparts.Switch the decode path to
push_range, which canonicalizes the range before packing. Add a unit test that decodes a hand-built non-canonical arrow StructArray and a testdrive regression test that round-trips a DuckDB-authored Parquet file throughCOPY FROM ... (FORMAT PARQUET).Fixes DB-55.
https://claude.ai/code/session_01KXioUsT2r5o2tRwrsHm4iR