Add regression tests for Parquet large binary offset overflow by vigneshsiva11 · Pull Request #9361 · apache/arrow-rs

vigneshsiva11 · 2026-02-05T14:41:42Z

Which issue does this PR close?

Refs Error when reading row group larger than 2GB (total string length per 8k row batch exceeds 2GB) #7973

Rationale for this change

This PR adds regression coverage for an offset overflow panic encountered when reading Parquet files with large binary columns using the Arrow Parquet reader.

The issue occurs when a single RecordBatch contains more than 2GB of binary data and is decoded into a 'BinaryArray' using 32-bit offsets. Similar failures are observed across multiple Parquet encodings. Adding regression tests helps document the failure mode and provides a foundation for validating a follow-up fix.

What changes are included in this PR?

This PR introduces a new test file under 'parquet/tests/arrow_reader' that adds regression tests covering large binary columns for the following Parquet encodings:

PLAIN
DELTA_LENGTH_BYTE_ARRAY
DELTA_BYTE_ARRAY
RLE_DICTIONARY

The tests construct Parquet files that exceed the 32-bit offset limit when decoded into a single Arrow 'BinaryArray' and assert successful RecordBatch decoding.

Are these changes tested?

Yes. This PR consists entirely of regression tests.

The tests are currently marked as ignored to avoid excessive memory usage in CI and to document the existing failure mode. They are intended to be enabled once the underlying reader logic is updated to safely handle large binary data without overflowing offsets.

Are there any user-facing changes?

No. This PR only adds tests and does not introduce any user-facing changes.

Copilot

Pull request overview

This PR adds regression tests for a Parquet reader issue where reading files with more than 2GB of binary data in a single RecordBatch causes an offset overflow panic. The tests document the failure mode across multiple Parquet encodings and are marked as ignored to avoid excessive memory usage in CI.

Changes:

Added test file large_string_overflow.rs with four regression tests covering different Parquet encodings (PLAIN, DELTA_LENGTH_BYTE_ARRAY, DELTA_BYTE_ARRAY, and RLE_DICTIONARY) for large binary data
Tests create Parquet files exceeding 2GB when decoded into a single BinaryArray
All tests are marked with #[ignore] to document the existing failure without causing CI issues

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-05T14:49:34Z

parquet/tests/arrow_reader/large_string_overflow.rs

@@ -0,0 +1,116 @@
+use std::sync::Arc;


This file is missing the Apache License header that is standard across all files in this codebase. Please add the Apache Software Foundation license header at the beginning of the file, following the same format as other test files in this directory (see parquet/tests/arrow_reader/bad_data.rs or checksum.rs for reference).

Copilot · 2026-02-05T14:49:34Z

parquet/tests/arrow_reader/large_string_overflow.rs

+fn large_binary_rle_dictionary_encoding_overflow() {
+    let array = make_large_binary_array();
+    let file =
+        write_parquet_with_encoding(array, Encoding::RLE_DICTIONARY);


This test will panic when run. The WriterProperties builder at line 40 calls set_encoding(encoding) with Encoding::RLE_DICTIONARY, but according to the parquet properties implementation (parquet/src/file/properties.rs:1087-1089), calling set_encoding with RLE_DICTIONARY or PLAIN_DICTIONARY will panic with "Dictionary encoding can not be used as fallback encoding".

To enable dictionary encoding, you should only use set_dictionary_enabled(true) without calling set_encoding(Encoding::RLE_DICTIONARY). The dictionary encoding will be used automatically when dictionary is enabled.

Copilot · 2026-02-05T14:49:35Z

parquet/tests/arrow_reader/large_string_overflow.rs

+    let props = WriterProperties::builder()
+        .set_dictionary_enabled(true)
+        .set_encoding(encoding)
+        .build();
+


The write_parquet_with_encoding function sets both set_dictionary_enabled(true) and set_encoding(encoding). According to the WriterProperties documentation, when dictionary is enabled and a specific encoding is set, the specified encoding becomes the "fallback" encoding (see parquet/src/file/properties.rs:1079-1081). This means these tests may not actually be testing the specified encodings directly, but rather testing dictionary encoding with these as fallbacks.

For more accurate testing of specific encodings, consider setting set_dictionary_enabled(false) for non-dictionary encodings (PLAIN, DELTA_LENGTH_BYTE_ARRAY, DELTA_BYTE_ARRAY). The dictionary test should only use set_dictionary_enabled(true) without calling set_encoding.

Suggested change

let props = WriterProperties::builder()

.set_dictionary_enabled(true)

.set_encoding(encoding)

.build();

let mut builder = WriterProperties::builder();

// For non-dictionary encodings, disable dictionary and set the encoding directly.

// For dictionary encoding, enable dictionary and rely on the default dictionary behavior.

let builder = match encoding {

Encoding::RLE_DICTIONARY => builder.set_dictionary_enabled(true),

_ => builder

.set_dictionary_enabled(false)

.set_encoding(encoding),

};

let props = builder.build();

Add regression tests for Parquet large binary offset overflow

c322da4

Copilot AI review requested due to automatic review settings February 5, 2026 14:41

github-actions bot added the parquet Changes to the parquet crate label Feb 5, 2026

Copilot started reviewing on behalf of vigneshsiva11 February 5, 2026 14:44 View session

Copilot AI reviewed Feb 5, 2026

View reviewed changes

vigneshsiva11 added 4 commits February 5, 2026 15:01

Add Apache license header and format regression tests

b3e1146

Format Parquet large string overflow tests

d64f8be

Trigger CI rerun

42fb85b

Format Parquet large string overflow tests to satisfy CI

188f3b9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add regression tests for Parquet large binary offset overflow#9361

Add regression tests for Parquet large binary offset overflow#9361
vigneshsiva11 wants to merge 5 commits intoapache:mainfrom
vigneshsiva11:test-parquet-large-string-overflow

vigneshsiva11 commented Feb 5, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Feb 5, 2026

Uh oh!

Copilot AI Feb 5, 2026

Uh oh!

Copilot AI Feb 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

-    let props = WriterProperties::builder()
-        .set_dictionary_enabled(true)
-        .set_encoding(encoding)
-        .build();
+    let mut builder = WriterProperties::builder();
+    // For non-dictionary encodings, disable dictionary and set the encoding directly.
+    // For dictionary encoding, enable dictionary and rely on the default dictionary behavior.
+    let builder = match encoding {
+        Encoding::RLE_DICTIONARY => builder.set_dictionary_enabled(true),
+        _ => builder
+            .set_dictionary_enabled(false)
+            .set_encoding(encoding),
+    };
+    let props = builder.build();

Conversation

vigneshsiva11 commented Feb 5, 2026

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant