Skip to content

Parquet: prevent binary offset overflow by stopping batch early#9362

Open
vigneshsiva11 wants to merge 1 commit intoapache:mainfrom
vigneshsiva11:fix-parquet-large-binary-batch-splitting
Open

Parquet: prevent binary offset overflow by stopping batch early#9362
vigneshsiva11 wants to merge 1 commit intoapache:mainfrom
vigneshsiva11:fix-parquet-large-binary-batch-splitting

Conversation

@vigneshsiva11
Copy link
Contributor

Which issue does this PR close?

Rationale for this change

When reading Parquet files containing very large binary or string values, the Arrow Parquet reader can attempt to construct a RecordBatch whose total value buffer exceeds the maximum representable offset size. This can lead to an overflow error or panic during decoding.

Instead of allowing the buffer to overflow and failing late, the reader should detect this condition early and stop decoding before the offset exceeds the representable limit. This behavior is consistent with other Arrow implementations (for example, PyArrow), which emit smaller batches when encountering very large row groups.

What changes are included in this PR?

  • Add an early overflow check when appending binary values to the Arrow offset buffer.
  • Ensure the overflow condition is detected before mutating internal buffers.
  • Return a controlled error instead of panicking when the offset limit would be exceeded.
  • Apply the fix uniformly across all byte array decoding paths (plain, dictionary, and delta encodings) via the shared offset buffer logic.

Are these changes tested?

Yes.

  • Regression tests covering large binary values were added in a separate PR.
  • Existing Parquet reader and writer tests continue to pass in CI.

Note: Some Parquet and Arrow integration tests require external test data provided via git submodules (parquet-testing and testing). These submodules are not present in a minimal local checkout but are initialized in CI.

Are there any user-facing changes?

Yes.

  • Reading Parquet files with very large binary or string columns will no longer panic or fail late due to offset overflow.
  • The reader now stops batch construction early and reports the error safely.

There are no breaking changes to public APIs.

Copilot AI review requested due to automatic review settings February 5, 2026 17:19
@github-actions github-actions bot added the parquet Changes to the parquet crate label Feb 5, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a critical bug where reading Parquet files containing very large binary or string values could cause an offset overflow error or panic. The fix moves the overflow check to occur before buffer mutation, ensuring that the internal state remains consistent if an overflow would occur.

Changes:

  • Modified try_push method in OffsetBuffer to calculate and validate the next offset before mutating internal buffers
  • The overflow detection now happens before calling extend_from_slice and push, preventing partial state corruption

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parquet Changes to the parquet crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Error when reading row group larger than 2GB (total string length per 8k row batch exceeds 2GB)

1 participant