[C++][Parquet] High memory usage on large Parquet file reading with Project and Filter, with Scanner::Scan() #49976

alexeyroytman · 2026-05-14T09:09:43Z

alexeyroytman
May 14, 2026

Hello.
I'm trying to fulfill a proof-of-concept for Arrow C++ usage for large Parquet file reading with partial-Project and Filter. (I've done the same earlier, with Java code of Apache Parquet project.)
I've seen an example of a code that has ScannerBuilder::Project() and ScannerBuilder::Filter(). In my case, it uses over 300% of CPU, gathers above 30GB of virtual memory, starts thrashing, and at some moment OOM killer kills it.

Some background on the Parquet file contents.

For now, I create a Parquet file by myself, and fully control it.
Its schema contains of D0..D9 (10 non-nullable strings as BINARY), M0..M4 (5 nullable DOUBLEs).
Each of the D(i) is randomly selected from a relatively small subset of strings (up to thousands).
M0 has a random value. Other M(j)s are nulls.
For D5 and D6 I enable dictionary, statistics and Bloom filter.
For M(j)s I disable dictionary, statistics and Bloom filter.
The rows are not sorted.
The Parquet file is uncompressed.
The Parquet file contains 1 billion of rows, its file size is 9.2 GiB. (The file creation time is ~50 minutes.)
The original comma-separated file in a tabular form is of 96.5 GiB. (When compressed with "gzip -9", it makes it 8.1 GiB).

The reading context:

I project all D(i), M0 and M1 columns. No artificial columns used.
I filter on: (D5==s51 || D5==s52) && (D6==s61 || D6==s61). (s51, s52, s61, s62 are all strings matching relevant columns.)
The expected number of rows is ~77 million (out of 1 billion).
I use ScannerBuilder::Project(), ScannerBuilder::Filter() and then Scanner::Scan().
Inside the Scanner::Scan(), sometimes batches have 0 rows, not sure whether this is relevant.

I trace high memory allocations (500 MiB and above), all doing PoolBuffer::Resize() and/or PoolBuffer::Reserve().
I see in the stacktraces:

parquet::TypedDecoder<parquet::PhysicalType<(parquet::Type::type)6> >::DecodeArrowNonNull(int, parquet::EncodingTraits<parquet::PhysicalType<(parquet::Type::type)6> >::Accumulator*)
TransferColumnData(parquet::internal::RecordReader*, std::unique_ptr<parquet::ColumnChunkMetaData, std::default_delete<parquet::ColumnChunkMetaData> >, std::shared_ptr<arrow::Field> const&, parquet::ColumnDescriptor const*, parquet::arrow::ReaderContext const*, std::shared_ptr<arrow::ChunkedArray>*)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[C++][Parquet] High memory usage on large Parquet file reading with Project and Filter, with Scanner::Scan() #49976

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

[C++][Parquet] High memory usage on large Parquet file reading with Project and Filter, with Scanner::Scan() #49976

Uh oh!

Uh oh!

alexeyroytman May 14, 2026

Replies: 0 comments

alexeyroytman
May 14, 2026