[C++][Parquet] High memory usage on large Parquet file reading with Project and Filter, with Scanner::Scan() #49976
Unanswered
alexeyroytman
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hello.
I'm trying to fulfill a proof-of-concept for Arrow C++ usage for large Parquet file reading with partial-Project and Filter. (I've done the same earlier, with Java code of Apache Parquet project.)
I've seen an example of a code that has
ScannerBuilder::Project()andScannerBuilder::Filter(). In my case, it uses over 300% of CPU, gathers above 30GB of virtual memory, starts thrashing, and at some moment OOM killer kills it.Some background on the Parquet file contents.
gzip -9", it makes it 8.1 GiB).The reading context:
ScannerBuilder::Project(),ScannerBuilder::Filter()and thenScanner::Scan().Scanner::Scan(), sometimes batches have 0 rows, not sure whether this is relevant.I trace high memory allocations (500 MiB and above), all doing
PoolBuffer::Resize()and/orPoolBuffer::Reserve().I see in the stacktraces:
parquet::TypedDecoder<parquet::PhysicalType<(parquet::Type::type)6> >::DecodeArrowNonNull(int, parquet::EncodingTraits<parquet::PhysicalType<(parquet::Type::type)6> >::Accumulator*)TransferColumnData(parquet::internal::RecordReader*, std::unique_ptr<parquet::ColumnChunkMetaData, std::default_delete<parquet::ColumnChunkMetaData> >, std::shared_ptr<arrow::Field> const&, parquet::ColumnDescriptor const*, parquet::arrow::ReaderContext const*, std::shared_ptr<arrow::ChunkedArray>*)Beta Was this translation helpful? Give feedback.
All reactions