Optimize data page statistics conversion (up to 4x)#9303
Optimize data page statistics conversion (up to 4x)#9303Dandandan merged 18 commits intoapache:mainfrom
Conversation
|
run benchmark arrow_statistics |
|
🤖 |
| let mut b = UInt8Builder::with_capacity(capacity); | ||
| for (len, index) in chunks { | ||
| match index { | ||
| ColumnIndexMetaData::INT32(index) => { |
|
🤖: Benchmark completed Details
|
Almost 4x even! |
|
run benchmark arrow_statistics |
|
🤖 |
alamb
left a comment
There was a problem hiding this comment.
Makes sense to me -- thank you @Dandandan
| _ => b.append_nulls(len), | ||
| } | ||
| } | ||
| Ok(Arc::new(b.finish())) |
There was a problem hiding this comment.
Seeing all these calls to finish with a builder that is not re-used looks wasteful to me -- I also tried to code up a PR to make this faster too (both finish as well as add a new build):
There was a problem hiding this comment.
Cool, in this case it probably does not help that much (most of the time is spent during allocations / capacity checks / copies... while building the values).
There was a problem hiding this comment.
Yeah, I agree the win is likely pretty small, but every little allocation helps
FWIW we could probably make it even better by not decoding into the ParquetMetadata structures at all (and directly decode Thrift to arrow arrays 🤔 )
There was a problem hiding this comment.
Yes the latter would be best (and preferably also do validation directly on read/convert them to the right type, so it could just pass them on)
There was a problem hiding this comment.
For anyone else followoing along, this is tracked here
| Ok(UInt64Array::from_iter(iter)) | ||
| let chunks: Vec<_> = iterator.collect(); | ||
| let total_capacity: usize = chunks.iter().map(|(len, _)| *len).sum(); | ||
| let mut builder = UInt64Builder::with_capacity(total_capacity); |
There was a problem hiding this comment.
this seems like it saves an allocation too
|
🤖: Benchmark completed Details
|
|
run benchmark arrow_statistics |
|
🤖 |
|
🤖: Benchmark completed Details
|
|
run benchmark arrow_statistics |
|
🤖 |
|
Benchmark script failed with exit code 101. Last 10 lines of output: Click to expand |
|
run benchmark arrow_statistics |
|
🤖 |
|
🤖: Benchmark completed Details
|
|
run benchmark arrow_statistics |
|
🤖 |
|
🤖: Benchmark completed Details
|
alamb
left a comment
There was a problem hiding this comment.
Very impressive @Dandandan -- thank you
I have one small comment (found by codex and I verified) about a potential change in FixedSizeBinary stats handling, but otherwise this looks great
| _ => b.append_nulls(len), | ||
| } | ||
| } | ||
| Ok(Arc::new(b.finish())) |
There was a problem hiding this comment.
Yeah, I agree the win is likely pretty small, but every little allocation helps
FWIW we could probably make it even better by not decoding into the ParquetMetadata structures at all (and directly decode Thrift to arrow arrays 🤔 )
| } | ||
| /// Returns the null pages. | ||
| /// | ||
| /// Values may be `None` when [`ColumnIndex::is_null_page()`] is `true`. |
There was a problem hiding this comment.
I don't understand this comment -- this API returns a vector of what pages are nulls. What is None? Or is it trying to say calling Self::min_value and Self::max_value will return None when is_null_page is true?
There was a problem hiding this comment.
Oh I should have removed that one (I think the comment was autogenerated).
I thought I could pass the values slice and null_pages slice to the builder (it is another 30% or so faster) instead of the iterator code, but the null_pages unfortunately has the valid/invalid reversed compared to "validity".
There was a problem hiding this comment.
but the null_pages unfortunately has the valid/invalid reversed compared to "validity".
Maybe it is time (another PR) to actually implement something like unary_mut for boolean buffers (so we could invert the bits without a new allocation)
There was a problem hiding this comment.
Perhaps - although I think here we ideally leave the original in place and copy it (or avoid the copy altogether by returning vec/iterator prefarably...)
And now I see you have already filed 👍 |
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
|
🚀 |
Which issue does this PR close?
Rationale for this change
Loading statis is notably inefficient. This makes the conversion from the structure to arrow arrays a bit faster by avoiding allocations, until we get a more efficient encoding directly (#9296)
Details
What changes are included in this PR?
Converts the inefficient iterator-based code (which doesn't really fit the iterator pattern well) to just traverse the values and use the builders. (I think it's just converting a bunch of ugly code to another bunch of ugly code).
Additionally tries to preallocate where possible.
Are these changes tested?
Existing tests
Are there any user-facing changes?