fix: Fix ORC bloom filter hashing to match official implementation#76
fix: Fix ORC bloom filter hashing to match official implementation#76WenyXu merged 4 commits intodatafusion-contrib:mainfrom
Conversation
|
@codex review |
|
Did you look at how the If both ORC and Parquet's Bloom Filters work the same way, we could share a single implementation. |
I looked at the Parquet-rs code and found that it's difficult to do. |
|
https://parquet.apache.org/docs/file-format/bloomfilter/
|
There was a problem hiding this comment.
Pull request overview
This PR fixes the ORC bloom filter hashing implementation to match the official ORC Java specification. The key change is replacing the previous Murmur3 x64_128 (seed 0) approach with ORC's official Murmur3 hash64 (seed 104729) combined with Thomas Wang's 64-bit mix function for integer values. The PR also refactors the bloom filter API to use add_hash and test_hash methods instead of raw byte operations, and adds comprehensive hit-case tests alongside the existing miss-case tests.
- Implements ORC-compliant Murmur3 hash64 with seed 104729 for byte values
- Adds Thomas Wang's 64-bit mix function for hashing integer/long values
- Refactors double-hash bit selection to match ORC Java (i=1..k, signed 32-bit split, negative-flip)
- Simplifies bloom filter testing through new
add_hashandtest_hashAPIs - Removes the external
murmur3crate dependency in favor of an internal implementation
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| src/bloom_filter.rs | Implements ORC-compliant hash functions (murmur3_64_orc, hash_long), adds add_hash/test_hash APIs, refactors bit indexing logic to match Java implementation |
| src/row_group_filter.rs | Updates predicate value hashing to use new hash_long/hash_bytes methods for all types, removes Cow-based byte serialization |
| tests/integration/main.rs | Refactors bloom filter integration test with count_rows helper, adds hit-case assertions for all tested types, improves test readability |
| Cargo.toml | Removes murmur3 external dependency |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Summary
add_hashandtest_hashAPIs to simplify Bloom filter construction and testing.Testing
cargo test -p orc-rust --features cli test_bloom_might_contain_truecargo test -p orc-rust --test integration bloom_filter_predicate_prunes