Skip to content

fix: Fix ORC bloom filter hashing to match official implementation#76

Merged
WenyXu merged 4 commits intodatafusion-contrib:mainfrom
suxiaogang223:fix_bloom
Dec 26, 2025
Merged

fix: Fix ORC bloom filter hashing to match official implementation#76
WenyXu merged 4 commits intodatafusion-contrib:mainfrom
suxiaogang223:fix_bloom

Conversation

@suxiaogang223
Copy link
Contributor

@suxiaogang223 suxiaogang223 commented Dec 22, 2025

Summary

  • Related Pr: feat: Predicate Filtering via ORC Bloom Filters #72
  • Align Bloom filter hashing with ORC Java (Murmur3 hash64 seed 104729 + Thomas Wang) and use the same double-hash bit selection.
  • Add add_hash and test_hash APIs to simplify Bloom filter construction and testing.
  • Refactor bloom-related tests and integration assertions; add a hit-case and reduce duplication.
  • Update bloom filter test data generator and expected output where needed.

Testing

  • cargo test -p orc-rust --features cli test_bloom_might_contain_true
  • cargo test -p orc-rust --test integration bloom_filter_predicate_prunes

@suxiaogang223 suxiaogang223 changed the title Fix bloom fix: Fix ORC bloom filter hashing to match official implementation Dec 22, 2025
@suxiaogang223
Copy link
Contributor Author

@codex review

@progval
Copy link
Collaborator

progval commented Dec 22, 2025

Did you look at how the parquet crate does it? They implement Bloom Filters and I'm guessing they also follow Java's signedness.

If both ORC and Parquet's Bloom Filters work the same way, we could share a single implementation.

@suxiaogang223
Copy link
Contributor Author

Did you look at how the parquet crate does it? They use Bloom Filter and I'm guessing they also follow Java's signedness.

If both ORC and Parquet's Bloom Filters work the same way, we could share a single implementation.

I looked at the Parquet-rs code and found that it's difficult to do.
https://arrow.apache.org/rust/parquet/bloom_filter/index.html

@suxiaogang223
Copy link
Contributor Author

https://parquet.apache.org/docs/file-format/bloomfilter/
https://orc.apache.org/specification/ORCv1/
Summary from chatgpt:

Parquet and ORC differ mainly in how they compute and use hashes for Bloom filters: Parquet uses a fixed, standardized approach where every value is first hashed with XXH64 (seed = 0), and that single 64-bit hash is then mapped into a Split Block Bloom Filter by selecting one block and deterministically setting 8 bits inside that block using predefined salts, emphasizing cache efficiency and predictable performance; ORC, by contrast, uses a more traditional Bloom filter design where the base hash depends on the data type (Wang 64-bit hash for numeric types and Murmur3-64 for strings/binary data), and multiple hash positions are derived from that base hash using linear combination (hash1 + i·hash2), with the number of hash functions configurable, prioritizing flexibility over a fixed, hardware-friendly layout.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes the ORC bloom filter hashing implementation to match the official ORC Java specification. The key change is replacing the previous Murmur3 x64_128 (seed 0) approach with ORC's official Murmur3 hash64 (seed 104729) combined with Thomas Wang's 64-bit mix function for integer values. The PR also refactors the bloom filter API to use add_hash and test_hash methods instead of raw byte operations, and adds comprehensive hit-case tests alongside the existing miss-case tests.

  • Implements ORC-compliant Murmur3 hash64 with seed 104729 for byte values
  • Adds Thomas Wang's 64-bit mix function for hashing integer/long values
  • Refactors double-hash bit selection to match ORC Java (i=1..k, signed 32-bit split, negative-flip)
  • Simplifies bloom filter testing through new add_hash and test_hash APIs
  • Removes the external murmur3 crate dependency in favor of an internal implementation

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File Description
src/bloom_filter.rs Implements ORC-compliant hash functions (murmur3_64_orc, hash_long), adds add_hash/test_hash APIs, refactors bit indexing logic to match Java implementation
src/row_group_filter.rs Updates predicate value hashing to use new hash_long/hash_bytes methods for all types, removes Cow-based byte serialization
tests/integration/main.rs Refactors bloom filter integration test with count_rows helper, adds hit-case assertions for all tested types, improves test readability
Cargo.toml Removes murmur3 external dependency

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@WenyXu WenyXu merged commit 46de7f0 into datafusion-contrib:main Dec 26, 2025
12 checks passed
@suxiaogang223 suxiaogang223 deleted the fix_bloom branch February 13, 2026 16:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants