GH-49881 [C++][Parquet] Support writing encrypted bloom filters by ArnavBalyan · Pull Request #49880 · apache/arrow

ArnavBalyan · 2026-04-28T08:04:44Z

Rationale for this change

Add support for writing encrypted bloom filters on the write side.
BlockSplitBloomFilter::WriteEncrypted, added for writing encrypted bloom filters.
Header and bitset are encrypted separately following read side encryption/spec

What changes are included in this PR?

Write support for encrypted bloom filter, tests added to check for round trip

Are these changes tested?

Yes, UT

Are there any user-facing changes?

Yes

GitHub Issue: Add write support for encrypted bloom filters #49881

github-actions · 2026-04-28T08:05:07Z

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose

Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename the pull request title in the following format?

GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

See also:

github-actions · 2026-04-28T08:12:42Z

⚠️ GitHub issue #49881 has been automatically assigned in GitHub to PR creator.

ArnavBalyan · 2026-05-01T05:22:54Z

cc @wgtmac thanks

pitrou · 2026-05-13T14:35:38Z

+  format::BloomFilterHeader header;
+  if (ARROW_PREDICT_FALSE(algorithm_ != BloomFilter::Algorithm::BLOCK)) {
+    throw ParquetException("BloomFilter does not support Algorithm other than BLOCK");
+  }
+  header.algorithm.__set_BLOCK(format::SplitBlockAlgorithm());
+  if (ARROW_PREDICT_FALSE(hash_strategy_ != HashStrategy::XXHASH)) {
+    throw ParquetException("BloomFilter does not support Hash other than XXHASH");
+  }
+  header.hash.__set_XXHASH(format::XxHash());
+  if (ARROW_PREDICT_FALSE(compression_strategy_ != CompressionStrategy::UNCOMPRESSED)) {
+    throw ParquetException(
+        "BloomFilter does not support Compression other than UNCOMPRESSED");
+  }
+  header.compression.__set_UNCOMPRESSED(format::Uncompressed());
+  header.__set_numBytes(num_bytes_);
+
+  // Bloom filter header and bitset are separate encrypted modules with different AADs.
+  encryptor->UpdateAad(
+      encryption::CreateModuleAad(encryptor->file_aad(), encryption::kBloomFilterHeader,
+                                  row_group_ordinal, column_ordinal, -1));
+  ThriftSerializer serializer;
+  serializer.Serialize(&header, sink, encryptor);


Most of this is identical to BlockSplitBloomFilter::WriteTo, can you factor things out into a common subroutine?

Perhaps something like:

struct EncryptionParams { Encryptor* encryptor; int16_t row_group_ordinal; int16_t column_ordinal; }; void WriteInternal(ArrowOutputStream*, std::optional<EncryptionParams>) const;

pitrou · 2026-05-13T14:38:16Z

+
+  PARQUET_ASSIGN_OR_THROW(auto sink, ::arrow::io::BufferOutputStream::Create());
+  auto file_writer = ParquetFileWriter::Open(sink, schema, writer_properties);
+  auto* row_group_writer = file_writer->AppendRowGroup();


Ideally we would write at least two row groups to check that the ordinal is passed properly.

pitrou · 2026-05-13T14:41:51Z

+        const auto& column_props = properties_->column_encryption_properties(column_path);
+        if (column_props != nullptr && column_props->is_encrypted() &&
+            !column_props->is_encrypted_with_footer_key()) {
+          ParquetException::NYI("Bloom filter writing with a dedicated column key");


I don't understand why this isn't implemented. You're using the column encryptor already, no?

The Parquet encryption spec says this:

For encrypted columns, the following modules are always encrypted, with the same column key: pages and page headers (both dictionary and data), column indexes, offset indexes, bloom filter headers and bitsets.

Same question here. Why only footer-key is supported?

pitrou · 2026-05-13T14:44:32Z

+          throw ParquetException(
+              "Encrypted files cannot contain more than 32767 columns");
+        }
+        auto* block_filter = dynamic_cast<BlockSplitBloomFilter*>(filter.get());


The dynamic_cast should not be required, just call WriteEncrypted on the base BloomFilter class.

wgtmac · 2026-05-15T14:44:50Z

+          throw ParquetException(
+              "Only BlockSplitBloomFilter is supported for encrypted bloom filters");
+        }
+        block_filter->WriteEncrypted(sink, meta_encryptor.get(), static_cast<int16_t>(i),


Column metadata is encrypted during column close (metadata.cc:1765), but bloom filter offsets are set later (metadata.cc:2066). For column-key metadata, readers use the decrypted encrypted_column_metadata, so they would miss the bloom filter offset/length unless the metadata encryption order is fixed.

ArnavBalyan requested a review from wgtmac as a code owner April 28, 2026 08:04

github-actions Bot added Component: Parquet Component: C++ awaiting review Awaiting review labels Apr 28, 2026

ArnavBalyan changed the title ~~[C++][Parquet] Support writing encrypted bloom filters~~ GH-49881 [C++][Parquet] Support writing encrypted bloom filters Apr 28, 2026

ArnavBalyan force-pushed the arnavb/enc-bloom-write branch from ddb5171 to 65310aa Compare April 28, 2026 08:14

update

9b02fff

ArnavBalyan force-pushed the arnavb/enc-bloom-write branch from 65310aa to 9b02fff Compare April 28, 2026 15:57

pitrou reviewed May 13, 2026

View reviewed changes

github-actions Bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels May 13, 2026

wgtmac reviewed May 15, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-49881 [C++][Parquet] Support writing encrypted bloom filters#49880

GH-49881 [C++][Parquet] Support writing encrypted bloom filters#49880
ArnavBalyan wants to merge 1 commit into
apache:mainfrom
ArnavBalyan:arnavb/enc-bloom-write

ArnavBalyan commented Apr 28, 2026 •

edited by github-actions Bot

Loading

Uh oh!

github-actions Bot commented Apr 28, 2026

Uh oh!

github-actions Bot commented Apr 28, 2026

Uh oh!

ArnavBalyan commented May 1, 2026

Uh oh!

pitrou May 13, 2026

Uh oh!

pitrou May 13, 2026

Uh oh!

pitrou May 13, 2026

Uh oh!

wgtmac May 15, 2026

Uh oh!

pitrou May 13, 2026

Uh oh!

wgtmac May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ArnavBalyan commented Apr 28, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

github-actions Bot commented Apr 28, 2026

Uh oh!

github-actions Bot commented Apr 28, 2026

Uh oh!

ArnavBalyan commented May 1, 2026

Uh oh!

pitrou May 13, 2026

Choose a reason for hiding this comment

Uh oh!

pitrou May 13, 2026

Choose a reason for hiding this comment

Uh oh!

pitrou May 13, 2026

Choose a reason for hiding this comment

Uh oh!

wgtmac May 15, 2026

Choose a reason for hiding this comment

Uh oh!

pitrou May 13, 2026

Choose a reason for hiding this comment

Uh oh!

wgtmac May 15, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ArnavBalyan commented Apr 28, 2026 •

edited by github-actions Bot

Loading