GH-49881 [C++][Parquet] Support writing encrypted bloom filters#49880
GH-49881 [C++][Parquet] Support writing encrypted bloom filters#49880ArnavBalyan wants to merge 1 commit into
Conversation
|
Thanks for opening a pull request! If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project. Then could you also rename the pull request title in the following format? or See also: |
|
|
ddb5171 to
65310aa
Compare
65310aa to
9b02fff
Compare
|
cc @wgtmac thanks |
| format::BloomFilterHeader header; | ||
| if (ARROW_PREDICT_FALSE(algorithm_ != BloomFilter::Algorithm::BLOCK)) { | ||
| throw ParquetException("BloomFilter does not support Algorithm other than BLOCK"); | ||
| } | ||
| header.algorithm.__set_BLOCK(format::SplitBlockAlgorithm()); | ||
| if (ARROW_PREDICT_FALSE(hash_strategy_ != HashStrategy::XXHASH)) { | ||
| throw ParquetException("BloomFilter does not support Hash other than XXHASH"); | ||
| } | ||
| header.hash.__set_XXHASH(format::XxHash()); | ||
| if (ARROW_PREDICT_FALSE(compression_strategy_ != CompressionStrategy::UNCOMPRESSED)) { | ||
| throw ParquetException( | ||
| "BloomFilter does not support Compression other than UNCOMPRESSED"); | ||
| } | ||
| header.compression.__set_UNCOMPRESSED(format::Uncompressed()); | ||
| header.__set_numBytes(num_bytes_); | ||
|
|
||
| // Bloom filter header and bitset are separate encrypted modules with different AADs. | ||
| encryptor->UpdateAad( | ||
| encryption::CreateModuleAad(encryptor->file_aad(), encryption::kBloomFilterHeader, | ||
| row_group_ordinal, column_ordinal, -1)); | ||
| ThriftSerializer serializer; | ||
| serializer.Serialize(&header, sink, encryptor); |
There was a problem hiding this comment.
Most of this is identical to BlockSplitBloomFilter::WriteTo, can you factor things out into a common subroutine?
Perhaps something like:
struct EncryptionParams {
Encryptor* encryptor;
int16_t row_group_ordinal;
int16_t column_ordinal;
};
void WriteInternal(ArrowOutputStream*, std::optional<EncryptionParams>) const;|
|
||
| PARQUET_ASSIGN_OR_THROW(auto sink, ::arrow::io::BufferOutputStream::Create()); | ||
| auto file_writer = ParquetFileWriter::Open(sink, schema, writer_properties); | ||
| auto* row_group_writer = file_writer->AppendRowGroup(); |
There was a problem hiding this comment.
Ideally we would write at least two row groups to check that the ordinal is passed properly.
| const auto& column_props = properties_->column_encryption_properties(column_path); | ||
| if (column_props != nullptr && column_props->is_encrypted() && | ||
| !column_props->is_encrypted_with_footer_key()) { | ||
| ParquetException::NYI("Bloom filter writing with a dedicated column key"); |
There was a problem hiding this comment.
I don't understand why this isn't implemented. You're using the column encryptor already, no?
The Parquet encryption spec says this:
For encrypted columns, the following modules are always encrypted, with the same column key: pages and page headers (both dictionary and data), column indexes, offset indexes, bloom filter headers and bitsets.
There was a problem hiding this comment.
Same question here. Why only footer-key is supported?
| throw ParquetException( | ||
| "Encrypted files cannot contain more than 32767 columns"); | ||
| } | ||
| auto* block_filter = dynamic_cast<BlockSplitBloomFilter*>(filter.get()); |
There was a problem hiding this comment.
The dynamic_cast should not be required, just call WriteEncrypted on the base BloomFilter class.
| throw ParquetException( | ||
| "Only BlockSplitBloomFilter is supported for encrypted bloom filters"); | ||
| } | ||
| block_filter->WriteEncrypted(sink, meta_encryptor.get(), static_cast<int16_t>(i), |
There was a problem hiding this comment.
Column metadata is encrypted during column close (metadata.cc:1765), but bloom filter offsets are set later (metadata.cc:2066). For column-key metadata, readers use the decrypted encrypted_column_metadata, so they would miss the bloom filter offset/length unless the metadata encryption order is fixed.
Rationale for this change
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?