feat: [WIP] Adds FSST encoding (not ready for review)#10153
Draft
devanbenz wants to merge 3 commits into
Draft
Conversation
This commit is a proof of concept to add FSST encoding to parquet within arrow-rs. This PR is heavily inspired and pulls in a bunch of methods used in: apache#9372
Apply patterns from the ALP encoder work (apache#9372 follow-up) - Rejects pages with more than i32::MAX values in flush_buffer - Add pre-size for flush_buffer - Replace magic length prefix with a constant
devanbenz
commented
Jun 17, 2026
| /// Decoder for the [`FSST`](Encoding::FSST) encoding. | ||
| /// | ||
| /// See [`FsstEncoder`](crate::encodings::encoding::fsst_encoder::FsstEncoder) | ||
| /// for the page layout. Only [`Type::BYTE_ARRAY`] is supported. |
Author
There was a problem hiding this comment.
I need to add FIXED_LEN_BYTE_ARRAY support too per the spec proposal.
devanbenz
commented
Jun 17, 2026
| /// Frequently occurring substrings (up to 8 bytes) are replaced with | ||
| /// single-byte codes drawn from a per-page symbol table, enabling random | ||
| /// access to individual compressed values. Applies to BYTE_ARRAY data. | ||
| FSST = 10; |
Author
There was a problem hiding this comment.
This will probably need to be FSST = 11 once ALP lands.
devanbenz
commented
Jun 17, 2026
| } | ||
|
|
||
| /// Append `value` to `out` as unsigned LEB128. | ||
| pub(crate) fn write_uleb128(out: &mut Vec<u8>, mut value: u64) { |
Author
There was a problem hiding this comment.
leb128 is likely used else-where in arrow... maybe this is already or can be a shared utility.
devanbenz
commented
Jun 17, 2026
| let mut histogram = [0u8; 8]; | ||
| for symbol in &self.symbols { | ||
| debug_assert!((1..=FSST_MAX_SYMBOL_LEN).contains(&symbol.len())); | ||
| histogram[symbol.len() - 1] += 1; |
Author
There was a problem hiding this comment.
Spec indicates that we start at 2 bytes for the symbol table:
Symbol Data : The symbol bytes, packed by length. Symbols are stored in length order (2 byte symbols first, then 3 bytes etc)
Need to adjust this. Same with other areas of the code where we start with the first byte in the symbol table..
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This commit is a proof of concept to add FSST encoding to
parquet within arrow-rs. This PR is heavily inspired and pulls in
a bunch of methods used in: #9372
Parquet FSST proposal: apache/parquet-format#531
Part of: #8749
I've listed a few TO-DOs in the PR as comments. I also need to wire up the arrow-reader and writer, and check performance.
I'll need to generate a test set of random UUIDs and the like to check out the perf 😎