Skip to content

feat: [WIP] Adds FSST encoding (not ready for review)#10153

Draft
devanbenz wants to merge 3 commits into
apache:mainfrom
devanbenz:db/fsst-encoding-poc
Draft

feat: [WIP] Adds FSST encoding (not ready for review)#10153
devanbenz wants to merge 3 commits into
apache:mainfrom
devanbenz:db/fsst-encoding-poc

Conversation

@devanbenz

@devanbenz devanbenz commented Jun 17, 2026

Copy link
Copy Markdown

This commit is a proof of concept to add FSST encoding to
parquet within arrow-rs. This PR is heavily inspired and pulls in
a bunch of methods used in: #9372

Parquet FSST proposal: apache/parquet-format#531

Part of: #8749

I've listed a few TO-DOs in the PR as comments. I also need to wire up the arrow-reader and writer, and check performance.

FSST works well with pseudo random textual data like uuids, urls, transaction ids, logs and more.

I'll need to generate a test set of random UUIDs and the like to check out the perf 😎

This commit is a proof of concept to add FSST encoding to
parquet within arrow-rs. This PR is heavily inspired and pulls in
a bunch of methods used in: apache#9372
Apply patterns from the ALP encoder work (apache#9372 follow-up)
- Rejects pages with more than i32::MAX values in flush_buffer
- Add pre-size for flush_buffer
- Replace magic length prefix with a constant
@github-actions github-actions Bot added the parquet Changes to the parquet crate label Jun 17, 2026
/// Decoder for the [`FSST`](Encoding::FSST) encoding.
///
/// See [`FsstEncoder`](crate::encodings::encoding::fsst_encoder::FsstEncoder)
/// for the page layout. Only [`Type::BYTE_ARRAY`] is supported.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to add FIXED_LEN_BYTE_ARRAY support too per the spec proposal.

Comment thread parquet/src/basic.rs
/// Frequently occurring substrings (up to 8 bytes) are replaced with
/// single-byte codes drawn from a per-page symbol table, enabling random
/// access to individual compressed values. Applies to BYTE_ARRAY data.
FSST = 10;

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will probably need to be FSST = 11 once ALP lands.

}

/// Append `value` to `out` as unsigned LEB128.
pub(crate) fn write_uleb128(out: &mut Vec<u8>, mut value: u64) {

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

leb128 is likely used else-where in arrow... maybe this is already or can be a shared utility.

let mut histogram = [0u8; 8];
for symbol in &self.symbols {
debug_assert!((1..=FSST_MAX_SYMBOL_LEN).contains(&symbol.len()));
histogram[symbol.len() - 1] += 1;

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spec indicates that we start at 2 bytes for the symbol table:

Symbol Data : The symbol bytes, packed by length. Symbols are stored in length order (2 byte symbols first, then 3 bytes etc)

Need to adjust this. Same with other areas of the code where we start with the first byte in the symbol table..

@devanbenz devanbenz changed the title feat: [WIP] Adds FSST encoding feat: [WIP] Adds FSST encoding (not ready for review) Jun 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parquet Changes to the parquet crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant