Skip to content

Add zstd windowLog override to WriterProperties#9367

Open
amirshukayev wants to merge 3 commits intoapache:mainfrom
amirshukayev:zstd-window-size
Open

Add zstd windowLog override to WriterProperties#9367
amirshukayev wants to merge 3 commits intoapache:mainfrom
amirshukayev:zstd-window-size

Conversation

@amirshukayev
Copy link

@amirshukayev amirshukayev commented Feb 6, 2026

Which issue does this PR close?

Rationale for this change

Tuning window size with zstd is important for wider columns such as html strings. In my personal uses I saw a 32% size decrease with small performance penalty with level = 9 and windowLog = 27 (128 MB windows up from 8 MB default for level 9).

What changes are included in this PR?

We add a zstd_window_log_override: Option<u32> param to CodecOptions to just allow the value to be set through to zstd-safe. Just plumbing.

Are these changes tested?

Pass all the cargo tests, one unit test provided, just to show a usage example/guard the behaviour, doesn't necessarily test the behaviour of the underlying zstd library.

Are there any user-facing changes?

No backwards incompatible changes, but this change adds a new param to CodecOptionsBuilder, and makes changes to cargo docs.

notes

I think taking the approach of making this an override makes the most sense, for the following reasons:

  1. The term override make sense, zstd levels typically set the windowLog, and then you can manually set it, so fundamentally they are overrides.
  2. This makes it a natural fit in CodecOptions which also includes an encoder specific flag in backward_compatible_lz4.
  3. We had to put the override flag in WriterProperties, which I'm not sure is entirely natural, but I couldn't think of another way to do this. We could guard this with the zstd flag but I'm not sure if that is consistent with the other zstd code.

Please let me know if you think of a better way to do this.

@github-actions github-actions bot added the parquet Changes to the parquet crate label Feb 6, 2026
@etseidl
Copy link
Contributor

etseidl commented Feb 6, 2026

Thanks @amirshukayev, this looks interesting. Since the compression codec itself is already on ColumnProperties rather than WriterProperties, I can't help wondering if the override should be as well. I can see only wanting to override a single column and leave the others with the default.

I also wonder if we should instead modify basic::Compression enum and add the override to the ZSTD variant. That would be more of a breaking change, but might be more natural 🤷

@amirshukayev
Copy link
Author

amirshukayev commented Feb 6, 2026

Thanks @amirshukayev, this looks interesting. Since the compression codec itself is already on ColumnProperties rather than WriterProperties, I can't help wondering if the override should be as well. I can see only wanting to override a single column and leave the others with the default.

Great point! I've made the update to allow it to be set on ColumnProperties.

I also wonder if we should instead modify basic::Compression enum and add the override to the ZSTD variant. That would be more of a breaking change, but might be more natural 🤷

I think this makes sense, this the approach I started with. Whats your take here? would it be worth breaking things? I'd be happy to implement it like that. This is the approach I used before, which I assume is what you mean?

// Before:                                                                                  
pub struct ZstdLevel(i32);                                                                          
                                                                                                    
// After (vendored):                                                   
pub struct ZstdLevel {                                                                              
    level: i32,                                                                                     
    window_log_override: Option<u32>,
}

@etseidl
Copy link
Contributor

etseidl commented Feb 6, 2026

Thanks @amirshukayev, this looks interesting. Since the compression codec itself is already on ColumnProperties rather than WriterProperties, I can't help wondering if the override should be as well. I can see only wanting to override a single column and leave the others with the default.

Great point! I've made the update to allow it to be set on ColumnProperties.

❤️

I also wonder if we should instead modify basic::Compression enum and add the override to the ZSTD variant. That would be more of a breaking change, but might be more natural 🤷

I think this makes sense, this the approach I started with. Whats your take here? would it be worth breaking things? I'd be happy to implement it like that. This is the approach I used before, which I assume is what you mean?

// Before:                                                                                  
pub struct ZstdLevel(i32);                                                                          
                                                                                                    
// After (vendored):                                                   
pub struct ZstdLevel {                                                                              
    level: i32,                                                                                     
    window_log_override: Option<u32>,
}

Yes, this is what I was thinking. If we did this, then users could simply use set_compression, and then codec specific options don't creep into the options. I personally think that's worth a breaking change (and the window is currently open for those). I'm not sure how big the ripple effect is, though, if we change the enum.

I wonder if others could opine here before starting down that road. CC @alamb @jhorstmann @Dandandan @Jefffrey

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parquet Changes to the parquet crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants