Skip to content

Conversation

@rambleraptor
Copy link
Contributor

@rambleraptor rambleraptor commented Feb 5, 2026

Which issue does this PR close?

Rationale for this change

We've been doing some internal work with Parquet and found a bunch of places where error messages could be improved or panics could be avoided.

I can split this up into multiple PRs if need be. I know it's an assortment of things.

What changes are included in this PR?

Are these changes tested?

Tests should continue to pass.

Are there any user-facing changes?

@github-actions github-actions bot added the parquet Changes to the parquet crate label Feb 5, 2026

impl<T: DataType> Decoder<T> for DictDecoder<T> {
fn set_data(&mut self, data: Bytes, num_values: usize) -> Result<()> {
if data.is_empty() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this a repetition of the check directly below?

let dict_idx = self.current_value.unwrap() as usize;
let dict_value = dict[dict_idx].clone();

let dict_value = dict
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For large rle runs or small buffers it might be more efficient to check this inside reload, but the difference in practice is probably minimal. Checking it here is also consistent with the bitpacked runs.

.iter_mut()
.zip(index_buf[..num_values].iter())
.for_each(|(b, i)| b.clone_from(&dict[*i as usize]));
for i in 0..num_values {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using the zipped iterator as before avoids the bounds checks on index_buf and buffer, that might make a measurable difference.


/// Returns the offset and length in bytes of the column chunk within the file
pub fn byte_range(&self) -> (u64, u64) {
pub fn byte_range(&self) -> Result<(u64, u64)> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most other accessors for this struct return signed values and validation seems to be left to the caller. I think it would be more consistent if byte_range also returned a (i64, i64). Even an Ok result here still needs to be validated to be actually in bounds for the given file.

));
}
} else if !is_root_node {
return Err(general_err!("Repetition level must be defined for non-root types"));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand how this error relates to the precision of decimal types

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parquet Changes to the parquet crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants