Skip to content

fix: deduplicate dictionary values when concatenating or interleaving (#10160)#10165

Open
sandy-sachin7 wants to merge 1 commit into
apache:mainfrom
sandy-sachin7:fix/concat-dictionary-dedup
Open

fix: deduplicate dictionary values when concatenating or interleaving (#10160)#10165
sandy-sachin7 wants to merge 1 commit into
apache:mainfrom
sandy-sachin7:fix/concat-dictionary-dedup

Conversation

@sandy-sachin7

Copy link
Copy Markdown

Which issue does this PR close?

Closes #10160.

Rationale for this change

When concatenating (or interleaving) dictionary arrays with different backing arrays, the dictionary values were naively concatenated — potentially producing duplicate entries. Downstream consumers like pandas reject this because they require unique dictionary categories.

The old heuristic in should_merge_dictionary_values only triggered dictionary merging when total_values >= total_entries, which missed cases where small dictionaries with overlapping values were concatenated.

What changes are included in this PR?

  1. arrow-select/src/dictionary.rs: Changed should_merge_dictionary_values to always return true for merging when dictionaries have different backing arrays (!single_dictionary). Removed the values_exceed_length heuristic that previously gated merging. Removed the now-unused len parameter.

  2. arrow-select/src/concat.rs: Updated concat_dictionaries to pass the new signature. Updated test_string_dictionary_array to expect 6 merged unique values instead of 7 naive concatenated values. Added concat_dictionary_batches_deduplicates_values test reproducing the exact issue scenario.

  3. arrow-select/src/interleave.rs: Updated interleave_dictionaries to pass the new signature. Updated test_interleave_dictionary to expect 3 merged unique values instead of 5.

Are these changes tested?

Yes — all 379 existing tests pass, plus the new reproducing test.

Are there any user-facing changes?

Dictionary arrays produced by concat, concat_batches, and interleave will now always have deduplicated dictionary values when the input arrays have different backing dictionaries. This may reduce the size of the resulting dictionary values array, but the logical data (key → value mappings) remains identical.

…apache#10160)

When concatenating or interleaving dictionary arrays with different
backing arrays, dictionary values must be merged (deduplicated) instead
of naively concatenated. The old heuristic only merged when
total_values >= total_entries, which allowed duplicate entries to slip
through — causing issues for downstream consumers like pandas that
enforce unique dictionary categories.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

batch concatenation leads to duplicate dictionary entries and not readable by pandas

1 participant