fix: deduplicate dictionary values when concatenating or interleaving (#10160)#10165
Open
sandy-sachin7 wants to merge 1 commit into
Open
fix: deduplicate dictionary values when concatenating or interleaving (#10160)#10165sandy-sachin7 wants to merge 1 commit into
sandy-sachin7 wants to merge 1 commit into
Conversation
…apache#10160) When concatenating or interleaving dictionary arrays with different backing arrays, dictionary values must be merged (deduplicated) instead of naively concatenated. The old heuristic only merged when total_values >= total_entries, which allowed duplicate entries to slip through — causing issues for downstream consumers like pandas that enforce unique dictionary categories.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
Closes #10160.
Rationale for this change
When concatenating (or interleaving) dictionary arrays with different backing arrays, the dictionary values were naively concatenated — potentially producing duplicate entries. Downstream consumers like pandas reject this because they require unique dictionary categories.
The old heuristic in
should_merge_dictionary_valuesonly triggered dictionary merging whentotal_values >= total_entries, which missed cases where small dictionaries with overlapping values were concatenated.What changes are included in this PR?
arrow-select/src/dictionary.rs: Changedshould_merge_dictionary_valuesto always returntruefor merging when dictionaries have different backing arrays (!single_dictionary). Removed thevalues_exceed_lengthheuristic that previously gated merging. Removed the now-unusedlenparameter.arrow-select/src/concat.rs: Updatedconcat_dictionariesto pass the new signature. Updatedtest_string_dictionary_arrayto expect 6 merged unique values instead of 7 naive concatenated values. Addedconcat_dictionary_batches_deduplicates_valuestest reproducing the exact issue scenario.arrow-select/src/interleave.rs: Updatedinterleave_dictionariesto pass the new signature. Updatedtest_interleave_dictionaryto expect 3 merged unique values instead of 5.Are these changes tested?
Yes — all 379 existing tests pass, plus the new reproducing test.
Are there any user-facing changes?
Dictionary arrays produced by
concat,concat_batches, andinterleavewill now always have deduplicated dictionary values when the input arrays have different backing dictionaries. This may reduce the size of the resulting dictionary values array, but the logical data (key → value mappings) remains identical.