fix: I16/U8 sort with nulls + CSV narrow-int explicit schema#197
Open
ser-vasilich wants to merge 2 commits intomasterfrom
Open
fix: I16/U8 sort with nulls + CSV narrow-int explicit schema#197ser-vasilich wants to merge 2 commits intomasterfrom
ser-vasilich wants to merge 2 commits intomasterfrom
Conversation
radix_encode_fn for the narrow-int single-key path read raw column bytes for null rows and folded them into the sort order — null rows landed wherever their underlying byte data happened to encode, ignoring the requested NULLS FIRST/LAST boundary. I16 and BOOL/U8 now consult the column's null bitmap and use a +1-shifted encoding when has_nulls is true, with the null sentinel parked outside the data range so it cannot tie with INT16_MIN/MAX or 0xFF/0. sort_indices_ex (and the matching window.c dispatch) bumps key_nbytes_max by one to cover the extra byte. The decode-shortcut in ray_sort, exec_sort, and sort_table_by_keys is gated off for the shifted encoding (radix_decode_into doesn't invert the shift) — we fall back to plain gather, which is correct. Tests cover ASC/DESC × nulls-first/last for I16 with negatives, U8 with high-byte data, and BOOL with mixed false/true so the encoded null sentinel cannot be aliased by ordinary data. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The resolved_type → parse_type map had no cases for RAY_U8, RAY_I16, RAY_I32 — they fell through the default branch to CSV_TYPE_STR. Then csv_intern_strings wrote 4-byte sym IDs into a column buffer sized for 1/2/4 bytes per row: 4× heap overflow for U8 (eventual SIGSEGV in later allocators), 2× for I16, and (most insidiously) size-matching but value-wrong for I32 — the column held sym IDs instead of the integers from the CSV. Added CSV_TYPE_U8/I16/I32 enum values, mapped them in parse_types, and wired matching cases into csv_parse_fn / csv_parse_serial (both the main switch and the past-row-boundary fill). Each case parses via fast_i64 and narrows at write time, mirroring how `as 'U8` truncates. Inference (promote_csv_type) is unchanged — the new narrow-int parse types are reachable only via explicit schema. Tests cover U8 explicit schema (10k rows → parallel parse path, followed by xasc to exercise the heap), I16 with empty-field nulls, I32 with negative values, and a small U8/U8 case with truncated rows that hits the serial parser's row-boundary fill. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two related correctness bugs in the narrow-int (U8 / I16) data path. Each fix lives in its own commit so bisect can isolate them.
Bug 1 — radix sort drops null bitmap for I16/BOOL/U8
radix_encode_fnfor the narrow-int single-key path read raw column bytes for null rows, ignoring the column's null bitmap.xasc/xdesc,exec_sort, top-K, and window-function ORDER BY all placed null rows wherever the underlying byte data happened to encode, instead of at the requested NULLS FIRST/LAST boundary.Fix: I16 and BOOL/U8 single-key encode now consults the null bitmap and uses a +1-shifted encoding when
has_nullsis true, with the null sentinel parked outside the data range so it cannot tie with INT16_MIN/MAX or 0xFF/0.sort_indices_exand the matchingwindow.cdispatch bumpkey_nbytes_maxby one to cover the extra byte. The decode-shortcut inray_sort/exec_sort/sort_table_by_keysis gated off for the shifted encoding (radix_decode_intodoesn't invert the shift) — falls back to plain gather, which is correct.Bug 2 — CSV explicit
[U8]/[I16]/[I32]schema corrupts dataThe
resolved_type→parse_typemap had no cases forRAY_U8,RAY_I16,RAY_I32— they all fell throughdefaulttoCSV_TYPE_STR.csv_intern_stringsthen wrote 4-byte sym IDs into a column buffer sized for 1/2/4 bytes per row:[U8]ray_alloc[I16][I32]User-visible repro on master:
Fix: added
CSV_TYPE_U8/CSV_TYPE_I16/CSV_TYPE_I32enum values, mapped them inparse_types, and wired matching cases intocsv_parse_fnandcsv_parse_serial(both the main switch and the past-row-boundary fill). Each case parses viafast_i64and narrows at write time, mirroring howas 'U8truncates. Inference (promote_csv_type) is unchanged — the new narrow-int parse types are reachable only via explicit schema.Test plan
make test— 2349 of 2351 passed (1 skipped, 1 failed)sym/like_fn/sym_vec_null_sym) is a pre-existing regression fromd5c2caca(ClickBench LIKE-on-dict-SYM rewrite) — reproduces on clean master without this branch.eac359adcleanly — no conflicts with PR R6 csv empty sym #194 (the two PRs touch different functions incsv.c; R6 csv empty sym #194 modifiescsv_intern_stringsfor empty SYM cells, this PR adds U8/I16/I32 cases to the parse-type map and parser switches).Out of scope (follow-ups)
query.c:5004,5331is_numeric_promoaccepts onlyI64/I32/F64for column-update operations — narrow-int columns from this CSV fix can't be updated with wider-type expressions without an explicitas. UX gap, not corruption; worth a separate ticket.🤖 Generated with Claude Code