Skip to content

fix: I16/U8 sort with nulls + CSV narrow-int explicit schema#197

Open
ser-vasilich wants to merge 2 commits intomasterfrom
fix/i16-u8-sort-and-csv-narrow-int
Open

fix: I16/U8 sort with nulls + CSV narrow-int explicit schema#197
ser-vasilich wants to merge 2 commits intomasterfrom
fix/i16-u8-sort-and-csv-narrow-int

Conversation

@ser-vasilich
Copy link
Copy Markdown
Collaborator

Summary

Two related correctness bugs in the narrow-int (U8 / I16) data path. Each fix lives in its own commit so bisect can isolate them.

Bug 1 — radix sort drops null bitmap for I16/BOOL/U8

radix_encode_fn for the narrow-int single-key path read raw column bytes for null rows, ignoring the column's null bitmap. xasc/xdesc, exec_sort, top-K, and window-function ORDER BY all placed null rows wherever the underlying byte data happened to encode, instead of at the requested NULLS FIRST/LAST boundary.

Fix: I16 and BOOL/U8 single-key encode now consults the null bitmap and uses a +1-shifted encoding when has_nulls is true, with the null sentinel parked outside the data range so it cannot tie with INT16_MIN/MAX or 0xFF/0. sort_indices_ex and the matching window.c dispatch bump key_nbytes_max by one to cover the extra byte. The decode-shortcut in ray_sort / exec_sort / sort_table_by_keys is gated off for the shifted encoding (radix_decode_into doesn't invert the shift) — falls back to plain gather, which is correct.

Bug 2 — CSV explicit [U8] / [I16] / [I32] schema corrupts data

The resolved_typeparse_type map had no cases for RAY_U8, RAY_I16, RAY_I32 — they all fell through default to CSV_TYPE_STR. csv_intern_strings then wrote 4-byte sym IDs into a column buffer sized for 1/2/4 bytes per row:

Schema Allocated Written Result
[U8] 1 B/row 4 B/row 4× heap overflow → SIGSEGV in next ray_alloc
[I16] 2 B/row 4 B/row 2× overflow → heap corruption
[I32] 4 B/row 4 B/row size matches; column silently holds sym IDs instead of integers from the CSV

User-visible repro on master:

$ printf 'v\n%s\n' $(seq 1 10000) > u8.csv
$ rayforce -e '(set t (.csv.read [U8] "u8.csv")) (count (xasc t (quote v)))'
AddressSanitizer: SEGV in fl_remove src/mem/heap.h:227
  ray_alloc → scratch_alloc → packed_radix_sort_run → sort_indices_ex
  → sort_table_by_keys → ray_xasc_fn

Fix: added CSV_TYPE_U8 / CSV_TYPE_I16 / CSV_TYPE_I32 enum values, mapped them in parse_types, and wired matching cases into csv_parse_fn and csv_parse_serial (both the main switch and the past-row-boundary fill). Each case parses via fast_i64 and narrows at write time, mirroring how as 'U8 truncates. Inference (promote_csv_type) is unchanged — the new narrow-int parse types are reachable only via explicit schema.

Test plan

  • make test — 2349 of 2351 passed (1 skipped, 1 failed)
    • The single failure (sym/like_fn/sym_vec_null_sym) is a pre-existing regression from d5c2caca (ClickBench LIKE-on-dict-SYM rewrite) — reproduces on clean master without this branch.
    • All 11 new tests in this PR pass.
  • Manual repro of both bugs reproduces on master, fixed on branch.
  • Rebased onto eac359ad cleanly — no conflicts with PR R6 csv empty sym #194 (the two PRs touch different functions in csv.c; R6 csv empty sym #194 modifies csv_intern_strings for empty SYM cells, this PR adds U8/I16/I32 cases to the parse-type map and parser switches).

Out of scope (follow-ups)

  • query.c:5004,5331 is_numeric_promo accepts only I64/I32/F64 for column-update operations — narrow-int columns from this CSV fix can't be updated with wider-type expressions without an explicit as. UX gap, not corruption; worth a separate ticket.

🤖 Generated with Claude Code

ser-vasilich and others added 2 commits May 6, 2026 23:25
radix_encode_fn for the narrow-int single-key path read raw column
bytes for null rows and folded them into the sort order — null rows
landed wherever their underlying byte data happened to encode,
ignoring the requested NULLS FIRST/LAST boundary.

I16 and BOOL/U8 now consult the column's null bitmap and use a
+1-shifted encoding when has_nulls is true, with the null sentinel
parked outside the data range so it cannot tie with INT16_MIN/MAX or
0xFF/0.  sort_indices_ex (and the matching window.c dispatch) bumps
key_nbytes_max by one to cover the extra byte.  The decode-shortcut
in ray_sort, exec_sort, and sort_table_by_keys is gated off for the
shifted encoding (radix_decode_into doesn't invert the shift) — we
fall back to plain gather, which is correct.

Tests cover ASC/DESC × nulls-first/last for I16 with negatives, U8
with high-byte data, and BOOL with mixed false/true so the encoded
null sentinel cannot be aliased by ordinary data.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The resolved_type → parse_type map had no cases for RAY_U8, RAY_I16,
RAY_I32 — they fell through the default branch to CSV_TYPE_STR.
Then csv_intern_strings wrote 4-byte sym IDs into a column buffer
sized for 1/2/4 bytes per row: 4× heap overflow for U8 (eventual
SIGSEGV in later allocators), 2× for I16, and (most insidiously)
size-matching but value-wrong for I32 — the column held sym IDs
instead of the integers from the CSV.

Added CSV_TYPE_U8/I16/I32 enum values, mapped them in parse_types,
and wired matching cases into csv_parse_fn / csv_parse_serial (both
the main switch and the past-row-boundary fill).  Each case parses
via fast_i64 and narrows at write time, mirroring how `as 'U8`
truncates.  Inference (promote_csv_type) is unchanged — the new
narrow-int parse types are reachable only via explicit schema.

Tests cover U8 explicit schema (10k rows → parallel parse path,
followed by xasc to exercise the heap), I16 with empty-field nulls,
I32 with negative values, and a small U8/U8 case with truncated
rows that hits the serial parser's row-boundary fill.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant