fix: I16/U8 sort with nulls + CSV narrow-int explicit schema by ser-vasilich · Pull Request #197 · RayforceDB/rayforce

ser-vasilich · 2026-05-06T20:30:11Z

Summary

Two related correctness bugs in the narrow-int (U8 / I16) data path. Each fix lives in its own commit so bisect can isolate them.

Bug 1 — radix sort drops null bitmap for I16/BOOL/U8

radix_encode_fn for the narrow-int single-key path read raw column bytes for null rows, ignoring the column's null bitmap. xasc/xdesc, exec_sort, top-K, and window-function ORDER BY all placed null rows wherever the underlying byte data happened to encode, instead of at the requested NULLS FIRST/LAST boundary.

Fix: I16 and BOOL/U8 single-key encode now consults the null bitmap and uses a +1-shifted encoding when has_nulls is true, with the null sentinel parked outside the data range so it cannot tie with INT16_MIN/MAX or 0xFF/0. sort_indices_ex and the matching window.c dispatch bump key_nbytes_max by one to cover the extra byte. The decode-shortcut in ray_sort / exec_sort / sort_table_by_keys is gated off for the shifted encoding (radix_decode_into doesn't invert the shift) — falls back to plain gather, which is correct.

Bug 2 — CSV explicit `[U8]` / `[I16]` / `[I32]` schema corrupts data

The resolved_type → parse_type map had no cases for RAY_U8, RAY_I16, RAY_I32 — they all fell through default to CSV_TYPE_STR. csv_intern_strings then wrote 4-byte sym IDs into a column buffer sized for 1/2/4 bytes per row:

Schema	Allocated	Written	Result
`[U8]`	1 B/row	4 B/row	4× heap overflow → SIGSEGV in next `ray_alloc`
`[I16]`	2 B/row	4 B/row	2× overflow → heap corruption
`[I32]`	4 B/row	4 B/row	size matches; column silently holds sym IDs instead of integers from the CSV

User-visible repro on master:

$ printf 'v\n%s\n' $(seq 1 10000) > u8.csv
$ rayforce -e '(set t (.csv.read [U8] "u8.csv")) (count (xasc t (quote v)))'
AddressSanitizer: SEGV in fl_remove src/mem/heap.h:227
  ray_alloc → scratch_alloc → packed_radix_sort_run → sort_indices_ex
  → sort_table_by_keys → ray_xasc_fn

Fix: added CSV_TYPE_U8 / CSV_TYPE_I16 / CSV_TYPE_I32 enum values, mapped them in parse_types, and wired matching cases into csv_parse_fn and csv_parse_serial (both the main switch and the past-row-boundary fill). Each case parses via fast_i64 and narrows at write time, mirroring how as 'U8 truncates. Inference (promote_csv_type) is unchanged — the new narrow-int parse types are reachable only via explicit schema.

Test plan

make test — 2349 of 2351 passed (1 skipped, 1 failed)
- The single failure (sym/like_fn/sym_vec_null_sym) is a pre-existing regression from d5c2caca (ClickBench LIKE-on-dict-SYM rewrite) — reproduces on clean master without this branch.
- All 11 new tests in this PR pass.
Manual repro of both bugs reproduces on master, fixed on branch.
Rebased onto eac359ad cleanly — no conflicts with PR R6 csv empty sym #194 (the two PRs touch different functions in csv.c; R6 csv empty sym #194 modifies csv_intern_strings for empty SYM cells, this PR adds U8/I16/I32 cases to the parse-type map and parser switches).

Out of scope (follow-ups)

query.c:5004,5331 is_numeric_promo accepts only I64/I32/F64 for column-update operations — narrow-int columns from this CSV fix can't be updated with wider-type expressions without an explicit as. UX gap, not corruption; worth a separate ticket.

🤖 Generated with Claude Code

radix_encode_fn for the narrow-int single-key path read raw column bytes for null rows and folded them into the sort order — null rows landed wherever their underlying byte data happened to encode, ignoring the requested NULLS FIRST/LAST boundary. I16 and BOOL/U8 now consult the column's null bitmap and use a +1-shifted encoding when has_nulls is true, with the null sentinel parked outside the data range so it cannot tie with INT16_MIN/MAX or 0xFF/0. sort_indices_ex (and the matching window.c dispatch) bumps key_nbytes_max by one to cover the extra byte. The decode-shortcut in ray_sort, exec_sort, and sort_table_by_keys is gated off for the shifted encoding (radix_decode_into doesn't invert the shift) — we fall back to plain gather, which is correct. Tests cover ASC/DESC × nulls-first/last for I16 with negatives, U8 with high-byte data, and BOOL with mixed false/true so the encoded null sentinel cannot be aliased by ordinary data. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The resolved_type → parse_type map had no cases for RAY_U8, RAY_I16, RAY_I32 — they fell through the default branch to CSV_TYPE_STR. Then csv_intern_strings wrote 4-byte sym IDs into a column buffer sized for 1/2/4 bytes per row: 4× heap overflow for U8 (eventual SIGSEGV in later allocators), 2× for I16, and (most insidiously) size-matching but value-wrong for I32 — the column held sym IDs instead of the integers from the CSV. Added CSV_TYPE_U8/I16/I32 enum values, mapped them in parse_types, and wired matching cases into csv_parse_fn / csv_parse_serial (both the main switch and the past-row-boundary fill). Each case parses via fast_i64 and narrows at write time, mirroring how `as 'U8` truncates. Inference (promote_csv_type) is unchanged — the new narrow-int parse types are reachable only via explicit schema. Tests cover U8 explicit schema (10k rows → parallel parse path, followed by xasc to exercise the heap), I16 with empty-field nulls, I32 with negative values, and a small U8/U8 case with truncated rows that hits the serial parser's row-boundary fill. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ser-vasilich and others added 2 commits May 6, 2026 23:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: I16/U8 sort with nulls + CSV narrow-int explicit schema#197

fix: I16/U8 sort with nulls + CSV narrow-int explicit schema#197
ser-vasilich wants to merge 2 commits intomasterfrom
fix/i16-u8-sort-and-csv-narrow-int

ser-vasilich commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ser-vasilich commented May 6, 2026

Summary

Bug 1 — radix sort drops null bitmap for I16/BOOL/U8

Bug 2 — CSV explicit [U8] / [I16] / [I32] schema corrupts data

Test plan

Out of scope (follow-ups)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Bug 2 — CSV explicit `[U8]` / `[I16]` / `[I32]` schema corrupts data