Scan pruning and sort-on-write (#20, #21, #22)#23
Conversation
Implements the Iceberg scan-performance set: prune non-matching data on read, and lay out data sorted so those bounds are tight. #20 — file pruning via manifest column bounds: - New src/write/serde.js: extract the Iceberg single-value codec + type comparator out of stats.js (no behavior change) and add deserializeValue, the read-side inverse of serializeValue. - New fileMightMatch in prune.js (sharing a refactored AND/OR/$nor walker with partitionMightMatch), AND'd into icebergDataSource scan planning. It decodes each data file's lower_bounds/upper_bounds and skips files whose bounds prove no row can match. Conservative: never range-prunes truncated string/binary bounds, keeps on any uncertainty. #21 — parquet row-group pruning (verify + test): - hyparquet's parquetPlan/canSkipRowGroup already skips row groups by per-row-group statistics + bloom filters when a filter is passed, which icebird already does. Added scanPruning.test.js asserting reduced bytes read for a selective predicate on a multi-row-group file, plus a missing-stats safe-fallback test. #22 — sort on append + compaction: - New src/write/sort.js buildSortComparator (direction, null-order, transforms, NaN-last, stable). prepareAppend now orders each written file by the table's default sort order and records the real sort_order_id; a sortOrderId override is threaded through icebergAppend/StageAppend/tx.append. Empty sort order is a no-op. - New src/write/rewrite.js + icebergRewrite (exported): reads every live row (deletes applied), sorts globally, regroups under the target spec, writes consolidated sorted files, and commits a replace snapshot. Not retried on conflict (would risk dropping concurrently-appended rows); v2-only for now (v3 row-lineage preservation is a follow-up). Docs: README Supported Features (Sorting, Scan Pruning) + a compaction snippet; data-source doc comment. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Dual-agent review —
|
| Source | Finding (severity, evidence) | Intersects |
|---|---|---|
| Codex | major — date bound vs Date literal mis-prune (src/prune.js:477, src/write/serde.js:145, src/sql/whereFilter.js:76) |
Yes — directly on the scan-pruning hot path (Targets fileMightMatch/boundsMightMatch; Risks bullet 1 & 2) |
| Claude | major — NaN sort-ordering branch untested (src/write/sort.js:84-91) |
Yes — Targets buildSortComparator/compareKeys; Risks bullet 3 |
Codex review
Fix Validations
#20 file pruning via manifest column bounds
- Status: incomplete
- Evidence:
src/sql/icebergDataSource.js:100,src/sql/icebergDataSource.js:102,src/prune.js:379,src/prune.js:394 - Assessment: Numeric/timestamp pruning is wired into scan planning and covered by tests, but the new bounds comparator can mis-prune
datecolumns when literals areDateobjects. Details below.
#21 parquet row-group pruning verification
- Status: correct
- Evidence:
src/sql/icebergDataSource.js:169,src/read.js:294,test/sql/scanPruning.test.js:262,test/sql/scanPruning.test.js:280 - Assessment: The existing path still forwards converted filters into parquet reads; the new tests cover byte reduction with stats and safe full-decode fallback without stats.
#22 sort on append and rewrite
- Status: correct
- Evidence:
src/write/stage.js:81,src/write/stage.js:86,src/write/stage.js:115,src/write/write.js:274,src/write/write.js:84 - Assessment: Append, transaction append, and one-call rewrite thread the sort order through the write path, and tests cover sorted output, empty sort order, override errors, rewrite conflict behavior, and v3 gating.
Findings
1) Behavioral Correctness
- Severity: major
- Confidence: high
- Evidence:
src/write/serde.js:145,src/write/serde.js:172,src/write/serde.js:197,src/prune.js:477,src/sql/whereFilter.js:76 - Why it matters: A
datecolumn bound is decoded as days-since-epoch, butDatefilter literals are passed through unchanged and compared by the generic fallback, so a matching file can be skipped. - Suggested fix: Add date-aware comparison normalization, e.g. convert
Dateliterals to epoch days fordatebefore callingcompare, and add afileMightMatchtest for adatefield withDateliterals.
No Finding
-
- Contract & Interface Fidelity
-
- Change Impact / Blast Radius
-
- Concurrency, Ordering & State Safety
-
- Error Handling & Resilience
-
- Security Surface
-
- Resource Lifecycle & Cleanup
-
- Release Safety
-
- Test Evidence Quality
-
- Architectural Consistency
-
- Debuggability & Operability
Evidence Bundle
- Changed hot paths:
icebergDataSourcescan planning,fileMightMatchmanifest-bound pruning,prepareAppendsort-on-write,icebergRewritereplace snapshots. - Impacted callers:
src/sql/icebergDataSource.js:100,src/write/write.js:37,src/write/write.js:84,src/write/write.js:274,src/types.d.ts:304 - Impacted tests:
test/sql/icebergDataSource.test.js:408,test/sql/scanPruning.test.js:101,test/sql/scanPruning.test.js:262,test/sql/scanPruning.test.js:280,test/write/sort.test.js:125,test/write/rewrite.test.js:80,test/write/rewrite.test.js:189 - Unresolved uncertainty: Whether squirreling SQL date literals normally arrive as
Dateor strings in all parser paths; the pruning API already accepts literal values directly, so the result-preserving path should still handleDate.
Claude review
Claude review
Sort-key NaN ordering branch is untested
- Severity: major
- Confidence: 82
- Evidence: src/write/sort.js:84-91
- Why it matters:
compareKeyshas a dedicated NaN branch ("NaN is ordered greatest (Iceberg); direction still reverses"), but no test uses a float/double sort column, so a sign error or a failure to reverse ondescin this correctness-critical sort branch would pass unnoticed. - Suggested fix: Add a
buildSortComparatortest over adouble/floatfield whose rows includeNaN, asserting NaN sorts greatest underasc(after the largest finite value) and smallest underdesc, plus theNaN-vs-NaNtie returning input order.
Reports: .git/dual-review/pr-23
Address the two major findings from the dual-agent review: - date mis-prune: compare() had no date case, so a Date query literal (coerced to milliseconds) was compared against bounds decoded as days-since-epoch, skipping matching files. Add a date case backed by a dateToDays() helper that normalizes Date/bigint/number/ISO-string to one domain and returns NaN (undecidable -> keep file) otherwise. Fixed in the canonical comparator so read-prune, write-stats, and sort all benefit; also fixes a latent bug for date columns mixing Date and number values. - Add fileMightMatch date tests (Date, numeric-days, ISO-string, unparseable). - Add buildSortComparator tests over a double field with NaN: NaN orders greatest under asc, reverses to first under desc, and NaN-vs-NaN is a stable tie. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
🧭 Decision map — where to spend your attentionCompanion to the dual-review verdict. This casts no verdict — it points at the 5 forks where the author made a real choice, so you can skim the rest. Scanned: 36 hunks across 17 files (8 new). Most is mechanical: the ~315-line 1. Bounds pruning trusts the WHERE literal is already in Iceberg's integer domain · unhappy-path / silent data-lossconst c = compare(a, b, type)
return Number.isNaN(c) ? undefined : c
2.
|
CI runs `npx tsc` over src + test with strict/checkJs, but the new test
files had several type errors:
- prune.bounds: bare-value filters ({ id: 9n }) aren't modelled by
hyparquet's ParquetQueryFilter — cast past the type.
- scanPruning: type columnData/parquetSchema as ColumnSource[]/
SchemaElement[]; add required 'last-partition-id'; coerce ids with
Number() before subtracting in the sort comparator.
- sort: sort_order_id is optional, so widen the return to
(number | undefined)[].
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
platypii
left a comment
There was a problem hiding this comment.
read the code, looks clean
Implements the coordinated Iceberg scan-performance set — prune non-matching data on read (#20/#21) and lay data out sorted so those bounds are tight (#22). Closes #20, #21, #22.
A note on starting point: the partition-tuple half of #20 (
5f98e43) and the WHERE→parquet-filter pushdown that #21 relies on (f5a256f) had already shipped after the issues were filed. This PR builds the genuinely-missing pieces.#20 — file pruning via manifest column bounds
src/write/serde.js: extracts the Iceberg single-value codec + type comparator out ofstats.js(no behavior change there) and addsdeserializeValue, the read-side inverse ofserializeValue.fileMightMatchinsrc/prune.js, sharing a refactoredAND/OR/$norwalker withpartitionMightMatchand AND'd intoicebergDataSourcescan planning. It decodes each data file'slower_bounds/upper_boundsand skips files whose bounds prove no row can match. Conservative by design: never range-prunes truncated string/binary bounds; keeps the file on any uncertainty (a pure result-preserving refinement).price > 0, result unchanged) and icebird-written tables.#21 — parquet row-group pruning (verify + test)
parquetPlan/canSkipRowGroupalready skips row groups by per-row-group statistics + bloom filters when a filter is passed, which icebird already does. No production change.test/sql/scanPruning.test.js: a byte-reduction test on a multi-row-group file plus a missing-stats safe-fallback test (absent stats ⇒ full decode, correct results).#22 — sort on append + compaction
src/write/sort.jsbuildSortComparator(direction, null-order, transform keys, NaN-last, stable).prepareAppendnow orders each written file by the table's default sort order and records the realsort_order_id; asortOrderIdoverride is threaded throughicebergAppend/icebergStageAppend/tx.append. An empty sort order is a no-op, so existing append behavior is unchanged.src/write/rewrite.js+icebergRewrite(exported): reads every live row (deletes applied), sorts globally, regroups under the target partition spec, writes consolidated sorted files, and commits areplacesnapshot viabuildSnapshotUpdate'sskipPriorManifestPaths.Deliberate limitations
main) and should be re-run against fresh metadata. Covered by a test asserting no row loss._row_idlineage rather than letassignFirstRowIdsrenumber rewritten rows. Flagged in code as a follow-up.Tests & verification
npm run lintandnpm run build:typesclean.serde,prune.bounds,sql/scanPruning,write/sort,write/rewrite. Rewrite coverage includes round-trip multiset equality, non-overlapping sort-key bounds (targetFileRows), deletes consumed, partition-spec evolution, conflict safety, v3 gating, and the one-call API.🤖 Generated with Claude Code