Scan pruning and sort-on-write (#20, #21, #22) by philcunliffe · Pull Request #23 · hyparam/icebird

philcunliffe · 2026-06-09T17:48:51Z

Implements the coordinated Iceberg scan-performance set — prune non-matching data on read (#20/#21) and lay data out sorted so those bounds are tight (#22). Closes #20, #21, #22.

A note on starting point: the partition-tuple half of #20 (5f98e43) and the WHERE→parquet-filter pushdown that #21 relies on (f5a256f) had already shipped after the issues were filed. This PR builds the genuinely-missing pieces.

#20 — file pruning via manifest column bounds

New src/write/serde.js: extracts the Iceberg single-value codec + type comparator out of stats.js (no behavior change there) and adds deserializeValue, the read-side inverse of serializeValue.
New fileMightMatch in src/prune.js, sharing a refactored AND/OR/$nor walker with partitionMightMatch and AND'd into icebergDataSource scan planning. It decodes each data file's lower_bounds/upper_bounds and skips files whose bounds prove no row can match. Conservative by design: never range-prunes truncated string/binary bounds; keeps the file on any uncertainty (a pure result-preserving refinement).
Validated against real Spark-written manifests (an existing test now opens 2 files instead of 3 for price > 0, result unchanged) and icebird-written tables.

#21 — parquet row-group pruning (verify + test)

hyparquet's parquetPlan/canSkipRowGroup already skips row groups by per-row-group statistics + bloom filters when a filter is passed, which icebird already does. No production change.
Added test/sql/scanPruning.test.js: a byte-reduction test on a multi-row-group file plus a missing-stats safe-fallback test (absent stats ⇒ full decode, correct results).

#22 — sort on append + compaction

New src/write/sort.js buildSortComparator (direction, null-order, transform keys, NaN-last, stable). prepareAppend now orders each written file by the table's default sort order and records the real sort_order_id; a sortOrderId override is threaded through icebergAppend / icebergStageAppend / tx.append. An empty sort order is a no-op, so existing append behavior is unchanged.
New src/write/rewrite.js + icebergRewrite (exported): reads every live row (deletes applied), sorts globally, regroups under the target partition spec, writes consolidated sorted files, and commits a replace snapshot via buildSnapshotUpdate's skipPriorManifestPaths.

Deliberate limitations

Rewrite is not retried on a concurrent commit — it only rewrote the rows it read, so a blind retry could silently drop rows another writer appended. On conflict it throws (CAS on main) and should be re-run against fresh metadata. Covered by a test asserting no row loss.
Compaction is v2-only for now; it throws a clear error on format-version 3, since correct v3 support must preserve _row_id lineage rather than let assignFirstRowIds renumber rewritten rows. Flagged in code as a follow-up.

Tests & verification

+55 tests (509 → 564), all passing; npm run lint and npm run build:types clean.
New suites: serde, prune.bounds, sql/scanPruning, write/sort, write/rewrite. Rewrite coverage includes round-trip multiset equality, non-overlapping sort-key bounds (targetFileRows), deletes consumed, partition-spec evolution, conflict safety, v3 gating, and the one-call API.

🤖 Generated with Claude Code

Implements the Iceberg scan-performance set: prune non-matching data on read, and lay out data sorted so those bounds are tight. #20 — file pruning via manifest column bounds: - New src/write/serde.js: extract the Iceberg single-value codec + type comparator out of stats.js (no behavior change) and add deserializeValue, the read-side inverse of serializeValue. - New fileMightMatch in prune.js (sharing a refactored AND/OR/$nor walker with partitionMightMatch), AND'd into icebergDataSource scan planning. It decodes each data file's lower_bounds/upper_bounds and skips files whose bounds prove no row can match. Conservative: never range-prunes truncated string/binary bounds, keeps on any uncertainty. #21 — parquet row-group pruning (verify + test): - hyparquet's parquetPlan/canSkipRowGroup already skips row groups by per-row-group statistics + bloom filters when a filter is passed, which icebird already does. Added scanPruning.test.js asserting reduced bytes read for a selective predicate on a multi-row-group file, plus a missing-stats safe-fallback test. #22 — sort on append + compaction: - New src/write/sort.js buildSortComparator (direction, null-order, transforms, NaN-last, stable). prepareAppend now orders each written file by the table's default sort order and records the real sort_order_id; a sortOrderId override is threaded through icebergAppend/StageAppend/tx.append. Empty sort order is a no-op. - New src/write/rewrite.js + icebergRewrite (exported): reads every live row (deletes applied), sorts globally, regroups under the target spec, writes consolidated sorted files, and commits a replace snapshot. Not retried on conflict (would risk dropping concurrently-appended rows); v2-only for now (v3 row-lineage preservation is a follow-up). Docs: README Supported Features (Sorting, Scan Pruning) + a compaction snippet; data-source doc comment. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

philcunliffe · 2026-06-09T18:27:31Z

Dual-agent review — `request_changes`

Verdict: request_changes
Risk class: medium
Auto-merge advisory: 👎 thumbs down — verdict is request_changes; needs human-gated follow-up

Advisory only: no merge was attempted.

Risk capstone

Cross-reference: reviewer findings vs high-risk surfaces

Source	Finding (severity, evidence)	Intersects
Codex	major — `date` bound vs `Date` literal mis-prune (`src/prune.js:477`, `src/write/serde.js:145`, `src/sql/whereFilter.js:76`)	Yes — directly on the scan-pruning hot path (Targets `fileMightMatch`/`boundsMightMatch`; Risks bullet 1 & 2)
Claude	major — NaN sort-ordering branch untested (`src/write/sort.js:84-91`)	Yes — Targets `buildSortComparator`/`compareKeys`; Risks bullet 3

Codex review

Fix Validations

#20 file pruning via manifest column bounds

Status: incomplete
Evidence: src/sql/icebergDataSource.js:100, src/sql/icebergDataSource.js:102, src/prune.js:379, src/prune.js:394
Assessment: Numeric/timestamp pruning is wired into scan planning and covered by tests, but the new bounds comparator can mis-prune date columns when literals are Date objects. Details below.

#21 parquet row-group pruning verification

Status: correct
Evidence: src/sql/icebergDataSource.js:169, src/read.js:294, test/sql/scanPruning.test.js:262, test/sql/scanPruning.test.js:280
Assessment: The existing path still forwards converted filters into parquet reads; the new tests cover byte reduction with stats and safe full-decode fallback without stats.

#22 sort on append and rewrite

Status: correct
Evidence: src/write/stage.js:81, src/write/stage.js:86, src/write/stage.js:115, src/write/write.js:274, src/write/write.js:84
Assessment: Append, transaction append, and one-call rewrite thread the sort order through the write path, and tests cover sorted output, empty sort order, override errors, rewrite conflict behavior, and v3 gating.

Findings

1) Behavioral Correctness

Severity: major
Confidence: high
Evidence: src/write/serde.js:145, src/write/serde.js:172, src/write/serde.js:197, src/prune.js:477, src/sql/whereFilter.js:76
Why it matters: A date column bound is decoded as days-since-epoch, but Date filter literals are passed through unchanged and compared by the generic fallback, so a matching file can be skipped.
Suggested fix: Add date-aware comparison normalization, e.g. convert Date literals to epoch days for date before calling compare, and add a fileMightMatch test for a date field with Date literals.

No Finding

1. Contract & Interface Fidelity
1. Change Impact / Blast Radius
1. Concurrency, Ordering & State Safety
1. Error Handling & Resilience
1. Security Surface
1. Resource Lifecycle & Cleanup
1. Release Safety
1. Test Evidence Quality
1. Architectural Consistency
1. Debuggability & Operability

Evidence Bundle

Changed hot paths: icebergDataSource scan planning, fileMightMatch manifest-bound pruning, prepareAppend sort-on-write, icebergRewrite replace snapshots.
Impacted callers: src/sql/icebergDataSource.js:100, src/write/write.js:37, src/write/write.js:84, src/write/write.js:274, src/types.d.ts:304
Impacted tests: test/sql/icebergDataSource.test.js:408, test/sql/scanPruning.test.js:101, test/sql/scanPruning.test.js:262, test/sql/scanPruning.test.js:280, test/write/sort.test.js:125, test/write/rewrite.test.js:80, test/write/rewrite.test.js:189
Unresolved uncertainty: Whether squirreling SQL date literals normally arrive as Date or strings in all parser paths; the pruning API already accepts literal values directly, so the result-preserving path should still handle Date.

Claude review

Claude review

Sort-key NaN ordering branch is untested

Severity: major
Confidence: 82
Evidence: src/write/sort.js:84-91
Why it matters: compareKeys has a dedicated NaN branch ("NaN is ordered greatest (Iceberg); direction still reverses"), but no test uses a float/double sort column, so a sign error or a failure to reverse on desc in this correctness-critical sort branch would pass unnoticed.
Suggested fix: Add a buildSortComparator test over a double/float field whose rows include NaN, asserting NaN sorts greatest under asc (after the largest finite value) and smallest under desc, plus the NaN-vs-NaN tie returning input order.

Reports: .git/dual-review/pr-23

Address the two major findings from the dual-agent review: - date mis-prune: compare() had no date case, so a Date query literal (coerced to milliseconds) was compared against bounds decoded as days-since-epoch, skipping matching files. Add a date case backed by a dateToDays() helper that normalizes Date/bigint/number/ISO-string to one domain and returns NaN (undecidable -> keep file) otherwise. Fixed in the canonical comparator so read-prune, write-stats, and sort all benefit; also fixes a latent bug for date columns mixing Date and number values. - Add fileMightMatch date tests (Date, numeric-days, ISO-string, unparseable). - Add buildSortComparator tests over a double field with NaN: NaN orders greatest under asc, reverses to first under desc, and NaN-vs-NaN is a stable tie. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

philcunliffe · 2026-06-09T18:47:15Z

🧭 Decision map — where to spend your attention

Companion to the dual-review verdict. This casts no verdict — it points at the 5 forks where the author made a real choice, so you can skim the rest.

Scanned: 36 hunks across 17 files (8 new). Most is mechanical: the ~315-line serde.js extraction moved verbatim out of stats.js (~half the diff, byte-identical), the optional sortOrderId plumbed through the append call chain, the nodeMightMatch(node, ctx)→(node, leafFn) refactor, and ~1100 lines of new tests. The decisions worth your eyes, in order:

1. Bounds pruning trusts the WHERE literal is already in Iceberg's integer domain · unhappy-path / silent data-loss

src/prune.js:533

const c = compare(a, b, type)
return Number.isNaN(c) ? undefined : c

Decision: compare the decoded manifest bound directly against the raw SQL literal, relying on compare() + keep-on-uncertainty (undefined/NaN → keep) as the only guard.
Alternative not taken: normalise the literal to the per-type integer domain first (as compare already does for timestamp via timestampToMicros), or return undefined when the JS types of the two operands disagree.
Check: compare() has no date/time case — both hit the numeric default. A date bound is an epoch-day number; a literal arriving as a Date/epoch-millis still compares cleanly (no NaN, no throw), so the file is pruned — silently dropping rows. The new test covers timestamp (which normalises Date) but not date. Codex major; verify the literal is guaranteed epoch-days/micros by the time it reaches boundsMightMatch.

2. `icebergRewrite` fails fast on a concurrent commit instead of retrying · concurrency / lifecycle

src/write/write.js:94

// Single commit attempt: see the doc comment on why a rewrite must not retry.
return await commitStaged(catalog, { namespace, table }, ctx, staged)

Decision: one commit attempt; conflict throws and the caller re-runs against fresh metadata.
Alternative not taken: reuse icebergAppend's commitWithRetry optimistic-retry loop.
Check: a blind retry would drop rows a concurrent writer appended after the rewrite read them — fail-fast is the safe call, but it's a deliberate divergence from every other writer here. Confirm no caller re-wraps it in a generic retry, and that the CAS conflict is real (safety test asserts /ref main expected snapshot/).

3. Sort comparator orders `NaN` greatest and lets `desc` reverse it · algorithm / contract

src/write/sort.js:84

// NaN is ordered greatest (Iceberg); direction still reverses.
if (aNaN || bNaN) {
  if (aNaN && bNaN) return 0
  const c = aNaN ? 1 : -1
  return desc ? -c : c
}

Decision: NaN sorts greatest; desc flips it to the front; NaN-vs-NaN ties.
Alternative not taken: order NaN least, or keep its placement fixed under direction.
Check: this ordering tightens per-file bounds for pruning; a sign/reversal error degrades pruning silently. No float/double sort column is tested anywhere (Claude major) — add a double+NaN case asserting greatest under asc, first under desc, stable on ties.

4. `icebergRewrite` is gated to format-version 2 only · contract / unhappy-path

src/write/rewrite.js:52

if (formatVersion !== 2) {
  throw new Error(`icebergRewrite supports format-version 2 only (got ${formatVersion}); v3 row lineage is not yet handled`)
}

Decision: refuse v3 rather than rewrite it.
Alternative not taken: preserve _row_id/_last_updated_sequence_number lineage instead of letting assignFirstRowIds renumber rewritten rows.
Check: confirm this is a conscious scope cut (a v3 rewrite that dropped lineage would corrupt it), reachable before any destructive write. Tested: rejects … /format-version 2 only/.

5. Bounds pruning restricted to an explicit orderable-type allowlist · algorithm / correctness firewall

src/prune.js:436

case 'date':
case 'time':
case 'timestamp':
  ...
  return true
default:
  return false   // string/binary/fixed/uuid bounds are truncated → never range-prune

Decision: only fully-ordered, untruncated types are range-pruned; truncated-prefix types are always kept.
Alternative not taken: prefix-aware pruning on truncated string/binary bounds.
Check: this switch is the firewall against pruning on a truncated prefix (which would drop files). Its membership is load-bearing — and note date/time are included here, which is exactly what feeds them into the literal-domain comparison flagged in entry 1.

Honorable mentions (real but lower-stakes): rewrite.js:96 — targetFileRows splits by fixed row count, so the documented "non-overlapping files" invariant holds only when boundary sort keys differ; prune.js:491 — $ne/$nin never prune ("too rare to bother"); prune.js:409 — boundForField accepts both Avro-array and plain-object bound maps; appliedSortOrderId = comparator ? orderId : 0 — an empty sort order records sort_order_id: 0.

_{Generated by /decision-map. Advisory — directs attention, casts no verdict.}

CI runs `npx tsc` over src + test with strict/checkJs, but the new test files had several type errors: - prune.bounds: bare-value filters ({ id: 9n }) aren't modelled by hyparquet's ParquetQueryFilter — cast past the type. - scanPruning: type columnData/parquetSchema as ColumnSource[]/ SchemaElement[]; add required 'last-partition-id'; coerce ids with Number() before subtracting in the sort comparator. - sort: sort_order_id is optional, so widen the return to (number | undefined)[]. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

platypii

read the code, looks clean

platypii approved these changes Jun 9, 2026

View reviewed changes

philcunliffe merged commit 3edb15b into master Jun 9, 2026
6 checks passed

This was referenced Jun 10, 2026

Read pruning: skip parquet row groups via column statistics #21

Closed

Write: honor declared sort orders on append and compaction #22

Closed

Support format-version 3 in icebergRewrite (preserve row lineage) #24

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Scan pruning and sort-on-write (#20, #21, #22)#23

Scan pruning and sort-on-write (#20, #21, #22)#23
philcunliffe merged 3 commits into
masterfrom
scan-performance-pruning-sort

philcunliffe commented Jun 9, 2026

Uh oh!

philcunliffe commented Jun 9, 2026

Fix Validations

#20 file pruning via manifest column bounds

#21 parquet row-group pruning verification

#22 sort on append and rewrite

Findings

1) Behavioral Correctness

No Finding

Evidence Bundle

Claude review

Sort-key NaN ordering branch is untested

Uh oh!

philcunliffe commented Jun 9, 2026

Uh oh!

platypii left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

philcunliffe commented Jun 9, 2026

#20 — file pruning via manifest column bounds

#21 — parquet row-group pruning (verify + test)

#22 — sort on append + compaction

Deliberate limitations

Tests & verification

Uh oh!

philcunliffe commented Jun 9, 2026

Dual-agent review — request_changes

Risk capstone

Cross-reference: reviewer findings vs high-risk surfaces

Fix Validations

#20 file pruning via manifest column bounds

#21 parquet row-group pruning verification

#22 sort on append and rewrite

Findings

1) Behavioral Correctness

No Finding

Evidence Bundle

Claude review

Sort-key NaN ordering branch is untested

Uh oh!

philcunliffe commented Jun 9, 2026

🧭 Decision map — where to spend your attention

1. Bounds pruning trusts the WHERE literal is already in Iceberg's integer domain · unhappy-path / silent data-loss

2. icebergRewrite fails fast on a concurrent commit instead of retrying · concurrency / lifecycle

3. Sort comparator orders NaN greatest and lets desc reverse it · algorithm / contract

4. icebergRewrite is gated to format-version 2 only · contract / unhappy-path

5. Bounds pruning restricted to an explicit orderable-type allowlist · algorithm / correctness firewall

Uh oh!

platypii left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Dual-agent review — `request_changes`

2. `icebergRewrite` fails fast on a concurrent commit instead of retrying · concurrency / lifecycle

3. Sort comparator orders `NaN` greatest and lets `desc` reverse it · algorithm / contract

4. `icebergRewrite` is gated to format-version 2 only · contract / unhappy-path