Skip to content

Scan pruning and sort-on-write (#20, #21, #22)#23

Merged
philcunliffe merged 3 commits into
masterfrom
scan-performance-pruning-sort
Jun 9, 2026
Merged

Scan pruning and sort-on-write (#20, #21, #22)#23
philcunliffe merged 3 commits into
masterfrom
scan-performance-pruning-sort

Conversation

@philcunliffe

Copy link
Copy Markdown
Contributor

Implements the coordinated Iceberg scan-performance set — prune non-matching data on read (#20/#21) and lay data out sorted so those bounds are tight (#22). Closes #20, #21, #22.

A note on starting point: the partition-tuple half of #20 (5f98e43) and the WHERE→parquet-filter pushdown that #21 relies on (f5a256f) had already shipped after the issues were filed. This PR builds the genuinely-missing pieces.

#20 — file pruning via manifest column bounds

  • New src/write/serde.js: extracts the Iceberg single-value codec + type comparator out of stats.js (no behavior change there) and adds deserializeValue, the read-side inverse of serializeValue.
  • New fileMightMatch in src/prune.js, sharing a refactored AND/OR/$nor walker with partitionMightMatch and AND'd into icebergDataSource scan planning. It decodes each data file's lower_bounds/upper_bounds and skips files whose bounds prove no row can match. Conservative by design: never range-prunes truncated string/binary bounds; keeps the file on any uncertainty (a pure result-preserving refinement).
  • Validated against real Spark-written manifests (an existing test now opens 2 files instead of 3 for price > 0, result unchanged) and icebird-written tables.

#21 — parquet row-group pruning (verify + test)

  • hyparquet's parquetPlan/canSkipRowGroup already skips row groups by per-row-group statistics + bloom filters when a filter is passed, which icebird already does. No production change.
  • Added test/sql/scanPruning.test.js: a byte-reduction test on a multi-row-group file plus a missing-stats safe-fallback test (absent stats ⇒ full decode, correct results).

#22 — sort on append + compaction

  • New src/write/sort.js buildSortComparator (direction, null-order, transform keys, NaN-last, stable). prepareAppend now orders each written file by the table's default sort order and records the real sort_order_id; a sortOrderId override is threaded through icebergAppend / icebergStageAppend / tx.append. An empty sort order is a no-op, so existing append behavior is unchanged.
  • New src/write/rewrite.js + icebergRewrite (exported): reads every live row (deletes applied), sorts globally, regroups under the target partition spec, writes consolidated sorted files, and commits a replace snapshot via buildSnapshotUpdate's skipPriorManifestPaths.

Deliberate limitations

  • Rewrite is not retried on a concurrent commit — it only rewrote the rows it read, so a blind retry could silently drop rows another writer appended. On conflict it throws (CAS on main) and should be re-run against fresh metadata. Covered by a test asserting no row loss.
  • Compaction is v2-only for now; it throws a clear error on format-version 3, since correct v3 support must preserve _row_id lineage rather than let assignFirstRowIds renumber rewritten rows. Flagged in code as a follow-up.

Tests & verification

  • +55 tests (509 → 564), all passing; npm run lint and npm run build:types clean.
  • New suites: serde, prune.bounds, sql/scanPruning, write/sort, write/rewrite. Rewrite coverage includes round-trip multiset equality, non-overlapping sort-key bounds (targetFileRows), deletes consumed, partition-spec evolution, conflict safety, v3 gating, and the one-call API.

🤖 Generated with Claude Code

Implements the Iceberg scan-performance set: prune non-matching data on
read, and lay out data sorted so those bounds are tight.

#20 — file pruning via manifest column bounds:
- New src/write/serde.js: extract the Iceberg single-value codec + type
  comparator out of stats.js (no behavior change) and add deserializeValue,
  the read-side inverse of serializeValue.
- New fileMightMatch in prune.js (sharing a refactored AND/OR/$nor walker
  with partitionMightMatch), AND'd into icebergDataSource scan planning. It
  decodes each data file's lower_bounds/upper_bounds and skips files whose
  bounds prove no row can match. Conservative: never range-prunes truncated
  string/binary bounds, keeps on any uncertainty.

#21 — parquet row-group pruning (verify + test):
- hyparquet's parquetPlan/canSkipRowGroup already skips row groups by
  per-row-group statistics + bloom filters when a filter is passed, which
  icebird already does. Added scanPruning.test.js asserting reduced bytes
  read for a selective predicate on a multi-row-group file, plus a
  missing-stats safe-fallback test.

#22 — sort on append + compaction:
- New src/write/sort.js buildSortComparator (direction, null-order,
  transforms, NaN-last, stable). prepareAppend now orders each written file
  by the table's default sort order and records the real sort_order_id; a
  sortOrderId override is threaded through icebergAppend/StageAppend/tx.append.
  Empty sort order is a no-op.
- New src/write/rewrite.js + icebergRewrite (exported): reads every live row
  (deletes applied), sorts globally, regroups under the target spec, writes
  consolidated sorted files, and commits a replace snapshot. Not retried on
  conflict (would risk dropping concurrently-appended rows); v2-only for now
  (v3 row-lineage preservation is a follow-up).

Docs: README Supported Features (Sorting, Scan Pruning) + a compaction
snippet; data-source doc comment.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@philcunliffe

Copy link
Copy Markdown
Contributor Author

Dual-agent review — request_changes

  • Verdict: request_changes
  • Risk class: medium
  • Auto-merge advisory: 👎 thumbs down — verdict is request_changes; needs human-gated follow-up

Advisory only: no merge was attempted.

Risk capstone

Cross-reference: reviewer findings vs high-risk surfaces

Source Finding (severity, evidence) Intersects
Codex major — date bound vs Date literal mis-prune (src/prune.js:477, src/write/serde.js:145, src/sql/whereFilter.js:76) Yes — directly on the scan-pruning hot path (Targets fileMightMatch/boundsMightMatch; Risks bullet 1 & 2)
Claude major — NaN sort-ordering branch untested (src/write/sort.js:84-91) Yes — Targets buildSortComparator/compareKeys; Risks bullet 3
Codex review

Fix Validations

#20 file pruning via manifest column bounds

  • Status: incomplete
  • Evidence: src/sql/icebergDataSource.js:100, src/sql/icebergDataSource.js:102, src/prune.js:379, src/prune.js:394
  • Assessment: Numeric/timestamp pruning is wired into scan planning and covered by tests, but the new bounds comparator can mis-prune date columns when literals are Date objects. Details below.

#21 parquet row-group pruning verification

  • Status: correct
  • Evidence: src/sql/icebergDataSource.js:169, src/read.js:294, test/sql/scanPruning.test.js:262, test/sql/scanPruning.test.js:280
  • Assessment: The existing path still forwards converted filters into parquet reads; the new tests cover byte reduction with stats and safe full-decode fallback without stats.

#22 sort on append and rewrite

  • Status: correct
  • Evidence: src/write/stage.js:81, src/write/stage.js:86, src/write/stage.js:115, src/write/write.js:274, src/write/write.js:84
  • Assessment: Append, transaction append, and one-call rewrite thread the sort order through the write path, and tests cover sorted output, empty sort order, override errors, rewrite conflict behavior, and v3 gating.

Findings

1) Behavioral Correctness

  • Severity: major
  • Confidence: high
  • Evidence: src/write/serde.js:145, src/write/serde.js:172, src/write/serde.js:197, src/prune.js:477, src/sql/whereFilter.js:76
  • Why it matters: A date column bound is decoded as days-since-epoch, but Date filter literals are passed through unchanged and compared by the generic fallback, so a matching file can be skipped.
  • Suggested fix: Add date-aware comparison normalization, e.g. convert Date literals to epoch days for date before calling compare, and add a fileMightMatch test for a date field with Date literals.

No Finding

    1. Contract & Interface Fidelity
    1. Change Impact / Blast Radius
    1. Concurrency, Ordering & State Safety
    1. Error Handling & Resilience
    1. Security Surface
    1. Resource Lifecycle & Cleanup
    1. Release Safety
    1. Test Evidence Quality
    1. Architectural Consistency
    1. Debuggability & Operability

Evidence Bundle

  • Changed hot paths: icebergDataSource scan planning, fileMightMatch manifest-bound pruning, prepareAppend sort-on-write, icebergRewrite replace snapshots.
  • Impacted callers: src/sql/icebergDataSource.js:100, src/write/write.js:37, src/write/write.js:84, src/write/write.js:274, src/types.d.ts:304
  • Impacted tests: test/sql/icebergDataSource.test.js:408, test/sql/scanPruning.test.js:101, test/sql/scanPruning.test.js:262, test/sql/scanPruning.test.js:280, test/write/sort.test.js:125, test/write/rewrite.test.js:80, test/write/rewrite.test.js:189
  • Unresolved uncertainty: Whether squirreling SQL date literals normally arrive as Date or strings in all parser paths; the pruning API already accepts literal values directly, so the result-preserving path should still handle Date.
Claude review

Claude review

Sort-key NaN ordering branch is untested

  • Severity: major
  • Confidence: 82
  • Evidence: src/write/sort.js:84-91
  • Why it matters: compareKeys has a dedicated NaN branch ("NaN is ordered greatest (Iceberg); direction still reverses"), but no test uses a float/double sort column, so a sign error or a failure to reverse on desc in this correctness-critical sort branch would pass unnoticed.
  • Suggested fix: Add a buildSortComparator test over a double/float field whose rows include NaN, asserting NaN sorts greatest under asc (after the largest finite value) and smallest under desc, plus the NaN-vs-NaN tie returning input order.

Reports: .git/dual-review/pr-23

Address the two major findings from the dual-agent review:

- date mis-prune: compare() had no date case, so a Date query literal
  (coerced to milliseconds) was compared against bounds decoded as
  days-since-epoch, skipping matching files. Add a date case backed by a
  dateToDays() helper that normalizes Date/bigint/number/ISO-string to one
  domain and returns NaN (undecidable -> keep file) otherwise. Fixed in the
  canonical comparator so read-prune, write-stats, and sort all benefit;
  also fixes a latent bug for date columns mixing Date and number values.
- Add fileMightMatch date tests (Date, numeric-days, ISO-string, unparseable).
- Add buildSortComparator tests over a double field with NaN: NaN orders
  greatest under asc, reverses to first under desc, and NaN-vs-NaN is a
  stable tie.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@philcunliffe

Copy link
Copy Markdown
Contributor Author

🧭 Decision map — where to spend your attention

Companion to the dual-review verdict. This casts no verdict — it points at the 5 forks where the author made a real choice, so you can skim the rest.

Scanned: 36 hunks across 17 files (8 new). Most is mechanical: the ~315-line serde.js extraction moved verbatim out of stats.js (~half the diff, byte-identical), the optional sortOrderId plumbed through the append call chain, the nodeMightMatch(node, ctx)(node, leafFn) refactor, and ~1100 lines of new tests. The decisions worth your eyes, in order:

1. Bounds pruning trusts the WHERE literal is already in Iceberg's integer domain · unhappy-path / silent data-loss

src/prune.js:533

const c = compare(a, b, type)
return Number.isNaN(c) ? undefined : c
  • Decision: compare the decoded manifest bound directly against the raw SQL literal, relying on compare() + keep-on-uncertainty (undefined/NaN → keep) as the only guard.
  • Alternative not taken: normalise the literal to the per-type integer domain first (as compare already does for timestamp via timestampToMicros), or return undefined when the JS types of the two operands disagree.
  • Check: compare() has no date/time case — both hit the numeric default. A date bound is an epoch-day number; a literal arriving as a Date/epoch-millis still compares cleanly (no NaN, no throw), so the file is pruned — silently dropping rows. The new test covers timestamp (which normalises Date) but not date. Codex major; verify the literal is guaranteed epoch-days/micros by the time it reaches boundsMightMatch.

2. icebergRewrite fails fast on a concurrent commit instead of retrying · concurrency / lifecycle

src/write/write.js:94

// Single commit attempt: see the doc comment on why a rewrite must not retry.
return await commitStaged(catalog, { namespace, table }, ctx, staged)
  • Decision: one commit attempt; conflict throws and the caller re-runs against fresh metadata.
  • Alternative not taken: reuse icebergAppend's commitWithRetry optimistic-retry loop.
  • Check: a blind retry would drop rows a concurrent writer appended after the rewrite read them — fail-fast is the safe call, but it's a deliberate divergence from every other writer here. Confirm no caller re-wraps it in a generic retry, and that the CAS conflict is real (safety test asserts /ref main expected snapshot/).

3. Sort comparator orders NaN greatest and lets desc reverse it · algorithm / contract

src/write/sort.js:84

// NaN is ordered greatest (Iceberg); direction still reverses.
if (aNaN || bNaN) {
  if (aNaN && bNaN) return 0
  const c = aNaN ? 1 : -1
  return desc ? -c : c
}
  • Decision: NaN sorts greatest; desc flips it to the front; NaN-vs-NaN ties.
  • Alternative not taken: order NaN least, or keep its placement fixed under direction.
  • Check: this ordering tightens per-file bounds for pruning; a sign/reversal error degrades pruning silently. No float/double sort column is tested anywhere (Claude major) — add a double+NaN case asserting greatest under asc, first under desc, stable on ties.

4. icebergRewrite is gated to format-version 2 only · contract / unhappy-path

src/write/rewrite.js:52

if (formatVersion !== 2) {
  throw new Error(`icebergRewrite supports format-version 2 only (got ${formatVersion}); v3 row lineage is not yet handled`)
}
  • Decision: refuse v3 rather than rewrite it.
  • Alternative not taken: preserve _row_id/_last_updated_sequence_number lineage instead of letting assignFirstRowIds renumber rewritten rows.
  • Check: confirm this is a conscious scope cut (a v3 rewrite that dropped lineage would corrupt it), reachable before any destructive write. Tested: rejects … /format-version 2 only/.

5. Bounds pruning restricted to an explicit orderable-type allowlist · algorithm / correctness firewall

src/prune.js:436

case 'date':
case 'time':
case 'timestamp':
  ...
  return true
default:
  return false   // string/binary/fixed/uuid bounds are truncated → never range-prune
  • Decision: only fully-ordered, untruncated types are range-pruned; truncated-prefix types are always kept.
  • Alternative not taken: prefix-aware pruning on truncated string/binary bounds.
  • Check: this switch is the firewall against pruning on a truncated prefix (which would drop files). Its membership is load-bearing — and note date/time are included here, which is exactly what feeds them into the literal-domain comparison flagged in entry 1.

Honorable mentions (real but lower-stakes): rewrite.js:96targetFileRows splits by fixed row count, so the documented "non-overlapping files" invariant holds only when boundary sort keys differ; prune.js:491$ne/$nin never prune ("too rare to bother"); prune.js:409boundForField accepts both Avro-array and plain-object bound maps; appliedSortOrderId = comparator ? orderId : 0 — an empty sort order records sort_order_id: 0.

Generated by /decision-map. Advisory — directs attention, casts no verdict.

CI runs `npx tsc` over src + test with strict/checkJs, but the new test
files had several type errors:

- prune.bounds: bare-value filters ({ id: 9n }) aren't modelled by
  hyparquet's ParquetQueryFilter — cast past the type.
- scanPruning: type columnData/parquetSchema as ColumnSource[]/
  SchemaElement[]; add required 'last-partition-id'; coerce ids with
  Number() before subtracting in the sort comparator.
- sort: sort_order_id is optional, so widen the return to
  (number | undefined)[].

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@platypii platypii left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

read the code, looks clean

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Read pruning: skip data files via partition values + manifest column bounds

2 participants