Support format-version 3 in icebergRewrite (preserve row lineage) by philcunliffe · Pull Request #24 · hyparam/icebird

philcunliffe · 2026-06-10T22:20:32Z

Implements the v3 rewrite follow-up deferred in #23: icebergRewrite now works on format-version 3 tables and preserves row lineage exactly, instead of throwing format-version 2 only.

The problem

Iceberg v3 gives every row a stable _row_id and _last_updated_sequence_number, normally derived from the data file's first_row_id + row position. Rewrite reads every live row, sorts globally, and regroups — which destroys the positional ordering derivation relies on. Letting assignFirstRowIds hand the rewritten manifest a fresh id range would renumber every row, making lineage consumers see the whole table as deleted-and-reinserted.

The spec's rules for moved rows (format spec, Row Lineage): a row's existing non-null _row_id must be copied into the new data file, and — since rewrite doesn't modify rows — its existing non-null _last_updated_sequence_number must be copied too. Commit-time first_row_id assignment to a manifest "is optional if the writer knows that existing data files in the manifest have assigned first_row_id values."

What this PR does

All production changes are in src/write/rewrite.js; everything else already did its half (the read path surfaces lineage per row with stored-value-wins semantics, the parquet writer threads field ids and embeds iceberg.schema, and assignFirstRowIds skips manifests that already have a first_row_id).

Materialized lineage columns. For v3, the parquet writer gets an extended schema with _row_id / _last_updated_sequence_number (reserved field ids 2147483540 / 2147483539, per-spec). Values are copied through the sort untouched. The field ids land both in the parquet SchemaElements (external-reader interop) and in the embedded iceberg.schema kv metadata (icebird's own discovery path). Stats, partitioning, and the manifest's embedded schema stay user-only.
No id consumption for a pure rewrite. When every live row carries lineage, the new manifest's first_row_id is pinned to the minimum carried _row_id. Carried ids are distinct and below next-row-id, so min + row count ≤ next-row-id: assignFirstRowIds assigns nothing, the snapshot gets added-rows: 0, and next-row-id does not advance. Per-file first_row_id stays null; inherited values are never consulted because every stored _row_id is non-null.
Upgraded-table fallback. If any row reads with null lineage (a v2→v3 upgrade rewritten before any post-upgrade commit assigned ids), the manifest is left for spec-default commit-time assignment: stored ids still win on read, null rows get derived ids, and next-row-id advances by the manifest row count — matching the spec's worked example.
Conflict safety. buildSnapshotUpdate already emits assert-next-row-id for v3, so a rewrite staged against stale metadata fails on commit if a concurrent v3 append moved the id watermark (on top of the existing CAS on main).
The v1/other guard now mirrors prepareAppend (unsupported format-version).

Tests & verification

+8 tests (568 → 576), all passing; npm run lint and npm run build:types clean. v2 rewrite behavior unchanged.
The v3 gating test ("throws on v3") is replaced with real coverage: lineage preserved across a sorted rewrite (the sort scrambles positions, so values must come from the materialized columns — asserted), targetFileRows splits, deletion vectors consumed with survivors keeping their ids, appends after a rewrite continuing from the unchanged next-row-id with no collisions, rewrite-of-a-rewrite stability, the upgraded-table assignment fallback, and the assert-next-row-id requirement rejecting a moved watermark.
Manually verified the rewritten parquet files carry both lineage columns with the reserved field ids in the physical schema and kv metadata, and round-trip a scrambled v3 table with next-row-id unchanged.

Deliberate limitations

Lineage columns are always materialized on v3 rewrites, even in the fallback mode where they hold nulls — spec-equivalent to omitting them ("readers should treat both columns as if they exist and are set to null"), and required anyway for mixed tables where some rows carry ids.
Retry-on-conflict remains out of scope, as in Scan pruning and sort-on-write (#20, #21, #22) #23: a rewrite only rewrote the rows it read, so on conflict it throws rather than risk dropping concurrently-appended rows.

🤖 Generated with Claude Code

A v3 rewrite must not renumber existing rows: each surviving row's _row_id and _last_updated_sequence_number are now materialized as explicit columns (reserved field ids 2147483540 / 2147483539) in the rewritten parquet files, since the global sort destroys the positional ordering that derivation relies on. When every live row carries lineage, the new manifest's first_row_id is pinned to the minimum carried _row_id so assignFirstRowIds consumes no new ids and next-row-id does not advance. Tables upgraded from v2 whose rows were never assigned ids fall back to spec-default commit-time assignment. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

philcunliffe · 2026-06-11T17:35:26Z

Dual-agent review — `approve`

Verdict: approve
Risk class: low
Auto-merge advisory: 👎 thumbs down — minor findings present; human should confirm before merge

Advisory only: no merge was attempted.

Risk capstone

Cross-reference: reviewer findings vs high-risk surfaces

Source	Finding (severity, evidence)	Intersects
Claude	README still documents icebergRewrite as v2-only (minor, README.md:176, README.md:227)	No — doc-only; outside the impact surface

Codex review

Fix Validations

v3 rewrite previously rejected / row lineage not preserved

Status: correct
Evidence: src/write/rewrite.js:59, src/write/rewrite.js:117, src/write/rewrite.js:198, src/write/snapshot.js:72, test/write/rewrite.test.js:233
Assessment: The guard now permits format-version 3, v3 rewrites write reserved lineage columns, and pure rewrites pin the manifest so assignFirstRowIds assigns zero new rows. The tests cover sorted rewrites, split files, deletion vectors, rewrite-of-rewrite, upgraded-table fallback, and stale next-row-id rejection.

upgraded v2 to v3 rows without lineage get commit-time assignment

Status: correct
Evidence: src/write/rewrite.js:97, src/write/rewrite.js:198, src/write/snapshot.js:162, src/read.js:608, test/write/rewrite.test.js:339
Assessment: Mixed or null lineage leaves first_row_id unset on the new manifest, so the shared v3 assignment path assigns a fresh range and the read path derives values only where stored lineage is null.

Findings

No new issues identified.

No Finding

Behavioral Correctness
Contract & Interface Fidelity
Change Impact / Blast Radius
Concurrency, Ordering & State Safety
Error Handling & Resilience
Security Surface
Resource Lifecycle & Cleanup
Release Safety
Test Evidence Quality
Architectural Consistency
Debuggability & Operability

Evidence Bundle

Changed hot paths: icebergStageRewrite v2/v3 gate and lineage write schema at src/write/rewrite.js:59, src/write/rewrite.js:117; manifest row-id pinning at src/write/rewrite.js:198; shared v3 row-id assignment at src/write/snapshot.js:72.
Impacted callers: one-call rewrite API exercises staged rewrite via test/write/rewrite.test.js:386; commit applies staged requirements and snapshot updates at src/write/commit.js:41, src/write/commit.js:222.
Impacted tests: v3 preservation test/write/rewrite.test.js:233; split output test/write/rewrite.test.js:262; deletion vector rewrite test/write/rewrite.test.js:276; append after rewrite test/write/rewrite.test.js:306; fallback assignment test/write/rewrite.test.js:339; next-row-id conflict test/write/rewrite.test.js:365.
Unresolved uncertainty: I did not independently verify the Parquet physical schema writer beyond the diff’s use of writeSchema; the provided tests also do not assert physical field IDs directly.

Claude review

Claude review

README still documents icebergRewrite as v2-only

Severity: minor
Confidence: 85
Evidence: README.md:176, README.md:227
Why it matters: The PR's headline feature is v3 rewrite support, but the user-facing docs still say "via a replace snapshot (v2 tables)" and "icebergRewrite compacts to sorted, non-overlapping files (v2)" — the restriction was added in the same commit as the now-removed guard (3edb15b), and this PR lifts the guard and the test but leaves the README advertising the restriction it removed.
Suggested fix: Update README.md lines 176 and 227 to state v2 and v3 are supported, noting that v3 rewrites preserve row lineage.

Reports: .git/dual-review/pr-24

philcunliffe · 2026-06-11T17:38:14Z

🧭 Decision map — where to spend your attention

Companion to the dual-review verdict. This casts no verdict — it points at the 4 forks where the author made a real choice, so you can skim the rest.

Scanned: 10 hunks across 2 files. Most is mechanical: ~140 added test lines (the lineageById helper, formatVersion plumbing, seven new v3 test bodies) and JSDoc updates. The production change is ~50 lines, and nearly all its decision content is in four forks:

1. Pin the rewritten manifest's `first_row_id` to the minimum carried `_row_id` · magic value / contract

src/write/rewrite.js:204

  if (allLineage) {
    newManifest.first_row_id = minRowId
  }

Decision: a pure rewrite reuses the carried id range — pinning first_row_id to min(_row_id) makes assignFirstRowIds take its skip-if-assigned branch, so the snapshot gets added-rows: 0 and next-row-id does not advance.
Alternative not taken: leave first_row_id null and accept a fresh commit-time range — simpler and spec-default, but it renumbers every row (the exact bug this PR exists to fix).
Check: the safety argument is the comment's invariant — carried ids are distinct and below next-row-id, so min + row count ≤ next-row-id. That holds for any valid table but is trusted, not asserted: duplicate stored ids from a corrupt upstream writer would silently extend the pinned range past the watermark.

2. All-or-nothing preserve mode: one null-lineage row drops the whole rewrite to fallback · unhappy-path policy

src/write/rewrite.js:97

  const allLineage = rowLineage && liveRows.length > 0 &&
    liveRows.every(r => r._row_id != null && r._last_updated_sequence_number != null)

Decision: preserve mode requires every live row to carry both lineage values; otherwise the manifest is left for spec-default commit-time assignment (next-row-id advances by the full row count; carried rows survive only via stored-value-wins on read).
Alternative not taken: handle mixed tables in preserve mode — e.g. split carried vs null rows into separate manifests so only null rows consume fresh ids. Tighter id accounting, materially more complexity.
Check: the mixed case (v2→v3 upgrade + post-upgrade appends, then rewrite) is the one branch with no test — confirm the fallback semantics are what you intend there.

3. Lineage columns are always materialized on v3 — even all-null in fallback mode · contract / shape

src/write/rewrite.js:117

  const writeSchema = rowLineage
    ? { ...schema, fields: [ ...schema.fields, /* _row_id, _last_updated_sequence_number */ ] }
    : schema

Decision: keyed on formatVersion >= 3, not on allLineage — every v3 rewrite writes the two lineage columns, holding nulls in fallback mode.
Alternative not taken: extend the schema only in preserve mode — spec-equivalent (readers treat absent columns as null) and smaller files for upgraded tables.
Check: the PR body owns this as a deliberate limitation (uniformity + required for mixed tables); confirm the file-size/interop cost on upgraded tables is acceptable.

4. The extended schema reaches the parquet writer only · contract / shape

src/write/rewrite.js:136

      await writeParquet({ writer: dataWriter, schema: writeSchema, records: chunk, codec })
      const stats = computeColumnStats(chunk, schema)

Decision: lineage columns get parquet SchemaElement field ids and iceberg.schema kv metadata, but no column stats, no role in partitioning, and no presence in the manifest's embedded schema.
Alternative not taken: include them in stats/manifest schema so engines could prune scans on _row_id ranges — at the cost of leaking writer-internal columns into table metadata.
Check: the one-word schema vs writeSchema difference on adjacent lines is very easy to misread as a bug — it is deliberate; confirm no downstream consumer expects lineage bounds in manifests.

Honorable mentions (real but lower-stakes): src/write/rewrite.js:59 — v1 rejection message harmonized to unsupported format-version: N, dropping the old self-naming error string; src/write/rewrite.js:122 — reserved field ids 2147483540/2147483539 are spec-fixed and match src/read.js:533, but a swap would be silent corruption; src/write/rewrite.js:97 — liveRows.length > 0 routes a fully-deleted table to fallback rather than crashing, untested; test/write/rewrite.test.js:365 — the assert-next-row-id conflict is exercised via hand-mutated metadata rather than a real interleaved commit (the in-test comment owns this).

_{Generated by /decision-map. Advisory — directs attention, casts no verdict.}

philcunliffe and others added 2 commits June 10, 2026 15:17

Fix typecheck: narrow formatVersion to 2 | 3 in rewrite test helper

1553920

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

philcunliffe requested a review from platypii June 11, 2026 17:50

platypii approved these changes Jun 11, 2026

View reviewed changes

philcunliffe merged commit 43e0ece into master Jun 11, 2026
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support format-version 3 in icebergRewrite (preserve row lineage)#24

Support format-version 3 in icebergRewrite (preserve row lineage)#24
philcunliffe merged 2 commits into
masterfrom
v3-rewrite

philcunliffe commented Jun 10, 2026

Uh oh!

philcunliffe commented Jun 11, 2026

Fix Validations

v3 rewrite previously rejected / row lineage not preserved

upgraded v2 to v3 rows without lineage get commit-time assignment

Findings

No Finding

Evidence Bundle

Claude review

README still documents icebergRewrite as v2-only

Uh oh!

philcunliffe commented Jun 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

philcunliffe commented Jun 10, 2026

The problem

What this PR does

Tests & verification

Deliberate limitations

Uh oh!

philcunliffe commented Jun 11, 2026

Dual-agent review — approve

Risk capstone

Cross-reference: reviewer findings vs high-risk surfaces

Fix Validations

v3 rewrite previously rejected / row lineage not preserved

upgraded v2 to v3 rows without lineage get commit-time assignment

Findings

No Finding

Evidence Bundle

Claude review

README still documents icebergRewrite as v2-only

Uh oh!

philcunliffe commented Jun 11, 2026

🧭 Decision map — where to spend your attention

1. Pin the rewritten manifest's first_row_id to the minimum carried _row_id · magic value / contract

2. All-or-nothing preserve mode: one null-lineage row drops the whole rewrite to fallback · unhappy-path policy

3. Lineage columns are always materialized on v3 — even all-null in fallback mode · contract / shape

4. The extended schema reaches the parquet writer only · contract / shape

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Dual-agent review — `approve`

1. Pin the rewritten manifest's `first_row_id` to the minimum carried `_row_id` · magic value / contract