docs(rfc): RFC-013 step 3b — capture-once WriteTxn (unify the write opener)#295
docs(rfc): RFC-013 step 3b — capture-once WriteTxn (unify the write opener)#295aaltshuler wants to merge 1 commit into
Conversation
…pener) Implementation-grade spec for RFC-013 step 3b, the opener-unification half of the WriteTxn (parent §4.1/§4.2/§9-3b). Step 3a (PR #288) made the write open O(1) but left writes on a bespoke bare-URL-latest opener that diverges from the read path (pinned version + shared Session + handle cache). 3b collapses the two into one primitive, so the read/write open-equivalence question is answered by construction. Key design points captured: - Open at the pinned base version (consistent SI read base), with the publish CAS as the staleness guard — strict-op conflict detection moves from open-time fail-fast to commit-time CAS (same ExpectedVersionMismatch error; more correct read base for Update/Delete). Flagged for review ratification. - Validate-once (removes the flat-46 schema-read constant); intra-txn handle reuse (write-local, commit-invalidated — no stale cross-write cache). - Scope fenced: PublishPlan/Phase-7/epoch/internal-compaction are steps 4/5/2, explicitly out. - Test plan closes the two gaps the step-3a edge-case audit found: a differential equivalence guard (test-only namespace vs WriteTxn::open) and a branch-checkout-at-advanced-head strict-write test. Linked from the dev index under Active Implementation Plans. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01FQ1Hf4eXLsJmeLUkTYBEw7
|
|
||
| ## 4. Cost contract (what `write_cost.rs` asserts after 3b) | ||
|
|
||
| Extend the existing local gate (`tests/write_cost.rs`, on the shared | ||
| `helpers::cost` harness): | ||
|
|
||
| - **opens ≤ |touched tables|, and 0 on a re-open within the same txn** — a | ||
| multi-statement mutation that touches table T twice opens T **once**. New |
There was a problem hiding this comment.
Differential equivalence test is mis-specified and won't close audit gap 1
The test-only StagedTableNamespace (retained from step 3a) implements the old open-at-HEAD behavior. WriteTxn::open under 3b opens at the pinned base version. When the branch HEAD has advanced past the base, the two paths diverge by design — the test would fail even for a correct 3b implementation. In the trivial case where HEAD == base (no concurrent writers), the test passes, but it proves nothing specific to 3b's pinning: HEAD == base is true for reads too, and doesn't validate that WriteTxn::open correctly returns the base snapshot when HEAD has moved.
To actually close audit gap 1, the test should: (1) open a graph, (2) advance the branch HEAD externally via a direct Lance write, then (3) assert that WriteTxn::open still returns the pinned base version, not HEAD. Comparing against StagedTableNamespace as a reference only works if it is also updated to open by pinned version — but the spec doesn't say it is, and if it is updated it's no longer a meaningful differential comparator.
This affects the audit-gap closure claim in §0 and the PR description.
| with table count × stage count. | ||
| - **The flat-46 schema-read constant** (§0/§6 of the parent): schema validation | ||
| re-reads/re-checks the catalog on each resolve instead of once per operation. | ||
|
|
||
| Neither scales with history (3a fixed that), but both scale with **work per | ||
| operation** (tables × stages), and both are *gone* once the snapshot and handle | ||
| are captured once and threaded. | ||
|
|
||
| ## 2. Design | ||
|
|
||
| ### 2.1 `WriteTxn` |
There was a problem hiding this comment.
WriteHandleCache::get_or_open with &self requires interior mutability — threading implications unspecified
The WriteTxn::open method signature takes &self but get_or_open must mutate the handle map. In async Rust, across await points in concurrent stage tasks (which all hold &WriteTxn simultaneously), this requires WriteHandleCache to be Send + Sync with interior mutability — concretely Mutex<HashMap<TableKey, Dataset>>. A RefCell-based implementation would compile but panic at runtime as soon as two stage futures poll concurrently. The spec doesn't address this.
Either change the open method to &mut self (simpler, no interior mutability, but concurrent parallel stages can't then share the same WriteTxn) or explicitly require WriteHandleCache: Send + Sync and specify the locking strategy. The deny-list item "In-process-only Dataset impls — Send + Sync" is directly relevant here.
| // identical to SubTableEntry::open (db/manifest.rs:200) — the O(1), | ||
| // session-warm, version-pinned opener reads already use. | ||
| instrumentation::open_table_dataset( | ||
| &e.location, Some(e.version), &self.session, | ||
| ) | ||
| }) | ||
| } | ||
| } | ||
| ``` | ||
|
|
||
| The result: **reads and writes share one open primitive.** The "are the two | ||
| openers equivalent?" question (the audit gap, §6) is answered *by construction* — | ||
| there is only one. | ||
|
|
||
| ### 2.3 The load-bearing decision: open at **pinned version**, not HEAD | ||
|
|
||
| This is the one behavioral change, and it must be deliberate. | ||
|
|
||
| Today the strict-write path opens **HEAD** and fails fast via | ||
| `ensure_expected_version(head == pinned)` (`table_store.rs:259` [S]). 3b opens at | ||
| the **pinned base version**, so the open-time check is trivially true and is | ||
| **replaced by the publish CAS** (`expected_table_versions`, already present). | ||
|
|
||
| Why this is correct — and *more* correct for strict ops: |
There was a problem hiding this comment.
Retry contract for CAS-conflicted strict ops is unspecified
§2.3 moves strict-op conflict detection from open-time fail-fast to commit-time CAS. The spec describes that a stale strict write produces ExpectedVersionMismatch at commit, but never specifies what the caller should do next: re-open a fresh WriteTxn at the new HEAD, or refresh in-place? The handles cache is "commit-invalidated," which implies the txn must be discarded after any commit — successful or conflicted — but this is not stated.
Without this guidance, an implementer could retry by re-staging against the same stale WriteTxn (same pinned base, same cached handles), reproducing the lost-update scenario §4.2 explicitly warns about. The existing PUBLISHER_RETRY_BUDGET = 5 loop in the publisher retries the CAS today, but under 3b the retry loop would also need to refresh the base snapshot. This retry shape should be specified alongside the §2.3 timing-change decision.
Summary
Implementation-grade spec for RFC-013 step 3b — the opener-unification half of the
WriteTxn(parent RFC §4.1/§4.2/§9-3b). Docs-only (Proposed).Step 3a (PR #288) made the write open O(1) but left writes on a bespoke bare-URL-latest opener that diverges from the read path (pinned version + shared
Session+ handle cache). 3b collapses the two openers into one primitive — so the read/write open-equivalence question (raised in the #288 post-merge review) is answered by construction: there's only one opener.Key design decisions
ExpectedVersionMismatcherror, more correct read base for Update/Delete — avoids the lost-update §4.2 warns about). Flagged explicitly for review ratification.PublishPlan/Phase-7-lineage/epoch/internal-compaction are steps 4/5/2, explicitly out. 3b is a refactor under the existing publish machinery; no module retired (smaller blast radius than 3a).Closes the two audit gaps
The test plan (§5) bakes in the two holes a step-3a edge-case audit surfaced:
#[cfg(test)]namespace path vsWriteTxn::open, assert identical version/branch/schema (converts "the suite passes" into a pinned contract).Plus an invariants check (2/3/5/15 hold or strengthen) and reversibility notes.
Standalone and shippable — unblocked by steps 2/4/5.
🤖 Generated with Claude Code
Greptile Summary
This docs-only RFC specifies the implementation plan for RFC-013 step 3b: collapsing the bespoke write-path opener onto the same pinned-version + shared-
Sessionprimitive the read path already uses, yieldingWriteTxn— a capture-once struct that threads a pinnedPinnedSnapshot, sharedSession, andWriteHandleCachethrough all write stages.WriteTxndesign (§2): opens each touched table once by pinned version (from_uri(location).with_version(base.version).with_session(session)), eliminates per-stage re-opens and the flat-46 schema-read constant, and makes re-resolution of "latest" unrepresentable by type.ExpectedVersionMismatch). Explicitly flagged for review ratification; §7 lists it as a risk with a reversibility path.WriteTxn::openat the existing snapshot capture seam;open_dataset_head_for_writecan be retired; non-hot-path direct openers stay unchanged.Confidence Score: 4/5
Docs-only change — no code ships yet. The index update is a clean one-liner. The RFC is well-structured and largely sound, with one test-plan design issue that would result in false assurance if implemented as written.
The RFC's design rationale, invariant analysis, and migration path are thorough and consistent with the parent RFC-013. The main concern is §5 test #1: the differential equivalence guard compares against StagedTableNamespace (old open-at-HEAD behavior), which diverges from WriteTxn::open whenever HEAD > base — the exact case 3b handles — providing false assurance rather than closing audit gap 1. Two additional omissions (WriteHandleCache threading and CAS-conflict retry contract) affect implementation fidelity but don't block the spec at Proposed status.
docs/dev/rfc-013-step-3b-writetxn.md — specifically §5 test #1, the WriteHandleCache API surface in §2.1, and the retry guidance in §2.3
Important Files Changed
Sequence Diagram
%%{init: {'theme': 'neutral'}}%% sequenceDiagram participant WE as Write Entry participant WT as WriteTxn participant WHC as WriteHandleCache participant LS as Lance Dataset participant PUB as ManifestBatchPublisher WE->>WT: WriteTxn::open(branch, snapshot) Note over WT: base = PinnedSnapshot, schema validated once loop each touched table WT->>WHC: get_or_open(table_key) WHC->>LS: from_uri(loc).with_version(N).with_session(s) LS-->>WHC: Dataset handle O(1) WHC-->>WT: cached handle end WE->>WT: stage via WriteTxn handles WT-->>WE: staged WE->>PUB: publish CAS alt CAS wins PUB-->>WE: PublishedSnapshot else CAS loses PUB-->>WE: ExpectedVersionMismatch Note over WE: Must open fresh WriteTxn to retry end%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%% sequenceDiagram participant WE as Write Entry participant WT as WriteTxn participant WHC as WriteHandleCache participant LS as Lance Dataset participant PUB as ManifestBatchPublisher WE->>WT: WriteTxn::open(branch, snapshot) Note over WT: base = PinnedSnapshot, schema validated once loop each touched table WT->>WHC: get_or_open(table_key) WHC->>LS: from_uri(loc).with_version(N).with_session(s) LS-->>WHC: Dataset handle O(1) WHC-->>WT: cached handle end WE->>WT: stage via WriteTxn handles WT-->>WE: staged WE->>PUB: publish CAS alt CAS wins PUB-->>WE: PublishedSnapshot else CAS loses PUB-->>WE: ExpectedVersionMismatch Note over WE: Must open fresh WriteTxn to retry endReviews (1): Last reviewed commit: "docs(rfc): RFC-013 step 3b — capture-onc..." | Re-trigger Greptile