Storage Engine v4 — combined squash (for macro review)#874
Storage Engine v4 — combined squash (for macro review)#874c1-squire-dev[bot] wants to merge 11 commits into
Conversation
This PR squashes the entire stack-of-6 v4 storage-engine PRs into a single commit so reviewers can give macro overall feedback against the full design. The same code is reviewed in PRs #867–#872 broken into focused stacked PRs: #867 Parent: option plumbing + magic-byte dispatch #868 Stack 1: protos, codec layer, engine skeleton #869 Stack 2: c1z3 envelope format #870 Stack 3: Pebble engine + all 6 record types + v2↔v3 translation + connectorstore.Writer adapter #871 Stack 4: cross-engine compaction via IngestAndExcise #872 Stack 5: equivalence runner + microtests + bench CLIs Goal of this combined PR: surface macro / architectural / cross-cutting review comments that are hard to see in the individual stacked PRs. Land the individual PRs separately. ## RFC `docs/rfcs/0004-storage-engine-v4/tracker.md` tracks the implementation status of the broader RFC. ## Highlights - **Engine**: `pkg/dotc1z/engine/pebble/` — Pebble-backed v3 storage engine covering all 6 record types (Grant, Resource, ResourceType, Entitlement, Asset, SyncRun) with secondary indexes (by_entitlement, by_principal, by_parent, by_resource). Strict-quiesce write barrier via WaitGroup + memtable flush. `db.Checkpoint` powers Save. - **Translation**: `pkg/dotc1z/engine/pebble/translate_v2.go` — exhaustive v2 connector wire types ↔ v3 storage record types with nil-safe roundtrips. - **Adapter**: `pkg/dotc1z/engine/pebble/adapter.go` — implements `connectorstore.Writer` (which embeds `Reader`). Compile-asserted via `var _ connectorstore.Writer = (*Adapter)(nil)`. Embeds 9 `UnimplementedXxxServer` stubs for the exotic gRPC service surfaces. - **Compaction**: `pkg/synccompactor/pebble/` — cross-engine sync rollup via Pebble's `IngestAndExcise`. Per-bucket SST writer + excise span. Atomic from base's perspective. - **Envelope**: `pkg/dotc1z/format/v3/` — c1z3 file format with proto manifest (incl. descriptor closure for self-describing reads) and zstd-tar payload. - **Codec**: `pkg/dotc1z/engine/pebble/codec/` — FoundationDB-style tuple encoding with NUL/escape rules (prefix-free property exhaustively tested), KSUID binary encoding for sync_ids, frozen- after-init registry for codegen codecs + lazy reflection fallback. - **Equivalence**: `pkg/dotc1z/engine/equivalence/` — pluggable property-test harness comparing engines against an in-memory reference. - **Bench**: `cmd/baton-fixture-gen/` + `cmd/baton-storage-bench/` — hyper-scale fixture generation + G2–G5 read benchmarks against a fixture. - **Build-tag gated**: all engine code is behind `//go:build batonsdkv2` so default connector binaries don't link Pebble. ## Test coverage - 16 engine tests (record-type Put/Get/Delete/Iterate + checkpoint + quiesce) - 7 v2↔v3 translation roundtrip tests (incl. unknown-enum safety) - 6 connectorstore.Writer adapter end-to-end tests - 5 compactor tests (incl. sync-isolation property) - 4 equivalence tests (Pebble vs MemoryRef) - 1 tuple-encoding prefix-free property test (40k+ comparisons) - 4 codec direct-vs-reflection benchmarks - envelope roundtrip + truncation/bad-magic refusals - 2 regression tests for the WithEngine option threading bug Total: 50+ test cases passing; 0 lint findings across the v4 surface. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
review update pending |
ghost
left a comment
There was a problem hiding this comment.
Blocking issues found — see review comments.
|
Addressed the review comments in this branch:
Validation run:
|
|
Update after checking the Pebble release suggestion: switched from the upstream-main pseudo-version to the current tagged release A couple of details from the change:
Validation rerun after the v2.1.5 switch:
|
|
Pushed the fixes in |
38437eb to
16eb119
Compare
|
Pushed follow-up commit What changed:
Validation:
|
16eb119 to
2e776ee
Compare
|
Pushed optional-interface parity follow-up commit Audit result:
Additional behavior tests added:
Validation:
|
2e776ee to
4d6edaf
Compare
|
Pushed engine-registration follow-up commit What changed:
Tests added/updated:
Dependency boundary checked:
Validation:
|
| out[k] = v2.GrantSources_GrantSource_builder{ | ||
| IsDirect: v.GetIsDirect(), | ||
| }.Build() | ||
| } | ||
| return v2.GrantSources_builder{Sources: out}.Build() | ||
| } | ||
|
|
||
| // --- Resource --- | ||
|
|
||
| // V2ResourceToV3 maps v2.Resource → v3.ResourceRecord. Caller supplies sync_id. | ||
| func V2ResourceToV3(syncID string, r *v2.Resource) *v3.ResourceRecord { | ||
| if r == nil { | ||
| return nil | ||
| } | ||
| var parent *v3.ResourceRef | ||
| if pid := r.GetParentResourceId(); pid != nil && pid.GetResource() != "" { | ||
| parent = v3.ResourceRef_builder{ | ||
| ResourceTypeId: pid.GetResourceType(), | ||
| ResourceId: pid.GetResource(), |
There was a problem hiding this comment.
🟡 Suggestion: V2ResourceToV3 maps ResourceTypeId, ResourceId, DisplayName, Description, Parent, and Annotations — but v2.Resource also carries typed trait payloads (UserTrait, AppTrait, GroupTrait, any_trait). These are not present in v3.ResourceRecord, so they are silently discarded. V3ResourceToV2 then reconstructs a v2.Resource with no trait data. This is a data-loss regression relative to the SQLite engine for any downstream consumer that depends on user profile fields (email, status, manager) or app/group metadata. If this is intentional (traits stored separately), a comment or SDK changelog entry documenting the behavior change would help callers prepare.
| if b.KeyCount() == 0 { | ||
| // Tiny excise-only: no SST to ingest but still need to | ||
| // clear destination. Pebble doesn't let us IngestAndExcise | ||
| // with zero SSTs (the ingest must reference at least one | ||
| // file), so we fall back to a RangeDelete + Flush. | ||
| _ = os.Remove(sstPath) | ||
| if err := dstDB.DeleteRange(plan.lower, plan.upper, pebble.Sync); err != nil { | ||
| return fmt.Errorf("compact: excise-only DeleteRange for %s: %w", plan.name, err) | ||
| } | ||
| continue | ||
| } | ||
|
|
||
| pending = append(pending, pendingIngest{ | ||
| sstPath: sstPath, | ||
| exciseSpan: pebble.KeyRange{ | ||
| Start: plan.lower, |
There was a problem hiding this comment.
🟡 Suggestion: The doc comment on Compact claims "The operation is atomic from base's perspective: a concurrent reader either sees only the old data or only the new data, never a mixture." This is true per-bucket (each IngestAndExcise is individually atomic), but the multi-bucket loop is not atomic as a whole — a failure or crash mid-loop leaves base with new data in some buckets and old data in others. Similarly, the DeleteRange calls for empty buckets (lines 128-138) are not covered by the later IngestAndExcise atomicity. Consider updating the doc comment to accurately describe the per-bucket atomicity, and noting whether callers are expected to handle partial-compaction recovery.
| if err := proto.Unmarshal(oldVal, old); err == nil { | ||
| if err := e.deleteGrantIndexes(batch, idBytes, old); err != nil { | ||
| closer.Close() | ||
| return err | ||
| } | ||
| } | ||
| closer.Close() | ||
| case errors.Is(err, pebble.ErrNotFound): | ||
| // no-op | ||
| default: |
There was a problem hiding this comment.
🟡 Suggestion: When proto.Unmarshal(oldVal, old) fails (line 49), the code silently skips deleteGrantIndexes and proceeds to write the new primary record with new index entries. The old index entries are now orphaned — they reference the old record's indexed fields but the primary has been overwritten. This same pattern appears in resources.go and entitlements.go. While unmarshal failure requires data corruption (unlikely in normal operation), silently producing phantom index entries makes the corruption invisible and harder to diagnose. Consider returning an error or logging at warn level when unmarshal fails, rather than silently continuing.
| return rec.GetContentType(), &bytesReader{b: rec.GetData()}, nil | ||
| } | ||
|
|
||
| // === read service surface === | ||
| // | ||
| // ListGrants / ListResources / etc. The connectorstore.Reader | ||
| // interface embeds these as gRPC ServiceServer interfaces; the | ||
| // adapter implements the most-called paths directly and leaves the | ||
| // rest to the embedded Unimplemented* stubs. | ||
|
|
||
| // ListGrants returns all grants on the active sync, optionally | ||
| // filtered by Resource (= principal) when req.Resource is set. | ||
| func (a *Adapter) ListGrants(ctx context.Context, req *v2.GrantsServiceListGrantsRequest) (*v2.GrantsServiceListGrantsResponse, error) { | ||
| syncID := a.resolveActiveSync(req.GetActiveSyncId()) | ||
| if syncID == "" { | ||
| return nil, ErrNoCurrentSync | ||
| } | ||
| out := make([]*v2.Grant, 0) | ||
| var iter func(yield func(*v3.GrantRecord) bool) error | ||
| if r := req.GetResource(); r != nil && r.GetId() != nil { | ||
| // Filter by principal. | ||
| iter = func(yield func(*v3.GrantRecord) bool) error { | ||
| return a.engine.IterateGrantsByPrincipal(ctx, syncID, | ||
| r.GetId().GetResourceType(), r.GetId().GetResource(), yield) | ||
| } |
There was a problem hiding this comment.
🟡 Suggestion: All List* methods (ListGrants, ListResources, ListResourceTypes, ListEntitlements) materialize the entire result set in memory with make([]*v2.X, 0) and unbounded append. For large syncs (millions of grants), this can produce hundreds-of-MB allocations in a single call. The v2.GrantsServiceListGrantsRequest proto has page_size and page_token fields that are ignored here — callers passing those fields expecting bounded results will get the full set instead. Consider supporting pagination or at minimum documenting the unbounded behavior so callers can plan accordingly.
General PR Review: Storage Engine v4 — combined squash (for macro review)Blocking Issues: 0 | Suggestions: 1 | Threads Resolved: 0 Review SummaryThe new commits add Security IssuesNone found. Correctness IssuesNone found. Suggestions
|
ghost
left a comment
There was a problem hiding this comment.
Blocking issues found — see review comments.
Diagnosis: PutGrants(N grants) was calling Engine.PutGrantRecord per
grant. Each PutGrantRecord did a Get (read-before-write index cleanup)
+ new Batch + Set primary + Set indexes + Commit(pebble.Sync). With
default pebble.Sync, N grants = N fsyncs + N batch alloc/close cycles.
At 1k/10k grants the per-grant cost was 50-58 µs vs SQLite's 18-30 µs
(SQLite chunks an INSERT into one transaction = ~1 fsync per chunk).
Fix: two changes that match SQLite's PRAGMA synchronous=NORMAL +
chunked-transaction semantics:
1. **Fat batch per Put* call (engine)**. New
PutGrantRecords / PutResourceRecords / PutEntitlementRecords /
PutResourceTypeRecords variadic methods write N records in a
single pebble.Batch, committing once. The single-record methods
now wrap the bulk ones with one element.
2. **Fresh-sync fast path (engine + adapter)**. Engine tracks a
`freshSync` flag set by MarkFreshSync (called from
Adapter.StartNewSync) and cleared by EndFreshSync (called from
Adapter.EndSync). While freshSync is true:
- batch.Commit uses pebble.NoSync (OS-buffered, no fsync per
commit), matching SQLite's synchronous=NORMAL during sync.
- The read-before-write Get for index cleanup is skipped because
the sync_id is guaranteed to have no prior records.
EndFreshSync does one db.Flush + db.LogData(nil, pebble.Sync) so
all writes are on disk before the caller returns.
SetCurrentSync (and ResumeSync, which calls it) does NOT flip the
freshSync flag — a resumed sync may already have data for the
sync_id, so writes keep fsync + read-before-write.
Crash semantics: identical to SQLite's. A host crash mid-sync forces
the connector to re-sync (the sync_run record stays in "started"
state); we never silently surface a partial sync as committed.
Bench results (-benchtime=2x, Linux arm64, 16 cores):
Workload | SQLite | Pebble | Pebble vs SQLite
-----------------|----------|----------|------------------
WritePack 100 | 14.2 ms | 7.4 ms | 1.93x faster
WritePack 1k | 30.2 ms | 10.7 ms | 2.82x faster
WritePack 10k | 218.6 ms | 42.9 ms | 5.10x faster
WritePack 100k | 3.17 s | 352 ms | 9.00x faster
WritePack 1M | 55.59 s | 3.78 s | 14.72x faster
Read 100 | 2.14 ms | 2.20 ms | parity
Read 1k | 6.04 ms | 3.93 ms | 1.54x faster
Read 10k | 43.2 ms | 16.0 ms | 2.71x faster
Per-grant write cost flatlines for Pebble at ~3.8 µs after enough
amortization; SQLite climbs to 56 µs at 1M grants (B-tree depth +
index maintenance).
Note: at 100k SQLite's ListGrants honored its default 10k page_size
and returned only 10k records — the bench surfaces that the
adapter's List* methods don't paginate today, which is a separate
review-thread item.
production_bench_test.go now sweeps 100, 1k, 10k, 100k, 1M (override
via BATONSDK_BENCH_SCALES env var; -short cuts to 100/1k/10k).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
Diagnosis + fix for the WritePack-slower-than-SQLite finding from the previous bench. Diagnosis: Per-grant Fix (commit 72881f6, matches SQLite's
Scale results (the question that matters as you scale grant count):
Per-grant cost: SQLite climbs to 56 µs at 1M (B-tree depth + index maintenance); Pebble flatlines at ~3.8 µs after amortization. At 1M grants Pebble writes the c1z in 3.8 s vs SQLite's 55.6 s. PR body has the full table + read benchmarks + the run command. |
The previous List*/ListGrants/ListResources/etc all materialized the
full result set in memory and ignored page_size + page_token. SQLite
silently clamped to 10k (default page) so the two engines weren't
even returning the same record count at scale — apples to oranges.
Matches SQLite's pagination contract:
- page_size == 0 || page_size > MaxPageSize → DefaultPageSize (10000)
- page_token is opaque to the caller; pass next_page_token back verbatim
- filter + paginate compose (e.g. by-principal + page through 100 grants)
- malformed token → ErrInvalidPageToken (so a corrupted token doesn't
silently restart from the beginning)
New file: pkg/dotc1z/engine/pebble/paginate.go
- encodeCursor / decodeCursor — base64.RawURLEncoding of raw pebble key
- rangeAfter — turns a cursor into a pebble.IterOptions LowerBound
that's strictly greater than the cursor (cursor + 0x00)
- clampPageSize — SQLite-equivalent clamp
- iteratePrimaryPageWithKey — generic helper that captures the LAST
RETURNED key (not the first non-returned) so cursors are stable
- PaginateGrantsBySync, PaginateGrantsByEntitlement, PaginateGrantsByPrincipal
- PaginateResourcesBySync, PaginateResourcesByParent
- PaginateResourceTypesBySync
- PaginateEntitlementsBySync, PaginateEntitlementsByResource
Adapter rewrite: each List* method now calls the matching Paginate*
helper, passing page_size + page_token from the request and returning
the engine's next_page_token in the response. ListResources has a
small over-fetch (4x cap) when resource_type_id post-filter is set,
so a sparse hit rate doesn't force a tail of round-trips.
Engine sentinel: ErrInvalidPageToken added to engine_stub.go alongside
the rest.
Tests (paginate_test.go, 11 cases all green):
- TestListGrantsPagination: page sizes 10, 50, 100, 250, 251, default(0)
against 250 grants — counts distinct IDs, detects duplicates,
asserts no exceed of page_size, verifies last page has empty token
- TestListGrantsPaginationByPrincipal: by_principal index pagination,
page_size=15 over 100 grants
- TestListResourcesPagination: 75 resources, page_size=20
- TestListEntitlementsPagination: 50 entitlements, page_size=7
- TestListResourceTypesPagination: 30 RTs, page_size=8
- TestPageTokenMalformed: garbage token surfaces ErrInvalidPageToken
- TestPaginationClampedPageSize: 0 + MaxPageSize+1 both clamp
Bench harness fix: benchmarkRegisteredUnpackReadGrants now walks
pages until next_page_token is empty so the two engines compare
apples-to-apples at >10k.
Bench results with pagination working:
Grants SQLite Pebble Pebble vs SQLite
100 2.13 ms 2.17 ms parity
1k 5.97 ms 3.73 ms 1.60x faster
10k 42.8 ms 15.6 ms 2.75x faster
100k 386 ms 142 ms 2.72x faster
Read scaling holds at 2.7-2.8x faster.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
Pagination wired (commit de17b96). Honest answer to "why don't we paginate": I built the adapter's List* as an MVP and never wired `page_size` / `page_token` through. There's no architectural reason — Pebble iterators are well-suited to it. Matches SQLite's contract:
New file: `pkg/dotc1z/engine/pebble/paginate.go` with `Paginate*` engine helpers and an `iteratePrimaryPageWithKey` generic that captures the LAST RETURNED key (not the first non-returned) so cursors are stable. Adapter rewrite: every `List*` now calls the matching `Paginate*` helper, passes `page_size` + `page_token` from the request, returns `next_page_token` in the response. `ListResources` over-fetches up to 4× when `resource_type_id` post-filter is set so a sparse hit rate doesn't force a tail of round-trips. Tests (11 cases, all green): page sizes 10/50/100/250/251/default(0) against 250 grants; by-principal index pagination over 100 grants; resources/entitlements/resource_types pagination; malformed-token errors; engine-level page_size clamping. Bench harness fix: `benchmarkRegisteredUnpackReadGrants` now walks pages until `next_page_token` is empty so the two engines compare apples-to-apples at >10k. Apples-to-apples read results:
PR description is updated with the same numbers. |
| rtFilter := req.GetResourceTypeId() | ||
|
|
||
| out := make([]*v2.Resource, 0, limit) | ||
| var nextCursor string | ||
| for len(out) < limit { | ||
| pageLimit := limit - len(out) | ||
| // Over-fetch a little when post-filtering so a sparse hit rate | ||
| // doesn't force a tail of extra round-trips. 4x is the cap; if | ||
| // rtFilter is empty we skip the over-fetch entirely. | ||
| fetchLimit := pageLimit | ||
| if rtFilter != "" { | ||
| fetchLimit = pageLimit * 4 | ||
| if fetchLimit > MaxPageSize { | ||
| fetchLimit = MaxPageSize | ||
| } | ||
| } | ||
| var records []*v3.ResourceRecord | ||
| var err error |
There was a problem hiding this comment.
🟡 Suggestion: When rtFilter is set and the inner loop hits len(out) == limit before consuming all records, the remaining records in the engine batch are never examined. The returned nextCursor (from the engine) skips past them, so matching records in the unexamined tail are silently dropped. Consider removing the early break and instead continuing to iterate (without appending past limit) so the engine's cursor is consumed correctly, or track a per-record cursor to resume from the exact position.
…s (E)
Production-readiness work. The previous combined branch shipped the
engine + write-side adapter, but the syncer and grant expander hard-
depend on a set of reader gRPC methods + write helpers that returned
codes.Unimplemented or weren't on the adapter at all. Dropping the
syncer onto the adapter would fail at the first lookup.
This commit closes that gap.
## (A) Syncer-blocking gRPC reader methods
New file: pkg/dotc1z/engine/pebble/adapter_reader.go
- GetEntitlement (reader_v2.EntitlementsReaderServiceServer)
- GetResource (reader_v2.ResourcesReaderServiceServer)
- GetResourceType (reader_v2.ResourceTypesReaderServiceServer)
- ListGrantsForEntitlement (reader_v2.GrantsReaderServiceServer)
- ListGrantsForResourceType (reader_v2.GrantsReaderServiceServer)
- GetSync, ListSyncs, GetLatestFinishedSync
(reader_v2.SyncsReaderServiceServer)
ListGrantsForEntitlement uses the by_entitlement index + post-filters
by principal_id / principal_resource_type_ids (4x over-fetch cap when
filtered to amortize sparse hit rates). ListGrantsForResourceType
walks the primary range + post-filters; a future
idxGrantByPrincipalResourceType index would make it index-driven.
GetLatestFinishedSync is a linear scan over sync_runs today (one rec
per sync; tenants have O(K) total). If hot, a (typeFinishedSyncByType
| type) -> sync_id sidecar makes it O(1).
v3SyncRunToV2 + v3SyncTypeToString bridge the v3 SyncRunRecord shape
to the reader_v2.SyncRun gRPC shape.
## (B) GrantStore sub-store + StoreExpandedGrants
New file: pkg/dotc1z/engine/pebble/adapter_grants_store.go
Adapter.Grants() returns a dotc1z.GrantStore implementation:
- StoreExpandedGrants: delegates to PutGrants. Pebble doesn't
re-extract expansion metadata at write time, so there's nothing
to strip.
- PendingExpansionPage / PendingExpansion: empty until Stack 6
wires expansion metadata extraction. Returning empty is the
safe no-op — the syncer's ExpandGrants loop terminates and
the sync proceeds.
- ListWithAnnotationsPage / ListWithAnnotationsForResourcePage:
return grants with nil annotation (no expansion in MVP).
- ListWithAnnotations: iter.Seq2 wrapper around the Page method,
matches the SQLite iteration contract (yields (zero, err) on
error).
Caveat: full grant expansion is RFC Stack 6, explicitly deferred.
Connectors that don't use grant expansion (most of them) work today
end-to-end; tenants needing expansion must wait for Stack 6.
## (D) IfNewer upserts (partial-sync semantics)
New file: pkg/dotc1z/engine/pebble/if_newer.go
- PutGrantRecordsIfNewer, PutResourceRecordsIfNewer,
PutEntitlementRecordsIfNewer, PutResourceTypeRecordsIfNewer
Mirror SQLite's EXCLUDED.discovered_at > X.discovered_at semantics:
read existing, compare timestamp, only write if strictly newer.
Index cleanup runs only on the records that pass the freshness
check. Single batch + commit per call. The freshSync fast path is
disabled by design here (IfNewer is by definition not a fresh sync).
Adapter wires in adapter_grants_store.go: PutGrantsIfNewer,
PutResourcesIfNewer, PutEntitlementsIfNewer, PutResourceTypesIfNewer.
## (E) Stats / GrantStats / OutputFilepath
New file: pkg/dotc1z/engine/pebble/adapter_stats.go
- Stats(syncType, syncID) -> {table: count} — exact counts via
per-record-type iteration. Matches the SQLite shape used by
pkg/sync/progresslog.
- GrantStats(syncType, syncID) -> {ent_resource_type: count} for
the per-RT progress lines.
- OutputFilepath() -> engine's dbDir. SQLite returns the .c1z
file path; Pebble doesn't have a single file so we return the
directory.
- Engine.DBDir() exported for the adapter.
## (F) Verified — no trait data loss
The earlier review-bot finding about V2ResourceToV3 silently dropping
typed trait payloads (UserTrait/GroupTrait/AppTrait/etc) is incorrect:
v2 traits are packed into Resource.Annotations as Any (confirmed by
inspection of pkg/types/resource/user_trait.go and friends), and our
translate_v2.go already preserves Annotations. Added a NOTE comment
to ResourceRecord in the proto so future readers don't re-relitigate.
## Tests (5 new test functions, all green)
- TestGetEntitlementResourceResourceType — all three Get methods
+ missing-entity ErrNotFound
- TestListGrantsForEntitlementAndResourceType — full list + RT
filter + RT-only list semantics
- TestSyncsReaderMethods — GetSync, ListSyncs, GetLatestFinishedSync
with sync_type filter
- TestStatsAndGrantStats — exact counts via iteration
- TestIfNewerSkipsStaleGrants — older = no-op; newer = overwrite;
equal-timestamp = no-op (strictly newer required, matches SQLite)
Total engine tests now: 30 (up from ~25). 0 lint issues across the
pebble surface.
## Still gap (from the audit, addressed in follow-ups)
- (C) Diff-sync subsystem — GenerateSyncDiff. Cross-engine
compactor is rollup, not diff.
- (G) Remaining reviewer-bot findings (orphan-index-on-unmarshal-
fail, multi-bucket compaction atomicity wording, etc.)
- (H) CI repo-wide lint findings (pre-existing, outside our PR).
- (J) Stack 6 (deferred grant expansion) + Stack 7 (C1 platform
integration) per RFC.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
… cache leak, Fatalf Five reviewer findings from PR #874 addressed: 1. **Orphan index entries on unmarshal-fail** — grants.go, resources.go, entitlements.go silently skipped index cleanup when proto.Unmarshal of the previous record failed during read-before-write. New behavior: surface the unmarshal error with the affected key in the message. Recovery is `DeleteX(key)` for the corrupt row. 2. **Cache leak on pebble.Open failure** — engine.go.Open minted a pebble.Cache (per preset; up to 4 GiB for PresetBackendInfraHot) then returned without Unref on failure. Now Unrefs the minted cache on the failure path; shared caches stay caller-owned. 3. **Fatalf via panic instead of os.Exit** — discardPebbleLogger was `panic(...)` on Fatalf. A recover() in a gRPC interceptor or HTTP framework could swallow it and let the program continue on a corrupted engine. Now writes to stderr and `os.Exit(1)`, matching Pebble's default-logger semantics. 4. **Errorf silently dropped** — discardPebbleLogger.Errorf was a no-op. Pebble uses Errorf for compaction errors and WAL recovery warnings (operationally significant). Now forwards to stderr. 5. **Multi-bucket compaction atomicity** — compactor.go.Compact docstring claimed whole-call atomicity, but the multi-bucket IngestAndExcise loop is per-bucket atomic only. Doc updated to accurately describe the contract + recovery model (re-run is idempotent since excise+ingest of the same SSTs converges). Tests still green: 30 engine + 5 compactor + 4 equivalence + 1 tuple property + 4 codec benchmarks; 0 lint issues. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
|
||
| // GrantStats returns just the grant count, partitioned by entitlement | ||
| // resource_type_id. Used by progresslog to show per-RT progress. Like | ||
| // Stats, this is an exact count via iteration. | ||
| func (a *Adapter) GrantStats(ctx context.Context, syncType connectorstore.SyncType, syncID string) (map[string]int64, error) { | ||
| if syncID == "" { | ||
| syncID = a.currentSyncID() | ||
| } | ||
| if syncID == "" { | ||
| return map[string]int64{}, nil | ||
| } | ||
| counts := map[string]int64{} | ||
| if err := a.engine.IterateGrantsBySync(ctx, syncID, func(rec *v3.GrantRecord) bool { | ||
| rt := rec.GetEntitlement().GetResourceTypeId() | ||
| counts[rt]++ | ||
| return true |
There was a problem hiding this comment.
🟠 Bug: GrantStats accepts a syncType parameter but never validates it against the sync run's type. The SQLite C1File.GrantStats (pkg/dotc1z/c1file.go:961) checks syncType != SyncTypeAny && syncType != lastSync.SyncType and returns an error on mismatch. The Pebble implementation silently returns counts for any sync regardless of syncType, breaking the contract.
| } | ||
| cursor = nextCursor | ||
| } | ||
| return reader_v2.GrantsReaderServiceListGrantsForEntitlementResponse_builder{ |
There was a problem hiding this comment.
🟡 Suggestion: ListGrantsForEntitlement (lines 110-152) and ListGrantsForResourceType (lines 163-209) have the same post-filter cursor skip pattern as the previously-flagged ListResources issue. When filters are active (principalID/rtSet/rtFilter) and the inner loop hits len(out) == limit before consuming all records in the engine batch, nextCursor is set to the engine's page cursor (past the entire batch). Unexamined matching records in the tail of the batch are silently dropped on the next pagination call.
ghost
left a comment
There was a problem hiding this comment.
Blocking issues found — see review comments.
…nc skip)
Mutation/error-handling audit triggered by review:
The previous freshSync write path skipped the read-before-write Get
on the assumption that a fresh sync's sync_id is empty. That's
correct for the FIRST write to each external_id, but breaks when a
connector emits the same external_id twice within one fresh sync
(common in paginated sources, or just a connector bug). The second
write would overwrite the primary but leave the OLD by_entitlement
and by_principal index entries pointing at stale indexed fields —
orphan entries that IterateGrantsByEntitlement / IterateGrantsByPrincipal
would dereference to the *new* primary, producing phantom-filter
results.
Fix: read-before-write runs unconditionally on all PutXRecords paths
(grants, resources, entitlements). The Get cost is small in the
fresh-sync hot phase (mostly a memtable hit) and the integrity gain
is non-negotiable. Fresh-sync still uses pebble.NoSync for the batch
commit — that's the larger perf win and remains intact.
Bench impact at 1M grants:
Before (unsafe): 3.78 s (14.7x faster than SQLite)
After (safe): 4.24 s (12.9x faster than SQLite)
~12% slower than the unsafe path, still well above the SQLite baseline.
Also fixed the same swallow pattern in paginate.go: the index-walk
Get on a missing primary now distinguishes ErrNotFound (orphan from
a concurrent delete-then-iter race; skip and continue, fsck reconciles)
from real errors (I/O, corrupted SST; surface via return).
New test coverage:
- TestPutGrantRecordOverwriteCleansIndexes
- TestPutResourceRecordOverwriteCleansIndexes
- TestPutEntitlementRecordOverwriteCleansIndexes
- TestFreshSyncDuplicateExternalIDCleansIndexes — exercises the
fresh-sync path with a duplicate external_id and asserts the
OLD index entries are cleaned (the bug class fixed here).
- TestNonFreshSyncOverwriteWorksThroughAdapter — adapter path
after EndSync + SetCurrentSync.
- TestDeleteThenPutCleansAndRewrites — Delete + Put returns to
a clean state.
All 6 mutation tests + all existing tests pass. 0 lint issues across
the engine + compactor surface.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
| closer.Close() | ||
| return fmt.Errorf("DeleteGrantRecord: unmarshal old %q: %w", externalID, err) | ||
| } | ||
| { |
There was a problem hiding this comment.
🟡 Suggestion: This bare { block is a refactoring artifact from inverting the old if err == nil { guard. The scope serves no purpose and makes the code look like it belongs to a control-flow structure. Same pattern appears in resources.go:122. Remove the bare braces from both files.
Root cause: ci.yaml and main.yaml used `version: latest` for golangci-lint, so each CI run pulled whatever version was current. v2.12.2 added new gosec rules (G702 command injection, G118 goroutine background context, G122 filepath.Walk TOCTOU, G703 path traversal, G704 SSRF) that didn't exist in earlier versions. PRs that were lint-clean against v2.9.0 broke against v2.12.2. Additionally, an earlier auto-fix commit (2659a4e) removed several //nolint:gosec directives that were "unused for linter gosec" under the OLDER linter version, mistaking obsolete-rule-name suppressions for dead code. v2.12.2 still needs those directives because the NEW gosec rules cover the same patterns under different rule names. Systematic fix: 1. Pin golangci-lint to v2.12.2 in both ci.yaml and main.yaml. The pin comment explains the trade-off and the recommended cadence for bumps. 2. Restore the nolint:gosec directives the auto-fix removed: - internal/connector/connector.go (exec.CommandContext) - pkg/actions/actions.go (intentional background goroutines, 2x) - pkg/dotc1z/clone_sync_bench_test.go (os.WriteFile under TempDir) - pkg/dotc1z/race_simulation_test.go (os.WriteFile, 2x) - pkg/lambda/grpc/config/config.go (os.ReadFile of operator cert) - pkg/sync/expand/expand_benchmark_test.go (os.WriteFile in temp) - pkg/sync/expand/expand_correctness_test.go (os.WriteFile in temp) - pkg/uhttp/wrapper.go (HttpClient.Do with arbitrary endpoints) 3. New nolint for the v4 PR's surface code: - pkg/dotc1z/format/v3/envelope.go (filepath.Walk over a Pebble checkpoint dir we own — no symlink TOCTOU exposure) 4. Remove now-unused nolints introduced earlier in this PR: - pkg/dotc1z/engine/equivalence/equivalence_test.go (G602 gone) - pkg/tasks/c1api/task_queue.go (G115 not firing in v2.12.2) - pkg/types/types.go (revive var-naming no longer fires there) Validation: golangci-lint v2.12.2 reports 0 issues across the whole repo with both default and batonsdkv2 build tags. `go test -tags=baton_lambda_support -short ./...` passes (matches the Windows CI command). Related cases addressed proactively: - Other repos using `version: latest` for golangci-lint or buf will hit the same drift; the comment in both workflows now flags the pattern for future maintainers. - The 5 new gosec rules (G702/G118/G122/G703/G704) are now accounted for at every callsite in the SDK. Future Add of new callsites needs to consider these rules at PR time. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Same-PR squash inadvertently bumped go.mod from 1.25.2 → 1.25.3.
CI's setup-go installs 1.25.2; the Go toolchain auto-downloads
1.25.3 to satisfy the go.mod directive, then conflicts because
the installed std-library is 1.25.2:
compile: version "go1.25.3" does not match go tool version "go1.25.2"
Fix: align go.mod with CI. Future bumps of either go.mod or CI's
go-version should land in the same commit.
Related-case audit:
- No `toolchain` directive in go.mod (would have caused the same
issue independently).
- Both ci.yaml and main.yaml pin go-version explicitly to 1.25.2;
no `latest` drift here.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ution
CI Windows job failed TestC1ZCachedViewSyncRunInvalidation. Root
cause is a deterministic-ordering bug in pkg/dotc1z/sync_runs.go:
q.Order(goqu.C("ended_at").Desc())
When two sync_runs end in quick succession, Windows' coarser
time.Now() resolution can produce identical ended_at strings. The
LIMIT 1 then returns whichever row SQLite happens to pick — the
test expects the latest finished sync but sometimes gets the older
one.
Fix: tiebreak on sync_id DESC. sync_ids are KSUIDs (sortable by
embedded timestamp), so the lexicographically-greater sync_id is
the later one.
Pre-existing bug in main (the test was added in PR #612 by
geoff@greer.fm); the Windows flake just didn't hit on main's most
recent CI run.
Related-case audit (other places ORDER BY ended_at might tie):
- getFinishedSync ✓ fixed here.
- LatestSyncID etc. that walk sync_runs use ORDER BY id when no
timestamp tiebreaker is needed — no change.
- The Pebble engine's IterateAllSyncRuns scans by sync_id natively,
no ordering bug to fix there.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Standalone design doc: answers every question the autoresearch-create
skill asks, plus a concrete walkthrough of how each artifact
(autoresearch.md, autoresearch.sh, autoresearch.checks.sh,
autoresearch.config.json) would look for this codebase.
Not a runbook to start the loop today — that's a separate /autoresearch
invocation. This file just makes the loop reproducible: anyone with
commit access can copy the snippets into a fresh worktree and kick it
off.
Sections:
1. Goal — current state + which scales have headroom
2. Primary + secondary metrics (1M WritePack ms as primary)
3. Files in scope — every tunable
4. Off limits — proto wire format, SQLite path, generated code
5. Constraints — engine + SQLite tests + lint + no new deps
6. autoresearch.sh — full bench script with structured METRIC output
7. autoresearch.checks.sh — correctness gate
8. autoresearch.config.json — 200 iterations, ~5-6 min each
9. What's been tried — fat-batch + fresh-sync NoSync (status quo);
dead ends (DisableWAL); open ideas (memtable size, L0 thresholds,
compaction concurrency, bloom filters, mixed compression presets,
codegen codec, tuple-encoder pooling)
10. Validation of existing tests — per-file verdict for engine,
adapter, paginate, mutation, compactor, equivalence, envelope,
SQLite regression sentinel
11. Stop conditions
12. Resume protocol
13. Out-of-scope items
14. Commit-message + dashboard expectations for the agent
The "What's been tried" + "Validation of existing tests" sections
mean a fresh agent can resume the loop without re-discovering the
~12.95x at 1M write delta we already shipped, or guessing whether a
keep/discard decision was actually correctness-gated.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Purpose
This PR squashes the entire v4 storage-engine stack into a single
commit so reviewers can give macro / architectural / cross-cutting
feedback that's hard to see when reading the stack one PR at a time.
The individual stacked PRs are still where the actual code lands.
This combined PR exists only as a review aid. Do not merge.
Stack map
RFC
docs/rfcs/0004-storage-engine-v4/tracker.mdtracks the implementation status.Performance — current numbers (post fat-batch + fresh-sync NoSync)
Linux arm64, 16 cores,
-benchtime=2x. Grant counts span the LSM-vs-B-treecrossover. See commit
72881f65for the perf-fix details.Write + pack c1z
Per-grant cost (the load-bearing metric):
SQLite's per-grant cost grows past 10k (B-tree depth + index maintenance);
Pebble's flatlines at ~3.8 µs once amortization kicks in. At 1M grants
Pebble writes the c1z file in 3.8 s vs SQLite's 55.6 s.
Unpack + read grants (apples-to-apples; both engines paginate)
Both engines now clamp to
page_size = 10000by default; the benchwalks pages until
next_page_tokenis empty.Single grant write (cold engine setup cost)
Pebble engine startup is ~6.6× cheaper than opening a fresh SQLite file.
Run
Why Pebble is now faster — the perf-fix mechanism
Before commit
72881f65,PutGrants(N grants)was callingEngine.PutGrantRecordper grant. Each call did:db.NewBatch()→ alloc + close cycle per grantbatch.Commit(pebble.Sync)→ one fsync per grantFor N grants that's N fsyncs vs SQLite's one fsync per chunked-INSERT
transaction. Linux fsync latency dominates at ~50 µs per call.
The fix mirrors SQLite's
PRAGMA synchronous=NORMAL+ chunked-INSERTsemantics:
PutGrantRecords/PutResourceRecords/PutEntitlementRecords/PutResourceTypeRecordsbuild a single
pebble.Batchand commit once.MarkFreshSync(called byAdapter.StartNewSync) andEndFreshSync(called byAdapter.EndSync), batch commits usepebble.NoSyncand skipread-before-write index cleanup (the sync_id has no prior records).
EndFreshSyncdoes oneFlush + WAL fsyncto harden everything.Crash semantics are identical to SQLite's
synchronous=NORMAL: a hostcrash mid-sync forces the connector to replay. The sync_run record
stays "started" — we never silently surface a partial sync as
committed.
Review feedback already applied
WithEngineoption dropped duringc1zOptions → c1fopts. Fixed + regression test.WithEnginedoc-comment merge; engine zero-value mismatch. Both fixed.github.com/cockroachdb/pebble/v2 v2.1.5.CheckpointTonow flushes (was Quiesce);Savereturns its shim error before side effects;ExtractZstdTarsize-limits + bounded copies; writer Close errors propagated; writes serialized via sharedwithWrite.Where to look first
For cross-cutting comments, the most architecturally load-bearing
files are:
pkg/dotc1z/engine/pebble/engine.go— lifecycle, Quiesce, Save, MarkFreshSync/EndFreshSyncpkg/dotc1z/engine/pebble/grants.go— fat-batch + fresh-sync write pathpkg/dotc1z/engine/pebble/adapter.go— connectorstore surfacepkg/dotc1z/engine/pebble/translate_v2.go— v2↔v3 boundarypkg/dotc1z/engine/pebble/keys.go— key layout (Appendix C)pkg/synccompactor/pebble/compactor.go— IngestAndExcise driverMacro questions worth chewing on
//go:build batonsdkv2keeps connectors from linking Pebble. Acceptable test-matrix split?synchronous=NORMALmid-sync. Should we also exposeWithDurability(NoSync)for non-sync writes that want the same trade-off?List*methods honorpage_size+page_token(default 10000, max 10000, matches SQLite). Opaque base64 cursor backed by the last pebble key returned.v2.Entitlement/Resource. Caller fetches the full row separately.IngestAndExciseis per-bucket atomic; multi-bucket loop is not transactional as a whole. Acceptable?ReflectCodeccovers MVP; codegen plugin deferred. ~2× perf delta in microbench. Acceptable for v4?🤖 Generated with Claude Code