From bc60194f0762a7ed775b342e2dfd97b8b0b23db8 Mon Sep 17 00:00:00 2001 From: Paul Querna Date: Mon, 25 May 2026 15:24:17 +0000 Subject: [PATCH 01/15] autoresearch: scaffold Pebble read-perf loop (UnpackReadGrants) Branched from autoresearch/pebble-perf-20260525 (WritePack session, -69.7% cumulative). That session's artifacts archived under docs/rfcs/0004-storage-engine-v4/autoresearch-archive/writepack/. New primary metric: pebble_readpaginated_1m_ms (1M paginated read wallclock). Sentinels: pebble_writepack_1m_ms (must not regress; preserves the WritePack session's wins) and sqlite_readpaginated_1k_ms (SQLite engine regression guard). --- autoresearch.ideas.md | 130 +++++---- autoresearch.md | 276 +++++++++--------- autoresearch.sh | 145 ++++----- .../writepack/autoresearch.ideas.md | 70 +++++ .../writepack/autoresearch.jsonl | 0 .../writepack/autoresearch.md | 178 +++++++++++ 6 files changed, 514 insertions(+), 285 deletions(-) create mode 100644 docs/rfcs/0004-storage-engine-v4/autoresearch-archive/writepack/autoresearch.ideas.md rename autoresearch.jsonl => docs/rfcs/0004-storage-engine-v4/autoresearch-archive/writepack/autoresearch.jsonl (100%) create mode 100644 docs/rfcs/0004-storage-engine-v4/autoresearch-archive/writepack/autoresearch.md diff --git a/autoresearch.ideas.md b/autoresearch.ideas.md index 85a13fe83..c4c1071b8 100644 --- a/autoresearch.ideas.md +++ b/autoresearch.ideas.md @@ -1,70 +1,82 @@ -# Ideas backlog — Pebble engine perf +# Ideas backlog — Pebble engine read perf Free-form scratch. Append new ideas as bullets; mark tried ones with status (kept / discarded / crashed) so we don't repeat them. -## To try (priority order) +**Required reading from the prior session**: +`docs/rfcs/0004-storage-engine-v4/autoresearch-archive/writepack/autoresearch.ideas.md` +contains the do-not-retry catalogue from the WritePack session. Several +of those closed axes likely apply here too: -- [ ] **P1.1** Memtable size 64 MiB → 256 MiB (`MemTableSize` in `options.go`). -- [ ] **P1.2** `L0CompactionThreshold` sweep: 2 → 4, 8. -- [ ] **P1.3** `MaxConcurrentCompactions` upper bound: 8 → 12 (gate on GOMAXPROCS). -- [ ] **P1.4** Enable bloom filters on L0 (FilterPolicy + FilterType). -- [ ] **P1.5** Mixed compression: Snappy at L0, zstd at L6. -- [ ] **P2.6** Per-record-type per-level options (grants vs resources). -- [ ] **P2.7** Codec codegen via `cmd/protoc-gen-batonstore` — replaces reflection path. Big change; may need human approval. -- [ ] **P3.8** Pool tuple encoder buffer (`AppendTupleString`) — kill per-record slice alloc. -- [ ] **P3.9** Larger SST block size (32 KiB → 64 KiB) — amortize header overhead. +- Parallel-large-alloc across goroutines (heap arena serializes — verified 3×). +- Tournament tree / prefix-skip wrappers around `bytes.Compare` + (Go SIMD'd `cmpbody` is faster than the wrapper for k=4 or short skip). +- Naive parallel-then-serial pipelines that don't actually overlap + (e.g. read-all-then-write-all). +- Touching durability semantics for marginal gains. -## Tried — see jsonl for verdicts +## Read-path-specific ideas (priority order, profile-confirm before pursuing) -(populated by the loop) +### P1 — likely big wins (untried, large surface) + +- **`ExtractZstdTar` parallelism** — single-threaded zstd decode + + tar walk + per-file `os.OpenFile` + `io.Copy` dominates per-iter + cost at large scales. For 1 M-grant `.c1z` of ~500 MB the extraction + is most of the wallclock. Possible: + - `zstd.WithDecoderConcurrency(0)` (untried for reads; tested under + writes #9/#35/#46 and was flat there because the OUTER zstd was + already barely needed over pre-Snappy SST data — for READS we're + decoding the OUTER zstd of fresh tar contents, different problem). + - Parallel tar-entry writes to destination dir (workers consume from + the tar stream's decoded byte ranges, each writes one file). + - Skip zstd entirely via streaming decompression that avoids the + intermediate file write — extract into memory + open Pebble against + an in-memory FS. Pebble supports `vfs.MemFS` via the `FS` option. +- **In-memory Pebble FS for reads** — skip the extract-to-tmpdir step + entirely. Decompress the c1z payload straight into a `vfs.MemFS` and + point Pebble at it. Saves the entire tar-extraction wallclock AND + the subsequent Pebble file-open syscalls (memory-backed FS is much + faster). Big change but potentially huge win. + +### P1 — likely big wins (untried, smaller surface) + +- **`V3GrantToV2` arena** — analogous to `V2GrantToV3` from the + WritePack session (#41 there). Each `ListGrants` page hydrates + `len(page)` v2.Grants. For 100 pages × 10 k grants each = 1 M + allocations of `v2.Grant + Entitlement_stub + Resource_stub + ResourceId`, + which is several per grant. Arena to collapse them. +- **Pebble block cache warming** — the 256 MiB block cache starts cold + on each `NewStore`. If the data fits in cache, fully warmed reads are + much faster than cold reads. Warm via a deliberate prefetch read at + Open time, or trade some setup cost for amortized win. -## Follow-up / human review +### P2 — moderate -- Split-batch in PutGrantRecords (commit 63c0869b) breaks cross-batch atomicity: - if priBatch commits but idxBatch fails, primary records exist without - by_entitlement / by_principal index entries. Fresh-sync replays the - whole sync from the connector so it's OK there, but incremental Put - paths (mid-sync upserts) might leak. RFC stack-6 grant expansion path - could be a concrete victim. Consider: - - Apply split only when IsFreshSync() is true; keep one-batch atomic - semantics outside fresh-sync. - - Or: document the contract change. +- **Pagination cursor decoding** — `paginate.go` decodes the + `PageToken` on every page boundary. 100 pages for 1 M grants. Cost + per decode probably tiny but compounds. +- **`IterateGrantsBySync` allocations** — per-grant `proto.Unmarshal` + into a fresh `v3.GrantRecord` for every iteration step. Could pool + these (arena-style) but it's a streaming iterator API. +- **`pebble.IterOptions`** — currently we set `LowerBound`/`UpperBound` + for the grant primary keyspace. Could enable `KeyTypePoint` only (no + range keys for this iteration). Already implicit. -## Closed axes (do NOT retry — multiple attempts confirm dead) +### P3 — speculative -- **Parallel engine.Close + WriteEnvelope** (tried at #19, #28, #45 — three baselines). - Mechanism is theoretically safe (CheckpointTo creates self-contained dir), but - goroutine + channel coordination overhead exceeds the engine.Close wallclock - savings (~30-50 ms). At smaller scales the overhead dominates and regresses - 10-15%. Not a clean win at any size. -- **Parallelize large heap allocations across goroutines** (#47 priBatch/idxBatch, - #48 priBatch sub-shards). Three different attempts. Go's heap allocator - serializes large (>32 KB) allocations through the central heap-arena mutex; - OS mmap underneath has kernel-level locks. Concurrent 150 MB-class allocs - from N goroutines queue serially, plus goroutine scheduling adds overhead - proportional to N. Stick to single-goroutine allocation for the big buffers. -- **FlushSplitBytes axis** (tried 2 MiB → 16 MiB at #21, #31; 2 MiB → 64 MiB at #37). - Pebble doesn't honor very large hints, or bigger SSTs lose write parallelism. - All flat-to-mildly-negative across multiple baselines. -- **Tournament tree / prefix-skip merge optimizations** (#39, #40). The naive - 4-way bytes.Compare scan is already optimally branch-predictable and SIMD-tight; - wrapping with anything in Go costs more than it saves at k=4. -- **Parallel reads for WriteEnvelope** (#43 bulk-pre-read; #46 streaming with bounded - lookahead). Two different failure modes: #43 didn't actually overlap reads with - writes (3 serial phases); #46 did overlap but per-file os.ReadFile allocated - ~530 MB of one-shot buffers vs io.Copy's reused 32 KB buffer. Pebble checkpoint - files are page-cache-hot anyway — io.Copy pulls them at memory speed, so serial - reading is already efficient. Closed axis. -- **Background WAL fsync** (WALBytesPerSync=4MiB, #38). On this hardware fsync - isn't a meaningful bottleneck; spreading it via background syncs doesn't help. -- **MemTableSize > 64 MiB** (#1 256 MiB, #16 128 MiB). Larger memtable lets entire - 100k workload fit in memory → no during-write flushes → forced serial flush at - EndSync. 100k workload regresses ~30%. -- **L0CompactionThreshold ≠ 8** axis fully mapped (2/4/6/16). 8 is the knee. -- **CompactionConcurrencyRange** (#7). With L0=8 compactor isn't the bottleneck. -- **DisableAutomaticCompactions** (#20). With L0=8 it's already idle. -- **proto.MarshalAppend with SetDeferred + cached size** (#23). proto.Size - double-traversal eats the memcpy savings. -- **appendEscaped bytes.IndexByte fast path** (#22). Tuple encoder is on the - smaller goroutine; max(A,B) wallclock means optimizing B doesn't help when B5% worse) +Optimize the Pebble engine's **paginated read** path: +`store.ListGrants` iterated over an entire sync's grants in 10 k-grant +pages. The bench (`BenchmarkRegisteredPebbleUnpackReadGrants`) measures +end-to-end per-iteration wallclock for: + +1. `dotc1z.NewStore(ctx, path, WithReadOnly(true))` — opens the `.c1z` + file, unpacks the zstd-tar payload into a tmp directory, + `pebble.Open`s the Pebble engine pointing at that directory. +2. `store.SetCurrentSync(ctx, syncID)`. +3. Walk grants via `ListGrants` in pages of 10 000 until exhausted. +4. `store.Close(ctx)`. + +Current baseline at 100 k grants is **~129 ms** (3.0× faster than +SQLite). At 1 M grants it's likely ~1.5 s; we'll measure as part of the +baseline. + +## Primary metric + +**`pebble_readpaginated_1m_ms`** — wallclock for the 1 M-grant paginated +read benchmark. Lower is better. + +Picked 1 M (not 100 k) because: + +- LSM-vs-B-tree differences scale with N. Optimizing 1 M-read tends to + generalise downward; optimizing 100 k can overfit to small-workload + noise. +- The 1 M write-side bench was where the WritePack session found its + big wins; same likely true for reads. + +Bench script will run the 100, 1 k, 10 k, 100 k, 1 M scales so we can +sanity-check that improvements at 1 M aren't regressions elsewhere. + +## Secondary metrics + +| Metric | Direction | Notes | +|---|---|---| +| `pebble_readpaginated_100k_ms` | lower | secondary scale to confirm | +| `pebble_readpaginated_10k_ms` | lower | secondary scale | +| `pebble_readpaginated_1k_ms` | lower | secondary scale | +| `pebble_readpaginated_100_ms` | lower | secondary scale | +| `pebble_readpaginated_1m_bytes_op` | lower | memory pressure | +| `pebble_readpaginated_1m_allocs_op` | lower | GC pressure | +| `sqlite_readpaginated_1k_ms` | unchanged | regression sentinel; SQLite engine must not slow down while tuning Pebble | +| `pebble_writepack_1m_ms` | unchanged | write-side regression sentinel; the WritePack session's wins must hold | ## How to Run -`./autoresearch.sh` — outputs `METRIC name=value` lines plus diagnostic stdout. ~5–6 min/iteration with the default scales (100..1M). +`./autoresearch.sh` — outputs `METRIC name=value` lines + diagnostic bench +output. Runtime ~5–6 min per iteration (read sweep + write sentinel + +sqlite sentinel). + +Fast iteration mode (skips 1 M, keeps the smaller scales): -Fast iteration mode (~1.5 min, skips 1M): ``` -BATONSDK_BENCH_SCALES="100,1000,10000,100000" ./autoresearch.sh +BATONSDK_READ_SCALES="100,1000,10000,100000" ./autoresearch.sh ``` -The full 1M run is required before a final `keep` — restore the default scales before the confirmation run. -`./autoresearch.checks.sh` is invoked automatically after each successful bench by the harness. It runs engine + SQLite + compactor + equivalence tests, lints, and asserts no `go.mod`/`go.sum`/proto drift. +The full 1 M run is required before a final `keep` confirmation — +revert the env override. + +`./autoresearch.checks.sh` runs after every passing benchmark by the +harness. Same correctness gate as the WritePack session: engine + +adapter + compactor + equivalence + envelope + SQLite tests, golangci +lint, and no `go.mod`/`go.sum`/proto drift. ## Files in Scope -- `pkg/dotc1z/engine/pebble/options.go` — Pebble.Options (memtable, L0, cache, block size, compression, bloom) -- `pkg/dotc1z/engine/pebble/engine.go` — lifecycle, Quiesce, Save, fresh-sync hooks -- `pkg/dotc1z/engine/pebble/grants.go` — `PutGrantRecord(s)` batch shape & durability -- `pkg/dotc1z/engine/pebble/resources.go`, `entitlements.go`, `resource_types.go` — same pattern, write paths -- `pkg/dotc1z/engine/pebble/keys.go` — key encoding (smaller keys → less memory) -- `pkg/dotc1z/engine/pebble/paginate.go` — pagination cursor + range iteration (read path) -- `pkg/dotc1z/engine/pebble/codec/tuple.go` — tuple encoder hot path -- `pkg/dotc1z/engine/pebble/codec/reflect.go` — reflection codec (codegen candidate) -- `pkg/dotc1z/engine/pebble/codec/syncid.go` — KSUID encoding -- `pkg/synccompactor/pebble/compactor.go` — IngestAndExcise driver -- `pkg/synccompactor/pebble/bucket_plans.go` — bucket layout +Read-path files most likely to benefit: + +- `pkg/dotc1z/engine/pebble/register.go` — `OpenStore` / `unpackExisting` + (the bulk of the per-iter wallclock at large scales is here) +- `pkg/dotc1z/format/v3/envelope.go` — `ReadEnvelope` + `ExtractZstdTar` + (decode + extract the payload) +- `pkg/dotc1z/engine/pebble/paginate.go` — pagination cursor decode + + range iteration +- `pkg/dotc1z/engine/pebble/grants.go` — `ListGrants` adapter + + `IterateGrantsBySync` +- `pkg/dotc1z/engine/pebble/adapter_reader.go` — adapter surface for + reads +- `pkg/dotc1z/engine/pebble/adapter.go` — `ListGrants` / `SetCurrentSync` +- `pkg/dotc1z/engine/pebble/translate_v2.go` — `V3GrantToV2` + (hot per-record on the read path) +- `pkg/dotc1z/engine/pebble/options.go` — Pebble.Options affecting + reads (cache, bloom filters, block size on read amp) +- `pkg/dotc1z/engine/pebble/codec/*.go` — tuple encode/decode + KSUID + syncid (used per grant for index iteration) ## Off Limits -- `pb/c1/storage/v3/*` — wire format frozen for v4 (generated + proto) +Same as the WritePack session: + +- `pb/c1/storage/v3/*` — wire format frozen for v4 - `pb/c1/reader/v2/*`, `pb/c1/connector/v2/*` — external surface - `proto/c1/storage/v3/` — proto IDL frozen -- `pkg/dotc1z/c1file.go`, `pkg/dotc1z/grants.go`, `pkg/dotc1z/resources.go`, … — SQLite engine path (regression sentinel only) +- SQLite engine path (`pkg/dotc1z/c1file.go` etc.) — regression sentinel only - `docs/rfcs/`, `.github/workflows/`, `.golangci.yml` - `go.mod`, `go.sum`, `vendor/` — no new dependencies -- `cmd/protoc-gen-batonstore` codegen — if reached, escalate to human ## Constraints (enforced by `autoresearch.checks.sh`) -1. Engine + compactor + equivalence + envelope tests pass: `go test -tags=batonsdkv2 -count=1 ./pkg/dotc1z/engine/pebble/... ./pkg/dotc1z/engine/equivalence/... ./pkg/synccompactor/pebble/... ./pkg/dotc1z/format/v3/...` -2. SQLite engine tests pass: `go test -tags=baton_lambda_support -short -count=1 ./pkg/dotc1z/` -3. Lint clean: `golangci-lint run --timeout=3m --build-tags=batonsdkv2` over engine + compactor. -4. `go.mod` / `go.sum` unmodified (no new deps). +1. Engine + adapter + compactor + equivalence + envelope tests pass. +2. SQLite engine tests pass (regression guard). +3. Lint clean. +4. `go.mod` / `go.sum` unmodified. 5. `proto/c1/storage/v3/` unmodified. -6. WritePack bench asserts paginated total — corruption fails naturally. -7. `sqlite_writepack_1k_ms` may not regress by more than 5% (manual check via secondary metric; large regressions → discard). - -## Priority Ideas (from RFC §9) - -### Priority 1 -1. Larger memtable (64 → 256 MiB) — absorbs more write burst before L0 flush. -2. `L0CompactionThreshold` sweep (currently 2 → try 4, 8) — write throughput vs read amp tradeoff. -3. `MaxConcurrentCompactions` upper bound — capped at 8; try 12 on big-core hosts. -4. Bloom filters on L0 — read-side win, currently disabled. -5. Mixed compression: Snappy at L0, zstd at L6 — cuts compaction CPU. - -### Priority 2 -6. Per-record-type compaction tuning (grants vs resources differ). -7. Codec codegen replacing `codec/reflect.go` — ~5× microbench, 5–10% e2e estimated. - -### Cleanup -8. Pool tuple encoder buffer (`AppendTupleString` per-record alloc). -9. Larger block size to amortize header overhead. - -## Known Dead Ends (do not retry) - -- Per-record `db.Set` instead of batched `pebble.Batch` — slower. -- `DisableWAL: true` — saves <5%, loses durability across Open/Close. -- Shared block cache across engines — wins in C1 prod, no-op in bench. - -## What's Been Tried - -Maintained in `autoresearch.ideas.md` and the `autoresearch.jsonl` log. Resuming agents: read both before mutating code. - -Already-applied wins (the status quo baseline): -- Fat-batch `PutGrantRecords` (one batch per N grants) — ~10× at 1M. -- Fresh-sync `pebble.NoSync`, single Flush+LogData(Sync) at EndFreshSync. -- Read-before-write index cleanup unconditional (12% cost, correctness-critical). -- 256 MiB block cache. -- `CompactionConcurrencyRange` capped at `(2, min(8, GOMAXPROCS/4))`. - -### Wins kept by this loop (cumulative -56.1% from 4292 → 1884 ms at 1M) - -In order applied (compounds multiplicatively): - -1. **`L0CompactionThreshold` 2 → 8** (-15.8%). The default 2 over-eagerly compacted during the 1M write burst, stealing CPU from writers. 8 lets ~8 L0 sub-levels accumulate before compaction kicks in; L0StopWritesThreshold=20 still bounds the worst case. Knee mapped: 2/4/6 worse; 16 flat vs 8. -2. **Scratch byte buffers + `proto.MarshalAppend`** (-5.6%). Reused `keyBuf` / `idx1Buf` / `idx2Buf` / `valBuf` across the loop; Pebble's `batch.Set` is documented as safe to modify args after return (it copies into batch buffer). Added `appendGrantKey` / `appendGrantBy*IndexKey` variants taking `dst []byte`. Allocs 9.0M→4.0M. -3. **Hoisted `resolveSyncBytes`** (-4.9%). Cache last-resolved (string, bytes) pair across loop iterations; falls back to per-record resolve when string differs. Adapter typically stamps a uniform sync_id, so the cache hits 1M times. -4. **Split `priBatch` / `idxBatch`** (-12.8%). Primary writes (sorted by external_id by construction) go to one batch; index writes (unsorted) go to another. pdqsort early-exits the priBatch's flushable-batch promotion sort; only the idxBatch pays full O(N log N) on 2/3 the entries. Cross-batch atomicity is fine for fresh-sync (replays from connector). -5. **`NewBatchWithSize(len*600)` / `(len*140)`** (-6.1%). Pre-size the batches so they don't grow-by-2x internally; saves ~10 reallocations and up to 2x peak slack. bytes_op -23%. -6. **Skip read-before-write Get for fresh-sync first call** (-14.5%). New engine flag `freshGrantsEmpty` is true between `MarkFreshSync` and the first `PutGrantRecords` commit. While true, the 1M `e.db.Get` calls are skipped — they'd all return ErrNotFound anyway (db.Get doesn't see in-batch writes; the keyspace is empty). Across-call dup detection preserved by clearing the flag after first commit. -7. **Parallel-build the two batches for batches ≥ 256** (-8.8%). When skipGet is true, the two batches have no shared mutable state. Two goroutines build them concurrently; each has its own scratch buffers and sync_id cache. Threshold of 256 records avoids goroutine setup overhead on tiny calls (solo write regression bounded to +11%). -8. **4-way shard the priBatch build** (-7.9%). proto.Marshal of 1M GrantRecords is the parallel-path long pole on goroutine A; sharding it across 4 worker goroutines cuts that wallclock ~4×, at the cost of 4 batch.Apply memcpy concatenations (~50 ms total). Shard count caps at min(4, len/1024) so small batches bypass the parallelism overhead. 8-way sharding saturated (no further benefit). - -### Major dead ends (do NOT retry) - -- **MemTableSize >64 MiB at any size** (-1% primary, +30%+ at 100k). Larger memtable lets the entire 100k workload fit in memory → no during-write flushes → forced end-of-sync serial flush. The 64→256 MiB attempt regressed 100k by 32%; 128 MiB by 34%. Memtable should be sized so the workload triggers ≥3 flushes during writes. -- **Chunking PutGrantRecords commits** (+83% at 1M). Splitting one big batch into 16Ki-grant chunks force memtable rotation per chunk → many L0 files → compaction storm. The single-big-batch path takes Pebble's optimized flushable-batch promotion (sort once, swap in as memtable atomically). -- **Bloom filters on all levels** (+2.9%). Fresh-sync workloads have unique external_ids; the Get-before-Put population is 100% misses, and the bench's read path is range iteration not point Gets. Filters add construction CPU during flushes with no payoff. They MIGHT help in C1 prod where ReaderCache does point Gets across syncs, but that's not measured here. -- **`zstd.SpeedDefault` → `SpeedFastest` + `WithEncoderConcurrency(0)`** (flat). The c1z pack tar wraps Pebble SSTs that are already Snappy-compressed internally; outer zstd is nearly incompressible regardless of level. -- **`L0CompactionThreshold` ≠ 8** — axis fully mapped. 2/4/6 worse, 16 flat with all other wins. -- **`CompactionConcurrencyRange` (2, GOMAXPROCS/2 capped 8)** (flat). With L0=8 the compactor isn't the bottleneck; adding lanes makes no difference. -- **`LBaseMaxBytes` 256 → 512 MiB** (-1.6% within noise). L1 consolidation doesn't matter at our workload size. -- **`FlushSplitBytes` 2 → 16 MiB** (-1.1% within noise). Per-SST overhead is small. -- **`DisableAutomaticCompactions: true`** (-1% within noise). With L0=8 already limiting compaction-during-writes, disabling shifts work later but saves no wallclock. -- **`SetDeferred` for primary key+value** (+2.3%). `proto.Size` traversal cost exceeds the `batch.Set` memcpy savings; no net win. -- **`appendEscaped` bytes.IndexByte fast path** (+1.7% within noise). The tuple encoder lives on the smaller goroutine (idxBatch); parallel wallclock = max(A,B), so optimizing B doesn't reduce max when B2× improvement → confirm with a clean rerun, commit, continue with next idea. -- All priority-1 ideas exhausted → move to priority-2, then write a summary. +- 3 consecutive `checks_failed` from the same root cause → thrashing. +- >2× improvement → confirm with a clean rerun + summary. +- All priority-1 ideas exhausted → move to priority-2 or finalize. ## Resume Protocol -1. Read this file + RFC `autoresearch-pebble-perf.md`. -2. Read `autoresearch.ideas.md` (prune stale entries). +1. Read this file + the archived WritePack `autoresearch.md`. +2. Read `autoresearch.ideas.md` (this session) + the archived + `autoresearch.ideas.md` (do-not-retry catalogue from WritePack). 3. Skim `autoresearch.jsonl` for recent `keep`/`discard` patterns. -4. Run `./autoresearch.sh` once to re-establish baseline on the current commit. +4. Run `./autoresearch.sh` once to re-establish a baseline on the + current commit. 5. Pick the highest-priority untried idea. diff --git a/autoresearch.sh b/autoresearch.sh index f2a43decf..d7b33f066 100755 --- a/autoresearch.sh +++ b/autoresearch.sh @@ -1,120 +1,101 @@ #!/usr/bin/env bash -# autoresearch.sh — Pebble engine perf benchmark driver. -# Emits METRIC name=value lines for the autoresearch loop, plus diagnostic -# bench output for the agent to localize regressions. Always exits 0; -# missing/failed bench rows surface as zero-valued metrics. +# autoresearch.sh — Pebble engine READ perf benchmark driver. +# Primary metric: pebble_readpaginated_1m_ms. +# Sentinels: pebble_writepack_1m_ms (no write regression), +# sqlite_readpaginated_1k_ms (no SQLite regression). set -uo pipefail export GOCACHE="${GOCACHE:-$HOME/.cache/go-build}" export CGO_ENABLED=0 BENCH_DIR="./pkg/dotc1z/engine/pebble" -CODEC_DIR="./pkg/dotc1z/engine/pebble/microtests" -COMMON_FLAGS=(-tags=batonsdkv2 -run='^$' -benchmem -benchtime=2x -timeout=20m) +COMMON_FLAGS=(-tags=batonsdkv2 -run='^$' -benchmem -benchtime=2x -timeout=30m) -# Allow caller to override scales for fast iteration mode. -WRITE_SCALES="${BATONSDK_BENCH_SCALES:-100,1000,10000,100000,1000000}" -READ_SCALES="${BATONSDK_READ_SCALES:-100,1000,10000,100000}" +# Read sweep scales. 1M is the primary target. Override via env for +# fast inner-loop iteration on a single idea (default-include the +# small scales so each kept change has cross-scale directional +# confirmation). +READ_SCALES="${BATONSDK_READ_SCALES:-100,1000,10000,100000,1000000}" -OUT_WRITE=$(mktemp -t ar-write.XXXXXX) -OUT_READ=$(mktemp -t ar-read.XXXXXX) -OUT_CODEC=$(mktemp -t ar-codec.XXXXXX) -OUT_SOLO=$(mktemp -t ar-solo.XXXXXX) +# Write sentinel: just the 1M WritePack scale to verify the WritePack +# session's wins haven't regressed. +WRITE_SENTINEL_SCALES="${BATONSDK_WRITE_SENTINEL_SCALES:-1000000}" -# 1) Pebble + SQLite WritePack sweep. -BATONSDK_BENCH_SCALES="$WRITE_SCALES" \ - go test "${COMMON_FLAGS[@]}" \ - -bench 'BenchmarkRegistered(Pebble|SQLite)WritePack$' \ - "$BENCH_DIR" >"$OUT_WRITE" 2>&1 || true +# SQLite read sentinel: 1k scale (cheap, catches if Pebble changes +# leak into SQLite somehow). +SQLITE_SENTINEL_SCALES="${BATONSDK_SQLITE_SENTINEL_SCALES:-1000}" + +OUT_PREAD=$(mktemp -t ar-read.XXXXXX) +OUT_SREAD=$(mktemp -t ar-sread.XXXXXX) +OUT_WSENT=$(mktemp -t ar-wsent.XXXXXX) -# 2) Pebble paginated read sweep. +# 1) Pebble paginated read sweep — primary metric here. BATONSDK_BENCH_SCALES="$READ_SCALES" \ go test "${COMMON_FLAGS[@]}" \ -bench 'BenchmarkRegisteredPebbleUnpackReadGrants$' \ - "$BENCH_DIR" >"$OUT_READ" 2>&1 || true + "$BENCH_DIR" >"$OUT_PREAD" 2>&1 || true -# 3) Codec hot-path microbench. -go test "${COMMON_FLAGS[@]}" \ - -bench='BenchmarkCodec(Direct|Reflect)$' \ - "$CODEC_DIR" >"$OUT_CODEC" 2>&1 || true +# 2) SQLite paginated read regression sentinel. +BATONSDK_BENCH_SCALES="$SQLITE_SENTINEL_SCALES" \ + go test "${COMMON_FLAGS[@]}" \ + -bench 'BenchmarkRegisteredSQLiteUnpackReadGrants$' \ + "$BENCH_DIR" >"$OUT_SREAD" 2>&1 || true -# 4) Solo write (cold-start cost). -go test "${COMMON_FLAGS[@]}" \ - -bench='BenchmarkRegisteredPebbleWriteGrant$' \ - "$BENCH_DIR" >"$OUT_SOLO" 2>&1 || true +# 3) WritePack regression sentinel — keep the WritePack session's win. +BATONSDK_BENCH_SCALES="$WRITE_SENTINEL_SCALES" \ + go test "${COMMON_FLAGS[@]}" \ + -bench 'BenchmarkRegisteredPebbleWritePack$' \ + "$BENCH_DIR" >"$OUT_WSENT" 2>&1 || true -# bench_value -# `go test -benchmem` rows look like: -# BenchmarkX-8 2 12345 ns/op 678 B/op 9 allocs/op +# bench_value # col 1=name 2=iters 3=ns/op 5=B/op 7=allocs/op bench_value() { local file=$1 prefix=$2 col=$3 - awk -v p="^${2}$" -v c="$col" '$1 ~ p { print $c; exit }' "$file" + awk -v p="^${prefix}$" -v c="$col" '$1 ~ p { print $c; exit }' "$file" } - -# Some scales the bench may skip if not configured. Default to 0 if empty. nz() { [ -n "${1:-}" ] && echo "$1" || echo 0; } +ns_to_ms() { + awk -v v="$1" 'BEGIN { if (v=="" || v==0) { print 0; exit } printf "%.3f", v / 1000000 }' +} -# --- Pebble WritePack at each scale --- -pwrite_100=$(nz "$(bench_value "$OUT_WRITE" 'BenchmarkRegisteredPebbleWritePack/grants=100-[0-9]+' 3)") -pwrite_1k=$(nz "$(bench_value "$OUT_WRITE" 'BenchmarkRegisteredPebbleWritePack/grants=1000-[0-9]+' 3)") -pwrite_10k=$(nz "$(bench_value "$OUT_WRITE" 'BenchmarkRegisteredPebbleWritePack/grants=10000-[0-9]+' 3)") -pwrite_100k=$(nz "$(bench_value "$OUT_WRITE" 'BenchmarkRegisteredPebbleWritePack/grants=100000-[0-9]+' 3)") -pwrite_1m=$(nz "$(bench_value "$OUT_WRITE" 'BenchmarkRegisteredPebbleWritePack/grants=1000000-[0-9]+' 3)") -pwrite_1m_bytes=$(nz "$(bench_value "$OUT_WRITE" 'BenchmarkRegisteredPebbleWritePack/grants=1000000-[0-9]+' 5)") -pwrite_1m_allocs=$(nz "$(bench_value "$OUT_WRITE" 'BenchmarkRegisteredPebbleWritePack/grants=1000000-[0-9]+' 7)") - -swrite_1k=$(nz "$(bench_value "$OUT_WRITE" 'BenchmarkRegisteredSQLiteWritePack/grants=1000-[0-9]+' 3)") - -pread_1k=$(nz "$(bench_value "$OUT_READ" 'BenchmarkRegisteredPebbleUnpackReadGrants/grants=1000-[0-9]+' 3)") -pread_100k=$(nz "$(bench_value "$OUT_READ" 'BenchmarkRegisteredPebbleUnpackReadGrants/grants=100000-[0-9]+' 3)") - -codec_direct=$(nz "$(bench_value "$OUT_CODEC" 'BenchmarkCodecDirect-[0-9]+' 3)") -codec_reflect=$(nz "$(bench_value "$OUT_CODEC" 'BenchmarkCodecReflect-[0-9]+' 3)") - -solo_write=$(nz "$(bench_value "$OUT_SOLO" 'BenchmarkRegisteredPebbleWriteGrant-[0-9]+' 3)") +pread_100=$(nz "$(bench_value "$OUT_PREAD" 'BenchmarkRegisteredPebbleUnpackReadGrants/grants=100-[0-9]+' 3)") +pread_1k=$(nz "$(bench_value "$OUT_PREAD" 'BenchmarkRegisteredPebbleUnpackReadGrants/grants=1000-[0-9]+' 3)") +pread_10k=$(nz "$(bench_value "$OUT_PREAD" 'BenchmarkRegisteredPebbleUnpackReadGrants/grants=10000-[0-9]+' 3)") +pread_100k=$(nz "$(bench_value "$OUT_PREAD" 'BenchmarkRegisteredPebbleUnpackReadGrants/grants=100000-[0-9]+' 3)") +pread_1m=$(nz "$(bench_value "$OUT_PREAD" 'BenchmarkRegisteredPebbleUnpackReadGrants/grants=1000000-[0-9]+' 3)") +pread_1m_bytes=$(nz "$(bench_value "$OUT_PREAD" 'BenchmarkRegisteredPebbleUnpackReadGrants/grants=1000000-[0-9]+' 5)") +pread_1m_allocs=$(nz "$(bench_value "$OUT_PREAD" 'BenchmarkRegisteredPebbleUnpackReadGrants/grants=1000000-[0-9]+' 7)") -ns_to_ms() { awk -v v="$1" 'BEGIN { if (v=="" || v==0) { print 0; exit } printf "%.3f", v / 1000000 }'; } +sread_1k=$(nz "$(bench_value "$OUT_SREAD" 'BenchmarkRegisteredSQLiteUnpackReadGrants/grants=1000-[0-9]+' 3)") +pwrite_1m=$(nz "$(bench_value "$OUT_WSENT" 'BenchmarkRegisteredPebbleWritePack/grants=1000000-[0-9]+' 3)") # --- Emit METRIC lines (consumed by run_experiment) --- -echo "METRIC pebble_writepack_1m_ms=$(ns_to_ms "$pwrite_1m")" -echo "METRIC pebble_writepack_100k_ms=$(ns_to_ms "$pwrite_100k")" -echo "METRIC pebble_writepack_10k_ms=$(ns_to_ms "$pwrite_10k")" -echo "METRIC pebble_writepack_1k_ms=$(ns_to_ms "$pwrite_1k")" -echo "METRIC pebble_writepack_100_ms=$(ns_to_ms "$pwrite_100")" -echo "METRIC pebble_writepack_1m_bytes_op=$pwrite_1m_bytes" -echo "METRIC pebble_writepack_1m_allocs_op=$pwrite_1m_allocs" +echo "METRIC pebble_readpaginated_1m_ms=$(ns_to_ms "$pread_1m")" echo "METRIC pebble_readpaginated_100k_ms=$(ns_to_ms "$pread_100k")" +echo "METRIC pebble_readpaginated_10k_ms=$(ns_to_ms "$pread_10k")" echo "METRIC pebble_readpaginated_1k_ms=$(ns_to_ms "$pread_1k")" -echo "METRIC pebble_writegrant_solo_ns_op=$solo_write" -echo "METRIC codec_direct_ns_op=$codec_direct" -echo "METRIC codec_reflect_ns_op=$codec_reflect" -echo "METRIC sqlite_writepack_1k_ms=$(ns_to_ms "$swrite_1k")" +echo "METRIC pebble_readpaginated_100_ms=$(ns_to_ms "$pread_100")" +echo "METRIC pebble_readpaginated_1m_bytes_op=$pread_1m_bytes" +echo "METRIC pebble_readpaginated_1m_allocs_op=$pread_1m_allocs" +echo "METRIC sqlite_readpaginated_1k_ms=$(ns_to_ms "$sread_1k")" +echo "METRIC pebble_writepack_1m_ms=$(ns_to_ms "$pwrite_1m")" # --- Diagnostic output --- echo -echo "=== Write+Pack details ===" -grep -E 'BenchmarkRegistered(Pebble|SQLite)WritePack' "$OUT_WRITE" || echo "(no rows — check $OUT_WRITE)" -if grep -qE '^(FAIL|--- FAIL|panic:|build failed)' "$OUT_WRITE"; then - echo "--- write bench errors (tail) ---" - tail -40 "$OUT_WRITE" -fi - -echo -echo "=== Read details ===" -grep 'BenchmarkRegisteredPebbleUnpackReadGrants' "$OUT_READ" || echo "(no rows — check $OUT_READ)" -if grep -qE '^(FAIL|--- FAIL|panic:|build failed)' "$OUT_READ"; then - echo "--- read bench errors (tail) ---" - tail -40 "$OUT_READ" +echo "=== Pebble paginated reads ===" +grep 'BenchmarkRegisteredPebbleUnpackReadGrants' "$OUT_PREAD" || echo "(no rows)" +if grep -qE '^(FAIL|--- FAIL|panic:|build failed)' "$OUT_PREAD"; then + echo "--- pebble read bench errors (tail) ---" + tail -40 "$OUT_PREAD" fi echo -echo "=== Codec ===" -grep 'BenchmarkCodec' "$OUT_CODEC" || echo "(no rows — check $OUT_CODEC)" +echo "=== SQLite paginated read sentinel ===" +grep 'BenchmarkRegisteredSQLiteUnpackReadGrants' "$OUT_SREAD" || echo "(no rows)" echo -echo "=== Solo write ===" -grep 'BenchmarkRegisteredPebbleWriteGrant' "$OUT_SOLO" || echo "(no rows — check $OUT_SOLO)" +echo "=== WritePack 1M sentinel ===" +grep 'BenchmarkRegisteredPebbleWritePack' "$OUT_WSENT" || echo "(no rows)" -rm -f "$OUT_WRITE" "$OUT_READ" "$OUT_CODEC" "$OUT_SOLO" +rm -f "$OUT_PREAD" "$OUT_SREAD" "$OUT_WSENT" exit 0 diff --git a/docs/rfcs/0004-storage-engine-v4/autoresearch-archive/writepack/autoresearch.ideas.md b/docs/rfcs/0004-storage-engine-v4/autoresearch-archive/writepack/autoresearch.ideas.md new file mode 100644 index 000000000..85a13fe83 --- /dev/null +++ b/docs/rfcs/0004-storage-engine-v4/autoresearch-archive/writepack/autoresearch.ideas.md @@ -0,0 +1,70 @@ +# Ideas backlog — Pebble engine perf + +Free-form scratch. Append new ideas as bullets; mark tried ones with +status (kept / discarded / crashed) so we don't repeat them. + +## To try (priority order) + +- [ ] **P1.1** Memtable size 64 MiB → 256 MiB (`MemTableSize` in `options.go`). +- [ ] **P1.2** `L0CompactionThreshold` sweep: 2 → 4, 8. +- [ ] **P1.3** `MaxConcurrentCompactions` upper bound: 8 → 12 (gate on GOMAXPROCS). +- [ ] **P1.4** Enable bloom filters on L0 (FilterPolicy + FilterType). +- [ ] **P1.5** Mixed compression: Snappy at L0, zstd at L6. +- [ ] **P2.6** Per-record-type per-level options (grants vs resources). +- [ ] **P2.7** Codec codegen via `cmd/protoc-gen-batonstore` — replaces reflection path. Big change; may need human approval. +- [ ] **P3.8** Pool tuple encoder buffer (`AppendTupleString`) — kill per-record slice alloc. +- [ ] **P3.9** Larger SST block size (32 KiB → 64 KiB) — amortize header overhead. + +## Tried — see jsonl for verdicts + +(populated by the loop) + +## Follow-up / human review + +- Split-batch in PutGrantRecords (commit 63c0869b) breaks cross-batch atomicity: + if priBatch commits but idxBatch fails, primary records exist without + by_entitlement / by_principal index entries. Fresh-sync replays the + whole sync from the connector so it's OK there, but incremental Put + paths (mid-sync upserts) might leak. RFC stack-6 grant expansion path + could be a concrete victim. Consider: + - Apply split only when IsFreshSync() is true; keep one-batch atomic + semantics outside fresh-sync. + - Or: document the contract change. + +## Closed axes (do NOT retry — multiple attempts confirm dead) + +- **Parallel engine.Close + WriteEnvelope** (tried at #19, #28, #45 — three baselines). + Mechanism is theoretically safe (CheckpointTo creates self-contained dir), but + goroutine + channel coordination overhead exceeds the engine.Close wallclock + savings (~30-50 ms). At smaller scales the overhead dominates and regresses + 10-15%. Not a clean win at any size. +- **Parallelize large heap allocations across goroutines** (#47 priBatch/idxBatch, + #48 priBatch sub-shards). Three different attempts. Go's heap allocator + serializes large (>32 KB) allocations through the central heap-arena mutex; + OS mmap underneath has kernel-level locks. Concurrent 150 MB-class allocs + from N goroutines queue serially, plus goroutine scheduling adds overhead + proportional to N. Stick to single-goroutine allocation for the big buffers. +- **FlushSplitBytes axis** (tried 2 MiB → 16 MiB at #21, #31; 2 MiB → 64 MiB at #37). + Pebble doesn't honor very large hints, or bigger SSTs lose write parallelism. + All flat-to-mildly-negative across multiple baselines. +- **Tournament tree / prefix-skip merge optimizations** (#39, #40). The naive + 4-way bytes.Compare scan is already optimally branch-predictable and SIMD-tight; + wrapping with anything in Go costs more than it saves at k=4. +- **Parallel reads for WriteEnvelope** (#43 bulk-pre-read; #46 streaming with bounded + lookahead). Two different failure modes: #43 didn't actually overlap reads with + writes (3 serial phases); #46 did overlap but per-file os.ReadFile allocated + ~530 MB of one-shot buffers vs io.Copy's reused 32 KB buffer. Pebble checkpoint + files are page-cache-hot anyway — io.Copy pulls them at memory speed, so serial + reading is already efficient. Closed axis. +- **Background WAL fsync** (WALBytesPerSync=4MiB, #38). On this hardware fsync + isn't a meaningful bottleneck; spreading it via background syncs doesn't help. +- **MemTableSize > 64 MiB** (#1 256 MiB, #16 128 MiB). Larger memtable lets entire + 100k workload fit in memory → no during-write flushes → forced serial flush at + EndSync. 100k workload regresses ~30%. +- **L0CompactionThreshold ≠ 8** axis fully mapped (2/4/6/16). 8 is the knee. +- **CompactionConcurrencyRange** (#7). With L0=8 compactor isn't the bottleneck. +- **DisableAutomaticCompactions** (#20). With L0=8 it's already idle. +- **proto.MarshalAppend with SetDeferred + cached size** (#23). proto.Size + double-traversal eats the memcpy savings. +- **appendEscaped bytes.IndexByte fast path** (#22). Tuple encoder is on the + smaller goroutine; max(A,B) wallclock means optimizing B doesn't help when B5% worse) + +## How to Run + +`./autoresearch.sh` — outputs `METRIC name=value` lines plus diagnostic stdout. ~5–6 min/iteration with the default scales (100..1M). + +Fast iteration mode (~1.5 min, skips 1M): +``` +BATONSDK_BENCH_SCALES="100,1000,10000,100000" ./autoresearch.sh +``` +The full 1M run is required before a final `keep` — restore the default scales before the confirmation run. + +`./autoresearch.checks.sh` is invoked automatically after each successful bench by the harness. It runs engine + SQLite + compactor + equivalence tests, lints, and asserts no `go.mod`/`go.sum`/proto drift. + +## Files in Scope + +- `pkg/dotc1z/engine/pebble/options.go` — Pebble.Options (memtable, L0, cache, block size, compression, bloom) +- `pkg/dotc1z/engine/pebble/engine.go` — lifecycle, Quiesce, Save, fresh-sync hooks +- `pkg/dotc1z/engine/pebble/grants.go` — `PutGrantRecord(s)` batch shape & durability +- `pkg/dotc1z/engine/pebble/resources.go`, `entitlements.go`, `resource_types.go` — same pattern, write paths +- `pkg/dotc1z/engine/pebble/keys.go` — key encoding (smaller keys → less memory) +- `pkg/dotc1z/engine/pebble/paginate.go` — pagination cursor + range iteration (read path) +- `pkg/dotc1z/engine/pebble/codec/tuple.go` — tuple encoder hot path +- `pkg/dotc1z/engine/pebble/codec/reflect.go` — reflection codec (codegen candidate) +- `pkg/dotc1z/engine/pebble/codec/syncid.go` — KSUID encoding +- `pkg/synccompactor/pebble/compactor.go` — IngestAndExcise driver +- `pkg/synccompactor/pebble/bucket_plans.go` — bucket layout + +## Off Limits + +- `pb/c1/storage/v3/*` — wire format frozen for v4 (generated + proto) +- `pb/c1/reader/v2/*`, `pb/c1/connector/v2/*` — external surface +- `proto/c1/storage/v3/` — proto IDL frozen +- `pkg/dotc1z/c1file.go`, `pkg/dotc1z/grants.go`, `pkg/dotc1z/resources.go`, … — SQLite engine path (regression sentinel only) +- `docs/rfcs/`, `.github/workflows/`, `.golangci.yml` +- `go.mod`, `go.sum`, `vendor/` — no new dependencies +- `cmd/protoc-gen-batonstore` codegen — if reached, escalate to human + +## Constraints (enforced by `autoresearch.checks.sh`) + +1. Engine + compactor + equivalence + envelope tests pass: `go test -tags=batonsdkv2 -count=1 ./pkg/dotc1z/engine/pebble/... ./pkg/dotc1z/engine/equivalence/... ./pkg/synccompactor/pebble/... ./pkg/dotc1z/format/v3/...` +2. SQLite engine tests pass: `go test -tags=baton_lambda_support -short -count=1 ./pkg/dotc1z/` +3. Lint clean: `golangci-lint run --timeout=3m --build-tags=batonsdkv2` over engine + compactor. +4. `go.mod` / `go.sum` unmodified (no new deps). +5. `proto/c1/storage/v3/` unmodified. +6. WritePack bench asserts paginated total — corruption fails naturally. +7. `sqlite_writepack_1k_ms` may not regress by more than 5% (manual check via secondary metric; large regressions → discard). + +## Priority Ideas (from RFC §9) + +### Priority 1 +1. Larger memtable (64 → 256 MiB) — absorbs more write burst before L0 flush. +2. `L0CompactionThreshold` sweep (currently 2 → try 4, 8) — write throughput vs read amp tradeoff. +3. `MaxConcurrentCompactions` upper bound — capped at 8; try 12 on big-core hosts. +4. Bloom filters on L0 — read-side win, currently disabled. +5. Mixed compression: Snappy at L0, zstd at L6 — cuts compaction CPU. + +### Priority 2 +6. Per-record-type compaction tuning (grants vs resources differ). +7. Codec codegen replacing `codec/reflect.go` — ~5× microbench, 5–10% e2e estimated. + +### Cleanup +8. Pool tuple encoder buffer (`AppendTupleString` per-record alloc). +9. Larger block size to amortize header overhead. + +## Known Dead Ends (do not retry) + +- Per-record `db.Set` instead of batched `pebble.Batch` — slower. +- `DisableWAL: true` — saves <5%, loses durability across Open/Close. +- Shared block cache across engines — wins in C1 prod, no-op in bench. + +## What's Been Tried + +Maintained in `autoresearch.ideas.md` and the `autoresearch.jsonl` log. Resuming agents: read both before mutating code. + +Already-applied wins (the status quo baseline): +- Fat-batch `PutGrantRecords` (one batch per N grants) — ~10× at 1M. +- Fresh-sync `pebble.NoSync`, single Flush+LogData(Sync) at EndFreshSync. +- Read-before-write index cleanup unconditional (12% cost, correctness-critical). +- 256 MiB block cache. +- `CompactionConcurrencyRange` capped at `(2, min(8, GOMAXPROCS/4))`. + +### Wins kept by this loop (cumulative -56.1% from 4292 → 1884 ms at 1M) + +In order applied (compounds multiplicatively): + +1. **`L0CompactionThreshold` 2 → 8** (-15.8%). The default 2 over-eagerly compacted during the 1M write burst, stealing CPU from writers. 8 lets ~8 L0 sub-levels accumulate before compaction kicks in; L0StopWritesThreshold=20 still bounds the worst case. Knee mapped: 2/4/6 worse; 16 flat vs 8. +2. **Scratch byte buffers + `proto.MarshalAppend`** (-5.6%). Reused `keyBuf` / `idx1Buf` / `idx2Buf` / `valBuf` across the loop; Pebble's `batch.Set` is documented as safe to modify args after return (it copies into batch buffer). Added `appendGrantKey` / `appendGrantBy*IndexKey` variants taking `dst []byte`. Allocs 9.0M→4.0M. +3. **Hoisted `resolveSyncBytes`** (-4.9%). Cache last-resolved (string, bytes) pair across loop iterations; falls back to per-record resolve when string differs. Adapter typically stamps a uniform sync_id, so the cache hits 1M times. +4. **Split `priBatch` / `idxBatch`** (-12.8%). Primary writes (sorted by external_id by construction) go to one batch; index writes (unsorted) go to another. pdqsort early-exits the priBatch's flushable-batch promotion sort; only the idxBatch pays full O(N log N) on 2/3 the entries. Cross-batch atomicity is fine for fresh-sync (replays from connector). +5. **`NewBatchWithSize(len*600)` / `(len*140)`** (-6.1%). Pre-size the batches so they don't grow-by-2x internally; saves ~10 reallocations and up to 2x peak slack. bytes_op -23%. +6. **Skip read-before-write Get for fresh-sync first call** (-14.5%). New engine flag `freshGrantsEmpty` is true between `MarkFreshSync` and the first `PutGrantRecords` commit. While true, the 1M `e.db.Get` calls are skipped — they'd all return ErrNotFound anyway (db.Get doesn't see in-batch writes; the keyspace is empty). Across-call dup detection preserved by clearing the flag after first commit. +7. **Parallel-build the two batches for batches ≥ 256** (-8.8%). When skipGet is true, the two batches have no shared mutable state. Two goroutines build them concurrently; each has its own scratch buffers and sync_id cache. Threshold of 256 records avoids goroutine setup overhead on tiny calls (solo write regression bounded to +11%). +8. **4-way shard the priBatch build** (-7.9%). proto.Marshal of 1M GrantRecords is the parallel-path long pole on goroutine A; sharding it across 4 worker goroutines cuts that wallclock ~4×, at the cost of 4 batch.Apply memcpy concatenations (~50 ms total). Shard count caps at min(4, len/1024) so small batches bypass the parallelism overhead. 8-way sharding saturated (no further benefit). + +### Major dead ends (do NOT retry) + +- **MemTableSize >64 MiB at any size** (-1% primary, +30%+ at 100k). Larger memtable lets the entire 100k workload fit in memory → no during-write flushes → forced end-of-sync serial flush. The 64→256 MiB attempt regressed 100k by 32%; 128 MiB by 34%. Memtable should be sized so the workload triggers ≥3 flushes during writes. +- **Chunking PutGrantRecords commits** (+83% at 1M). Splitting one big batch into 16Ki-grant chunks force memtable rotation per chunk → many L0 files → compaction storm. The single-big-batch path takes Pebble's optimized flushable-batch promotion (sort once, swap in as memtable atomically). +- **Bloom filters on all levels** (+2.9%). Fresh-sync workloads have unique external_ids; the Get-before-Put population is 100% misses, and the bench's read path is range iteration not point Gets. Filters add construction CPU during flushes with no payoff. They MIGHT help in C1 prod where ReaderCache does point Gets across syncs, but that's not measured here. +- **`zstd.SpeedDefault` → `SpeedFastest` + `WithEncoderConcurrency(0)`** (flat). The c1z pack tar wraps Pebble SSTs that are already Snappy-compressed internally; outer zstd is nearly incompressible regardless of level. +- **`L0CompactionThreshold` ≠ 8** — axis fully mapped. 2/4/6 worse, 16 flat with all other wins. +- **`CompactionConcurrencyRange` (2, GOMAXPROCS/2 capped 8)** (flat). With L0=8 the compactor isn't the bottleneck; adding lanes makes no difference. +- **`LBaseMaxBytes` 256 → 512 MiB** (-1.6% within noise). L1 consolidation doesn't matter at our workload size. +- **`FlushSplitBytes` 2 → 16 MiB** (-1.1% within noise). Per-SST overhead is small. +- **`DisableAutomaticCompactions: true`** (-1% within noise). With L0=8 already limiting compaction-during-writes, disabling shifts work later but saves no wallclock. +- **`SetDeferred` for primary key+value** (+2.3%). `proto.Size` traversal cost exceeds the `batch.Set` memcpy savings; no net win. +- **`appendEscaped` bytes.IndexByte fast path** (+1.7% within noise). The tuple encoder lives on the smaller goroutine (idxBatch); parallel wallclock = max(A,B), so optimizing B doesn't reduce max when B2× improvement → confirm with a clean rerun, commit, continue with next idea. +- All priority-1 ideas exhausted → move to priority-2, then write a summary. + +## Resume Protocol + +1. Read this file + RFC `autoresearch-pebble-perf.md`. +2. Read `autoresearch.ideas.md` (prune stale entries). +3. Skim `autoresearch.jsonl` for recent `keep`/`discard` patterns. +4. Run `./autoresearch.sh` once to re-establish baseline on the current commit. +5. Pick the highest-priority untried idea. From 90b5ac6c3c62f43fbd72690d6ba101b1fa9836a1 Mon Sep 17 00:00:00 2001 From: Paul Querna Date: Mon, 25 May 2026 15:25:53 +0000 Subject: [PATCH 02/15] =?UTF-8?q?Baseline=20for=20the=20read-perf=20sessio?= =?UTF-8?q?n,=20starting=20from=20the=20autoresearch/pebble-perf-20260525?= =?UTF-8?q?=20branch=20(WritePack=20session=20end=20state).=20pebble=5Frea?= =?UTF-8?q?dpaginated=5F1m=5Fms=20=3D=201229.6=20ms.=20Allocs=20at=201M=20?= =?UTF-8?q?scale=20=3D=2017M=20(versus=2022K=20on=20the=20WritePack=20side?= =?UTF-8?q?=20=E2=80=94=20likely=20a=20big=20arena-like=20win=20available)?= =?UTF-8?q?.=20SQLite=20+=20WritePack=20sentinels=20within=20tolerance.?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Result: {"status":"keep","pebble_readpaginated_1m_ms":1229.57,"pebble_readpaginated_100k_ms":129.639,"pebble_readpaginated_10k_ms":15.041,"pebble_readpaginated_1k_ms":2.931,"pebble_readpaginated_100_ms":1.766,"pebble_readpaginated_1m_bytes_op":1447335468,"pebble_readpaginated_1m_allocs_op":17032960,"sqlite_readpaginated_1k_ms":5.869,"pebble_writepack_1m_ms":1262.162} --- autoresearch.jsonl | 1 + 1 file changed, 1 insertion(+) create mode 100644 autoresearch.jsonl diff --git a/autoresearch.jsonl b/autoresearch.jsonl new file mode 100644 index 000000000..fb99c4882 --- /dev/null +++ b/autoresearch.jsonl @@ -0,0 +1 @@ +{"type":"config","name":"Pebble engine read perf — minimize UnpackReadGrants 1M paginated read","metricName":"pebble_readpaginated_1m_ms","metricUnit":"ms","bestDirection":"lower"} From c66be93567ffed9a0c56cae5a74efd8111a694e9 Mon Sep 17 00:00:00 2001 From: Paul Querna Date: Mon, 25 May 2026 15:36:56 +0000 Subject: [PATCH 03/15] Outer-only grantReadArena for PaginateGrantsBySync. Collapses per-iter v3.GrantRecord allocations from O(records) to O(pages) \u2014 for the 1M read bench, 1M outer allocs become 100 slice allocs (one per page). pebble_readpaginated_1m_ms 1229.6\u21921201.4 ms (-2.3%); pebble_readpaginated_100k_ms 129.6\u2192122.2 (-5.7% same direction confirms). Allocs/op 17.03M\u219216.03M (-5.9%, matches the eliminated outer GrantRecord allocs). Smaller scales regressed (1k +14%, 100 +20%) due to over-sized arena: pre-allocates 10 k slots but workloads of 100/1k only fill a fraction. WritePack + SQLite sentinels flat. Followup: switch to slab-style growable arena to fix small-scale regression. Result: {"status":"keep","pebble_readpaginated_1m_ms":1201.441,"pebble_readpaginated_100k_ms":122.228,"pebble_readpaginated_10k_ms":14.902,"pebble_readpaginated_1k_ms":3.348,"pebble_readpaginated_100_ms":2.117,"pebble_readpaginated_1m_bytes_op":1447654476,"pebble_readpaginated_1m_allocs_op":16033053,"sqlite_readpaginated_1k_ms":6.068,"pebble_writepack_1m_ms":1242.529} --- autoresearch.jsonl | 2 ++ pkg/dotc1z/engine/pebble/paginate.go | 41 ++++++++++++++++++++++++++-- 2 files changed, 40 insertions(+), 3 deletions(-) diff --git a/autoresearch.jsonl b/autoresearch.jsonl index fb99c4882..423b563c3 100644 --- a/autoresearch.jsonl +++ b/autoresearch.jsonl @@ -1 +1,3 @@ {"type":"config","name":"Pebble engine read perf — minimize UnpackReadGrants 1M paginated read","metricName":"pebble_readpaginated_1m_ms","metricUnit":"ms","bestDirection":"lower"} +{"run":49,"commit":"90b5ac6","metric":1229.57,"metrics":{"pebble_readpaginated_100k_ms":129.639,"pebble_readpaginated_10k_ms":15.041,"pebble_readpaginated_1k_ms":2.931,"pebble_readpaginated_100_ms":1.766,"pebble_readpaginated_1m_bytes_op":1447335468,"pebble_readpaginated_1m_allocs_op":17032960,"sqlite_readpaginated_1k_ms":5.869,"pebble_writepack_1m_ms":1262.162},"status":"keep","description":"Baseline for the read-perf session, starting from the autoresearch/pebble-perf-20260525 branch (WritePack session end state). pebble_readpaginated_1m_ms = 1229.6 ms. Allocs at 1M scale = 17M (versus 22K on the WritePack side — likely a big arena-like win available). SQLite + WritePack sentinels within tolerance.","timestamp":1779722753425,"segment":1,"confidence":null,"asi":{"hypothesis":"establish a clean baseline for the read-perf session","baseline_pread_1m_ms":1229.57,"baseline_pread_100k_ms":129.64,"host":"16-core arm64 linux (same as WritePack session)","allocs_at_1m_anomaly":"17M allocs at 1M scale. The bench paginates 1M grants in 100 pages of 10k each. Each grant likely triggers a few allocations (V3GrantToV2 + nested stubs); 1M × ~17 ≈ 17M matches. Strong hint that an arena-style fix applies (like WritePack session #41).","next_idea":"Profile-first: pprof the 1M read to localize the actual bottleneck (NewStore/unpack vs paginated iteration vs Close). Most likely the unpack (ExtractZstdTar) dominates at large scales because the c1z is ~500MB to extract.","iter_time_observation":"Full bench sweep runs in ~17s wallclock; much faster than the WritePack session's ~190s. Can afford more iterations per session."}} +{"run":50,"commit":"90b5ac6","metric":1235.807,"metrics":{"pebble_readpaginated_100k_ms":130.436,"pebble_readpaginated_10k_ms":15.289,"pebble_readpaginated_1k_ms":4.088,"pebble_readpaginated_100_ms":2.279,"pebble_readpaginated_1m_bytes_op":1668983796,"pebble_readpaginated_1m_allocs_op":16033557,"sqlite_readpaginated_1k_ms":5.934,"pebble_writepack_1m_ms":1268.401},"status":"discard","description":"grantReadArena for the paginated grants read path with pre-populated nested fields (Entitlement, Principal, DiscoveredAt). Hypothesis: proto.Unmarshal's 'if field == nil, allocate' check would reuse the pre-populated arena slots. Reality: proto.Unmarshal allocates fresh nested messages anyway (allocs only dropped 17.0M\\u219216.0M = -5.9%, which matches just the OUTER GrantRecord arena win, not the nested fields). The pre-populated nested arenas are wasted memory \\u2014 bytes_op +15.4%, smaller scales regressed +28-40% at 1k/100. Primary flat (+0.5%).","timestamp":1779723086778,"segment":1,"confidence":null,"asi":{"hypothesis":"proto.Unmarshal's generated code uses 'if nested == nil { new(T) }' pattern, so pre-populating with arena slots should let the runtime reuse them.","result":"Wrong about the proto runtime. Allocs dropped only by the 1M outer GrantRecord count; nested fields still allocated fresh. Smaller scales regressed because arena was sized to DefaultPageSize=10000 regardless of actual records.","rollback_reason":"primary flat, smaller scales clearly regressed, bytes_op up.","mechanism_failure":"The protobuf runtime's behavior on non-nil nested fields is more complex than the simple if-nil-allocate I assumed. Either (1) it CLEARS the existing nested field before unmarshaling, (2) it MERGES into the existing field but my pre-populated field gets discarded for some other reason, or (3) the generated unmarshal code path for v3 GrantRecord uses a different pattern. Need to read the actual unmarshal code path more carefully.","next_action_hint":"Two angles: (a) just arena the OUTER GrantRecord (lose only -5.9% allocs, modest win); (b) figure out HOW to actually share nested allocations across the page — maybe via proto.UnmarshalState directly bypassing the generated Unmarshal, or by writing custom decode for GrantRecord. Also: the arena should be sized to limit OR len(actual_records), not always DefaultPageSize, to avoid over-allocation at small scales."}} diff --git a/pkg/dotc1z/engine/pebble/paginate.go b/pkg/dotc1z/engine/pebble/paginate.go index 19b7224aa..25f2982d3 100644 --- a/pkg/dotc1z/engine/pebble/paginate.go +++ b/pkg/dotc1z/engine/pebble/paginate.go @@ -136,9 +136,42 @@ func iteratePrimaryPageWithKey[T proto.Message]( // === Paginated grant variants === +// grantReadArena batches the OUTER v3.GrantRecord allocations done +// when hydrating a page of grants via proto.Unmarshal. Each iter step +// of the page-read loop allocates a fresh GrantRecord; for a 1 M +// paginated read in 10 k chunks that's 1 M outer-struct allocs + +// associated memclr work. Arena collapses these to 100 slice allocs +// (one per page). +// +// We do NOT pre-populate nested fields (Entitlement/Principal/ +// DiscoveredAt). An earlier attempt to do so (paginate.go, run #50) +// only saved the OUTER GrantRecord allocations, not the nested ones, +// while wasting memory on unused pre-populated arena slots at smaller +// scales. proto.Unmarshal's nested-message reuse path didn't trigger +// on our pre-populated pointers — the protobuf runtime's actual +// behavior differed from the consumeMessageInfo source-level read. +// Leaving nested message allocation to the runtime. +type grantReadArena struct { + grants []v3.GrantRecord +} + +func newGrantReadArena(pageLimit int) *grantReadArena { + return &grantReadArena{ + grants: make([]v3.GrantRecord, 0, pageLimit), + } +} + +func (a *grantReadArena) allocGrant() *v3.GrantRecord { + a.grants = append(a.grants, v3.GrantRecord{}) + return &a.grants[len(a.grants)-1] +} + // PaginateGrantsBySync returns up to `limit` grants from the // primary-key range, starting strictly after `cursor`. Returns the // next cursor (empty if no more) plus the materialized records. +// +// Uses grantReadArena for the per-iter outer-struct allocations — +// 1 page = 1 arena slice rather than O(page-size) individual mallocs. func (e *Engine) PaginateGrantsBySync( ctx context.Context, syncID, cursor string, limit int, ) ([]*v3.GrantRecord, string, error) { @@ -150,10 +183,12 @@ func (e *Engine) PaginateGrantsBySync( if err != nil { return nil, "", err } + if limit <= 0 { + limit = DefaultPageSize + } prefix := encodeGrantPrefix(idBytes) - return iteratePrimaryPageWithKey(ctx, e.db, prefix, cursorBytes, limit, func() *v3.GrantRecord { - return &v3.GrantRecord{} - }) + arena := newGrantReadArena(limit) + return iteratePrimaryPageWithKey(ctx, e.db, prefix, cursorBytes, limit, arena.allocGrant) } // PaginateGrantsByEntitlement uses the by_entitlement index. The From 2f3619afc4cf3a9739f97ad5eddbd112dc90c507 Mon Sep 17 00:00:00 2001 From: Paul Querna Date: Mon, 25 May 2026 15:42:43 +0000 Subject: [PATCH 04/15] grantV2ReadArena: arena-allocate the 6 v2.Grant nested stubs (Grant + Entitlement + 2 \u00d7 (Resource + ResourceId)) in adapter.ListGrants. Each translateV3Grant call now allocates O(1) into the per-page arena instead of 6 individual mallocs. pebble_readpaginated_1m_ms 1229.6\u2192974.6 ms (-20.7%); pebble_readpaginated_100k_ms 129.6\u2192104.9 (-19.1% same direction confirms). Allocs/op 17.0M\u219210.0M (-41%) \u2014 exactly the 6M eliminated stub allocations matching the mechanism. bytes_op flat. SQLite + WritePack sentinels flat. Smaller scales still regress (1k +16%, 100 +28%) due to arena-allocation overhead at sizes where direct allocation was already cheap. Result: {"status":"keep","pebble_readpaginated_1m_ms":974.563,"pebble_readpaginated_100k_ms":104.94,"pebble_readpaginated_10k_ms":13.992,"pebble_readpaginated_1k_ms":3.411,"pebble_readpaginated_100_ms":2.269,"pebble_readpaginated_1m_bytes_op":1428262212,"pebble_readpaginated_1m_allocs_op":10033722,"sqlite_readpaginated_1k_ms":5.85,"pebble_writepack_1m_ms":1253.056} --- autoresearch.jsonl | 2 + pkg/dotc1z/engine/pebble/adapter.go | 9 ++- pkg/dotc1z/engine/pebble/translate_v2.go | 98 ++++++++++++++++++++++++ 3 files changed, 108 insertions(+), 1 deletion(-) diff --git a/autoresearch.jsonl b/autoresearch.jsonl index 423b563c3..db4f7523e 100644 --- a/autoresearch.jsonl +++ b/autoresearch.jsonl @@ -1,3 +1,5 @@ {"type":"config","name":"Pebble engine read perf — minimize UnpackReadGrants 1M paginated read","metricName":"pebble_readpaginated_1m_ms","metricUnit":"ms","bestDirection":"lower"} {"run":49,"commit":"90b5ac6","metric":1229.57,"metrics":{"pebble_readpaginated_100k_ms":129.639,"pebble_readpaginated_10k_ms":15.041,"pebble_readpaginated_1k_ms":2.931,"pebble_readpaginated_100_ms":1.766,"pebble_readpaginated_1m_bytes_op":1447335468,"pebble_readpaginated_1m_allocs_op":17032960,"sqlite_readpaginated_1k_ms":5.869,"pebble_writepack_1m_ms":1262.162},"status":"keep","description":"Baseline for the read-perf session, starting from the autoresearch/pebble-perf-20260525 branch (WritePack session end state). pebble_readpaginated_1m_ms = 1229.6 ms. Allocs at 1M scale = 17M (versus 22K on the WritePack side — likely a big arena-like win available). SQLite + WritePack sentinels within tolerance.","timestamp":1779722753425,"segment":1,"confidence":null,"asi":{"hypothesis":"establish a clean baseline for the read-perf session","baseline_pread_1m_ms":1229.57,"baseline_pread_100k_ms":129.64,"host":"16-core arm64 linux (same as WritePack session)","allocs_at_1m_anomaly":"17M allocs at 1M scale. The bench paginates 1M grants in 100 pages of 10k each. Each grant likely triggers a few allocations (V3GrantToV2 + nested stubs); 1M × ~17 ≈ 17M matches. Strong hint that an arena-style fix applies (like WritePack session #41).","next_idea":"Profile-first: pprof the 1M read to localize the actual bottleneck (NewStore/unpack vs paginated iteration vs Close). Most likely the unpack (ExtractZstdTar) dominates at large scales because the c1z is ~500MB to extract.","iter_time_observation":"Full bench sweep runs in ~17s wallclock; much faster than the WritePack session's ~190s. Can afford more iterations per session."}} {"run":50,"commit":"90b5ac6","metric":1235.807,"metrics":{"pebble_readpaginated_100k_ms":130.436,"pebble_readpaginated_10k_ms":15.289,"pebble_readpaginated_1k_ms":4.088,"pebble_readpaginated_100_ms":2.279,"pebble_readpaginated_1m_bytes_op":1668983796,"pebble_readpaginated_1m_allocs_op":16033557,"sqlite_readpaginated_1k_ms":5.934,"pebble_writepack_1m_ms":1268.401},"status":"discard","description":"grantReadArena for the paginated grants read path with pre-populated nested fields (Entitlement, Principal, DiscoveredAt). Hypothesis: proto.Unmarshal's 'if field == nil, allocate' check would reuse the pre-populated arena slots. Reality: proto.Unmarshal allocates fresh nested messages anyway (allocs only dropped 17.0M\\u219216.0M = -5.9%, which matches just the OUTER GrantRecord arena win, not the nested fields). The pre-populated nested arenas are wasted memory \\u2014 bytes_op +15.4%, smaller scales regressed +28-40% at 1k/100. Primary flat (+0.5%).","timestamp":1779723086778,"segment":1,"confidence":null,"asi":{"hypothesis":"proto.Unmarshal's generated code uses 'if nested == nil { new(T) }' pattern, so pre-populating with arena slots should let the runtime reuse them.","result":"Wrong about the proto runtime. Allocs dropped only by the 1M outer GrantRecord count; nested fields still allocated fresh. Smaller scales regressed because arena was sized to DefaultPageSize=10000 regardless of actual records.","rollback_reason":"primary flat, smaller scales clearly regressed, bytes_op up.","mechanism_failure":"The protobuf runtime's behavior on non-nil nested fields is more complex than the simple if-nil-allocate I assumed. Either (1) it CLEARS the existing nested field before unmarshaling, (2) it MERGES into the existing field but my pre-populated field gets discarded for some other reason, or (3) the generated unmarshal code path for v3 GrantRecord uses a different pattern. Need to read the actual unmarshal code path more carefully.","next_action_hint":"Two angles: (a) just arena the OUTER GrantRecord (lose only -5.9% allocs, modest win); (b) figure out HOW to actually share nested allocations across the page — maybe via proto.UnmarshalState directly bypassing the generated Unmarshal, or by writing custom decode for GrantRecord. Also: the arena should be sized to limit OR len(actual_records), not always DefaultPageSize, to avoid over-allocation at small scales."}} +{"run":51,"commit":"c66be93","metric":1201.441,"metrics":{"pebble_readpaginated_100k_ms":122.228,"pebble_readpaginated_10k_ms":14.902,"pebble_readpaginated_1k_ms":3.348,"pebble_readpaginated_100_ms":2.117,"pebble_readpaginated_1m_bytes_op":1447654476,"pebble_readpaginated_1m_allocs_op":16033053,"sqlite_readpaginated_1k_ms":6.068,"pebble_writepack_1m_ms":1242.529},"status":"keep","description":"Outer-only grantReadArena for PaginateGrantsBySync. Collapses per-iter v3.GrantRecord allocations from O(records) to O(pages) \\u2014 for the 1M read bench, 1M outer allocs become 100 slice allocs (one per page). pebble_readpaginated_1m_ms 1229.6\\u21921201.4 ms (-2.3%); pebble_readpaginated_100k_ms 129.6\\u2192122.2 (-5.7% same direction confirms). Allocs/op 17.03M\\u219216.03M (-5.9%, matches the eliminated outer GrantRecord allocs). Smaller scales regressed (1k +14%, 100 +20%) due to over-sized arena: pre-allocates 10 k slots but workloads of 100/1k only fill a fraction. WritePack + SQLite sentinels flat. Followup: switch to slab-style growable arena to fix small-scale regression.","timestamp":1779723416651,"segment":1,"confidence":4.510020843354104,"asi":{"hypothesis":"outer v3.GrantRecord allocations in iteratePrimaryPageWithKey (1M per 1M read) collapse to O(pages) via arena. Pre-populating nested fields didn't work (see #50); just arena the outer struct.","result":"primary -2.3% (1229.6→1201.4 ms), 100k -5.7% directionally confirms. Allocs/op dropped by exactly 1M (the outer GrantRecord count), matching the mechanism.","small_scale_regression":"1k +14%, 100 +20%. Cause: arena pre-allocates DefaultPageSize=10000 slots; small workloads waste 9000+ slots = ~700 KB of memclr per call. For a 1.77 ms baseline at 100 scale, an extra 350 µs of unused-arena alloc work is significant.","fix_idea_for_next_iteration":"Use a slab-style arena: initial slab of 256-512 entries, grow by allocating a NEW slab (preserving pointers in earlier slabs). Avoids the upfront 10000-slot waste at small scales while keeping pointer stability for the returned page.","sentinels":"pebble_writepack_1m_ms 1262→1243 (within noise, no regression). sqlite_readpaginated_1k_ms 5.87→6.07 (within noise).","cumulative_progress":"Read session iteration 2: -2.3% from session baseline."}} +{"run":52,"commit":"c66be93","metric":1199.243,"metrics":{"pebble_readpaginated_100k_ms":127.605,"pebble_readpaginated_10k_ms":15.945,"pebble_readpaginated_1k_ms":3.352,"pebble_readpaginated_100_ms":2.21,"pebble_readpaginated_1m_bytes_op":1482329336,"pebble_readpaginated_1m_allocs_op":16034013,"sqlite_readpaginated_1k_ms":6.132,"pebble_writepack_1m_ms":1265.657},"status":"discard","description":"Slab-style growable arena (initial cap 256, doubling to 4 K cap). Intent: fix #51's small-scale regression by not over-allocating 10 K slots for tiny reads. Primary flat (-0.2% vs #51) and small scales slightly worse (100 +4.2%, 100k +4.4% vs #51). The slab management overhead (per-call if-len-equals-cap check + slab append + slice-header sync) outweighs the saved memclr at small scales.","timestamp":1779723569216,"segment":1,"confidence":1.855046658093442,"asi":{"hypothesis":"slab-style arena fixes #51's small-scale regression while keeping the 1M win","result":"primary flat (-0.2%), small scales slightly worse vs #51. Multiple slabs (256+512+1024+...) added per-call branch overhead that isn't compensated by the saved memclr at small scales.","rollback_reason":"primary didn't improve and small scales got slightly worse — the simpler fixed-size arena from #51 is the better tradeoff.","lesson":"When the arena's pre-allocated waste is small in absolute terms (e.g., 800 KB memclr = 80 µs), the management overhead of a growable arena can exceed the saved work. Simpler fixed-size arena with a known upper bound on page size is the right answer."}} diff --git a/pkg/dotc1z/engine/pebble/adapter.go b/pkg/dotc1z/engine/pebble/adapter.go index f6360eff2..890181e2b 100644 --- a/pkg/dotc1z/engine/pebble/adapter.go +++ b/pkg/dotc1z/engine/pebble/adapter.go @@ -500,9 +500,16 @@ func (a *Adapter) ListGrants(ctx context.Context, req *v2.GrantsServiceListGrant if err != nil { return nil, err } + // Arena the v2.Grant + nested stubs so a page of N grants costs 6 + // slice allocs instead of 6 × N individual mallocs. Pre-sized to + // the EXACT record count returned (no over-allocation), so small + // pages don't pay the cost of unused arena slots. The arena's + // backing arrays are held alive via the *v2.Grant pointers in `out`, + // which the caller receives in the response. + arena := newGrantV2ReadArena(len(records)) out := make([]*v2.Grant, 0, len(records)) for _, rec := range records { - out = append(out, V3GrantToV2(rec)) + out = append(out, arena.translateV3Grant(rec)) } return v2.GrantsServiceListGrantsResponse_builder{ List: out, diff --git a/pkg/dotc1z/engine/pebble/translate_v2.go b/pkg/dotc1z/engine/pebble/translate_v2.go index 94db3fc3c..2e3e983b1 100644 --- a/pkg/dotc1z/engine/pebble/translate_v2.go +++ b/pkg/dotc1z/engine/pebble/translate_v2.go @@ -78,6 +78,104 @@ func resourceToPrincipalRef(r *v2.Resource) *v3.PrincipalRef { }.Build() } +// grantV2ReadArena batches the v2.Grant + nested-stub allocations +// done when V3GrantToV2 hydrates a page of read results. Each grant +// becomes 6 individual heap allocations via the default builder path: +// +// *v2.Grant +// *v2.Entitlement +// *v2.Resource (entitlement's Resource) +// *v2.ResourceId (entitlement's Resource.Id) +// *v2.Resource (principal) +// *v2.ResourceId (principal.Id) +// +// For the 1 M paginated read bench that's 6 M allocations on top of +// the proto.Unmarshal allocations from the engine side. The arena +// pre-allocates 6 backing arrays sized to the actual returned page +// count, and translateV3Grant fills them in via the Set* methods. GC +// sees 6 large objects per page instead of 6 × records. +// +// Lifetime: the arena lives for one ListGrants call (or transitively, +// for the lifetime of the Response.List slice the caller holds). +// Pointers into the arena slices stay valid as long as the arena is +// reachable — Go's GC keeps the backing arrays alive while any pointer +// to an element is held. The caller never sees the arena directly; +// they see *v2.Grant pointers into it. +// +// Pre-sized to exact `n` capacity, so append never reallocates and +// returned pointers are stable for the arena's lifetime. +type grantV2ReadArena struct { + grants []v2.Grant + entitlements []v2.Entitlement + entitlementResources []v2.Resource + entitlementResourceIDs []v2.ResourceId + principalResources []v2.Resource + principalResourceIDs []v2.ResourceId +} + +// newGrantV2ReadArena pre-allocates backing arrays exactly sized to +// the number of records that will be translated. Returns nil if n +// is 0 — caller can use V3GrantToV2 directly in that case. +func newGrantV2ReadArena(n int) *grantV2ReadArena { + if n <= 0 { + return nil + } + return &grantV2ReadArena{ + grants: make([]v2.Grant, 0, n), + entitlements: make([]v2.Entitlement, 0, n), + entitlementResources: make([]v2.Resource, 0, n), + entitlementResourceIDs: make([]v2.ResourceId, 0, n), + principalResources: make([]v2.Resource, 0, n), + principalResourceIDs: make([]v2.ResourceId, 0, n), + } +} + +// translateV3Grant is the arena-allocating counterpart to V3GrantToV2. +// Returns a pointer into the arena; behavior matches V3GrantToV2 for +// all valid inputs. +func (a *grantV2ReadArena) translateV3Grant(r *v3.GrantRecord) *v2.Grant { + if r == nil { + return nil + } + a.grants = append(a.grants, v2.Grant{}) + g := &a.grants[len(a.grants)-1] + g.SetId(r.GetExternalId()) + if ref := r.GetEntitlement(); ref != nil { + a.entitlementResourceIDs = append(a.entitlementResourceIDs, v2.ResourceId{}) + rid := &a.entitlementResourceIDs[len(a.entitlementResourceIDs)-1] + rid.SetResourceType(ref.GetResourceTypeId()) + rid.SetResource(ref.GetResourceId()) + + a.entitlementResources = append(a.entitlementResources, v2.Resource{}) + res := &a.entitlementResources[len(a.entitlementResources)-1] + res.SetId(rid) + + a.entitlements = append(a.entitlements, v2.Entitlement{}) + ent := &a.entitlements[len(a.entitlements)-1] + ent.SetId(ref.GetEntitlementId()) + ent.SetResource(res) + g.SetEntitlement(ent) + } + if ref := r.GetPrincipal(); ref != nil { + a.principalResourceIDs = append(a.principalResourceIDs, v2.ResourceId{}) + rid := &a.principalResourceIDs[len(a.principalResourceIDs)-1] + rid.SetResourceType(ref.GetResourceTypeId()) + rid.SetResource(ref.GetResourceId()) + + a.principalResources = append(a.principalResources, v2.Resource{}) + res := &a.principalResources[len(a.principalResources)-1] + res.SetId(rid) + g.SetPrincipal(res) + } + if ann := r.GetAnnotations(); len(ann) > 0 { + g.SetAnnotations(ann) + } + if src := v3GrantSourcesToV2(r.GetSources()); src != nil { + g.SetSources(src) + } + return g +} + // grantTranslateArena batches v3.GrantRecord / EntitlementRef / // PrincipalRef allocations for one PutGrants call. The default // V2GrantToV3 builder pattern heap-allocates each of the three structs From ff177b15be56c04be12f07e490eb64f17be54db1 Mon Sep 17 00:00:00 2001 From: Paul Querna Date: Mon, 25 May 2026 15:44:50 +0000 Subject: [PATCH 05/15] autoresearch.ideas.md: log read-perf session results (-20.7% in 5 iterations) --- autoresearch.ideas.md | 26 +++++++++++++++++++++++++- 1 file changed, 25 insertions(+), 1 deletion(-) diff --git a/autoresearch.ideas.md b/autoresearch.ideas.md index c4c1071b8..886ed0a01 100644 --- a/autoresearch.ideas.md +++ b/autoresearch.ideas.md @@ -79,4 +79,28 @@ of those closed axes likely apply here too: ## Tried — see jsonl for verdicts -(populated by the loop) +### Kept + +- **#51 outer-only grantReadArena** (-2.3% primary). Collapses the 1 M + v3.GrantRecord outer allocations to one slice alloc per page. Small- + scale regression (1k +14 %, 100 +20 %) is the known arena-over-allocation + tradeoff. WritePack + SQLite sentinels flat. +- **#53 grantV2ReadArena** (-20.7 % primary). Arena-allocates the 6 + v2.Grant nested stubs (Grant + Entitlement + 2 Resources + 2 ResourceIds) + in adapter.ListGrants. Pre-sized to len(records) so no waste at any scale + for the arena itself (small-scale regression unchanged from #51, came from + the OUTER GrantRecord arena, not this one). Allocs/op 17M→10M. + +### Discarded + +- **#50 grantReadArena with pre-populated nested fields** — proto.Unmarshal + didn't reuse the pre-populated EntitlementRef/PrincipalRef/Timestamp + pointers despite the consumeMessageInfo source-level read suggesting it + should. Only the outer GrantRecord was reused (-1 alloc/grant), while the + unused pre-populated arenas added bytes_op +15 % and regressed smaller + scales +28-40 %. Probable causes recorded in the jsonl ASI. +- **#52 slab-style growable arena** — attempted to fix #51's small-scale + regression by sizing arenas to actual records via doubling-slab strategy. + Slab management overhead (per-call cap check + slice-header sync) cancelled + the saved memclr at small scales. Fixed-size arena from #51 is the better + tradeoff. From f45b86e73909cbc51f7d7f5a472a9375da05da6f Mon Sep 17 00:00:00 2001 From: Paul Querna Date: Mon, 25 May 2026 15:53:07 +0000 Subject: [PATCH 06/15] BATCHED parallel proto.Unmarshal for PaginateGrantsBySync, fixing the #54 design that dispatched per-record. Main goroutine batches 64 records into a single concatenated value buffer + offsets array, then sends ONE channel msg per batch to a 4-worker pool. Workers Unmarshal sequentially within their batch (avoiding cross-worker cache contention) but multiple batches process in parallel. pebble_readpaginated_1m_ms 974.6\u2192480.0 ms (-50.7 %). 100k -41.6 %, 10k -33.6 %, 1k -7.3 %, 100 essentially flat. allocs_op flat at 10M, bytes_op +36 % from the per-batch value buffer pre-alloc. SQLite + WritePack sentinels flat. Result: {"status":"keep","pebble_readpaginated_1m_ms":479.991,"pebble_readpaginated_100k_ms":61.269,"pebble_readpaginated_10k_ms":9.288,"pebble_readpaginated_1k_ms":3.159,"pebble_readpaginated_100_ms":2.386,"pebble_readpaginated_1m_bytes_op":1946320572,"pebble_readpaginated_1m_allocs_op":10082026,"sqlite_readpaginated_1k_ms":5.907,"pebble_writepack_1m_ms":1244.66} --- autoresearch.jsonl | 2 + pkg/dotc1z/engine/pebble/paginate.go | 147 ++++++++++++++++++++++++--- 2 files changed, 134 insertions(+), 15 deletions(-) diff --git a/autoresearch.jsonl b/autoresearch.jsonl index db4f7523e..b44a32f9f 100644 --- a/autoresearch.jsonl +++ b/autoresearch.jsonl @@ -3,3 +3,5 @@ {"run":50,"commit":"90b5ac6","metric":1235.807,"metrics":{"pebble_readpaginated_100k_ms":130.436,"pebble_readpaginated_10k_ms":15.289,"pebble_readpaginated_1k_ms":4.088,"pebble_readpaginated_100_ms":2.279,"pebble_readpaginated_1m_bytes_op":1668983796,"pebble_readpaginated_1m_allocs_op":16033557,"sqlite_readpaginated_1k_ms":5.934,"pebble_writepack_1m_ms":1268.401},"status":"discard","description":"grantReadArena for the paginated grants read path with pre-populated nested fields (Entitlement, Principal, DiscoveredAt). Hypothesis: proto.Unmarshal's 'if field == nil, allocate' check would reuse the pre-populated arena slots. Reality: proto.Unmarshal allocates fresh nested messages anyway (allocs only dropped 17.0M\\u219216.0M = -5.9%, which matches just the OUTER GrantRecord arena win, not the nested fields). The pre-populated nested arenas are wasted memory \\u2014 bytes_op +15.4%, smaller scales regressed +28-40% at 1k/100. Primary flat (+0.5%).","timestamp":1779723086778,"segment":1,"confidence":null,"asi":{"hypothesis":"proto.Unmarshal's generated code uses 'if nested == nil { new(T) }' pattern, so pre-populating with arena slots should let the runtime reuse them.","result":"Wrong about the proto runtime. Allocs dropped only by the 1M outer GrantRecord count; nested fields still allocated fresh. Smaller scales regressed because arena was sized to DefaultPageSize=10000 regardless of actual records.","rollback_reason":"primary flat, smaller scales clearly regressed, bytes_op up.","mechanism_failure":"The protobuf runtime's behavior on non-nil nested fields is more complex than the simple if-nil-allocate I assumed. Either (1) it CLEARS the existing nested field before unmarshaling, (2) it MERGES into the existing field but my pre-populated field gets discarded for some other reason, or (3) the generated unmarshal code path for v3 GrantRecord uses a different pattern. Need to read the actual unmarshal code path more carefully.","next_action_hint":"Two angles: (a) just arena the OUTER GrantRecord (lose only -5.9% allocs, modest win); (b) figure out HOW to actually share nested allocations across the page — maybe via proto.UnmarshalState directly bypassing the generated Unmarshal, or by writing custom decode for GrantRecord. Also: the arena should be sized to limit OR len(actual_records), not always DefaultPageSize, to avoid over-allocation at small scales."}} {"run":51,"commit":"c66be93","metric":1201.441,"metrics":{"pebble_readpaginated_100k_ms":122.228,"pebble_readpaginated_10k_ms":14.902,"pebble_readpaginated_1k_ms":3.348,"pebble_readpaginated_100_ms":2.117,"pebble_readpaginated_1m_bytes_op":1447654476,"pebble_readpaginated_1m_allocs_op":16033053,"sqlite_readpaginated_1k_ms":6.068,"pebble_writepack_1m_ms":1242.529},"status":"keep","description":"Outer-only grantReadArena for PaginateGrantsBySync. Collapses per-iter v3.GrantRecord allocations from O(records) to O(pages) \\u2014 for the 1M read bench, 1M outer allocs become 100 slice allocs (one per page). pebble_readpaginated_1m_ms 1229.6\\u21921201.4 ms (-2.3%); pebble_readpaginated_100k_ms 129.6\\u2192122.2 (-5.7% same direction confirms). Allocs/op 17.03M\\u219216.03M (-5.9%, matches the eliminated outer GrantRecord allocs). Smaller scales regressed (1k +14%, 100 +20%) due to over-sized arena: pre-allocates 10 k slots but workloads of 100/1k only fill a fraction. WritePack + SQLite sentinels flat. Followup: switch to slab-style growable arena to fix small-scale regression.","timestamp":1779723416651,"segment":1,"confidence":4.510020843354104,"asi":{"hypothesis":"outer v3.GrantRecord allocations in iteratePrimaryPageWithKey (1M per 1M read) collapse to O(pages) via arena. Pre-populating nested fields didn't work (see #50); just arena the outer struct.","result":"primary -2.3% (1229.6→1201.4 ms), 100k -5.7% directionally confirms. Allocs/op dropped by exactly 1M (the outer GrantRecord count), matching the mechanism.","small_scale_regression":"1k +14%, 100 +20%. Cause: arena pre-allocates DefaultPageSize=10000 slots; small workloads waste 9000+ slots = ~700 KB of memclr per call. For a 1.77 ms baseline at 100 scale, an extra 350 µs of unused-arena alloc work is significant.","fix_idea_for_next_iteration":"Use a slab-style arena: initial slab of 256-512 entries, grow by allocating a NEW slab (preserving pointers in earlier slabs). Avoids the upfront 10000-slot waste at small scales while keeping pointer stability for the returned page.","sentinels":"pebble_writepack_1m_ms 1262→1243 (within noise, no regression). sqlite_readpaginated_1k_ms 5.87→6.07 (within noise).","cumulative_progress":"Read session iteration 2: -2.3% from session baseline."}} {"run":52,"commit":"c66be93","metric":1199.243,"metrics":{"pebble_readpaginated_100k_ms":127.605,"pebble_readpaginated_10k_ms":15.945,"pebble_readpaginated_1k_ms":3.352,"pebble_readpaginated_100_ms":2.21,"pebble_readpaginated_1m_bytes_op":1482329336,"pebble_readpaginated_1m_allocs_op":16034013,"sqlite_readpaginated_1k_ms":6.132,"pebble_writepack_1m_ms":1265.657},"status":"discard","description":"Slab-style growable arena (initial cap 256, doubling to 4 K cap). Intent: fix #51's small-scale regression by not over-allocating 10 K slots for tiny reads. Primary flat (-0.2% vs #51) and small scales slightly worse (100 +4.2%, 100k +4.4% vs #51). The slab management overhead (per-call if-len-equals-cap check + slab append + slice-header sync) outweighs the saved memclr at small scales.","timestamp":1779723569216,"segment":1,"confidence":1.855046658093442,"asi":{"hypothesis":"slab-style arena fixes #51's small-scale regression while keeping the 1M win","result":"primary flat (-0.2%), small scales slightly worse vs #51. Multiple slabs (256+512+1024+...) added per-call branch overhead that isn't compensated by the saved memclr at small scales.","rollback_reason":"primary didn't improve and small scales got slightly worse — the simpler fixed-size arena from #51 is the better tradeoff.","lesson":"When the arena's pre-allocated waste is small in absolute terms (e.g., 800 KB memclr = 80 µs), the management overhead of a growable arena can exceed the saved work. Simpler fixed-size arena with a known upper bound on page size is the right answer."}} +{"run":53,"commit":"2f3619a","metric":974.563,"metrics":{"pebble_readpaginated_100k_ms":104.94,"pebble_readpaginated_10k_ms":13.992,"pebble_readpaginated_1k_ms":3.411,"pebble_readpaginated_100_ms":2.269,"pebble_readpaginated_1m_bytes_op":1428262212,"pebble_readpaginated_1m_allocs_op":10033722,"sqlite_readpaginated_1k_ms":5.85,"pebble_writepack_1m_ms":1253.056},"status":"keep","description":"grantV2ReadArena: arena-allocate the 6 v2.Grant nested stubs (Grant + Entitlement + 2 \\u00d7 (Resource + ResourceId)) in adapter.ListGrants. Each translateV3Grant call now allocates O(1) into the per-page arena instead of 6 individual mallocs. pebble_readpaginated_1m_ms 1229.6\\u2192974.6 ms (-20.7%); pebble_readpaginated_100k_ms 129.6\\u2192104.9 (-19.1% same direction confirms). Allocs/op 17.0M\\u219210.0M (-41%) \\u2014 exactly the 6M eliminated stub allocations matching the mechanism. bytes_op flat. SQLite + WritePack sentinels flat. Smaller scales still regress (1k +16%, 100 +28%) due to arena-allocation overhead at sizes where direct allocation was already cheap.","timestamp":1779723763071,"segment":1,"confidence":9.065626222048449,"asi":{"hypothesis":"V3GrantToV2 allocates 6 nested stub structs per grant (v2.Grant + Entitlement + 2 Resources + 2 ResourceIds). Per profile this was 190 ms CPU for the 1M paginated read. Arena them similar to V2GrantToV3 (WritePack session #41).","result":"primary -20.7% (1229.6→974.6 ms). Allocs -41% exactly matching the 6M eliminated stubs. 100k -19.1% confirms.","mechanism_insight":"Two compounding effects: (1) 6M individual mallocs become 6 slice allocs; (2) the per-page arena lives only as long as the response is consumed, so GC reclamation is cheap. Total alloc count dropped by exactly what was theoretically possible.","cumulative_progress":"Read session iteration 4: -20.7% from session baseline 1229.6 ms. Combined with #51's outer GrantRecord arena (-2.3%), session total is -20.7% from baseline.","small_scale_regression_note":"1k +16% and 100 +28% — same pattern as #51's outer arena. Arena pre-allocates `len(records)` slots; for small pages this is fine, but the per-call arena struct + 6 slice headers add ~60 µs overhead that matters at the 1.77 ms / 2.93 ms scale of the 100/1k benches. Production callers usually do bulk reads, where the tradeoff is overwhelming.","tradeoff_assessment":"Bulk-read callers (sync engines, batch processors): -19-21% wallclock. Small-read callers: +16-28% wallclock at the smallest scales. The primary metric (1M) is the optimization target; bulk-read use case dominates production. Keep.","next_action_hint":"Profile again. After eliminating outer GrantRecord + v2 stubs, the remaining 974 ms wallclock has the nested proto.Unmarshal allocations (EntitlementRef, PrincipalRef, Timestamp, strings inside) still firing fresh allocations. That's 4M allocs/grant × 1M = 4M of the remaining 10M total. Hard to eliminate without going to a custom decoder."}} +{"run":54,"commit":"ff177b1","metric":1079.776,"metrics":{"pebble_readpaginated_100k_ms":120.771,"pebble_readpaginated_10k_ms":15.669,"pebble_readpaginated_1k_ms":3.673,"pebble_readpaginated_100_ms":2.513,"pebble_readpaginated_1m_bytes_op":1581013284,"pebble_readpaginated_1m_allocs_op":11035313,"sqlite_readpaginated_1k_ms":5.884,"pebble_writepack_1m_ms":1266.341},"status":"discard","description":"Parallel proto.Unmarshal via 4-worker pool in PaginateGrantsBySync. Main goroutine iterates Pebble, copies each value's bytes (Pebble's iter.Value() invalidated on Next()) and dispatches (idx, bytes) to workers. Pre-allocated arena slots indexed by job idx avoid race conditions. Primary +10.8% REGRESSION. bytes_op +10.5% from per-record value-buffer allocations; allocs_op +1M from `make([]byte, len(v))` per record. The dispatch overhead (channel send/recv at ~50-100 ns each \\u00d7 1M = 50-100 ms) and the per-record byte-buffer alloc swamped the theoretical 350 ms parallel-decode savings.","timestamp":1779724162098,"segment":1,"confidence":7.883604099361628,"asi":{"hypothesis":"proto.Unmarshal at 470 ms CPU is the serial bottleneck. 4-worker parallelism would cut to ~120 ms + 50 ms copy = 170 ms total decode work. Save ~300 ms wallclock.","result":"primary +10.8% regression. ~+105 ms wallclock. All other scales regressed too.","rollback_reason":"primary regressed clearly above noise.","mechanism_failure":"Three compounding overheads I underestimated: (1) per-record value-buffer allocation — 1M new `make([]byte, len(v))` calls cost ~30 ns each = 30 ms PLUS the GC pressure from tracking 1M new objects. (2) channel send + receive synchronization per job — ~50 ns each × 2 (send + receive) × 1M = 100 ms. (3) goroutine wakeup latency — 4 workers pulling from one channel cause cache-line contention on the channel's internal buffer.","next_attempts_if_pursued":"Could mitigate (1) via a slab-buffer pool, and (2) via batched job dispatch (group N records per job). But the savings ceiling is still bounded by Amdahl's law: the iter.Next + value-copy work still has to happen serially on the main goroutine (~190 ms). Even perfect 4-way parallel unmarshal can only save ≈280 ms minus dispatch overhead.","lesson":"For sub-1ms-per-record work being parallelized across goroutines, dispatch overhead (channel ops + per-record buffer allocation) can match or exceed the saved CPU work. Worker pools win at >10μs/task; below that, the serial path is hard to beat."}} diff --git a/pkg/dotc1z/engine/pebble/paginate.go b/pkg/dotc1z/engine/pebble/paginate.go index 25f2982d3..27e1fbd7f 100644 --- a/pkg/dotc1z/engine/pebble/paginate.go +++ b/pkg/dotc1z/engine/pebble/paginate.go @@ -5,6 +5,7 @@ import ( "encoding/base64" "errors" "fmt" + "sync" "github.com/cockroachdb/pebble/v2" "google.golang.org/protobuf/proto" @@ -155,23 +156,26 @@ type grantReadArena struct { grants []v3.GrantRecord } -func newGrantReadArena(pageLimit int) *grantReadArena { - return &grantReadArena{ - grants: make([]v3.GrantRecord, 0, pageLimit), - } -} - -func (a *grantReadArena) allocGrant() *v3.GrantRecord { - a.grants = append(a.grants, v3.GrantRecord{}) - return &a.grants[len(a.grants)-1] -} - // PaginateGrantsBySync returns up to `limit` grants from the // primary-key range, starting strictly after `cursor`. Returns the // next cursor (empty if no more) plus the materialized records. // -// Uses grantReadArena for the per-iter outer-struct allocations — -// 1 page = 1 arena slice rather than O(page-size) individual mallocs. +// The page's proto.Unmarshal work is parallelized via BATCHED dispatch +// to a worker pool. Main goroutine iterates Pebble (iter.Value()'s +// storage is invalidated by iter.Next, so iteration must be serial), +// copies wire bytes into a per-batch concatenated buffer, and +// dispatches the batch to a worker. Workers proto.Unmarshal each +// record in their batch into pre-allocated arena slots. +// +// Batched dispatch (vs the per-record dispatch attempt #54) avoids: +// +// - 1 M individual `make([]byte, N)` allocs for value buffers +// (one slab per batch instead) +// - 1 M channel send + receive pairs (≈64 × fewer at batchSize=64) +// +// At the 1 M paginated read bench, proto.Unmarshal is ≈470 ms of the +// 974 ms wallclock. 4-way parallel decode targets ≈120 ms decode + +// ≈50 ms dispatch/copy overhead = ≈170 ms total. func (e *Engine) PaginateGrantsBySync( ctx context.Context, syncID, cursor string, limit int, ) ([]*v3.GrantRecord, string, error) { @@ -187,8 +191,121 @@ func (e *Engine) PaginateGrantsBySync( limit = DefaultPageSize } prefix := encodeGrantPrefix(idBytes) - arena := newGrantReadArena(limit) - return iteratePrimaryPageWithKey(ctx, e.db, prefix, cursorBytes, limit, arena.allocGrant) + lower, upper := rangeAfter(prefix, cursorBytes) + iter, err := e.db.NewIter(&pebble.IterOptions{ + LowerBound: lower, + UpperBound: upper, + }) + if err != nil { + return nil, "", fmt.Errorf("page iter: %w", err) + } + defer iter.Close() + + // Pre-allocate arena slots up to limit. Workers index into these + // slots directly; no append, no race. + arena := &grantReadArena{grants: make([]v3.GrantRecord, limit)} + + const ( + pageUnmarshalWorkers = 4 + unmarshalBatchSize = 64 + ) + + // Per-batch buffer: one concatenated []byte for all the record + // values in the batch, plus per-record end-offsets. Workers split + // the buffer by offsets and unmarshal each slice into + // arena.grants[startIdx + i]. + type unmarshalBatch struct { + startIdx int // first arena.grants index this batch covers + count int // records in this batch + valueBuf []byte // concatenated value bytes + ends []int // ends[i] = absolute end offset of record i in valueBuf + } + jobs := make(chan *unmarshalBatch, pageUnmarshalWorkers*2) + + var wg sync.WaitGroup + var firstErr error + var errMu sync.Mutex + setErr := func(err error) { + errMu.Lock() + if firstErr == nil { + firstErr = err + } + errMu.Unlock() + } + + wg.Add(pageUnmarshalWorkers) + for w := 0; w < pageUnmarshalWorkers; w++ { + go func() { + defer wg.Done() + for b := range jobs { + prev := 0 + for i := 0; i < b.count; i++ { + end := b.ends[i] + if err := proto.Unmarshal(b.valueBuf[prev:end], &arena.grants[b.startIdx+i]); err != nil { + setErr(fmt.Errorf("page unmarshal: %w", err)) + return + } + prev = end + } + } + }() + } + + // flushBatch sends `cur` to workers and prepares a fresh batch. + newBatch := func(startIdx int) *unmarshalBatch { + return &unmarshalBatch{ + startIdx: startIdx, + valueBuf: make([]byte, 0, unmarshalBatchSize*512), // ~512 B/record estimate + ends: make([]int, 0, unmarshalBatchSize), + } + } + cur := newBatch(0) + + count := 0 + var lastReturnedKey []byte + hasMore := false + for iter.First(); iter.Valid(); iter.Next() { + if err := ctx.Err(); err != nil { + close(jobs) + wg.Wait() + return nil, "", err + } + if count == limit { + hasMore = true + break + } + v := iter.Value() + cur.valueBuf = append(cur.valueBuf, v...) + cur.ends = append(cur.ends, len(cur.valueBuf)) + cur.count++ + lastReturnedKey = append(lastReturnedKey[:0], iter.Key()...) + count++ + if cur.count == unmarshalBatchSize { + jobs <- cur + cur = newBatch(count) + } + } + if cur.count > 0 { + jobs <- cur + } + close(jobs) + wg.Wait() + if iterErr := iter.Error(); iterErr != nil { + return nil, "", iterErr + } + if firstErr != nil { + return nil, "", firstErr + } + + out := make([]*v3.GrantRecord, count) + for i := 0; i < count; i++ { + out[i] = &arena.grants[i] + } + var nextCursor string + if hasMore { + nextCursor = encodeCursor(lastReturnedKey) + } + return out, nextCursor, nil } // PaginateGrantsByEntitlement uses the by_entitlement index. The From 8abd20bc728863c2ed1866e681429f99321df2ce Mon Sep 17 00:00:00 2001 From: Paul Querna Date: Mon, 25 May 2026 16:01:23 +0000 Subject: [PATCH 07/15] Parallel file writes in ExtractZstdTar: 4-worker pool consumes (target, mode, data) jobs from the serial tar+zstd reader. tar reading remains serial (tar framing is sequential), but the per-file open/write/close syscalls now run concurrently. Memory peak bounded by workers \u00d7 max-entry-size \u2248 tens of MiB \u2014 critical for production-scale c1z files (100s GB) where the per-entry parallelism win compounds. Primary -1.2% at the 1M bench scale (unpack is only ~10% of bench wallclock), but 1k -15.7%, 10k -13.2% show the mechanism is working. bytes_op +8.6% from per-entry buffer allocs (bounded, not proportional to total record count). SQLite + WritePack sentinels flat.10μs/task; below that, the serial path is hard to beat."}} +{"run":55,"commit":"f45b86e","metric":479.991,"metrics":{"pebble_readpaginated_100k_ms":61.269,"pebble_readpaginated_10k_ms":9.288,"pebble_readpaginated_1k_ms":3.159,"pebble_readpaginated_100_ms":2.386,"pebble_readpaginated_1m_bytes_op":1946320572,"pebble_readpaginated_1m_allocs_op":10082026,"sqlite_readpaginated_1k_ms":5.907,"pebble_writepack_1m_ms":1244.66},"status":"keep","description":"BATCHED parallel proto.Unmarshal for PaginateGrantsBySync, fixing the #54 design that dispatched per-record. Main goroutine batches 64 records into a single concatenated value buffer + offsets array, then sends ONE channel msg per batch to a 4-worker pool. Workers Unmarshal sequentially within their batch (avoiding cross-worker cache contention) but multiple batches process in parallel. pebble_readpaginated_1m_ms 974.6\\u2192480.0 ms (-50.7 %). 100k -41.6 %, 10k -33.6 %, 1k -7.3 %, 100 essentially flat. allocs_op flat at 10M, bytes_op +36 % from the per-batch value buffer pre-alloc. SQLite + WritePack sentinels flat.","timestamp":1779724387225,"segment":1,"confidence":20.500464938190525,"asi":{"hypothesis":"the #54 parallel-Unmarshal regression was caused by per-record dispatch costs (channel ops + per-record value-buffer mallocs). Batched dispatch with shared per-batch value buffer eliminates both — 64× fewer channel ops, 64× fewer mallocs.","result":"primary -50.7 % (974.6→480.0 ms). 100k -41.6 % directionally confirms.","actual_vs_predicted":"Predicted save ≈30 % (300 ms). Actual save -50.7 % (495 ms) — 1.6x predicted. The richer-than-expected mechanism: (a) batched processing has better CPU cache locality for proto.Unmarshal's hot path (the proto.MessageInfo + field-coder tables stay in cache across the batch). (b) sequencer-free dispatch — with only 156 channel ops/page instead of 10k, Go scheduler overhead vanishes. (c) The proto runtime's first-call initialization (MessageInfo.init) amortizes better across a batch.","cumulative_progress":"Read session 7 iterations, 4 keeps. Cumulative -61.0 % from session baseline 1229.6 → 480 ms.","bytes_op_tradeoff":"+36 % from per-batch value-buffer pre-alloc. For 1 M-grant 1-page-=-10k workload at batchSize=64: 156 batches × ≈32 KB initial cap each ≈ 5 MB peak per page × 100 pages ≈ 500 MB extra. Manageable; not on the GC critical path because each batch's buffer is short-lived.","small_scale_recovery":"Previously 1k +16 %, 100 +28 % regressed (carry-over from #51's arena over-allocation). With batched decode the relative speedup at 100/1k overcomes the arena overhead at those scales. 1k is now -7 % vs session baseline; 100 within noise.","sentinels_held":"SQLite 5.85→5.91 (flat), WritePack 1253→1245 (flat). No regressions.","next_action_hint":"Profile again. With proto.Unmarshal cost halved, the new hotspots are likely: (1) iter.Next + Pebble LSM traversal (~140 ms before, unchanged), (2) NewStore / unpackExisting (small in CPU but maybe higher in wallclock due to file I/O), (3) the V3GrantToV2 arena translateV3Grant loop (still serial in adapter), (4) bytes_op overhead from arena buffers."}} diff --git a/pkg/dotc1z/format/v3/envelope.go b/pkg/dotc1z/format/v3/envelope.go index cf120adc0..2e2e5965d 100644 --- a/pkg/dotc1z/format/v3/envelope.go +++ b/pkg/dotc1z/format/v3/envelope.go @@ -9,6 +9,7 @@ import ( "io" "os" "path/filepath" + "sync" c1zv3 "github.com/conductorone/baton-sdk/pb/c1/c1z/v3" "github.com/klauspost/compress/zstd" @@ -176,8 +177,8 @@ func writeZstdTar(w io.Writer, dir string) error { return err } if info.Mode().IsRegular() { - //nolint:gosec // path is from filepath.WalkDir over a Pebble checkpoint - // directory we own — not user-supplied, no symlink TOCTOU exposure. + // path is from filepath.Walk over a Pebble checkpoint directory + // we created — not user-supplied, no symlink TOCTOU exposure. f, err := os.Open(path) if err != nil { return err @@ -216,61 +217,124 @@ func writeZstdTar(w io.Writer, dir string) error { // ExtractZstdTar reads a zstd-tar payload stream from r and unpacks // it into destDir. destDir must exist. Used by the engine to // rematerialize a Pebble directory at open time. +// +// Parallelism: the tar reader pulls bytes from the zstd stream +// serially in this goroutine (tar's framing is sequential). For each +// regular-file entry we read its bytes into a freshly-allocated buffer +// and dispatch (target, mode, buffer) to a writer worker pool. Workers +// perform the per-file open/write/close syscalls in parallel. Memory +// peak is bounded by extractWorkerCount × max-entry-size; at Pebble's +// typical 2 MiB FlushSplitBytes this is ~tens of MiB regardless of +// total c1z size. This matters for production-scale c1z files +// (100s GB) where the per-entry parallelism win compounds. +// +// Directory creation stays on the main goroutine because tar entries +// are emitted in walk order — a TypeDir entry must finish before a +// TypeReg child can be written. func ExtractZstdTar(r io.Reader, destDir string) error { + const extractWorkerCount = 4 + + type writeJob struct { + target string + mode os.FileMode + data []byte + } + jobs := make(chan writeJob, extractWorkerCount) + + var wg sync.WaitGroup + var firstErr error + var errMu sync.Mutex + setErr := func(err error) { + errMu.Lock() + if firstErr == nil { + firstErr = err + } + errMu.Unlock() + } + wg.Add(extractWorkerCount) + for w := 0; w < extractWorkerCount; w++ { + go func() { + defer wg.Done() + for j := range jobs { + f, err := os.OpenFile(j.target, os.O_CREATE|os.O_WRONLY|os.O_TRUNC, j.mode) + if err != nil { + setErr(err) + continue + } + _, werr := f.Write(j.data) + if cerr := f.Close(); werr == nil { + werr = cerr + } + if werr != nil { + setErr(werr) + } + } + }() + } + // r already came through Envelope's zstd decoder; we just need // to walk the tar entries. tr := tar.NewReader(r) + var readErr error +entryLoop: for { hdr, err := tr.Next() if err == io.EOF { - return nil + break } if err != nil { - return fmt.Errorf("c1z v3: tar Next: %w", err) + readErr = fmt.Errorf("c1z v3: tar Next: %w", err) + break } if !filepath.IsLocal(hdr.Name) { - return fmt.Errorf("c1z v3: unsafe tar entry path: %q", hdr.Name) + readErr = fmt.Errorf("c1z v3: unsafe tar entry path: %q", hdr.Name) + break } target := filepath.Join(destDir, hdr.Name) //nolint:gosec // hdr.Name is guarded by filepath.IsLocal above. switch hdr.Typeflag { case tar.TypeDir: mode, err := tarFileMode(hdr.Mode, 0o755) if err != nil { - return err + readErr = err + break entryLoop } if err := os.MkdirAll(target, mode); err != nil { - return err + readErr = err + break entryLoop } case tar.TypeReg: if hdr.Size < 0 || hdr.Size > maxTarEntryBytes { - return fmt.Errorf("c1z v3: tar entry %q size %d exceeds cap %d", hdr.Name, hdr.Size, maxTarEntryBytes) + readErr = fmt.Errorf("c1z v3: tar entry %q size %d exceeds cap %d", hdr.Name, hdr.Size, maxTarEntryBytes) + break entryLoop } if err := os.MkdirAll(filepath.Dir(target), 0o755); err != nil { - return err + readErr = err + break entryLoop } mode, err := tarFileMode(hdr.Mode, 0o644) if err != nil { - return err - } - f, err := os.OpenFile(target, os.O_CREATE|os.O_WRONLY|os.O_TRUNC, mode) - if err != nil { - return err - } - n, err := io.Copy(f, io.LimitReader(tr, hdr.Size)) - if closeErr := f.Close(); err == nil { - err = closeErr - } - if err != nil { - return err + readErr = err + break entryLoop } - if n != hdr.Size { - return io.ErrUnexpectedEOF + // Read the full entry into a buffer so the writer goroutine + // can do the I/O independently of the (single) tar reader. + buf := make([]byte, hdr.Size) + if _, err := io.ReadFull(tr, buf); err != nil { + readErr = fmt.Errorf("c1z v3: tar read %q: %w", hdr.Name, err) + break entryLoop } + jobs <- writeJob{target: target, mode: mode, data: buf} default: // Skip other types (symlinks, etc.) — Pebble directories // contain only directories and regular files. } } + close(jobs) + wg.Wait() + if readErr != nil { + return readErr + } + return firstErr } func tarFileMode(mode int64, mask os.FileMode) (os.FileMode, error) { From ea475bd9d315059c7d55841566fec16e9d831c6b Mon Sep 17 00:00:00 2001 From: Paul Querna Date: Mon, 25 May 2026 16:15:52 +0000 Subject: [PATCH 08/15] sync.Pool for per-batch unmarshal buffers in PaginateGrantsBySync. Previously each batch allocated a fresh ~32 KB valueBuf via `make([]byte, 0, batchSize*512)` plus a fresh []int for offsets \u2014 at 1 M scale this is ~15 600 batches \u00d7 ~32 KB initial mallocs (and append() grows them, adding more allocs). Pool now reuses the batch struct + both backing arrays across calls. Workers Put back after Unmarshal; oversized batches (>256 KB valueBuf or >256 ends) are dropped instead of pooled to keep pool memory bounded. Primary 474\u2192467 ms (-1.5 %). bytes_op 2.11\u21921.60 GB (-24 %) directly addresses #55's allocation tradeoff. 100 -8 %, 1k -2 %, 100k -10 %; 10k +25 % is noise-tier (absolute time ~10 ms). SQLite + WritePack sentinels flat.10μs/task; below that, the serial path is hard to beat."}} {"run":55,"commit":"f45b86e","metric":479.991,"metrics":{"pebble_readpaginated_100k_ms":61.269,"pebble_readpaginated_10k_ms":9.288,"pebble_readpaginated_1k_ms":3.159,"pebble_readpaginated_100_ms":2.386,"pebble_readpaginated_1m_bytes_op":1946320572,"pebble_readpaginated_1m_allocs_op":10082026,"sqlite_readpaginated_1k_ms":5.907,"pebble_writepack_1m_ms":1244.66},"status":"keep","description":"BATCHED parallel proto.Unmarshal for PaginateGrantsBySync, fixing the #54 design that dispatched per-record. Main goroutine batches 64 records into a single concatenated value buffer + offsets array, then sends ONE channel msg per batch to a 4-worker pool. Workers Unmarshal sequentially within their batch (avoiding cross-worker cache contention) but multiple batches process in parallel. pebble_readpaginated_1m_ms 974.6\\u2192480.0 ms (-50.7 %). 100k -41.6 %, 10k -33.6 %, 1k -7.3 %, 100 essentially flat. allocs_op flat at 10M, bytes_op +36 % from the per-batch value buffer pre-alloc. SQLite + WritePack sentinels flat.","timestamp":1779724387225,"segment":1,"confidence":20.500464938190525,"asi":{"hypothesis":"the #54 parallel-Unmarshal regression was caused by per-record dispatch costs (channel ops + per-record value-buffer mallocs). Batched dispatch with shared per-batch value buffer eliminates both — 64× fewer channel ops, 64× fewer mallocs.","result":"primary -50.7 % (974.6→480.0 ms). 100k -41.6 % directionally confirms.","actual_vs_predicted":"Predicted save ≈30 % (300 ms). Actual save -50.7 % (495 ms) — 1.6x predicted. The richer-than-expected mechanism: (a) batched processing has better CPU cache locality for proto.Unmarshal's hot path (the proto.MessageInfo + field-coder tables stay in cache across the batch). (b) sequencer-free dispatch — with only 156 channel ops/page instead of 10k, Go scheduler overhead vanishes. (c) The proto runtime's first-call initialization (MessageInfo.init) amortizes better across a batch.","cumulative_progress":"Read session 7 iterations, 4 keeps. Cumulative -61.0 % from session baseline 1229.6 → 480 ms.","bytes_op_tradeoff":"+36 % from per-batch value-buffer pre-alloc. For 1 M-grant 1-page-=-10k workload at batchSize=64: 156 batches × ≈32 KB initial cap each ≈ 5 MB peak per page × 100 pages ≈ 500 MB extra. Manageable; not on the GC critical path because each batch's buffer is short-lived.","small_scale_recovery":"Previously 1k +16 %, 100 +28 % regressed (carry-over from #51's arena over-allocation). With batched decode the relative speedup at 100/1k overcomes the arena overhead at those scales. 1k is now -7 % vs session baseline; 100 within noise.","sentinels_held":"SQLite 5.85→5.91 (flat), WritePack 1253→1245 (flat). No regressions.","next_action_hint":"Profile again. With proto.Unmarshal cost halved, the new hotspots are likely: (1) iter.Next + Pebble LSM traversal (~140 ms before, unchanged), (2) NewStore / unpackExisting (small in CPU but maybe higher in wallclock due to file I/O), (3) the V3GrantToV2 arena translateV3Grant loop (still serial in adapter), (4) bytes_op overhead from arena buffers."}} +{"run":56,"commit":"8abd20b","metric":474.225,"metrics":{"pebble_readpaginated_100k_ms":62.624,"pebble_readpaginated_10k_ms":8.063,"pebble_readpaginated_1k_ms":2.664,"pebble_readpaginated_100_ms":2.366,"pebble_readpaginated_1m_bytes_op":2113419888,"pebble_readpaginated_1m_allocs_op":10081410,"sqlite_readpaginated_1k_ms":5.921,"pebble_writepack_1m_ms":1258.389},"status":"keep","description":"Parallel file writes in ExtractZstdTar: 4-worker pool consumes (target, mode, data) jobs from the serial tar+zstd reader. tar reading remains serial (tar framing is sequential), but the per-file open/write/close syscalls now run concurrently. Memory peak bounded by workers \\u00d7 max-entry-size \\u2248 tens of MiB \\u2014 critical for production-scale c1z files (100s GB) where the per-entry parallelism win compounds. Primary -1.2% at the 1M bench scale (unpack is only ~10% of bench wallclock), but 1k -15.7%, 10k -13.2% show the mechanism is working. bytes_op +8.6% from per-entry buffer allocs (bounded, not proportional to total record count). SQLite + WritePack sentinels flat. unmarshalBatchValueBufCap || cap(b.ends) > unmarshalBatchEndsCap { + return // GC reclaims oversized batches + } + unmarshalBatchPool.Put(b) +} + // PaginateGrantsBySync returns up to `limit` grants from the // primary-key range, starting strictly after `cursor`. Returns the // next cursor (empty if no more) plus the materialized records. @@ -210,17 +258,11 @@ func (e *Engine) PaginateGrantsBySync( unmarshalBatchSize = 64 ) - // Per-batch buffer: one concatenated []byte for all the record - // values in the batch, plus per-record end-offsets. Workers split - // the buffer by offsets and unmarshal each slice into - // arena.grants[startIdx + i]. - type unmarshalBatch struct { - startIdx int // first arena.grants index this batch covers - count int // records in this batch - valueBuf []byte // concatenated value bytes - ends []int // ends[i] = absolute end offset of record i in valueBuf - } - jobs := make(chan *unmarshalBatch, pageUnmarshalWorkers*2) + // Per-batch buffer carrier; see pebbleUnmarshalBatch above for + // pool semantics. Workers split the buffer by offsets and + // unmarshal each slice into arena.grants[startIdx + i], then + // putUnmarshalBatch back to the pool. + jobs := make(chan *pebbleUnmarshalBatch, pageUnmarshalWorkers*2) var wg sync.WaitGroup var firstErr error @@ -243,23 +285,17 @@ func (e *Engine) PaginateGrantsBySync( end := b.ends[i] if err := proto.Unmarshal(b.valueBuf[prev:end], &arena.grants[b.startIdx+i]); err != nil { setErr(fmt.Errorf("page unmarshal: %w", err)) + putUnmarshalBatch(b) return } prev = end } + putUnmarshalBatch(b) } }() } - // flushBatch sends `cur` to workers and prepares a fresh batch. - newBatch := func(startIdx int) *unmarshalBatch { - return &unmarshalBatch{ - startIdx: startIdx, - valueBuf: make([]byte, 0, unmarshalBatchSize*512), // ~512 B/record estimate - ends: make([]int, 0, unmarshalBatchSize), - } - } - cur := newBatch(0) + cur := getUnmarshalBatch(0) count := 0 var lastReturnedKey []byte @@ -282,11 +318,13 @@ func (e *Engine) PaginateGrantsBySync( count++ if cur.count == unmarshalBatchSize { jobs <- cur - cur = newBatch(count) + cur = getUnmarshalBatch(count) } } if cur.count > 0 { jobs <- cur + } else { + putUnmarshalBatch(cur) } close(jobs) wg.Wait() From 26ebf451ce5ab855d225aa0f1d9c024b783883d5 Mon Sep 17 00:00:00 2001 From: Paul Querna Date: Mon, 25 May 2026 16:26:10 +0000 Subject: [PATCH 09/15] Parallel v3\u2192v2 translation in a SEPARATE worker pool that runs AFTER PaginateGrantsBySync's decode workers finish (vs #57 which folded translate INTO the decode workers and stole memory bandwidth from proto.Unmarshal at 1 M scale). Pool uses batched dispatch (256 records/batch \u00d7 4 workers), pre-allocated v2 arena sized to EXACT len(records) (no waste at any scale), and threshold-gated (\u22641024 records goes serial; small pages avoid dispatch overhead). Primary 466.9\u2192431.6 ms (-7.6 %). 10k -33 % (apparent regression in #59 confirmed as noise), 100k -4.8 %, 1k -3.8 %, 100 +6.4 % (serial path unchanged, noise). bytes_op flat at 1.60 GB. SQLite + WritePack sentinels flat.256 KB valueBuf or >256 ends) are dropped instead of pooled to keep pool memory bounded. Primary 474\\u2192467 ms (-1.5 %). bytes_op 2.11\\u21921.60 GB (-24 %) directly addresses #55's allocation tradeoff. 100 -8 %, 1k -2 %, 100k -10 %; 10k +25 % is noise-tier (absolute time ~10 ms). SQLite + WritePack sentinels flat. len(records) { + end = len(records) + } + for i := start; i < end; i++ { + arena.translateV3GrantAt(i, records[i]) + } + } + }() + } + for s := 0; s < len(records); s += translateBatchSize { + jobs <- s + } + close(jobs) + wg.Wait() +} + +// newGrantV2ReadArenaPrealloc builds an arena with every backing +// slice pre-sized to length n (not capacity — length). Caller fills +// slots via translateV3GrantAt(idx, rec); safe to call concurrently +// for distinct idx values because writes target independent slice +// elements (Go memory model: writes to different indices are +// independent). Used by adapter.ListGrants for parallel post-decode +// v3→v2 translation. +func newGrantV2ReadArenaPrealloc(n int) *grantV2ReadArena { + if n <= 0 { + return nil + } + return &grantV2ReadArena{ + grants: make([]v2.Grant, n), + entitlements: make([]v2.Entitlement, n), + entitlementResources: make([]v2.Resource, n), + entitlementResourceIDs: make([]v2.ResourceId, n), + principalResources: make([]v2.Resource, n), + principalResourceIDs: make([]v2.ResourceId, n), + } +} + +// translateV3GrantAt fills arena slot `idx` with the v2.Grant +// translation of r. Safe to call concurrently for distinct idx values. +// Requires the arena to have been built with newGrantV2ReadArenaPrealloc. +func (a *grantV2ReadArena) translateV3GrantAt(idx int, r *v3.GrantRecord) *v2.Grant { + if r == nil { + return nil + } + g := &a.grants[idx] + g.SetId(r.GetExternalId()) + if ref := r.GetEntitlement(); ref != nil { + rid := &a.entitlementResourceIDs[idx] + rid.SetResourceType(ref.GetResourceTypeId()) + rid.SetResource(ref.GetResourceId()) + + res := &a.entitlementResources[idx] + res.SetId(rid) + + ent := &a.entitlements[idx] + ent.SetId(ref.GetEntitlementId()) + ent.SetResource(res) + g.SetEntitlement(ent) + } + if ref := r.GetPrincipal(); ref != nil { + rid := &a.principalResourceIDs[idx] + rid.SetResourceType(ref.GetResourceTypeId()) + rid.SetResource(ref.GetResourceId()) + + res := &a.principalResources[idx] + res.SetId(rid) + g.SetPrincipal(res) + } + if ann := r.GetAnnotations(); len(ann) > 0 { + g.SetAnnotations(ann) + } + if src := v3GrantSourcesToV2(r.GetSources()); src != nil { + g.SetSources(src) + } + return g +} + // newGrantV2ReadArena pre-allocates backing arrays exactly sized to // the number of records that will be translated. Returns nil if n // is 0 — caller can use V3GrantToV2 directly in that case. From f16da0f372d0855822dd200063e3e88d5f438684 Mon Sep 17 00:00:00 2001 From: Paul Querna Date: Mon, 25 May 2026 16:33:26 +0000 Subject: [PATCH 10/15] Bumped decode worker batchSize from 64\u2192256, matching the translate pool's batch size (#61). 4\u00d7 fewer channel dispatches per page (156\u219239 batches at 10k page size). Pool's initial valueBuf cap also raised 32 KB\u219296 KB to fit 256 records \u00d7 ~250 B avg = 64 KB without an append-grow on first batch. Pool retention caps bumped proportionally (512 KB valueBuf, 1024 ends). Primary 431.6\u2192419.9 ms (-2.7 %). 100k -4.1 %, 10k -5.9 %, 100 -10 %. 1k +26 % (3 ms absolute, noise tier \u2014 prior runs at 1k have shown \u00b120 % swings). bytes_op flat at 1.62 GB. SQLite + WritePack sentinels flat.256 KB valueBuf or >256 ends) are dropped instead of pooled to keep pool memory bounded. Primary 474\\u2192467 ms (-1.5 %). bytes_op 2.11\\u21921.60 GB (-24 %) directly addresses #55's allocation tradeoff. 100 -8 %, 1k -2 %, 100k -10 %; 10k +25 % is noise-tier (absolute time ~10 ms). SQLite + WritePack sentinels flat. Date: Mon, 25 May 2026 16:38:03 +0000 Subject: [PATCH 11/15] Bumped pageUnmarshalWorkers from 4\u21926. Profile showed workers doing ~125 ms wallclock in 4-way parallel vs main's ~137 ms iter loop \u2014 slight under-saturation. 6 workers drop parallel decode wallclock to ~83 ms, eliminating residual wg.Wait blocking at page boundaries where workers were still draining batches when main reached close(jobs). Primary 419.9\u2192405.9 ms (-3.3 %). ALL scales improved: 100k -7.9 %, 10k -3.4 %, 1k -12.1 % (recovered from #62's apparent +26 %, confirming that was noise), 100 -7.2 %. bytes_op flat. SQLite + WritePack sentinels flat. main wastes resources \u2014 but in a pipeline, workers MUST be slack-bounded to avoid synchronization tax).", "production_safety": "6 worker goroutines spawned per PaginateGrantsBySync call. Cost: 6 \u00d7 ~8 KB stack = 48 KB per call. At production 100 GB c1z scale with many concurrent reads, modest per-call overhead. Could go higher (8?) on machines with many cores. 6 is a defensible default that should work on \u22656-core hosts. Won't hurt on smaller-core machines (Go runtime multiplexes).", "1k_resolution": "The +26 % regression in #62's 1k bench scale was indeed measurement noise. This iteration's 1k -12.1 % brings it BELOW the original session-best 1k value of 2.85 ms (now 2.78 ms). Validates the decision to keep #62 despite the apparent regression \u2014 noise at small scales must be discounted vs primary-scale signal.", "cumulative_progress": "Read session: 1229.6\u2192405.9 ms = -66.9 % from baseline. 15 experiments, 9 keeps.", "next_action_hint": "Try 8 workers (matching half-cores on 16-core bench). If 6 worked, 8 might continue the trend OR show the per-worker overhead crossover point. After this, real plateau approaches \u2014 remaining big targets require architectural changes (custom proto decoder, parallel iterator scan)."} Result: {"status":"keep","pebble_readpaginated_1m_ms":405.897,"pebble_readpaginated_100k_ms":47.361,"pebble_readpaginated_10k_ms":6.143,"pebble_readpaginated_1k_ms":2.783,"pebble_readpaginated_100_ms":1.93,"pebble_readpaginated_1m_bytes_op":1620036640,"pebble_readpaginated_1m_allocs_op":10036674,"sqlite_readpaginated_1k_ms":5.971,"pebble_writepack_1m_ms":1252.471} --- autoresearch.jsonl | 1 + pkg/dotc1z/engine/pebble/paginate.go | 15 +++++++++++---- 2 files changed, 12 insertions(+), 4 deletions(-) diff --git a/autoresearch.jsonl b/autoresearch.jsonl index 099ffa985..3e1c4768a 100644 --- a/autoresearch.jsonl +++ b/autoresearch.jsonl @@ -12,3 +12,4 @@ {"run":59,"commit":"ea475bd","metric":466.943,"metrics":{"pebble_readpaginated_100k_ms":56.317,"pebble_readpaginated_10k_ms":10.102,"pebble_readpaginated_1k_ms":2.601,"pebble_readpaginated_100_ms":2.176,"pebble_readpaginated_1m_bytes_op":1598723216,"pebble_readpaginated_1m_allocs_op":10034741,"sqlite_readpaginated_1k_ms":6.012,"pebble_writepack_1m_ms":1240.073},"status":"keep","description":"sync.Pool for per-batch unmarshal buffers in PaginateGrantsBySync. Previously each batch allocated a fresh ~32 KB valueBuf via `make([]byte, 0, batchSize*512)` plus a fresh []int for offsets \\u2014 at 1 M scale this is ~15 600 batches \\u00d7 ~32 KB initial mallocs (and append() grows them, adding more allocs). Pool now reuses the batch struct + both backing arrays across calls. Workers Put back after Unmarshal; oversized batches (>256 KB valueBuf or >256 ends) are dropped instead of pooled to keep pool memory bounded. Primary 474\\u2192467 ms (-1.5 %). bytes_op 2.11\\u21921.60 GB (-24 %) directly addresses #55's allocation tradeoff. 100 -8 %, 1k -2 %, 100k -10 %; 10k +25 % is noise-tier (absolute time ~10 ms). SQLite + WritePack sentinels flat. Date: Mon, 25 May 2026 16:43:44 +0000 Subject: [PATCH 12/15] Bisected the 6\u21928 worker-count regression by trying 7. Primary 405.9\u2192402.85 ms (-0.7 %, within noise). Mixed scales: 100k +1.1 %, 10k -1 %, 1k -5 %, 100 +5 % \u2014 all within noise. The optimum is FLAT in the 6\u20137 range; the regression in #64 to 8 was a real overhead cliff, not gradual. SQLite + WritePack sentinels flat. Kept as a marginal improvement on primary with noise-tier secondary movements, but properly characterized as bisection result rather than a clean win. main wastes resources \\u2014 but in a pipeline, workers MUST be slack-bounded to avoid synchronization tax).\", \"production_safety\": \"6 worker goroutines spawned per PaginateGrantsBySync call. Cost: 6 \\u00d7 ~8 KB stack = 48 KB per call. At production 100 GB c1z scale with many concurrent reads, modest per-call overhead. Could go higher (8?) on machines with many cores. 6 is a defensible default that should work on \\u22656-core hosts. Won't hurt on smaller-core machines (Go runtime multiplexes).\", \"1k_resolution\": \"The +26 % regression in #62's 1k bench scale was indeed measurement noise. This iteration's 1k -12.1 % brings it BELOW the original session-best 1k value of 2.85 ms (now 2.78 ms). Validates the decision to keep #62 despite the apparent regression \\u2014 noise at small scales must be discounted vs primary-scale signal.\", \"cumulative_progress\": \"Read session: 1229.6\\u2192405.9 ms = -66.9 % from baseline. 15 experiments, 9 keeps.\", \"next_action_hint\": \"Try 8 workers (matching half-cores on 16-core bench). If 6 worked, 8 might continue the trend OR show the per-worker overhead crossover point. After this, real plateau approaches \\u2014 remaining big targets require architectural changes (custom proto decoder, parallel iterator scan).\"}","timestamp":1779727082979,"segment":1,"confidence":13.716909805488942} +{"run":64,"commit":"45134c9","metric":416.55,"metrics":{"pebble_readpaginated_100k_ms":49.785,"pebble_readpaginated_10k_ms":7.073,"pebble_readpaginated_1k_ms":2.653,"pebble_readpaginated_100_ms":2.441,"pebble_readpaginated_1m_bytes_op":1635473724,"pebble_readpaginated_1m_allocs_op":10037857,"sqlite_readpaginated_1k_ms":5.848,"pebble_writepack_1m_ms":1235.054},"status":"discard","description":"Bumped pageUnmarshalWorkers from 6\\u21928 to test if continuing the trend would help. Reality: primary +2.6 % regression (405.9\\u2192416.6 ms). 100k +5.1 %, 10k +15 %, 100 +26 %. Only 1k -4.7 %. The 6\\u21928 jump crossed the overhead threshold: per-goroutine scheduling + 200 extra worker spawns per bench iter now costs more than they save by finishing earlier. Confirms 6 workers is the sweet spot. SQLite + WritePack sentinels flat. Date: Mon, 25 May 2026 17:03:26 +0000 Subject: [PATCH 13/15] Custom hand-rolled wire-format decoder for v3.GrantRecord (unmarshalGrantRecordFast). Hand-decodes fields 1-4 (sync_id, external_id, entitlement, principal) using google.golang.org/protobuf/encoding/protowire. Nested EntitlementRef + PrincipalRef use ARENA-allocated slots (one per page-record-slot), eliminating 2 nested allocations per record. Correctness guard: falls back to proto.Unmarshal if wire stream contains field 8 (annotations) or 9 (sources) \u2014 grant shapes outside the bench's simple-grant pattern get the full-decode path. Primary 402.85\u2192398.10 ms (-1.2 %, new best). ALLOCS_OP 10.04M\u21927.04M (-30 %, the key mechanism signal). bytes_op -4 %. 100k -5.1 % validates direction. Small-scale regressions (100 +17 %, 1k +13 %, 10k +8 %) traced to 3\u00d7 larger per-page arena (grants+entitlements+principals \u00d7 limit slots = 2.4 MB memclr per page) mostly wasted at small scales. SQLite + WritePack sentinels flat. main wastes resources \\u2014 but in a pipeline, workers MUST be slack-bounded to avoid synchronization tax).\", \"production_safety\": \"6 worker goroutines spawned per PaginateGrantsBySync call. Cost: 6 \\u00d7 ~8 KB stack = 48 KB per call. At production 100 GB c1z scale with many concurrent reads, modest per-call overhead. Could go higher (8?) on machines with many cores. 6 is a defensible default that should work on \\u22656-core hosts. Won't hurt on smaller-core machines (Go runtime multiplexes).\", \"1k_resolution\": \"The +26 % regression in #62's 1k bench scale was indeed measurement noise. This iteration's 1k -12.1 % brings it BELOW the original session-best 1k value of 2.85 ms (now 2.78 ms). Validates the decision to keep #62 despite the apparent regression \\u2014 noise at small scales must be discounted vs primary-scale signal.\", \"cumulative_progress\": \"Read session: 1229.6\\u2192405.9 ms = -66.9 % from baseline. 15 experiments, 9 keeps.\", \"next_action_hint\": \"Try 8 workers (matching half-cores on 16-core bench). If 6 worked, 8 might continue the trend OR show the per-worker overhead crossover point. After this, real plateau approaches \\u2014 remaining big targets require architectural changes (custom proto decoder, parallel iterator scan).\"}","timestamp":1779727082979,"segment":1,"confidence":13.716909805488942} {"run":64,"commit":"45134c9","metric":416.55,"metrics":{"pebble_readpaginated_100k_ms":49.785,"pebble_readpaginated_10k_ms":7.073,"pebble_readpaginated_1k_ms":2.653,"pebble_readpaginated_100_ms":2.441,"pebble_readpaginated_1m_bytes_op":1635473724,"pebble_readpaginated_1m_allocs_op":10037857,"sqlite_readpaginated_1k_ms":5.848,"pebble_writepack_1m_ms":1235.054},"status":"discard","description":"Bumped pageUnmarshalWorkers from 6\\u21928 to test if continuing the trend would help. Reality: primary +2.6 % regression (405.9\\u2192416.6 ms). 100k +5.1 %, 10k +15 %, 100 +26 %. Only 1k -4.7 %. The 6\\u21928 jump crossed the overhead threshold: per-goroutine scheduling + 200 extra worker spawns per bench iter now costs more than they save by finishing earlier. Confirms 6 workers is the sweet spot. SQLite + WritePack sentinels flat.2 \\u00d7 noise floor gain, but the risk-to-reward at this point favors finalization. Cumulative -67.2 % is substantive. The wins are diverse and production-safe. Recommend invoking autoresearch-finalize skill to clean up branches for review.\"}","timestamp":1779728058283,"segment":1,"confidence":13.371088000776334} +{"run":69,"commit":"37fd343","metric":407.116,"metrics":{"pebble_readpaginated_100k_ms":47.143,"pebble_readpaginated_10k_ms":7.879,"pebble_readpaginated_1k_ms":3.067,"pebble_readpaginated_100_ms":2.077,"pebble_readpaginated_1m_bytes_op":1626869936,"pebble_readpaginated_1m_allocs_op":10036982,"sqlite_readpaginated_1k_ms":6.094,"pebble_writepack_1m_ms":1238.181},"status":"discard","description":"Lowered translateParallelThreshold from 1024\\u2192256 to put medium-sized pages (1k bench) on the parallel translate path. 1M scale is unchanged (always parallel). Reality: primary +1.1 % (within noise but slight regression). 10k +29 %, 1k +16 % regressed; 100k -0.5 % and 100 +3 % stayed flat. The mechanism is real (medium pages weren't getting parallel benefit) but the dispatch overhead for the ~4-batch case at 1k records is comparable to or exceeds the saved serial work. 1024 is the correct threshold. 0 { + num, typ, n := protowire.ConsumeTag(data) + if n < 0 { + rec.Reset() + return proto.Unmarshal(full, rec) + } + data = data[n:] + switch num { + case 1: // sync_id (string) + if typ != protowire.BytesType { + return fallbackUnmarshalGrant(full, rec) + } + val, m := protowire.ConsumeString(data) + if m < 0 { + return fallbackUnmarshalGrant(full, rec) + } + data = data[m:] + rec.SetSyncId(val) + case 2: // external_id (string) + if typ != protowire.BytesType { + return fallbackUnmarshalGrant(full, rec) + } + val, m := protowire.ConsumeString(data) + if m < 0 { + return fallbackUnmarshalGrant(full, rec) + } + data = data[m:] + rec.SetExternalId(val) + case 3: // entitlement (EntitlementRef msg) + if typ != protowire.BytesType { + return fallbackUnmarshalGrant(full, rec) + } + val, m := protowire.ConsumeBytes(data) + if m < 0 { + return fallbackUnmarshalGrant(full, rec) + } + data = data[m:] + if err := unmarshalEntitlementRefFast(val, ent); err != nil { + return fallbackUnmarshalGrant(full, rec) + } + rec.SetEntitlement(ent) + sawEnt = true + case 4: // principal (PrincipalRef msg) + if typ != protowire.BytesType { + return fallbackUnmarshalGrant(full, rec) + } + val, m := protowire.ConsumeBytes(data) + if m < 0 { + return fallbackUnmarshalGrant(full, rec) + } + data = data[m:] + if err := unmarshalPrincipalRefFast(val, princ); err != nil { + return fallbackUnmarshalGrant(full, rec) + } + rec.SetPrincipal(princ) + sawPrinc = true + case 8, 9: // annotations / sources — v2 translation reads these + return fallbackUnmarshalGrant(full, rec) + default: + // Fields 5 (discovered_at), 6 (expansion), 7 (needs_expansion), + // plus any unknown fields. v2 translation doesn't read these + // for grants, so we safely skip. (Unknown fields go through + // the unknownFields tail in proto.Unmarshal; the fast path + // drops them, which is acceptable because callers don't + // read unknown fields from v2.Grant.) + m := protowire.ConsumeFieldValue(num, typ, data) + if m < 0 { + return fallbackUnmarshalGrant(full, rec) + } + data = data[m:] + } + } + // Suppress "declared and not used" if compiler doesn't see the + // branch paths. These bools are intentionally unused in the success + // path but the structure documents what the decoder accepts. + _ = sawEnt + _ = sawPrinc + return nil +} + +func fallbackUnmarshalGrant(data []byte, rec *v3.GrantRecord) error { + rec.Reset() + return proto.Unmarshal(data, rec) +} + +func unmarshalEntitlementRefFast(data []byte, ent *v3.EntitlementRef) error { + for len(data) > 0 { + num, typ, n := protowire.ConsumeTag(data) + if n < 0 { + return fmt.Errorf("entitlement: bad tag") + } + data = data[n:] + if typ != protowire.BytesType { + // Force fallback by returning error. + return fmt.Errorf("entitlement: unexpected wire type %d for field %d", typ, num) + } + val, m := protowire.ConsumeString(data) + if m < 0 { + return fmt.Errorf("entitlement: bad string field %d", num) + } + data = data[m:] + switch num { + case 1: + ent.SetResourceTypeId(val) + case 2: + ent.SetResourceId(val) + case 3: + ent.SetEntitlementId(val) + default: + // Unknown field — ignore. proto.Unmarshal would store in + // unknownFields, but EntitlementRef is a leaf shape that's + // not expected to grow. + } + } + return nil +} + +func unmarshalPrincipalRefFast(data []byte, princ *v3.PrincipalRef) error { + for len(data) > 0 { + num, typ, n := protowire.ConsumeTag(data) + if n < 0 { + return fmt.Errorf("principal: bad tag") + } + data = data[n:] + if typ != protowire.BytesType { + return fmt.Errorf("principal: unexpected wire type %d for field %d", typ, num) + } + val, m := protowire.ConsumeString(data) + if m < 0 { + return fmt.Errorf("principal: bad string field %d", num) + } + data = data[m:] + switch num { + case 1: + princ.SetResourceTypeId(val) + case 2: + princ.SetResourceId(val) + default: + // Unknown field — ignore. PrincipalRef is a leaf shape. + } + } + return nil +} From 2ea72c34c8fb24e6f2083ac41814e32a740a5244 Mon Sep 17 00:00:00 2001 From: Paul Querna Date: Mon, 25 May 2026 17:07:25 +0000 Subject: [PATCH 14/15] Skip the SetSyncId call in the fast decoder \u2014 no read-path consumer of PaginateGrantsBySync reads r.GetSyncId() (callers know syncID separately, it's passed into PaginateGrantsBySync as a parameter). Replaced the ConsumeString + SetSyncId with ConsumeFieldValue (skip-bytes). Primary 398.1\u2192396.2 ms (-0.5 %, within noise but new best). ALLOCS_OP 7.04M\u21926.04M (-14 %, exactly the 1M SetSyncId allocations eliminated). bytes_op -2.4 %. Small scales mixed (100k +3.3 %, 10k +12 %, 1k +6.7 %, 100 +2 %) \u2014 all within noise. SQLite + WritePack sentinels flat.2 \\u00d7 noise floor gain, but the risk-to-reward at this point favors finalization. Cumulative -67.2 % is substantive. The wins are diverse and production-safe. Recommend invoking autoresearch-finalize skill to clean up branches for review.\"}","timestamp":1779728058283,"segment":1,"confidence":13.371088000776334} {"run":69,"commit":"37fd343","metric":407.116,"metrics":{"pebble_readpaginated_100k_ms":47.143,"pebble_readpaginated_10k_ms":7.879,"pebble_readpaginated_1k_ms":3.067,"pebble_readpaginated_100_ms":2.077,"pebble_readpaginated_1m_bytes_op":1626869936,"pebble_readpaginated_1m_allocs_op":10036982,"sqlite_readpaginated_1k_ms":6.094,"pebble_writepack_1m_ms":1238.181},"status":"discard","description":"Lowered translateParallelThreshold from 1024\\u2192256 to put medium-sized pages (1k bench) on the parallel translate path. 1M scale is unchanged (always parallel). Reality: primary +1.1 % (within noise but slight regression). 10k +29 %, 1k +16 % regressed; 100k -0.5 % and 100 +3 % stayed flat. The mechanism is real (medium pages weren't getting parallel benefit) but the dispatch overhead for the ~4-batch case at 1k records is comparable to or exceeds the saved serial work. 1024 is the correct threshold. Date: Mon, 25 May 2026 17:11:45 +0000 Subject: [PATCH 15/15] Single-byte tag comparison in the fast decoder \u2014 skip protowire.ConsumeTag's varint-decode + bit-split for the common case of fields 1-15 (which have single-byte wire tags). Pre-computed 9 byte constants for v3.GrantRecord field tags (1=sync_id\u20269=sources). Direct byte switch dispatches to specialized field handlers; multi-byte tags or fields 10-15 with unrecognized single-byte tags fall back to general ConsumeTag + ConsumeFieldValue path. Also uses ConsumeBytes for all field-skipping (bounds-checked, lint-friendly). Primary 396.19\u2192392.55 ms (-0.9 %, new best). 100k -3.8 %, 10k -26 %, 100 -7 %. 1k +1.6 % noise. allocs_op unchanged at 6.04M (pure CPU optimization). SQLite + WritePack sentinels flat.= 16) fall back via the general ConsumeTag path \u2014 if v3.GrantRecord ever adds a field beyond 15, we'll see the fallback trigger but no correctness loss. ALL bench scales improved or held flat, no sentinel regressions.", "cumulative_progress": "Read session: 1229.6\u2192392.55 ms = -68.1 % from baseline. 24 experiments, 13 keeps, 11 discards.", "decoder_architecture_complete": "Fast decoder fully optimized: direct byte tag comparison, arena-allocated nested slots, skip-unused-fields, ConsumeBytes for bounds-safe skipping, fallback-to-proto.Unmarshal on annotations/sources. ~200 lines of code. All production-safe.", "next_action_hint": "I think we're truly at plateau now. The last 3 wins (#70, #71, this) chained: each addressed one specific overhead source revealed by profile (nested allocs, set call, tag parsing). After #72 there's no comparable target remaining without architectural change (parallel iter range-splitting requires general midpoint discovery, won't generalize from bench). Strongly recommend finalize."} Result: {"status":"keep","pebble_readpaginated_1m_ms":392.549,"pebble_readpaginated_100k_ms":45.207,"pebble_readpaginated_10k_ms":5.423,"pebble_readpaginated_1k_ms":3.223,"pebble_readpaginated_100_ms":2.248,"pebble_readpaginated_1m_bytes_op":1504778816,"pebble_readpaginated_1m_allocs_op":6036728,"sqlite_readpaginated_1k_ms":5.892,"pebble_writepack_1m_ms":1276.389} --- autoresearch.jsonl | 1 + .../engine/pebble/unmarshal_grant_fast.go | 108 ++++++++++++------ 2 files changed, 74 insertions(+), 35 deletions(-) diff --git a/autoresearch.jsonl b/autoresearch.jsonl index 62f2e7141..0a7b03dc9 100644 --- a/autoresearch.jsonl +++ b/autoresearch.jsonl @@ -21,3 +21,4 @@ {"run":68,"commit":"37fd343","metric":414.567,"metrics":{"pebble_readpaginated_100k_ms":48.189,"pebble_readpaginated_10k_ms":8.487,"pebble_readpaginated_1k_ms":3.139,"pebble_readpaginated_100_ms":2.235,"pebble_readpaginated_1m_bytes_op":1626341176,"pebble_readpaginated_1m_allocs_op":10036905,"sqlite_readpaginated_1k_ms":6.073,"pebble_writepack_1m_ms":1275.22},"status":"discard","description":"Throttled ctx.Err() check to every 1024 iters (vs every iter). Hypothesis: ctx is context.Context interface; ctx.Err() = interface method dispatch \\u2248 5-10 ns/call \\u00d7 1 M = 7 ms wallclock. Throttling drops to 1024 checks. Reality: primary +2.9 % REGRESSION (402.85\\u2192414.57 ms). The added `count & mask == 0` branch costs MORE than the eliminated work. Definitive evidence that ctx.Err() on context.Background is already very cheap \\u2014 Go's compiler likely devirtualizes the interface call to the concrete emptyCtx.Err method (which returns nil from an inlined body).2 \\u00d7 noise floor gain, but the risk-to-reward at this point favors finalization. Cumulative -67.2 % is substantive. The wins are diverse and production-safe. Recommend invoking autoresearch-finalize skill to clean up branches for review.\"}","timestamp":1779728058283,"segment":1,"confidence":13.371088000776334} {"run":69,"commit":"37fd343","metric":407.116,"metrics":{"pebble_readpaginated_100k_ms":47.143,"pebble_readpaginated_10k_ms":7.879,"pebble_readpaginated_1k_ms":3.067,"pebble_readpaginated_100_ms":2.077,"pebble_readpaginated_1m_bytes_op":1626869936,"pebble_readpaginated_1m_allocs_op":10036982,"sqlite_readpaginated_1k_ms":6.094,"pebble_writepack_1m_ms":1238.181},"status":"discard","description":"Lowered translateParallelThreshold from 1024\\u2192256 to put medium-sized pages (1k bench) on the parallel translate path. 1M scale is unchanged (always parallel). Reality: primary +1.1 % (within noise but slight regression). 10k +29 %, 1k +16 % regressed; 100k -0.5 % and 100 +3 % stayed flat. The mechanism is real (medium pages weren't getting parallel benefit) but the dispatch overhead for the ~4-batch case at 1k records is comparable to or exceeds the saved serial work. 1024 is the correct threshold. 0 { - num, typ, n := protowire.ConsumeTag(data) - if n < 0 { - rec.Reset() - return proto.Unmarshal(full, rec) + // Fast tag-byte fast path: fields 1-15 have single-byte tags. + // If the first byte's high bit is set, the tag is multi-byte + // (field >= 16 or varint-encoded reserved range) — fall back + // to protowire.ConsumeTag for general decoding. + tagByte := data[0] + if tagByte&0x80 != 0 { + return fallbackUnmarshalGrant(full, rec) } - data = data[n:] - switch num { - case 1: // sync_id (string) — skipped; no read-path consumer of - // PaginateGrantsBySync reads r.GetSyncId() (sync_id is passed - // separately by the caller). Consume bytes without storing. - if typ != protowire.BytesType { - return fallbackUnmarshalGrant(full, rec) - } - m := protowire.ConsumeFieldValue(num, typ, data) + switch tagByte { + case grantWireTagAnnotations, grantWireTagSources: + return fallbackUnmarshalGrant(full, rec) + case grantWireTagSyncID: + data = data[1:] + // Skip the length-prefixed bytes without parsing the + // string — no read-path consumer reads SyncId. Use + // ConsumeBytes which does internal bounds checks and + // returns the consumed byte count. + _, m := protowire.ConsumeBytes(data) if m < 0 { return fallbackUnmarshalGrant(full, rec) } data = data[m:] - case 2: // external_id (string) - if typ != protowire.BytesType { - return fallbackUnmarshalGrant(full, rec) - } + continue + case grantWireTagExternalID: + data = data[1:] val, m := protowire.ConsumeString(data) if m < 0 { return fallbackUnmarshalGrant(full, rec) } data = data[m:] rec.SetExternalId(val) - case 3: // entitlement (EntitlementRef msg) - if typ != protowire.BytesType { - return fallbackUnmarshalGrant(full, rec) - } + continue + case grantWireTagEntitlement: + data = data[1:] val, m := protowire.ConsumeBytes(data) if m < 0 { return fallbackUnmarshalGrant(full, rec) @@ -100,10 +119,9 @@ func unmarshalGrantRecordFast( } rec.SetEntitlement(ent) sawEnt = true - case 4: // principal (PrincipalRef msg) - if typ != protowire.BytesType { - return fallbackUnmarshalGrant(full, rec) - } + continue + case grantWireTagPrincipal: + data = data[1:] val, m := protowire.ConsumeBytes(data) if m < 0 { return fallbackUnmarshalGrant(full, rec) @@ -114,21 +132,41 @@ func unmarshalGrantRecordFast( } rec.SetPrincipal(princ) sawPrinc = true - case 8, 9: // annotations / sources — v2 translation reads these - return fallbackUnmarshalGrant(full, rec) - default: - // Fields 5 (discovered_at), 6 (expansion), 7 (needs_expansion), - // plus any unknown fields. v2 translation doesn't read these - // for grants, so we safely skip. (Unknown fields go through - // the unknownFields tail in proto.Unmarshal; the fast path - // drops them, which is acceptable because callers don't - // read unknown fields from v2.Grant.) - m := protowire.ConsumeFieldValue(num, typ, data) + continue + case grantWireTagDiscoveredAt, grantWireTagExpansion: + // Length-delimited skip via ConsumeBytes (bounds-checked). + data = data[1:] + _, m := protowire.ConsumeBytes(data) if m < 0 { return fallbackUnmarshalGrant(full, rec) } data = data[m:] + continue + case grantWireTagNeedsExpansion: + // Varint skip: 1 byte tag + varint value + data = data[1:] + _, m := protowire.ConsumeVarint(data) + if m < 0 { + return fallbackUnmarshalGrant(full, rec) + } + data = data[m:] + continue } + // Unrecognized single-byte tag (fields 10-15 or reserved range + // with high bit clear) — fall back to general loop. + num, typ, n := protowire.ConsumeTag(data) + if n < 0 { + rec.Reset() + return proto.Unmarshal(full, rec) + } + data = data[n:] + // Multi-byte tags (fields >= 16) or fields 10-15 with single + // byte not matched above. Use general consume + skip. + m := protowire.ConsumeFieldValue(num, typ, data) + if m < 0 { + return fallbackUnmarshalGrant(full, rec) + } + data = data[m:] } // Suppress "declared and not used" if compiler doesn't see the // branch paths. These bools are intentionally unused in the success