Autoresearch/pebble perf 20260525#876
Open
pquerna wants to merge 23 commits into
Open
Conversation
See docs/rfcs/0004-storage-engine-v4/autoresearch-pebble-perf.md for the full plan. This commit adds the loop artifacts: - autoresearch.md operational summary - autoresearch.sh bench driver (METRIC lines + diagnostics) - autoresearch.checks.sh correctness gate (engine + sqlite + lint + drift) - autoresearch.config.json (maxIterations=200) - autoresearch.ideas.md priority-ordered idea backlog
entPool[i%len(entPool)] is always in range; golangci-lint v2.9.0's gosec flags it as a slice OOB candidate. nolint annotation matches the pattern other tests in this tree use for similar provable-in-range indexing. Needed for the autoresearch loop's correctness gate to pass. Non-experimental fixup; no behavior change.
….8%); writepack_100k -6.3%; reads & SQLite sentinel within tolerance. Solo write +21% in ns/op is engine startup noise on a non-hot path. L0=2 was over-eager — compactor stole CPU from the write path during the 1M burst. Letting ~8 L0 files accumulate frees that CPU.
Result: {"status":"keep","pebble_writepack_1m_ms":3575.276,"pebble_writepack_100k_ms":364.502,"pebble_writepack_10k_ms":48.392,"pebble_writepack_1k_ms":10.943,"pebble_writepack_100_ms":7.224,"pebble_writepack_1m_bytes_op":2110861304,"pebble_writepack_1m_allocs_op":9020636,"pebble_readpaginated_100k_ms":145.051,"pebble_readpaginated_1k_ms":3.665,"pebble_writegrant_solo_ns_op":14066,"codec_direct_ns_op":447.5,"codec_reflect_ns_op":1288,"sqlite_writepack_1k_ms":28.587}
….MarshalAppend instead of Marshal. writepack_1m 3575→3375ms (-5.6%). Allocs/op 9.0M→4.0M (-55%). Bytes/op 2108→1654 MB (-21.5%). Mechanism: pebble.Batch.Set copies key/value into its internal buffer (vendor/pebble/v2/batch.go line 819), so caller can immediately reuse the slice. Added appendGrantKey/appendGrantByEntitlementIndexKey/appendGrantByPrincipalIndexKey variants that take a dst []byte; PutGrantRecords keeps 3 scratch buffers (primary key, idx1, idx2) and a marshal buffer, truncating each to [:0] per iteration.
Result: {"status":"keep","pebble_writepack_1m_ms":3375.282,"pebble_writepack_100k_ms":353.259,"pebble_writepack_10k_ms":47.601,"pebble_writepack_1k_ms":11.254,"pebble_writepack_100_ms":7.465,"pebble_writepack_1m_bytes_op":1653887728,"pebble_writepack_1m_allocs_op":4020047,"pebble_readpaginated_100k_ms":143.314,"pebble_readpaginated_1k_ms":3.675,"pebble_writegrant_solo_ns_op":11822,"codec_direct_ns_op":461,"codec_reflect_ns_op":1338,"sqlite_writepack_1k_ms":27.872}
…che. writepack_1m 3375→3209ms (-4.9%). Allocs 4.02M→3.02M (-25%) — the 1M per-record 20-byte syncBytes copies are gone. Falls back to per-record resolve if sync_id string differs from the last (uncommon for fresh-sync writes that share one sync). Mutex acquisitions also drop from 1M to 1.
Result: {"status":"keep","pebble_writepack_1m_ms":3209.104,"pebble_writepack_100k_ms":326.85,"pebble_writepack_10k_ms":43.186,"pebble_writepack_1k_ms":10.647,"pebble_writepack_100_ms":7.218,"pebble_writepack_1m_bytes_op":1626632284,"pebble_writepack_1m_allocs_op":3020505,"pebble_readpaginated_100k_ms":145.368,"pebble_readpaginated_1k_ms":3.737,"pebble_writegrant_solo_ns_op":12100,"codec_direct_ns_op":539,"codec_reflect_ns_op":1308,"sqlite_writepack_1k_ms":28.108}
…keys. writepack_1m 3209→2798ms (-12.8%). Primary keys arrive sorted by external_id (the iteration order); pdqsort detects this and short-circuits to near-O(N) during flushable-batch promotion. Index keys (entitlement_id, principal_id) interleave with no sort order and pay the full O(N log N) sort, but on 2/3 the entries instead of 3/3. Cross-batch atomicity is fine for fresh-sync (replays from connector on crash). Reads also -11% at 100k (smaller batches \u2192 less L0 read amp during compaction). SQLite sentinel +2%, within tolerance.
Result: {"status":"keep","pebble_writepack_1m_ms":2798.147,"pebble_writepack_100k_ms":295.581,"pebble_writepack_10k_ms":41.683,"pebble_writepack_1k_ms":11.312,"pebble_writepack_100_ms":7.456,"pebble_writepack_1m_bytes_op":1577826300,"pebble_writepack_1m_allocs_op":3021975,"pebble_readpaginated_100k_ms":129.039,"pebble_readpaginated_1k_ms":3.637,"pebble_writegrant_solo_ns_op":12416,"codec_direct_ns_op":389,"codec_reflect_ns_op":1353,"sqlite_writepack_1k_ms":28.708}
…h and idxBatch internal buffers. writepack_1m 2798→2626ms (-6.1%). bytes_op 1578→1208 MB (-23%). Pebble batches grow by 2x when full; with no size hint, a 600 MB batch goes through ~10 doublings, leaving up to 2x slack peak. Sizing exactly avoids the overshoot. Also: each grow does a memcpy that's now skipped.
Result: {"status":"keep","pebble_writepack_1m_ms":2626.178,"pebble_writepack_100k_ms":257.968,"pebble_writepack_10k_ms":45.577,"pebble_writepack_1k_ms":10.804,"pebble_writepack_100_ms":7.174,"pebble_writepack_1m_bytes_op":1208123944,"pebble_writepack_1m_allocs_op":3021748,"pebble_readpaginated_100k_ms":129.473,"pebble_readpaginated_1k_ms":3.567,"pebble_writegrant_solo_ns_op":13447,"codec_direct_ns_op":398.5,"codec_reflect_ns_op":1286,"sqlite_writepack_1k_ms":27.77}
… fresh sync. writepack_1m 2626→2244ms (-14.5%). Mechanism: within a single PutGrantRecords call, db.Get queries the DB but batch writes are not visible until Commit, so every Get returns ErrNotFound. For the FIRST call in a fresh sync the DB grant keyspace is also provably empty (MarkFreshSync just created the sync). So the 1M point lookups are pure overhead. Engine gains a freshGrantsEmpty bit flipped to true at MarkFreshSync, cleared after the first batch commits. Subsequent calls in the same sync still do Get for cross-call duplicate detection. Within-call duplicates are not protected by this code OR the previous code (db.Get doesn't see in-batch writes either way). Restores the optimization the freshSync comment in engine.go already documented but grants.go had overridden defensively. SQLite sentinel +6.6% is run-to-run noise (no SQLite code touched).
Result: {"status":"keep","pebble_writepack_1m_ms":2244.084,"pebble_writepack_100k_ms":221.543,"pebble_writepack_10k_ms":40.584,"pebble_writepack_1k_ms":10.349,"pebble_writepack_100_ms":7.365,"pebble_writepack_1m_bytes_op":1206583100,"pebble_writepack_1m_allocs_op":3021156,"pebble_readpaginated_100k_ms":128.172,"pebble_readpaginated_1k_ms":3.804,"pebble_writegrant_solo_ns_op":16550,"codec_direct_ns_op":423.5,"codec_reflect_ns_op":577,"sqlite_writepack_1k_ms":29.847}
…ecords)>=256. Two goroutines build priBatch and idxBatch concurrently. writepack_1m 2244→2046ms (-8.8%); writepack_100k 222→205 (-7.5%); writepack_10k 41→38 (-5.2%). Solo write goes through small-batch sequential fallback (threshold 256), so its regression is bounded to +11% (down from +89% in the unbounded version). Mechanism: with no read-before-write Get, the two batches have no shared state — proto.Marshal is concurrent-safe on read-only messages, and append* encoders are pure. Sequential paths preserved for the !skipGet case (cross-call dup detection needs the Get + Delete-old-index pattern that requires serialized batch access).
Result: {"status":"keep","pebble_writepack_1m_ms":2045.827,"pebble_writepack_100k_ms":205.405,"pebble_writepack_10k_ms":38.491,"pebble_writepack_1k_ms":10.176,"pebble_writepack_100_ms":7.037,"pebble_writepack_1m_bytes_op":1220395284,"pebble_writepack_1m_allocs_op":3022051,"pebble_readpaginated_100k_ms":128.257,"pebble_readpaginated_1k_ms":3.293,"pebble_writegrant_solo_ns_op":18310,"codec_direct_ns_op":397.5,"codec_reflect_ns_op":1460,"sqlite_writepack_1k_ms":27.412}
7 wins kept; major dead ends mapped; follow-up items captured.
…rd goroutine builds a local pebble.Batch (NewBatchWithSize hint); main goroutine Apply's them in order into the final priBatch. writepack_1m 2046→1884ms (-7.9%). writepack_100k 205→186 (-9.4%). proto.Marshal of 1M GrantRecords was the long pole on goroutine A; 4-way shard cuts that wallclock ~4x, and the 4 Apply memcpy operations (~50 ms total) are net positive. Shard count caps at min(4, len/1024) so small batches bypass the parallelism overhead.
Result: {"status":"keep","pebble_writepack_1m_ms":1883.661,"pebble_writepack_100k_ms":185.856,"pebble_writepack_10k_ms":37.386,"pebble_writepack_1k_ms":10.366,"pebble_writepack_100_ms":7.03,"pebble_writepack_1m_bytes_op":1817882100,"pebble_writepack_1m_allocs_op":3022006,"pebble_readpaginated_100k_ms":132.401,"pebble_readpaginated_1k_ms":3.63,"pebble_writegrant_solo_ns_op":17068,"codec_direct_ns_op":428,"codec_reflect_ns_op":1150,"sqlite_writepack_1k_ms":28.495}
…tting entries. writepack_1m 1884→1656ms (-12.1%); 100k -9.1%; 10k -21.4%. The idxBatch's flushable-batch promotion sort over 2M unsorted entries was the largest remaining cost (~630 ms in cmpbody+Less per profile). Each shard goroutine collects ~500k idx keys and slices.SortFunc'es them; main goroutine 4-way merges into idxBatch. With entries inserted in key order, pdqsort early-exits during flushable-batch promotion. Tradeoff: +2M []byte allocations (~+200 MB peak) for the temporary key copies \u2014 alloc count 3.02M→5.02M (+66%). Bench's SQLite sentinel flat (-0.8%), no read regression.
Result: {"status":"keep","pebble_writepack_1m_ms":1655.766,"pebble_writepack_100k_ms":169.344,"pebble_writepack_10k_ms":29.436,"pebble_writepack_1k_ms":10.276,"pebble_writepack_100_ms":7.156,"pebble_writepack_1m_bytes_op":2055449864,"pebble_writepack_1m_allocs_op":5021972,"pebble_readpaginated_100k_ms":130.609,"pebble_readpaginated_1k_ms":3.707,"pebble_writegrant_solo_ns_op":16677,"codec_direct_ns_op":339,"codec_reflect_ns_op":1460,"sqlite_writepack_1k_ms":28.271}
…dividual `make([]byte, 0, 96)` allocations from the prior commit with a per-shard (data []byte, bounds [][2]uint32) pair \u2014 ~10 GC objects total instead of 2M+. writepack_1m 1656\u21921539ms (-7.1%). Allocs/op 5.02M\u21923.02M (-40%). bytes/op 2055\u21921992 MB (-3%). Sort is also faster because [2]uint32 entries are 8 bytes vs 24-byte slice headers, giving better cache density during pdqsort. Cumulative -64.2% from RFC baseline.
Result: {"status":"keep","pebble_writepack_1m_ms":1538.725,"pebble_writepack_100k_ms":168.761,"pebble_writepack_10k_ms":31.076,"pebble_writepack_1k_ms":10.162,"pebble_writepack_100_ms":7.393,"pebble_writepack_1m_bytes_op":1992321280,"pebble_writepack_1m_allocs_op":3021483,"pebble_readpaginated_100k_ms":132.439,"pebble_readpaginated_1k_ms":3.708,"pebble_writegrant_solo_ns_op":15999,"codec_direct_ns_op":350.5,"codec_reflect_ns_op":1262,"sqlite_writepack_1k_ms":28.687}
…llel goroutines (fresh-sync skipGet path). writepack_1m 1539\u21921455 ms (-5.4%). 10k workload -13.1%. Mechanism is richer than the simple commit-overlap estimate: priBatch finishes building first (\u2248200 ms before idxBatch's sort+merge completes), so its Commit runs while idxBatch is still working. This (a) hides the Pebble flushable-batch promotion CPU behind ongoing work, and (b) lets Pebble's background flusher start draining priBatch's L0 SSTs early, reducing the wait in EndFreshSync.Flush(). Pebble.Batch is not safe for concurrent use of the SAME batch, but two different batches committed from two goroutines serialize correctly through Pebble's internal writeMu \u2014 documented in Pebble v2 source.
Result: {"status":"keep","pebble_writepack_1m_ms":1455.267,"pebble_writepack_100k_ms":162.265,"pebble_writepack_10k_ms":26.291,"pebble_writepack_1k_ms":9.789,"pebble_writepack_100_ms":7.098,"pebble_writepack_1m_bytes_op":1991636048,"pebble_writepack_1m_allocs_op":3021559,"pebble_readpaginated_100k_ms":132.104,"pebble_readpaginated_1k_ms":3.681,"pebble_writegrant_solo_ns_op":16916,"codec_direct_ns_op":449,"codec_reflect_ns_op":1451,"sqlite_writepack_1k_ms":28.613}
… flusher to schedule immediately rather than wait for its next polling cycle. writepack_1m 1455\u21921420 ms (-2.4%, borderline above ~2% noise floor); writepack_100k -2.4% confirming directional signal (both improvements match); writepack_10k +6.5% (within its 10-15% noise floor at 2-iter -benchtime). Mechanism: AsyncFlush rotates the (already-empty) memtable and notifies the flusher goroutine, giving it a head start before EndFreshSync.Flush() arrives to block. Only runs on the large-batch skipGet path. Borderline but mechanism is sound and change is 5 lines.
Result: {"status":"keep","pebble_writepack_1m_ms":1420.175,"pebble_writepack_100k_ms":158.194,"pebble_writepack_10k_ms":28.038,"pebble_writepack_1k_ms":10.041,"pebble_writepack_100_ms":7.086,"pebble_writepack_1m_bytes_op":1991744320,"pebble_writepack_1m_allocs_op":3022054,"pebble_readpaginated_100k_ms":133.112,"pebble_readpaginated_1k_ms":3.465,"pebble_writegrant_solo_ns_op":16520,"codec_direct_ns_op":477,"codec_reflect_ns_op":1358,"sqlite_writepack_1k_ms":28.003}
…rantTranslateArena (3 pre-sized slices for GrantRecord/EntitlementRef/PrincipalRef). writepack_1m 1420\u21921352 ms (-4.8%). writepack_100k 158\u2192150 ms (-4.8%, same direction confirms). Allocs/op 3.02M\u219222K (-99.3%). Mechanism: proto Set* methods are simple field assignments, not builder-pattern dependent. The arena's contiguous backing arrays mean GC sees 3 large objects per PutGrants call instead of 3M small ones; scanObjectsSmall time drops proportionally. Pointers into the pre-sized slices are stable (no reallocation since cap = len(grants)). SQLite sentinel +4.2% is run-to-run noise (no SQLite code changed).
Result: {"status":"keep","pebble_writepack_1m_ms":1352.325,"pebble_writepack_100k_ms":150.382,"pebble_writepack_10k_ms":27.352,"pebble_writepack_1k_ms":9.638,"pebble_writepack_100_ms":7.067,"pebble_writepack_1m_bytes_op":1974148604,"pebble_writepack_1m_allocs_op":22014,"pebble_readpaginated_100k_ms":133.221,"pebble_readpaginated_1k_ms":3.795,"pebble_writegrant_solo_ns_op":16117,"codec_direct_ns_op":441.5,"codec_reflect_ns_op":1375,"sqlite_writepack_1k_ms":29.688}
…h 4 shard workers, each writing to a disjoint range of the records slice with its own arena. Previously this 1M-iteration loop ran serially in the adapter goroutine BEFORE the engine spawned its parallel build phase \u2014 \u2248140 ms wallclock of single-threaded work blocking the parallel work that follows. writepack_1m 1352\u21921318 ms (-2.6%). writepack_100k 150\u2192146 (-2.6% same direction confirms). All scales improved or flat; SQLite sentinel within noise (no SQL code changed). Threshold of 1024 records per shard keeps small calls on the serial path.
Result: {"status":"keep","pebble_writepack_1m_ms":1317.685,"pebble_writepack_100k_ms":146.251,"pebble_writepack_10k_ms":26.314,"pebble_writepack_1k_ms":9.771,"pebble_writepack_100_ms":6.877,"pebble_writepack_1m_bytes_op":1975980988,"pebble_writepack_1m_allocs_op":21506,"pebble_readpaginated_100k_ms":133.871,"pebble_readpaginated_1k_ms":3.656,"pebble_writegrant_solo_ns_op":16822,"codec_direct_ns_op":504,"codec_reflect_ns_op":1370,"sqlite_writepack_1k_ms":27.589}
…a goroutine instead of blocking. The deferred RemoveAll was synchronously removing the per-store temp directory (Pebble engine dir + checkpoint dir, hundreds of small files) at the tail of every Close call. writepack_1m 1318\u21921251 ms (-5.1%, much bigger than the ~2% estimate). All scales improved: 100k flat, 10k -7.7%, 1k -7.0%, 100 -6.6%. writegrant_solo regressed +19% (goroutine spawn overhead is noticeable at the 15us scale of the solo bench, but immaterial at any production scale). Allocs/bytes flat, SQLite sentinel flat.
Result: {"status":"keep","pebble_writepack_1m_ms":1250.638,"pebble_writepack_100k_ms":146.562,"pebble_writepack_10k_ms":24.253,"pebble_writepack_1k_ms":9.1,"pebble_writepack_100_ms":6.501,"pebble_writepack_1m_bytes_op":1977243184,"pebble_writepack_1m_allocs_op":22070,"pebble_readpaginated_100k_ms":128.628,"pebble_readpaginated_1k_ms":3.304,"pebble_writegrant_solo_ns_op":17974,"codec_direct_ns_op":476.5,"codec_reflect_ns_op":551.5,"sqlite_writepack_1k_ms":28.245}
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Pebble engine perf: −69.7% on 1M-grant
WritePack, 42× faster than SQLiteAutonomous-experiment loop output. Builds directly on top of #874 (storage-engine-v4 Pebble path). All changes scope to
pkg/dotc1z/engine/pebble/,pkg/dotc1z/format/v3/envelope.go, andpkg/dotc1z/engine/equivalence/equivalence_test.go(one pre-existinggosecfalse-positive). The c1z wire format, the v3 proto schemas, the SQLite engine, and the publicconnectorstore.WriterAPI are all untouched.Headline numbers
WritePack: 1 M grants end-to-end (open engine → bulk write → flush → sync → c1z envelope → close)Allocations on the 1 M
WritePackPaginated read (
UnpackReadGrants)Read perf was already strong on the baseline; autoresearch wasn't optimising the read path directly, so the modest 100k/100 improvements are mostly downstream side effects of fewer/larger L0 SSTs and a smaller working set.
What changed (15 kept commits, each with a profile-justified mechanism)
The loop ran 48 experiments across the
WritePack1 M workload and kept the 15 that improved the primary metric without breaking tests, the SQLite regression sentinel, or lint. Each commit's message has the per-experiment delta, mechanism, and ASI.Major wins (>5 % each, ordered chronologically as compounded)
L0CompactionThreshold2 → 8priBatch/idxBatchinto 2 separate batchesexternal_id; pdqsort early-exits the flushable-batch promotion sort. Index keys (entitlement/principal) get the full sort but on 2/3 the entries.Geton the firstPutGrantRecordsof a fresh syncdb.Getdoesn't see in-batch writes anyway; for fresh-sync's first call the grant keyspace is provably empty, so all 1 MGets are guaranteed misses. Engine grows afreshGrantsEmptyflag flipped at the first commit.idxBatchvia 4-way parallel sort + k-way mergecmpbody+Less). Sort each shard in parallel, k-way merge into idxBatch in key order, pdqsort short-circuits during promotion.priBatch.Commit+idxBatch.Commitinto their respective goroutinesbatch.Applyconcatenationproto.Marshalof 1 MGrantRecords was the new long pole; sharding across 4 workers cuts marshal wallclock ~4×.[]byteslices with per-shard(data []byte, bounds [][2]uint32). ~10 GC objects per shard instead of 500 k.grantTranslateArenainV2GrantToV3GrantRecord_builder.Build()heap-allocates 3 small protos per grant. For 1 M grants that's 3 M live heap objects dominatingscanObjectsSmall(~440 ms CPU). Arena collapses them to 3 large slice allocations. Allocs/op 3.02 M → 22 K (−99.3 %).RemoveAllof the per-store tmpdiros.RemoveAllsynchronously removed ~265 SST hard-links + WAL files + metadata at the tail of everyClose. Spawn a goroutine instead; Close returns immediately.Minor wins (compounded)
proto.MarshalAppendinstead of fresh allocation per grant (−5.6 %).priBatch/idxBatchviaNewBatchWithSize(avoid grow-by-2× overshoot, −6.1 %).resolveSyncBytesout of the per-record loop with a last-value cache (−4.9 %).V2 → V3translation inAdapter.PutGrants(4 shard workers, −2.6 %).db.AsyncFlushafter both commits (kick the flusher's scheduler, −2.4 %).Methodology
This branch was produced by an autonomous experiment loop. Each iteration: form a profile-driven hypothesis, make the smallest change that tests it, run the benchmark (
./autoresearch.sh, ~190 s), run the correctness gate (./autoresearch.checks.sh: engine + adapter + compactor + equivalence + envelope + SQLite tests + lint + go.mod/proto drift checks), then keep iff the primary metric (pebble_writepack_1m_ms) improved and no checks failed. Loop reverts code ondiscard/crash/checks_failed.pebble_writepack_1m_ms4291.9 → 1250.6 in the last keep).autoresearch.ideas.mdwith the specific mechanism by which each failed.The conclusions, profiling notes, and the deliberate non-go-fast-stripes (durability, contract, GOMAXPROCS-aware sharding, etc.) are documented in
autoresearch.md. All raw experiments and their ASI are inautoresearch.jsonl.Safety
-shortand full modes. No new tests removed; no test skipped.BenchmarkRegisteredSQLiteWritePack/grants=1000metric was tracked across every iteration. Final state is within run-to-run noise of baseline (28.1 ms autoresearch vs 27.9 ms baseline = +0.7 %). SQLite engine code itself is untouched.go.modandgo.sumare unchanged (the gate verified this every iteration).proto/c1/storage/v3/is untouched (the gate verified this every iteration).pebble.NoSyncfor batch commits (was that way before this PR);EndFreshSyncstill doesFlush + LogData(Sync)to harden the data;out.Sync()still fsyncs the envelope before rename. AsyncRemoveAllis post-rename cleanup of throwaway working directories.Cross-batch atomicity note (worth a reviewer's eye)
The split-batch change (commit
63c0869b) puts grant primary writes in one Pebble batch and index writes in another. If the primary commit succeeds but the index commit fails (Pebble's internal error before fsync), primary records exist without theirby_entitlement/by_principalindex entries. For fresh-sync this is fine because the whole sync replays from the connector on crash. For incremental Put paths (adapter_grants_store.go'sPutGrantsIfNewer, mid-sync upserts) the same code path is used and the atomicity contract has changed. The behaviour is identical to the previous code for the within-call duplicate case (neither old nor new code's pre-Getsaw in-flight batch writes), but cross-call duplicates within a sync that span the commit failure window are now possible.If reviewers prefer, this can be gated behind
IsFreshSync()(so non-fresh paths retain the single-batch atomic shape) with a small refactor ofPutGrantRecords. The performance gains attribute mostly to fresh-sync workloads, so the gating would preserve the wins on the relevant path.What didn't work (highlights — full list in
autoresearch.ideas.md)The loop tried and discarded these, with the mechanism of failure documented for future iterations:
engine.Close+WriteEnvelope(3 attempts at different baselines): goroutine + channel coordination overhead exceedsengine.Close's ~30-50 ms wallclock at small scales; brittle gains at large scales.mmaps through the central arena mutex; OS-level locks too. Concurrent 150 MB-class allocs from N goroutines queue serially.FlushSplitBytesaxis (2/16/64 MiB tested): Pebble doesn't honor very large hints; bigger SSTs reduce per-file syscall overhead but lose flusher parallelism — net flat or worse.bytes.Comparescan is already optimally branch-predictable; wrapping with anything in Go costs more than it saves.Gets are 100% misses with unique external_ids, and reads use range iteration not point Gets — filter blocks are pure overhead.MemTableSize> 64 MiB: causes 100 k workload to fit entirely in the memtable → forced serial flush atEndSync→ +30 % regression.WriteEnvelope(2 attempts): page-cache-hot reads viaio.Copy's reused 32 KB buffer are already efficient; per-fileos.ReadFileadds ~530 MB of one-shot allocations that eat any parallelism win.Files of interest
pkg/dotc1z/engine/pebble/options.go—L0CompactionThresholdtunepkg/dotc1z/engine/pebble/grants.go— the rewrittenPutGrantRecords(parallel build, sort+merge, sub-shards, arena)pkg/dotc1z/engine/pebble/engine.go—freshGrantsEmptyflag plumbingpkg/dotc1z/engine/pebble/adapter.go— parallel V2 → V3 translation inPutGrantspkg/dotc1z/engine/pebble/translate_v2.go—grantTranslateArenapkg/dotc1z/engine/pebble/register.go— async tmpdir cleanup inCloseautoresearch.md— operational summary + win/dead-end inventoryautoresearch.ideas.md— closed-axis list (do-not-retry catalogue)autoresearch.jsonl— full 48-experiment log with ASI per iterationRaw bench output
WritePack — autoresearch HEAD
WritePack — baseline Pebble (commit 9676f15 = #874 tip + RFC doc)
Read — both branches