Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
15 commits
Select commit Hold shift + click to select a range
bc60194
autoresearch: scaffold Pebble read-perf loop (UnpackReadGrants)
pquerna May 25, 2026
90b5ac6
Baseline for the read-perf session, starting from the autoresearch/pe…
pquerna May 25, 2026
c66be93
Outer-only grantReadArena for PaginateGrantsBySync. Collapses per-ite…
pquerna May 25, 2026
2f3619a
grantV2ReadArena: arena-allocate the 6 v2.Grant nested stubs (Grant +…
pquerna May 25, 2026
ff177b1
autoresearch.ideas.md: log read-perf session results (-20.7% in 5 ite…
pquerna May 25, 2026
f45b86e
BATCHED parallel proto.Unmarshal for PaginateGrantsBySync, fixing the…
pquerna May 25, 2026
8abd20b
Parallel file writes in ExtractZstdTar: 4-worker pool consumes (targe…
pquerna May 25, 2026
ea475bd
sync.Pool for per-batch unmarshal buffers in PaginateGrantsBySync. Pr…
pquerna May 25, 2026
26ebf45
Parallel v3\u2192v2 translation in a SEPARATE worker pool that runs A…
pquerna May 25, 2026
f16da0f
Bumped decode worker batchSize from 64\u2192256, matching the transla…
pquerna May 25, 2026
45134c9
Bumped pageUnmarshalWorkers from 4\u21926. Profile showed workers doi…
pquerna May 25, 2026
37fd343
Bisected the 6\u21928 worker-count regression by trying 7. Primary 40…
pquerna May 25, 2026
ae048e5
Custom hand-rolled wire-format decoder for v3.GrantRecord (unmarshalG…
pquerna May 25, 2026
2ea72c3
Skip the SetSyncId call in the fast decoder \u2014 no read-path consu…
pquerna May 25, 2026
9afeea1
Single-byte tag comparison in the fast decoder \u2014 skip protowire.…
pquerna May 25, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
160 changes: 98 additions & 62 deletions autoresearch.ideas.md
Original file line number Diff line number Diff line change
@@ -1,70 +1,106 @@
# Ideas backlog — Pebble engine perf
# Ideas backlog — Pebble engine read perf

Free-form scratch. Append new ideas as bullets; mark tried ones with
status (kept / discarded / crashed) so we don't repeat them.

## To try (priority order)
**Required reading from the prior session**:
`docs/rfcs/0004-storage-engine-v4/autoresearch-archive/writepack/autoresearch.ideas.md`
contains the do-not-retry catalogue from the WritePack session. Several
of those closed axes likely apply here too:

- [ ] **P1.1** Memtable size 64 MiB → 256 MiB (`MemTableSize` in `options.go`).
- [ ] **P1.2** `L0CompactionThreshold` sweep: 2 → 4, 8.
- [ ] **P1.3** `MaxConcurrentCompactions` upper bound: 8 → 12 (gate on GOMAXPROCS).
- [ ] **P1.4** Enable bloom filters on L0 (FilterPolicy + FilterType).
- [ ] **P1.5** Mixed compression: Snappy at L0, zstd at L6.
- [ ] **P2.6** Per-record-type per-level options (grants vs resources).
- [ ] **P2.7** Codec codegen via `cmd/protoc-gen-batonstore` — replaces reflection path. Big change; may need human approval.
- [ ] **P3.8** Pool tuple encoder buffer (`AppendTupleString`) — kill per-record slice alloc.
- [ ] **P3.9** Larger SST block size (32 KiB → 64 KiB) — amortize header overhead.
- Parallel-large-alloc across goroutines (heap arena serializes — verified 3×).
- Tournament tree / prefix-skip wrappers around `bytes.Compare`
(Go SIMD'd `cmpbody` is faster than the wrapper for k=4 or short skip).
- Naive parallel-then-serial pipelines that don't actually overlap
(e.g. read-all-then-write-all).
- Touching durability semantics for marginal gains.

## Read-path-specific ideas (priority order, profile-confirm before pursuing)

### P1 — likely big wins (untried, large surface)

- **`ExtractZstdTar` parallelism** — single-threaded zstd decode +
tar walk + per-file `os.OpenFile` + `io.Copy` dominates per-iter
cost at large scales. For 1 M-grant `.c1z` of ~500 MB the extraction
is most of the wallclock. Possible:
- `zstd.WithDecoderConcurrency(0)` (untried for reads; tested under
writes #9/#35/#46 and was flat there because the OUTER zstd was
already barely needed over pre-Snappy SST data — for READS we're
decoding the OUTER zstd of fresh tar contents, different problem).
- Parallel tar-entry writes to destination dir (workers consume from
the tar stream's decoded byte ranges, each writes one file).
- Skip zstd entirely via streaming decompression that avoids the
intermediate file write — extract into memory + open Pebble against
an in-memory FS. Pebble supports `vfs.MemFS` via the `FS` option.
- **In-memory Pebble FS for reads** — skip the extract-to-tmpdir step
entirely. Decompress the c1z payload straight into a `vfs.MemFS` and
point Pebble at it. Saves the entire tar-extraction wallclock AND
the subsequent Pebble file-open syscalls (memory-backed FS is much
faster). Big change but potentially huge win.

### P1 — likely big wins (untried, smaller surface)

- **`V3GrantToV2` arena** — analogous to `V2GrantToV3` from the
WritePack session (#41 there). Each `ListGrants` page hydrates
`len(page)` v2.Grants. For 100 pages × 10 k grants each = 1 M
allocations of `v2.Grant + Entitlement_stub + Resource_stub + ResourceId`,
which is several per grant. Arena to collapse them.
- **Pebble block cache warming** — the 256 MiB block cache starts cold
on each `NewStore`. If the data fits in cache, fully warmed reads are
much faster than cold reads. Warm via a deliberate prefetch read at
Open time, or trade some setup cost for amortized win.

### P2 — moderate

- **Pagination cursor decoding** — `paginate.go` decodes the
`PageToken` on every page boundary. 100 pages for 1 M grants. Cost
per decode probably tiny but compounds.
- **`IterateGrantsBySync` allocations** — per-grant `proto.Unmarshal`
into a fresh `v3.GrantRecord` for every iteration step. Could pool
these (arena-style) but it's a streaming iterator API.
- **`pebble.IterOptions`** — currently we set `LowerBound`/`UpperBound`
for the grant primary keyspace. Could enable `KeyTypePoint` only (no
range keys for this iteration). Already implicit.

### P3 — speculative

- **Bloom filters on L0/L1 for reads** — discarded in the write
session (#8) because fresh-sync writes have no Get hits. For reads,
if we did point-Gets, blooms could help — but the read path is a
range scan, not point Gets. Probably still useless. Skip.
- **`pebble.Options.MaxOpenFiles`** — currently 1024. At ~265 L0 SSTs
for a 1M-grant sync this is fine. For larger syncs we might hit the
limit. Not relevant at the bench's current scales.
- **Custom `IterOptions.LowerBound`/`UpperBound` for the by-principal
index** — currently the primary key scan walks `v3|G|sync|...`. If
reads used the by-principal index by default, performance might
differ. Probably not — primary scan is the right answer for full
enumeration. Skip.

## Tried — see jsonl for verdicts

(populated by the loop)

## Follow-up / human review

- Split-batch in PutGrantRecords (commit 63c0869b) breaks cross-batch atomicity:
if priBatch commits but idxBatch fails, primary records exist without
by_entitlement / by_principal index entries. Fresh-sync replays the
whole sync from the connector so it's OK there, but incremental Put
paths (mid-sync upserts) might leak. RFC stack-6 grant expansion path
could be a concrete victim. Consider:
- Apply split only when IsFreshSync() is true; keep one-batch atomic
semantics outside fresh-sync.
- Or: document the contract change.

## Closed axes (do NOT retry — multiple attempts confirm dead)

- **Parallel engine.Close + WriteEnvelope** (tried at #19, #28, #45 — three baselines).
Mechanism is theoretically safe (CheckpointTo creates self-contained dir), but
goroutine + channel coordination overhead exceeds the engine.Close wallclock
savings (~30-50 ms). At smaller scales the overhead dominates and regresses
10-15%. Not a clean win at any size.
- **Parallelize large heap allocations across goroutines** (#47 priBatch/idxBatch,
#48 priBatch sub-shards). Three different attempts. Go's heap allocator
serializes large (>32 KB) allocations through the central heap-arena mutex;
OS mmap underneath has kernel-level locks. Concurrent 150 MB-class allocs
from N goroutines queue serially, plus goroutine scheduling adds overhead
proportional to N. Stick to single-goroutine allocation for the big buffers.
- **FlushSplitBytes axis** (tried 2 MiB → 16 MiB at #21, #31; 2 MiB → 64 MiB at #37).
Pebble doesn't honor very large hints, or bigger SSTs lose write parallelism.
All flat-to-mildly-negative across multiple baselines.
- **Tournament tree / prefix-skip merge optimizations** (#39, #40). The naive
4-way bytes.Compare scan is already optimally branch-predictable and SIMD-tight;
wrapping with anything in Go costs more than it saves at k=4.
- **Parallel reads for WriteEnvelope** (#43 bulk-pre-read; #46 streaming with bounded
lookahead). Two different failure modes: #43 didn't actually overlap reads with
writes (3 serial phases); #46 did overlap but per-file os.ReadFile allocated
~530 MB of one-shot buffers vs io.Copy's reused 32 KB buffer. Pebble checkpoint
files are page-cache-hot anyway — io.Copy pulls them at memory speed, so serial
reading is already efficient. Closed axis.
- **Background WAL fsync** (WALBytesPerSync=4MiB, #38). On this hardware fsync
isn't a meaningful bottleneck; spreading it via background syncs doesn't help.
- **MemTableSize > 64 MiB** (#1 256 MiB, #16 128 MiB). Larger memtable lets entire
100k workload fit in memory → no during-write flushes → forced serial flush at
EndSync. 100k workload regresses ~30%.
- **L0CompactionThreshold ≠ 8** axis fully mapped (2/4/6/16). 8 is the knee.
- **CompactionConcurrencyRange** (#7). With L0=8 compactor isn't the bottleneck.
- **DisableAutomaticCompactions** (#20). With L0=8 it's already idle.
- **proto.MarshalAppend with SetDeferred + cached size** (#23). proto.Size
double-traversal eats the memcpy savings.
- **appendEscaped bytes.IndexByte fast path** (#22). Tuple encoder is on the
smaller goroutine; max(A,B) wallclock means optimizing B doesn't help when B<A.
### Kept

- **#51 outer-only grantReadArena** (-2.3% primary). Collapses the 1 M
v3.GrantRecord outer allocations to one slice alloc per page. Small-
scale regression (1k +14 %, 100 +20 %) is the known arena-over-allocation
tradeoff. WritePack + SQLite sentinels flat.
- **#53 grantV2ReadArena** (-20.7 % primary). Arena-allocates the 6
v2.Grant nested stubs (Grant + Entitlement + 2 Resources + 2 ResourceIds)
in adapter.ListGrants. Pre-sized to len(records) so no waste at any scale
for the arena itself (small-scale regression unchanged from #51, came from
the OUTER GrantRecord arena, not this one). Allocs/op 17M→10M.

### Discarded

- **#50 grantReadArena with pre-populated nested fields** — proto.Unmarshal
didn't reuse the pre-populated EntitlementRef/PrincipalRef/Timestamp
pointers despite the consumeMessageInfo source-level read suggesting it
should. Only the outer GrantRecord was reused (-1 alloc/grant), while the
unused pre-populated arenas added bytes_op +15 % and regressed smaller
scales +28-40 %. Probable causes recorded in the jsonl ASI.
- **#52 slab-style growable arena** — attempted to fix #51's small-scale
regression by sizing arenas to actual records via doubling-slab strategy.
Slab management overhead (per-call cap check + slice-header sync) cancelled
the saved memclr at small scales. Fixed-size arena from #51 is the better
tradeoff.
Loading
Loading