Streaming daemon: Phase 1 layer 3 — orchestration + daemon (closes #815)#819
Conversation
3d12c9e to
84ff8c2
Compare
672d01f to
55cabde
Compare
84ff8c2 to
419f7ec
Compare
aac0f4b to
1b47f8e
Compare
04f9931 to
7f8e58f
Compare
1b47f8e to
72368b2
Compare
7f8e58f to
f3431cd
Compare
|
Note (non-blocking): a zero-txhash-key window → infinite restart loop on synthetic/standalone networks Surfaced while writing the daemon catch-up E2E ( A complete chunk whose 10,000 ledgers are all transaction-free produces an empty txhash Not a production concern: mainnet/testnet never produce an all-zero-tx 10k-ledger chunk. But a standalone/quickstart network, or a naive synthetic test backend (e.g. an all-zero-tx fixture), hits it easily. The new E2E sidesteps it by giving the chunk a sparse few one-tx ledgers so the Decision: no fix for now — it can't happen in prod. If ever revisited: either build a degenerate/empty index for a zero-key window, or make the zero-key error a fatal sentinel so |
72368b2 to
531dca9
Compare
fa0083f to
cbc80ab
Compare
…hanges Rebased onto the updated #819 and propagated #817's API changes into the Phase 2 hot-store/lifecycle layer: - window -> tx-hash index rename + key prefix index: -> txhash_index: (TxHashIndexCoverage.Window -> .Index, Catalog.windows -> .txhashIndex); merged the hot-catalog methods (HotChunkKeys/ReadyHotChunkKeys/hotChunkKeysWith) onto #817's renamed Catalog. - Catalog.Get/Has -> get/has; config sections regrouped (cfg.Retention.*, cfg.Layout.*, cfg.Storage.*, cfg.Ingestion.*); pins via PinLayout. - RetentionGate -> RetentionFloor: eligibility.go uses floor.Excludes(c) and floor.Excludes(layout.LastChunk(idx)) for the discard/prune scans; the reader- retention test dropped its Admits/Floor assertions (deferred to #772) and keeps the straddling-window prune-survival check via Excludes. - renamed the lifecycle test's oneTxLCMBytes -> oneTxLCMRand (the #819 daemon-E2E added a same-named helper). Also fixes a latent daemon bug surfaced by the catch-up E2E: startConfig built ProcessConfig without a HotProbe, so any real catch-up aborted with "HotProbe is nil" and the supervisor looped. Wire the production NewRocksHotProbe. build/vet/test -short green (streaming + hotchunk + ingest + stores).
…hanges Rebased onto the updated #820 and propagated #817's API changes into the Phase 2 live-ingestion/daemon layer: - window -> tx-hash index rename + key prefix index: -> txhash_index: (TxHashIndexCoverage.Index, Catalog.txhashIndex), Catalog.Get/Has -> get/has, config sections regrouped (cfg.Retention/Layout/Storage/Ingestion), pins via PinLayout. - daemon.go merge: kept #821's live-ingestion wiring (LifecycleConfig + Core) and deduped the HotProbe line (#821's Phase-2 wiring already set it, so #820's HotProbe fix is redundant here). - removed the #819 cold-only catch-up E2E (TestRunDaemon_CatchUpMaterializes...) + its someTxBackend/oneTxLCMBytes helpers: #821's daemon now requires Boundaries.Core and runs a continuous live loop, so a cold-only "catch up then return" test can't fit — and TestE2E_DaemonLifecycle covers it end to end. Mechanical propagation only; build/vet/test -short green (the heavy lifecycle E2E stays -short-gated).
0433d50 to
2e5403a
Compare
cbc80ab to
ae91d20
Compare
Rebased the orchestration/daemon layer onto the reorganized #818 (geometry + catalog subpackages) and propagated #817's API changes: - qualify moved symbols: catalog.{Catalog,ArtifactSet,NewArtifactSet,AllArtifacts, NewCatalog}; geometry.{Kind*,State*,Layout,NewLayout,TxHashIndexID, TxHashIndexCoverage,TxHashIndexLayout,NewTxHashIndexLayout,MaxChunksPerTxhashIndex, LastCompleteChunkAt} - window -> tx-hash-index rename (Windows->TxHashIndexLayout, FrozenCoverage-> FrozenTxHashIndex, AllIndexKeys->AllTxHashIndexKeys, IndexBuild.Window->.Index) - config regroup: cfg.Backfill.ChunksPerTxhashIndex->cfg.Layout.*, cfg.Streaming.*->cfg.Retention.*/cfg.Storage.*; daemon_test config literals updated - tests reach the index layout via the public cat.TxHashIndexLayout() (txhashIndex is now an unexported field of the catalog package) build + vet + go test -short green on ./cmd/stellar-rpc/internal/fullhistory/...
2e5403a to
03147f0
Compare
Restacked on the split/no-hooks #819 and ported the hot tier across the new package boundary: - hot key schema -> geometry (HotState/HotReady/HotTransient, exported HotChunkKey/ParseHotChunkKey/HotChunkPrefix); hot catalog methods -> catalog (HotState, PutHotTransient, FlipHotReady, DeleteHotKey, {Ready,}HotChunkKeys) - processChunk hot-source branch + progress hot refinement (lastCommittedLedger(cat, probe), highestReadyChunkSigned, refineWithHotDB) - new files: pkg/stores/hotchunk, streaming/{eligibility,hotsource,ingest,lifecycle} - daemon wires the cold-only catch-up's HotProbe (NewRocksHotProbe) - crash-hooks REMOVED to match #817/#818 (the split makes cat.hooks unreachable from streaming); the one beforeHotTransient hook test is dropped, the rest are the structural crash tests #817/#818 established - propagated renames: window->tx-hash-index, RetentionGate->RetentionFloor, cat.Has->public HotState, cat.layout->Layout() build + vet + go test -short green on ./cmd/stellar-rpc/internal/fullhistory/...
ccebeb0 to
d169a0d
Compare
Reverts the resolve-side debris handling (from 33d10cd): drop windowHasDebris, the stored.covers(desired) debris branch in resolveWindow (back to a plain 'covered ⇒ no build'), and the 'freezing' arm of buildThenSweep's eager sweep (the 'pruning' arm stays — that post-build demote+sweep is design-specified). Debris cleanup is already homed in the lifecycle's prune stage (eligiblePruneOps, run on every lifecycle run including the startup seed): index keys in freezing/pruning, demoted pruning .bin inputs, and redundant .bin in a finalized window. Handling it in resolve duplicated that — once the lifecycle lands, one run would sweep the same leftovers twice (resolve in stage 1, eligiblePruneOps in stage 3) — and put reclamation inside resolve, whose contract everywhere else is just 'diff desired-vs-catalog → plan builds'. The codex 'stranded forever' concern only holds for Phase 1 in isolation; the next startup's prune stage reclaims it once that PR lands, and steady state never strands. Drops the two now-dead resolver debris tests. The crash-matrix AfterMarkBeforeCommit test is unaffected (it builds the freezing coverage itself → frozen, predecessor → pruning, swept by the kept pruning arm).
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 5f04567534
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| return IndexBuild{}, false, err | ||
| } | ||
| stored := coverageRange{Lo: frozen.Lo, Hi: frozen.Hi} | ||
| if hasFrozen && stored.covers(desired) { |
There was a problem hiding this comment.
Don't defer sweeps to a missing lifecycle
When this branch is hit after a crash that already committed the replacement index but died before buildThenSweep swept the demoted predecessor coverage or terminal .bin keys, returning here leaves that pruning debris forever. Fresh evidence for re-raising: in this final tree startup only runs catch-up and ServeReads, and a repo-wide search finds no lifecycle/eligiblePruneOps runner, so the lifecycle prune stage referenced here never runs in Phase 1; schedule a sweep-only build here or wire the lifecycle before deferring to it.
Useful? React with 👍 / 👎.
|
|
||
| // completeThrough maps a signed chunk index to its "complete through" last ledger: | ||
| // c < 0 ⇒ preGenesisLedger (no uint32 wraparound); c >= 0 ⇒ chunk.ID(c).LastLedger(). | ||
| func completeThrough(c int64) uint32 { |
There was a problem hiding this comment.
completeThrough and preGenesisLedger don't appear in the design doc, and what they encode is already spelled out there. completeThrough(c) is just "chunk c's last ledger" with the c < 0 sentinel folded in — and the doc's lastCommittedLedger handles that sentinel with an explicit branch instead of a helper:
base := earliest - 1
cold := highestDurableChunk(cat) // -1 if none
if cold >= 0 {
return max(base, chunk.ID(cold).LastLedger()), nil // == doc's chunkLastLedger(cold)
}
return base, nilwatermarkMidChunk can likewise use chunk.ID(c).LastLedger() directly (the ledger is ≥ genesis there, so c ≥ 0). That drops both helpers and makes the code greppable against the doc's chunkLastLedger / lastCommittedLedger.
There was a problem hiding this comment.
Done in dcde887. Dropped both — lastCommittedLedger now derives the watermark inline (signed max of the floor term earliest-1 and the cold term chunk.ID(cold).LastLedger(), floored at FirstLedgerSeq-1), matching the doc's lastCommittedLedger/chunkLastLedger.
One adjustment: watermarkMidChunk keeps an explicit c < 0 → false guard rather than calling chunk.ID(c).LastLedger() unconditionally. The fresh-start watermark is the sub-genesis sentinel (lastCommitted = FirstLedgerSeq-1, chunkIDOfLedger → -1), and it reaches watermarkMidChunk both in the catch-up loop (line ~114, "Signed so genesis reads as a boundary") and the {"genesis sentinel is a boundary", preGenesisLedger, false} test — so c is not always ≥ 0 there. The guard mirrors the explicit-branch style you proposed for lastCommittedLedger. preGenesisLedger is now a test-only const (the name was only load-bearing in test expectations).
There was a problem hiding this comment.
The adjustment seems weird. Let me see if there is way around it
There was a problem hiding this comment.
Correction — removed the guard in a51f19e; the unconditional chunk.ID(c).LastLedger() you suggested is correct here, and I had it wrong above.
chunkIDOfLedger only ever returns -1 (for any sub-genesis ledger) or a real chunk id ≥ 0, and the only sub-genesis value watermarkMidChunk is ever passed is the fresh-start sentinel preGenesisLedger (= FirstLedgerSeq-1). For c = -1, chunk.ID(-1).LastLedger() wraps — (MaxUint32+1) overflows to 0 — back to exactly preGenesisLedger, so lastCommitted != … already yields false (a boundary) with no special case. (Same uint32 aliasing the old TestCompleteThrough documented.) So the guard was redundant, not load-bearing — my earlier "c is not always ≥ 0" reasoning was beside the point. TestWatermarkMidChunk/genesis_sentinel_is_a_boundary still passes. Left a comment noting the wrap so it isn't re-added.
| // max() guards a lagging bulk tip: the tip alone could regress the floor below | ||
| // pruning or drop a complete watermark chunk. | ||
| anchor := max(tip, lastCommitted) | ||
| rangeStart := chunk.IDFromLedger(effectiveRetentionFloor(anchor, retentionChunks, earliest)) |
There was a problem hiding this comment.
The design doc names this retentionFloorChunk and returns a ChunkID; here it's effectiveRetentionFloor returning a ledger, so every chunk-side caller round-trips through chunk.IDFromLedger(...) to get back to a chunk. Suggest renaming to retentionFloorChunk and returning a ChunkID to match the doc — then this becomes rangeStart := retentionFloorChunk(...), and only the metrics.Watermark call (which wants a ledger) does the chunkFirstLedger conversion. The definition lives in retention.go (#817), so it spans the stack.
| const defaultRestartBackoff = 5 * time.Second | ||
|
|
||
| // Boundaries bundles the external boundaries run and validateConfig inject. | ||
| type Boundaries struct { |
There was a problem hiding this comment.
Boundaries/BuildBoundaries is heavier than the job needs. Three of the four fields are the same object — NetworkTip is backendTip{backend}, Cleanup is the backend's datastore teardown — so this is really "the backfill backend, viewed three ways" plus the unrelated ServeReads. It's also a second injection layer on top of daemonOptions (a factory, inside the options bag, returning another bag of seams), with two dead params (paths, cat are ignored in buildProductionBoundaries).
Suggest collapsing to the one thing that varies: an injectable backfill.Backend (nil ⇒ frontfill) plus ServeReads, with runDaemonWith deriving NetworkTip and Cleanup. That removes the struct, the factory, the dead params, and the vague name; tests already inject a fakeBackend with Tip(), so they lose nothing. Not in the design doc — pure impl-seam cleanup.
| // progress. Each pass samples networkTip, anchors on max(tip, lastCommitted), | ||
| // computes [rangeStart, rangeEnd] with the mid-chunk exclusion, and breaks on an | ||
| // empty or non-advancing range. | ||
| func catchUp(ctx context.Context, cfg StartConfig, lastCommitted, earliest uint32) (uint32, error) { |
There was a problem hiding this comment.
The doc unified its terminology on "backfill" — it's used throughout (42×) and "catch-up" appears nowhere; it calls this very loop "Backfill" that "re-runs if the tip advances mid-pass." catchUp reintroduces "catch-up" as a parallel term, and it spreads into the "catch-up pass" log lines and the CatchupPass/CatchupProgress metrics. Since RunBackfill is already the single pass, suggest renaming the loop to something backfill-rooted that captures "re-pass until the tip" — e.g. backfillToTip — with the logs and metrics following (BackfillPass/BackfillProgress). Keeps the code greppable against the doc, same as the other naming items.
Three doc-alignment cleanups (tamirms), no behavior change:
1. progress.go: drop completeThrough + the preGenesisLedger const. lastCommittedLedger
now derives the watermark inline (max of floor term earliest-1 and cold term
chunk.LastLedger, signed, floored at FirstLedgerSeq-1) matching the doc's
lastCommittedLedger; watermarkMidChunk uses chunk.ID(c).LastLedger() directly with
an explicit sub-genesis guard (the fresh-start watermark is sub-genesis, so the
guard stays — c is NOT always >= 0). preGenesisLedger moves to a test-only const.
2. retention.go: rename effectiveRetentionFloor -> retentionFloorChunk returning a
chunk.ID (the doc's name + type). Chunk-side callers drop their IDFromLedger
round-trip (rangeStart := retentionFloorChunk(...)); only the two metrics.Watermark
calls convert back via .FirstLedger() (lossless — both floor terms are
chunk-first-ledgers). Updated the geometry + retention_test references.
3. daemon.go: collapse Boundaries{NetworkTip,Backend,ServeReads,Cleanup} + the
BuildBoundaries factory into an injectable backfill.Backend (nil => frontfill) +
ServeReads, with runDaemonWith deriving NetworkTip (resolveNetworkTip) and Cleanup.
buildProductionBoundaries -> buildBackfillBackend(ctx,cfg,logger)(Backend,cleanup,err);
removes the struct, its validate, the factory, the two dead params (paths, cat), and
the vague name. Tests inject a fakeBackend (re-added tipErr for the unreachable-tip
case); dropped the now-moot BuildBoundariesError seam test.
…rminology The doc unified on 'backfill' (used throughout, 'catch-up' appears nowhere) and calls this very loop the backfill that re-runs if the tip advances mid-pass. 'catch-up' was a parallel term; rename it away for doc-greppability: - catchUp -> backfillToTip (the re-pass-until-tip loop; RunBackfill is the single pass) - metrics: CatchupProgress/CatchupPass -> BackfillProgress/BackfillPass, and the Prometheus fields/strings (catchup_backfilled_ledger -> backfilled_ledger, catchup_target_ledger -> backfill_target_ledger, catchup_passes_total -> backfill_passes_total, phase label catchup_pass -> backfill_pass). Not yet exposed (#772), so no dashboard impact. - log lines ('catch-up pass ...' -> 'backfill pass ...') and all 'catch-up' prose. No behavior change; whole fullhistory tree green (-short), vet clean. grep for catchup/catch-up across the subsystem now returns nothing.
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
The explicit c < 0 guard was unnecessary. chunkIDOfLedger only ever returns -1 (for any sub-genesis ledger) or a real chunk id >= 0, and the sole sub-genesis value watermarkMidChunk is ever called with is the fresh-start sentinel preGenesisLedger (= FirstLedgerSeq-1). For c = -1, chunk.ID(-1).LastLedger() wraps — (MaxUint32+1) overflows to 0 — back to exactly preGenesisLedger, so the comparison already reports a boundary (false) with no special case. (This is the uint32 aliasing the removed TestCompleteThrough documented.) tamirms's suggested chunk.ID(c).LastLedger() form is therefore correct for every reachable input. TestWatermarkMidChunk/genesis_sentinel_is_a_boundary still passes.
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
- progress.go: the G115 int64->uint32 narrowing is chunk.ID(cold) (cold is int64), not uint32(through) — move the //nolint:gosec onto that line and drop the unused one on the return (gosec doesn't flag the bounded uint32(through)). - daemon.go: supervise's 'sleepCtx != nil -> return nil' is an intentional clean-shutdown-on-ctx-cancel; annotate //nolint:nilerr.
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
| // (backfill, freeze/rebuild, prune) plus derived progress gauges; distinct from | ||
| // per-data-type ingest.MetricSink. All methods must be safe for concurrent use. | ||
| // LastCommitted and ChunkBoundary await the Phase-2 live-ingestion loop; the rest are wired. | ||
| type Metrics interface { |
There was a problem hiding this comment.
A lot of these are redundant or don't carry information you'd actually use. Suggested cuts:
Three gauges hold the same number. watermark_ledger, catchup_backfilled_ledger, and last_committed_ledger are all set from the same lastCommitted value. Keep one. I'd keep last_committed_ledger — the design doc says "last committed" everywhere (21 times) and "watermark" twice, so it's the name people will recognize.
Tip, lag, and target aren't needed. catchup_target_ledger is just the tip. ingestion_lag_ledgers is tip minus last-committed, which anyone can compute. And captive core already publishes the latest ledger it has, so the daemon doesn't need to report the tip at all — to get lag, subtract our last_committed_ledger from captive core's latest-ledger metric in the query. None of the three earns its place.
Counters that don't add anything. catchup_passes_total and freeze_indexes_total are just the _count of the catchup_pass and rebuild phase-duration histograms. rebuilt_chunks_total and rebuild_chunks_per_index both count chunks folded into index rebuilds, which isn't a number you'd act on. freeze_chunks_total counts chunks written, which just tracks last_committed_ledger moving. Drop all of these.
chunk_boundaries_total has no code that sets it — it's reserved for the Phase-2 ingestion loop. Like the #817/#818 reviews: add the metric when the code that fills it lands.
cold_tier_bytes is computed the slow way. It walks every cold directory and calls stat on every file — about 30,000 files at mainnet size, worse as history grows, and much worse if the files are on network storage. Disk usage is something the OS tracks per filesystem: one statfs call returns the used bytes, and it's more accurate (real disk blocks, not apparent file sizes). node_exporter already reports it per mount.
Worth keeping: last_committed_ledger, retention_floor_ledger, pruned_ops_total, and phase_duration_seconds by phase. Four instead of fifteen.
There was a problem hiding this comment.
Done in ac64eac — cut to the four you flagged (last_committed_ledger, retention_floor_ledger, pruned_ops_total, phase_duration_seconds by phase).
| type notConfiguredTip struct{} | ||
|
|
||
| func (notConfiguredTip) NetworkTip(context.Context) (uint32, error) { | ||
| return 0, errors.New("no bulk backend configured ([backfill.bsb].bucket_path empty); " + |
There was a problem hiding this comment.
Small one: this error points operators at [backfill.bsb].bucket_path, but that key doesn't exist — the bucket path lives in [backfill.datastore.params], and the thing that actually triggers frontfill-only mode is an empty [backfill.datastore].type (the check just above at line 236). An operator who wants to configure a backend and reads this gets sent to the wrong place. Suggest pointing it at [backfill.datastore].type.
There was a problem hiding this comment.
Fixed in 5d91d07. The error now points at [backfill.datastore].type (the empty-type check just above that actually selects frontfill-only mode), instead of the non-existent [backfill.bsb].bucket_path.
| // where chunkIDOfLedger yields -1 and chunk.ID(-1).LastLedger() wraps (MaxUint32+1 | ||
| // overflows to 0) back to exactly preGenesisLedger — so the comparison reports a | ||
| // boundary (false) without a special case. | ||
| func watermarkMidChunk(lastCommitted uint32) bool { |
There was a problem hiding this comment.
Same naming point as the metrics one: this function (plus a few comments, the log "watermark derived", and the error "startup derive watermark") call the value "watermark," but the variable is lastCommitted, the function is lastCommittedLedger, and the log field is last_committed. The design doc uses "last committed" throughout and "watermark" only twice in passing. Could we settle on "last committed" everywhere — watermarkMidChunk -> lastCommittedMidChunk, and the stray "watermark" mentions to match? One name for one thing.
There was a problem hiding this comment.
Done in 5d91d07 — settled on "last committed" everywhere.
| // --------------------------------------------------------------------------- | ||
|
|
||
| // A young-network first start does no backfill then serves reads once. | ||
| func TestRun_FirstStartCatchUpThenServe(t *testing.T) { |
There was a problem hiding this comment.
Leftover from the catch-up -> backfill rename: this test (and TestRunDaemon_CatchUpMaterializesAllColdTypesAndIndex in daemon_test.go, plus the "catches up" prose around it) still says "CatchUp." Worth renaming to "Backfill" to match.
There was a problem hiding this comment.
Done in 5d91d07. Renamed TestRun_FirstStartCatchUpThenServe → ...BackfillThenServe and TestRunDaemon_CatchUpMaterializesAllColdTypesAndIndex → TestRunDaemon_BackfillMaterializesAllColdTypesAndIndex, plus the "catches up" prose in those two tests' comments. (Left the generic English "catches up" in resolve_test.go / backend.go — those aren't the catch-up term-of-art.)
Per review: most of the daemon's metrics are redundant or recomputable. Keep only last_committed_ledger, retention_floor_ledger, pruned_ops_total, and phase_duration_seconds (by phase). Dropped: - watermark_ledger + backfilled_ledger: dupes of last_committed_ledger (all three were set from the same lastCommitted). Kept last_committed_ledger — the doc's term. - ingestion_lag_ledgers + backfill_target_ledger: the target is just the tip, and lag = captive-core's latest-ledger minus our last_committed_ledger (compute in the query); the daemon needn't report the tip. - catchup/backfill_passes_total + freeze_indexes_total: the _count of the backfill_pass / rebuild phase-duration histograms. rebuilt_chunks_total + rebuild_chunks_per_index + freeze_chunks_total: not actionable numbers. All dropped. - chunk_boundaries_total: no code sets it (Phase-2). Add when the filling code lands. - cold_tier_bytes + MeasureColdTierBytes: a 30k-file tree-walk; node_exporter already reports per-mount used bytes via statfs. Interface shrinks accordingly: LastCommitted(lastCommitted, retentionFloor) replaces Watermark + the unused 1-arg LastCommitted; Freeze/Rebuild/BackfillPass lose their now-dead count args (just time the phase); IngestionLag/BackfillProgress/ColdTierBytes/ ChunkBoundary removed. Both test recordingMetrics fakes + their assertions updated. Whole fullhistory tree green (-short), vet clean.
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
…fill renames
Three review nits:
- daemon.go: the notConfiguredTip error pointed operators at [backfill.bsb].bucket_path
(a key that doesn't exist); frontfill-only is triggered by an empty
[backfill.datastore].type — point the message there.
- 'watermark' -> 'last committed' everywhere: the value is lastCommitted, the deriver
is lastCommittedLedger, the log field is last_committed, and the design doc uses
'last committed' throughout ('watermark' only twice). Renamed watermarkMidChunk ->
lastCommittedMidChunk, the 'derive watermark' error + 'watermark derived' log, the
prose comments, the last_committed_ledger metric help, and the geometry/test mentions.
- leftover catch-up -> backfill test names: TestRun_FirstStartCatchUpThenServe ->
...BackfillThenServe, TestRunDaemon_CatchUpMaterializes... -> ...BackfillMaterializes...,
plus the 'catches up' prose around them.
No behavior change; whole fullhistory tree green (-short), vet + gofmt clean.
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
| // The txhash kind: one rule per overlapping window (resolveWindow runs the diff). | ||
| var builds []IndexBuild | ||
| for _, w := range indexesOverlapping(txLayout, rangeStart, rangeEnd) { | ||
| build, ok, err := resolveWindow(cat, txLayout, w, rangeStart, rangeEnd, needs) |
There was a problem hiding this comment.
nit: this should probably be renamed to resolveTxHashIndex
There was a problem hiding this comment.
Done in a20ebf5 — renamed resolveWindow → resolveTxHashIndex (def + call + comment).
tamirms
left a comment
There was a problem hiding this comment.
left one nit about renaming resolveWindow
but everything looks great! feel free to merge 🎉
The helper resolves the txhash-index rule for one window; the type-qualified name reads clearer at the call site (nit).
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
Rebased onto the reworked Phase 1 (#819), which flattened the streaming package into the fullhistory root + a backfill subpackage and made the cold pipeline source-blind (ingest.WriteColdChunk over a ledgerbackend.LedgerStream). The hot tier + lifecycle machinery is re-carved onto that layout: - hot key schema -> geometry (HotState/HotReady/HotTransient, HotChunkKey/ ParseHotChunkKey); hot catalog methods -> catalog (HotState, PutHotTransient, FlipHotReady, DeleteHotKey, {Ready,}HotChunkKeys) - HotProbe/HotChunk + ErrHotVolumeLost live in package backfill (consumed by backfillSource's hot branch, which now yields a ledgerbackend.LedgerStream + a closer); ProcessConfig gains a nil-safe HotProbe field - the freeze source opens the hot DB READ-ONLY (the design's openRocksDBReadOnly): rocksdb.Config gains a ReadOnly mode (read-only open, no create, no flush-on-close) and hotchunk.OpenReadOnly composes the facades over it, so a freeze never mutates the hot DB it reads - the per-chunk hotchunk DB (single shared multi-CF: ledgers + events + tx-hash; transient/ready state machine) + ingest.HotService rebuilt over hotchunk.DB (one atomic synced WriteBatch, decision (a) — no errgroup fan-out) - progress hot refinement: lastCommittedLedger(cat[, probe]) maxes a hot term (positional or one refineWithHotDB read), loss detected lazily on the open - freeze/discard/prune + progress + retention live in a new fullhistory/lifecycle/ subpackage (the scoped part of #824); the freeze stage reuses backfill.RunBackfill (the same path catch-up uses); the live ingestion loop stays in the daemon root - daemon wires the cold-only catch-up's HotProbe (NewRocksHotProbe) - folded-in cleanups: deleted the now-dead RunHot/HotStores orchestration AND the per-type hot ingesters it was the last caller of (the hotchunk.DB HotService supersedes them); aggressively trimmed the phase-2 doc comments Verification: go build + go vet + go test -short green on ./cmd/stellar-rpc/internal/fullhistory/... (cgo RocksDB toolchain). Stack: streaming-phase2-lifecycle -> streaming-phase1-daemon
…mmitted (apply #819 metrics cut) Round-N review on #819 trimmed the control-plane metrics to a minimal set and explicitly dropped cold_tier_bytes (a ~30k-file tree walk; node_exporter statfs reports per-mount used bytes). The #820 rebase had re-introduced it (plus MeasureColdTierBytes and its per-pass/per-tick call sites) — remove it again. Also drop the per-ledger LedgerCommitted heartbeat: it only re-set the existing last_committed_ledger gauge more often (the lifecycle tick / backfill pass already set it) and added interface surface #816 doesn't require. Kept: live_hot_chunks (actionable discard-backlog signal), chunk_boundaries_total and discarded_hot_chunks_total (now driven by live Phase-2 code).
Closes #815 — Phase 1 (Backfill), layer 3 of 3. Stacked on #818 (
streaming-phase1-primitives).Orchestration + the daemon entrypoint, completing the cold-only backfill daemon.
What this layer adds
progress.go) — recomputed from durable catalog keys on every start, never stored; signed-int64sentinel arithmetic so the genesis / "nothing complete" markers can't wrap.resolve(backfill/resolve.go) — a pure catalog diff over[rangeStart, rangeEnd]→ aPlan: per-chunk ledgers/events rule + per-window txhash-index rule (with the trailing-window cap). Recomputed every run, so a restart re-plans cleanly.executePlan(backfill/execute.go) — one bounded worker pool; index builds wait on their in-coverage chunk builds before acquiring a slot (no slot-starvation deadlock atworkers=1);cenkalti/backoffretries.run+ the catch-up loop (startup.go) —networkTip-bounded catch-up (anchor = max(tip, watermark), mid-chunk resume exclusion, first-start-with-no-tip fatal) →serveReadshandoff. No hot tier / live loop / lifecycle in Phase 1.daemon.go) — LoadConfig → lock roots → open catalog → build boundaries →validateConfig(pin/confirmearliest_ledger) → Prometheus registry → supervised run loop. Plus thefull-historyCLI subcommand (folds into the general RPCstartcommand at the Cut over to RPC v2: remove SQLite ingestion/query, serve from v2 stores #772 cutover).buildProductionBoundariesbuilds the BSB backend viabackfill.NewBSBBackendFromConfig, releasing the datastore at shutdown; frontfill-only when no datastore is configured.[backfill.datastore]+[backfill.bsb]reuse the SDK config types verbatim (datastore.DataStoreConfig+ledgerbackend.BufferedStorageBackendConfig), so any SDK datastore — GCS, S3, Filesystem — works as the bulk source. The daemon drives the cold tier throughingest.ColdService+NewPrometheusSink.This branch is a runnable backfill daemon: it boots from one TOML, catches up from the configured lake (or an injected backend), and serves — the read server itself is a #772 no-op for now (reads still come from the v1 daemon until the cutover).
Reviewing
A per-file design map (each file → its #815 deliverable + #787 design section) and a suggested reading order are posted as review comments on this PR.
Verification
go build+go vet+go test -shortgreen on./cmd/stellar-rpc/internal/fullhistory/...(cgo RocksDB toolchain);golangci-lintruns in CI. The fullcmd/stellar-rpcbinary link needs the Rustlibpreflight/libxdr2json(CImake build-libs);go vet ./cmd/stellar-rpc/type-checks the entrypoint locally.Stack:
streaming-phase1-daemon→streaming-phase1-primitives(#818)