Streaming daemon: Phase 1 layer 3 — orchestration + daemon (closes #815) by chowbao · Pull Request #819 · stellar/stellar-rpc

chowbao · 2026-06-24T20:04:27Z

Closes #815 — Phase 1 (Backfill), layer 3 of 3. Stacked on #818 (streaming-phase1-primitives).

Orchestration + the daemon entrypoint, completing the cold-only backfill daemon.

What this layer adds

Derived progress / watermark (progress.go) — recomputed from durable catalog keys on every start, never stored; signed-int64 sentinel arithmetic so the genesis / "nothing complete" markers can't wrap.
resolve (backfill/resolve.go) — a pure catalog diff over [rangeStart, rangeEnd] → a Plan: per-chunk ledgers/events rule + per-window txhash-index rule (with the trailing-window cap). Recomputed every run, so a restart re-plans cleanly.
executePlan (backfill/execute.go) — one bounded worker pool; index builds wait on their in-coverage chunk builds before acquiring a slot (no slot-starvation deadlock at workers=1); cenkalti/backoff retries.
run + the catch-up loop (startup.go) — networkTip-bounded catch-up (anchor = max(tip, watermark), mid-chunk resume exclusion, first-start-with-no-tip fatal) → serveReads handoff. No hot tier / live loop / lifecycle in Phase 1.
Daemon entrypoint (daemon.go) — LoadConfig → lock roots → open catalog → build boundaries → validateConfig (pin/confirm earliest_ledger) → Prometheus registry → supervised run loop. Plus the full-history CLI subcommand (folds into the general RPC start command at the Cut over to RPC v2: remove SQLite ingestion/query, serve from v2 stores #772 cutover).
Production backend wired — buildProductionBoundaries builds the BSB backend via backfill.NewBSBBackendFromConfig, releasing the datastore at shutdown; frontfill-only when no datastore is configured. [backfill.datastore] + [backfill.bsb] reuse the SDK config types verbatim (datastore.DataStoreConfig + ledgerbackend.BufferedStorageBackendConfig), so any SDK datastore — GCS, S3, Filesystem — works as the bulk source. The daemon drives the cold tier through ingest.ColdService + NewPrometheusSink.

This branch is a runnable backfill daemon: it boots from one TOML, catches up from the configured lake (or an injected backend), and serves — the read server itself is a #772 no-op for now (reads still come from the v1 daemon until the cutover).

Reviewing

A per-file design map (each file → its #815 deliverable + #787 design section) and a suggested reading order are posted as review comments on this PR.

Verification

go build + go vet + go test -short green on ./cmd/stellar-rpc/internal/fullhistory/... (cgo RocksDB toolchain); golangci-lint runs in CI. The full cmd/stellar-rpc binary link needs the Rust libpreflight/libxdr2json (CI make build-libs); go vet ./cmd/stellar-rpc/ type-checks the entrypoint locally.

Stack: streaming-phase1-daemon → streaming-phase1-primitives (#818)

chowbao · 2026-06-25T12:11:43Z

Note (non-blocking): a zero-txhash-key window → infinite restart loop on synthetic/standalone networks

Surfaced while writing the daemon catch-up E2E (TestRunDaemon_CatchUpMaterializesAllColdTypesAndIndex, fa0083f):

A complete chunk whose 10,000 ledgers are all transaction-free produces an empty txhash .bin. buildTxhashIndex then fails with txhash: cannot build a cold index with zero keys; that error propagates runBackfill → startStreaming, and superviseStreaming classifies it as restartable (not a fatal sentinel). So the daemon restarts (~5s backoff), re-derives the watermark, re-resolves the same still-unbuilt window, fails again → an infinite restart loop (~7s/cycle). It is not slow ingest — a successful catch-up of the same chunk runs in ~0.3s.

Not a production concern: mainnet/testnet never produce an all-zero-tx 10k-ledger chunk. But a standalone/quickstart network, or a naive synthetic test backend (e.g. an all-zero-tx fixture), hits it easily. The new E2E sidesteps it by giving the chunk a sparse few one-tx ledgers so the .bin has keys to index.

Decision: no fix for now — it can't happen in prod. If ever revisited: either build a degenerate/empty index for a zero-key window, or make the zero-key error a fatal sentinel so superviseStreaming surfaces it instead of retrying forever.

…hanges Rebased onto the updated #819 and propagated #817's API changes into the Phase 2 hot-store/lifecycle layer: - window -> tx-hash index rename + key prefix index: -> txhash_index: (TxHashIndexCoverage.Window -> .Index, Catalog.windows -> .txhashIndex); merged the hot-catalog methods (HotChunkKeys/ReadyHotChunkKeys/hotChunkKeysWith) onto #817's renamed Catalog. - Catalog.Get/Has -> get/has; config sections regrouped (cfg.Retention.*, cfg.Layout.*, cfg.Storage.*, cfg.Ingestion.*); pins via PinLayout. - RetentionGate -> RetentionFloor: eligibility.go uses floor.Excludes(c) and floor.Excludes(layout.LastChunk(idx)) for the discard/prune scans; the reader- retention test dropped its Admits/Floor assertions (deferred to #772) and keeps the straddling-window prune-survival check via Excludes. - renamed the lifecycle test's oneTxLCMBytes -> oneTxLCMRand (the #819 daemon-E2E added a same-named helper). Also fixes a latent daemon bug surfaced by the catch-up E2E: startConfig built ProcessConfig without a HotProbe, so any real catch-up aborted with "HotProbe is nil" and the supervisor looped. Wire the production NewRocksHotProbe. build/vet/test -short green (streaming + hotchunk + ingest + stores).

…hanges Rebased onto the updated #820 and propagated #817's API changes into the Phase 2 live-ingestion/daemon layer: - window -> tx-hash index rename + key prefix index: -> txhash_index: (TxHashIndexCoverage.Index, Catalog.txhashIndex), Catalog.Get/Has -> get/has, config sections regrouped (cfg.Retention/Layout/Storage/Ingestion), pins via PinLayout. - daemon.go merge: kept #821's live-ingestion wiring (LifecycleConfig + Core) and deduped the HotProbe line (#821's Phase-2 wiring already set it, so #820's HotProbe fix is redundant here). - removed the #819 cold-only catch-up E2E (TestRunDaemon_CatchUpMaterializes...) + its someTxBackend/oneTxLCMBytes helpers: #821's daemon now requires Boundaries.Core and runs a continuous live loop, so a cold-only "catch up then return" test can't fit — and TestE2E_DaemonLifecycle covers it end to end. Mechanical propagation only; build/vet/test -short green (the heavy lifecycle E2E stays -short-gated).

Rebased the orchestration/daemon layer onto the reorganized #818 (geometry + catalog subpackages) and propagated #817's API changes: - qualify moved symbols: catalog.{Catalog,ArtifactSet,NewArtifactSet,AllArtifacts, NewCatalog}; geometry.{Kind*,State*,Layout,NewLayout,TxHashIndexID, TxHashIndexCoverage,TxHashIndexLayout,NewTxHashIndexLayout,MaxChunksPerTxhashIndex, LastCompleteChunkAt} - window -> tx-hash-index rename (Windows->TxHashIndexLayout, FrozenCoverage-> FrozenTxHashIndex, AllIndexKeys->AllTxHashIndexKeys, IndexBuild.Window->.Index) - config regroup: cfg.Backfill.ChunksPerTxhashIndex->cfg.Layout.*, cfg.Streaming.*->cfg.Retention.*/cfg.Storage.*; daemon_test config literals updated - tests reach the index layout via the public cat.TxHashIndexLayout() (txhashIndex is now an unexported field of the catalog package) build + vet + go test -short green on ./cmd/stellar-rpc/internal/fullhistory/...

Restacked on the split/no-hooks #819 and ported the hot tier across the new package boundary: - hot key schema -> geometry (HotState/HotReady/HotTransient, exported HotChunkKey/ParseHotChunkKey/HotChunkPrefix); hot catalog methods -> catalog (HotState, PutHotTransient, FlipHotReady, DeleteHotKey, {Ready,}HotChunkKeys) - processChunk hot-source branch + progress hot refinement (lastCommittedLedger(cat, probe), highestReadyChunkSigned, refineWithHotDB) - new files: pkg/stores/hotchunk, streaming/{eligibility,hotsource,ingest,lifecycle} - daemon wires the cold-only catch-up's HotProbe (NewRocksHotProbe) - crash-hooks REMOVED to match #817/#818 (the split makes cat.hooks unreachable from streaming); the one beforeHotTransient hook test is dropped, the rest are the structural crash tests #817/#818 established - propagated renames: window->tx-hash-index, RetentionGate->RetentionFloor, cat.Has->public HotState, cat.layout->Layout() build + vet + go test -short green on ./cmd/stellar-rpc/internal/fullhistory/...

Reverts the resolve-side debris handling (from 33d10cd): drop windowHasDebris, the stored.covers(desired) debris branch in resolveWindow (back to a plain 'covered ⇒ no build'), and the 'freezing' arm of buildThenSweep's eager sweep (the 'pruning' arm stays — that post-build demote+sweep is design-specified). Debris cleanup is already homed in the lifecycle's prune stage (eligiblePruneOps, run on every lifecycle run including the startup seed): index keys in freezing/pruning, demoted pruning .bin inputs, and redundant .bin in a finalized window. Handling it in resolve duplicated that — once the lifecycle lands, one run would sweep the same leftovers twice (resolve in stage 1, eligiblePruneOps in stage 3) — and put reclamation inside resolve, whose contract everywhere else is just 'diff desired-vs-catalog → plan builds'. The codex 'stranded forever' concern only holds for Phase 1 in isolation; the next startup's prune stage reclaims it once that PR lands, and steady state never strands. Drops the two now-dead resolver debris tests. The crash-matrix AfterMarkBeforeCommit test is unaffected (it builds the freezing coverage itself → frozen, predecessor → pruning, swept by the kept pruning arm).

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5f04567534

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-06-30T14:55:01Z

+		return IndexBuild{}, false, err
+	}
+	stored := coverageRange{Lo: frozen.Lo, Hi: frozen.Hi}
+	if hasFrozen && stored.covers(desired) {


Don't defer sweeps to a missing lifecycle

When this branch is hit after a crash that already committed the replacement index but died before buildThenSweep swept the demoted predecessor coverage or terminal .bin keys, returning here leaves that pruning debris forever. Fresh evidence for re-raising: in this final tree startup only runs catch-up and ServeReads, and a repo-wide search finds no lifecycle/eligiblePruneOps runner, so the lifecycle prune stage referenced here never runs in Phase 1; schedule a sweep-only build here or wire the lifecycle before deferring to it.

Useful? React with 👍 / 👎.

tamirms · 2026-06-30T15:38:05Z

+
+// completeThrough maps a signed chunk index to its "complete through" last ledger:
+// c < 0 ⇒ preGenesisLedger (no uint32 wraparound); c >= 0 ⇒ chunk.ID(c).LastLedger().
+func completeThrough(c int64) uint32 {


completeThrough and preGenesisLedger don't appear in the design doc, and what they encode is already spelled out there. completeThrough(c) is just "chunk c's last ledger" with the c < 0 sentinel folded in — and the doc's lastCommittedLedger handles that sentinel with an explicit branch instead of a helper:

base := earliest - 1 cold := highestDurableChunk(cat) // -1 if none if cold >= 0 { return max(base, chunk.ID(cold).LastLedger()), nil // == doc's chunkLastLedger(cold) } return base, nil

watermarkMidChunk can likewise use chunk.ID(c).LastLedger() directly (the ledger is ≥ genesis there, so c ≥ 0). That drops both helpers and makes the code greppable against the doc's chunkLastLedger / lastCommittedLedger.

Done in dcde887. Dropped both — lastCommittedLedger now derives the watermark inline (signed max of the floor term earliest-1 and the cold term chunk.ID(cold).LastLedger(), floored at FirstLedgerSeq-1), matching the doc's lastCommittedLedger/chunkLastLedger.

One adjustment: watermarkMidChunk keeps an explicit c < 0 → false guard rather than calling chunk.ID(c).LastLedger() unconditionally. The fresh-start watermark is the sub-genesis sentinel (lastCommitted = FirstLedgerSeq-1, chunkIDOfLedger → -1), and it reaches watermarkMidChunk both in the catch-up loop (line ~114, "Signed so genesis reads as a boundary") and the {"genesis sentinel is a boundary", preGenesisLedger, false} test — so c is not always ≥ 0 there. The guard mirrors the explicit-branch style you proposed for lastCommittedLedger. preGenesisLedger is now a test-only const (the name was only load-bearing in test expectations).

The adjustment seems weird. Let me see if there is way around it

Correction — removed the guard in a51f19e; the unconditional chunk.ID(c).LastLedger() you suggested is correct here, and I had it wrong above.

chunkIDOfLedger only ever returns -1 (for any sub-genesis ledger) or a real chunk id ≥ 0, and the only sub-genesis value watermarkMidChunk is ever passed is the fresh-start sentinel preGenesisLedger (= FirstLedgerSeq-1). For c = -1, chunk.ID(-1).LastLedger() wraps — (MaxUint32+1) overflows to 0 — back to exactly preGenesisLedger, so lastCommitted != … already yields false (a boundary) with no special case. (Same uint32 aliasing the old TestCompleteThrough documented.) So the guard was redundant, not load-bearing — my earlier "c is not always ≥ 0" reasoning was beside the point. TestWatermarkMidChunk/genesis_sentinel_is_a_boundary still passes. Left a comment noting the wrap so it isn't re-added.

tamirms · 2026-06-30T15:38:06Z

+		// max() guards a lagging bulk tip: the tip alone could regress the floor below
+		// pruning or drop a complete watermark chunk.
+		anchor := max(tip, lastCommitted)
+		rangeStart := chunk.IDFromLedger(effectiveRetentionFloor(anchor, retentionChunks, earliest))


The design doc names this retentionFloorChunk and returns a ChunkID; here it's effectiveRetentionFloor returning a ledger, so every chunk-side caller round-trips through chunk.IDFromLedger(...) to get back to a chunk. Suggest renaming to retentionFloorChunk and returning a ChunkID to match the doc — then this becomes rangeStart := retentionFloorChunk(...), and only the metrics.Watermark call (which wants a ledger) does the chunkFirstLedger conversion. The definition lives in retention.go (#817), so it spans the stack.

Done in dcde887.

tamirms · 2026-06-30T15:38:07Z

+const defaultRestartBackoff = 5 * time.Second
+
+// Boundaries bundles the external boundaries run and validateConfig inject.
+type Boundaries struct {


Boundaries/BuildBoundaries is heavier than the job needs. Three of the four fields are the same object — NetworkTip is backendTip{backend}, Cleanup is the backend's datastore teardown — so this is really "the backfill backend, viewed three ways" plus the unrelated ServeReads. It's also a second injection layer on top of daemonOptions (a factory, inside the options bag, returning another bag of seams), with two dead params (paths, cat are ignored in buildProductionBoundaries).

Suggest collapsing to the one thing that varies: an injectable backfill.Backend (nil ⇒ frontfill) plus ServeReads, with runDaemonWith deriving NetworkTip and Cleanup. That removes the struct, the factory, the dead params, and the vague name; tests already inject a fakeBackend with Tip(), so they lose nothing. Not in the design doc — pure impl-seam cleanup.

Done in dcde887.

tamirms · 2026-06-30T15:54:44Z

+// progress. Each pass samples networkTip, anchors on max(tip, lastCommitted),
+// computes [rangeStart, rangeEnd] with the mid-chunk exclusion, and breaks on an
+// empty or non-advancing range.
+func catchUp(ctx context.Context, cfg StartConfig, lastCommitted, earliest uint32) (uint32, error) {


The doc unified its terminology on "backfill" — it's used throughout (42×) and "catch-up" appears nowhere; it calls this very loop "Backfill" that "re-runs if the tip advances mid-pass." catchUp reintroduces "catch-up" as a parallel term, and it spreads into the "catch-up pass" log lines and the CatchupPass/CatchupProgress metrics. Since RunBackfill is already the single pass, suggest renaming the loop to something backfill-rooted that captures "re-pass until the tip" — e.g. backfillToTip — with the logs and metrics following (BackfillPass/BackfillProgress). Keeps the code greppable against the doc, same as the other naming items.

Done in 307e14d.

Three doc-alignment cleanups (tamirms), no behavior change: 1. progress.go: drop completeThrough + the preGenesisLedger const. lastCommittedLedger now derives the watermark inline (max of floor term earliest-1 and cold term chunk.LastLedger, signed, floored at FirstLedgerSeq-1) matching the doc's lastCommittedLedger; watermarkMidChunk uses chunk.ID(c).LastLedger() directly with an explicit sub-genesis guard (the fresh-start watermark is sub-genesis, so the guard stays — c is NOT always >= 0). preGenesisLedger moves to a test-only const. 2. retention.go: rename effectiveRetentionFloor -> retentionFloorChunk returning a chunk.ID (the doc's name + type). Chunk-side callers drop their IDFromLedger round-trip (rangeStart := retentionFloorChunk(...)); only the two metrics.Watermark calls convert back via .FirstLedger() (lossless — both floor terms are chunk-first-ledgers). Updated the geometry + retention_test references. 3. daemon.go: collapse Boundaries{NetworkTip,Backend,ServeReads,Cleanup} + the BuildBoundaries factory into an injectable backfill.Backend (nil => frontfill) + ServeReads, with runDaemonWith deriving NetworkTip (resolveNetworkTip) and Cleanup. buildProductionBoundaries -> buildBackfillBackend(ctx,cfg,logger)(Backend,cleanup,err); removes the struct, its validate, the factory, the two dead params (paths, cat), and the vague name. Tests inject a fakeBackend (re-added tipErr for the unreachable-tip case); dropped the now-moot BuildBoundariesError seam test.

…rminology The doc unified on 'backfill' (used throughout, 'catch-up' appears nowhere) and calls this very loop the backfill that re-runs if the tip advances mid-pass. 'catch-up' was a parallel term; rename it away for doc-greppability: - catchUp -> backfillToTip (the re-pass-until-tip loop; RunBackfill is the single pass) - metrics: CatchupProgress/CatchupPass -> BackfillProgress/BackfillPass, and the Prometheus fields/strings (catchup_backfilled_ledger -> backfilled_ledger, catchup_target_ledger -> backfill_target_ledger, catchup_passes_total -> backfill_passes_total, phase label catchup_pass -> backfill_pass). Not yet exposed (#772), so no dashboard impact. - log lines ('catch-up pass ...' -> 'backfill pass ...') and all 'catch-up' prose. No behavior change; whole fullhistory tree green (-short), vet clean. grep for catchup/catch-up across the subsystem now returns nothing.

chatgpt-codex-connector · 2026-06-30T16:23:16Z

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

The explicit c < 0 guard was unnecessary. chunkIDOfLedger only ever returns -1 (for any sub-genesis ledger) or a real chunk id >= 0, and the sole sub-genesis value watermarkMidChunk is ever called with is the fresh-start sentinel preGenesisLedger (= FirstLedgerSeq-1). For c = -1, chunk.ID(-1).LastLedger() wraps — (MaxUint32+1) overflows to 0 — back to exactly preGenesisLedger, so the comparison already reports a boundary (false) with no special case. (This is the uint32 aliasing the removed TestCompleteThrough documented.) tamirms's suggested chunk.ID(c).LastLedger() form is therefore correct for every reachable input. TestWatermarkMidChunk/genesis_sentinel_is_a_boundary still passes.

chatgpt-codex-connector · 2026-06-30T16:32:29Z

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

- progress.go: the G115 int64->uint32 narrowing is chunk.ID(cold) (cold is int64), not uint32(through) — move the //nolint:gosec onto that line and drop the unused one on the return (gosec doesn't flag the bounded uint32(through)). - daemon.go: supervise's 'sleepCtx != nil -> return nil' is an intentional clean-shutdown-on-ctx-cancel; annotate //nolint:nilerr.

chatgpt-codex-connector · 2026-06-30T16:54:05Z

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

tamirms · 2026-06-30T17:01:16Z

+// (backfill, freeze/rebuild, prune) plus derived progress gauges; distinct from
+// per-data-type ingest.MetricSink. All methods must be safe for concurrent use.
+// LastCommitted and ChunkBoundary await the Phase-2 live-ingestion loop; the rest are wired.
+type Metrics interface {


A lot of these are redundant or don't carry information you'd actually use. Suggested cuts:

Three gauges hold the same number. watermark_ledger, catchup_backfilled_ledger, and last_committed_ledger are all set from the same lastCommitted value. Keep one. I'd keep last_committed_ledger — the design doc says "last committed" everywhere (21 times) and "watermark" twice, so it's the name people will recognize.

Tip, lag, and target aren't needed. catchup_target_ledger is just the tip. ingestion_lag_ledgers is tip minus last-committed, which anyone can compute. And captive core already publishes the latest ledger it has, so the daemon doesn't need to report the tip at all — to get lag, subtract our last_committed_ledger from captive core's latest-ledger metric in the query. None of the three earns its place.

Counters that don't add anything. catchup_passes_total and freeze_indexes_total are just the _count of the catchup_pass and rebuild phase-duration histograms. rebuilt_chunks_total and rebuild_chunks_per_index both count chunks folded into index rebuilds, which isn't a number you'd act on. freeze_chunks_total counts chunks written, which just tracks last_committed_ledger moving. Drop all of these.

chunk_boundaries_total has no code that sets it — it's reserved for the Phase-2 ingestion loop. Like the #817/#818 reviews: add the metric when the code that fills it lands.

cold_tier_bytes is computed the slow way. It walks every cold directory and calls stat on every file — about 30,000 files at mainnet size, worse as history grows, and much worse if the files are on network storage. Disk usage is something the OS tracks per filesystem: one statfs call returns the used bytes, and it's more accurate (real disk blocks, not apparent file sizes). node_exporter already reports it per mount.

Worth keeping: last_committed_ledger, retention_floor_ledger, pruned_ops_total, and phase_duration_seconds by phase. Four instead of fifteen.

Done in ac64eac — cut to the four you flagged (last_committed_ledger, retention_floor_ledger, pruned_ops_total, phase_duration_seconds by phase).

tamirms · 2026-06-30T17:25:56Z

+type notConfiguredTip struct{}
+
+func (notConfiguredTip) NetworkTip(context.Context) (uint32, error) {
+	return 0, errors.New("no bulk backend configured ([backfill.bsb].bucket_path empty); " +


Small one: this error points operators at [backfill.bsb].bucket_path, but that key doesn't exist — the bucket path lives in [backfill.datastore.params], and the thing that actually triggers frontfill-only mode is an empty [backfill.datastore].type (the check just above at line 236). An operator who wants to configure a backend and reads this gets sent to the wrong place. Suggest pointing it at [backfill.datastore].type.

Fixed in 5d91d07. The error now points at [backfill.datastore].type (the empty-type check just above that actually selects frontfill-only mode), instead of the non-existent [backfill.bsb].bucket_path.

tamirms · 2026-06-30T17:25:57Z

+// where chunkIDOfLedger yields -1 and chunk.ID(-1).LastLedger() wraps (MaxUint32+1
+// overflows to 0) back to exactly preGenesisLedger — so the comparison reports a
+// boundary (false) without a special case.
+func watermarkMidChunk(lastCommitted uint32) bool {


Same naming point as the metrics one: this function (plus a few comments, the log "watermark derived", and the error "startup derive watermark") call the value "watermark," but the variable is lastCommitted, the function is lastCommittedLedger, and the log field is last_committed. The design doc uses "last committed" throughout and "watermark" only twice in passing. Could we settle on "last committed" everywhere — watermarkMidChunk -> lastCommittedMidChunk, and the stray "watermark" mentions to match? One name for one thing.

Done in 5d91d07 — settled on "last committed" everywhere.

tamirms · 2026-06-30T17:25:58Z

+// ---------------------------------------------------------------------------
+
+// A young-network first start does no backfill then serves reads once.
+func TestRun_FirstStartCatchUpThenServe(t *testing.T) {


Leftover from the catch-up -> backfill rename: this test (and TestRunDaemon_CatchUpMaterializesAllColdTypesAndIndex in daemon_test.go, plus the "catches up" prose around it) still says "CatchUp." Worth renaming to "Backfill" to match.

Done in 5d91d07. Renamed TestRun_FirstStartCatchUpThenServe → ...BackfillThenServe and TestRunDaemon_CatchUpMaterializesAllColdTypesAndIndex → TestRunDaemon_BackfillMaterializesAllColdTypesAndIndex, plus the "catches up" prose in those two tests' comments. (Left the generic English "catches up" in resolve_test.go / backend.go — those aren't the catch-up term-of-art.)

Per review: most of the daemon's metrics are redundant or recomputable. Keep only last_committed_ledger, retention_floor_ledger, pruned_ops_total, and phase_duration_seconds (by phase). Dropped: - watermark_ledger + backfilled_ledger: dupes of last_committed_ledger (all three were set from the same lastCommitted). Kept last_committed_ledger — the doc's term. - ingestion_lag_ledgers + backfill_target_ledger: the target is just the tip, and lag = captive-core's latest-ledger minus our last_committed_ledger (compute in the query); the daemon needn't report the tip. - catchup/backfill_passes_total + freeze_indexes_total: the _count of the backfill_pass / rebuild phase-duration histograms. rebuilt_chunks_total + rebuild_chunks_per_index + freeze_chunks_total: not actionable numbers. All dropped. - chunk_boundaries_total: no code sets it (Phase-2). Add when the filling code lands. - cold_tier_bytes + MeasureColdTierBytes: a 30k-file tree-walk; node_exporter already reports per-mount used bytes via statfs. Interface shrinks accordingly: LastCommitted(lastCommitted, retentionFloor) replaces Watermark + the unused 1-arg LastCommitted; Freeze/Rebuild/BackfillPass lose their now-dead count args (just time the phase); IngestionLag/BackfillProgress/ColdTierBytes/ ChunkBoundary removed. Both test recordingMetrics fakes + their assertions updated. Whole fullhistory tree green (-short), vet clean.

chatgpt-codex-connector · 2026-06-30T17:32:50Z

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

…fill renames Three review nits: - daemon.go: the notConfiguredTip error pointed operators at [backfill.bsb].bucket_path (a key that doesn't exist); frontfill-only is triggered by an empty [backfill.datastore].type — point the message there. - 'watermark' -> 'last committed' everywhere: the value is lastCommitted, the deriver is lastCommittedLedger, the log field is last_committed, and the design doc uses 'last committed' throughout ('watermark' only twice). Renamed watermarkMidChunk -> lastCommittedMidChunk, the 'derive watermark' error + 'watermark derived' log, the prose comments, the last_committed_ledger metric help, and the geometry/test mentions. - leftover catch-up -> backfill test names: TestRun_FirstStartCatchUpThenServe -> ...BackfillThenServe, TestRunDaemon_CatchUpMaterializes... -> ...BackfillMaterializes..., plus the 'catches up' prose around them. No behavior change; whole fullhistory tree green (-short), vet + gofmt clean.

chatgpt-codex-connector · 2026-06-30T17:45:40Z

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

tamirms · 2026-06-30T18:37:30Z

+	// The txhash kind: one rule per overlapping window (resolveWindow runs the diff).
+	var builds []IndexBuild
+	for _, w := range indexesOverlapping(txLayout, rangeStart, rangeEnd) {
+		build, ok, err := resolveWindow(cat, txLayout, w, rangeStart, rangeEnd, needs)


nit: this should probably be renamed to resolveTxHashIndex

Done in a20ebf5 — renamed resolveWindow → resolveTxHashIndex (def + call + comment).

tamirms

left one nit about renaming resolveWindow

but everything looks great! feel free to merge 🎉

The helper resolves the txhash-index rule for one window; the type-qualified name reads clearer at the call site (nit).

chatgpt-codex-connector · 2026-06-30T19:34:20Z

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

Rebased onto the reworked Phase 1 (#819), which flattened the streaming package into the fullhistory root + a backfill subpackage and made the cold pipeline source-blind (ingest.WriteColdChunk over a ledgerbackend.LedgerStream). The hot tier + lifecycle machinery is re-carved onto that layout: - hot key schema -> geometry (HotState/HotReady/HotTransient, HotChunkKey/ ParseHotChunkKey); hot catalog methods -> catalog (HotState, PutHotTransient, FlipHotReady, DeleteHotKey, {Ready,}HotChunkKeys) - HotProbe/HotChunk + ErrHotVolumeLost live in package backfill (consumed by backfillSource's hot branch, which now yields a ledgerbackend.LedgerStream + a closer); ProcessConfig gains a nil-safe HotProbe field - the freeze source opens the hot DB READ-ONLY (the design's openRocksDBReadOnly): rocksdb.Config gains a ReadOnly mode (read-only open, no create, no flush-on-close) and hotchunk.OpenReadOnly composes the facades over it, so a freeze never mutates the hot DB it reads - the per-chunk hotchunk DB (single shared multi-CF: ledgers + events + tx-hash; transient/ready state machine) + ingest.HotService rebuilt over hotchunk.DB (one atomic synced WriteBatch, decision (a) — no errgroup fan-out) - progress hot refinement: lastCommittedLedger(cat[, probe]) maxes a hot term (positional or one refineWithHotDB read), loss detected lazily on the open - freeze/discard/prune + progress + retention live in a new fullhistory/lifecycle/ subpackage (the scoped part of #824); the freeze stage reuses backfill.RunBackfill (the same path catch-up uses); the live ingestion loop stays in the daemon root - daemon wires the cold-only catch-up's HotProbe (NewRocksHotProbe) - folded-in cleanups: deleted the now-dead RunHot/HotStores orchestration AND the per-type hot ingesters it was the last caller of (the hotchunk.DB HotService supersedes them); aggressively trimmed the phase-2 doc comments Verification: go build + go vet + go test -short green on ./cmd/stellar-rpc/internal/fullhistory/... (cgo RocksDB toolchain). Stack: streaming-phase2-lifecycle -> streaming-phase1-daemon

…mmitted (apply #819 metrics cut) Round-N review on #819 trimmed the control-plane metrics to a minimal set and explicitly dropped cold_tier_bytes (a ~30k-file tree walk; node_exporter statfs reports per-mount used bytes). The #820 rebase had re-introduced it (plus MeasureColdTierBytes and its per-pass/per-tick call sites) — remove it again. Also drop the per-ledger LedgerCommitted heartbeat: it only re-set the existing last_committed_ledger gauge more often (the lifecycle tick / backfill pass already set it) and added interface surface #816 doesn't require. Kept: live_hot_chunks (actionable discard-backlog signal), chunk_boundaries_total and discarded_hot_chunks_total (now driven by live Phase-2 code).

chowbao force-pushed the streaming-phase1-daemon branch from 3d12c9e to 84ff8c2 Compare June 24, 2026 20:39

chowbao force-pushed the streaming-phase1-primitives branch from 672d01f to 55cabde Compare June 24, 2026 22:05

chowbao force-pushed the streaming-phase1-daemon branch from 84ff8c2 to 419f7ec Compare June 24, 2026 22:05

chowbao force-pushed the streaming-phase1-primitives branch from aac0f4b to 1b47f8e Compare June 25, 2026 04:05

chowbao force-pushed the streaming-phase1-daemon branch from 04f9931 to 7f8e58f Compare June 25, 2026 04:09

chowbao force-pushed the streaming-phase1-primitives branch from 1b47f8e to 72368b2 Compare June 25, 2026 05:19

chowbao force-pushed the streaming-phase1-daemon branch from 7f8e58f to f3431cd Compare June 25, 2026 05:19

chowbao force-pushed the streaming-phase1-primitives branch from 72368b2 to 531dca9 Compare June 25, 2026 14:36

chowbao force-pushed the streaming-phase1-daemon branch from fa0083f to cbc80ab Compare June 25, 2026 14:42

chowbao force-pushed the streaming-phase1-primitives branch 2 times, most recently from 0433d50 to 2e5403a Compare June 25, 2026 18:48

chowbao force-pushed the streaming-phase1-daemon branch from cbc80ab to ae91d20 Compare June 25, 2026 19:14

chowbao force-pushed the streaming-phase1-primitives branch from 2e5403a to 03147f0 Compare June 25, 2026 19:51

chowbao force-pushed the streaming-phase1-primitives branch 5 times, most recently from ccebeb0 to d169a0d Compare June 25, 2026 21:05

chatgpt-codex-connector Bot reviewed Jun 30, 2026

View reviewed changes

tamirms reviewed Jun 30, 2026

View reviewed changes

chowbao added 2 commits June 30, 2026 12:01

tamirms reviewed Jun 30, 2026

View reviewed changes

tamirms approved these changes Jun 30, 2026

View reviewed changes

backfill: rename resolveWindow -> resolveTxHashIndex

a20ebf5

The helper resolves the txhash-index rule for one window; the type-qualified name reads clearer at the call site (nit).

chowbao merged commit 954ed78 into feature/full-history Jun 30, 2026
7 of 15 checks passed

github-project-automation Bot moved this from Needs Review to Done in Platform Scrum Jun 30, 2026

chowbao deleted the streaming-phase1-daemon branch June 30, 2026 20:04

Uh oh!

Conversation

chowbao commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this layer adds

Reviewing

Verification

Uh oh!

chowbao commented Jun 25, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 30, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chowbao Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chowbao Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chowbao Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot commented Jun 30, 2026

Uh oh!

chatgpt-codex-connector Bot commented Jun 30, 2026

Uh oh!

chatgpt-codex-connector Bot commented Jun 30, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chowbao Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chowbao Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot commented Jun 30, 2026

Uh oh!

chatgpt-codex-connector Bot commented Jun 30, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tamirms left a comment

Choose a reason for hiding this comment

Uh oh!

chowbao commented Jun 24, 2026 •

edited

Loading

chowbao Jun 30, 2026 •

edited

Loading

chowbao Jun 30, 2026 •

edited

Loading

chowbao Jun 30, 2026 •

edited

Loading

chowbao Jun 30, 2026 •

edited

Loading

chowbao Jun 30, 2026 •

edited

Loading