Skip to content

Streaming daemon: Phase 1 layer 3 — orchestration + daemon (closes #815)#819

Merged
chowbao merged 36 commits into
feature/full-historyfrom
streaming-phase1-daemon
Jun 30, 2026
Merged

Streaming daemon: Phase 1 layer 3 — orchestration + daemon (closes #815)#819
chowbao merged 36 commits into
feature/full-historyfrom
streaming-phase1-daemon

Conversation

@chowbao

@chowbao chowbao commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

Closes #815 — Phase 1 (Backfill), layer 3 of 3. Stacked on #818 (streaming-phase1-primitives).

Orchestration + the daemon entrypoint, completing the cold-only backfill daemon.

What this layer adds

  • Derived progress / watermark (progress.go) — recomputed from durable catalog keys on every start, never stored; signed-int64 sentinel arithmetic so the genesis / "nothing complete" markers can't wrap.
  • resolve (backfill/resolve.go) — a pure catalog diff over [rangeStart, rangeEnd] → a Plan: per-chunk ledgers/events rule + per-window txhash-index rule (with the trailing-window cap). Recomputed every run, so a restart re-plans cleanly.
  • executePlan (backfill/execute.go) — one bounded worker pool; index builds wait on their in-coverage chunk builds before acquiring a slot (no slot-starvation deadlock at workers=1); cenkalti/backoff retries.
  • run + the catch-up loop (startup.go) — networkTip-bounded catch-up (anchor = max(tip, watermark), mid-chunk resume exclusion, first-start-with-no-tip fatal) → serveReads handoff. No hot tier / live loop / lifecycle in Phase 1.
  • Daemon entrypoint (daemon.go) — LoadConfig → lock roots → open catalog → build boundaries → validateConfig (pin/confirm earliest_ledger) → Prometheus registry → supervised run loop. Plus the full-history CLI subcommand (folds into the general RPC start command at the Cut over to RPC v2: remove SQLite ingestion/query, serve from v2 stores #772 cutover).
  • Production backend wiredbuildProductionBoundaries builds the BSB backend via backfill.NewBSBBackendFromConfig, releasing the datastore at shutdown; frontfill-only when no datastore is configured. [backfill.datastore] + [backfill.bsb] reuse the SDK config types verbatim (datastore.DataStoreConfig + ledgerbackend.BufferedStorageBackendConfig), so any SDK datastore — GCS, S3, Filesystem — works as the bulk source. The daemon drives the cold tier through ingest.ColdService + NewPrometheusSink.

This branch is a runnable backfill daemon: it boots from one TOML, catches up from the configured lake (or an injected backend), and serves — the read server itself is a #772 no-op for now (reads still come from the v1 daemon until the cutover).

Reviewing

A per-file design map (each file → its #815 deliverable + #787 design section) and a suggested reading order are posted as review comments on this PR.

Verification

go build + go vet + go test -short green on ./cmd/stellar-rpc/internal/fullhistory/... (cgo RocksDB toolchain); golangci-lint runs in CI. The full cmd/stellar-rpc binary link needs the Rust libpreflight/libxdr2json (CI make build-libs); go vet ./cmd/stellar-rpc/ type-checks the entrypoint locally.

Stack: streaming-phase1-daemonstreaming-phase1-primitives (#818)

@chowbao

chowbao commented Jun 25, 2026

Copy link
Copy Markdown
Contributor Author

Note (non-blocking): a zero-txhash-key window → infinite restart loop on synthetic/standalone networks

Surfaced while writing the daemon catch-up E2E (TestRunDaemon_CatchUpMaterializesAllColdTypesAndIndex, fa0083f):

A complete chunk whose 10,000 ledgers are all transaction-free produces an empty txhash .bin. buildTxhashIndex then fails with txhash: cannot build a cold index with zero keys; that error propagates runBackfill → startStreaming, and superviseStreaming classifies it as restartable (not a fatal sentinel). So the daemon restarts (~5s backoff), re-derives the watermark, re-resolves the same still-unbuilt window, fails again → an infinite restart loop (~7s/cycle). It is not slow ingest — a successful catch-up of the same chunk runs in ~0.3s.

Not a production concern: mainnet/testnet never produce an all-zero-tx 10k-ledger chunk. But a standalone/quickstart network, or a naive synthetic test backend (e.g. an all-zero-tx fixture), hits it easily. The new E2E sidesteps it by giving the chunk a sparse few one-tx ledgers so the .bin has keys to index.

Decision: no fix for now — it can't happen in prod. If ever revisited: either build a degenerate/empty index for a zero-key window, or make the zero-key error a fatal sentinel so superviseStreaming surfaces it instead of retrying forever.

@chowbao chowbao force-pushed the streaming-phase1-primitives branch from 72368b2 to 531dca9 Compare June 25, 2026 14:36
@chowbao chowbao force-pushed the streaming-phase1-daemon branch from fa0083f to cbc80ab Compare June 25, 2026 14:42
chowbao added a commit that referenced this pull request Jun 25, 2026
…hanges

Rebased onto the updated #819 and propagated #817's API changes into the Phase 2
hot-store/lifecycle layer:
- window -> tx-hash index rename + key prefix index: -> txhash_index:
  (TxHashIndexCoverage.Window -> .Index, Catalog.windows -> .txhashIndex); merged
  the hot-catalog methods (HotChunkKeys/ReadyHotChunkKeys/hotChunkKeysWith) onto
  #817's renamed Catalog.
- Catalog.Get/Has -> get/has; config sections regrouped (cfg.Retention.*,
  cfg.Layout.*, cfg.Storage.*, cfg.Ingestion.*); pins via PinLayout.
- RetentionGate -> RetentionFloor: eligibility.go uses floor.Excludes(c) and
  floor.Excludes(layout.LastChunk(idx)) for the discard/prune scans; the reader-
  retention test dropped its Admits/Floor assertions (deferred to #772) and keeps
  the straddling-window prune-survival check via Excludes.
- renamed the lifecycle test's oneTxLCMBytes -> oneTxLCMRand (the #819 daemon-E2E
  added a same-named helper).

Also fixes a latent daemon bug surfaced by the catch-up E2E: startConfig built
ProcessConfig without a HotProbe, so any real catch-up aborted with "HotProbe is
nil" and the supervisor looped. Wire the production NewRocksHotProbe.

build/vet/test -short green (streaming + hotchunk + ingest + stores).
chowbao added a commit that referenced this pull request Jun 25, 2026
…hanges

Rebased onto the updated #820 and propagated #817's API changes into the Phase 2
live-ingestion/daemon layer:
- window -> tx-hash index rename + key prefix index: -> txhash_index:
  (TxHashIndexCoverage.Index, Catalog.txhashIndex), Catalog.Get/Has -> get/has,
  config sections regrouped (cfg.Retention/Layout/Storage/Ingestion), pins via
  PinLayout.
- daemon.go merge: kept #821's live-ingestion wiring (LifecycleConfig + Core) and
  deduped the HotProbe line (#821's Phase-2 wiring already set it, so #820's
  HotProbe fix is redundant here).
- removed the #819 cold-only catch-up E2E (TestRunDaemon_CatchUpMaterializes...)
  + its someTxBackend/oneTxLCMBytes helpers: #821's daemon now requires
  Boundaries.Core and runs a continuous live loop, so a cold-only "catch up then
  return" test can't fit — and TestE2E_DaemonLifecycle covers it end to end.

Mechanical propagation only; build/vet/test -short green (the heavy lifecycle E2E
stays -short-gated).
@chowbao chowbao force-pushed the streaming-phase1-primitives branch 2 times, most recently from 0433d50 to 2e5403a Compare June 25, 2026 18:48
@chowbao chowbao force-pushed the streaming-phase1-daemon branch from cbc80ab to ae91d20 Compare June 25, 2026 19:14
chowbao added a commit that referenced this pull request Jun 25, 2026
Rebased the orchestration/daemon layer onto the reorganized #818 (geometry +
catalog subpackages) and propagated #817's API changes:
- qualify moved symbols: catalog.{Catalog,ArtifactSet,NewArtifactSet,AllArtifacts,
  NewCatalog}; geometry.{Kind*,State*,Layout,NewLayout,TxHashIndexID,
  TxHashIndexCoverage,TxHashIndexLayout,NewTxHashIndexLayout,MaxChunksPerTxhashIndex,
  LastCompleteChunkAt}
- window -> tx-hash-index rename (Windows->TxHashIndexLayout, FrozenCoverage->
  FrozenTxHashIndex, AllIndexKeys->AllTxHashIndexKeys, IndexBuild.Window->.Index)
- config regroup: cfg.Backfill.ChunksPerTxhashIndex->cfg.Layout.*,
  cfg.Streaming.*->cfg.Retention.*/cfg.Storage.*; daemon_test config literals updated
- tests reach the index layout via the public cat.TxHashIndexLayout() (txhashIndex
  is now an unexported field of the catalog package)

build + vet + go test -short green on ./cmd/stellar-rpc/internal/fullhistory/...
@chowbao chowbao force-pushed the streaming-phase1-primitives branch from 2e5403a to 03147f0 Compare June 25, 2026 19:51
chowbao added a commit that referenced this pull request Jun 25, 2026
Restacked on the split/no-hooks #819 and ported the hot tier across the new
package boundary:
- hot key schema -> geometry (HotState/HotReady/HotTransient, exported
  HotChunkKey/ParseHotChunkKey/HotChunkPrefix); hot catalog methods -> catalog
  (HotState, PutHotTransient, FlipHotReady, DeleteHotKey, {Ready,}HotChunkKeys)
- processChunk hot-source branch + progress hot refinement
  (lastCommittedLedger(cat, probe), highestReadyChunkSigned, refineWithHotDB)
- new files: pkg/stores/hotchunk, streaming/{eligibility,hotsource,ingest,lifecycle}
- daemon wires the cold-only catch-up's HotProbe (NewRocksHotProbe)
- crash-hooks REMOVED to match #817/#818 (the split makes cat.hooks unreachable
  from streaming); the one beforeHotTransient hook test is dropped, the rest are
  the structural crash tests #817/#818 established
- propagated renames: window->tx-hash-index, RetentionGate->RetentionFloor,
  cat.Has->public HotState, cat.layout->Layout()

build + vet + go test -short green on ./cmd/stellar-rpc/internal/fullhistory/...
@chowbao chowbao force-pushed the streaming-phase1-primitives branch 5 times, most recently from ccebeb0 to d169a0d Compare June 25, 2026 21:05
Reverts the resolve-side debris handling (from 33d10cd): drop windowHasDebris,
the stored.covers(desired) debris branch in resolveWindow (back to a plain
'covered ⇒ no build'), and the 'freezing' arm of buildThenSweep's eager sweep
(the 'pruning' arm stays — that post-build demote+sweep is design-specified).

Debris cleanup is already homed in the lifecycle's prune stage (eligiblePruneOps,
run on every lifecycle run including the startup seed): index keys in
freezing/pruning, demoted pruning .bin inputs, and redundant .bin in a finalized
window. Handling it in resolve duplicated that — once the lifecycle lands, one run
would sweep the same leftovers twice (resolve in stage 1, eligiblePruneOps in
stage 3) — and put reclamation inside resolve, whose contract everywhere else is
just 'diff desired-vs-catalog → plan builds'. The codex 'stranded forever'
concern only holds for Phase 1 in isolation; the next startup's prune stage
reclaims it once that PR lands, and steady state never strands.

Drops the two now-dead resolver debris tests. The crash-matrix
AfterMarkBeforeCommit test is unaffected (it builds the freezing coverage itself
→ frozen, predecessor → pruning, swept by the kept pruning arm).

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5f04567534

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

return IndexBuild{}, false, err
}
stored := coverageRange{Lo: frozen.Lo, Hi: frozen.Hi}
if hasFrozen && stored.covers(desired) {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Don't defer sweeps to a missing lifecycle

When this branch is hit after a crash that already committed the replacement index but died before buildThenSweep swept the demoted predecessor coverage or terminal .bin keys, returning here leaves that pruning debris forever. Fresh evidence for re-raising: in this final tree startup only runs catch-up and ServeReads, and a repo-wide search finds no lifecycle/eligiblePruneOps runner, so the lifecycle prune stage referenced here never runs in Phase 1; schedule a sweep-only build here or wire the lifecycle before deferring to it.

Useful? React with 👍 / 👎.


// completeThrough maps a signed chunk index to its "complete through" last ledger:
// c < 0 ⇒ preGenesisLedger (no uint32 wraparound); c >= 0 ⇒ chunk.ID(c).LastLedger().
func completeThrough(c int64) uint32 {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

completeThrough and preGenesisLedger don't appear in the design doc, and what they encode is already spelled out there. completeThrough(c) is just "chunk c's last ledger" with the c < 0 sentinel folded in — and the doc's lastCommittedLedger handles that sentinel with an explicit branch instead of a helper:

base := earliest - 1
cold := highestDurableChunk(cat)   // -1 if none
if cold >= 0 {
    return max(base, chunk.ID(cold).LastLedger()), nil   // == doc's chunkLastLedger(cold)
}
return base, nil

watermarkMidChunk can likewise use chunk.ID(c).LastLedger() directly (the ledger is ≥ genesis there, so c ≥ 0). That drops both helpers and makes the code greppable against the doc's chunkLastLedger / lastCommittedLedger.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in dcde887. Dropped both — lastCommittedLedger now derives the watermark inline (signed max of the floor term earliest-1 and the cold term chunk.ID(cold).LastLedger(), floored at FirstLedgerSeq-1), matching the doc's lastCommittedLedger/chunkLastLedger.

One adjustment: watermarkMidChunk keeps an explicit c < 0 → false guard rather than calling chunk.ID(c).LastLedger() unconditionally. The fresh-start watermark is the sub-genesis sentinel (lastCommitted = FirstLedgerSeq-1, chunkIDOfLedger → -1), and it reaches watermarkMidChunk both in the catch-up loop (line ~114, "Signed so genesis reads as a boundary") and the {"genesis sentinel is a boundary", preGenesisLedger, false} test — so c is not always ≥ 0 there. The guard mirrors the explicit-branch style you proposed for lastCommittedLedger. preGenesisLedger is now a test-only const (the name was only load-bearing in test expectations).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The adjustment seems weird. Let me see if there is way around it

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correction — removed the guard in a51f19e; the unconditional chunk.ID(c).LastLedger() you suggested is correct here, and I had it wrong above.

chunkIDOfLedger only ever returns -1 (for any sub-genesis ledger) or a real chunk id ≥ 0, and the only sub-genesis value watermarkMidChunk is ever passed is the fresh-start sentinel preGenesisLedger (= FirstLedgerSeq-1). For c = -1, chunk.ID(-1).LastLedger() wraps(MaxUint32+1) overflows to 0 — back to exactly preGenesisLedger, so lastCommitted != … already yields false (a boundary) with no special case. (Same uint32 aliasing the old TestCompleteThrough documented.) So the guard was redundant, not load-bearing — my earlier "c is not always ≥ 0" reasoning was beside the point. TestWatermarkMidChunk/genesis_sentinel_is_a_boundary still passes. Left a comment noting the wrap so it isn't re-added.

// max() guards a lagging bulk tip: the tip alone could regress the floor below
// pruning or drop a complete watermark chunk.
anchor := max(tip, lastCommitted)
rangeStart := chunk.IDFromLedger(effectiveRetentionFloor(anchor, retentionChunks, earliest))

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The design doc names this retentionFloorChunk and returns a ChunkID; here it's effectiveRetentionFloor returning a ledger, so every chunk-side caller round-trips through chunk.IDFromLedger(...) to get back to a chunk. Suggest renaming to retentionFloorChunk and returning a ChunkID to match the doc — then this becomes rangeStart := retentionFloorChunk(...), and only the metrics.Watermark call (which wants a ledger) does the chunkFirstLedger conversion. The definition lives in retention.go (#817), so it spans the stack.

@chowbao chowbao Jun 30, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in dcde887.

const defaultRestartBackoff = 5 * time.Second

// Boundaries bundles the external boundaries run and validateConfig inject.
type Boundaries struct {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Boundaries/BuildBoundaries is heavier than the job needs. Three of the four fields are the same object — NetworkTip is backendTip{backend}, Cleanup is the backend's datastore teardown — so this is really "the backfill backend, viewed three ways" plus the unrelated ServeReads. It's also a second injection layer on top of daemonOptions (a factory, inside the options bag, returning another bag of seams), with two dead params (paths, cat are ignored in buildProductionBoundaries).

Suggest collapsing to the one thing that varies: an injectable backfill.Backend (nil ⇒ frontfill) plus ServeReads, with runDaemonWith deriving NetworkTip and Cleanup. That removes the struct, the factory, the dead params, and the vague name; tests already inject a fakeBackend with Tip(), so they lose nothing. Not in the design doc — pure impl-seam cleanup.

@chowbao chowbao Jun 30, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in dcde887.

// progress. Each pass samples networkTip, anchors on max(tip, lastCommitted),
// computes [rangeStart, rangeEnd] with the mid-chunk exclusion, and breaks on an
// empty or non-advancing range.
func catchUp(ctx context.Context, cfg StartConfig, lastCommitted, earliest uint32) (uint32, error) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The doc unified its terminology on "backfill" — it's used throughout (42×) and "catch-up" appears nowhere; it calls this very loop "Backfill" that "re-runs if the tip advances mid-pass." catchUp reintroduces "catch-up" as a parallel term, and it spreads into the "catch-up pass" log lines and the CatchupPass/CatchupProgress metrics. Since RunBackfill is already the single pass, suggest renaming the loop to something backfill-rooted that captures "re-pass until the tip" — e.g. backfillToTip — with the logs and metrics following (BackfillPass/BackfillProgress). Keeps the code greppable against the doc, same as the other naming items.

@chowbao chowbao Jun 30, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 307e14d.

chowbao added 2 commits June 30, 2026 12:01
Three doc-alignment cleanups (tamirms), no behavior change:

1. progress.go: drop completeThrough + the preGenesisLedger const. lastCommittedLedger
   now derives the watermark inline (max of floor term earliest-1 and cold term
   chunk.LastLedger, signed, floored at FirstLedgerSeq-1) matching the doc's
   lastCommittedLedger; watermarkMidChunk uses chunk.ID(c).LastLedger() directly with
   an explicit sub-genesis guard (the fresh-start watermark is sub-genesis, so the
   guard stays — c is NOT always >= 0). preGenesisLedger moves to a test-only const.

2. retention.go: rename effectiveRetentionFloor -> retentionFloorChunk returning a
   chunk.ID (the doc's name + type). Chunk-side callers drop their IDFromLedger
   round-trip (rangeStart := retentionFloorChunk(...)); only the two metrics.Watermark
   calls convert back via .FirstLedger() (lossless — both floor terms are
   chunk-first-ledgers). Updated the geometry + retention_test references.

3. daemon.go: collapse Boundaries{NetworkTip,Backend,ServeReads,Cleanup} + the
   BuildBoundaries factory into an injectable backfill.Backend (nil => frontfill) +
   ServeReads, with runDaemonWith deriving NetworkTip (resolveNetworkTip) and Cleanup.
   buildProductionBoundaries -> buildBackfillBackend(ctx,cfg,logger)(Backend,cleanup,err);
   removes the struct, its validate, the factory, the two dead params (paths, cat), and
   the vague name. Tests inject a fakeBackend (re-added tipErr for the unreachable-tip
   case); dropped the now-moot BuildBoundariesError seam test.
…rminology

The doc unified on 'backfill' (used throughout, 'catch-up' appears nowhere) and
calls this very loop the backfill that re-runs if the tip advances mid-pass.
'catch-up' was a parallel term; rename it away for doc-greppability:

- catchUp -> backfillToTip (the re-pass-until-tip loop; RunBackfill is the single pass)
- metrics: CatchupProgress/CatchupPass -> BackfillProgress/BackfillPass, and the
  Prometheus fields/strings (catchup_backfilled_ledger -> backfilled_ledger,
  catchup_target_ledger -> backfill_target_ledger, catchup_passes_total ->
  backfill_passes_total, phase label catchup_pass -> backfill_pass). Not yet exposed
  (#772), so no dashboard impact.
- log lines ('catch-up pass ...' -> 'backfill pass ...') and all 'catch-up' prose.

No behavior change; whole fullhistory tree green (-short), vet clean. grep for
catchup/catch-up across the subsystem now returns nothing.
@chatgpt-codex-connector

Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

The explicit c < 0 guard was unnecessary. chunkIDOfLedger only ever returns -1
(for any sub-genesis ledger) or a real chunk id >= 0, and the sole sub-genesis
value watermarkMidChunk is ever called with is the fresh-start sentinel
preGenesisLedger (= FirstLedgerSeq-1). For c = -1, chunk.ID(-1).LastLedger()
wraps — (MaxUint32+1) overflows to 0 — back to exactly preGenesisLedger, so the
comparison already reports a boundary (false) with no special case. (This is the
uint32 aliasing the removed TestCompleteThrough documented.) tamirms's suggested
chunk.ID(c).LastLedger() form is therefore correct for every reachable input.

TestWatermarkMidChunk/genesis_sentinel_is_a_boundary still passes.
@chatgpt-codex-connector

Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

- progress.go: the G115 int64->uint32 narrowing is chunk.ID(cold) (cold is
  int64), not uint32(through) — move the //nolint:gosec onto that line and drop
  the unused one on the return (gosec doesn't flag the bounded uint32(through)).
- daemon.go: supervise's 'sleepCtx != nil -> return nil' is an intentional
  clean-shutdown-on-ctx-cancel; annotate //nolint:nilerr.
@chatgpt-codex-connector

Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

// (backfill, freeze/rebuild, prune) plus derived progress gauges; distinct from
// per-data-type ingest.MetricSink. All methods must be safe for concurrent use.
// LastCommitted and ChunkBoundary await the Phase-2 live-ingestion loop; the rest are wired.
type Metrics interface {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A lot of these are redundant or don't carry information you'd actually use. Suggested cuts:

Three gauges hold the same number. watermark_ledger, catchup_backfilled_ledger, and last_committed_ledger are all set from the same lastCommitted value. Keep one. I'd keep last_committed_ledger — the design doc says "last committed" everywhere (21 times) and "watermark" twice, so it's the name people will recognize.

Tip, lag, and target aren't needed. catchup_target_ledger is just the tip. ingestion_lag_ledgers is tip minus last-committed, which anyone can compute. And captive core already publishes the latest ledger it has, so the daemon doesn't need to report the tip at all — to get lag, subtract our last_committed_ledger from captive core's latest-ledger metric in the query. None of the three earns its place.

Counters that don't add anything. catchup_passes_total and freeze_indexes_total are just the _count of the catchup_pass and rebuild phase-duration histograms. rebuilt_chunks_total and rebuild_chunks_per_index both count chunks folded into index rebuilds, which isn't a number you'd act on. freeze_chunks_total counts chunks written, which just tracks last_committed_ledger moving. Drop all of these.

chunk_boundaries_total has no code that sets it — it's reserved for the Phase-2 ingestion loop. Like the #817/#818 reviews: add the metric when the code that fills it lands.

cold_tier_bytes is computed the slow way. It walks every cold directory and calls stat on every file — about 30,000 files at mainnet size, worse as history grows, and much worse if the files are on network storage. Disk usage is something the OS tracks per filesystem: one statfs call returns the used bytes, and it's more accurate (real disk blocks, not apparent file sizes). node_exporter already reports it per mount.

Worth keeping: last_committed_ledger, retention_floor_ledger, pruned_ops_total, and phase_duration_seconds by phase. Four instead of fifteen.

@chowbao chowbao Jun 30, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in ac64eac — cut to the four you flagged (last_committed_ledger, retention_floor_ledger, pruned_ops_total, phase_duration_seconds by phase).

type notConfiguredTip struct{}

func (notConfiguredTip) NetworkTip(context.Context) (uint32, error) {
return 0, errors.New("no bulk backend configured ([backfill.bsb].bucket_path empty); " +

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small one: this error points operators at [backfill.bsb].bucket_path, but that key doesn't exist — the bucket path lives in [backfill.datastore.params], and the thing that actually triggers frontfill-only mode is an empty [backfill.datastore].type (the check just above at line 236). An operator who wants to configure a backend and reads this gets sent to the wrong place. Suggest pointing it at [backfill.datastore].type.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 5d91d07. The error now points at [backfill.datastore].type (the empty-type check just above that actually selects frontfill-only mode), instead of the non-existent [backfill.bsb].bucket_path.

// where chunkIDOfLedger yields -1 and chunk.ID(-1).LastLedger() wraps (MaxUint32+1
// overflows to 0) back to exactly preGenesisLedger — so the comparison reports a
// boundary (false) without a special case.
func watermarkMidChunk(lastCommitted uint32) bool {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same naming point as the metrics one: this function (plus a few comments, the log "watermark derived", and the error "startup derive watermark") call the value "watermark," but the variable is lastCommitted, the function is lastCommittedLedger, and the log field is last_committed. The design doc uses "last committed" throughout and "watermark" only twice in passing. Could we settle on "last committed" everywhere — watermarkMidChunk -> lastCommittedMidChunk, and the stray "watermark" mentions to match? One name for one thing.

@chowbao chowbao Jun 30, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 5d91d07 — settled on "last committed" everywhere.

// ---------------------------------------------------------------------------

// A young-network first start does no backfill then serves reads once.
func TestRun_FirstStartCatchUpThenServe(t *testing.T) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leftover from the catch-up -> backfill rename: this test (and TestRunDaemon_CatchUpMaterializesAllColdTypesAndIndex in daemon_test.go, plus the "catches up" prose around it) still says "CatchUp." Worth renaming to "Backfill" to match.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 5d91d07. Renamed TestRun_FirstStartCatchUpThenServe...BackfillThenServe and TestRunDaemon_CatchUpMaterializesAllColdTypesAndIndexTestRunDaemon_BackfillMaterializesAllColdTypesAndIndex, plus the "catches up" prose in those two tests' comments. (Left the generic English "catches up" in resolve_test.go / backend.go — those aren't the catch-up term-of-art.)

Per review: most of the daemon's metrics are redundant or recomputable. Keep
only last_committed_ledger, retention_floor_ledger, pruned_ops_total, and
phase_duration_seconds (by phase). Dropped:

- watermark_ledger + backfilled_ledger: dupes of last_committed_ledger (all three
  were set from the same lastCommitted). Kept last_committed_ledger — the doc's term.
- ingestion_lag_ledgers + backfill_target_ledger: the target is just the tip, and
  lag = captive-core's latest-ledger minus our last_committed_ledger (compute in the
  query); the daemon needn't report the tip.
- catchup/backfill_passes_total + freeze_indexes_total: the _count of the backfill_pass
  / rebuild phase-duration histograms. rebuilt_chunks_total + rebuild_chunks_per_index
  + freeze_chunks_total: not actionable numbers. All dropped.
- chunk_boundaries_total: no code sets it (Phase-2). Add when the filling code lands.
- cold_tier_bytes + MeasureColdTierBytes: a 30k-file tree-walk; node_exporter already
  reports per-mount used bytes via statfs.

Interface shrinks accordingly: LastCommitted(lastCommitted, retentionFloor) replaces
Watermark + the unused 1-arg LastCommitted; Freeze/Rebuild/BackfillPass lose their
now-dead count args (just time the phase); IngestionLag/BackfillProgress/ColdTierBytes/
ChunkBoundary removed. Both test recordingMetrics fakes + their assertions updated.

Whole fullhistory tree green (-short), vet clean.
@chatgpt-codex-connector

Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

…fill renames

Three review nits:

- daemon.go: the notConfiguredTip error pointed operators at [backfill.bsb].bucket_path
  (a key that doesn't exist); frontfill-only is triggered by an empty
  [backfill.datastore].type — point the message there.

- 'watermark' -> 'last committed' everywhere: the value is lastCommitted, the deriver
  is lastCommittedLedger, the log field is last_committed, and the design doc uses
  'last committed' throughout ('watermark' only twice). Renamed watermarkMidChunk ->
  lastCommittedMidChunk, the 'derive watermark' error + 'watermark derived' log, the
  prose comments, the last_committed_ledger metric help, and the geometry/test mentions.

- leftover catch-up -> backfill test names: TestRun_FirstStartCatchUpThenServe ->
  ...BackfillThenServe, TestRunDaemon_CatchUpMaterializes... -> ...BackfillMaterializes...,
  plus the 'catches up' prose around them.

No behavior change; whole fullhistory tree green (-short), vet + gofmt clean.
@chatgpt-codex-connector

Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

// The txhash kind: one rule per overlapping window (resolveWindow runs the diff).
var builds []IndexBuild
for _, w := range indexesOverlapping(txLayout, rangeStart, rangeEnd) {
build, ok, err := resolveWindow(cat, txLayout, w, rangeStart, rangeEnd, needs)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this should probably be renamed to resolveTxHashIndex

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in a20ebf5 — renamed resolveWindowresolveTxHashIndex (def + call + comment).

@tamirms tamirms left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left one nit about renaming resolveWindow

but everything looks great! feel free to merge 🎉

The helper resolves the txhash-index rule for one window; the type-qualified
name reads clearer at the call site (nit).
@chatgpt-codex-connector

Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

@chowbao chowbao merged commit 954ed78 into feature/full-history Jun 30, 2026
7 of 15 checks passed
@github-project-automation github-project-automation Bot moved this from Needs Review to Done in Platform Scrum Jun 30, 2026
@chowbao chowbao deleted the streaming-phase1-daemon branch June 30, 2026 20:04
chowbao added a commit that referenced this pull request Jun 30, 2026
Rebased onto the reworked Phase 1 (#819), which flattened the streaming
package into the fullhistory root + a backfill subpackage and made the cold
pipeline source-blind (ingest.WriteColdChunk over a ledgerbackend.LedgerStream).
The hot tier + lifecycle machinery is re-carved onto that layout:

- hot key schema -> geometry (HotState/HotReady/HotTransient, HotChunkKey/
  ParseHotChunkKey); hot catalog methods -> catalog (HotState, PutHotTransient,
  FlipHotReady, DeleteHotKey, {Ready,}HotChunkKeys)
- HotProbe/HotChunk + ErrHotVolumeLost live in package backfill (consumed by
  backfillSource's hot branch, which now yields a ledgerbackend.LedgerStream +
  a closer); ProcessConfig gains a nil-safe HotProbe field
- the freeze source opens the hot DB READ-ONLY (the design's openRocksDBReadOnly):
  rocksdb.Config gains a ReadOnly mode (read-only open, no create, no flush-on-close)
  and hotchunk.OpenReadOnly composes the facades over it, so a freeze never mutates
  the hot DB it reads
- the per-chunk hotchunk DB (single shared multi-CF: ledgers + events + tx-hash;
  transient/ready state machine) + ingest.HotService rebuilt over hotchunk.DB
  (one atomic synced WriteBatch, decision (a) — no errgroup fan-out)
- progress hot refinement: lastCommittedLedger(cat[, probe]) maxes a hot term
  (positional or one refineWithHotDB read), loss detected lazily on the open
- freeze/discard/prune + progress + retention live in a new fullhistory/lifecycle/
  subpackage (the scoped part of #824); the freeze stage reuses backfill.RunBackfill
  (the same path catch-up uses); the live ingestion loop stays in the daemon root
- daemon wires the cold-only catch-up's HotProbe (NewRocksHotProbe)
- folded-in cleanups: deleted the now-dead RunHot/HotStores orchestration AND the
  per-type hot ingesters it was the last caller of (the hotchunk.DB HotService
  supersedes them); aggressively trimmed the phase-2 doc comments

Verification: go build + go vet + go test -short green on
./cmd/stellar-rpc/internal/fullhistory/... (cgo RocksDB toolchain).

Stack: streaming-phase2-lifecycle -> streaming-phase1-daemon
chowbao added a commit that referenced this pull request Jun 30, 2026
…mmitted (apply #819 metrics cut)

Round-N review on #819 trimmed the control-plane metrics to a minimal set and
explicitly dropped cold_tier_bytes (a ~30k-file tree walk; node_exporter statfs
reports per-mount used bytes). The #820 rebase had re-introduced it (plus
MeasureColdTierBytes and its per-pass/per-tick call sites) — remove it again.

Also drop the per-ledger LedgerCommitted heartbeat: it only re-set the existing
last_committed_ledger gauge more often (the lifecycle tick / backfill pass
already set it) and added interface surface #816 doesn't require.

Kept: live_hot_chunks (actionable discard-backlog signal), chunk_boundaries_total
and discarded_hot_chunks_total (now driven by live Phase-2 code).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

2 participants