Skip to content

fix(cache): retry bucket init on transient NATS disconnect#3049

Merged
migmartri merged 5 commits into
chainloop-dev:mainfrom
migmartri:fix/cache-natskv-init-fail-open
Apr 17, 2026
Merged

fix(cache): retry bucket init on transient NATS disconnect#3049
migmartri merged 5 commits into
chainloop-dev:mainfrom
migmartri:fix/cache-natskv-init-fail-open

Conversation

@migmartri

@migmartri migmartri commented Apr 16, 2026

Copy link
Copy Markdown
Member

Summary

  • When NATS dropped mid-init, initBucket surfaced the error immediately and the control plane panicked on startup. The NATS client already retries connections in the background (MaxReconnects=-1, ReconnectWait=2s), so the cache init now retries alongside it for up to 30s before giving up.
  • Only connection-level errors are retried (context.DeadlineExceeded, nats.ErrConnectionClosed, ErrNoServers, ErrTimeout, ErrDisconnected, ErrConnectionDraining). Configuration errors — e.g. unsupported replica count — still fail fast.
  • UpdateStream on the backing KV stream is now skipped when Discard is already DiscardOld (idempotent on redeploys), and has an explicit 5s context timeout.
  • panic(err) in app/controlplane/cmd/main.go is replaced with a logged exit so a genuinely unavailable NATS fails startup with a readable error message instead of a stack trace. Deferred cleanups (sentry flush, plugin cleanup, context cancel) are invoked explicitly before exit.

The control plane panicked on startup when UpdateStream on the KV
backing stream hit the NATS default request timeout. initBucket now
skips UpdateStream when Discard is already DiscardOld and logs a
warning instead of returning an error when the JetStream metadata
calls fail, matching the fail-open pattern used by the cache's
runtime operations.

Signed-off-by: Miguel Martinez <miguel@chainloop.dev>

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 2 files

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="pkg/cache/natskv_test.go">

<violation number="1" location="pkg/cache/natskv_test.go:291">
P2: This assertion does not validate idempotent `UpdateStream` behavior: `Created` is creation-time metadata, so it stays equal even when `UpdateStream` runs.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review, or fix all with cubic.

Comment thread pkg/cache/natskv_test.go Outdated
NATS auto-reconnects (MaxReconnects=-1, ReconnectWait=2s), but the
initial cache bucket init had no retry path — a drop mid-init would
surface as an immediate error. Wrap initBucket in a bounded retry
(30s total, 2s period) that only retries connection-level errors so
configuration failures still fail fast.

Also skip UpdateStream when the backing stream already has
DiscardOld, avoiding redundant metadata calls on every boot, and
replace the panic(err) in main.go with a logged exit so a genuinely
unavailable NATS fails startup cleanly instead of dumping a stack
trace.

Signed-off-by: Miguel Martinez <miguel@chainloop.dev>
@migmartri migmartri changed the title fix(cache): make NATS KV bucket init idempotent and fail-open fix(cache): retry bucket init on transient NATS disconnect Apr 16, 2026
watchReconnect previously ran a single initBucket attempt after each
reconnect signal. A reconnect that races with NATS leader election
or cluster settle would leave the bucket config stale until the next
disconnect fired another signal. Reuse the bounded retry loop so
reconnect handling survives an unsettled restart window.

Signed-off-by: Miguel Martinez <miguel@chainloop.dev>
Reuse nopLogger from pkg/cache instead of duplicating a test stub,
drop the redundant lastErr local in the retry loop, and trim the
initMaxWait comment to its intent.

Signed-off-by: Miguel Martinez <miguel@chainloop.dev>
@migmartri migmartri requested a review from a team April 16, 2026 21:41
Previous idempotency test checked info.Created across inits, but
Created is creation-time metadata and never changes on UpdateStream,
so the assertion was a no-op.

Exercise ensureDiscardOld directly on both branches: flip the
backing stream to DiscardNew, verify the update branch restores it,
then measure nc.Stats().OutMsgs across a second call to confirm
UpdateStream is not issued when Discard is already correct.

Signed-off-by: Miguel Martinez <miguel@chainloop.dev>
@migmartri

Copy link
Copy Markdown
Member Author

Valid catch — info.Created is creation-time metadata and never changes on UpdateStream, so the assertion was a no-op. Replaced the test in baaf396: now exercises ensureDiscardOld directly on both branches and asserts the skip path via nc.Stats().OutMsgs delta (confirmed failing with delta=2 when skip is disabled, passing with delta=1 when enabled).

@migmartri migmartri merged commit a0921aa into chainloop-dev:main Apr 17, 2026
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants