fix(cache): retry bucket init on transient NATS disconnect#3049
Merged
migmartri merged 5 commits intoApr 17, 2026
Conversation
The control plane panicked on startup when UpdateStream on the KV backing stream hit the NATS default request timeout. initBucket now skips UpdateStream when Discard is already DiscardOld and logs a warning instead of returning an error when the JetStream metadata calls fail, matching the fail-open pattern used by the cache's runtime operations. Signed-off-by: Miguel Martinez <miguel@chainloop.dev>
There was a problem hiding this comment.
1 issue found across 2 files
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="pkg/cache/natskv_test.go">
<violation number="1" location="pkg/cache/natskv_test.go:291">
P2: This assertion does not validate idempotent `UpdateStream` behavior: `Created` is creation-time metadata, so it stays equal even when `UpdateStream` runs.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review, or fix all with cubic.
NATS auto-reconnects (MaxReconnects=-1, ReconnectWait=2s), but the initial cache bucket init had no retry path — a drop mid-init would surface as an immediate error. Wrap initBucket in a bounded retry (30s total, 2s period) that only retries connection-level errors so configuration failures still fail fast. Also skip UpdateStream when the backing stream already has DiscardOld, avoiding redundant metadata calls on every boot, and replace the panic(err) in main.go with a logged exit so a genuinely unavailable NATS fails startup cleanly instead of dumping a stack trace. Signed-off-by: Miguel Martinez <miguel@chainloop.dev>
watchReconnect previously ran a single initBucket attempt after each reconnect signal. A reconnect that races with NATS leader election or cluster settle would leave the bucket config stale until the next disconnect fired another signal. Reuse the bounded retry loop so reconnect handling survives an unsettled restart window. Signed-off-by: Miguel Martinez <miguel@chainloop.dev>
Reuse nopLogger from pkg/cache instead of duplicating a test stub, drop the redundant lastErr local in the retry loop, and trim the initMaxWait comment to its intent. Signed-off-by: Miguel Martinez <miguel@chainloop.dev>
Previous idempotency test checked info.Created across inits, but Created is creation-time metadata and never changes on UpdateStream, so the assertion was a no-op. Exercise ensureDiscardOld directly on both branches: flip the backing stream to DiscardNew, verify the update branch restores it, then measure nc.Stats().OutMsgs across a second call to confirm UpdateStream is not issued when Discard is already correct. Signed-off-by: Miguel Martinez <miguel@chainloop.dev>
Member
Author
|
Valid catch — |
jiparis
approved these changes
Apr 17, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
initBucketsurfaced the error immediately and the control plane panicked on startup. The NATS client already retries connections in the background (MaxReconnects=-1,ReconnectWait=2s), so the cache init now retries alongside it for up to 30s before giving up.context.DeadlineExceeded,nats.ErrConnectionClosed,ErrNoServers,ErrTimeout,ErrDisconnected,ErrConnectionDraining). Configuration errors — e.g. unsupported replica count — still fail fast.UpdateStreamon the backing KV stream is now skipped whenDiscardis alreadyDiscardOld(idempotent on redeploys), and has an explicit 5s context timeout.panic(err)inapp/controlplane/cmd/main.gois replaced with a logged exit so a genuinely unavailable NATS fails startup with a readable error message instead of a stack trace. Deferred cleanups (sentry flush, plugin cleanup, context cancel) are invoked explicitly before exit.