[release/v25.2.x] acceptance: extend cluster-ready timeout to 10m to reduce flakiness#1541
Merged
Merged
Conversation
RafalKorepta
approved these changes
May 21, 2026
5dab71b to
2496ace
Compare
Contributor
|
Maybe it will be better to merge #1542 into your PR as you fix flaky test, but buildkite configuration is broken and prevents from merging |
Several acceptance scenarios on release/v25.2.x branch fail at exactly the 5-minute cluster-ready boundary while the cluster eventually does come up — the runners on this branch routinely take 5-7 minutes to bring a Redpanda cluster to Ready under load (image pull, PVC binding, operator reconcile under k3d contention). Without this change, a slow start is indistinguishable from a stuck cluster: tests time out mid-startup with "Condition never satisfied". This PR centralizes the per-step Eventually budget into two constants in acceptance/steps/cluster.go: clusterReadyTimeout = 10 * time.Minute (was 5 *time.Minute literal) clusterReadyPoll = 5 * time.Second and replaces the 5*time.Minute,5*time.Second literal pairs across the acceptance/steps/ files that wait on cluster readiness: cluster.go (6 sites) k8s.go (3 sites) regression.go (2 sites) rpk.go (1 site) service.go (1 site) This is intentionally scoped to v25.2.x: v25.3.x and v26.1.x acceptance runs pass on the same agent class within the current 5-minute budget. Centralizing the constant also makes future tuning a one-line change. No behavior change for tests that already pass within 5 minutes; the budget extension only matters for runs that would otherwise time out between 5 and 10 minutes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ning
Two metrics scenarios (Reject_request_without_TLS,
Reject_unauthenticated_token) consistently time out at exactly 10s on
v25.2.x k3d runners. The cause is in `operatorIsRunning`:
checkStableResource(ctx, t, &dep) // waits for RV to stop
// changing for 5s
require.Equal(t, dep.Status.AvailableReplicas,
int32(1)) // one-shot assert
`checkStableResource` proves the Deployment object isn't churning, NOT
that the pod is Ready. The OnFeature helm install hook returns before
the operator pod is Ready, the Deployment status settles at
AvailableReplicas=0 (nothing else mutates it), the 5s stability gate
passes, and the immediate replica assert fires while replicas are still
0 → fail at ~10s. The same code path is fine on faster branches and
intermittently broken on v25.2.x under image-pull / scheduling
contention.
Same flake class the rest of this PR addresses for cluster-ready
Eventually loops, just applied to the operator-ready path that
otherwise wasn't covered. Polls the Deployment status until all four
replica counters reflect 1/1 Ready, with operator-specific constants
mirroring the clusterReady* convention:
operatorReadyTimeout = 2 * time.Minute
operatorReadyPoll = 2 * time.Second
2 minutes is enough headroom for a fresh image pull + scheduling on
shared k3d without masking genuine operator-stuck failures.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
257d794 to
7ad6536
Compare
…adline
TestIntegrationChart fails with `context deadline exceeded` on
release/v25.2.x when scheduled late in the integration suite. The
chain is:
- go test -timeout 60m (taskfiles/Taskfile.yml test:integration)
- testutil.Context(t) -> t.Context() picks up t.Deadline() minus 1s
- ApplyAllAndWait -> WaitFor uses ctx.Deadline() as the polling
budget — no internal default if a deadline is already set
- WaitFor iterates objs in series; 3 Redpanda CRs each take 5-7
minutes on v25.2.x to reach Stable under k3d/vcluster contention
By the time the subtest actually runs, only a fraction of the 60m
budget remains, and the three serial waits don't fit. The error
surfaces from pkg/kube/generics.go:40 with no actionable diagnostic.
This commit gives each ApplyAllAndWait / ApplyAndWait call a fresh
10-minute budget via context.WithTimeout(context.WithoutCancel(...))
so the wait isn't squeezed by where in the suite the test was
scheduled. WithoutCancel preserves logging / tracing values from
t.Context() while breaking deadline propagation.
Same flake-class as the rest of #1541 addresses for acceptance step
Eventually loops; see clusterReadyTimeout there for the constant
pattern.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Several acceptance scenarios on
release/v25.2.xfail at exactly the5-minute cluster-ready boundary while the cluster eventually does come
up. Runners on this branch routinely take 5–7 minutes to bring a
Redpanda cluster to
Readyunder load (image pull, PVC binding,operator reconcile under k3d contention). Without this change a slow
start is indistinguishable from a stuck cluster: tests time out
mid-startup with
Condition never satisfied.This PR centralizes the per-step
Eventuallybudget into two constantsin
acceptance/steps/cluster.go:and replaces the
5*time.Minute, 5*time.Secondpairs across theacceptance/steps/files that wait on cluster readiness:cluster.gok8s.goregression.gorpk.goservice.goWhy v25.2.x only
Same diff would benefit other release branches in principle, but
release/v25.3.xandrelease/v26.1.xacceptance runs pass on thesame
k8s-m6id12xlargeagent class within the current 5-minutebudget (see #1538 / #1539 — same rolling-restart backport diff, both
green). The flakiness is specific to v25.2.x: slower operator
startup, slower image pull, and shared k3d state that's drifted with
time-in-maintenance.
Centralizing the constant means we can re-tune in one place if we want
to apply this to other branches later.
Failure shape this targets
From recent runs of #1540 against this branch:
All clustered at the 300–350s mark — i.e. the 5-minute Eventually
budget ran out while
Checking cluster resource conditions contains "Ready"? falsewas still being repeated. With a 10-minute ceiling weabsorb the slow-but-correct path.
Zero overlap in which test fails per run, and the failures are in
unrelated areas (schema CRDs, shadow links, scale-up, upgrade-matrix)
— this is timing pressure on the shared k3d cluster, not a code
regression. Bumping the ceiling lets the slow-but-correct path complete
instead of being indistinguishable from a stuck cluster.
Scope this does NOT cover
show 7 consecutive tests timing out simultaneously, which looks like
a single k3d cluster died and contaminated the rest of the suite.
Real fix is to isolate each scenario in its own k3d instance, at the
cost of ~25 min per acceptance run. Separate PR if we want it.
reference
--version v25.1.3etc. which are already version-pinned,but pull container images that can drift at the registry. Out of
scope; a separate PR would convert chart references to digests.
Test plan
go build ./acceptance/...— cleango vet ./acceptance/...— cleanrelease/v25.2.x— observe whether thecluster-ready-timeout failures disappear; run multiple times to
confirm independence of randomized seed.
🤖 Generated with Claude Code