[release/v25.2.x] acceptance: extend cluster-ready timeout to 10m to reduce flakiness by david-yu · Pull Request #1541 · redpanda-data/redpanda-operator

david-yu · 2026-05-21T04:48:26Z

Summary

Several acceptance scenarios on release/v25.2.x fail at exactly the
5-minute cluster-ready boundary while the cluster eventually does come
up. Runners on this branch routinely take 5–7 minutes to bring a
Redpanda cluster to Ready under load (image pull, PVC binding,
operator reconcile under k3d contention). Without this change a slow
start is indistinguishable from a stuck cluster: tests time out
mid-startup with Condition never satisfied.

This PR centralizes the per-step Eventually budget into two constants
in acceptance/steps/cluster.go:

const clusterReadyTimeout = 10 * time.Minute  // was 5*time.Minute literal
const clusterReadyPoll    = 5  * time.Second

and replaces the 5*time.Minute, 5*time.Second pairs across the
acceptance/steps/ files that wait on cluster readiness:

File	Sites
`cluster.go`	6
`k8s.go`	3
`regression.go`	2
`rpk.go`	1
`service.go`	1

Why v25.2.x only

Same diff would benefit other release branches in principle, but
release/v25.3.x and release/v26.1.x acceptance runs pass on the
same k8s-m6id12xlarge agent class within the current 5-minute
budget (see #1538 / #1539 — same rolling-restart backport diff, both
green). The flakiness is specific to v25.2.x: slower operator
startup, slower image pull, and shared k3d state that's drifted with
time-in-maintenance.

Centralizing the constant means we can re-tune in one place if we want
to apply this to other branches later.

Failure shape this targets

From recent runs of #1540 against this branch:

TestAcceptanceSuite/Skipping_incremental_scale_downs         (307.20s)  FAIL
TestAcceptanceSuite/Manage_ShadowLink                        (356.35s)  FAIL
TestAcceptanceSuite/Operator_upgrade_from_25.1.3             (342.87s)  FAIL
TestAcceptanceSuite/Migrate_from_a_Helm_chart_release_to...  (320.74s)  FAIL

All clustered at the 300–350s mark — i.e. the 5-minute Eventually
budget ran out while Checking cluster resource conditions contains "Ready"? false was still being repeated. With a 10-minute ceiling we
absorb the slow-but-correct path.

Zero overlap in which test fails per run, and the failures are in
unrelated areas (schema CRDs, shadow links, scale-up, upgrade-matrix)
— this is timing pressure on the shared k3d cluster, not a code
regression. Bumping the ceiling lets the slow-but-correct path complete
instead of being indistinguishable from a stuck cluster.

Scope this does NOT cover

Per-scenario k3d cluster isolation. Some runs (e.g. [backport release/v25.2.x] operator: defer rolling restart while a recently replaced pod is still coming up #1540 run 3)
show 7 consecutive tests timing out simultaneously, which looks like
a single k3d cluster died and contaminated the rest of the suite.
Real fix is to isolate each scenario in its own k3d instance, at the
cost of ~25 min per acceptance run. Separate PR if we want it.
Pinning of published chart artifacts. The upgrade-matrix tests
reference --version v25.1.3 etc. which are already version-pinned,
but pull container images that can drift at the registry. Out of
scope; a separate PR would convert chart references to digests.

Test plan

go build ./acceptance/... — clean
go vet ./acceptance/... — clean
Acceptance suite on release/v25.2.x — observe whether the
cluster-ready-timeout failures disappear; run multiple times to
confirm independence of randomized seed.

🤖 Generated with Claude Code

RafalKorepta · 2026-05-21T19:53:04Z

Maybe it will be better to merge #1542 into your PR as you fix flaky test, but buildkite configuration is broken and prevents from merging

Several acceptance scenarios on release/v25.2.x branch fail at exactly the 5-minute cluster-ready boundary while the cluster eventually does come up — the runners on this branch routinely take 5-7 minutes to bring a Redpanda cluster to Ready under load (image pull, PVC binding, operator reconcile under k3d contention). Without this change, a slow start is indistinguishable from a stuck cluster: tests time out mid-startup with "Condition never satisfied". This PR centralizes the per-step Eventually budget into two constants in acceptance/steps/cluster.go: clusterReadyTimeout = 10 * time.Minute (was 5 *time.Minute literal) clusterReadyPoll = 5 * time.Second and replaces the 5*time.Minute,5*time.Second literal pairs across the acceptance/steps/ files that wait on cluster readiness: cluster.go (6 sites) k8s.go (3 sites) regression.go (2 sites) rpk.go (1 site) service.go (1 site) This is intentionally scoped to v25.2.x: v25.3.x and v26.1.x acceptance runs pass on the same agent class within the current 5-minute budget. Centralizing the constant also makes future tuning a one-line change. No behavior change for tests that already pass within 5 minutes; the budget extension only matters for runs that would otherwise time out between 5 and 10 minutes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ning Two metrics scenarios (Reject_request_without_TLS, Reject_unauthenticated_token) consistently time out at exactly 10s on v25.2.x k3d runners. The cause is in `operatorIsRunning`: checkStableResource(ctx, t, &dep) // waits for RV to stop // changing for 5s require.Equal(t, dep.Status.AvailableReplicas, int32(1)) // one-shot assert `checkStableResource` proves the Deployment object isn't churning, NOT that the pod is Ready. The OnFeature helm install hook returns before the operator pod is Ready, the Deployment status settles at AvailableReplicas=0 (nothing else mutates it), the 5s stability gate passes, and the immediate replica assert fires while replicas are still 0 → fail at ~10s. The same code path is fine on faster branches and intermittently broken on v25.2.x under image-pull / scheduling contention. Same flake class the rest of this PR addresses for cluster-ready Eventually loops, just applied to the operator-ready path that otherwise wasn't covered. Polls the Deployment status until all four replica counters reflect 1/1 Ready, with operator-specific constants mirroring the clusterReady* convention: operatorReadyTimeout = 2 * time.Minute operatorReadyPoll = 2 * time.Second 2 minutes is enough headroom for a fresh image pull + scheduling on shared k3d without masking genuine operator-stuck failures. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…adline TestIntegrationChart fails with `context deadline exceeded` on release/v25.2.x when scheduled late in the integration suite. The chain is: - go test -timeout 60m (taskfiles/Taskfile.yml test:integration) - testutil.Context(t) -> t.Context() picks up t.Deadline() minus 1s - ApplyAllAndWait -> WaitFor uses ctx.Deadline() as the polling budget — no internal default if a deadline is already set - WaitFor iterates objs in series; 3 Redpanda CRs each take 5-7 minutes on v25.2.x to reach Stable under k3d/vcluster contention By the time the subtest actually runs, only a fraction of the 60m budget remains, and the three serial waits don't fit. The error surfaces from pkg/kube/generics.go:40 with no actionable diagnostic. This commit gives each ApplyAllAndWait / ApplyAndWait call a fresh 10-minute budget via context.WithTimeout(context.WithoutCancel(...)) so the wait isn't squeezed by where in the suite the test was scheduled. WithoutCancel preserves logging / tracing values from t.Context() while breaking deadline propagation. Same flake-class as the rest of #1541 addresses for acceptance step Eventually loops; see clusterReadyTimeout there for the constant pattern. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

david-yu marked this pull request as ready for review May 21, 2026 05:08

david-yu requested review from RafalKorepta, andrewstucki, chrisseto, gene-redpanda and hidalgopl as code owners May 21, 2026 05:08

RafalKorepta approved these changes May 21, 2026

View reviewed changes

RafalKorepta enabled auto-merge (rebase) May 21, 2026 09:40

RafalKorepta force-pushed the dyu/stabilize-acceptance-v25.2.x branch from 5dab71b to 2496ace Compare May 21, 2026 09:47

david-yu disabled auto-merge May 21, 2026 14:53

david-yu enabled auto-merge (squash) May 21, 2026 14:53

david-yu and others added 3 commits May 21, 2026 13:10

chore: re-trigger CI

7ad6536

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

david-yu force-pushed the dyu/stabilize-acceptance-v25.2.x branch from 257d794 to 7ad6536 Compare May 21, 2026 20:11

david-yu merged commit a3f4e9e into release/v25.2.x May 21, 2026
10 checks passed

RafalKorepta deleted the dyu/stabilize-acceptance-v25.2.x branch May 22, 2026 13:37

RafalKorepta mentioned this pull request May 22, 2026

[backport release/v25.3.x] operator: defer rolling restart while a recently replaced pod is still coming up #1539

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[release/v25.2.x] acceptance: extend cluster-ready timeout to 10m to reduce flakiness#1541

[release/v25.2.x] acceptance: extend cluster-ready timeout to 10m to reduce flakiness#1541
david-yu merged 4 commits into
release/v25.2.xfrom
dyu/stabilize-acceptance-v25.2.x

david-yu commented May 21, 2026 •

edited

Loading

Uh oh!

RafalKorepta commented May 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

david-yu commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why v25.2.x only

Failure shape this targets

Scope this does NOT cover

Test plan

Uh oh!

RafalKorepta commented May 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

david-yu commented May 21, 2026 •

edited

Loading