Skip to content

[release/v25.2.x] acceptance: extend cluster-ready timeout to 10m to reduce flakiness#1541

Merged
david-yu merged 4 commits into
release/v25.2.xfrom
dyu/stabilize-acceptance-v25.2.x
May 21, 2026
Merged

[release/v25.2.x] acceptance: extend cluster-ready timeout to 10m to reduce flakiness#1541
david-yu merged 4 commits into
release/v25.2.xfrom
dyu/stabilize-acceptance-v25.2.x

Conversation

@david-yu
Copy link
Copy Markdown
Contributor

@david-yu david-yu commented May 21, 2026

Summary

Several acceptance scenarios on release/v25.2.x fail at exactly the
5-minute cluster-ready boundary while the cluster eventually does come
up. Runners on this branch routinely take 5–7 minutes to bring a
Redpanda cluster to Ready under load (image pull, PVC binding,
operator reconcile under k3d contention). Without this change a slow
start is indistinguishable from a stuck cluster: tests time out
mid-startup with Condition never satisfied.

This PR centralizes the per-step Eventually budget into two constants
in acceptance/steps/cluster.go:

const clusterReadyTimeout = 10 * time.Minute  // was 5*time.Minute literal
const clusterReadyPoll    = 5  * time.Second

and replaces the 5*time.Minute, 5*time.Second pairs across the
acceptance/steps/ files that wait on cluster readiness:

File Sites
cluster.go 6
k8s.go 3
regression.go 2
rpk.go 1
service.go 1

Why v25.2.x only

Same diff would benefit other release branches in principle, but
release/v25.3.x and release/v26.1.x acceptance runs pass on the
same k8s-m6id12xlarge agent class within the current 5-minute
budget
(see #1538 / #1539 — same rolling-restart backport diff, both
green). The flakiness is specific to v25.2.x: slower operator
startup, slower image pull, and shared k3d state that's drifted with
time-in-maintenance.

Centralizing the constant means we can re-tune in one place if we want
to apply this to other branches later.

Failure shape this targets

From recent runs of #1540 against this branch:

TestAcceptanceSuite/Skipping_incremental_scale_downs         (307.20s)  FAIL
TestAcceptanceSuite/Manage_ShadowLink                        (356.35s)  FAIL
TestAcceptanceSuite/Operator_upgrade_from_25.1.3             (342.87s)  FAIL
TestAcceptanceSuite/Migrate_from_a_Helm_chart_release_to...  (320.74s)  FAIL

All clustered at the 300–350s mark — i.e. the 5-minute Eventually
budget ran out while Checking cluster resource conditions contains "Ready"? false was still being repeated. With a 10-minute ceiling we
absorb the slow-but-correct path.

Zero overlap in which test fails per run, and the failures are in
unrelated areas (schema CRDs, shadow links, scale-up, upgrade-matrix)
— this is timing pressure on the shared k3d cluster, not a code
regression. Bumping the ceiling lets the slow-but-correct path complete
instead of being indistinguishable from a stuck cluster.

Scope this does NOT cover

  • Per-scenario k3d cluster isolation. Some runs (e.g. [backport release/v25.2.x] operator: defer rolling restart while a recently replaced pod is still coming up #1540 run 3)
    show 7 consecutive tests timing out simultaneously, which looks like
    a single k3d cluster died and contaminated the rest of the suite.
    Real fix is to isolate each scenario in its own k3d instance, at the
    cost of ~25 min per acceptance run. Separate PR if we want it.
  • Pinning of published chart artifacts. The upgrade-matrix tests
    reference --version v25.1.3 etc. which are already version-pinned,
    but pull container images that can drift at the registry. Out of
    scope; a separate PR would convert chart references to digests.

Test plan

  • go build ./acceptance/... — clean
  • go vet ./acceptance/... — clean
  • Acceptance suite on release/v25.2.x — observe whether the
    cluster-ready-timeout failures disappear; run multiple times to
    confirm independence of randomized seed.

🤖 Generated with Claude Code

@david-yu david-yu marked this pull request as ready for review May 21, 2026 05:08
@RafalKorepta RafalKorepta enabled auto-merge (rebase) May 21, 2026 09:40
@RafalKorepta RafalKorepta force-pushed the dyu/stabilize-acceptance-v25.2.x branch from 5dab71b to 2496ace Compare May 21, 2026 09:47
@david-yu david-yu disabled auto-merge May 21, 2026 14:53
@david-yu david-yu enabled auto-merge (squash) May 21, 2026 14:53
@RafalKorepta
Copy link
Copy Markdown
Contributor

Maybe it will be better to merge #1542 into your PR as you fix flaky test, but buildkite configuration is broken and prevents from merging

david-yu and others added 3 commits May 21, 2026 13:10
Several acceptance scenarios on release/v25.2.x branch fail at exactly the
5-minute cluster-ready boundary while the cluster eventually does come
up — the runners on this branch routinely take 5-7 minutes to bring a
Redpanda cluster to Ready under load (image pull, PVC binding,
operator reconcile under k3d contention). Without this change, a slow
start is indistinguishable from a stuck cluster: tests time out
mid-startup with "Condition never satisfied".

This PR centralizes the per-step Eventually budget into two constants
in acceptance/steps/cluster.go:

  clusterReadyTimeout = 10 * time.Minute   (was 5 *time.Minute literal)
  clusterReadyPoll    = 5 * time.Second

and replaces the 5*time.Minute,5*time.Second literal pairs across the
acceptance/steps/ files that wait on cluster readiness:

  cluster.go      (6 sites)
  k8s.go          (3 sites)
  regression.go   (2 sites)
  rpk.go          (1 site)
  service.go      (1 site)

This is intentionally scoped to v25.2.x: v25.3.x and v26.1.x acceptance
runs pass on the same agent class within the current 5-minute budget.
Centralizing the constant also makes future tuning a one-line change.

No behavior change for tests that already pass within 5 minutes; the
budget extension only matters for runs that would otherwise time out
between 5 and 10 minutes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ning

Two metrics scenarios (Reject_request_without_TLS,
Reject_unauthenticated_token) consistently time out at exactly 10s on
v25.2.x k3d runners. The cause is in `operatorIsRunning`:

  checkStableResource(ctx, t, &dep)               // waits for RV to stop
                                                  // changing for 5s
  require.Equal(t, dep.Status.AvailableReplicas,
                int32(1))                         // one-shot assert

`checkStableResource` proves the Deployment object isn't churning, NOT
that the pod is Ready. The OnFeature helm install hook returns before
the operator pod is Ready, the Deployment status settles at
AvailableReplicas=0 (nothing else mutates it), the 5s stability gate
passes, and the immediate replica assert fires while replicas are still
0 → fail at ~10s. The same code path is fine on faster branches and
intermittently broken on v25.2.x under image-pull / scheduling
contention.

Same flake class the rest of this PR addresses for cluster-ready
Eventually loops, just applied to the operator-ready path that
otherwise wasn't covered. Polls the Deployment status until all four
replica counters reflect 1/1 Ready, with operator-specific constants
mirroring the clusterReady* convention:

  operatorReadyTimeout = 2 * time.Minute
  operatorReadyPoll    = 2 * time.Second

2 minutes is enough headroom for a fresh image pull + scheduling on
shared k3d without masking genuine operator-stuck failures.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@david-yu david-yu force-pushed the dyu/stabilize-acceptance-v25.2.x branch from 257d794 to 7ad6536 Compare May 21, 2026 20:11
…adline

TestIntegrationChart fails with `context deadline exceeded` on
release/v25.2.x when scheduled late in the integration suite. The
chain is:

  - go test -timeout 60m (taskfiles/Taskfile.yml test:integration)
  - testutil.Context(t) -> t.Context() picks up t.Deadline() minus 1s
  - ApplyAllAndWait -> WaitFor uses ctx.Deadline() as the polling
    budget — no internal default if a deadline is already set
  - WaitFor iterates objs in series; 3 Redpanda CRs each take 5-7
    minutes on v25.2.x to reach Stable under k3d/vcluster contention

By the time the subtest actually runs, only a fraction of the 60m
budget remains, and the three serial waits don't fit. The error
surfaces from pkg/kube/generics.go:40 with no actionable diagnostic.

This commit gives each ApplyAllAndWait / ApplyAndWait call a fresh
10-minute budget via context.WithTimeout(context.WithoutCancel(...))
so the wait isn't squeezed by where in the suite the test was
scheduled. WithoutCancel preserves logging / tracing values from
t.Context() while breaking deadline propagation.

Same flake-class as the rest of #1541 addresses for acceptance step
Eventually loops; see clusterReadyTimeout there for the constant
pattern.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@david-yu david-yu merged commit a3f4e9e into release/v25.2.x May 21, 2026
10 checks passed
@RafalKorepta RafalKorepta deleted the dyu/stabilize-acceptance-v25.2.x branch May 22, 2026 13:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants