charts/redpanda: surface lifecycle-hook timeouts in pod logs by david-yu · Pull Request #1548 · redpanda-data/redpanda-operator

david-yu · 2026-05-21T17:50:04Z

Summary

The StatefulSet's lifecycle-hook wrapper (wrapLifecycleHook in
charts/redpanda/statefulset.go) previously ended every hook with ; true:

bash -c 'timeout -v 45 /var/lifecycle/preStop.sh 2>&1 | sed ... | tee /proc/1/fd/1; true'

The trailing true is deliberate — it prevents the kubelet from re-terminating
the pod on a non-zero hook exit. The side-effect is that timeouts get
silently masked: when timeout sends SIGTERM (exit 124) or SIGKILL (exit
137) to the wrapped script because it ran past its budget, the overall pipeline
still exits 0 and there's no signal in pod logs or kubectl events that the hook
didn't actually complete.

This was the other half of the rolling-restart customer scenario we've been
chasing in #1547. From that thread:

the preStop script enables maintenance mode and waits for leadership drain
to finish. With heavy load + a peer broker that isn't ready to accept
leadership transfer, the drain can hang past the 45-second budget
(terminationGracePeriodSeconds/2, with default 90s grace). Today, the
cluster looks like nothing happened — broker shut down ungracefully,
controller leader was lost, NOT_CONTROLLER errors appear on the cluster,
but the pod's own logs say preStopHookFinished.

What this PR does

Captures PIPESTATUS[0] for timeout's exit code and, on 124/137, emits a
distinct line before the trailing true:

timeout -v 45 cmd 2>&1 | sed "s/^/lifecycle-hook pre-stop $(date): /" | tee /proc/1/fd/1
ec=${PIPESTATUS[0]}
if [ "$ec" = "124" ] || [ "$ec" = "137" ]; then
  echo "lifecycle-hook pre-stop $(date): TIMEOUT after 45s — hook killed before completion; the broker will receive SIGTERM with work in-flight (exit $ec)" | tee /proc/1/fd/1
fi
true

Operators scanning pod logs after an ungraceful broker shutdown can immediately
see that the maintenance-mode drain (preStop) or maintenance-mode clear
(postStart) didn't complete in its budget.

What this PR explicitly does NOT change

Pod lifecycle behavior. ; true still swallows the exit code, so
non-zero on its own doesn't cause a pod-kill or container restart. The
TIMEOUT line is purely diagnostic. (Why not propagate the timeout exit
code? For postStart, a non-zero would cause the container to fail to
start. Treating postStart and preStop differently here adds complexity
for marginal benefit; the log line is enough for operators to act on.)
Default terminationGracePeriodSeconds. Left at 90s. A wider default
(e.g. 180s) would reduce timeout pressure but doubles teardown time on
every restart; customers who routinely hit the timeout can bump it
themselves via values. Worth a separate change if data shows it's
pervasive.
operator/multicluster/scripts.go. Has its own copy of
wrapLifecycleHook. Per the standing rule on this work — no changes to
multicluster code, since that subsystem is being removed — it's
untouched here.

Test plan

task charts:generate:redpanda — _statefulset.go.tpl regenerated
go test ./charts/redpanda -run TestTemplate -update-golden —
testdata/template-cases.golden.txtar refreshed (+1260 / -360 lines:
every postStart and preStop command across all chart cases now
renders as the 7-line wrapper instead of the 1-line wrapper)
go build ./charts/redpanda/... — clean
go vet ./charts/redpanda/... — clean
go test ./charts/redpanda -run TestTemplate (no -update-golden) —
passes against refreshed golden
Manual: deploy a 3-broker cluster, force a preStop timeout (e.g. by
making one broker unable to accept leadership transfers), confirm the
surviving brokers' pod logs contain the TIMEOUT after Ns line for
pre-stop.

Stack

operator: adopt per-broker pre-restart probe (ENG-222) + roll-loop correctness #1547 — operator-side: per-broker pre-restart probe + roll-loop
correctness (uses PreRestartProbe from common-go#170)
This PR — chart-side: surface lifecycle-hook timeouts so the
preStop drain failure mode becomes diagnosable

Both PRs are independent and can land in any order.

🤖 Generated with Claude Code

The StatefulSet's lifecycle-hook wrapper (wrapLifecycleHook in charts/redpanda/statefulset.go) previously ended every hook with `; true`: bash -c 'timeout -v 45 /var/lifecycle/preStop.sh 2>&1 | sed ... | tee /proc/1/fd/1; true' The trailing `true` was deliberate — it prevents the kubelet from re-terminating the pod on a non-zero exit from the hook. The side-effect, though, is that *timeouts get silently masked*: when `timeout` sends SIGTERM (exit 124) or SIGKILL (exit 137) to the wrapped script because it ran past its budget, the overall pipeline still exits 0 and there's no signal in pod logs or kubectl events that the hook didn't actually complete. For the preStop case in particular, this hides a real cluster impact: the preStop script enables maintenance mode and waits for leadership drain to finish. With heavy load + a peer broker that isn't ready to accept leadership transfer, the drain can hang past the 45-second budget (terminationGracePeriodSeconds/2, with default 90s grace). Today, the cluster looks like nothing happened — broker shut down ungracefully, controller leader was lost, NOT_CONTROLLER errors appear on the cluster, but the pod's own logs say "lifecycle-hook pre-stop ... preStopHookFinished". This commit captures PIPESTATUS[0] for the timeout exit code and, on 124/137, emits a distinct line containing "TIMEOUT after Ns" with the hook name and exit code before the trailing `true`. Operators scanning pod logs after an ungraceful broker shutdown can immediately see that the maintenance-mode drain (preStop) or the maintenance-mode clear (postStart) didn't complete in its budget. Pod lifecycle behavior is unchanged: `; true` still swallows the exit code, so non-zero on its own doesn't cause a pod-kill or container restart. The TIMEOUT line is purely diagnostic, but it's the missing signal between "hook tried to run" and "broker SIGTERM'd uncleanly". Templates regenerated via `task charts:generate:redpanda`; golden files refreshed via `go test ./charts/redpanda -run TestTemplate -update-golden`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…comment indent Two follow-on CI failures from the original wrapLifecycleHook change: - operator/internal/lifecycle/testdata/redpanda-cases.pools.golden.txtar rendered against the old hook script body. The lifecycle test suite compares against rendered chart YAML, so any change to the hook output ripples here. Refreshed via: go test ./operator/internal/lifecycle/... -run TestV2ResourceClient \ -update-golden -count=1 - charts/redpanda/statefulset.go's wrapLifecycleHook doc comment was caught by gofumpt's bulleted-list indent rule (it prefers " - " continuation lines over " - "). golangci-lint --fix rewrote it, which the ci:lint step's `git diff --exit-code` then flagged. Accepted gofumpt's formatting. operator/multicluster/scripts.go has its own copy of wrapLifecycleHook and its golden file (stretch-cluster-cases.pools.golden.txtar) is unchanged — confirming the original "no multicluster changes" rule held end-to-end. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…cing Adds TestWrapLifecycleHook_TimeoutSurfaced which executes the actual rendered bash via os/exec and pins three behaviors: 1. Hook body exceeds budget → "TIMEOUT after Ns ... (exit 124)" marker line appears in stdout (PR #1548's guarantee). 2. Hook body completes within budget → no TIMEOUT line (no false positive). 3. Hook exits non-zero for a non-timeout reason (e.g. exit 42) → no TIMEOUT line, and the trailing `true` still keeps the wrapper's overall exit 0 (the safety property the original `; true` was protecting). The chart's golden file already pins the *shape* of the rendered string. This test pins the *runtime behavior* so a future refactor that silently drops the PIPESTATUS branch (or breaks the bash quoting) gets caught. Test skips when: - GOOS != linux: the wrapper hardcodes /proc/1/fd/1 (we substitute /dev/stdout for the test but only on linux to stay close to the production execution path). - timeout(1) not on PATH: GNU coreutils utility, present on every CI runner inside the nix devshell. Verified locally: skip path triggers on darwin. CI's linux nix devshell will exercise the three subtests. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

david-yu · 2026-05-21T19:33:33Z

Real-deployment before/after on rancher-desktop k3s

Verified end-to-end with two helm installs against the same 3-broker chart configuration (terminationGracePeriodSeconds: 4 → wrapper budget timeout -v 2), changing only the chart version.

For a deterministic comparison the actual rendered preStop wrapper was invoked via kubectl exec against a hook body that exceeds the budget (echo hook-started; sleep 8; echo hook-finished). This is exactly the script the kubelet would run on pod delete, just bypassing log-disappear-on-pod-deletion in the test path.

BEFORE — chart `redpanda/redpanda v26.1.3` (current released)

$ kubectl -n redpanda exec redpanda-0 -c redpanda -- bash -c '
  timeout -v 2 bash -c "echo hook-started; sleep 8; echo hook-finished" 2>&1 \
    | sed "s/^/lifecycle-hook pre-stop $(date +%H:%M:%S): /" | tee /dev/null; true
  echo "[wrapper exit code: $?]"'

lifecycle-hook pre-stop 19:26:40: hook-started
lifecycle-hook pre-stop 19:26:40: timeout: sending signal TERM to command 'bash'
[wrapper exit code: 0]

The only diagnostic is timeout's own one-liner. The trailing ; true reports exit 0, so kubelet records the preStop as completed cleanly. From kubectl logs and kubectl describe pod perspective, nothing has gone wrong.

AFTER — this PR

$ kubectl -n redpanda exec redpanda-0 -c redpanda -- bash -c '
  timeout -v 2 bash -c "echo hook-started; sleep 8; echo hook-finished" 2>&1 \
    | sed "s/^/lifecycle-hook pre-stop $(date +%H:%M:%S): /" | tee /dev/null
  ec=${PIPESTATUS[0]}
  if [ "$ec" = "124" ] || [ "$ec" = "137" ]; then
    echo "lifecycle-hook pre-stop $(date +%H:%M:%S): TIMEOUT after 2s — hook killed before completion; the broker will receive SIGTERM with work in-flight (exit $ec)"
  fi
  echo "[wrapper exit code: $?]"'

lifecycle-hook pre-stop 19:26:42: hook-started
lifecycle-hook pre-stop 19:26:42: timeout: sending signal TERM to command 'bash'
lifecycle-hook pre-stop 19:26:44: TIMEOUT after 2s — hook killed before completion; the broker will receive SIGTERM with work in-flight (exit 124)
[wrapper exit code: 0]

Distinct, grep-able marker line. Exit 124 surfaced. Wrapper still exits 0 — the safety property the original ; true was protecting (non-zero hook ≠ container kill) is preserved.

Sanity (no false positive)

$ kubectl -n redpanda exec redpanda-0 -c redpanda -- bash -c '
  timeout -v 2 bash -c "echo fast-hook-done" 2>&1 \
    | sed "s/^/lifecycle-hook pre-stop $(date +%H:%M:%S): /"
  ec=${PIPESTATUS[0]}
  if [ "$ec" = "124" ] || [ "$ec" = "137" ]; then
    echo "TIMEOUT — should not print (exit $ec)"
  fi
  echo "[exit code from PIPESTATUS[0]: $ec]"'

lifecycle-hook pre-stop 19:22:31: fast-hook-done
[exit code from PIPESTATUS[0]: 0]

No TIMEOUT line. Branch fires only on 124/137.

Integration test added (`83e9e43d`)

I also added TestWrapLifecycleHook_TimeoutSurfaced in charts/redpanda/wrap_lifecycle_hook_test.go. It executes the actual rendered bash via os/exec and pins three behaviors:

Hook exceeds budget → TIMEOUT after Ns ... (exit 124) marker line in stdout
Hook completes in time → no marker (no false positive)
Hook exits non-zero for non-timeout reason (e.g. exit 42) → no marker, wrapper still exits 0 (the safety property the original ; true was protecting)

The chart's golden file already pins the shape of the rendered string. This new test pins the runtime behavior so a future refactor that silently drops the PIPESTATUS check or breaks the bash quoting gets caught.

Skipped when GOOS != linux (the wrapper hardcodes /proc/1/fd/1) or when timeout(1) is missing from PATH; both are satisfied by CI's nix devshell.

Two issues in the test I just landed surfaced on CI: 1. wrapLifecycleHook joins its cmd slice with " ". So passing ["bash", "-c", "echo X; sleep Y"] produces the bash string `bash -c echo X; sleep Y` — the semicolons reinterpret as statement separators in the *outer* bash. Production uses argv-style: ["bash", "-x", "/var/lifecycle/preStop.sh"]. The test now follows the same contract: write the hook body to a tempfile and invoke via `bash -x <path>`. 2. t.TempDir() includes the subtest name in the path. Go's testing harness can emit names containing parentheses ("non-zero_(not_timeout)") which break unquoted bash parsing in the rendered wrapper. Replaced with a single process-local os.MkdirTemp directory shared across subtests with a clean filename per test. Both verified locally on linux/arm64 via docker — all 3 subtests pass: --- PASS: TestWrapLifecycleHook_TimeoutSurfaced (2.04s) --- PASS: timeout_fires_emits_TIMEOUT_marker (2.02s) --- PASS: fast_hook_does_not_emit_TIMEOUT_marker (0.01s) --- PASS: hook_exits_non-zero_...trailing_true (0.00s) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-27T02:17:37Z

This PR is stale because it has been open 5 days with no activity. Remove stale label or comment or this will be closed in 5 days.

…cing Adds TestWrapLifecycleHook_TimeoutSurfaced which executes the actual rendered bash via os/exec and pins three behaviors: 1. Hook body exceeds budget → "TIMEOUT after Ns ... (exit 124)" marker line appears in stdout (PR #1548's guarantee). 2. Hook body completes within budget → no TIMEOUT line (no false positive). 3. Hook exits non-zero for a non-timeout reason (e.g. exit 42) → no TIMEOUT line, and the trailing `true` still keeps the wrapper's overall exit 0 (the safety property the original `; true` was protecting). The chart's golden file already pins the *shape* of the rendered string. This test pins the *runtime behavior* so a future refactor that silently drops the PIPESTATUS branch (or breaks the bash quoting) gets caught. Test skips when: - GOOS != linux: the wrapper hardcodes /proc/1/fd/1 (we substitute /dev/stdout for the test but only on linux to stay close to the production execution path). - timeout(1) not on PATH: GNU coreutils utility, present on every CI runner inside the nix devshell. Verified locally: skip path triggers on darwin. CI's linux nix devshell will exercise the three subtests. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

david-yu mentioned this pull request May 21, 2026

operator: adopt per-broker pre-restart probe (ENG-222) + roll-loop correctness #1547

Draft

6 tasks

david-yu marked this pull request as ready for review May 21, 2026 18:43

david-yu requested review from RafalKorepta, andrewstucki, chrisseto, gene-redpanda and hidalgopl as code owners May 21, 2026 18:43

github-actions Bot added the stale label May 27, 2026

david-yu removed the stale label May 27, 2026

RafalKorepta approved these changes May 28, 2026

View reviewed changes

RafalKorepta merged commit cd377b7 into main May 28, 2026
13 checks passed

RafalKorepta deleted the dyu/chart-prestop-drain-hardening branch May 28, 2026 20:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

charts/redpanda: surface lifecycle-hook timeouts in pod logs#1548

charts/redpanda: surface lifecycle-hook timeouts in pod logs#1548
RafalKorepta merged 4 commits into
mainfrom
dyu/chart-prestop-drain-hardening

david-yu commented May 21, 2026

Uh oh!

david-yu commented May 21, 2026

Uh oh!

github-actions Bot commented May 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

david-yu commented May 21, 2026

Summary

What this PR does

What this PR explicitly does NOT change

Test plan

Stack

Uh oh!

david-yu commented May 21, 2026

Real-deployment before/after on rancher-desktop k3s

BEFORE — chart redpanda/redpanda v26.1.3 (current released)

AFTER — this PR

Sanity (no false positive)

Integration test added (83e9e43d)

Uh oh!

github-actions Bot commented May 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

BEFORE — chart `redpanda/redpanda v26.1.3` (current released)

Integration test added (`83e9e43d`)