charts/redpanda: surface lifecycle-hook timeouts in pod logs#1548
Conversation
The StatefulSet's lifecycle-hook wrapper (wrapLifecycleHook in charts/redpanda/statefulset.go) previously ended every hook with `; true`: bash -c 'timeout -v 45 /var/lifecycle/preStop.sh 2>&1 | sed ... | tee /proc/1/fd/1; true' The trailing `true` was deliberate — it prevents the kubelet from re-terminating the pod on a non-zero exit from the hook. The side-effect, though, is that *timeouts get silently masked*: when `timeout` sends SIGTERM (exit 124) or SIGKILL (exit 137) to the wrapped script because it ran past its budget, the overall pipeline still exits 0 and there's no signal in pod logs or kubectl events that the hook didn't actually complete. For the preStop case in particular, this hides a real cluster impact: the preStop script enables maintenance mode and waits for leadership drain to finish. With heavy load + a peer broker that isn't ready to accept leadership transfer, the drain can hang past the 45-second budget (terminationGracePeriodSeconds/2, with default 90s grace). Today, the cluster looks like nothing happened — broker shut down ungracefully, controller leader was lost, NOT_CONTROLLER errors appear on the cluster, but the pod's own logs say "lifecycle-hook pre-stop ... preStopHookFinished". This commit captures PIPESTATUS[0] for the timeout exit code and, on 124/137, emits a distinct line containing "TIMEOUT after Ns" with the hook name and exit code before the trailing `true`. Operators scanning pod logs after an ungraceful broker shutdown can immediately see that the maintenance-mode drain (preStop) or the maintenance-mode clear (postStart) didn't complete in its budget. Pod lifecycle behavior is unchanged: `; true` still swallows the exit code, so non-zero on its own doesn't cause a pod-kill or container restart. The TIMEOUT line is purely diagnostic, but it's the missing signal between "hook tried to run" and "broker SIGTERM'd uncleanly". Templates regenerated via `task charts:generate:redpanda`; golden files refreshed via `go test ./charts/redpanda -run TestTemplate -update-golden`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…comment indent
Two follow-on CI failures from the original wrapLifecycleHook change:
- operator/internal/lifecycle/testdata/redpanda-cases.pools.golden.txtar
rendered against the old hook script body. The lifecycle test
suite compares against rendered chart YAML, so any change to the
hook output ripples here. Refreshed via:
go test ./operator/internal/lifecycle/... -run TestV2ResourceClient \
-update-golden -count=1
- charts/redpanda/statefulset.go's wrapLifecycleHook doc comment
was caught by gofumpt's bulleted-list indent rule (it prefers
" - " continuation lines over " - "). golangci-lint --fix
rewrote it, which the ci:lint step's `git diff --exit-code`
then flagged. Accepted gofumpt's formatting.
operator/multicluster/scripts.go has its own copy of wrapLifecycleHook
and its golden file (stretch-cluster-cases.pools.golden.txtar) is
unchanged — confirming the original "no multicluster changes" rule
held end-to-end.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…cing
Adds TestWrapLifecycleHook_TimeoutSurfaced which executes the
actual rendered bash via os/exec and pins three behaviors:
1. Hook body exceeds budget → "TIMEOUT after Ns ... (exit 124)"
marker line appears in stdout (PR #1548's guarantee).
2. Hook body completes within budget → no TIMEOUT line (no
false positive).
3. Hook exits non-zero for a non-timeout reason (e.g. exit 42)
→ no TIMEOUT line, and the trailing `true` still keeps the
wrapper's overall exit 0 (the safety property the original
`; true` was protecting).
The chart's golden file already pins the *shape* of the rendered
string. This test pins the *runtime behavior* so a future refactor
that silently drops the PIPESTATUS branch (or breaks the bash
quoting) gets caught.
Test skips when:
- GOOS != linux: the wrapper hardcodes /proc/1/fd/1 (we
substitute /dev/stdout for the test but only on linux to stay
close to the production execution path).
- timeout(1) not on PATH: GNU coreutils utility, present on
every CI runner inside the nix devshell.
Verified locally: skip path triggers on darwin. CI's linux nix
devshell will exercise the three subtests.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Real-deployment before/after on rancher-desktop k3sVerified end-to-end with two helm installs against the same 3-broker chart configuration ( For a deterministic comparison the actual rendered preStop wrapper was invoked via BEFORE — chart
|
Two issues in the test I just landed surfaced on CI:
1. wrapLifecycleHook joins its cmd slice with " ". So passing
["bash", "-c", "echo X; sleep Y"] produces the bash string
`bash -c echo X; sleep Y` — the semicolons reinterpret as
statement separators in the *outer* bash. Production uses
argv-style: ["bash", "-x", "/var/lifecycle/preStop.sh"]. The
test now follows the same contract: write the hook body to a
tempfile and invoke via `bash -x <path>`.
2. t.TempDir() includes the subtest name in the path. Go's
testing harness can emit names containing parentheses
("non-zero_(not_timeout)") which break unquoted bash parsing
in the rendered wrapper. Replaced with a single process-local
os.MkdirTemp directory shared across subtests with a clean
filename per test.
Both verified locally on linux/arm64 via docker — all 3 subtests
pass:
--- PASS: TestWrapLifecycleHook_TimeoutSurfaced (2.04s)
--- PASS: timeout_fires_emits_TIMEOUT_marker (2.02s)
--- PASS: fast_hook_does_not_emit_TIMEOUT_marker (0.01s)
--- PASS: hook_exits_non-zero_...trailing_true (0.00s)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
This PR is stale because it has been open 5 days with no activity. Remove stale label or comment or this will be closed in 5 days. |
…cing
Adds TestWrapLifecycleHook_TimeoutSurfaced which executes the
actual rendered bash via os/exec and pins three behaviors:
1. Hook body exceeds budget → "TIMEOUT after Ns ... (exit 124)"
marker line appears in stdout (PR #1548's guarantee).
2. Hook body completes within budget → no TIMEOUT line (no
false positive).
3. Hook exits non-zero for a non-timeout reason (e.g. exit 42)
→ no TIMEOUT line, and the trailing `true` still keeps the
wrapper's overall exit 0 (the safety property the original
`; true` was protecting).
The chart's golden file already pins the *shape* of the rendered
string. This test pins the *runtime behavior* so a future refactor
that silently drops the PIPESTATUS branch (or breaks the bash
quoting) gets caught.
Test skips when:
- GOOS != linux: the wrapper hardcodes /proc/1/fd/1 (we
substitute /dev/stdout for the test but only on linux to stay
close to the production execution path).
- timeout(1) not on PATH: GNU coreutils utility, present on
every CI runner inside the nix devshell.
Verified locally: skip path triggers on darwin. CI's linux nix
devshell will exercise the three subtests.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
The StatefulSet's lifecycle-hook wrapper (
wrapLifecycleHookincharts/redpanda/statefulset.go) previously ended every hook with; true:bash -c 'timeout -v 45 /var/lifecycle/preStop.sh 2>&1 | sed ... | tee /proc/1/fd/1; true'The trailing
trueis deliberate — it prevents the kubelet from re-terminatingthe pod on a non-zero hook exit. The side-effect is that timeouts get
silently masked: when
timeoutsends SIGTERM (exit 124) or SIGKILL (exit137) to the wrapped script because it ran past its budget, the overall pipeline
still exits 0 and there's no signal in pod logs or kubectl events that the hook
didn't actually complete.
This was the other half of the rolling-restart customer scenario we've been
chasing in #1547. From that thread:
What this PR does
Captures
PIPESTATUS[0]fortimeout's exit code and, on 124/137, emits adistinct line before the trailing
true:Operators scanning pod logs after an ungraceful broker shutdown can immediately
see that the maintenance-mode drain (preStop) or maintenance-mode clear
(postStart) didn't complete in its budget.
What this PR explicitly does NOT change
; truestill swallows the exit code, sonon-zero on its own doesn't cause a pod-kill or container restart. The
TIMEOUT line is purely diagnostic. (Why not propagate the timeout exit
code? For
postStart, a non-zero would cause the container to fail tostart. Treating postStart and preStop differently here adds complexity
for marginal benefit; the log line is enough for operators to act on.)
terminationGracePeriodSeconds. Left at 90s. A wider default(e.g. 180s) would reduce timeout pressure but doubles teardown time on
every restart; customers who routinely hit the timeout can bump it
themselves via values. Worth a separate change if data shows it's
pervasive.
operator/multicluster/scripts.go. Has its own copy ofwrapLifecycleHook. Per the standing rule on this work — no changes tomulticluster code, since that subsystem is being removed — it's
untouched here.
Test plan
task charts:generate:redpanda—_statefulset.go.tplregeneratedgo test ./charts/redpanda -run TestTemplate -update-golden—testdata/template-cases.golden.txtarrefreshed (+1260 / -360 lines:every
postStartandpreStopcommand across all chart cases nowrenders as the 7-line wrapper instead of the 1-line wrapper)
go build ./charts/redpanda/...— cleango vet ./charts/redpanda/...— cleango test ./charts/redpanda -run TestTemplate(no-update-golden) —passes against refreshed golden
making one broker unable to accept leadership transfers), confirm the
surviving brokers' pod logs contain the
TIMEOUT after Nsline forpre-stop.Stack
correctness (uses
PreRestartProbefrom common-go#170)preStop drain failure mode becomes diagnosable
Both PRs are independent and can land in any order.
🤖 Generated with Claude Code