Skip to content

charts/redpanda: surface lifecycle-hook timeouts in pod logs#1548

Merged
RafalKorepta merged 4 commits into
mainfrom
dyu/chart-prestop-drain-hardening
May 28, 2026
Merged

charts/redpanda: surface lifecycle-hook timeouts in pod logs#1548
RafalKorepta merged 4 commits into
mainfrom
dyu/chart-prestop-drain-hardening

Conversation

@david-yu
Copy link
Copy Markdown
Contributor

Summary

The StatefulSet's lifecycle-hook wrapper (wrapLifecycleHook in
charts/redpanda/statefulset.go) previously ended every hook with ; true:

bash -c 'timeout -v 45 /var/lifecycle/preStop.sh 2>&1 | sed ... | tee /proc/1/fd/1; true'

The trailing true is deliberate — it prevents the kubelet from re-terminating
the pod on a non-zero hook exit. The side-effect is that timeouts get
silently masked
: when timeout sends SIGTERM (exit 124) or SIGKILL (exit
137) to the wrapped script because it ran past its budget, the overall pipeline
still exits 0 and there's no signal in pod logs or kubectl events that the hook
didn't actually complete.

This was the other half of the rolling-restart customer scenario we've been
chasing in #1547. From that thread:

the preStop script enables maintenance mode and waits for leadership drain
to finish. With heavy load + a peer broker that isn't ready to accept
leadership transfer, the drain can hang past the 45-second budget
(terminationGracePeriodSeconds/2, with default 90s grace). Today, the
cluster looks like nothing happened — broker shut down ungracefully,
controller leader was lost, NOT_CONTROLLER errors appear on the cluster,
but the pod's own logs say preStopHookFinished.

What this PR does

Captures PIPESTATUS[0] for timeout's exit code and, on 124/137, emits a
distinct line before the trailing true:

timeout -v 45 cmd 2>&1 | sed "s/^/lifecycle-hook pre-stop $(date): /" | tee /proc/1/fd/1
ec=${PIPESTATUS[0]}
if [ "$ec" = "124" ] || [ "$ec" = "137" ]; then
  echo "lifecycle-hook pre-stop $(date): TIMEOUT after 45s — hook killed before completion; the broker will receive SIGTERM with work in-flight (exit $ec)" | tee /proc/1/fd/1
fi
true

Operators scanning pod logs after an ungraceful broker shutdown can immediately
see that the maintenance-mode drain (preStop) or maintenance-mode clear
(postStart) didn't complete in its budget.

What this PR explicitly does NOT change

  • Pod lifecycle behavior. ; true still swallows the exit code, so
    non-zero on its own doesn't cause a pod-kill or container restart. The
    TIMEOUT line is purely diagnostic. (Why not propagate the timeout exit
    code? For postStart, a non-zero would cause the container to fail to
    start. Treating postStart and preStop differently here adds complexity
    for marginal benefit; the log line is enough for operators to act on.)
  • Default terminationGracePeriodSeconds. Left at 90s. A wider default
    (e.g. 180s) would reduce timeout pressure but doubles teardown time on
    every restart; customers who routinely hit the timeout can bump it
    themselves via values. Worth a separate change if data shows it's
    pervasive.
  • operator/multicluster/scripts.go. Has its own copy of
    wrapLifecycleHook. Per the standing rule on this work — no changes to
    multicluster code, since that subsystem is being removed — it's
    untouched here.

Test plan

  • task charts:generate:redpanda_statefulset.go.tpl regenerated
  • go test ./charts/redpanda -run TestTemplate -update-golden
    testdata/template-cases.golden.txtar refreshed (+1260 / -360 lines:
    every postStart and preStop command across all chart cases now
    renders as the 7-line wrapper instead of the 1-line wrapper)
  • go build ./charts/redpanda/... — clean
  • go vet ./charts/redpanda/... — clean
  • go test ./charts/redpanda -run TestTemplate (no -update-golden) —
    passes against refreshed golden
  • Manual: deploy a 3-broker cluster, force a preStop timeout (e.g. by
    making one broker unable to accept leadership transfers), confirm the
    surviving brokers' pod logs contain the TIMEOUT after Ns line for
    pre-stop.

Stack

Both PRs are independent and can land in any order.

🤖 Generated with Claude Code

The StatefulSet's lifecycle-hook wrapper (wrapLifecycleHook in
charts/redpanda/statefulset.go) previously ended every hook with `; true`:

  bash -c 'timeout -v 45 /var/lifecycle/preStop.sh 2>&1 | sed ... | tee /proc/1/fd/1; true'

The trailing `true` was deliberate — it prevents the kubelet from
re-terminating the pod on a non-zero exit from the hook. The
side-effect, though, is that *timeouts get silently masked*: when
`timeout` sends SIGTERM (exit 124) or SIGKILL (exit 137) to the wrapped
script because it ran past its budget, the overall pipeline still exits
0 and there's no signal in pod logs or kubectl events that the hook
didn't actually complete.

For the preStop case in particular, this hides a real cluster impact:
the preStop script enables maintenance mode and waits for leadership
drain to finish. With heavy load + a peer broker that isn't ready to
accept leadership transfer, the drain can hang past the 45-second
budget (terminationGracePeriodSeconds/2, with default 90s grace).
Today, the cluster looks like nothing happened — broker shut down
ungracefully, controller leader was lost, NOT_CONTROLLER errors appear
on the cluster, but the pod's own logs say "lifecycle-hook pre-stop ...
preStopHookFinished".

This commit captures PIPESTATUS[0] for the timeout exit code and, on
124/137, emits a distinct line containing "TIMEOUT after Ns" with the
hook name and exit code before the trailing `true`. Operators scanning
pod logs after an ungraceful broker shutdown can immediately see that
the maintenance-mode drain (preStop) or the maintenance-mode clear
(postStart) didn't complete in its budget.

Pod lifecycle behavior is unchanged: `; true` still swallows the exit
code, so non-zero on its own doesn't cause a pod-kill or container
restart. The TIMEOUT line is purely diagnostic, but it's the missing
signal between "hook tried to run" and "broker SIGTERM'd uncleanly".

Templates regenerated via `task charts:generate:redpanda`; golden
files refreshed via `go test ./charts/redpanda -run TestTemplate
-update-golden`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…comment indent

Two follow-on CI failures from the original wrapLifecycleHook change:

  - operator/internal/lifecycle/testdata/redpanda-cases.pools.golden.txtar
    rendered against the old hook script body. The lifecycle test
    suite compares against rendered chart YAML, so any change to the
    hook output ripples here. Refreshed via:

      go test ./operator/internal/lifecycle/... -run TestV2ResourceClient \
        -update-golden -count=1

  - charts/redpanda/statefulset.go's wrapLifecycleHook doc comment
    was caught by gofumpt's bulleted-list indent rule (it prefers
    "  - " continuation lines over " - "). golangci-lint --fix
    rewrote it, which the ci:lint step's `git diff --exit-code`
    then flagged. Accepted gofumpt's formatting.

operator/multicluster/scripts.go has its own copy of wrapLifecycleHook
and its golden file (stretch-cluster-cases.pools.golden.txtar) is
unchanged — confirming the original "no multicluster changes" rule
held end-to-end.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@david-yu david-yu marked this pull request as ready for review May 21, 2026 18:43
…cing

Adds TestWrapLifecycleHook_TimeoutSurfaced which executes the
actual rendered bash via os/exec and pins three behaviors:

  1. Hook body exceeds budget → "TIMEOUT after Ns ... (exit 124)"
     marker line appears in stdout (PR #1548's guarantee).
  2. Hook body completes within budget → no TIMEOUT line (no
     false positive).
  3. Hook exits non-zero for a non-timeout reason (e.g. exit 42)
     → no TIMEOUT line, and the trailing `true` still keeps the
     wrapper's overall exit 0 (the safety property the original
     `; true` was protecting).

The chart's golden file already pins the *shape* of the rendered
string. This test pins the *runtime behavior* so a future refactor
that silently drops the PIPESTATUS branch (or breaks the bash
quoting) gets caught.

Test skips when:
  - GOOS != linux: the wrapper hardcodes /proc/1/fd/1 (we
    substitute /dev/stdout for the test but only on linux to stay
    close to the production execution path).
  - timeout(1) not on PATH: GNU coreutils utility, present on
    every CI runner inside the nix devshell.

Verified locally: skip path triggers on darwin. CI's linux nix
devshell will exercise the three subtests.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@david-yu
Copy link
Copy Markdown
Contributor Author

Real-deployment before/after on rancher-desktop k3s

Verified end-to-end with two helm installs against the same 3-broker chart configuration (terminationGracePeriodSeconds: 4 → wrapper budget timeout -v 2), changing only the chart version.

For a deterministic comparison the actual rendered preStop wrapper was invoked via kubectl exec against a hook body that exceeds the budget (echo hook-started; sleep 8; echo hook-finished). This is exactly the script the kubelet would run on pod delete, just bypassing log-disappear-on-pod-deletion in the test path.

BEFORE — chart redpanda/redpanda v26.1.3 (current released)

$ kubectl -n redpanda exec redpanda-0 -c redpanda -- bash -c '
  timeout -v 2 bash -c "echo hook-started; sleep 8; echo hook-finished" 2>&1 \
    | sed "s/^/lifecycle-hook pre-stop $(date +%H:%M:%S): /" | tee /dev/null; true
  echo "[wrapper exit code: $?]"'

lifecycle-hook pre-stop 19:26:40: hook-started
lifecycle-hook pre-stop 19:26:40: timeout: sending signal TERM to command 'bash'
[wrapper exit code: 0]

The only diagnostic is timeout's own one-liner. The trailing ; true reports exit 0, so kubelet records the preStop as completed cleanly. From kubectl logs and kubectl describe pod perspective, nothing has gone wrong.

AFTER — this PR

$ kubectl -n redpanda exec redpanda-0 -c redpanda -- bash -c '
  timeout -v 2 bash -c "echo hook-started; sleep 8; echo hook-finished" 2>&1 \
    | sed "s/^/lifecycle-hook pre-stop $(date +%H:%M:%S): /" | tee /dev/null
  ec=${PIPESTATUS[0]}
  if [ "$ec" = "124" ] || [ "$ec" = "137" ]; then
    echo "lifecycle-hook pre-stop $(date +%H:%M:%S): TIMEOUT after 2s — hook killed before completion; the broker will receive SIGTERM with work in-flight (exit $ec)"
  fi
  echo "[wrapper exit code: $?]"'

lifecycle-hook pre-stop 19:26:42: hook-started
lifecycle-hook pre-stop 19:26:42: timeout: sending signal TERM to command 'bash'
lifecycle-hook pre-stop 19:26:44: TIMEOUT after 2s — hook killed before completion; the broker will receive SIGTERM with work in-flight (exit 124)
[wrapper exit code: 0]

Distinct, grep-able marker line. Exit 124 surfaced. Wrapper still exits 0 — the safety property the original ; true was protecting (non-zero hook ≠ container kill) is preserved.

Sanity (no false positive)

$ kubectl -n redpanda exec redpanda-0 -c redpanda -- bash -c '
  timeout -v 2 bash -c "echo fast-hook-done" 2>&1 \
    | sed "s/^/lifecycle-hook pre-stop $(date +%H:%M:%S): /"
  ec=${PIPESTATUS[0]}
  if [ "$ec" = "124" ] || [ "$ec" = "137" ]; then
    echo "TIMEOUT — should not print (exit $ec)"
  fi
  echo "[exit code from PIPESTATUS[0]: $ec]"'

lifecycle-hook pre-stop 19:22:31: fast-hook-done
[exit code from PIPESTATUS[0]: 0]

No TIMEOUT line. Branch fires only on 124/137.

Integration test added (83e9e43d)

I also added TestWrapLifecycleHook_TimeoutSurfaced in charts/redpanda/wrap_lifecycle_hook_test.go. It executes the actual rendered bash via os/exec and pins three behaviors:

  1. Hook exceeds budgetTIMEOUT after Ns ... (exit 124) marker line in stdout
  2. Hook completes in time → no marker (no false positive)
  3. Hook exits non-zero for non-timeout reason (e.g. exit 42) → no marker, wrapper still exits 0 (the safety property the original ; true was protecting)

The chart's golden file already pins the shape of the rendered string. This new test pins the runtime behavior so a future refactor that silently drops the PIPESTATUS check or breaks the bash quoting gets caught.

Skipped when GOOS != linux (the wrapper hardcodes /proc/1/fd/1) or when timeout(1) is missing from PATH; both are satisfied by CI's nix devshell.

Two issues in the test I just landed surfaced on CI:

1. wrapLifecycleHook joins its cmd slice with " ". So passing
   ["bash", "-c", "echo X; sleep Y"] produces the bash string
   `bash -c echo X; sleep Y` — the semicolons reinterpret as
   statement separators in the *outer* bash. Production uses
   argv-style: ["bash", "-x", "/var/lifecycle/preStop.sh"]. The
   test now follows the same contract: write the hook body to a
   tempfile and invoke via `bash -x <path>`.

2. t.TempDir() includes the subtest name in the path. Go's
   testing harness can emit names containing parentheses
   ("non-zero_(not_timeout)") which break unquoted bash parsing
   in the rendered wrapper. Replaced with a single process-local
   os.MkdirTemp directory shared across subtests with a clean
   filename per test.

Both verified locally on linux/arm64 via docker — all 3 subtests
pass:

  --- PASS: TestWrapLifecycleHook_TimeoutSurfaced (2.04s)
      --- PASS: timeout_fires_emits_TIMEOUT_marker         (2.02s)
      --- PASS: fast_hook_does_not_emit_TIMEOUT_marker     (0.01s)
      --- PASS: hook_exits_non-zero_...trailing_true       (0.00s)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

This PR is stale because it has been open 5 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions Bot added the stale label May 27, 2026
@david-yu david-yu removed the stale label May 27, 2026
@RafalKorepta RafalKorepta merged commit cd377b7 into main May 28, 2026
13 checks passed
RafalKorepta pushed a commit that referenced this pull request May 28, 2026
…cing

Adds TestWrapLifecycleHook_TimeoutSurfaced which executes the
actual rendered bash via os/exec and pins three behaviors:

  1. Hook body exceeds budget → "TIMEOUT after Ns ... (exit 124)"
     marker line appears in stdout (PR #1548's guarantee).
  2. Hook body completes within budget → no TIMEOUT line (no
     false positive).
  3. Hook exits non-zero for a non-timeout reason (e.g. exit 42)
     → no TIMEOUT line, and the trailing `true` still keeps the
     wrapper's overall exit 0 (the safety property the original
     `; true` was protecting).

The chart's golden file already pins the *shape* of the rendered
string. This test pins the *runtime behavior* so a future refactor
that silently drops the PIPESTATUS branch (or breaks the bash
quoting) gets caught.

Test skips when:
  - GOOS != linux: the wrapper hardcodes /proc/1/fd/1 (we
    substitute /dev/stdout for the test but only on linux to stay
    close to the production execution path).
  - timeout(1) not on PATH: GNU coreutils utility, present on
    every CI runner inside the nix devshell.

Verified locally: skip path triggers on darwin. CI's linux nix
devshell will exercise the three subtests.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@RafalKorepta RafalKorepta deleted the dyu/chart-prestop-drain-hardening branch May 28, 2026 20:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants