Increase default probe initialDelaySeconds from 10 to 60 by delthas · Pull Request #34 · adobe/zookeeper-operator

delthas · 2026-02-23T17:13:17Z

The default liveness and readiness probe initialDelaySeconds (10s) is incompatible with the startup script's DNS retry loop, which can take up to 42 seconds in the worst case. This causes pods to enter CrashLoopBackOff: the liveness probe kills the container at ~30s (initialDelaySeconds 10 + failureThreshold 3 × periodSeconds 10), before ZooKeeper has a chance to start.

The DNS retry loop in zookeeperStart.sh retries getent hosts $DOMAIN up to 21 times with a 2-second sleep between attempts. For a single-node cluster or any case where the headless service has no ready endpoints, DNS will never resolve during the loop, and the script must wait the full ~42 seconds before proceeding to start ZooKeeper.

History of the DNS check in zookeeperStart.sh:

97ddb6e - Original: simple nslookup, no retry. DNS failure meant no ensemble, script moved on immediately.
ed1f1d1 - "Added polling for checking headless service is active": introduced the retry loop (count=20, sleep 2) because nslookup of the headless service can fail transiently even when an active ensemble exists. The loop was guarded by $MYID -ne 1 so the first node skipped it entirely.
5c86f53 - "Observers fail to register when zk ensemble service domain is not yet available": added an elif nslookup $DOMAIN | grep "server can't find" fast-path to skip the retry loop when DNS definitively says "not found". This also removed the $MYID -ne 1 guard.
c693909 - "Use getent instead of nslookup for starting scripts": replaced nslookup with getent. Dropped the elif because getent does not produce a parseable "server can't find" message. This restored the original retry-always behavior from step 2, but without the MYID guard, meaning all nodes now unconditionally wait up to 42 seconds when DNS does not resolve.

The probe defaults were never updated to account for step 4, so pods that hit the full DNS retry path are killed before startup completes. Increasing initialDelaySeconds to 60 gives the startup script time to exhaust the DNS loop and start ZooKeeper before probes begin firing.

The default liveness and readiness probe initialDelaySeconds (10s) is incompatible with the startup script's DNS retry loop, which can take up to 42 seconds in the worst case. This causes pods to enter CrashLoopBackOff: the liveness probe kills the container at ~30s (initialDelaySeconds 10 + failureThreshold 3 × periodSeconds 10), before ZooKeeper has a chance to start. The DNS retry loop in zookeeperStart.sh retries `getent hosts $DOMAIN` up to 21 times with a 2-second sleep between attempts. For a single-node cluster or any case where the headless service has no ready endpoints, DNS will never resolve during the loop, and the script must wait the full ~42 seconds before proceeding to start ZooKeeper. History of the DNS check in zookeeperStart.sh: 1. 97ddb6e - Original: simple `nslookup`, no retry. DNS failure meant no ensemble, script moved on immediately. 2. ed1f1d1 - "Added polling for checking headless service is active": introduced the retry loop (count=20, sleep 2) because nslookup of the headless service can fail transiently even when an active ensemble exists. The loop was guarded by `$MYID -ne 1` so the first node skipped it entirely. 3. 5c86f53 - "Observers fail to register when zk ensemble service domain is not yet available": added an `elif nslookup $DOMAIN | grep "server can't find"` fast-path to skip the retry loop when DNS definitively says "not found". This also removed the `$MYID -ne 1` guard. 4. c693909 - "Use getent instead of nslookup for starting scripts": replaced nslookup with getent. Dropped the elif because getent does not produce a parseable "server can't find" message. This restored the original retry-always behavior from step 2, but without the MYID guard, meaning all nodes now unconditionally wait up to 42 seconds when DNS does not resolve. The probe defaults were never updated to account for step 4, so pods that hit the full DNS retry path are killed before startup completes. Increasing initialDelaySeconds to 60 gives the startup script time to exhaust the DNS loop and start ZooKeeper before probes begin firing.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Increase default probe initialDelaySeconds from 10 to 60#34

Increase default probe initialDelaySeconds from 10 to 60#34
delthas wants to merge 1 commit intoadobe:masterfrom
delthas:fix-probe

delthas commented Feb 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

delthas commented Feb 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant