Skip to content

Increase default probe initialDelaySeconds from 10 to 60#34

Open
delthas wants to merge 1 commit intoadobe:masterfrom
delthas:fix-probe
Open

Increase default probe initialDelaySeconds from 10 to 60#34
delthas wants to merge 1 commit intoadobe:masterfrom
delthas:fix-probe

Conversation

@delthas
Copy link

@delthas delthas commented Feb 23, 2026

The default liveness and readiness probe initialDelaySeconds (10s) is incompatible with the startup script's DNS retry loop, which can take up to 42 seconds in the worst case. This causes pods to enter CrashLoopBackOff: the liveness probe kills the container at ~30s (initialDelaySeconds 10 + failureThreshold 3 × periodSeconds 10), before ZooKeeper has a chance to start.

The DNS retry loop in zookeeperStart.sh retries getent hosts $DOMAIN up to 21 times with a 2-second sleep between attempts. For a single-node cluster or any case where the headless service has no ready endpoints, DNS will never resolve during the loop, and the script must wait the full ~42 seconds before proceeding to start ZooKeeper.

History of the DNS check in zookeeperStart.sh:

  1. 97ddb6e - Original: simple nslookup, no retry. DNS failure meant no ensemble, script moved on immediately.

  2. ed1f1d1 - "Added polling for checking headless service is active": introduced the retry loop (count=20, sleep 2) because nslookup of the headless service can fail transiently even when an active ensemble exists. The loop was guarded by $MYID -ne 1 so the first node skipped it entirely.

  3. 5c86f53 - "Observers fail to register when zk ensemble service domain is not yet available": added an elif nslookup $DOMAIN | grep "server can't find" fast-path to skip the retry loop when DNS definitively says "not found". This also removed the $MYID -ne 1 guard.

  4. c693909 - "Use getent instead of nslookup for starting scripts": replaced nslookup with getent. Dropped the elif because getent does not produce a parseable "server can't find" message. This restored the original retry-always behavior from step 2, but without the MYID guard, meaning all nodes now unconditionally wait up to 42 seconds when DNS does not resolve.

The probe defaults were never updated to account for step 4, so pods that hit the full DNS retry path are killed before startup completes. Increasing initialDelaySeconds to 60 gives the startup script time to exhaust the DNS loop and start ZooKeeper before probes begin firing.

The default liveness and readiness probe initialDelaySeconds (10s) is
incompatible with the startup script's DNS retry loop, which can take
up to 42 seconds in the worst case. This causes pods to enter
CrashLoopBackOff: the liveness probe kills the container at ~30s
(initialDelaySeconds 10 + failureThreshold 3 × periodSeconds 10),
before ZooKeeper has a chance to start.

The DNS retry loop in zookeeperStart.sh retries `getent hosts $DOMAIN`
up to 21 times with a 2-second sleep between attempts. For a
single-node cluster or any case where the headless service has no
ready endpoints, DNS will never resolve during the loop, and the
script must wait the full ~42 seconds before proceeding to start
ZooKeeper.

History of the DNS check in zookeeperStart.sh:

1. 97ddb6e - Original: simple `nslookup`, no retry. DNS failure meant
   no ensemble, script moved on immediately.

2. ed1f1d1 - "Added polling for checking headless service is active":
   introduced the retry loop (count=20, sleep 2) because nslookup of
   the headless service can fail transiently even when an active
   ensemble exists. The loop was guarded by `$MYID -ne 1` so the
   first node skipped it entirely.

3. 5c86f53 - "Observers fail to register when zk ensemble service
   domain is not yet available": added an `elif nslookup $DOMAIN |
   grep "server can't find"` fast-path to skip the retry loop when
   DNS definitively says "not found". This also removed the
   `$MYID -ne 1` guard.

4. c693909 - "Use getent instead of nslookup for starting scripts":
   replaced nslookup with getent. Dropped the elif because getent
   does not produce a parseable "server can't find" message. This
   restored the original retry-always behavior from step 2, but
   without the MYID guard, meaning all nodes now unconditionally
   wait up to 42 seconds when DNS does not resolve.

The probe defaults were never updated to account for step 4, so pods
that hit the full DNS retry path are killed before startup completes.
Increasing initialDelaySeconds to 60 gives the startup script time to
exhaust the DNS loop and start ZooKeeper before probes begin firing.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant