Increase default probe initialDelaySeconds from 10 to 60#34
Open
delthas wants to merge 1 commit intoadobe:masterfrom
Open
Increase default probe initialDelaySeconds from 10 to 60#34delthas wants to merge 1 commit intoadobe:masterfrom
delthas wants to merge 1 commit intoadobe:masterfrom
Conversation
The default liveness and readiness probe initialDelaySeconds (10s) is incompatible with the startup script's DNS retry loop, which can take up to 42 seconds in the worst case. This causes pods to enter CrashLoopBackOff: the liveness probe kills the container at ~30s (initialDelaySeconds 10 + failureThreshold 3 × periodSeconds 10), before ZooKeeper has a chance to start. The DNS retry loop in zookeeperStart.sh retries `getent hosts $DOMAIN` up to 21 times with a 2-second sleep between attempts. For a single-node cluster or any case where the headless service has no ready endpoints, DNS will never resolve during the loop, and the script must wait the full ~42 seconds before proceeding to start ZooKeeper. History of the DNS check in zookeeperStart.sh: 1. 97ddb6e - Original: simple `nslookup`, no retry. DNS failure meant no ensemble, script moved on immediately. 2. ed1f1d1 - "Added polling for checking headless service is active": introduced the retry loop (count=20, sleep 2) because nslookup of the headless service can fail transiently even when an active ensemble exists. The loop was guarded by `$MYID -ne 1` so the first node skipped it entirely. 3. 5c86f53 - "Observers fail to register when zk ensemble service domain is not yet available": added an `elif nslookup $DOMAIN | grep "server can't find"` fast-path to skip the retry loop when DNS definitively says "not found". This also removed the `$MYID -ne 1` guard. 4. c693909 - "Use getent instead of nslookup for starting scripts": replaced nslookup with getent. Dropped the elif because getent does not produce a parseable "server can't find" message. This restored the original retry-always behavior from step 2, but without the MYID guard, meaning all nodes now unconditionally wait up to 42 seconds when DNS does not resolve. The probe defaults were never updated to account for step 4, so pods that hit the full DNS retry path are killed before startup completes. Increasing initialDelaySeconds to 60 gives the startup script time to exhaust the DNS loop and start ZooKeeper before probes begin firing.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The default liveness and readiness probe initialDelaySeconds (10s) is incompatible with the startup script's DNS retry loop, which can take up to 42 seconds in the worst case. This causes pods to enter CrashLoopBackOff: the liveness probe kills the container at ~30s (initialDelaySeconds 10 + failureThreshold 3 × periodSeconds 10), before ZooKeeper has a chance to start.
The DNS retry loop in zookeeperStart.sh retries
getent hosts $DOMAINup to 21 times with a 2-second sleep between attempts. For a single-node cluster or any case where the headless service has no ready endpoints, DNS will never resolve during the loop, and the script must wait the full ~42 seconds before proceeding to start ZooKeeper.History of the DNS check in zookeeperStart.sh:
97ddb6e - Original: simple
nslookup, no retry. DNS failure meant no ensemble, script moved on immediately.ed1f1d1 - "Added polling for checking headless service is active": introduced the retry loop (count=20, sleep 2) because nslookup of the headless service can fail transiently even when an active ensemble exists. The loop was guarded by
$MYID -ne 1so the first node skipped it entirely.5c86f53 - "Observers fail to register when zk ensemble service domain is not yet available": added an
elif nslookup $DOMAIN | grep "server can't find"fast-path to skip the retry loop when DNS definitively says "not found". This also removed the$MYID -ne 1guard.c693909 - "Use getent instead of nslookup for starting scripts": replaced nslookup with getent. Dropped the elif because getent does not produce a parseable "server can't find" message. This restored the original retry-always behavior from step 2, but without the MYID guard, meaning all nodes now unconditionally wait up to 42 seconds when DNS does not resolve.
The probe defaults were never updated to account for step 4, so pods that hit the full DNS retry path are killed before startup completes. Increasing initialDelaySeconds to 60 gives the startup script time to exhaust the DNS loop and start ZooKeeper before probes begin firing.