fix: Improve Zookeeper initialization wait logic to support multi url configuration store#671
fix: Improve Zookeeper initialization wait logic to support multi url configuration store#671ganeshkalyank wants to merge 1 commit intoapache:masterfrom
Conversation
There was a problem hiding this comment.
Pull request overview
Updates the Pulsar Helm chart’s cluster-initialization Job to wait for a multi-URL ZooKeeper configuration store using a ZooKeeper-aware command instead of DNS lookup, addressing init failures when configurationStore contains comma-separated hosts.
Changes:
- Replace
nslookup-based waiting forconfigurationStorewithbin/pulsar zookeeper-shell ... ls /polling. - Set a smaller JVM heap (
PULSAR_MEM) for the wait probe to reduce init-container memory usage.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| until timeout 15 bin/pulsar zookeeper-shell -server {{ .Values.pulsar_metadata.configurationStore }} ls /; do | ||
| echo "configurationStore {{ .Values.pulsar_metadata.configurationStore }} is unreachable... check in 3 seconds ..." && sleep 3; |
There was a problem hiding this comment.
The readiness probe uses -server {{ .Values.pulsar_metadata.configurationStore }} directly, but the actual --configuration-store used later is built via pulsar.configurationStore.connect (which incorporates configurationStorePort). If configurationStorePort is customized (or differs from ZooKeeper’s default 2181), this loop may keep failing even though the configuration store endpoint used for initialization is reachable. Use the same connection string here as the initialization step (and quote it to avoid shell word-splitting).
| until timeout 15 bin/pulsar zookeeper-shell -server {{ .Values.pulsar_metadata.configurationStore }} ls /; do | |
| echo "configurationStore {{ .Values.pulsar_metadata.configurationStore }} is unreachable... check in 3 seconds ..." && sleep 3; | |
| until timeout 15 bin/pulsar zookeeper-shell -server "{{ template "pulsar.configurationStore.connect" . }}" ls /; do | |
| echo "configurationStore {{ template "pulsar.configurationStore.connect" . }} is unreachable... check in 3 seconds ..." && sleep 3; |
| export PULSAR_MEM="-Xmx128M"; | ||
| until timeout 15 bin/pulsar zookeeper-shell -server {{ .Values.pulsar_metadata.configurationStore }} ls /; do | ||
| echo "configurationStore {{ .Values.pulsar_metadata.configurationStore }} is unreachable... check in 3 seconds ..." && sleep 3; | ||
| done; |
There was a problem hiding this comment.
wait-zk-cs-ready now uses bin/pulsar zookeeper-shell to probe ZooKeeper, but it doesn’t apply the chart’s ZooKeeper TLS client settings. When .Values.tls.enabled and .Values.tls.zookeeper.enabled are true, this probe will fail even if the configuration store is reachable over TLS, blocking initialization. Consider including pulsar.toolset.zookeeper.tls.settings before invoking bin/pulsar (and ensure the initContainer mounts the toolset cert/CA volumes so those settings work).
| export PULSAR_MEM="-Xmx128M"; | |
| until timeout 15 bin/pulsar zookeeper-shell -server {{ .Values.pulsar_metadata.configurationStore }} ls /; do | |
| echo "configurationStore {{ .Values.pulsar_metadata.configurationStore }} is unreachable... check in 3 seconds ..." && sleep 3; | |
| done; | |
| export PULSAR_MEM="-Xmx128M"; | |
| {{- include "pulsar.toolset.zookeeper.tls.settings" . | nindent 12 }} | |
| until timeout 15 bin/pulsar zookeeper-shell -server {{ .Values.pulsar_metadata.configurationStore }} ls /; do | |
| echo "configurationStore {{ .Values.pulsar_metadata.configurationStore }} is unreachable... check in 3 seconds ..." && sleep 3; | |
| done; | |
| volumeMounts: | |
| {{- include "pulsar.toolset.certs.volumeMounts" . | nindent 8 }} |
Fixes #670
Motivation
When using a multi-URL configuration store (e.g., zk1:2181,zk2:2181), the wait-zk-cs-ready init container fails because nslookup cannot resolve comma-separated hostnames. This causes initialization to time out even when ZooKeeper is already accessible.
Modifications
Replaced nslookup with bin/pulsar zookeeper-shell -server ls /, which supports the full ZooKeeper connection string including multi-URL formats.
Verifying this change