Skip to content

fix: Improve Zookeeper initialization wait logic to support multi url configuration store#671

Open
ganeshkalyank wants to merge 1 commit intoapache:masterfrom
ganeshkalyank:fix-zk-cs-init
Open

fix: Improve Zookeeper initialization wait logic to support multi url configuration store#671
ganeshkalyank wants to merge 1 commit intoapache:masterfrom
ganeshkalyank:fix-zk-cs-init

Conversation

@ganeshkalyank
Copy link
Copy Markdown

Fixes #670

Motivation

When using a multi-URL configuration store (e.g., zk1:2181,zk2:2181), the wait-zk-cs-ready init container fails because nslookup cannot resolve comma-separated hostnames. This causes initialization to time out even when ZooKeeper is already accessible.

Modifications

Replaced nslookup with bin/pulsar zookeeper-shell -server ls /, which supports the full ZooKeeper connection string including multi-URL formats.

Verifying this change

  • Make sure that the change passes the CI checks.

Copilot AI review requested due to automatic review settings March 28, 2026 05:22
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates the Pulsar Helm chart’s cluster-initialization Job to wait for a multi-URL ZooKeeper configuration store using a ZooKeeper-aware command instead of DNS lookup, addressing init failures when configurationStore contains comma-separated hosts.

Changes:

  • Replace nslookup-based waiting for configurationStore with bin/pulsar zookeeper-shell ... ls / polling.
  • Set a smaller JVM heap (PULSAR_MEM) for the wait probe to reduce init-container memory usage.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +74 to +75
until timeout 15 bin/pulsar zookeeper-shell -server {{ .Values.pulsar_metadata.configurationStore }} ls /; do
echo "configurationStore {{ .Values.pulsar_metadata.configurationStore }} is unreachable... check in 3 seconds ..." && sleep 3;
Copy link

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The readiness probe uses -server {{ .Values.pulsar_metadata.configurationStore }} directly, but the actual --configuration-store used later is built via pulsar.configurationStore.connect (which incorporates configurationStorePort). If configurationStorePort is customized (or differs from ZooKeeper’s default 2181), this loop may keep failing even though the configuration store endpoint used for initialization is reachable. Use the same connection string here as the initialization step (and quote it to avoid shell word-splitting).

Suggested change
until timeout 15 bin/pulsar zookeeper-shell -server {{ .Values.pulsar_metadata.configurationStore }} ls /; do
echo "configurationStore {{ .Values.pulsar_metadata.configurationStore }} is unreachable... check in 3 seconds ..." && sleep 3;
until timeout 15 bin/pulsar zookeeper-shell -server "{{ template "pulsar.configurationStore.connect" . }}" ls /; do
echo "configurationStore {{ template "pulsar.configurationStore.connect" . }} is unreachable... check in 3 seconds ..." && sleep 3;

Copilot uses AI. Check for mistakes.
Comment on lines +73 to 76
export PULSAR_MEM="-Xmx128M";
until timeout 15 bin/pulsar zookeeper-shell -server {{ .Values.pulsar_metadata.configurationStore }} ls /; do
echo "configurationStore {{ .Values.pulsar_metadata.configurationStore }} is unreachable... check in 3 seconds ..." && sleep 3;
done;
Copy link

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wait-zk-cs-ready now uses bin/pulsar zookeeper-shell to probe ZooKeeper, but it doesn’t apply the chart’s ZooKeeper TLS client settings. When .Values.tls.enabled and .Values.tls.zookeeper.enabled are true, this probe will fail even if the configuration store is reachable over TLS, blocking initialization. Consider including pulsar.toolset.zookeeper.tls.settings before invoking bin/pulsar (and ensure the initContainer mounts the toolset cert/CA volumes so those settings work).

Suggested change
export PULSAR_MEM="-Xmx128M";
until timeout 15 bin/pulsar zookeeper-shell -server {{ .Values.pulsar_metadata.configurationStore }} ls /; do
echo "configurationStore {{ .Values.pulsar_metadata.configurationStore }} is unreachable... check in 3 seconds ..." && sleep 3;
done;
export PULSAR_MEM="-Xmx128M";
{{- include "pulsar.toolset.zookeeper.tls.settings" . | nindent 12 }}
until timeout 15 bin/pulsar zookeeper-shell -server {{ .Values.pulsar_metadata.configurationStore }} ls /; do
echo "configurationStore {{ .Values.pulsar_metadata.configurationStore }} is unreachable... check in 3 seconds ..." && sleep 3;
done;
volumeMounts:
{{- include "pulsar.toolset.certs.volumeMounts" . | nindent 8 }}

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Cluster initialization fails when using multi url config store

2 participants