Skip to content

fix: Add chart control for updateStrategy to brokers and proxies#668

Merged
lhotari merged 1 commit intoapache:masterfrom
mezmo:broker_proxy_update_strategy
Mar 20, 2026
Merged

fix: Add chart control for updateStrategy to brokers and proxies#668
lhotari merged 1 commit intoapache:masterfrom
mezmo:broker_proxy_update_strategy

Conversation

@darinspivey
Copy link
Copy Markdown
Contributor

Fixes: #667

Motivation

A user can control the updateStrategy for the pods of bookies and zookeeper. However, the values for brokers and proxies is hardcoded. Being able to control this value via the Helm chart is crucial to being able to do smooth chart upgrades to a fully-running cluster. For example, setting the strategy to OnDelete would allow a user to control which order the pods are restarted after an upgrade.

Modifications

For Broker and Proxy templates, set a default, but allow overrides from .Values.xxx.updateStrategy

Verifying this change

  • Make sure that the change passes the CI checks.

A user can control the `updateStrategy` for the pods of bookies and
zookeeper. However, the values for brokers and proxies is hardcoded.
Being able to control this value via the Helm chart is crucial to being
able to do smooth chart upgrades to a fully-running cluster. For
example, setting the strategy to `OnDelete` would allow a user to
control which order the pods are restarted after an upgrade.

Fixes: apache#667
@darinspivey
Copy link
Copy Markdown
Contributor Author

darinspivey commented Mar 19, 2026

@lhotari you're very active in the pulsar repos. I was wondering what you thought of this, and if you had any thoughts about how to make upgrades more controlled in the future (when using helm)?

@lhotari
Copy link
Copy Markdown
Member

lhotari commented Mar 19, 2026

@lhotari you're very active in the pulsar repos. I was wondering what you thought of this, and if you had any thoughts about how to make upgrades more controlled in the future (when using helm)?

In the issue, you mentioned "Not having control restarts all components at once which renders a fully-operational cluster in a bad error state." I think that wouldn't be expected when rolling restarts are performed, and it's a bug. Please share more details of what type of bad error state you end up in. I know that there are some bugs that could cause this. Sharing the Pulsar version would help to see if there's a fix in newer versions. The client version could also matter in some cases. Sharing that would be helpful too.

In general, it would be useful to perform upgrades "slowly" so that each set of components is handled separately and upgraded before moving on to the next ones. For example, upgrading ZooKeeper, then BookKeeper, and finally Brokers & Proxies. The order doesn't matter that much since newer versions should always be able to talk to older component versions.

Even without handling restarts separately, it shouldn't result in the cluster getting into a bad error state unless the high load causes the system to collapse when there are a lot of component restarts at once.

When the Pulsar version is upgraded, one possible solution is to manage the images for the different components separately in values.yaml and not rely on the default that changes the image for all components at once. In that case, one would perform multiple Helm deployments while upgrading. This would work for cases where only the Pulsar image is upgraded. However, if the chart is upgraded, there could be many changes that impact multiple different components and cause them to restart.

This is just one thought on some solutions. It would be great if you could contribute a section to the README.md file about handling upgrades in a controlled way and what problems it resolves.

One known issue with brokers in a full rolling restart is that there's also a lot of shuffling due to load balancing. Bundles get moved across brokers resulting in disruptions in traffic for producers and consumers until the cluster stabilizes itself. This mainly matters at very high throughput / workload when resources aren't heavily over-provisioned.
There has been a plan to address this problem with https://github.com/apache/pulsar/blob/master/pip/pip-192.md and https://github.com/apache/pulsar/blob/master/pip/pip-307.md. The implementation exists, but there are experiences that it's not that stable and would require more contributions to harden it.

@lhotari lhotari merged commit 339d2d5 into apache:master Mar 20, 2026
68 of 70 checks passed
@lhotari
Copy link
Copy Markdown
Member

lhotari commented Mar 20, 2026

FYI StreamNative cloud has a feature "Graceful Cluster Rollout", https://docs.streamnative.io/private-cloud/v2/configure-private-cloud/advanced/private-cloud-graceful-cluster-rollout . That relies on PIP-192 and PIP-307 besides the StreamNative Operator (commercial product) orchestration.

@darinspivey
Copy link
Copy Markdown
Contributor Author

@lhotari thanks for your quick attention here. I'd be happy to provide more visibility as I learn more about Pulsar--this is my first production implementation of it, and we're still in the tweaking stage, but loving it so far (as compared to kafka). Here are the versions I'm currently using:

pulsar-helm-chart: 4.5.0
pulsar-client: ^1.16.0 (nodejs publishing tier)
pulsar-rs: 6.7.1 (consumption tier)

I might have generalized too much about the 'bad state' of the system. What I've generally seen is the fire-and-forget upgrade where multiple components restart at the same time. When that happens, it's more of a stampede problem when you have thousands of topics and busy producers. I've seen it blow up ZK with a flood of lookups, brokers crashing because they can't handle bundle handoffs, and heat on the bookies when the ensembles lose member nodes. All of that together just makes for a bad situation--so far, I've just been able to turn off Pulsar to recover from these situations, as I'm still tuning production on a trial basis.

In general, it would be useful to perform upgrades "slowly" so that each set of components is handled separately and upgraded before moving on to the next ones.

This PR here is a crude attempt at doing this, but perhaps a more elegant way of rolling the components out is warranted. I agree that the 'slow upgrade' approach feels better in terms of control, and you've offered up a few good suggestions in that area.

Even without handling restarts separately, it shouldn't result in the cluster getting into a bad error state unless the high load causes the system to collapse when there are a lot of component restarts at once.

Yes, this is it, primarily. ZK might also a preferred restart order based on the leader, but I think that's a minor thing compared to the traffic stampedes that happen.

This is just one thought on some solutions. It would be great if you could contribute a section to the README.md file about handling upgrades in a controlled way and what problems it resolves.

I'm happy to contribute anything I can as I learn more about the system's behavior through my own testing. Would you prefer such a contribution prior to any official upgrade controls? It can be just what's worked for me based on this PRs change?

One known issue with brokers in a full rolling restart is that there's also a lot of shuffling due to load balancing.

Yes, 100% true. At first, I wasn't able to roll brokers without experiencing super high latency on my publishing tier, which was disasterous. I was able to mitigate this situation by doing a few things:

  • First, setting terminationGracePeriodSeconds: 300 which allows enough time for all topics to unload prior to being killed by k8s. I've found that the topics usually unload fine in under a minute, which is only slightly longer that the k8s default of (I think) 30s.

  • Also, a big win was to set the liveness probe to initialDelaySeconds: 30 with the thought being "let the last broker fully come up before the next one rolls--that way the newest broker can be the most appropriate candidate to handle the next broker's offloaded topics."

  • I had to also make sure that my publishers had ample operationTimeoutSeconds of 20-30. Initially, I had this set low in a "fail fast" mentality, only to learn that it caused more of a flood to the proxies and to ZK. Taking the bundle move hit (latency) up front allowed for a quicker recovery when the bundles settle into their new broker. For the record, I've found that MOST of the pulsar defaults have been pretty close to production-ready. The default setting for this (30 seconds?) is another example of that--I shouldn't have changed it, lol.

I'm not sure which (or all) of these helped most, but that fixed my stampede problems with rolling brokers. The last broker to roll is mostly idle when the rolling is done, but then the ThresholdShedder takes care of balancing it out smoothly.

@darinspivey darinspivey deleted the broker_proxy_update_strategy branch March 23, 2026 13:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Broker and Proxy have no control over updateStrategy

2 participants