Skip to content

[fix][broker] Avoid blocking the PulsarAdmin callback thread on the post-unload load-report write#26061

Open
merlimat wants to merge 1 commit into
apache:masterfrom
merlimat:mmerli/fix-clusters-migration-write-blocking
Open

[fix][broker] Avoid blocking the PulsarAdmin callback thread on the post-unload load-report write#26061
merlimat wants to merge 1 commit into
apache:masterfrom
merlimat:mmerli/fix-clusters-migration-write-blocking

Conversation

@merlimat

Copy link
Copy Markdown
Contributor

Motivation

When a namespace isolation policy changes, ClustersBase.filterAndUnloadMatchedNamespaceAsync unloads the affected namespaces and then writes the load report so the cluster rebalances quickly. That write ran inside FutureUtil.waitForAll(unloadAsync...).thenAccept(...), which executes on the thread that completes the adminClient.namespaces().unloadAsync(...) futures — a PulsarAdmin async-HTTP-client callback thread.

LoadManager.writeLoadReportOnZookeeper(true) blocks there: ModularLoadManagerImpl.writeBrokerDataOnZooKeeper does lock.lock() plus a metadata-store updateValue(...).join(). So a brief metadata write blocked the admin-client callback thread.

Modifications

Run the load-report write via thenAcceptAsync(..., pulsar().getExecutor()) so it executes on the broker executor (where short blocking metadata writes are routine) instead of the admin-client callback thread. The REST response still waits for the write (the chained future completes after it); the write remains best-effort (failure is logged and swallowed, as before).

Verifying this change

Covered by AdminApi2Test.testIsolationPolicyUnloadsNSWith* (AllScope, ChangedScope1/2, NoneScope, PrimaryChanged, ScopeMissing; persistent + non-persistent) — 12/12 pass.

Does this pull request potentially affect one of the following parts:

If the box was checked, please highlight the changes

  • Dependencies (add or upgrade a dependency)
  • The public API
  • The schema
  • The default values of configurations
  • The threading model
  • The binary protocol
  • The REST endpoints
  • The admin CLI options
  • The metrics
  • Anything that affects deployment

…ost-unload load-report write

When a namespace isolation policy changes, ClustersBase
.filterAndUnloadMatchedNamespaceAsync unloads the affected namespaces and then
writes the load report so the cluster rebalances quickly. That write ran inside
FutureUtil.waitForAll(unloadAsync...).thenAccept(...), which executes on the
thread that completes the adminClient.namespaces().unloadAsync(...) futures -- a
PulsarAdmin async-HTTP-client callback thread. LoadManager
.writeLoadReportOnZookeeper(true) blocks there: ModularLoadManagerImpl
.writeBrokerDataOnZooKeeper does lock.lock() plus a metadata-store
updateValue(...).join().

Run the load-report write via thenAcceptAsync(..., pulsar().getExecutor()) so it
executes on the broker executor (where short blocking metadata writes are
routine) instead of the admin-client callback thread. The REST response still
waits for the write (the chained future completes after it); the write remains
best-effort (failure logged and swallowed, as before).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants