Skip to content

[fix][client] Run the failover health probe off the Netty event-loop thread#26064

Open
merlimat wants to merge 2 commits into
apache:masterfrom
merlimat:mmerli/fix-failover-probe-eventloop-blocking
Open

[fix][client] Run the failover health probe off the Netty event-loop thread#26064
merlimat wants to merge 2 commits into
apache:masterfrom
merlimat:mmerli/fix-failover-probe-eventloop-blocking

Conversation

@merlimat

Copy link
Copy Markdown
Contributor

Motivation

SameAuthParamsLookupAutoClusterFailover periodically probes broker health with a blocking getLookup(url).getBroker(...).get(3, SECONDS). The periodic task was scheduled on a Netty EventLoopGroup (EventLoopUtil.newEventLoopGroup(1, ...)), so the blocking probe ran on a Netty event-loop thread.

Modifications

Use a plain single-thread ScheduledExecutorService (Executors.newSingleThreadScheduledExecutor) for the periodic health check — matching the sibling AutoClusterFailover — so the blocking probe no longer occupies a Netty event-loop thread. scheduleAtFixedRate and shutdownNow are unchanged, and the executor is dedicated solely to this check.

Verifying this change

Covered by existing SameAuthParamsLookupAutoClusterFailoverTest (4 tests) — passes.

Does this pull request potentially affect one of the following parts:

If the box was checked, please highlight the changes

  • Dependencies (add or upgrade a dependency)
  • The public API
  • The schema
  • The default values of configurations
  • The threading model
  • The binary protocol
  • The REST endpoints
  • The admin CLI options
  • The metrics
  • Anything that affects deployment

merlimat added 2 commits June 18, 2026 14:53
…thread

SameAuthParamsLookupAutoClusterFailover periodically probes broker health with a
blocking getLookup(url).getBroker(...).get(3, SECONDS). The periodic task was
scheduled on a Netty EventLoopGroup (EventLoopUtil.newEventLoopGroup(1, ...)), so
the blocking probe ran on an event-loop thread.

Use a plain single-thread ScheduledExecutorService
(Executors.newSingleThreadScheduledExecutor) for the periodic health check,
matching the sibling AutoClusterFailover, so the blocking probe no longer
occupies a Netty event-loop thread. scheduleAtFixedRate and shutdownNow are
unchanged; the executor is dedicated solely to this check.
… fix its broker test

Follow-up to the executor change in this PR, fixing the CI failure in
org.apache.pulsar.broker.SameAuthParamsLookupAutoClusterFailoverTest.

1. The broker integration test reflects the private 'executor' field and typed it
   as io.netty.channel.EventLoopGroup; the field is now a ScheduledExecutorService,
   so the reflective cast threw ClassCastException. Update the three type
   references in the test to ScheduledExecutorService (it only uses execute/submit).

2. The test schedules the health check every 100ms while one service (a dead dummy
   proxy) blocks its probe for ~3s. On a plain single-threaded
   ScheduledExecutorService, scheduleAtFixedRate runs such slow checks back-to-back
   (catch-up) and monopolizes the thread, starving the task the test submits to the
   executor (a Netty EventLoopGroup interleaves immediate tasks, which is why it
   passed before). Use scheduleWithFixedDelay so a gap is left after each check;
   this is also better for a blocking health probe, which fixed-rate would
   otherwise issue continuously while a service is down.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants