RedisCluster: Per-node health tracking and non-blocking initialize()#4083
Open
ngabhanenetskope wants to merge 2 commits into
Open
RedisCluster: Per-node health tracking and non-blocking initialize()#4083ngabhanenetskope wants to merge 2 commits into
ngabhanenetskope wants to merge 2 commits into
Conversation
|
Hi, I’m Jit, a friendly security platform designed to help developers build secure applications from day zero with an MVS (Minimal viable security) mindset. In case there are security findings, they will be communicated to you as a comment inside the PR. Hope you’ll enjoy using Jit. Questions? Comments? Want to learn more? Get in touch with us. |
Collaborator
|
Hi @ngabhanenetskope, thank you for your contribution! I'll take a look at it soon. |
0115eb7 to
20d661c
Compare
…luster Two improvements to RedisCluster's resilience during node failures: 1. Per-node health tracking (NodeHealthManager): After N consecutive TimeoutError/ConnectionError on a node (default: 5), the node is marked unavailable. Subsequent requests to that node fail immediately with NodeUnavailableError instead of blocking for socket_timeout seconds. Recovery uses a proper half-open pattern: after recovery_time (default: 10s), exactly ONE probe request is allowed through. If it succeeds, the node is fully recovered. If it fails, the timer resets for another recovery_time window. Concurrent requests remain blocked during the probe to prevent re-flooding a potentially-still-dead node. NodeUnavailableError is a subclass of ConnectionError but is caught separately in _execute_command to avoid triggering pool disconnect and topology reinitialize (since no network I/O was attempted). 2. Non-blocking initialize(): Changed from blocking lock acquisition to try-lock pattern. If another thread is already refreshing the cluster topology, additional threads return immediately instead of queuing on the lock. They use the stale (but functional for healthy nodes) topology until the refresh completes. This eliminates the thundering herd on initialize(): previously 50 threads timing out simultaneously would all queue on _initialization_lock, serializing 50 redundant CLUSTER SLOTS calls. Now only the first thread refreshes; others continue serving requests to healthy nodes. Configuration (RedisCluster constructor): - node_health_failure_threshold: int = 5 - node_health_recovery_time: float = 10.0 Fixes redis#4074
20d661c to
143ef26
Compare
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Reviewed by Cursor Bugbot for commit 143ef26. Configure here.
Contributor
Author
@petyaslavova |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

Summary
Adds two mechanisms to improve RedisCluster resilience during transient node failures under high concurrency:
Per-node health tracking (
NodeHealthManager) — After N consecutiveTimeoutError/ConnectionErroron a node (default: 5), subsequent requests to that node fail immediately instead of blocking forsocket_timeout. Afterrecovery_time(default: 10s), one probe request is allowed through (half-open pattern). On success, the node is marked available again.Non-blocking
initialize()— Changed from blocking lock acquisition to try-lock. If another thread is already refreshing topology, additional threads return immediately using stale (but functional for healthy nodes) topology instead of queuing on_initialization_lock.Motivation
Under high concurrency (e.g., gevent with 250 greenlets,
max_connections_per_node=50,socket_timeout=5s), a single node failure causes:initialize()(thundering herd)_initialization_lock, serializing 50 redundant CLUSTER SLOTS callsWith this fix:
See #4074 for full analysis.
Configuration
Behavior
Tests
TestNodeHealthManager— 7 tests covering threshold, recovery, per-node isolation, and cluster integrationRelated
Note
Medium Risk
Changes cluster command routing and topology refresh concurrency under failure; mis-tuned thresholds could mark healthy nodes unavailable or delay slot updates, but behavior is configurable and scoped to connection failures.
Overview
Adds per-node health tracking to
RedisClusterso repeatedTimeoutError/ConnectionErroron one node (configurable threshold, default 5) marks it unavailable and later commands raiseNodeUnavailableErrorwithout opening a connection—avoiding pool saturation on a dead peer. Recovery uses a half-open probe afternode_health_recovery_time(default 10s); success clears state, failed probe resets the timer.NodesManager.initialize()now uses a non-blocking_initialization_lock: concurrent refresh callers skip redundantCLUSTER SLOTSwork and keep using the current topology until the in-flight refresh finishes.New public pieces:
NodeUnavailableError,NodeHealthManager, and constructor kwargsnode_health_failure_threshold/node_health_recovery_time. Tests cover the health manager and fail-fast behavior in_execute_command.Reviewed by Cursor Bugbot for commit b998891. Bugbot is set up for automated code reviews on this repo. Configure here.