RedisCluster: Per-node health tracking and non-blocking initialize() by ngabhanenetskope · Pull Request #4083 · redis/redis-py

ngabhanenetskope · 2026-05-27T13:53:01Z

Summary

Adds two mechanisms to improve RedisCluster resilience during transient node failures under high concurrency:

Per-node health tracking (NodeHealthManager) — After N consecutive TimeoutError/ConnectionError on a node (default: 5), subsequent requests to that node fail immediately instead of blocking for socket_timeout. After recovery_time (default: 10s), one probe request is allowed through (half-open pattern). On success, the node is marked available again.
Non-blocking initialize() — Changed from blocking lock acquisition to try-lock. If another thread is already refreshing topology, additional threads return immediately using stale (but functional for healthy nodes) topology instead of queuing on _initialization_lock.

Motivation

Under high concurrency (e.g., gevent with 250 greenlets, max_connections_per_node=50, socket_timeout=5s), a single node failure causes:

50 greenlets block on the dead node for 5s each (pool saturates instantly)
All 50 timeout simultaneously and call initialize() (thundering herd)
All queue on _initialization_lock, serializing 50 redundant CLUSTER SLOTS calls
Health check can't respond → readiness probe fails → pod dies

With this fix:

Health tracking: greenlet RFE: Sharding #6 fails instantly (0ms) instead of blocking 5s → pool never saturates beyond 5 connections
Non-blocking initialize: only the first thread refreshes topology; others continue serving requests to healthy nodes

See #4074 for full analysis.

Configuration

rc = RedisCluster(
    startup_nodes=[...],
    node_health_failure_threshold=5,   # failures before marking unavailable
    node_health_recovery_time=10.0,    # seconds before allowing probe
)

Behavior

Scenario	Before	After
6th request to dead node	Blocks for socket_timeout (5s)	Instant ConnectionError (0ms)
50 threads calling initialize()	All queue, 50 serial CLUSTER SLOTS calls	1 refreshes, 49 return immediately
Node recovers after failover	Must wait for blocked threads to timeout	Probe succeeds after recovery_time → instant recovery

Tests

TestNodeHealthManager — 7 tests covering threshold, recovery, per-node isolation, and cluster integration

Fixes RedisCluster: Transient node failure causes thundering herd on initialize(), leading to pool exhaustion under high concurrency #4074
Complements PR Randomize cluster startup node order during topology refresh #4060 (randomize nodes in initialize)
Follow-up to PR Fix ConnectionPool to raise MaxConnectionsError instead of Connection… #3698 (MaxConnectionsError distinction)

Note

Medium Risk
Changes cluster command routing and topology refresh concurrency under failure; mis-tuned thresholds could mark healthy nodes unavailable or delay slot updates, but behavior is configurable and scoped to connection failures.

Overview
Adds per-node health tracking to RedisCluster so repeated TimeoutError/ConnectionError on one node (configurable threshold, default 5) marks it unavailable and later commands raise NodeUnavailableError without opening a connection—avoiding pool saturation on a dead peer. Recovery uses a half-open probe after node_health_recovery_time (default 10s); success clears state, failed probe resets the timer.

NodesManager.initialize() now uses a non-blocking _initialization_lock: concurrent refresh callers skip redundant CLUSTER SLOTS work and keep using the current topology until the in-flight refresh finishes.

New public pieces: NodeUnavailableError, NodeHealthManager, and constructor kwargs node_health_failure_threshold / node_health_recovery_time. Tests cover the health manager and fail-fast behavior in _execute_command.

^{Reviewed by Cursor Bugbot for commit b998891. Bugbot is set up for automated code reviews on this repo. Configure here.}

jit-ci · 2026-05-27T13:53:11Z

Hi, I’m Jit, a friendly security platform designed to help developers build secure applications from day zero with an MVS (Minimal viable security) mindset.

In case there are security findings, they will be communicated to you as a comment inside the PR.

Hope you’ll enjoy using Jit.

Questions? Comments? Want to learn more? Get in touch with us.

petyaslavova · 2026-05-29T07:34:28Z

Hi @ngabhanenetskope, thank you for your contribution! I'll take a look at it soon.

…luster Two improvements to RedisCluster's resilience during node failures: 1. Per-node health tracking (NodeHealthManager): After N consecutive TimeoutError/ConnectionError on a node (default: 5), the node is marked unavailable. Subsequent requests to that node fail immediately with NodeUnavailableError instead of blocking for socket_timeout seconds. Recovery uses a proper half-open pattern: after recovery_time (default: 10s), exactly ONE probe request is allowed through. If it succeeds, the node is fully recovered. If it fails, the timer resets for another recovery_time window. Concurrent requests remain blocked during the probe to prevent re-flooding a potentially-still-dead node. NodeUnavailableError is a subclass of ConnectionError but is caught separately in _execute_command to avoid triggering pool disconnect and topology reinitialize (since no network I/O was attempted). 2. Non-blocking initialize(): Changed from blocking lock acquisition to try-lock pattern. If another thread is already refreshing the cluster topology, additional threads return immediately instead of queuing on the lock. They use the stale (but functional for healthy nodes) topology until the refresh completes. This eliminates the thundering herd on initialize(): previously 50 threads timing out simultaneously would all queue on _initialization_lock, serializing 50 redundant CLUSTER SLOTS calls. Now only the first thread refreshes; others continue serving requests to healthy nodes. Configuration (RedisCluster constructor): - node_health_failure_threshold: int = 5 - node_health_recovery_time: float = 10.0 Fixes redis#4074

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{Reviewed by Cursor Bugbot for commit 143ef26. Configure here.}

ngabhanenetskope · 2026-06-10T05:08:37Z

Hi @ngabhanenetskope, thank you for your contribution! I'll take a look at it soon.

@petyaslavova
Did you get a chance to take a look!

cursor Bot reviewed May 27, 2026

View reviewed changes

Comment thread redis/cluster.py

Comment thread redis/cluster.py

ngabhanenetskope mentioned this pull request May 27, 2026

RedisCluster: Transient node failure causes thundering herd on initialize(), leading to pool exhaustion under high concurrency #4074

Open

ngabhanenetskope force-pushed the fix/pool-acquire-timeout branch 2 times, most recently from 0115eb7 to 20d661c Compare June 1, 2026 12:57

cursor Bot reviewed Jun 1, 2026

View reviewed changes

Comment thread redis/cluster.py

ngabhanenetskope force-pushed the fix/pool-acquire-timeout branch from 20d661c to 143ef26 Compare June 2, 2026 03:59

cursor Bot reviewed Jun 2, 2026

View reviewed changes

Comment thread redis/cluster.py

Merge branch 'master' into fix/pool-acquire-timeout

b998891

rlehman221 mentioned this pull request Jun 8, 2026

Async RedisCluster: ClusterNode.execute_command leaks a pool connection on every write-side error (send_packed_command outside try/finally) -> permanent MaxConnectionsError #4110

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RedisCluster: Per-node health tracking and non-blocking initialize()#4083

RedisCluster: Per-node health tracking and non-blocking initialize()#4083
ngabhanenetskope wants to merge 2 commits into
redis:masterfrom
ngabhanenetskope:fix/pool-acquire-timeout

ngabhanenetskope commented May 27, 2026 •

edited by cursor Bot

Loading

Uh oh!

jit-ci Bot commented May 27, 2026

Uh oh!

Uh oh!

Uh oh!

petyaslavova commented May 29, 2026

Uh oh!

Uh oh!

cursor Bot left a comment

Uh oh!

Uh oh!

ngabhanenetskope commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ngabhanenetskope commented May 27, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Configuration

Behavior

Tests

Related

Uh oh!

jit-ci Bot commented May 27, 2026

Uh oh!

Uh oh!

Uh oh!

petyaslavova commented May 29, 2026

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ngabhanenetskope commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ngabhanenetskope commented May 27, 2026 •

edited by cursor Bot

Loading