Configurable Retry Status Codes and Max Retry Count until Node Failover - RestClient

### Description

**Affected versions:** 8.x
**Related issue:** [#25578 — RestClient: allow to customize retry logic](https://github.com/elastic/elasticsearch/issues/25578)

### Problem Description
The low-level RestClient decides whether a response is retryable and therefore whether to rotate to another node via the method RestClient#isRetryStatus(int), which is hardcoded to only treat 502, 503, and 504 as retryable.
A consequence is that other HTTP codes, for instance HTTP 429 (Too Many Requests), are not treated as retryable, even though there are scenarios in which a failover to a different node is definitely required.

### Impact of the Current Behavior
A single overloaded or throttling node effectively poisons every request routed to it, even when the cluster as a whole has ample healthy capacity:

- One ES node returning 4xxs or 5xx unequal to 502, 503 and 504 can lead to a huge amount of unanswered Elasticsearch requests, because the client keeps routing to it and never tries a different ES node for the same request. End users see errors despite the cluster being nominally healthy.
- The FailureListener and dead-node tracking are only triggered for the three hardcoded statuses, so a node consistently returning 429 is never marked dead and never taken out of rotation.
- The Elasticsearch Sniffer does not help. The issue is not which nodes exist, but which status codes justify rotating between them.
- There is no extension point. Applications must either wrap every call in their own retry logic (defeating the purpose of the client's built-in node-rotation machinery), use a HttpResponseInterceptor to fake HTTP response codes or fork the client.

**Real-world example we experienced**: In a production system with 32 Elasticsearch worker nodes, we encountered an incident where the node the ElasticsearchClient was routing requests to consistently responded with HTTP 429. Because 429 is not treated as retryable, no failover to any of the healthy nodes took place, and data in Elasticsearch could not be retrieved for hours despite 31 other nodes being available to serve the requests.

### Proposed Solution
Make the retry policy configurable.

1. Configurable retry status codes. A builder method to supply a set of retryable HTTP status codes, defaulting to the current {502, 503, 504}. Users who need 429 (or 408, etc.) treated as retryable can add them explicitly.
2. Configurable maximum retry count per node. A builder method to configure how many failing requests (HTTP codes unequal to 1xx, 2xx and 3xx) against a node are allowed until the ElasticsearchClient failovers to a different  ES node.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Configurable Retry Status Codes and Max Retry Count until Node Failover - RestClient #1212

Description

Problem Description

Impact of the Current Behavior

Proposed Solution

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Configurable Retry Status Codes and Max Retry Count until Node Failover - RestClient #1212

Description

Description

Problem Description

Impact of the Current Behavior

Proposed Solution

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions