Skip to content

Configurable Retry Status Codes and Max Retry Count until Node Failover - RestClient #1212

@sebastian-mue

Description

@sebastian-mue

Description

Affected versions: 8.x
Related issue: #25578 — RestClient: allow to customize retry logic

Problem Description

The low-level RestClient decides whether a response is retryable and therefore whether to rotate to another node via the method RestClient#isRetryStatus(int), which is hardcoded to only treat 502, 503, and 504 as retryable.
A consequence is that other HTTP codes, for instance HTTP 429 (Too Many Requests), are not treated as retryable, even though there are scenarios in which a failover to a different node is definitely required.

Impact of the Current Behavior

A single overloaded or throttling node effectively poisons every request routed to it, even when the cluster as a whole has ample healthy capacity:

  • One ES node returning 4xxs or 5xx unequal to 502, 503 and 504 can lead to a huge amount of unanswered Elasticsearch requests, because the client keeps routing to it and never tries a different ES node for the same request. End users see errors despite the cluster being nominally healthy.
  • The FailureListener and dead-node tracking are only triggered for the three hardcoded statuses, so a node consistently returning 429 is never marked dead and never taken out of rotation.
  • The Elasticsearch Sniffer does not help. The issue is not which nodes exist, but which status codes justify rotating between them.
  • There is no extension point. Applications must either wrap every call in their own retry logic (defeating the purpose of the client's built-in node-rotation machinery), use a HttpResponseInterceptor to fake HTTP response codes or fork the client.

Real-world example we experienced: In a production system with 32 Elasticsearch worker nodes, we encountered an incident where the node the ElasticsearchClient was routing requests to consistently responded with HTTP 429. Because 429 is not treated as retryable, no failover to any of the healthy nodes took place, and data in Elasticsearch could not be retrieved for hours despite 31 other nodes being available to serve the requests.

Proposed Solution

Make the retry policy configurable.

  1. Configurable retry status codes. A builder method to supply a set of retryable HTTP status codes, defaulting to the current {502, 503, 504}. Users who need 429 (or 408, etc.) treated as retryable can add them explicitly.
  2. Configurable maximum retry count per node. A builder method to configure how many failing requests (HTTP codes unequal to 1xx, 2xx and 3xx) against a node are allowed until the ElasticsearchClient failovers to a different ES node.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions