Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
140 changes: 50 additions & 90 deletions source/client-backpressure/client-backpressure.md
Original file line number Diff line number Diff line change
Expand Up @@ -122,7 +122,8 @@ rules:
2. A retry attempt will only be permitted if:
1. The error is a retryable overload error.
2. We have not reached `MAX_RETRIES`.
- The value of `MAX_RETRIES` is 5 and non-configurable.
- The default value of `MAX_RETRIES` is 2. Drivers MUST expose `maxAdaptiveRetries` as a configurable option for
this maximum.
Comment on lines +125 to +126
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the no knobs mantra, we need a very good reason to add maxAdaptiveRetries.

The specification gives no justification behind the maxAdaptiveRetries knob and its default value false. I think, the justification should be added. The only justifications I found are in Decision: Driver Backpressure Retry Behavior Defaults1:

allows users to disable adaptive retries entirely by setting the maximum number of retries to 0
...
Configurable retries affect the amount of load offered to the system for a given operation.

  1. The aforementioned mantra says "if a typical user would have no idea how to choose a correct value, we pick a good default instead of adding a knob". How will a typical user choose a value better than our default? If the answer is that he will have to do the same analysis we did (here and here) to either choose a better value, or decide that the default is fine, then we should not add the knob.
    • If it is the disabling/enabling of adaptive retries that we mostly care about, because we are afraid, that such retries will be harmful, and applications will have to disable them, then maybe we should not introduce them at all.
  2. Maybe we should apply the same approach that is applied to the token bucket idea: do not do it now, gather actual data, then analyze and decide? (when I was suggesting in in the summer of 2025, I was the weirdo, but here we are) - If we introduce the option now, it may end up being a useless option that pollutes the API and confuses users.

1 If the document has other justifications and I missed them - I apologize, I haven't read it fully. If there are other justifications, and you know of them, could you please state them explicitly?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We know from our modelling and from the underpinning theory that the n-retries approach (retry up to N times on overload errors, no token bucket) can introduce retry storms as overload increases. However, the specifics of the workload and cluster serving that workload significantly impacts the threshold at which retry volume becomes an additional burden rather than a throughput improvement. Some applications and clusters may be very tolerant of many additional retries, while others may want to break out of the loop much earlier.

The selection of 2 as a default attempts to broadly pick a sensible default for most users that will on average be a benefit rather than a negative during overload. However, savvy users, the users we expect to be most affected by overload and have the most insight into the specifics of their workload and cluster, will likely find that tweaking this value on a per-workload basis produces better results. Additionally, there are situations where disabling overload retries entirely is optimal. Without a knob, those situations will cause users to either have a strictly worse experience with a new driver, or force them to downgrade to an older driver to avoid the issue. These are two strong motivations to add a knob for maxAdaptiveRetries.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not lose these motivations in the PR. Can you add to the Q&A?

Copy link
Copy Markdown
Member

@stIncMale stIncMale Apr 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

savvy users, the users we expect to be most affected by overload

  • Could you please clarify why savvy users are expected to be most affected by overload?
  • What do you mean by "affected by overload"? For example, do you mean their servers are expected to be overloaded more severely / for longer intervals?

savvy users...will likely find that tweaking this value on a per-workload basis produces better results

You explicitly say that the proposed option is for "savvy users", who you separate from "most users". We discussed this proposed API change within the Java driver team, and this seems as clear of a case for applying the aforementioned no knobs mantra, as there can be - it explicitly instructs not to add a knob if a typical user would have no idea how to choose a correct value. If the mantra is not applicable in this case, then in what case could it possibly be applicable?

Granted, some options may be needed even if their usage requires a user to be advanced, like readConcernLevel, because the functionality they expose is inherently advanced. But the proposed option is not one of those, as it is about performance tuning.

Additionally, there are situations where disabling overload retries entirely is optimal.

Could you please elaborate on what are those situations? Please keep in mind that an application can't change its decision about retries caused by retryable overload errors at runtime. An application will have to either choose to always have them disabled, or always have them enabled. The benefit has to be quite significant, to justify us introducing a performance knob. But even then, I would argue that the decision about such a performance knob should based on real data and/or explicit user needs.

TL;DR

  1. The proposed maxAdaptiveRetries option clearly violates the no knobs mantra.
  2. The proposed option is for performance tuning. Similarly to the idea of token buckets, which previously were claimed to be very needed, but is no longer considered as such, it has to be based on the real data and/or explicit user needs, not guesses.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because savvy (i.e. power) users are likely doing more complex and heavy workloads with more load, and are therefore at a higher risk of causing overload on their servers.

Both of the proposed knobs are easy to choose correct values for once a user observes the behavior with the defaults.

Neither maxAdaptiveRetries or enableOverloadRetargeting are purely for performance tuning.

maxAdaptiveRetries is really configuring the length of the retry window during periods of overload. This window size changes what a given workload produces during overload. Our simulations show that higher maximum retry counts reduce error rates at the cost of higher latency and additional server load. The specifics of a user's application dictate where the threshold for cutting off retries is. Any default value we choose at this point will produce poor results for some users: without additional data gathered after the release of this feature, we are guessing.

enableOverloadRetargeting is similarly both a behavioral and performance configuration. For example, take a three-node replica set with a primary and two secondaries where all operations are performed with primaryPreferred. When overload occurs on the primary without overload retargeting enabled, all retries will go to the primary. There is no chance of stale data, but error rates will be higher, and the retries will cause additional load on the primary.

When overload occurs on the primary with overload retargeting enabled, retries will go to a secondary. This increases the risk of stale data, but improves the capacity of the system significantly while also allowing the primary to recover from overload more quickly. However, this is meaningfully different from the current behavior of primaryPreferred, so enabling it by default risks breaking user applications.

One situation where users may want to disable overload retries entirely is a non-critical workload that runs against a consistently overloaded cluster also used by critical workloads. Overload retries on the non-critical workload would worsen the performance of the critical workloads and lengthen the periods of overload, so disabling them is likely optimal.

There is a separate project following this release to gather retry attempt telemetry. This will enable us to determine the best follow-ups for improving backpressure based on real data rather than guessing. These knobs also make it possible to see if other defaults would work better in real-world situations.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify, the decision to include these knobs has already been made at the initiative level (me, Matt, Patrick, and the distinguished engineers). Your concerns are completely valid and appreciated: similar questions were raised during the decision-making process, with answers summarized in the added FAQ sections. If those sections (or others associated with the new knobs) are missing important context or aren't as clear as they need to be, I'm happy to work with you to ensure they're fully captured in both the spec and the user documentation.

Copy link
Copy Markdown
Member

@stIncMale stIncMale Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q:

Could you please clarify why savvy users are expected to be most affected by overload?

A:

Because savvy (i.e. power) users are likely doing more complex and heavy workloads with more load, and are therefore at a higher risk of causing overload on their servers.

Are you saying that savvy users are more likely to build a system that ends up being overloaded in production? This must mean they are more likely to incorrectly estimate the load / incorrectly choose hardware / incorrectly design for the required load than the typical users. I fail to see how this explanation may work out.

Both of the proposed knobs are easy to choose correct values for once a user observes the behavior with the defaults.

This only confirms that one has to do elaborate testing under overload to choose the value.

Any default value we choose at this point will produce poor results for some users

The same could be said about read/write retries we had before the current project, yet there is no option for specifying the number of retry attempts.

One situation where users may want to disable overload retries entirely is a non-critical workload that runs against a consistently overloaded cluster also used by critical workloads.

If a deployment is consistently overloaded, then nothing may save the situation. The best we may hope for is that the mongod/mongos servers won't become unresponsive like they are doing now (it's impossible to prevent this completely, we may only hope to rise the bar of the request rate at which they become unresponsive). I don't believe that a reasonable user running a production such that it is consistently overloaded is a thing, because it is almost equivalent to the system not working1. Where did this scenario come from?

P.S. Please, let's not mix the conversation about unrelated things in the same thread. I replied about enableOverloadRetargeting in the thread corresponding to that option.


1 The response times grow exponentially as throughput (how much work per unit of time the system is doing currently) approaches the raw throughput (the maximum amount of work per unit of time the system is capable of doing), and, therefore, one should always strive to run a production system such that the throughput stays noticeably smaller than the raw throughput, which avoids overloading the system.

Copy link
Copy Markdown
Member

@stIncMale stIncMale Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify, the decision to include these knobs has already been made

The decision about the specification requirements that this PR changes have also been made. Not only made, but the changes were approved as the specification. Yet we are here changing them a few month after merging them.

It is always possible for my manager, @katcharov, or someone with a rank above him, to override and instruct me to back off. Until then, I am doing my job (reviewing the PR I was assigned to review) by expressing concerns and objections.

- This intentionally changes the behavior of CSOT which otherwise would retry an unlimited number of times within
the timeout to avoid retry storms.
3. (CSOT-only): There is still time for a retry attempt according to the
Expand All @@ -133,20 +134,17 @@ rules:
- To retry `runCommand`, both [retryWrites](../retryable-writes/retryable-writes.md#retrywrites) and
[retryReads](../retryable-reads/retryable-reads.md#retryreads) MUST be enabled. See
[Why must both `retryWrites` and `retryReads` be enabled to retry runCommand?](client-backpressure.md#why-must-both-retrywrites-and-retryreads-be-enabled-to-retry-runcommand)
3. If the request is eligible for retry (as outlined in step 2 above and step 4 in the
[adaptive retry requirements](client-backpressure.md#adaptive-retry-requirements) below), the client MUST apply
exponential backoff according to the following formula:
`backoff = jitter * min(MAX_BACKOFF, BASE_BACKOFF * 2^(attempt - 1))`
3. If the request is eligible for retry (as outlined in step 2 above), the client MUST apply exponential backoff
according to the following formula: `backoff = jitter * min(MAX_BACKOFF, BASE_BACKOFF * 2^(attempt - 1))`
- `jitter` is a random jitter value between 0 and 1.
- `BASE_BACKOFF` is constant 100ms.
- `MAX_BACKOFF` is 10000ms.
- This results in delays of 100ms, 200ms, 400ms, 800ms, and 1600ms before accounting for jitter.
4. If the request is eligible for retry (as outlined in step 2 above and step 4 in the
[adaptive retry requirements](client-backpressure.md#adaptive-retry-requirements) below), the client MUST add the
previously used server's address to the list of deprioritized server addresses for
[server selection](../server-selection/server-selection.md).
5. If the request is eligible for retry (as outlined in step 2 above and step 4 in the
[adaptive retry requirements](client-backpressure.md#adaptive-retry-requirements) below) and is a retryable write:
- This results in delays of 100ms and 200ms before accounting for jitter.
4. If the request is eligible for retry (as outlined in step 2 above) and `enableOverloadRetargeting` is enabled, the
client MUST add the previously used server's address to the list of deprioritized server addresses for
[server selection](../server-selection/server-selection.md). Drivers MUST expose `enableOverloadRetargeting` as a
configurable boolean option that defaults to `false`.
Comment on lines +143 to +146
Copy link
Copy Markdown
Member

@stIncMale stIncMale Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the no knobs mantra, we need a very good reason to add enableOverloadRetargeting.

The specification gives no justification behind the enableOverloadRetargeting knob and its default value false. I think, the justification should be added, if it is eventually agreed that it exists. The only justification I found is in Decision: Driver Backpressure Retry Behavior Defaults:

a)

A primaryPreferred readPreference currently only reads from a secondary node if the primary is unavailable. Making an overloaded primary count as “unavailable” changes this semantically, but is a net benefit for users unless they truly do not want any secondary reads unless the primary is actually down.

b)

The same is true for secondaryPreferred reads and primary routing. As a result, users must explicitly opt-in to this feature.

The justification does not seem correct to me, and I don't think we should add the enableOverloadRetargeting option.

a)

The alleged interpretation of primaryPreferred as a guarantee that no secondary read happens _"unless the primary is actually down" does not seem to make sense.

  1. Formally, there is no such thing as "actually down". A driver may mark the primary as Unknown simply as a result of a temporary network issue, even as a result of a network timeout if it causes authentication to fail (see https://github.com/mongodb/specifications/blob/master/source/server-discovery-and-monitoring/server-discovery-and-monitoring.md#why-mark-a-server-unknown-after-an-auth-error). When this happens, secondary reads may happen, but I am guessing that no one will say that the primary is "actually down".
    • The documentation of this read preference is: "In most situations, operations read from the primary member of the set. However, if the primary is unavailable, as is the case during failover situations, operations read from secondary members that satisfy...". Note how it neither defines "unavailable", nor limits the situations strictly to failover.
  2. Even if "actually down" were formally a thing (for example, one may define "actually down" as a situation when the driver's cluster view does not have RSPrimary), unless there is a way for an application to determine that a secondary read was executed while the primary was not being "actually down", it can't tell whether the alleged semantics of primaryPreferred is violated. And there is no such way, not even via the cluster/server listeners, because there is no specified order between the cluster/server events and the execution of operations. Therefore, this alleged semantics is meaningless.

Any application that uses primaryPreferred must be prepared to deal with secondary reads, and us deprioritizing the primary does not violate any formal guarantees of the primaryPreferred read preference.

b)

The b) part is even weaker:

  • secondaryPreferred, similarly to primaryPreferred, means that the application is ready to deal with both primary and secondary reads.
  • On top of that, there can be no consistency guarantees the application expects from secondary reads that could be violated by primary reads.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current contract we've made with users utilizing primaryPreferred is that reads will only go to a secondary if the primary is unavailable. The documentation does not explicitly define unavailable, but in practice that means the primary is unselectable. Overload retargeting makes the primary unselectable for a retry operation if it returned an overload error on a previous attempt. This materially changes how often secondary reads occur. Since secondary reads can result in stale data, enabling overload retargeting increases the chance that users of primaryPreferred will get stale data when they did not previously. This is a semantic change, and so we've made the decision to turn it off by default and add a knob to enable it.

Overload retargeting significantly increases availability during overload, but it does increase the risk of getting stale data when used with primaryPreferred. We may find that users of primaryPreferred prefer that behavior and we turn on retargeting by defualt in the future, but that's not in the scope of this initial release.

secondaryPreferred does not have this same staleness issue, but it still materially changes what the preference means from "almost always secondary" to "sometimes primary".

The larger discussion here is what exactly primaryPreferred and secondaryPreferred should mean (likely a broader definition of "unavailable"), but that's outside of the scope of this project.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add to the Q&A?

Copy link
Copy Markdown
Member

@stIncMale stIncMale Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current contract we've made with users utilizing primaryPreferred is that reads will only go to a secondary if the primary is unavailable. The documentation does not explicitly define unavailable, but in practice that means the primary is unselectable.

Where did you find this "only if" guarantee? The documentation does not give such a guarantee, and has a different, which I quoted in my previous comment. Furthermore, given that "unavailable" is not defined (nor should it be), we are free to decide and change what "unavailable" is however we want - it won't break the guarantees of the primaryPreferred read preference, because there are no guarantees. Deciding that a server that explicitly refuses to serve requests is unavailable is very reasonable and useful because, as you pointed out, it may increase availability for primaryPreferred reads when the primary is overloaded.

enabling overload retargeting increases the chance that users of primaryPreferred will get stale data

Yes, that is true. Moreover, network-related issues, server becoming unresponsive due to being overloaded (has been happening, and may continue to happen, but hopefully rarely) have the same effect. All of this is part of the primaryPreferred semantics - it guarantees neither that all reads will be executed by the primary, nor any upper limit on the portion of secondary reads.

From a comment in a different thread:

take a three-node replica set with a primary and two secondaries where all operations are performed with primaryPreferred. When overload occurs on the primary without overload retargeting enabled, all retries will go to the primary. There is no chance of stale data

I think, that is incorrect, and the documentation seem to be explicit about that (see below). primaryPreferred provides neither guarantees against reading stale data, nor guarantees to read only durable data, even if the driver directs the reads only to the server that it sees as the primary:

  1. A driver seeing a specific node as primary and selecting it for a primaryPreferred read, guarantees neither that this node is indeed a primary, nor that it will still be the primary when executing the read.
  2. A server seeing itself as primary does not guarantee that it is actually a primary. This is common truth, not something MongoDB-specific.
  3. A server seeing itself as primary may not actually be one, so it neither guarantees to read non-stale data, nor it guarantees to read durable data, which is again explicitly stated in the documentation: "new writes to the former primary will eventually roll back".

If an application needs a guaranteed non-stale and durable read, then it may use the linearizable read concern.
If an application only needs a guarantee that the data read is durable, it may use the majority/snapshot read concern. Neither of these guarantees may be achieved even via the primary read preference alone, let alone the primaryPreferred read preference.

secondaryPreferred does not have this same staleness issue, but it still materially changes what the preference means from "almost always secondary" to "sometimes primary".

  • "almost always secondary" does mean "sometimes primary", doesn't it?
  • Where did you get the "almost always" formulation from? The documentation says "Operations typically read data from secondary members of the replica set". The driver preferring not to read from the secondaries when they are overloaded is completely within reading from secondaries "typically". Similarly to primaryPreferred, implying any limit on the ratio of secondary vs primary reads in secondaryPreferred is impossible.

The larger discussion here is what exactly primaryPreferred and secondaryPreferred should mean (likely a broader definition of "unavailable"), but that's outside of the scope of this project.

If you think that their current documented meaning is wrong, then figuring this out is definitely within the scope, as the justification for proposing the enableOverloadRetargeting option is based on the meaning of these read preferences.


Let's put the above aside, for a moment, and consider what the enableOverloadRetargeting option is conceptually.

This option is equivalent to introducing a new read preference field, controlling how strongly the primary/secondary is to be preferred in primaryPreferred/secondaryPreferred. Importantly, such a field would have to be taken into account by both the driver and the server: similarly to a driver, a mongos should know that it should route a read to the primary even if the primary is overloaded (regardless of how mongos routes reads currently, the aforementioned new read preference information must be accessible to it, so that it can use it properly if the server maintainers decide to change the current behavior in the future).

I don't think that usefulness of such a change have been demonstrated, and think that user requests are a necessity for considering it to be introduced. However, at least from the API standpoint, such an approach seems sound.

TL;DR

  1. The proposed enableOverloadRetargeting option does not appear to be based on correct understanding of the read preferences.
  2. Regardless of item 1, the proposed option should convey information to both the driver and the server, is conceptually part of the read preference, and should be introduced as such, if at all.
    • I don't think its usefulness have been demonstrated, and think that user requests are a necessity for considering it to be introduced.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I forgot to say that similarly to the other proposed option, I discussed this one with some people from the Java driver team. They also think that enableOverloadRetargeting being true does not change the semantics of the read preferences, and that if nevertheless accepted, it should be done as part of the read preference.

5. If the request is eligible for retry (as outlined in step 2 above) and is a retryable write:
1. If the command is a part of a transaction, the instructions for command modification on retry for commands in
transactions MUST be followed, as outlined in the
[transactions](../transactions/transactions.md#interaction-with-retryable-writes) specification.
Expand All @@ -159,26 +157,12 @@ rules:
specifications.
- For the purposes of error propagation, `runCommand` is considered a write.

##### Adaptive retry requirements

If adaptive retries are enabled, the following rules MUST also be obeyed:

1. If the command succeeds on the first attempt, drivers MUST deposit `RETRY_TOKEN_RETURN_RATE` tokens.
- The value is 0.1 and non-configurable.
2. If the command succeeds on a retry attempt, drivers MUST deposit `RETRY_TOKEN_RETURN_RATE`+1 tokens.
3. If a retry attempt fails with an error that is not an overload error, drivers MUST deposit 1 token.
- An error that does not contain the `SystemOverloadedError` error label indicates that the server is healthy enough
to handle requests. For the purposes of retry budget tracking, this counts as a success.
4. A retry attempt will only be permitted if a token can be consumed from the token bucket.
5. A retry attempt consumes 1 token from the token bucket.

#### Interaction with Other Retry Policies

The retry policy in this specification is separate from the other retry policies defined in the
[retryable reads](../retryable-reads/retryable-reads.md) and [retryable writes](../retryable-writes/retryable-writes.md)
specifications. Drivers MUST ensure:

- Only overload errors consume tokens from the token bucket before retrying.
- When a failed attempt is retried, backoff MUST be applied if and only if the error is an overload error.
- If an overload error is encountered:
- Regardless of whether CSOT is enabled or not, the maximum number of retries for any retry policy becomes
Expand All @@ -198,8 +182,7 @@ included, such as error handling with `NoWritesPerformed` labels.
BASE_BACKOFF = 0.1 # 100ms
MAX_BACKOFF = 10 # 10000ms

RETRY_TOKEN_RETURN_RATE = 0.1
MAX_RETRIES = 5
MAX_RETRIES = 2

def execute_command_retryable(command, ...):
deprioritized_servers = []
Expand All @@ -211,23 +194,13 @@ def execute_command_retryable(command, ...):
server = select_server(deprioritized_servers)
connection = server.getConnection()
res = execute_command(connection, command)
if adaptive_retry:
# Deposit tokens into the bucket on success.
tokens = RETRY_TOKEN_RETURN_RATE
if attempt > 0:
tokens += 1
token_bucket.deposit(tokens)
return res
except PyMongoError as exc:
is_retryable = (is_retryable_write(command, exc)
or is_retryable_read(command, exc)
or (exc.contains_error_label("RetryableError") and exc.contains_error_label("SystemOverloadedError")))
is_overload = exc.contains_error_label("SystemOverloadedError")

# if a retry fails with an error which is not an overload error, deposit 1 token
if adaptive_retry and attempt > 0 and not is_overload:
token_bucket.deposit(1)

# Raise if the error is non-retryable.
if not is_retryable:
raise
Expand All @@ -238,8 +211,10 @@ def execute_command_retryable(command, ...):

if attempt > allowed_retries:
raise

deprioritized_servers.append(server.address)

# enableOverloadRetargeting is true
if overload_retargeting:
deprioritized_servers.append(server.address)

if is_overload:
jitter = random.random() # Random float between [0.0, 1.0).
Expand All @@ -250,59 +225,9 @@ def execute_command_retryable(command, ...):
if time.monotonic() + backoff > _csot.get_deadline():
raise

if adaptive_retry and not token_bucket.consume(1):
raise

time.sleep(backoff)
```

### Token Bucket

The overload retry policy introduces an opt-in per-client [token bucket](https://en.wikipedia.org/wiki/Token_bucket) to
limit overload error retry attempts. Although the server rejects excess commands as quickly as possible, doing so costs
CPU and creates extra contention on the connection pool which can eventually negatively affect goodput. To reduce this
risk, the token bucket will limit retry attempts during a prolonged overload.

The token bucket MUST be disabled by default and can be enabled through the
[adaptiveRetries=True](../uri-options/uri-options.md) connection and client options.

The token bucket starts at its maximum capacity of 1000 for consistency with the server.

Each MongoClient instance MUST have its own token bucket. When adaptive retries are enabled, the token bucket MUST be
created when the MongoClient is initialized and exist for the lifetime of the MongoClient. Drivers MUST ensure the token
bucket implementation is thread-safe as it may be accessed concurrently by multiple operations.

#### Pseudocode

The token bucket is implemented via a thread safe counter. For languages without atomics, this can be implemented via a
lock, for example:

```python
DEFAULT_RETRY_TOKEN_CAPACITY = 1000
class TokenBucket:
"""A token bucket implementation for rate limiting."""
def __init__(
self,
capacity: float = DEFAULT_RETRY_TOKEN_CAPACITY,
):
self.lock = Lock()
self.capacity = capacity
self.tokens = capacity

def consume(self, n: float) -> bool:
"""Consume n tokens from the bucket if available."""
with self.lock:
if self.tokens >= n:
self.tokens -= n
return True
return False

def deposit(self, n: float) -> None:
"""Deposit n tokens back into the bucket."""
with self.lock:
self.tokens = min(self.capacity, self.tokens + n)
```

#### Handshake changes

Drivers conforming to this spec MUST add `"backpressure": True` to the
Expand Down Expand Up @@ -466,8 +391,43 @@ Additionally, both `retryReads` and `retryWrites` are enabled by default, so for
retried. This approach also prevents accidentally retrying a read command when only `retryWrites` is enabled, or
retrying a write command when only `retryReads` is enabled.

### Why make `maxAdaptiveRetries` configurable?

Modelling and the underpinning theory for backpressure shows that the n-retries approach (retry up to N times on
overload errors without a token bucket) can introduce retry storms as overload increases. However, the specifics of the
workload and cluster serving that workload significantly impacts the threshold at which retry volume becomes an
additional burden rather than a throughput improvement. Some applications and clusters may be very tolerant of many
additional retries, while others may want to break out of the loop much earlier.

The selection of 2 as a default attempts to broadly pick a sensible default for most users that will on average be a
benefit rather than a negative during overload. However, savvy users, the users expected to be most affected by overload
and have the most insight into the specifics of their workload and cluster, will likely find that tweaking this value on
a per-workload basis produces better results. Additionally, there are situations where disabling overload retries
entirely is optimal. Without a knob, those situations will cause users to either have a strictly worse experience with a
new driver, or force them to downgrade to an older driver to avoid the issue. These are two strong motivations to add a
knob for `maxAdaptiveRetries`.

### Why make `enableOverloadRetargeting` configurable?

The current contract we've made with users utilizing `primaryPreferred` is that reads will only go to a secondary if the
primary is unavailable. The documentation does not explicitly define unavailable, but in practice that means the primary
is unselectable. Overload retargeting makes the primary unselectable for a retry operation if it returned an overload
error on a previous attempt. This materially changes how often secondary reads occur. Since secondary reads can result
in stale data, enabling overload retargeting increases the chance that users of `primaryPreferred` will get stale data
when they did not previously. This is a semantic change, and so retargeting is disabled by default, with a knob to
enable it.

Overload retargeting significantly increases availability during overload, but it does increase the risk of getting
stale data when used with `primaryPreferred`. Users of `primaryPreferred` may widely end up preferring that behavior. If
that is the case, overload retargeting may be enabled by default in the future.

`secondaryPreferred` does not have this same staleness issue, but it still materially changes what the preference means
from "almost always secondary" to "sometimes primary".

## Changelog

- 2026-03-30: Introduce phase 1 support without token buckets.

- 2026-02-20: Disable token buckets by default.

- 2026-01-09: Initial version.
49 changes: 20 additions & 29 deletions source/client-backpressure/tests/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,8 @@ be manually implemented by each driver.

#### Test 1: Operation Retry Uses Exponential Backoff

Drivers should test that retries do not occur immediately when a SystemOverloadedError is encountered.
Drivers should test that retries do not occur immediately when a SystemOverloadedError is encountered. This test MUST be
executed against a MongoDB 4.4+ server that has enabled the `configureFailPoint` command with the `errorLabels` option.

1. Let `client` be a `MongoClient`
2. Let `collection` be a collection
Expand Down Expand Up @@ -49,29 +50,19 @@ Drivers should test that retries do not occur immediately when a SystemOverloade

5. Execute step 3 again.

6. Compare the two time between the two runs.
6. Compare the time between the two runs.

```python
assertTrue(with_backoff_time - no_backoff_time >= 2.1)
assertTrue(absolute_value(with_backoff_time - (no_backoff_time + 0.3 seconds)) < 0.3 seconds)
```

The sum of 5 backoffs is 3.1 seconds. There is a 1-second window to account for potential variance between the two
runs.

#### Test 2: Token Bucket Capacity is Enforced

Drivers should test that retry token buckets are created at their maximum capacity and that that capacity is enforced.

1. Let `client` be a `MongoClient` with `adaptiveRetries=True`.
2. Assert that the client's retry token bucket is at full capacity and that the capacity is
`DEFAULT_RETRY_TOKEN_CAPACITY`.
3. Using `client`, execute a successful `ping` command.
4. Assert that the successful command did not increase the number of tokens in the bucket above
`DEFAULT_RETRY_TOKEN_CAPACITY`.
The sum of 2 backoffs is 0.3 seconds. There is a 0.3-second window to account for potential variance between the
two runs.
Comment on lines +59 to +60
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following the logic used in the previous version of this test, wouldn't this make the comparison value in the assertion 0? (i.e. backoff time minus buffer window)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, correct.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm concerned that this test will be flaky because of the small time values. WDYT about restoring the original logic and just setting maxAdaptiveRetries=5? This would also give us better confirmation that the proper values for backoff are being calculated.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could add another test to explicitly verify proper backoff calculation that configures maxAdaptiveRetries. The purpose of this test is to verify that the backoff is used, which is independent of the actual time values. We could add a larger buffer window if flakiness is a concern.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed over slack, we decided to update the prose test to assert on a window of values rather than a lower bound.


#### Test 3: Overload Errors are Retried a Maximum of MAX_RETRIES times

Drivers should test that without adaptive retries enabled, overload errors are retried a maximum of five times.
Drivers should test that overload errors are retried a maximum of MAX_RETRIES times. This test MUST be executed against
a MongoDB 4.4+ server that has enabled the `configureFailPoint` command with the `errorLabels` option.

1. Let `client` be a `MongoClient` with command event monitoring enabled.

Expand All @@ -95,24 +86,24 @@ Drivers should test that without adaptive retries enabled, overload errors are r

5. Assert that the raised error contains both the `RetryableError` and `SystemOverloadedError` error labels.

6. Assert that the total number of started commands is MAX_RETRIES + 1 (6).
6. Assert that the total number of started commands is MAX_RETRIES + 1 (3).

#### Test 4: Adaptive Retries are Limited by Token Bucket Tokens
#### Test 4: Overload Errors are Retried a Maximum of maxAdaptiveRetries times when configured

Drivers should test that when enabled, adaptive retries are limited by the number of tokens in the bucket.
Drivers should test that overload errors are retried a maximum of `maxAdaptiveRetries` times, when configured. This test
MUST be executed against a MongoDB 4.4+ server that has enabled the `configureFailPoint` command with the `errorLabels`
option.

1. Let `client` be a `MongoClient` with `adaptiveRetries=True` and command event monitoring enabled.
1. Let `client` be a `MongoClient` with `maxAdaptiveRetries=1` and command event monitoring enabled.

2. Set `client`'s retry token bucket to have 2 tokens.

3. Let `coll` be a collection.
2. Let `coll` be a collection.

4. Configure the following failpoint:
3. Configure the following failpoint:

```javascript
{
configureFailPoint: 'failCommand',
mode: {times: 3},
mode: 'alwaysOn',
data: {
failCommands: ['find'],
errorCode: 462, // IngressRequestRateLimitExceeded
Expand All @@ -121,8 +112,8 @@ Drivers should test that when enabled, adaptive retries are limited by the numbe
}
```

5. Perform a find operation with `coll` that fails.
4. Perform a find operation with `coll` that fails.

6. Assert that the raised error contains both the `RetryableError` and `SystemOverloadedError` error labels.
5. Assert that the raised error contains both the `RetryableError` and `SystemOverloadedError` error labels.

7. Assert that the total number of started commands is 3: one for the initial attempt and two for the retries.
6. Assert that the total number of started commands is `maxAdaptiveRetries` + 1 (2).
Original file line number Diff line number Diff line change
Expand Up @@ -85,24 +85,6 @@
"client": "client",
"eventType": "cmap",
"events": [
{
"connectionCheckedOutEvent": {}
},
{
"connectionCheckedInEvent": {}
},
{
"connectionCheckedOutEvent": {}
},
{
"connectionCheckedInEvent": {}
},
{
"connectionCheckedOutEvent": {}
},
{
"connectionCheckedInEvent": {}
},
{
"connectionCheckedOutEvent": {}
},
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -58,10 +58,3 @@ tests:
- connectionCheckedInEvent: {}
- connectionCheckedOutEvent: {}
- connectionCheckedInEvent: {}
- connectionCheckedOutEvent: {}
- connectionCheckedInEvent: {}
- connectionCheckedOutEvent: {}
- connectionCheckedInEvent: {}
- connectionCheckedOutEvent: {}
- connectionCheckedInEvent: {}

Loading
Loading