Skip to content

Flexible retry backoff policies#182

Merged
sravotto merged 4 commits intocockroachdb:masterfrom
MattWhelan:backoff
Dec 4, 2025
Merged

Flexible retry backoff policies#182
sravotto merged 4 commits intocockroachdb:masterfrom
MattWhelan:backoff

Conversation

@MattWhelan
Copy link
Copy Markdown
Contributor

Extends the existing retry limit mechanism to allow for delays between retries. Includes implementations for fixed delays and exponential backoff.

@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@MattWhelan MattWhelan force-pushed the backoff branch 2 times, most recently from 3034fdf to d7039d1 Compare June 10, 2025 15:32
@MattWhelan MattWhelan requested a review from sean- June 10, 2025 16:01
@dikshant
Copy link
Copy Markdown

We haven’t implemented this because cockroachdb-go uses save point based retries and because of savepoint-based retries, a transaction can continue holding locks in between retries. So that means with exponential backoff, you'll be holding locks for longer, thus putting the app at even greater risk of contention.

If you could show us that this is not the case with backoffs, we will consider adding this! Thanks!

@MattWhelan
Copy link
Copy Markdown
Contributor Author

We haven’t implemented this because cockroachdb-go uses save point based retries and because of savepoint-based retries, a transaction can continue holding locks in between retries. So that means with exponential backoff, you'll be holding locks for longer, thus putting the app at even greater risk of contention.

If you could show us that this is not the case with backoffs, we will consider adding this! Thanks!

I just pushed 9dda97b. Could you take a look at that? I'm not sure what all the implications are of separating the database transaction lifecycle from the Tx object, but aside from that, I think it's sound, and avoids the problem with holding locks during backoff.

@MattWhelan MattWhelan requested a review from rafiss June 23, 2025 20:05
Comment thread crdb/retry.go Outdated
}

// Vargo converts a go-retry style Delay provider into a RetryPolicy
func Vargo(fn func() VargoBackoff) RetryPolicy {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if we could think of a different name to go with here. Users of cockroach-go who don't already know about the sethvargo/go-retry library may be confused to see this.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm open to suggestions. I was trying to build an explicit integration without adding a transitive dependency.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sravotto , how about we rename this function Vargo -> ExternalBackoffPolicy.

I think the name would be a bit more discoverable this way.

Comment thread crdb/common.go
if restartErr := tx.Exec(ctx, "ROLLBACK"); restartErr != nil {
return newTxnRestartError(restartErr, err)
}
if restartErr := tx.Exec(ctx, "BEGIN"); restartErr != nil {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like it could have issues. Some libraries may not like it if we use the transaction after rolling it back. For example, I see this in the pgx code where it returns an error if you try to use a transaction that was already rolled back: https://github.com/jackc/pgx/blob/ecc9203ef42fbba50507e773901b5aead75288ef/tx.go#L205-L224

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It returns an error if you use a transaction object that has the closed flag set. Since we're not doing that, it ought to be ok, I would expect.

Matt Whelan added 2 commits October 21, 2025 14:33
Extends the existing retry limit mechanism to allow for delays between
retries. Includes implementations for fixed delays and exponential
backoff.
Comment thread crdb/tx.go Outdated
// WithMaxRetries configures context so that ExecuteTx retries tx specified
// number of times when encountering retryable errors.
// Setting retries to 0 will retry indefinitely.
// Setting retries to 0 will not retry: the transaction will be tried only once.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should try to avoid breaking the API here. This changes the behavior of WithMaxRetries(ctx, 0) from "retry indefinitely" to "no retries," which would affect any existing code relying on this.

Maybe we could add a separate WithNoRetries() function, and change something about the internal representation so that the outward facing API remains the same.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. Done.

Comment thread crdb/common.go
if restartErr := tx.Exec(ctx, "ROLLBACK"); restartErr != nil {
return newTxnRestartError(restartErr, err)
}
if restartErr := tx.Exec(ctx, "BEGIN"); restartErr != nil {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still doesn't seem valid to me -- if we reuse the same tx object here, it will not succeed when sending BEGIN right after a ROLLBACK. (see https://github.com/jackc/pgx/blob/ecc9203ef42fbba50507e773901b5aead75288ef/tx.go#L205-L224)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that we not calling tx.Rollback, but rather tx.Exec(ctx, "ROLLBACK"). Please check the comments in the code, and let me know if that makes sense.

Comment thread crdb/common.go
return newTxnRestartError(rollbackErr, err)
// We have a retryable error. Check the retry policy.
delay, retryErr := retryFunc(err)
if delay > 0 && retryErr == nil {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't follow this. Why are we doing a retry if retryErr is nil?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added documentation on the RetryFunc function definition to clarify the semantics. Implementation of RetryFunc may return errors (typically MaxRetriesExceededError) to stop the retry process.

Comment thread crdb/common.go Outdated
if delay > 0 && retryErr == nil {
// We don't want to hold locks while waiting for a backoff, so restart the entire transaction
if restartErr := tx.Exec(ctx, "ROLLBACK"); restartErr != nil {
return newTxnRestartError(restartErr, err)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the wrong error to return here -- as per this line, when this error is displayed, it will say that there was an issue with ROLLBACK TO SAVEPOINT, which is not what this line is doing:

const msgPattern = "restarting txn failed. ROLLBACK TO SAVEPOINT " +

(This is also the wrong error to return on lines 98 and 101.)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Passing the operation to newTxnRestartError.

Comment thread crdb/retry_test.go
}
}
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should also have tests for edge cases like zero BaseDelay, negative RetryLimit, and MaxDelay < BaseDelay.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment thread crdb/common.go
if rollbackErr := tx.Exec(ctx, "ROLLBACK TO SAVEPOINT cockroach_restart"); rollbackErr != nil {
return newTxnRestartError(rollbackErr, err)
// We have a retryable error. Check the retry policy.
delay, retryErr := retryFunc(err)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

before we enter the retry logic, we should check for context cancellation:

if err := ctx.Err(); err != nil {
  return err
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment thread crdb/retry.go
NewRetry() RetryFunc
}

type LimitBackoffRetryPolicy struct {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is public-facing so should have a comment explaining what it is and how to use it.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment thread crdb/retry.go Outdated
}

// ExpBackoffRetryPolicy implements RetryPolicy using an exponential backoff with optional
// saturation.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't follow what "saturation" means here. Can we write this comment to explain more how to use this?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, reworded comment.

Comment thread crdb/retry_test.go
actualDelays := make([]time.Duration, 0, len(expectedDelays))
rf := policy.NewRetry()
for {
delay, err := rf(nil)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're always passing a nil error here -- can we add tests that check the behavior with a non-nil error?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Document LimitBackoffRetryPolicy, ExpBackoffRetryPolicy, and Vargo adapter
with detailed examples.
Preserve backward compatibility by making WithMaxRetries(ctx, 0) mean
unlimited retries (original behavior).
Add WithNoRetries() for disabling retries and introduce sentinel constants.

Enhance RetryFunc documentation to clarify return value semantics and
add additional test cases.
Copy link
Copy Markdown
Contributor

@sravotto sravotto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review. Please take another look (will squash the commits once we are done).

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @rafiss and @sean-)

Comment thread crdb/common.go
if rollbackErr := tx.Exec(ctx, "ROLLBACK TO SAVEPOINT cockroach_restart"); rollbackErr != nil {
return newTxnRestartError(rollbackErr, err)
// We have a retryable error. Check the retry policy.
delay, retryErr := retryFunc(err)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment thread crdb/common.go
return newTxnRestartError(rollbackErr, err)
// We have a retryable error. Check the retry policy.
delay, retryErr := retryFunc(err)
if delay > 0 && retryErr == nil {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added documentation on the RetryFunc function definition to clarify the semantics. Implementation of RetryFunc may return errors (typically MaxRetriesExceededError) to stop the retry process.

Comment thread crdb/common.go Outdated
if delay > 0 && retryErr == nil {
// We don't want to hold locks while waiting for a backoff, so restart the entire transaction
if restartErr := tx.Exec(ctx, "ROLLBACK"); restartErr != nil {
return newTxnRestartError(restartErr, err)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Passing the operation to newTxnRestartError.

Comment thread crdb/common.go
if restartErr := tx.Exec(ctx, "ROLLBACK"); restartErr != nil {
return newTxnRestartError(restartErr, err)
}
if restartErr := tx.Exec(ctx, "BEGIN"); restartErr != nil {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that we not calling tx.Rollback, but rather tx.Exec(ctx, "ROLLBACK"). Please check the comments in the code, and let me know if that makes sense.

Comment thread crdb/retry.go
NewRetry() RetryFunc
}

type LimitBackoffRetryPolicy struct {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment thread crdb/retry.go Outdated
}

// ExpBackoffRetryPolicy implements RetryPolicy using an exponential backoff with optional
// saturation.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, reworded comment.

Comment thread crdb/retry_test.go
actualDelays := make([]time.Duration, 0, len(expectedDelays))
rf := policy.NewRetry()
for {
delay, err := rf(nil)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment thread crdb/retry_test.go
}
}
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment thread crdb/tx.go Outdated
// WithMaxRetries configures context so that ExecuteTx retries tx specified
// number of times when encountering retryable errors.
// Setting retries to 0 will retry indefinitely.
// Setting retries to 0 will not retry: the transaction will be tried only once.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. Done.

Copy link
Copy Markdown
Contributor

@rafiss rafiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing those comments.

I gave one final pass, and the only remaining comments I have are renames and simple check for handling infinite retries. I'll mark this as approved, so feel free to merge after addressing those comments.

Comment thread crdb/retry.go Outdated
}

// Vargo converts a go-retry style Delay provider into a RetryPolicy
func Vargo(fn func() VargoBackoff) RetryPolicy {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sravotto , how about we rename this function Vargo -> ExternalBackoffPolicy.

I think the name would be a bit more discoverable this way.

Comment thread crdb/retry.go Outdated
Comment thread crdb/retry.go Outdated
Comment thread crdb/retry.go Outdated
…icies

Update ExpBackoffRetryPolicy to handle RetryLimit consistently with
LimitBackoffRetryPolicy: 0 now means unlimited retries, negative values mean no
retries.
Rename vargoAdapter to ExternalBackoff-related names for clarity and
consistency with the public API. i
Add comprehensive test coverage for NoRetries, UnlimitedRetries, and
negative RetryLimit values with ExpBackoffRetryPolicy.
@sravotto
Copy link
Copy Markdown
Contributor

sravotto commented Dec 4, 2025

Thanks for the review.

@sravotto sravotto merged commit 319b7f9 into cockroachdb:master Dec 4, 2025
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants