Skip to content

[retry] Add centralized retry/backoff abstraction#1007

Open
Pranav-d33 wants to merge 12 commits into
meshery:masterfrom
Pranav-d33:feat/retry-backoff-abstraction
Open

[retry] Add centralized retry/backoff abstraction#1007
Pranav-d33 wants to merge 12 commits into
meshery:masterfrom
Pranav-d33:feat/retry-backoff-abstraction

Conversation

@Pranav-d33

@Pranav-d33 Pranav-d33 commented May 13, 2026

Copy link
Copy Markdown

cc @leecalcote Introduces retry/ — a context-aware, exponential-backoff retry package wrapping cenkalti/backoff/v4. Provides a reusable abstraction to replace ad-hoc retry loops across Meshery components.

Changes:

  • Add retry/{retry,options,errors,retry_test}.go
    • Do(ctx, op, ...opts) with production-safe defaults (500 ms initial, x1.5 growth, +/-30% jitter, 2 min cap)
    • Permanent(err) escape hatch for non-retryable errors
    • IsPermanent(err) using errors.As for proper chain unwrapping
    • WithLogNotifier bridges retries to MeshKit logger.Handler
    • 15 table-driven tests, race-detector clean
  • Promote cenkalti/backoff/v4 v4.3.0 from indirect to direct in go.mod
    fixes Feature Request: Centralized Retry/Backoff Helper in MeshKit #1008

Introduces retry/ — a context-aware, exponential-backoff retry package
wrapping cenkalti/backoff/v4. Provides a reusable abstraction to replace
ad-hoc retry loops across Meshery components.

Changes:
- Add retry/{retry,options,errors,retry_test}.go
  - Do(ctx, op, ...opts) with production-safe defaults
    (500 ms initial, x1.5 growth, +/-30% jitter, 2 min cap)
  - Permanent(err) escape hatch for non-retryable errors
  - IsPermanent(err) using errors.As for proper chain unwrapping
  - WithLogNotifier bridges retries to MeshKit logger.Handler
  - 15 table-driven tests, race-detector clean
- Promote cenkalti/backoff/v4 v4.3.0 from indirect to direct in go.mod

Relates to meshery/meshery#19275

Signed-off-by: Pranav Dhiran <dhiranpranav72@gmail.com>
@welcome

welcome Bot commented May 13, 2026

Copy link
Copy Markdown

Yay, your first pull request! 👍 A contributor will be by to give feedback soon. In the meantime, you can find updates in the #github-notifications channel in the community Slack.
Be sure to double-check that you have signed your commits. Here are instructions for making signing an implicit activity while performing a commit.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a centralized, context-aware exponential backoff retry package wrapping github.com/cenkalti/backoff/v4. It includes a configurable Do function, functional options for tuning backoff parameters, and helpers for handling permanent errors. Feedback suggests upgrading to backoff/v5 for consistency, improving the documentation examples to use context-aware HTTP calls, and refining the Operation signature to include a context. Additionally, suggestions were made to optimize the Do function by checking the context before the first attempt, simplifying the internal retry call, and tightening test assertions for permanent errors.

Comment thread go.mod Outdated
Comment thread retry/retry.go Outdated
Comment thread retry/retry.go Outdated
Comment thread retry/retry.go
Comment thread retry/retry.go Outdated
Comment thread retry/retry_test.go Outdated
@Pranav-d33 Pranav-d33 force-pushed the feat/retry-backoff-abstraction branch from be1d278 to 02affd4 Compare May 13, 2026 20:55
Pranav-d33 and others added 4 commits May 14, 2026 02:32
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Pranav Dhiran <dhiranpranav72@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Pranav Dhiran <dhiranpranav72@gmail.com>
- Simplified RetryNotify call
- Added TODO about missing context.Context in Operation signature
- Updated example to use context-aware http.NewRequestWithContext

Signed-off-by: Pranav Dhiran <dhiranpranav72@gmail.com>
Signed-off-by: Pranav Dhiran <dhiranpranav72@gmail.com>
@Pranav-d33

Copy link
Copy Markdown
Author

cc @leecalcote Wanted to flag this is ready for review when you get a chance — happy to walk through the approach or adjust based on feedback.

- Rename errors.go to error.go
- Pass context.Context to Operation signature
- Use structured logging in WithLogNotifier

Signed-off-by: Pranav Dhiran <dhiranpranav72@gmail.com>
@Pranav-d33 Pranav-d33 force-pushed the feat/retry-backoff-abstraction branch from 9a6c9c9 to 08b2ca7 Compare May 14, 2026 09:29
Comment thread retry/error.go Outdated
Comment thread retry/retry.go Outdated
Comment thread retry/retry_test.go Outdated
@Pranav-d33 Pranav-d33 force-pushed the feat/retry-backoff-abstraction branch from cd16bc1 to 7f9d6e5 Compare May 15, 2026 10:04
@Pranav-d33 Pranav-d33 requested a review from icoderarely May 15, 2026 10:05

@icoderarely icoderarely left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, might need a maintainers eye on it now for this to be merged if this is a fix they want. Good work @Pranav-d33

@Pranav-d33

Copy link
Copy Markdown
Author

@icoderarely Can you point me to the right person to ping ???

@icoderarely

Copy link
Copy Markdown

hey, did you not create an issue or forgot to link this pr to an issue you are working on?

and the one who assigned you might be the first person you could tag.

Also have some patience, it's a big project, alot of work is happening in parallel and maintainers/members need time too to get back to each. So best bet would be to tag, wait for a day, if there is no response msg this on slack community that you need this pr to be reviewed/merged.

@Pranav-d33

Copy link
Copy Markdown
Author

The issue is in Meshery/meshery and lee was the one who assigned it to me, thats why i pinged him in slack...
meshery/meshery#19275

@icoderarely

Copy link
Copy Markdown

oh we gotta tag someone else then (some exceptions mate)

@icoderarely

Copy link
Copy Markdown

@Omkar-Ugal who could be the right person to tag?

@Omkar-Ugal

Copy link
Copy Markdown
Member

please link the pr with the issue

@Pranav-d33

Copy link
Copy Markdown
Author

please link the pr with the issue

it is linked, meshery/meshery#19275 , the issue is in meshery actually, do i create a new issue in meshkit as well?

Signed-off-by: Pranav Dhiran <dhiranpranav72@gmail.com>

Closes meshery#1008

Signed-off-by: Pranav Dhiran <dhiranpranav72@gmail.com>
@Pranav-d33 Pranav-d33 force-pushed the feat/retry-backoff-abstraction branch from 7f9d6e5 to 628892a Compare May 16, 2026 12:06

@matrixkavi matrixkavi left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change for go.mod is not required.

While majority of the code looks ok to me, I was wondering what would happen if a caller called WithMultiplier or WithJitter with negative values for instance. Isn't it good to validate and set the values to default rather than blindly setting them in options.go?

Comment thread go.mod
Clamp negative/invalid values to defaults instead of silently accepting them.

Signed-off-by: Pranav Dhiran <dhiranpranav72@gmail.com>
@Pranav-d33

Pranav-d33 commented May 16, 2026

Copy link
Copy Markdown
Author

Added validation guards:

  • WithMultiplier: values <= 0 now fall back to DefaultMultiplier (1.5)
  • WithJitter: values < 0 or > 1 now fall back to DefaultRandomizationFactor (0.3)

Also reverted the unnecessary go.mod change.

Comment thread retry/retry.go Outdated

@lekaf974 lekaf974 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a quick commen, LGTM otherwise

@lekaf974 lekaf974 requested a review from icoderarely May 17, 2026 03:09
@Pranav-d33

Copy link
Copy Markdown
Author

@matrixkavi , Let me know if any changes are needed before merging.

…pts type with backoff library

Signed-off-by: Pranav Dhiran <dhiranpranav72@gmail.com>

@icoderarely icoderarely left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for

retry: drop unnecessary uint64->uint cast by aligning Config.MaxAttempts type with backoff library

lekaf974
lekaf974 previously approved these changes May 20, 2026

@lekaf974 lekaf974 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@leecalcote

Copy link
Copy Markdown
Member

There is the smallest amount of discourse on this topic in the community Slack - https://mesheryio.slack.com/archives/CFGG6U10E/p1779755449286469?thread_ts=1776923231.353449&cid=CFGG6U10E

@leecalcote

Copy link
Copy Markdown
Member

@Pranav-d33 this might be something that we can wire into the remote provider availability tests that occur concurrently onload of the Provider UI.

@Pranav-d33

Copy link
Copy Markdown
Author

MeshKit Retry Abstraction Test Matrix

Category 1: Basic Success Paths

Scenario What it tests Result
TestRetrySucceedsFirstAttempt Operation succeeds on the first call with no retries. PASS (0.00s) — operation called exactly once
TestRetrySucceedsAfterTransientErrors Operation fails 3 times and succeeds on the 4th attempt. PASS (0.01s) — operation called exactly 4 times
TestRetryDefaultsAreApplied Verifies default retry settings are applied when no options are provided. PASS (0.36s) — at least 2 calls made

Category 2: Permanent Error Handling

Scenario What it tests Result
TestRetryPermanentErrorStopsImmediately Permanent errors stop retries immediately. PASS (0.00s) — exactly 1 call
TestIsPermanentReturnsFalseForTransient Transient errors are not classified as permanent. PASS (0.00s)
TestIsPermanentReturnsTrueForPermanentWrapped Wrapped permanent errors are correctly detected. PASS (0.00s)
TestIsPermanentHandlesDoublyWrappedErrors Nested wrapped permanent errors still unwrap correctly. PASS (0.00s)

Category 3: Context Cancellation

Scenario What it tests Result
TestRetryContextCancellationStopsLoop Retry loop stops when context is cancelled mid-backoff. PASS (0.01s)
TestRetryContextAlreadyCancelledBeforeFirstAttempt Retry exits immediately if context is already cancelled. PASS (0.00s)

Category 4: Budget Enforcement

Scenario What it tests Result
TestRetryMaxAttemptsEnforced Retry count respects WithMaxAttempts(4). PASS (0.01s) — exactly 4 calls
TestRetryMaxElapsedTimeEnforced Retry loop respects wall-clock elapsed time budget. PASS (0.08s)
TestRetryZeroMaxAttemptsMeansUnlimited Unlimited retries stop only via elapsed time budget. PASS (0.04s)
TestRetryWithMaxAttemptsOneNoRetry WithMaxAttempts(1) performs exactly one attempt. PASS (0.00s)

Category 5: Notifier Callback Behavior

Scenario What it tests Result
TestRetryNotifierCalledOnEachRetry Notifier invoked once per transient retry failure. PASS (0.01s)
TestRetryNotifierNotCalledOnImmediateSuccess Notifier is not called on immediate success. PASS (0.00s)
TestRetryNotifierNotCalledOnPermanentError Notifier is not called for permanent errors. PASS (0.00s)

Summary

  • Total scenarios executed: 16
  • Passed: 16/16
  • Total runtime: 0.36s
  • Coverage includes:
    • transient retry behavior
    • exponential backoff defaults
    • permanent error semantics
    • context cancellation
    • retry budget enforcement
    • notifier callback behavior
    • wrapped error handling

Comment thread retry/options.go
Comment thread retry/options.go
Comment thread retry/error.go Outdated
Comment thread retry/retry.go

@reaper8055 reaper8055 left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have reviewed the changes, please check the comments.

…rror alias

- Add ErrorClassifier type with DecisionRetry/DecisionStop so callers
  can classify errors as transient or terminal at their bounded context
- Remove exported PermanentError type alias to avoid leaking cenkalti/backoff
  into MeshKit's public API (zero call-sites affected)
- Add validateConfig that fails fast on invalid config:
  * InitialInterval > 0, MaxInterval > 0, MaxInterval >= InitialInterval
  * MaxElapsedTime >= 0, Multiplier finite and >= 1, Jitter in [0,1]
- Remove silent-default fallback from WithMultiplier and WithJitter
  (validation is now centralized instead of hiding bugs)
- Add 16 tests covering classifier behaviour and config validation
- Add ExampleDo showing idiomatic HTTP usage with per-attempt timeout
  vs retry loop budget, using httptest for hermetic output

Signed-off-by: Pranav Dhiran <dhiranpranav72@gmail.com>
@Pranav-d33

Copy link
Copy Markdown
Author

@reaper8055 ready for re-review.

Comment thread retry/retry.go
return err
}

func validateConfig(cfg Config) error {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The config validation is a good addition, but the returned errors are currently string-only.

Right now callers can only distinguish invalid retry configuration from an operation failure by inspecting the error message. That is brittle. Since retry.Do can now fail before the operation even runs, callers should have a reliable way to detect that class of failure.

I’d suggest exposing a sentinel error, for example:

var ErrInvalidConfig = errors.New("retry: invalid config")

Then wrap it from validateConfig:

return fmt.Errorf("%w: InitialInterval must be > 0, got %v", ErrInvalidConfig, cfg.InitialInterval)

This lets consumers write:

if errors.Is(err, retry.ErrInvalidConfig) {
    // fail startup / reject bad config
}

Comment thread retry/retry_test.go
// MaxElapsedTime limits the retry loop but does NOT interrupt an in-flight HTTP
// request. Always pair it with http.Client.Timeout (or NewRequestWithContext) so
// each attempt has its own deadline.
func ExampleDo() {

@reaper8055 reaper8055 Jun 1, 2026

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The classifier API has good tests, but it also needs a compiled documentation example.

WithErrorClassifier is now an important part of the public contract because it lets callers separate retry-loop mechanics from domain-specific error classification. However, the current ExampleDo demonstrates retry.Permanent inside the operation, not the classifier path.

I’d add an example like ExampleWithErrorClassifier that shows the intended usage:

err := retry.Do(ctx, op,
    retry.WithErrorClassifier(func(err error) retry.ErrorDecision {
        var status statusError
        if errors.As(err, &status) {
            if status.Code >= 500 || status.Code == http.StatusTooManyRequests {
                return retry.DecisionRetry
            }
            return retry.DecisionStop
        }
        return retry.DecisionRetry
    }),
)

This matters because otherwise users may keep embedding permanent/transient decisions inside every operation body. The cleaner contract is: the operation returns the domain error, and the classifier decides whether that error is retryable.

A compiled example also protects the public API from drift because go test will fail if the documented usage stops compiling.

IMPORTANT: check with @leecalcote / @lekaf974 on repo policy to where to include code examples for such use case!

Comment thread retry/_demo/main.go Outdated
@@ -0,0 +1,175 @@
package main

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMPORTANT: check with @leecalcote / @lekaf974 on repo policy to where to include code examples / demos for such use case!

@reaper8055 reaper8055 left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything looks good here apart from small things, check the comments.

@lekaf974 lekaf974 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure that demo is needed

Comment thread retry/_demo/main.go Outdated
@lekaf974 lekaf974 dismissed their stale review June 1, 2026 02:06

files added after my approval, I'll take another look

@lekaf974 lekaf974 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please return meshkit errors from retry package

Comment thread retry/retry.go
…pleWithErrorClassifier

- All errors from Do() are now structured meshkit errors with codes/severity
- ErrInvalidConfig sentinel lets callers distinguish config errors via errors.Is
- retryError wrapper preserves error chains through Unwrap() []error
- Add ExampleWithErrorClassifier compiled example, remove standalone _demo
- Add TestRetryInvalidConfigSentinelIsReachable

Signed-off-by: Pranav Dhiran <dhiranpranav72@gmail.com>
Pranav-d33 added a commit to Pranav-d33/meshery that referenced this pull request Jun 1, 2026
Remove the go.mod replace directive pointing to the Pranav-d33/meshkit
fork (which cannot be pushed upstream) and introduce a local retryutil
package wrapping github.com/cenkalti/backoff/v5 (already an indirect
dep) with the same API surface as meshkit/retry.

This is temporary: once meshery/meshkit#1007 merges, delete
server/models/retryutil/ and flip the import in remote_provider.go
back to github.com/meshery/meshkit/retry.

Signed-off-by: Pranav Dhiran <dhiranpranav72@gmail.com>
Pranav-d33 added a commit to Pranav-d33/meshery that referenced this pull request Jun 1, 2026
Remove the go.mod replace directive pointing to the Pranav-d33/meshkit
fork (which cannot be pushed upstream) and introduce a local retryutil
package wrapping github.com/cenkalti/backoff/v5 (already an indirect
dep) with the same API surface as meshkit/retry.

This is temporary: once meshery/meshkit#1007 merges, delete
server/models/retryutil/ and flip the import in remote_provider.go
back to github.com/meshery/meshkit/retry.

Signed-off-by: Pranav Dhiran <dhiranpranav72@gmail.com>

@banana-three-join banana-three-join left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing major really sticks out to me. I think it looks good!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature Request: Centralized Retry/Backoff Helper in MeshKit

8 participants