[retry] Add centralized retry/backoff abstraction#1007
Conversation
Introduces retry/ — a context-aware, exponential-backoff retry package
wrapping cenkalti/backoff/v4. Provides a reusable abstraction to replace
ad-hoc retry loops across Meshery components.
Changes:
- Add retry/{retry,options,errors,retry_test}.go
- Do(ctx, op, ...opts) with production-safe defaults
(500 ms initial, x1.5 growth, +/-30% jitter, 2 min cap)
- Permanent(err) escape hatch for non-retryable errors
- IsPermanent(err) using errors.As for proper chain unwrapping
- WithLogNotifier bridges retries to MeshKit logger.Handler
- 15 table-driven tests, race-detector clean
- Promote cenkalti/backoff/v4 v4.3.0 from indirect to direct in go.mod
Relates to meshery/meshery#19275
Signed-off-by: Pranav Dhiran <dhiranpranav72@gmail.com>
|
Yay, your first pull request! 👍 A contributor will be by to give feedback soon. In the meantime, you can find updates in the #github-notifications channel in the community Slack. |
There was a problem hiding this comment.
Code Review
This pull request introduces a centralized, context-aware exponential backoff retry package wrapping github.com/cenkalti/backoff/v4. It includes a configurable Do function, functional options for tuning backoff parameters, and helpers for handling permanent errors. Feedback suggests upgrading to backoff/v5 for consistency, improving the documentation examples to use context-aware HTTP calls, and refining the Operation signature to include a context. Additionally, suggestions were made to optimize the Do function by checking the context before the first attempt, simplifying the internal retry call, and tightening test assertions for permanent errors.
be1d278 to
02affd4
Compare
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Pranav Dhiran <dhiranpranav72@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Pranav Dhiran <dhiranpranav72@gmail.com>
- Simplified RetryNotify call - Added TODO about missing context.Context in Operation signature - Updated example to use context-aware http.NewRequestWithContext Signed-off-by: Pranav Dhiran <dhiranpranav72@gmail.com>
Signed-off-by: Pranav Dhiran <dhiranpranav72@gmail.com>
|
cc @leecalcote Wanted to flag this is ready for review when you get a chance — happy to walk through the approach or adjust based on feedback. |
- Rename errors.go to error.go - Pass context.Context to Operation signature - Use structured logging in WithLogNotifier Signed-off-by: Pranav Dhiran <dhiranpranav72@gmail.com>
9a6c9c9 to
08b2ca7
Compare
cd16bc1 to
7f9d6e5
Compare
icoderarely
left a comment
There was a problem hiding this comment.
LGTM, might need a maintainers eye on it now for this to be merged if this is a fix they want. Good work @Pranav-d33
|
@icoderarely Can you point me to the right person to ping ??? |
|
hey, did you not create an issue or forgot to link this pr to an issue you are working on? and the one who assigned you might be the first person you could tag. Also have some patience, it's a big project, alot of work is happening in parallel and maintainers/members need time too to get back to each. So best bet would be to tag, wait for a day, if there is no response msg this on slack community that you need this pr to be reviewed/merged. |
|
The issue is in Meshery/meshery and lee was the one who assigned it to me, thats why i pinged him in slack... |
|
oh we gotta tag someone else then (some exceptions mate) |
|
@Omkar-Ugal who could be the right person to tag? |
|
please link the pr with the issue |
it is linked, meshery/meshery#19275 , the issue is in meshery actually, do i create a new issue in meshkit as well? |
Signed-off-by: Pranav Dhiran <dhiranpranav72@gmail.com> Closes meshery#1008 Signed-off-by: Pranav Dhiran <dhiranpranav72@gmail.com>
7f9d6e5 to
628892a
Compare
There was a problem hiding this comment.
The change for go.mod is not required.
While majority of the code looks ok to me, I was wondering what would happen if a caller called WithMultiplier or WithJitter with negative values for instance. Isn't it good to validate and set the values to default rather than blindly setting them in options.go?
Clamp negative/invalid values to defaults instead of silently accepting them. Signed-off-by: Pranav Dhiran <dhiranpranav72@gmail.com>
|
Added validation guards:
Also reverted the unnecessary |
lekaf974
left a comment
There was a problem hiding this comment.
Just a quick commen, LGTM otherwise
|
@matrixkavi , Let me know if any changes are needed before merging. |
…pts type with backoff library Signed-off-by: Pranav Dhiran <dhiranpranav72@gmail.com>
icoderarely
left a comment
There was a problem hiding this comment.
LGTM for
retry: drop unnecessary uint64->uint cast by aligning Config.MaxAttempts type with backoff library
|
There is the smallest amount of discourse on this topic in the community Slack - https://mesheryio.slack.com/archives/CFGG6U10E/p1779755449286469?thread_ts=1776923231.353449&cid=CFGG6U10E |
|
@Pranav-d33 this might be something that we can wire into the remote provider availability tests that occur concurrently onload of the Provider UI. |
MeshKit Retry Abstraction Test MatrixCategory 1: Basic Success Paths
Category 2: Permanent Error Handling
Category 3: Context Cancellation
Category 4: Budget Enforcement
Category 5: Notifier Callback Behavior
Summary
|
reaper8055
left a comment
There was a problem hiding this comment.
I have reviewed the changes, please check the comments.
…rror alias - Add ErrorClassifier type with DecisionRetry/DecisionStop so callers can classify errors as transient or terminal at their bounded context - Remove exported PermanentError type alias to avoid leaking cenkalti/backoff into MeshKit's public API (zero call-sites affected) - Add validateConfig that fails fast on invalid config: * InitialInterval > 0, MaxInterval > 0, MaxInterval >= InitialInterval * MaxElapsedTime >= 0, Multiplier finite and >= 1, Jitter in [0,1] - Remove silent-default fallback from WithMultiplier and WithJitter (validation is now centralized instead of hiding bugs) - Add 16 tests covering classifier behaviour and config validation - Add ExampleDo showing idiomatic HTTP usage with per-attempt timeout vs retry loop budget, using httptest for hermetic output Signed-off-by: Pranav Dhiran <dhiranpranav72@gmail.com>
|
@reaper8055 ready for re-review. |
| return err | ||
| } | ||
|
|
||
| func validateConfig(cfg Config) error { |
There was a problem hiding this comment.
The config validation is a good addition, but the returned errors are currently string-only.
Right now callers can only distinguish invalid retry configuration from an operation failure by inspecting the error message. That is brittle. Since retry.Do can now fail before the operation even runs, callers should have a reliable way to detect that class of failure.
I’d suggest exposing a sentinel error, for example:
var ErrInvalidConfig = errors.New("retry: invalid config")Then wrap it from validateConfig:
return fmt.Errorf("%w: InitialInterval must be > 0, got %v", ErrInvalidConfig, cfg.InitialInterval)This lets consumers write:
if errors.Is(err, retry.ErrInvalidConfig) {
// fail startup / reject bad config
}| // MaxElapsedTime limits the retry loop but does NOT interrupt an in-flight HTTP | ||
| // request. Always pair it with http.Client.Timeout (or NewRequestWithContext) so | ||
| // each attempt has its own deadline. | ||
| func ExampleDo() { |
There was a problem hiding this comment.
The classifier API has good tests, but it also needs a compiled documentation example.
WithErrorClassifier is now an important part of the public contract because it lets callers separate retry-loop mechanics from domain-specific error classification. However, the current ExampleDo demonstrates retry.Permanent inside the operation, not the classifier path.
I’d add an example like ExampleWithErrorClassifier that shows the intended usage:
err := retry.Do(ctx, op,
retry.WithErrorClassifier(func(err error) retry.ErrorDecision {
var status statusError
if errors.As(err, &status) {
if status.Code >= 500 || status.Code == http.StatusTooManyRequests {
return retry.DecisionRetry
}
return retry.DecisionStop
}
return retry.DecisionRetry
}),
)This matters because otherwise users may keep embedding permanent/transient decisions inside every operation body. The cleaner contract is: the operation returns the domain error, and the classifier decides whether that error is retryable.
A compiled example also protects the public API from drift because go test will fail if the documented usage stops compiling.
IMPORTANT: check with @leecalcote / @lekaf974 on repo policy to where to include code examples for such use case!
| @@ -0,0 +1,175 @@ | |||
| package main | |||
There was a problem hiding this comment.
IMPORTANT: check with @leecalcote / @lekaf974 on repo policy to where to include code examples / demos for such use case!
reaper8055
left a comment
There was a problem hiding this comment.
Everything looks good here apart from small things, check the comments.
lekaf974
left a comment
There was a problem hiding this comment.
Not sure that demo is needed
files added after my approval, I'll take another look
lekaf974
left a comment
There was a problem hiding this comment.
Please return meshkit errors from retry package
…pleWithErrorClassifier - All errors from Do() are now structured meshkit errors with codes/severity - ErrInvalidConfig sentinel lets callers distinguish config errors via errors.Is - retryError wrapper preserves error chains through Unwrap() []error - Add ExampleWithErrorClassifier compiled example, remove standalone _demo - Add TestRetryInvalidConfigSentinelIsReachable Signed-off-by: Pranav Dhiran <dhiranpranav72@gmail.com>
Remove the go.mod replace directive pointing to the Pranav-d33/meshkit fork (which cannot be pushed upstream) and introduce a local retryutil package wrapping github.com/cenkalti/backoff/v5 (already an indirect dep) with the same API surface as meshkit/retry. This is temporary: once meshery/meshkit#1007 merges, delete server/models/retryutil/ and flip the import in remote_provider.go back to github.com/meshery/meshkit/retry. Signed-off-by: Pranav Dhiran <dhiranpranav72@gmail.com>
Remove the go.mod replace directive pointing to the Pranav-d33/meshkit fork (which cannot be pushed upstream) and introduce a local retryutil package wrapping github.com/cenkalti/backoff/v5 (already an indirect dep) with the same API surface as meshkit/retry. This is temporary: once meshery/meshkit#1007 merges, delete server/models/retryutil/ and flip the import in remote_provider.go back to github.com/meshery/meshkit/retry. Signed-off-by: Pranav Dhiran <dhiranpranav72@gmail.com>
banana-three-join
left a comment
There was a problem hiding this comment.
Nothing major really sticks out to me. I think it looks good!
cc @leecalcote Introduces retry/ — a context-aware, exponential-backoff retry package wrapping cenkalti/backoff/v4. Provides a reusable abstraction to replace ad-hoc retry loops across Meshery components.
Changes:
fixes Feature Request: Centralized Retry/Backoff Helper in MeshKit #1008