Update deferred retry backoff to quartic formula and default max retries to 20#271
Update deferred retry backoff to quartic formula and default max retries to 20#271anuj-pal27 wants to merge 6 commits intorage-rb:mainfrom
Conversation
…ed-retry-strategy-20
…ed-retry-strategy-20
|
hii @rsamoilov can you please review this PR when you have time. Thanks! |
rsamoilov
left a comment
There was a problem hiding this comment.
Hi @anuj-pal27
This looks great! Max attempts set to 20 also makes sense.
As a general improvement, do you mind adding the will_retry_in? method to the deferred metadata? I think you'll need to cache the result of __next_retry_in, but otherwise this change would significantly improve observability.
|
Thanks for the guidance @rsamoilov ! I’ll add will_retry_in to deferred metadata and cache the __next_retry_in result so it isn’t recomputed. I’ll push an update shortly. |
|
Hi @rsamoilov , quick update on the will_retry_in implementation approach: I’m planning to add caching only for will_retry_in. Also, instead of writing directly to context[7] from Metadata, I plan to add Context.set_will_retry_in(...) so Metadata doesn’t depend on array index details and context structure stays encapsulated. Does this approach look good to you? |
|
Hi @anuj-pal27, I don't think you need to update However, Imagine a user monitors task failures and publishes the time when the task will be retried to their monitoring system. They would do it by calling |
|
Hey @rsamoilov — I’ve updated the changes. Quick summary:
|
Description
This PR updates the default retry behavior for
Rage::Deferredso failed tasks are retried for longer instead of exhausting too quickly.What changed
Increased default max retries to 20
Updated default retry delay formula to:
(attempt**4) + 10 + (rand(15) * attempt)
Why this change
Previously, retries could finish in just a few minutes, which is often too short for real incidents like temporary outages or bad deploys.
With this new formula, retries are spaced out more over time, so tasks have a better chance to succeed once the issue is fixed.
Test updates
I also updated tests in
spec/rage/deferred/task_spec.rbto match the new behavior:0..4MAX_ATTEMPTS = 20Closes #251.