fix(uffd): add retry with exponential backoff on source.Slice() error#2389
fix(uffd): add retry with exponential backoff on source.Slice() error#2389
Conversation
…s in faultPage Transient Slice errors (network blips, temporary GCS/S3 failures) previously caused immediate sandbox termination. Retry up to 3 times with exponential backoff (50ms-500ms + jitter) before signaling uffd exit, giving transient errors a chance to recover.
The constant represents the number of retries, not total attempts. Adjusted the loop range and log fields to be consistent with the new naming.
PR SummaryMedium Risk Overview Reviewed by Cursor Bugbot for commit 5fa2cb0. Bugbot is set up for automated code reviews on this repo. Configure here. |
There was a problem hiding this comment.
Two issues found in the retry loop in userfaultfd.go. Line 376: time.After creates a timer not cancelled when ctx.Done fires, leaking it until expiry. Use time.NewTimer and call timer.Stop on context cancellation. Line 369: rand.Int63n panics when the argument is 0 - safe with current 50ms base delay but fragile if sliceRetryBaseDelay is reduced below 2ns. Guard with: if half := int64(delay) / 2; half > 0.
packages/orchestrator/pkg/sandbox/uffd/userfaultfd/userfaultfd.go
Outdated
Show resolved
Hide resolved
When the context is cancelled (e.g. during shutdown), source.Slice() fails immediately with a context error. Without this check, up to 4096 concurrent fault handlers would each log a misleading 'retrying' warning before the select detects ctx.Done(). Break the retry loop early when ctx.Err() is non-nil to avoid the log noise.
time.After creates a timer that lives until expiry even if the select takes the ctx.Done path. Under high failure load with 4096 concurrent fault handlers each retrying 3 times, this could produce many abandoned timers. Use time.NewTimer and call Stop() on context cancellation to release the timer immediately.
1ec86ea to
5fa2cb0
Compare
Transient Slice errors (network blips, temporary GCS/S3 failures) previously caused immediate sandbox termination. Retry up to 3 times with exponential backoff (50ms-500ms + jitter) before signaling uffd exit, giving transient errors a chance to recover.