fix(uffd): retry failed page faults via UFFDIO_WAKE instead of killin…#2370
fix(uffd): retry failed page faults via UFFDIO_WAKE instead of killin…#2370
Conversation
…g sandbox Previously, a single failed source.Slice in faultPage immediately fired SignalExit, irreversibly killing the entire sandbox. This was the root cause of production UFFD failures when GCS reads timed out. Add UFFDIO_WAKE support and use it to retry transient storage failures: 1. On source.Slice failure, increment a per-address retry counter 2. If under maxFaultRetries (3): call UFFDIO_WAKE to wake the faulting guest thread without resolving the fault - the kernel re-delivers the page fault as a fresh message, freeing the current goroutine 3. If all retries exhausted: call SignalExit as before (sandbox teardown) 4. On success: clear retry state for that address This is the proper kernel-level retry mechanism - UFFDIO_WAKE tells the kernel 'I cannot serve this page right now', and the kernel naturally re-delivers the fault. No goroutine slots are held during the retry window.
PR SummaryMedium Risk Overview Reviewed by Cursor Bugbot for commit f1255f2. Bugbot is set up for automated code reviews on this repo. Configure here. |
There was a problem hiding this comment.
The concurrent-retry exhaustion bug (shared per-address counter across goroutines) materially undermines the retry mechanism's ability to tolerate transient GCS timeouts; this warrants human review before merge.
Extended reasoning...
Overview\nThis PR adds UFFDIO_WAKE support to retry transient failures in the userfaultfd fault handler, replacing immediate sandbox teardown with up to 3 retry cycles.\n\n### Security risks\nNo authentication or permission-sensitive code is touched. The changes are confined to the uffd fault-handling subsystem.\n\n### Level of scrutiny\nHigh. This is production-critical, low-level kernel interface code. The retry mechanism is the entire value of the PR, and a concurrency flaw in the retry counter could cause premature sandbox termination under load — the exact scenario the PR is designed to handle.\n\n### Other factors\nThe bug hunter identified a meaningful race: the per-address counter is shared across all concurrently spawned goroutines for the same fault address. With N concurrent vCPU faults and , all N goroutines can increment the counter in one failure cycle, exhausting the budget before any actual retry cycle completes. A human reviewer should evaluate whether kernel-level coalescing prevents this in practice, and if not, how to scope the counter to retry cycles rather than individual goroutine attempts.
jakubno
left a comment
There was a problem hiding this comment.
lgtm, but let's wait for @ValentaTomas to check
|
Closing in favor of #2389. |
…g sandbox
Previously, a single failed source.Slice in faultPage immediately fired SignalExit, irreversibly killing the entire sandbox. This was the root cause of production UFFD failures when GCS reads timed out.
Add UFFDIO_WAKE support and use it to retry transient storage failures:
This is the proper kernel-level retry mechanism - UFFDIO_WAKE tells the kernel 'I cannot serve this page right now', and the kernel naturally re-delivers the fault. No goroutine slots are held during the retry window.