Skip to content

fix(uffd): retry failed page faults via UFFDIO_WAKE instead of killin…#2370

Closed
arkamar wants to merge 3 commits intomainfrom
fix/uffd-retry-on-transient-storage-failures
Closed

fix(uffd): retry failed page faults via UFFDIO_WAKE instead of killin…#2370
arkamar wants to merge 3 commits intomainfrom
fix/uffd-retry-on-transient-storage-failures

Conversation

@arkamar
Copy link
Copy Markdown
Contributor

@arkamar arkamar commented Apr 13, 2026

…g sandbox

Previously, a single failed source.Slice in faultPage immediately fired SignalExit, irreversibly killing the entire sandbox. This was the root cause of production UFFD failures when GCS reads timed out.

Add UFFDIO_WAKE support and use it to retry transient storage failures:

  1. On source.Slice failure, increment a per-address retry counter
  2. If under maxFaultRetries (3): call UFFDIO_WAKE to wake the faulting guest thread without resolving the fault - the kernel re-delivers the page fault as a fresh message, freeing the current goroutine
  3. If all retries exhausted: call SignalExit as before (sandbox teardown)
  4. On success: clear retry state for that address

This is the proper kernel-level retry mechanism - UFFDIO_WAKE tells the kernel 'I cannot serve this page right now', and the kernel naturally re-delivers the fault. No goroutine slots are held during the retry window.

…g sandbox

Previously, a single failed source.Slice in faultPage immediately fired
SignalExit, irreversibly killing the entire sandbox. This was the root
cause of production UFFD failures when GCS reads timed out.

Add UFFDIO_WAKE support and use it to retry transient storage failures:

1. On source.Slice failure, increment a per-address retry counter
2. If under maxFaultRetries (3): call UFFDIO_WAKE to wake the faulting
   guest thread without resolving the fault - the kernel re-delivers
   the page fault as a fresh message, freeing the current goroutine
3. If all retries exhausted: call SignalExit as before (sandbox teardown)
4. On success: clear retry state for that address

This is the proper kernel-level retry mechanism - UFFDIO_WAKE tells
the kernel 'I cannot serve this page right now', and the kernel
naturally re-delivers the fault. No goroutine slots are held during
the retry window.
@cursor
Copy link
Copy Markdown

cursor bot commented Apr 13, 2026

PR Summary

Medium Risk
Changes the userfaultfd page-fault error path to retry and re-trigger faults instead of immediately tearing down the sandbox; mistakes here could cause stuck or repeated faults under load. Scope is limited to UFFD handling and adds bounded retries to mitigate infinite loops.

Overview
Improves resilience to transient backing-store read failures during UFFD page faults by adding UFFDIO_WAKE support and using it to retry faults a bounded number of times before triggering sandbox exit. The handler now tracks per-page retry counts, wakes the faulting thread to have the kernel re-deliver the fault, and clears retry state on success or final failure (with errors joined for logging/trace).

Reviewed by Cursor Bugbot for commit f1255f2. Bugbot is set up for automated code reviews on this repo. Configure here.

@arkamar arkamar marked this pull request as ready for review April 13, 2026 09:14
Copy link
Copy Markdown

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The concurrent-retry exhaustion bug (shared per-address counter across goroutines) materially undermines the retry mechanism's ability to tolerate transient GCS timeouts; this warrants human review before merge.

Extended reasoning...

Overview\nThis PR adds UFFDIO_WAKE support to retry transient failures in the userfaultfd fault handler, replacing immediate sandbox teardown with up to 3 retry cycles.\n\n### Security risks\nNo authentication or permission-sensitive code is touched. The changes are confined to the uffd fault-handling subsystem.\n\n### Level of scrutiny\nHigh. This is production-critical, low-level kernel interface code. The retry mechanism is the entire value of the PR, and a concurrency flaw in the retry counter could cause premature sandbox termination under load — the exact scenario the PR is designed to handle.\n\n### Other factors\nThe bug hunter identified a meaningful race: the per-address counter is shared across all concurrently spawned goroutines for the same fault address. With N concurrent vCPU faults and , all N goroutines can increment the counter in one failure cycle, exhausting the budget before any actual retry cycle completes. A human reviewer should evaluate whether kernel-level coalescing prevents this in practice, and if not, how to scope the counter to retry cycles rather than individual goroutine attempts.

Copy link
Copy Markdown
Member

@jakubno jakubno left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, but let's wait for @ValentaTomas to check

@jakubno jakubno assigned ValentaTomas and unassigned jakubno Apr 13, 2026
@jakubno jakubno removed the request for review from dobrac April 13, 2026 15:23
@ValentaTomas ValentaTomas marked this pull request as draft April 13, 2026 17:09
@arkamar
Copy link
Copy Markdown
Contributor Author

arkamar commented Apr 14, 2026

Closing in favor of #2389.

@arkamar arkamar closed this Apr 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants