fix(uffd): retry failed page faults via UFFDIO_WAKE instead of killin… by arkamar · Pull Request #2370 · e2b-dev/infra

arkamar · 2026-04-13T08:50:31Z

…g sandbox

Previously, a single failed source.Slice in faultPage immediately fired SignalExit, irreversibly killing the entire sandbox. This was the root cause of production UFFD failures when GCS reads timed out.

Add UFFDIO_WAKE support and use it to retry transient storage failures:

On source.Slice failure, increment a per-address retry counter
If under maxFaultRetries (3): call UFFDIO_WAKE to wake the faulting guest thread without resolving the fault - the kernel re-delivers the page fault as a fresh message, freeing the current goroutine
If all retries exhausted: call SignalExit as before (sandbox teardown)
On success: clear retry state for that address

This is the proper kernel-level retry mechanism - UFFDIO_WAKE tells the kernel 'I cannot serve this page right now', and the kernel naturally re-delivers the fault. No goroutine slots are held during the retry window.

…g sandbox Previously, a single failed source.Slice in faultPage immediately fired SignalExit, irreversibly killing the entire sandbox. This was the root cause of production UFFD failures when GCS reads timed out. Add UFFDIO_WAKE support and use it to retry transient storage failures: 1. On source.Slice failure, increment a per-address retry counter 2. If under maxFaultRetries (3): call UFFDIO_WAKE to wake the faulting guest thread without resolving the fault - the kernel re-delivers the page fault as a fresh message, freeing the current goroutine 3. If all retries exhausted: call SignalExit as before (sandbox teardown) 4. On success: clear retry state for that address This is the proper kernel-level retry mechanism - UFFDIO_WAKE tells the kernel 'I cannot serve this page right now', and the kernel naturally re-delivers the fault. No goroutine slots are held during the retry window.

cursor · 2026-04-13T08:50:43Z

PR Summary

Medium Risk
Changes the userfaultfd page-fault error path to retry and re-trigger faults instead of immediately tearing down the sandbox; mistakes here could cause stuck or repeated faults under load. Scope is limited to UFFD handling and adds bounded retries to mitigate infinite loops.

Overview
Improves resilience to transient backing-store read failures during UFFD page faults by adding UFFDIO_WAKE support and using it to retry faults a bounded number of times before triggering sandbox exit. The handler now tracks per-page retry counts, wakes the faulting thread to have the kernel re-deliver the fault, and clears retry state on success or final failure (with errors joined for logging/trace).

^{Reviewed by Cursor Bugbot for commit f1255f2. Bugbot is set up for automated code reviews on this repo. Configure here.}

claude

The concurrent-retry exhaustion bug (shared per-address counter across goroutines) materially undermines the retry mechanism's ability to tolerate transient GCS timeouts; this warrants human review before merge.

Extended reasoning...

Overview\nThis PR adds UFFDIO_WAKE support to retry transient failures in the userfaultfd fault handler, replacing immediate sandbox teardown with up to 3 retry cycles.\n\n### Security risks\nNo authentication or permission-sensitive code is touched. The changes are confined to the uffd fault-handling subsystem.\n\n### Level of scrutiny\nHigh. This is production-critical, low-level kernel interface code. The retry mechanism is the entire value of the PR, and a concurrency flaw in the retry counter could cause premature sandbox termination under load — the exact scenario the PR is designed to handle.\n\n### Other factors\nThe bug hunter identified a meaningful race: the per-address counter is shared across all concurrently spawned goroutines for the same fault address. With N concurrent vCPU faults and , all N goroutines can increment the counter in one failure cycle, exhausting the budget before any actual retry cycle completes. A human reviewer should evaluate whether kernel-level coalescing prevents this in practice, and if not, how to scope the counter to retry cycles rather than individual goroutine attempts.

packages/orchestrator/pkg/sandbox/uffd/userfaultfd/userfaultfd.go

jakubno

lgtm, but let's wait for @ValentaTomas to check

arkamar · 2026-04-14T12:20:10Z

Closing in favor of #2389.

e2b-request-same-site-reviewers bot assigned sitole Apr 13, 2026

jakubno assigned jakubno and unassigned sitole Apr 13, 2026

style(uffd): simplify if-else to early return per linter

6855354

arkamar marked this pull request as ready for review April 13, 2026 09:14

arkamar requested review from ValentaTomas, dobrac and jakubno as code owners April 13, 2026 09:14

claude bot reviewed Apr 13, 2026

View reviewed changes

packages/orchestrator/pkg/sandbox/uffd/userfaultfd/userfaultfd.go Show resolved Hide resolved

packages/orchestrator/pkg/sandbox/uffd/userfaultfd/userfaultfd.go Show resolved Hide resolved

fix(uffd): include wakeErr in error chain on retry exhaustion

f1255f2

jakubno reviewed Apr 13, 2026

View reviewed changes

jakubno assigned ValentaTomas and unassigned jakubno Apr 13, 2026

jakubno removed the request for review from dobrac April 13, 2026 15:23

ValentaTomas marked this pull request as draft April 13, 2026 17:09

arkamar closed this Apr 14, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(uffd): retry failed page faults via UFFDIO_WAKE instead of killin…#2370

fix(uffd): retry failed page faults via UFFDIO_WAKE instead of killin…#2370
arkamar wants to merge 3 commits intomainfrom
fix/uffd-retry-on-transient-storage-failures

arkamar commented Apr 13, 2026

Uh oh!

cursor bot commented Apr 13, 2026 •

edited

Loading

Uh oh!

claude bot left a comment

Uh oh!

Uh oh!

Uh oh!

jakubno left a comment

Uh oh!

arkamar commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

arkamar commented Apr 13, 2026

Uh oh!

cursor bot commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Summary

Uh oh!

claude bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

jakubno left a comment

Choose a reason for hiding this comment

Uh oh!

arkamar commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

cursor bot commented Apr 13, 2026 •

edited

Loading