-
Notifications
You must be signed in to change notification settings - Fork 282
fix(orchestrator): prefer running sandbox on host IP lookups #2387
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Changes from all commits
Commits
Show all changes
3 commits
Select commit
Hold shift + click to select a range
ac071b1
fix(orchestrator): prefer running sandbox on host IP lookups
dobrac 2601699
fix(orchestrator): use testify/require in tests and prefer starting o…
devin-ai-integration[bot] f4db401
fix(orchestrator): add t.Parallel() to satisfy paralleltest linter
devin-ai-integration[bot] File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,80 @@ | ||
| package sandbox | ||
|
|
||
| import ( | ||
| "net" | ||
| "testing" | ||
|
|
||
| "github.com/stretchr/testify/require" | ||
|
|
||
| "github.com/e2b-dev/infra/packages/orchestrator/pkg/sandbox/network" | ||
| ) | ||
|
|
||
| func TestGetByHostPortPrefersRunningSandbox(t *testing.T) { | ||
| t.Parallel() | ||
|
|
||
| m := NewSandboxesMap() | ||
|
|
||
| stopping := &Sandbox{ | ||
| Resources: &Resources{ | ||
| Slot: &network.Slot{HostIP: net.ParseIP("10.11.0.2")}, | ||
| }, | ||
| } | ||
| stopping.status.Store(int32(StatusStopping)) | ||
| m.sandboxes.Insert("stopping", stopping) | ||
|
|
||
| running := &Sandbox{ | ||
| Resources: &Resources{ | ||
| Slot: &network.Slot{HostIP: net.ParseIP("10.11.0.2")}, | ||
| }, | ||
| } | ||
| running.status.Store(int32(StatusRunning)) | ||
| m.sandboxes.Insert("running", running) | ||
|
|
||
| sbx, err := m.GetByHostPort("10.11.0.2:2049") | ||
| require.NoError(t, err) | ||
| require.Same(t, running, sbx) | ||
| } | ||
|
|
||
| func TestGetByHostPortPrefersStartingOverStopping(t *testing.T) { | ||
| t.Parallel() | ||
|
|
||
| m := NewSandboxesMap() | ||
|
|
||
| stopping := &Sandbox{ | ||
| Resources: &Resources{ | ||
| Slot: &network.Slot{HostIP: net.ParseIP("10.11.0.2")}, | ||
| }, | ||
| } | ||
| stopping.status.Store(int32(StatusStopping)) | ||
| m.sandboxes.Insert("stopping", stopping) | ||
|
|
||
| starting := &Sandbox{ | ||
| Resources: &Resources{ | ||
| Slot: &network.Slot{HostIP: net.ParseIP("10.11.0.2")}, | ||
| }, | ||
| } | ||
|
cursor[bot] marked this conversation as resolved.
|
||
| starting.status.Store(int32(StatusStarting)) | ||
| m.sandboxes.Insert("starting", starting) | ||
|
|
||
| sbx, err := m.GetByHostPort("10.11.0.2:2049") | ||
| require.NoError(t, err) | ||
| require.Same(t, starting, sbx) | ||
| } | ||
|
|
||
| func TestGetByHostPortFallsBackToStoppingSandbox(t *testing.T) { | ||
| t.Parallel() | ||
|
|
||
| m := NewSandboxesMap() | ||
|
|
||
| stopping := &Sandbox{ | ||
| Resources: &Resources{ | ||
| Slot: &network.Slot{HostIP: net.ParseIP("10.11.0.3")}, | ||
| }, | ||
| } | ||
| stopping.status.Store(int32(StatusStopping)) | ||
| m.sandboxes.Insert("stopping", stopping) | ||
|
|
||
| sbx, err := m.GetByHostPort("10.11.0.3:2049") | ||
| require.NoError(t, err) | ||
| require.Same(t, stopping, sbx) | ||
| } | ||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🔴 Two defects remain in the
GetByHostPortfallback logic: (1) a TOCTOU race where the sandbox status is read twice via separate atomic loads—once insideIsRunning()and once in the fallback condition—allowing a concurrentMarkRunning()CAS to cause the function to return a staleStoppingsandbox instead of a newly-Runningone, directly defeating the PR's core security goal; (2) a non-deterministic overwrite where the conditionSandboxStatus(sbx.status.Load()) == StatusStartingunconditionally replaces any existingStartingfallback with anotherStartingsandbox (acknowledged but unaddressed in the PR checklist). Fix both by reading status once per iteration:status := SandboxStatus(sbx.status.Load()), using it for both theIsRunningcheck and the fallback condition, and guarding the fallback overwrite withstatus == StatusStarting && SandboxStatus(fallback.status.Load()) \!= StatusStarting.Extended reasoning...
Bug 1 — TOCTOU double-read race (normal severity)
The fallback logic in
GetByHostPort(map.go:105-114) makes two separate atomic loads of the same sandbox's status field.IsRunning()(sandbox.go:249) callss.status.Load()once, and the fallback condition on line 112 callssbx.status.Load()again. Because these are independent atomic operations with no lock between them, a concurrentMarkRunning()CAS (0→1) can execute between the two reads, making the two reads observe different values for the same field.Step-by-step proof of the race:
StatusStarting (0).sbx.IsRunning()→status.Load()→ returns0(Starting) →false→ does NOT early-return.MarkRunning()CAS → 0→1; sandbox X is nowStatusRunning (1).SandboxStatus(sbx.status.Load())→ returns1(Running, NOT Starting).fallback == nil || 1 == StatusStarting=fallback == nil || false.StatusStoppingsandbox was already encountered in an earlier iteration, sofallback \!= nil: neither branch is true →fallbackremains theStoppingsandbox.Runningsandbox is found → function returns theStoppingsandbox.This directly defeats the PR's primary security goal: preventing cross-tenant NFS volume access and TCP traffic misattribution by preferring
Runningsandboxes. ARunningsandbox exists on the IP but is silently bypassed.Bug 2 — Non-deterministic StatusStarting overwrite (nit)
The fallback condition
if fallback == nil || SandboxStatus(sbx.status.Load()) == StatusStartingis unconditional on the second branch: anyStatusStartingcandidate always overwrites the current fallback, even when the current fallback is already aStatusStartingsandbox. Because Go map iteration order is intentionally randomised, when twoStatusStartingsandboxes share the same IP, the one encountered last (randomly) wins.Step-by-step proof:
Suppose iteration order is [Starting-A, Starting-B] (or [Starting-B, Starting-A] — both orders possible):
fallback == nilis true →fallback = Starting-A.SandboxStatus(sbx.status.Load()) == StatusStartingis true →fallback = Starting-B(overwrites).In the reversed order, Starting-A would win instead. The PR description explicitly acknowledges this case in the Review Checklist but does not fix it.
While two
StatusStartingsandboxes on the same IP is rare (requires the async slot-return goroutine in sandbox.go:1264 to return the slot before the Remove callback fires, allowing a new sandbox to acquire the same slot), it is a reachable scenario as confirmed by code analysis of the cleanup/LIFO ordering.Impact: Both bugs affect the security-sensitive code paths that route NFS proxy and TCP firewall traffic. The TOCTOU race can cause a
Runningsandbox to be silently bypassed in favor of aStoppingone; the non-deterministic overwrite can route traffic to the wrong tenant when twoStartingsandboxes share an IP.Fix: Read status once per iteration and use the cached value for both checks:
This eliminates the TOCTOU window and ensures a
Startingsandbox is never overwritten by anotherStartingsandbox.