fix(orchestrator): prefer running sandbox on host IP lookups#2387
fix(orchestrator): prefer running sandbox on host IP lookups#2387
Conversation
PR SummaryMedium Risk Overview Reviewed by Cursor Bugbot for commit f4db401. Bugbot is set up for automated code reviews on this repo. Configure here. |
…ver stopping in fallback - Switch tests to use require.NoError/require.Same to match repo conventions - In GetByHostPort fallback, prefer StatusStarting over StatusStopping so that when an IP slot is freed and reused, we route to the new sandbox - Add TestGetByHostPortPrefersStartingOverStopping test Co-Authored-By: jakub.dobry <jakub.dobry@e2b.dev>
Co-Authored-By: jakub.dobry <jakub.dobry@e2b.dev>
| if sbx.IsRunning() { | ||
| return sbx, nil | ||
| } | ||
|
|
||
| // Prefer a starting sandbox over a stopping one so that when an IP | ||
| // slot is freed by a stopping sandbox and immediately reused by a | ||
| // new starting sandbox, we route to the new sandbox. | ||
| if fallback == nil || SandboxStatus(sbx.status.Load()) == StatusStarting { | ||
| fallback = sbx | ||
| } |
There was a problem hiding this comment.
🔴 Two defects remain in the GetByHostPort fallback logic: (1) a TOCTOU race where the sandbox status is read twice via separate atomic loads—once inside IsRunning() and once in the fallback condition—allowing a concurrent MarkRunning() CAS to cause the function to return a stale Stopping sandbox instead of a newly-Running one, directly defeating the PR's core security goal; (2) a non-deterministic overwrite where the condition SandboxStatus(sbx.status.Load()) == StatusStarting unconditionally replaces any existing Starting fallback with another Starting sandbox (acknowledged but unaddressed in the PR checklist). Fix both by reading status once per iteration: status := SandboxStatus(sbx.status.Load()), using it for both the IsRunning check and the fallback condition, and guarding the fallback overwrite with status == StatusStarting && SandboxStatus(fallback.status.Load()) \!= StatusStarting.
Extended reasoning...
Bug 1 — TOCTOU double-read race (normal severity)
The fallback logic in GetByHostPort (map.go:105-114) makes two separate atomic loads of the same sandbox's status field. IsRunning() (sandbox.go:249) calls s.status.Load() once, and the fallback condition on line 112 calls sbx.status.Load() again. Because these are independent atomic operations with no lock between them, a concurrent MarkRunning() CAS (0→1) can execute between the two reads, making the two reads observe different values for the same field.
Step-by-step proof of the race:
- Map iteration reaches sandbox X whose status is
StatusStarting (0). - Thread A:
sbx.IsRunning()→status.Load()→ returns0(Starting) →false→ does NOT early-return. - Thread B (concurrent):
MarkRunning()CAS → 0→1; sandbox X is nowStatusRunning (1). - Thread A continues: fallback condition evaluates
SandboxStatus(sbx.status.Load())→ returns1(Running, NOT Starting). - Condition:
fallback == nil || 1 == StatusStarting=fallback == nil || false. - Suppose a
StatusStoppingsandbox was already encountered in an earlier iteration, sofallback \!= nil: neither branch is true →fallbackremains theStoppingsandbox. - No other
Runningsandbox is found → function returns theStoppingsandbox.
This directly defeats the PR's primary security goal: preventing cross-tenant NFS volume access and TCP traffic misattribution by preferring Running sandboxes. A Running sandbox exists on the IP but is silently bypassed.
Bug 2 — Non-deterministic StatusStarting overwrite (nit)
The fallback condition if fallback == nil || SandboxStatus(sbx.status.Load()) == StatusStarting is unconditional on the second branch: any StatusStarting candidate always overwrites the current fallback, even when the current fallback is already a StatusStarting sandbox. Because Go map iteration order is intentionally randomised, when two StatusStarting sandboxes share the same IP, the one encountered last (randomly) wins.
Step-by-step proof:
Suppose iteration order is [Starting-A, Starting-B] (or [Starting-B, Starting-A] — both orders possible):
- Iteration 1: fallback is nil → condition
fallback == nilis true →fallback = Starting-A. - Iteration 2:
SandboxStatus(sbx.status.Load()) == StatusStartingis true →fallback = Starting-B(overwrites).
In the reversed order, Starting-A would win instead. The PR description explicitly acknowledges this case in the Review Checklist but does not fix it.
While two StatusStarting sandboxes on the same IP is rare (requires the async slot-return goroutine in sandbox.go:1264 to return the slot before the Remove callback fires, allowing a new sandbox to acquire the same slot), it is a reachable scenario as confirmed by code analysis of the cleanup/LIFO ordering.
Impact: Both bugs affect the security-sensitive code paths that route NFS proxy and TCP firewall traffic. The TOCTOU race can cause a Running sandbox to be silently bypassed in favor of a Stopping one; the non-deterministic overwrite can route traffic to the wrong tenant when two Starting sandboxes share an IP.
Fix: Read status once per iteration and use the cached value for both checks:
status := SandboxStatus(sbx.status.Load())
if status == StatusRunning {
return sbx, nil
}
if fallback == nil || (status == StatusStarting && SandboxStatus(fallback.status.Load()) \!= StatusStarting) {
fallback = sbx
}This eliminates the TOCTOU window and ensures a Starting sandbox is never overwritten by another Starting sandbox.
Fixes GetByHostPort to handle IP slot reuse during sandbox lifecycle transitions. When a stopping sandbox frees an IP slot that is immediately claimed by a new starting sandbox, both temporarily coexist in the map with the same IP. Previously, GetByHostPort returned the first match (which could be the stale stopping sandbox). Now it applies a priority: