Skip to content

fix(orchestrator): prefer running sandbox on host IP lookups#2387

Closed
dobrac wants to merge 3 commits intomainfrom
codex/address-vulnerability-with-stopping-sandboxes
Closed

fix(orchestrator): prefer running sandbox on host IP lookups#2387
dobrac wants to merge 3 commits intomainfrom
codex/address-vulnerability-with-stopping-sandboxes

Conversation

@dobrac
Copy link
Copy Markdown
Contributor

@dobrac dobrac commented Apr 14, 2026

Fixes GetByHostPort to handle IP slot reuse during sandbox lifecycle transitions. When a stopping sandbox frees an IP slot that is immediately claimed by a new starting sandbox, both temporarily coexist in the map with the same IP. Previously, GetByHostPort returned the first match (which could be the stale stopping sandbox). Now it applies a priority:

  1. Running → return immediately
  2. Starting preferred over Stopping as fallback
  3. Stopping only if nothing better exists

@cursor
Copy link
Copy Markdown

cursor bot commented Apr 14, 2026

PR Summary

Medium Risk
Changes IP-based sandbox routing behavior, which could impact traffic/volume attribution if selection logic is wrong. Added tests reduce regression risk, but this code path is security-sensitive (cross-tenant routing).

Overview
Updates Map.GetByHostPort to prefer StatusRunning sandboxes for a matching host IP, with a fallback to the best non-running match (favoring StatusStarting over StatusStopping) when no running sandbox exists, and adds unit tests covering running-vs-stopping preference, starting-vs-stopping preference, and stopping-only fallback.

Reviewed by Cursor Bugbot for commit f4db401. Bugbot is set up for automated code reviews on this repo. Configure here.

cursor[bot]

This comment was marked as resolved.

claude[bot]

This comment was marked as resolved.

Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 2 additional findings.

Open in Devin Review

devin-ai-integration bot and others added 2 commits April 14, 2026 22:32
…ver stopping in fallback

- Switch tests to use require.NoError/require.Same to match repo conventions
- In GetByHostPort fallback, prefer StatusStarting over StatusStopping so
  that when an IP slot is freed and reused, we route to the new sandbox
- Add TestGetByHostPortPrefersStartingOverStopping test

Co-Authored-By: jakub.dobry <jakub.dobry@e2b.dev>
Co-Authored-By: jakub.dobry <jakub.dobry@e2b.dev>
@dobrac dobrac marked this pull request as ready for review April 15, 2026 06:29
@dobrac dobrac added the bug Something isn't working label Apr 15, 2026
@dobrac dobrac closed this Apr 15, 2026
Comment on lines +105 to +114
if sbx.IsRunning() {
return sbx, nil
}

// Prefer a starting sandbox over a stopping one so that when an IP
// slot is freed by a stopping sandbox and immediately reused by a
// new starting sandbox, we route to the new sandbox.
if fallback == nil || SandboxStatus(sbx.status.Load()) == StatusStarting {
fallback = sbx
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Two defects remain in the GetByHostPort fallback logic: (1) a TOCTOU race where the sandbox status is read twice via separate atomic loads—once inside IsRunning() and once in the fallback condition—allowing a concurrent MarkRunning() CAS to cause the function to return a stale Stopping sandbox instead of a newly-Running one, directly defeating the PR's core security goal; (2) a non-deterministic overwrite where the condition SandboxStatus(sbx.status.Load()) == StatusStarting unconditionally replaces any existing Starting fallback with another Starting sandbox (acknowledged but unaddressed in the PR checklist). Fix both by reading status once per iteration: status := SandboxStatus(sbx.status.Load()), using it for both the IsRunning check and the fallback condition, and guarding the fallback overwrite with status == StatusStarting && SandboxStatus(fallback.status.Load()) \!= StatusStarting.

Extended reasoning...

Bug 1 — TOCTOU double-read race (normal severity)

The fallback logic in GetByHostPort (map.go:105-114) makes two separate atomic loads of the same sandbox's status field. IsRunning() (sandbox.go:249) calls s.status.Load() once, and the fallback condition on line 112 calls sbx.status.Load() again. Because these are independent atomic operations with no lock between them, a concurrent MarkRunning() CAS (0→1) can execute between the two reads, making the two reads observe different values for the same field.

Step-by-step proof of the race:

  1. Map iteration reaches sandbox X whose status is StatusStarting (0).
  2. Thread A: sbx.IsRunning()status.Load() → returns 0 (Starting) → false → does NOT early-return.
  3. Thread B (concurrent): MarkRunning() CAS → 0→1; sandbox X is now StatusRunning (1).
  4. Thread A continues: fallback condition evaluates SandboxStatus(sbx.status.Load()) → returns 1 (Running, NOT Starting).
  5. Condition: fallback == nil || 1 == StatusStarting = fallback == nil || false.
  6. Suppose a StatusStopping sandbox was already encountered in an earlier iteration, so fallback \!= nil: neither branch is true → fallback remains the Stopping sandbox.
  7. No other Running sandbox is found → function returns the Stopping sandbox.

This directly defeats the PR's primary security goal: preventing cross-tenant NFS volume access and TCP traffic misattribution by preferring Running sandboxes. A Running sandbox exists on the IP but is silently bypassed.

Bug 2 — Non-deterministic StatusStarting overwrite (nit)

The fallback condition if fallback == nil || SandboxStatus(sbx.status.Load()) == StatusStarting is unconditional on the second branch: any StatusStarting candidate always overwrites the current fallback, even when the current fallback is already a StatusStarting sandbox. Because Go map iteration order is intentionally randomised, when two StatusStarting sandboxes share the same IP, the one encountered last (randomly) wins.

Step-by-step proof:
Suppose iteration order is [Starting-A, Starting-B] (or [Starting-B, Starting-A] — both orders possible):

  • Iteration 1: fallback is nil → condition fallback == nil is true → fallback = Starting-A.
  • Iteration 2: SandboxStatus(sbx.status.Load()) == StatusStarting is true → fallback = Starting-B (overwrites).
    In the reversed order, Starting-A would win instead. The PR description explicitly acknowledges this case in the Review Checklist but does not fix it.

While two StatusStarting sandboxes on the same IP is rare (requires the async slot-return goroutine in sandbox.go:1264 to return the slot before the Remove callback fires, allowing a new sandbox to acquire the same slot), it is a reachable scenario as confirmed by code analysis of the cleanup/LIFO ordering.

Impact: Both bugs affect the security-sensitive code paths that route NFS proxy and TCP firewall traffic. The TOCTOU race can cause a Running sandbox to be silently bypassed in favor of a Stopping one; the non-deterministic overwrite can route traffic to the wrong tenant when two Starting sandboxes share an IP.

Fix: Read status once per iteration and use the cached value for both checks:

status := SandboxStatus(sbx.status.Load())
if status == StatusRunning {
    return sbx, nil
}
if fallback == nil || (status == StatusStarting && SandboxStatus(fallback.status.Load()) \!= StatusStarting) {
    fallback = sbx
}

This eliminates the TOCTOU window and ensures a Starting sandbox is never overwritten by another Starting sandbox.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

aardvark bug Something isn't working codex

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants