Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 19 additions & 2 deletions packages/orchestrator/pkg/sandbox/map.go
Original file line number Diff line number Diff line change
Expand Up @@ -88,17 +88,34 @@
}

// GetByHostPort looks up a sandbox by its host IP address parsed from hostPort.
// It matches any sandbox in the map (starting, running, or stopping).
// It prefers a running sandbox and only falls back to a non-running one when
// no running sandbox matches.
func (m *Map) GetByHostPort(hostPort string) (*Sandbox, error) {
reqIP, _, err := net.SplitHostPort(hostPort)
if err != nil {
return nil, fmt.Errorf("error parsing remote address %s: %w", hostPort, err)
}

var fallback *Sandbox
for _, sbx := range m.sandboxes.Items() {
if sbx.Slot.HostIPString() == reqIP {
if sbx.Slot.HostIPString() != reqIP {
continue
}

if sbx.IsRunning() {
return sbx, nil
}

// Prefer a starting sandbox over a stopping one so that when an IP
// slot is freed by a stopping sandbox and immediately reused by a
// new starting sandbox, we route to the new sandbox.
if fallback == nil || SandboxStatus(sbx.status.Load()) == StatusStarting {
fallback = sbx
}

Check failure on line 114 in packages/orchestrator/pkg/sandbox/map.go

View check run for this annotation

Claude / Claude Code Review

TOCTOU double-read and non-deterministic Starting overwrite in GetByHostPort fallback

Two defects remain in the `GetByHostPort` fallback logic: (1) a TOCTOU race where the sandbox status is read twice via separate atomic loads—once inside `IsRunning()` and once in the fallback condition—allowing a concurrent `MarkRunning()` CAS to cause the function to return a stale `Stopping` sandbox instead of a newly-`Running` one, directly defeating the PR's core security goal; (2) a non-deterministic overwrite where the condition `SandboxStatus(sbx.status.Load()) == StatusStarting` uncondit
Comment on lines +105 to +114
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Two defects remain in the GetByHostPort fallback logic: (1) a TOCTOU race where the sandbox status is read twice via separate atomic loads—once inside IsRunning() and once in the fallback condition—allowing a concurrent MarkRunning() CAS to cause the function to return a stale Stopping sandbox instead of a newly-Running one, directly defeating the PR's core security goal; (2) a non-deterministic overwrite where the condition SandboxStatus(sbx.status.Load()) == StatusStarting unconditionally replaces any existing Starting fallback with another Starting sandbox (acknowledged but unaddressed in the PR checklist). Fix both by reading status once per iteration: status := SandboxStatus(sbx.status.Load()), using it for both the IsRunning check and the fallback condition, and guarding the fallback overwrite with status == StatusStarting && SandboxStatus(fallback.status.Load()) \!= StatusStarting.

Extended reasoning...

Bug 1 — TOCTOU double-read race (normal severity)

The fallback logic in GetByHostPort (map.go:105-114) makes two separate atomic loads of the same sandbox's status field. IsRunning() (sandbox.go:249) calls s.status.Load() once, and the fallback condition on line 112 calls sbx.status.Load() again. Because these are independent atomic operations with no lock between them, a concurrent MarkRunning() CAS (0→1) can execute between the two reads, making the two reads observe different values for the same field.

Step-by-step proof of the race:

  1. Map iteration reaches sandbox X whose status is StatusStarting (0).
  2. Thread A: sbx.IsRunning()status.Load() → returns 0 (Starting) → false → does NOT early-return.
  3. Thread B (concurrent): MarkRunning() CAS → 0→1; sandbox X is now StatusRunning (1).
  4. Thread A continues: fallback condition evaluates SandboxStatus(sbx.status.Load()) → returns 1 (Running, NOT Starting).
  5. Condition: fallback == nil || 1 == StatusStarting = fallback == nil || false.
  6. Suppose a StatusStopping sandbox was already encountered in an earlier iteration, so fallback \!= nil: neither branch is true → fallback remains the Stopping sandbox.
  7. No other Running sandbox is found → function returns the Stopping sandbox.

This directly defeats the PR's primary security goal: preventing cross-tenant NFS volume access and TCP traffic misattribution by preferring Running sandboxes. A Running sandbox exists on the IP but is silently bypassed.

Bug 2 — Non-deterministic StatusStarting overwrite (nit)

The fallback condition if fallback == nil || SandboxStatus(sbx.status.Load()) == StatusStarting is unconditional on the second branch: any StatusStarting candidate always overwrites the current fallback, even when the current fallback is already a StatusStarting sandbox. Because Go map iteration order is intentionally randomised, when two StatusStarting sandboxes share the same IP, the one encountered last (randomly) wins.

Step-by-step proof:
Suppose iteration order is [Starting-A, Starting-B] (or [Starting-B, Starting-A] — both orders possible):

  • Iteration 1: fallback is nil → condition fallback == nil is true → fallback = Starting-A.
  • Iteration 2: SandboxStatus(sbx.status.Load()) == StatusStarting is true → fallback = Starting-B (overwrites).
    In the reversed order, Starting-A would win instead. The PR description explicitly acknowledges this case in the Review Checklist but does not fix it.

While two StatusStarting sandboxes on the same IP is rare (requires the async slot-return goroutine in sandbox.go:1264 to return the slot before the Remove callback fires, allowing a new sandbox to acquire the same slot), it is a reachable scenario as confirmed by code analysis of the cleanup/LIFO ordering.

Impact: Both bugs affect the security-sensitive code paths that route NFS proxy and TCP firewall traffic. The TOCTOU race can cause a Running sandbox to be silently bypassed in favor of a Stopping one; the non-deterministic overwrite can route traffic to the wrong tenant when two Starting sandboxes share an IP.

Fix: Read status once per iteration and use the cached value for both checks:

status := SandboxStatus(sbx.status.Load())
if status == StatusRunning {
    return sbx, nil
}
if fallback == nil || (status == StatusStarting && SandboxStatus(fallback.status.Load()) \!= StatusStarting) {
    fallback = sbx
}

This eliminates the TOCTOU window and ensures a Starting sandbox is never overwritten by another Starting sandbox.

}

if fallback != nil {
return fallback, nil
}

return nil, fmt.Errorf("sandbox with address %s not found", hostPort)
Expand Down
80 changes: 80 additions & 0 deletions packages/orchestrator/pkg/sandbox/map_test.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
package sandbox

import (
"net"
"testing"

"github.com/stretchr/testify/require"

"github.com/e2b-dev/infra/packages/orchestrator/pkg/sandbox/network"
)

func TestGetByHostPortPrefersRunningSandbox(t *testing.T) {
t.Parallel()

m := NewSandboxesMap()

stopping := &Sandbox{
Resources: &Resources{
Slot: &network.Slot{HostIP: net.ParseIP("10.11.0.2")},
},
}
stopping.status.Store(int32(StatusStopping))
m.sandboxes.Insert("stopping", stopping)

running := &Sandbox{
Resources: &Resources{
Slot: &network.Slot{HostIP: net.ParseIP("10.11.0.2")},
},
}
running.status.Store(int32(StatusRunning))
m.sandboxes.Insert("running", running)

sbx, err := m.GetByHostPort("10.11.0.2:2049")
require.NoError(t, err)
require.Same(t, running, sbx)
}

func TestGetByHostPortPrefersStartingOverStopping(t *testing.T) {
t.Parallel()

m := NewSandboxesMap()

stopping := &Sandbox{
Resources: &Resources{
Slot: &network.Slot{HostIP: net.ParseIP("10.11.0.2")},
},
}
stopping.status.Store(int32(StatusStopping))
m.sandboxes.Insert("stopping", stopping)

starting := &Sandbox{
Resources: &Resources{
Slot: &network.Slot{HostIP: net.ParseIP("10.11.0.2")},
},
}
Comment thread
cursor[bot] marked this conversation as resolved.
starting.status.Store(int32(StatusStarting))
m.sandboxes.Insert("starting", starting)

sbx, err := m.GetByHostPort("10.11.0.2:2049")
require.NoError(t, err)
require.Same(t, starting, sbx)
}

func TestGetByHostPortFallsBackToStoppingSandbox(t *testing.T) {
t.Parallel()

m := NewSandboxesMap()

stopping := &Sandbox{
Resources: &Resources{
Slot: &network.Slot{HostIP: net.ParseIP("10.11.0.3")},
},
}
stopping.status.Store(int32(StatusStopping))
m.sandboxes.Insert("stopping", stopping)

sbx, err := m.GetByHostPort("10.11.0.3:2049")
require.NoError(t, err)
require.Same(t, stopping, sbx)
}
Loading