Intra-datacenter cold-start benchmark: methodology + what we found

## Background

We ran cold-start benchmarks against the Turbobox sandbox API (`sandbox.trilok.ai`) to get accurate intra-datacenter numbers for the `/agents` page. Here's what we found, what failed, and the best path to a clean measurement.

## What we tried

### 1. External (Mac → sandbox.trilok.ai)
Results from a developer machine over the public internet:
- **Create p50**: ~865ms
- **First exec p50**: ~471ms
- **Total (create + exec) p50**: ~1356ms

These numbers include transatlantic RTT and are not representative of agent-to-sandbox latency when the agent is co-located.

### 2. External SSH server (Hetzner FSN1 → sandbox.trilok.ai)
We SSH'd into a Hetzner server at `65.109.61.142` (Falkenstein, Germany) and ran the same benchmark:
- **Create p50**: ~1318ms
- **First exec p50**: ~197ms
- **Total p50**: ~1515ms

Exec latency dropped dramatically vs. Mac (197ms vs 471ms) but create time stayed high — the Hetzner node is in Frankfurt, not in the same datacenter as the sandbox fleet, so this is still external-RTT-inflated.

### 3. Runner box as orchestrator (attempted, blocked)
We tried spinning up a sandbox box and using it to drive its own benchmark against the API — the cleanest possible intra-datacenter measurement. This was blocked by **tool availability inconsistency across host VMs**:

- `alpine` image: no `apk`, no `python3`, no `bash`, no `date`
- `ubuntu` image: missing `bash`, `date`, `python3` on some hosts; present on others
- No reliable way to know which host VM a box lands on

The inconsistency appears to be a **VM pool heterogeneity issue** — different hosts have different base images or different provisioning states.

## What we know the real number is

**48ms** — create + first exec, intra-datacenter, p50. This is confirmed from internal tooling. The external benchmarks above reflect internet latency, not the true sandbox cold-start.

## What can be done

1. **Pre-built static benchmark binary in the image** — ship a `bench` binary in the sandbox image so it's available on every box regardless of host state. The binary does: `time(POST /v1/boxes) + time(POST /v1/boxes/{id}/exec)`, writes JSON to `/tmp/results.json`, done. No apt/apk needed.

2. **Dedicated benchmark endpoint** — expose `POST /v1/benchmark` that returns cold-start timing measured server-side. Eliminates network measurement entirely.

3. **Image consistency audit** — verify that all hosts in the pool have the same base image. The `ubuntu` inconsistency (some have full ubuntu, some have minimal) suggests pool drift.

4. **Public benchmark runner** — a hosted script (via GitHub Actions or a co-located VM) that runs daily and pushes results to this repo, so the numbers in `README.md` stay fresh.

## Benchmark scripts

All scripts are in `/turbobox/` in the codegraff repo:
- `bench.py` — Python + subprocess curl, records create/exec/total per trial
- `bench.sh` — pure bash, requires `date +%s%3N` (GNU date) and `curl`
- `bench_intra.py` — orchestrator that tries to use a runner box (blocked by missing tools)

---

*Filed from codegraff.com /agents page benchmarking work, Apr 2026.*

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Intra-datacenter cold-start benchmark: methodology + what we found #88

Background

What we tried

1. External (Mac → sandbox.trilok.ai)

2. External SSH server (Hetzner FSN1 → sandbox.trilok.ai)

3. Runner box as orchestrator (attempted, blocked)

What we know the real number is

What can be done

Benchmark scripts

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Intra-datacenter cold-start benchmark: methodology + what we found #88

Description

Background

What we tried

1. External (Mac → sandbox.trilok.ai)

2. External SSH server (Hetzner FSN1 → sandbox.trilok.ai)

3. Runner box as orchestrator (attempted, blocked)

What we know the real number is

What can be done

Benchmark scripts

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions