Background
We ran cold-start benchmarks against the Turbobox sandbox API (sandbox.trilok.ai) to get accurate intra-datacenter numbers for the /agents page. Here's what we found, what failed, and the best path to a clean measurement.
What we tried
1. External (Mac → sandbox.trilok.ai)
Results from a developer machine over the public internet:
- Create p50: ~865ms
- First exec p50: ~471ms
- Total (create + exec) p50: ~1356ms
These numbers include transatlantic RTT and are not representative of agent-to-sandbox latency when the agent is co-located.
2. External SSH server (Hetzner FSN1 → sandbox.trilok.ai)
We SSH'd into a Hetzner server at 65.109.61.142 (Falkenstein, Germany) and ran the same benchmark:
- Create p50: ~1318ms
- First exec p50: ~197ms
- Total p50: ~1515ms
Exec latency dropped dramatically vs. Mac (197ms vs 471ms) but create time stayed high — the Hetzner node is in Frankfurt, not in the same datacenter as the sandbox fleet, so this is still external-RTT-inflated.
3. Runner box as orchestrator (attempted, blocked)
We tried spinning up a sandbox box and using it to drive its own benchmark against the API — the cleanest possible intra-datacenter measurement. This was blocked by tool availability inconsistency across host VMs:
alpine image: no apk, no python3, no bash, no date
ubuntu image: missing bash, date, python3 on some hosts; present on others
- No reliable way to know which host VM a box lands on
The inconsistency appears to be a VM pool heterogeneity issue — different hosts have different base images or different provisioning states.
What we know the real number is
48ms — create + first exec, intra-datacenter, p50. This is confirmed from internal tooling. The external benchmarks above reflect internet latency, not the true sandbox cold-start.
What can be done
-
Pre-built static benchmark binary in the image — ship a bench binary in the sandbox image so it's available on every box regardless of host state. The binary does: time(POST /v1/boxes) + time(POST /v1/boxes/{id}/exec), writes JSON to /tmp/results.json, done. No apt/apk needed.
-
Dedicated benchmark endpoint — expose POST /v1/benchmark that returns cold-start timing measured server-side. Eliminates network measurement entirely.
-
Image consistency audit — verify that all hosts in the pool have the same base image. The ubuntu inconsistency (some have full ubuntu, some have minimal) suggests pool drift.
-
Public benchmark runner — a hosted script (via GitHub Actions or a co-located VM) that runs daily and pushes results to this repo, so the numbers in README.md stay fresh.
Benchmark scripts
All scripts are in /turbobox/ in the codegraff repo:
bench.py — Python + subprocess curl, records create/exec/total per trial
bench.sh — pure bash, requires date +%s%3N (GNU date) and curl
bench_intra.py — orchestrator that tries to use a runner box (blocked by missing tools)
Filed from codegraff.com /agents page benchmarking work, Apr 2026.
Background
We ran cold-start benchmarks against the Turbobox sandbox API (
sandbox.trilok.ai) to get accurate intra-datacenter numbers for the/agentspage. Here's what we found, what failed, and the best path to a clean measurement.What we tried
1. External (Mac → sandbox.trilok.ai)
Results from a developer machine over the public internet:
These numbers include transatlantic RTT and are not representative of agent-to-sandbox latency when the agent is co-located.
2. External SSH server (Hetzner FSN1 → sandbox.trilok.ai)
We SSH'd into a Hetzner server at
65.109.61.142(Falkenstein, Germany) and ran the same benchmark:Exec latency dropped dramatically vs. Mac (197ms vs 471ms) but create time stayed high — the Hetzner node is in Frankfurt, not in the same datacenter as the sandbox fleet, so this is still external-RTT-inflated.
3. Runner box as orchestrator (attempted, blocked)
We tried spinning up a sandbox box and using it to drive its own benchmark against the API — the cleanest possible intra-datacenter measurement. This was blocked by tool availability inconsistency across host VMs:
alpineimage: noapk, nopython3, nobash, nodateubuntuimage: missingbash,date,python3on some hosts; present on othersThe inconsistency appears to be a VM pool heterogeneity issue — different hosts have different base images or different provisioning states.
What we know the real number is
48ms — create + first exec, intra-datacenter, p50. This is confirmed from internal tooling. The external benchmarks above reflect internet latency, not the true sandbox cold-start.
What can be done
Pre-built static benchmark binary in the image — ship a
benchbinary in the sandbox image so it's available on every box regardless of host state. The binary does:time(POST /v1/boxes) + time(POST /v1/boxes/{id}/exec), writes JSON to/tmp/results.json, done. No apt/apk needed.Dedicated benchmark endpoint — expose
POST /v1/benchmarkthat returns cold-start timing measured server-side. Eliminates network measurement entirely.Image consistency audit — verify that all hosts in the pool have the same base image. The
ubuntuinconsistency (some have full ubuntu, some have minimal) suggests pool drift.Public benchmark runner — a hosted script (via GitHub Actions or a co-located VM) that runs daily and pushes results to this repo, so the numbers in
README.mdstay fresh.Benchmark scripts
All scripts are in
/turbobox/in the codegraff repo:bench.py— Python + subprocess curl, records create/exec/total per trialbench.sh— pure bash, requiresdate +%s%3N(GNU date) andcurlbench_intra.py— orchestrator that tries to use a runner box (blocked by missing tools)Filed from codegraff.com /agents page benchmarking work, Apr 2026.