Skip to content

Intra-datacenter cold-start benchmark: methodology + what we found #88

@justrach

Description

@justrach

Background

We ran cold-start benchmarks against the Turbobox sandbox API (sandbox.trilok.ai) to get accurate intra-datacenter numbers for the /agents page. Here's what we found, what failed, and the best path to a clean measurement.

What we tried

1. External (Mac → sandbox.trilok.ai)

Results from a developer machine over the public internet:

  • Create p50: ~865ms
  • First exec p50: ~471ms
  • Total (create + exec) p50: ~1356ms

These numbers include transatlantic RTT and are not representative of agent-to-sandbox latency when the agent is co-located.

2. External SSH server (Hetzner FSN1 → sandbox.trilok.ai)

We SSH'd into a Hetzner server at 65.109.61.142 (Falkenstein, Germany) and ran the same benchmark:

  • Create p50: ~1318ms
  • First exec p50: ~197ms
  • Total p50: ~1515ms

Exec latency dropped dramatically vs. Mac (197ms vs 471ms) but create time stayed high — the Hetzner node is in Frankfurt, not in the same datacenter as the sandbox fleet, so this is still external-RTT-inflated.

3. Runner box as orchestrator (attempted, blocked)

We tried spinning up a sandbox box and using it to drive its own benchmark against the API — the cleanest possible intra-datacenter measurement. This was blocked by tool availability inconsistency across host VMs:

  • alpine image: no apk, no python3, no bash, no date
  • ubuntu image: missing bash, date, python3 on some hosts; present on others
  • No reliable way to know which host VM a box lands on

The inconsistency appears to be a VM pool heterogeneity issue — different hosts have different base images or different provisioning states.

What we know the real number is

48ms — create + first exec, intra-datacenter, p50. This is confirmed from internal tooling. The external benchmarks above reflect internet latency, not the true sandbox cold-start.

What can be done

  1. Pre-built static benchmark binary in the image — ship a bench binary in the sandbox image so it's available on every box regardless of host state. The binary does: time(POST /v1/boxes) + time(POST /v1/boxes/{id}/exec), writes JSON to /tmp/results.json, done. No apt/apk needed.

  2. Dedicated benchmark endpoint — expose POST /v1/benchmark that returns cold-start timing measured server-side. Eliminates network measurement entirely.

  3. Image consistency audit — verify that all hosts in the pool have the same base image. The ubuntu inconsistency (some have full ubuntu, some have minimal) suggests pool drift.

  4. Public benchmark runner — a hosted script (via GitHub Actions or a co-located VM) that runs daily and pushes results to this repo, so the numbers in README.md stay fresh.

Benchmark scripts

All scripts are in /turbobox/ in the codegraff repo:

  • bench.py — Python + subprocess curl, records create/exec/total per trial
  • bench.sh — pure bash, requires date +%s%3N (GNU date) and curl
  • bench_intra.py — orchestrator that tries to use a runner box (blocked by missing tools)

Filed from codegraff.com /agents page benchmarking work, Apr 2026.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions