Skip to content

bench harness: sim crashes during teardown + workload too light to validate hot-path perf claims #79

@DFearing

Description

@DFearing

Context

While verifying #71 (Perf WG-1) on PR #77, the bench harness at bench/run-bench.mjs failed to validate the perf claims end-to-end. Two infrastructure gaps surfaced:

Gap 1: Sim crashes mid-rep on subagent file appends

Every rep of node run-bench.mjs --throttle-only --reps=3 --stack <stack> produced this fatal error during teardown:

```
[sim] fatal: Error: ENOENT: no such file or directory, open '/tmp/agent-flow-bench/.sim-sessions//subagents/.jsonl'
at Object.writeFileSync (node:fs:2437:20)
at Object.appendFileSync (node:fs:2519:6)
at appendJsonl (scripts/sim/runner.ts:108:6)
at appendSubEntry (scripts/sim/runner.ts:388:5)
```

Root cause: appendJsonl at scripts/sim/runner.ts:108 calls fs.appendFileSync without mkdir -p on the parent dir. The bench harness wipes .sim-sessions/ between reps (run-bench.mjs:141); if the next rep's sim spawns subagents faster than the spawnSubagent path's own ensureDir, or if there's a residual write from a prior rep, the append crashes.

The 90s measurement window completed cleanly before the crash in every rep, so metrics were captured — but the workload was potentially truncated (subagent spawning incomplete), which biases the result.

Gap 2: Bench metrics don't include atlas pages or scene-graph child count

The plan for #71 specified validation against:

  • glyph atlas page count plateau (CR-2)
  • scene-graph child count plateau (CR-3)
  • draw-call count not regressed ≥10% (CR-1 fallback rule)

bench/instrumentation.js doesn't expose any of these. Currently the harness measures FPS, frame p50/p95/p99, long tasks, scripting/task/layout/recalc time, heap, React commits — none of which directly catch the regressions issue #71's three Critical findings target.

Gap 3: Default workload doesn't trigger long-running-session conditions

Bench results from PR #77 (4× CPU, 3 sessions, 90s × 3 reps):

Metric A-base C-pr1 Delta
FPS 58.9 (sd 0.0) 58.9 (sd 0.0) 0.0%
Frame p95 17.7ms (sd 0.0) 17.7ms (sd 0.0) 0.0%
Long tasks 0 0
Scripting 2098ms 2387ms +13.8%
Heap peak 8.9 MB 9.2 MB +3.4%

`sd 0.0` on FPS and p95 across 3 reps in both stacks indicates headless Chromium is hitting the 60fps vsync ceiling despite `--disable-frame-rate-limit`. The workload doesn't saturate either stack, so the perf wins (atlas churn cap, scene-graph pruning, allocation reuse) never get tested. The +13.8% scripting cost reflects new caching infrastructure overhead that isn't amortized when there's no churn.

Suggested fixes

  1. Sim ENOENT race (Gap 1): Make `appendJsonl` mkdir-p the parent dir before `appendFileSync`. Trivial 2-line fix in `scripts/sim/runner.ts`. Alternatively, in the bench harness, kill the prior rep's sim cleanly before running `rmrf` on `.sim-sessions/`.
  2. Atlas/scene-graph instrumentation (Gap 2): Add to `bench/instrumentation.js` a periodic snapshot of `window.__pixiAtlasPageCount` and `window.__pixiSceneGraphSize` (need corresponding exposure hooks in the pixi layers — small one-liner per layer that writes the count to a global on each update). Compute end-of-window plateau or growth-rate from the samples.
  3. Long-session scenario (Gap 3): Add a scenario or flag that runs ≥5 min with persistent agents (long `timeAlive`) and high subagent churn so atlas/scene-graph growth actually manifest. The current 90s × 3 sessions × 3 subagents is too short to fill the atlas (~9000 unique entries needed per the issue's CR-2 description) or grow the scene graph past noise.
  4. Vsync uncap on this Chromium version: Confirm `--disable-frame-rate-limit --disable-gpu-vsync` actually removes the cap, or switch to a windowed (non-headless) run for FPS-sensitive measurements.

Why this matters

Without these fixes, the bench gate from #46's perf review series (#71 is WG-1 of 5) cannot quantitatively validate any of the WG PRs. The unit tests in #77 pin the perf invariants directly (entry pool plateau, glyph cap, alpha-bucket count, etc.), but those test the mechanism, not the end-user-visible delta. Future perf PRs need an A/B harness that actually measures what they claim to fix.

Filed during verification of #77 (Perf WG-1). Not blocking #77's merge — but the next WG PR will hit the same wall.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions