bench harness: sim crashes during teardown + workload too light to validate hot-path perf claims

## Context

While verifying #71 (Perf WG-1) on PR #77, the bench harness at `bench/run-bench.mjs` failed to validate the perf claims end-to-end. Two infrastructure gaps surfaced:

## Gap 1: Sim crashes mid-rep on subagent file appends

Every rep of `node run-bench.mjs --throttle-only --reps=3 --stack <stack>` produced this fatal error during teardown:

\`\`\`
[sim] fatal: Error: ENOENT: no such file or directory, open '/tmp/agent-flow-bench/.sim-sessions/<uuid>/subagents/<uuid>.jsonl'
    at Object.writeFileSync (node:fs:2437:20)
    at Object.appendFileSync (node:fs:2519:6)
    at appendJsonl (scripts/sim/runner.ts:108:6)
    at appendSubEntry (scripts/sim/runner.ts:388:5)
\`\`\`

Root cause: `appendJsonl` at `scripts/sim/runner.ts:108` calls `fs.appendFileSync` without `mkdir -p` on the parent dir. The bench harness wipes `.sim-sessions/` between reps (`run-bench.mjs:141`); if the next rep's sim spawns subagents faster than the spawnSubagent path's own `ensureDir`, or if there's a residual write from a prior rep, the append crashes.

The 90s measurement window completed cleanly before the crash in every rep, so metrics were captured — but the workload was potentially truncated (subagent spawning incomplete), which biases the result.

## Gap 2: Bench metrics don't include atlas pages or scene-graph child count

The plan for #71 specified validation against:
- glyph atlas page count plateau (CR-2)
- scene-graph child count plateau (CR-3)
- draw-call count not regressed ≥10% (CR-1 fallback rule)

`bench/instrumentation.js` doesn't expose any of these. Currently the harness measures FPS, frame p50/p95/p99, long tasks, scripting/task/layout/recalc time, heap, React commits — none of which directly catch the regressions issue #71's three Critical findings target.

## Gap 3: Default workload doesn't trigger long-running-session conditions

Bench results from PR #77 (4× CPU, 3 sessions, 90s × 3 reps):

| Metric | A-base | C-pr1 | Delta |
|---|---|---|---|
| FPS | 58.9 (sd 0.0) | 58.9 (sd 0.0) | 0.0% |
| Frame p95 | 17.7ms (sd 0.0) | 17.7ms (sd 0.0) | 0.0% |
| Long tasks | 0 | 0 | — |
| Scripting | 2098ms | 2387ms | +13.8% |
| Heap peak | 8.9 MB | 9.2 MB | +3.4% |

\`sd 0.0\` on FPS and p95 across 3 reps in both stacks indicates headless Chromium is hitting the 60fps vsync ceiling despite \`--disable-frame-rate-limit\`. The workload doesn't saturate either stack, so the perf wins (atlas churn cap, scene-graph pruning, allocation reuse) never get tested. The +13.8% scripting cost reflects new caching infrastructure overhead that isn't amortized when there's no churn.

## Suggested fixes

1. **Sim ENOENT race (Gap 1)**: Make \`appendJsonl\` mkdir-p the parent dir before \`appendFileSync\`. Trivial 2-line fix in \`scripts/sim/runner.ts\`. Alternatively, in the bench harness, kill the prior rep's sim cleanly before running \`rmrf\` on \`.sim-sessions/\`.
2. **Atlas/scene-graph instrumentation (Gap 2)**: Add to \`bench/instrumentation.js\` a periodic snapshot of \`window.__pixiAtlasPageCount\` and \`window.__pixiSceneGraphSize\` (need corresponding exposure hooks in the pixi layers — small one-liner per layer that writes the count to a global on each update). Compute end-of-window plateau or growth-rate from the samples.
3. **Long-session scenario (Gap 3)**: Add a scenario or flag that runs ≥5 min with persistent agents (long \`timeAlive\`) and high subagent churn so atlas/scene-graph growth actually manifest. The current 90s × 3 sessions × 3 subagents is too short to fill the atlas (~9000 unique entries needed per the issue's CR-2 description) or grow the scene graph past noise.
4. **Vsync uncap on this Chromium version**: Confirm \`--disable-frame-rate-limit --disable-gpu-vsync\` actually removes the cap, or switch to a windowed (non-headless) run for FPS-sensitive measurements.

## Why this matters

Without these fixes, the bench gate from #46's perf review series (#71 is WG-1 of 5) cannot quantitatively validate any of the WG PRs. The unit tests in #77 pin the perf invariants directly (entry pool plateau, glyph cap, alpha-bucket count, etc.), but those test the mechanism, not the end-user-visible delta. Future perf PRs need an A/B harness that actually measures what they claim to fix.

Filed during verification of #77 (Perf WG-1). Not blocking #77's merge — but the next WG PR will hit the same wall.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bench harness: sim crashes during teardown + workload too light to validate hot-path perf claims #79

Context

Gap 1: Sim crashes mid-rep on subagent file appends

Gap 2: Bench metrics don't include atlas pages or scene-graph child count

Gap 3: Default workload doesn't trigger long-running-session conditions

Suggested fixes

Why this matters

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Metric	A-base	C-pr1	Delta
FPS	58.9 (sd 0.0)	58.9 (sd 0.0)	0.0%
Frame p95	17.7ms (sd 0.0)	17.7ms (sd 0.0)	0.0%
Long tasks	0	0	—
Scripting	2098ms	2387ms	+13.8%
Heap peak	8.9 MB	9.2 MB	+3.4%

bench harness: sim crashes during teardown + workload too light to validate hot-path perf claims #79

Description

Context

Gap 1: Sim crashes mid-rep on subagent file appends

Gap 2: Bench metrics don't include atlas pages or scene-graph child count

Gap 3: Default workload doesn't trigger long-running-session conditions

Suggested fixes

Why this matters

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions