Context
While verifying #71 (Perf WG-1) on PR #77, the bench harness at bench/run-bench.mjs failed to validate the perf claims end-to-end. Two infrastructure gaps surfaced:
Gap 1: Sim crashes mid-rep on subagent file appends
Every rep of node run-bench.mjs --throttle-only --reps=3 --stack <stack> produced this fatal error during teardown:
```
[sim] fatal: Error: ENOENT: no such file or directory, open '/tmp/agent-flow-bench/.sim-sessions//subagents/.jsonl'
at Object.writeFileSync (node:fs:2437:20)
at Object.appendFileSync (node:fs:2519:6)
at appendJsonl (scripts/sim/runner.ts:108:6)
at appendSubEntry (scripts/sim/runner.ts:388:5)
```
Root cause: appendJsonl at scripts/sim/runner.ts:108 calls fs.appendFileSync without mkdir -p on the parent dir. The bench harness wipes .sim-sessions/ between reps (run-bench.mjs:141); if the next rep's sim spawns subagents faster than the spawnSubagent path's own ensureDir, or if there's a residual write from a prior rep, the append crashes.
The 90s measurement window completed cleanly before the crash in every rep, so metrics were captured — but the workload was potentially truncated (subagent spawning incomplete), which biases the result.
Gap 2: Bench metrics don't include atlas pages or scene-graph child count
The plan for #71 specified validation against:
- glyph atlas page count plateau (CR-2)
- scene-graph child count plateau (CR-3)
- draw-call count not regressed ≥10% (CR-1 fallback rule)
bench/instrumentation.js doesn't expose any of these. Currently the harness measures FPS, frame p50/p95/p99, long tasks, scripting/task/layout/recalc time, heap, React commits — none of which directly catch the regressions issue #71's three Critical findings target.
Gap 3: Default workload doesn't trigger long-running-session conditions
Bench results from PR #77 (4× CPU, 3 sessions, 90s × 3 reps):
| Metric |
A-base |
C-pr1 |
Delta |
| FPS |
58.9 (sd 0.0) |
58.9 (sd 0.0) |
0.0% |
| Frame p95 |
17.7ms (sd 0.0) |
17.7ms (sd 0.0) |
0.0% |
| Long tasks |
0 |
0 |
— |
| Scripting |
2098ms |
2387ms |
+13.8% |
| Heap peak |
8.9 MB |
9.2 MB |
+3.4% |
`sd 0.0` on FPS and p95 across 3 reps in both stacks indicates headless Chromium is hitting the 60fps vsync ceiling despite `--disable-frame-rate-limit`. The workload doesn't saturate either stack, so the perf wins (atlas churn cap, scene-graph pruning, allocation reuse) never get tested. The +13.8% scripting cost reflects new caching infrastructure overhead that isn't amortized when there's no churn.
Suggested fixes
- Sim ENOENT race (Gap 1): Make `appendJsonl` mkdir-p the parent dir before `appendFileSync`. Trivial 2-line fix in `scripts/sim/runner.ts`. Alternatively, in the bench harness, kill the prior rep's sim cleanly before running `rmrf` on `.sim-sessions/`.
- Atlas/scene-graph instrumentation (Gap 2): Add to `bench/instrumentation.js` a periodic snapshot of `window.__pixiAtlasPageCount` and `window.__pixiSceneGraphSize` (need corresponding exposure hooks in the pixi layers — small one-liner per layer that writes the count to a global on each update). Compute end-of-window plateau or growth-rate from the samples.
- Long-session scenario (Gap 3): Add a scenario or flag that runs ≥5 min with persistent agents (long `timeAlive`) and high subagent churn so atlas/scene-graph growth actually manifest. The current 90s × 3 sessions × 3 subagents is too short to fill the atlas (~9000 unique entries needed per the issue's CR-2 description) or grow the scene graph past noise.
- Vsync uncap on this Chromium version: Confirm `--disable-frame-rate-limit --disable-gpu-vsync` actually removes the cap, or switch to a windowed (non-headless) run for FPS-sensitive measurements.
Why this matters
Without these fixes, the bench gate from #46's perf review series (#71 is WG-1 of 5) cannot quantitatively validate any of the WG PRs. The unit tests in #77 pin the perf invariants directly (entry pool plateau, glyph cap, alpha-bucket count, etc.), but those test the mechanism, not the end-user-visible delta. Future perf PRs need an A/B harness that actually measures what they claim to fix.
Filed during verification of #77 (Perf WG-1). Not blocking #77's merge — but the next WG PR will hit the same wall.
Context
While verifying #71 (Perf WG-1) on PR #77, the bench harness at
bench/run-bench.mjsfailed to validate the perf claims end-to-end. Two infrastructure gaps surfaced:Gap 1: Sim crashes mid-rep on subagent file appends
Every rep of
node run-bench.mjs --throttle-only --reps=3 --stack <stack>produced this fatal error during teardown:```
[sim] fatal: Error: ENOENT: no such file or directory, open '/tmp/agent-flow-bench/.sim-sessions//subagents/.jsonl'
at Object.writeFileSync (node:fs:2437:20)
at Object.appendFileSync (node:fs:2519:6)
at appendJsonl (scripts/sim/runner.ts:108:6)
at appendSubEntry (scripts/sim/runner.ts:388:5)
```
Root cause:
appendJsonlatscripts/sim/runner.ts:108callsfs.appendFileSyncwithoutmkdir -pon the parent dir. The bench harness wipes.sim-sessions/between reps (run-bench.mjs:141); if the next rep's sim spawns subagents faster than the spawnSubagent path's ownensureDir, or if there's a residual write from a prior rep, the append crashes.The 90s measurement window completed cleanly before the crash in every rep, so metrics were captured — but the workload was potentially truncated (subagent spawning incomplete), which biases the result.
Gap 2: Bench metrics don't include atlas pages or scene-graph child count
The plan for #71 specified validation against:
bench/instrumentation.jsdoesn't expose any of these. Currently the harness measures FPS, frame p50/p95/p99, long tasks, scripting/task/layout/recalc time, heap, React commits — none of which directly catch the regressions issue #71's three Critical findings target.Gap 3: Default workload doesn't trigger long-running-session conditions
Bench results from PR #77 (4× CPU, 3 sessions, 90s × 3 reps):
`sd 0.0` on FPS and p95 across 3 reps in both stacks indicates headless Chromium is hitting the 60fps vsync ceiling despite `--disable-frame-rate-limit`. The workload doesn't saturate either stack, so the perf wins (atlas churn cap, scene-graph pruning, allocation reuse) never get tested. The +13.8% scripting cost reflects new caching infrastructure overhead that isn't amortized when there's no churn.
Suggested fixes
Why this matters
Without these fixes, the bench gate from #46's perf review series (#71 is WG-1 of 5) cannot quantitatively validate any of the WG PRs. The unit tests in #77 pin the perf invariants directly (entry pool plateau, glyph cap, alpha-bucket count, etc.), but those test the mechanism, not the end-user-visible delta. Future perf PRs need an A/B harness that actually measures what they claim to fix.
Filed during verification of #77 (Perf WG-1). Not blocking #77's merge — but the next WG PR will hit the same wall.