Perf next steps: render path is optimized, JS main thread is now the bottleneck

## Summary

Empirical comparison of the major perf checkpoints since the upstream merge-base, captured 2026-05-02.

- **Render-path work is essentially complete.** Baseline → main: **47 → 112 FPS at 1× CPU (+138%)** while simultaneously rendering **3× as many canvases**. Per-canvas effective throughput went up ~7×.
- **PR #36 (full Pixi migration) was the single inflection point** — it accounts for almost the entire visible win. Everything before was net-flat or slightly regressed; everything after (#43 visibility-gating + hit-test, #45 multiView) showed no measurable improvement in this scenario.
- **Under high CPU load (4× throttle), the perf work has barely moved the needle** (10 → 12 FPS). Long-task blocking time stays at ~80s out of every 90s measurement window. The bottleneck has clearly moved from the render path to the JS main thread.

## Methodology

- Bench: `bench/run-bench.mjs` (Playwright + CDP, headless Chromium with `--disable-frame-rate-limit --enable-precise-memory-info`)
- 5 stacks: baseline (`59ccf4e`), PR #31 (`df3bd94`), PR #36 (`680cb12`), PR #43 (`ed97e12`), main / PR #45 (`c20d844`)
- Workload: `concurrent` sim scenario, 3 sessions, workload-matched (3 canvases visible simultaneously)
- 1 rep × 2 throttles per stack, 30s warmup + 90s measurement
- Raw data: `bench/results/perf-comparison.jsonl` (10 rows)
- Chart regenerator: `bench/perf-chart.mjs`

## Results

> Drag `bench/results/perf-chart.png` into this issue to attach the chart.

| Stack | FPS @ 1× | p95 @ 1× | FPS @ 4× | p95 @ 4× | Long-tasks @ 4× | Heap peak |
|---|---|---|---|---|---|---|
| Baseline (59ccf4e, 1 canvas) | 47 | 25 ms | 9.8 | 120 ms | 87.2 s | 23 MB |
| PR #31 (df3bd94) | 61 | **34 ms** ⚠️ | 10.4 | **137 ms** ⚠️ | 79.9 s | 44 MB |
| **PR #36 (680cb12)** | **115** | **19 ms** | 11.2 | 121 ms | 80.6 s | 44 MB |
| PR #43 (ed97e12) | 112 | 19 ms | 11.6 | 115 ms | 80.9 s | 42 MB |
| main / PR #45 (c20d844) | 112 | 19 ms | 11.5 | 118 ms | 80.9 s | 43 MB |

⚠️ PR #31's partial Pixi migration regressed p95 frame time at both throttles — half-Pixi/half-DOM was worse than either alone. The win required completing the migration (PR #36).

## Where the wins did NOT come from (and why)

The bench shows zero improvement from #43 and #45 — but that's a measurement-design issue, not a sign the work was wasted:

- **#43 visibility gating** (IntersectionObserver + document.visibilityState) — all 3 bench canvases are on-screen, so the gating never fires. Real value is when canvases scroll out of view.
- **#43 per-pixel hit-test (EventBoundary)** — correctness/UX fix, not a throughput change.
- **#45 multiView** — drops a `readPixels` GPU↔CPU sync and per-viewport RenderTexture. Wins are in GPU memory + GPU time, neither of which the bench captures. The heap delta we do see is dominated by canvas count, not RenderTexture.

These optimizations may still be load-bearing; the bench just isn't designed to surface them. See \"Direction\" item 3 below.

## Direction for future perf work

The render path is mostly tapped out for this workload. To meaningfully move 4× throttle FPS or scale beyond 3 concurrent canvases, the next gains have to come from the JS main thread.

In priority order:

### 1. Profile long tasks at 4× throttle (cheap, high-information)

80s of blocking time over a 90s window means almost everything is one long task. Without knowing *what's in* them, further optimization is guessing. Concrete steps:
- Run a representative \`concurrent\` workload at 4× throttle with the Chromium Performance recorder attached
- Identify the top 3 long-task call stacks
- Likely suspects: sim event ingestion, React reconciliation, scene-graph diff
- File a separate sub-issue per hot path with the call-stack screenshot

### 2. Reduce React reconciliation cost

The bench captures ~3500-3800 React commits per 90s window with 3 canvases — that's ~13 commits/sec/canvas. Worth investigating:
- Are panel state changes (Messages, Files, $Cost, Timeline toggles) triggering full-tree reconciliation that should be scoped to one panel?
- Is per-event sim ingestion fanning out into multiple React state writes when one batched write would do?
- Move panel state into separate context providers so canvas-state changes don't reconcile the panel subtrees and vice versa

### 3. Move sim event ingestion off the main thread

SimulationManager's shared rAF processes events for all sessions in lockstep on the main thread. Web Worker for ingestion (relay events → diff → postMessage minimal updates to React) would free the main-thread budget that's currently spent on JSON parsing + state diffing.

### 4. Add bench scenarios that exercise the optimizations this run missed

The current bench understates #43 and #45 because the scenario is \"everything visible, all the time\". Suggest adding:
- **Off-screen scenario** — N canvases mounted, only K visible (others scrolled out / behind another panel) — measures #43's visibility gating
- **GPU memory snapshot** — `Performance.getMetrics` GPU process memory delta per measurement window — measures #45's multiView heap savings
- **Apples-to-apples + workload-matched in one run** — currently we only have one or the other; baseline's 1-canvas datapoint is the only apples-to-apples cell

### 5. Measure Pixi's residual per-frame blit

Per #45's caveat: WebGL backend's `GlRenderTargetAdaptor.postrender` still does `canvasSource.context2D.drawImage(contextCanvas, ...)` per render target, per frame. Cost is unknown but bounded. A focused micro-bench (mock scene, vary canvas count, measure render() wall time) would tell us whether it's worth chasing WebGPU.

## Tracking

- This bench rerun supersedes #37's first-pass numbers for the recent stacks.
- Files left in tree (uncommitted): \`bench/perf-chart.mjs\`, \`bench/results/perf-comparison.jsonl\`, \`bench/results/perf-chart.png\`, \`bench/results/perf-chart.html\`. Worth committing the chart script to a follow-up branch if we want it as standing infrastructure.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Perf next steps: render path is optimized, JS main thread is now the bottleneck #46

Summary

Methodology

Results

Where the wins did NOT come from (and why)

Direction for future perf work

1. Profile long tasks at 4× throttle (cheap, high-information)

2. Reduce React reconciliation cost

3. Move sim event ingestion off the main thread

4. Add bench scenarios that exercise the optimizations this run missed

5. Measure Pixi's residual per-frame blit

Tracking

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Stack	FPS @ 1×	p95 @ 1×	FPS @ 4×	p95 @ 4×	Long-tasks @ 4×	Heap peak
Baseline (`59ccf4e`, 1 canvas)	47	25 ms	9.8	120 ms	87.2 s	23 MB
PR #31 (`df3bd94`)	61	34 ms ⚠️	10.4	137 ms ⚠️	79.9 s	44 MB
PR #36 (`680cb12`)	115	19 ms	11.2	121 ms	80.6 s	44 MB
PR #43 (`ed97e12`)	112	19 ms	11.6	115 ms	80.9 s	42 MB
main / PR #45 (`c20d844`)	112	19 ms	11.5	118 ms	80.9 s	43 MB

Perf next steps: render path is optimized, JS main thread is now the bottleneck #46

Description

Summary

Methodology

Results

Where the wins did NOT come from (and why)

Direction for future perf work

1. Profile long tasks at 4× throttle (cheap, high-information)

2. Reduce React reconciliation cost

3. Move sim event ingestion off the main thread

4. Add bench scenarios that exercise the optimizations this run missed

5. Measure Pixi's residual per-frame blit

Tracking

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions