Canvas2D performance push: shipped wins, lessons, and what's next

## TL;DR

Multi-PR investigation into the Canvas2D long-task blowup at 4× CPU throttle / 3 concurrent sessions. Net result on 5-rep median bench (`c978155`):

| Setting | FPS | Scripting time | vs PR #56 baseline |
|---|---|---|---|
| Defaults (bloom on) | **12.46** | 66.8s / 90s | **+31% FPS** |
| Bloom throttle=2 | **13.81** | 49.7s | **+45% FPS** |
| Bloom OFF | **17.81** | 32.6s | **+87% FPS** |

The big wins are user-tunable. Default behavior is a moderate improvement; the dramatic numbers come from settings users opt into via the Perf popover.

---

## Methodology timeline

1. **Investigation** (#56): CPU profile at 4× throttle attributed 49.5% to `BloomRenderer.apply`, 3.4% to `fillText`, 1.7% to `closePath` (drawHexGrid). Filed sub-issues #53, #54, #55.
2. **Six initial PRs** shipped against those sub-issues plus a React commit-rate fix:
   - #58 React shared-ticker + memo
   - #60 measureRef stabilization
   - #61 Canvas2D bloom toggle
   - #62 hex-grid cache
   - #63 glyph atlas + per-agent overlay cache
   - #65 bloom throttle (every Nth frame)
3. **First user-visible test failed expectations** — toggling bloom OFF in the dev server appeared to do nothing.
4. **Bench A/B verification** (#64) on `--no-bloom` vindicated the bloom finding cleanly: 81.9s → 40.6s long-tasks, +53% FPS. The discrepancy was test-conditions: 1 canvas / 1 idle session at native speed doesn't surface the contention; 3 sessions at 4× throttle does.
5. **Second user-visible disappointment** — #62 + #63 didn't move the bench needle (FPS 11.7 → 11.9, long-tasks 81.9s → 82.0s).
6. **Forensic profile comparison** by researcher: hex-grid cache and overlay cache both had catastrophic cache hit rates. **Net: +5.07s of new cache machinery overhead vs −1.17s of fillText savings.** Cache implementations were strictly worse than the original code.
7. **Cache hit-rate fixes** (#67 hex grid, #68 overlay): drop camera offset from hex cache key (don't invalidate on sub-pixel pan); quantize overlay `dataHash` inputs (timeAlive to seconds, cost to \$0.01 / \$0.001).
8. **Bench stabilization** (#69): added `--reps=N` flag with median/stdDev/CoV reporting, variance flagging, system-state snapshot, optional CPU governor helper script. Cut FPS variance from σ/μ ~0.18 to under 0.05.
9. **Final 5-rep verification**: numbers above. CoV well under the 15% noisy threshold for all three arms.

---

## Shipped PRs (with measured impact)

### Real, verified wins

- **#58 React shared-ticker + memo** — React commits 3048 → 1292 (**−58%**) in the default run. Sustained through later PRs.
- **#61 bloom toggle (OFF)** — +43% FPS, −51% scripting time. Largest single win, but user-opt-in.
- **#65 bloom throttle=2** — +11% FPS, −25% scripting time, visual glow preserved.
- **#67 hex-grid cache fix** — `closePath` self-time 2.2% → 0.5-0.6% (≈75% reduction). Cache key no longer includes camera, so sub-pixel pan doesn't invalidate.
- **#68 overlay cache fix** — fillText −10% (1329ms → 1195ms), getContext −23%, FPS +4.3% on its A/B pair.

### Infrastructure

- **#56** — CPU profile harness (`profile-long-tasks.mjs`) and parsed report.
- **#60** — `useVirtualList` measureRef stabilized via per-id callback Map + rAF-coalesced forceTick. Predicted 2.4% of CPU; not isolated in the final bench but contributes to lower React commits.
- **#62** — original hex-grid cache (broken hit rate, fixed by #67).
- **#63** — original glyph atlas + overlay cache (broken hit rate for overlay, fixed by #68; glyph atlas works for some non-overlay text).
- **#64** — bench A/B verification of bloom finding.
- **#66** — `--bloom-throttle=N` flag + first comparison table.
- **#69** — multi-rep median, variance flagging, CPU governor helper, system-state snapshot.

---

## Lessons learned

### 1. Self-time in V8 builtins under heavy throttle can mislead

PR #56's profile correctly attributed bloom's cost via `drawImage` self-time. But the same metric at the same magnitude could mean \"we're truly burning CPU here\" or \"we're stalling here waiting for the GPU/compositor to flush.\" PR #64's empirical A/B is the only way to distinguish — and it confirmed the bloom case was real CPU work that frees up the budget when removed.

**Action**: any future profile-driven optimization plan should include a \"verify by ablation\" step before merging multiple PRs against a single profile. The bench is cheap; assuming the profile is correct is expensive when it's wrong.

### 2. Caches that miss are strictly worse than no cache

#62 and #63 looked correct on inspection (cache key, hash function, eviction policy all reasonable). But the cache keys included continuously-changing values (sub-pixel camera offset; per-tick timeAlive/token counts), so hit rate was near-zero. We paid:
- Original render cost (cache miss → still computes the value)
- Cache canvas allocation + getContext
- drawImage from cache to main canvas
- Cache key construction overhead

**Action**: add a cache hit-rate metric to any new caching layer and assert it during development. \"Theoretical maximum savings\" from the eliminated work is wrong if the cache misses; the actual delta is `(savings × hit_rate) − (cache_overhead × miss_rate)`.

### 3. Test conditions matter as much as the test

The user's manual \"bloom toggle does nothing\" observation was correct for their test conditions (1 canvas, idle sim, DevTools FPS meter). The bench's \"bloom toggle saves 50%\" observation was correct for ITS test conditions (3 sessions, concurrent workload, in-canvas rAF FPS). Both were measuring something real; the gap between them is the contention regime, not measurement error.

**Action**: when a user reports \"perf change had no effect,\" gather the test conditions before reverting. The fix may be correct under load that the manual test didn't reproduce.

### 4. Multi-rep median + variance flagging makes a noisy bench trustworthy

Pre-#69, FPS swung 6.4–12.8 on identical code on the same machine, primarily from background system activity. Post-#69, all 3 final arms had CoV under 5% and the variance check would have warned us if it crept up.

**Action**: any perf comparison should use `--reps=5` minimum. The CPU governor pinning (`bench/scripts/bench-prep.sh --set-performance`) is also helpful where available, though not strictly required given the median + variance discipline.

---

## What's still on the table

If we want to push further:

1. **Non-overlay text quantization** (~10% of remaining scripting time). Per H's analysis:
   - Agent labels (~1.5s of fillText) — biggest single source. Likely text sprite cache LRU-eviction or color-key churn.
   - Cost summary panel — `totalCost.toFixed(3)` changes constantly, defeating the text sprite cache.
   - Tool/discovery labels and bubble text — sprite cache misses on new label appearance.
2. **The 21.4% V8 `(program)` bucket** — GC, deopts, parse. Needs `Tracing.start` with v8 categories (heavier than `Profiler.start`). Filed as direction item in #46 originally; not yet investigated.
3. **Bloom throttle per-canvas heuristic** — when N>1 canvas is open, auto-bump the bloom throttle. The marginal cost of the Nth bloom pass is high; users may not realize the trade is theirs to make.

None of these are urgent. The Canvas2D path is now in a meaningfully better place than at PR #56.

---

## Reproducer

```bash
cd source
pnpm run build:app
./bench/scripts/bench-prep.sh --set-performance     # optional, requires sudo
node bench/profile-long-tasks.mjs --reps=5                              # defaults
node bench/profile-long-tasks.mjs --reps=5 --bloom-throttle=2           # throttle 2
node bench/profile-long-tasks.mjs --reps=5 --no-bloom                   # bloom off
./bench/scripts/bench-prep.sh --restore             # if you ran --set-performance
```

Outputs land in `bench/results/long-tasks-{summary,profile,report}{,-no-bloom,-throttle2}.{json,cpuprofile,md}` plus per-rep JSON files.

---

## Closes / linked

- Closes #46 direction item 1 (\"profile long tasks at 4× throttle\") definitively.
- Closes #53 (BloomRenderer hot path — toggle + throttle shipped).
- Closes #54 (text/grid hot paths — caches shipped, hit rates fixed, partial gain materialized).
- Closes #55 (measureRef thrash).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Canvas2D performance push: shipped wins, lessons, and what's next #70

TL;DR

Methodology timeline

Shipped PRs (with measured impact)

Real, verified wins

Infrastructure

Lessons learned

1. Self-time in V8 builtins under heavy throttle can mislead

2. Caches that miss are strictly worse than no cache

3. Test conditions matter as much as the test

4. Multi-rep median + variance flagging makes a noisy bench trustworthy

What's still on the table

Reproducer

Closes / linked

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Setting	FPS	Scripting time	vs PR #56 baseline
Defaults (bloom on)	12.46	66.8s / 90s	+31% FPS
Bloom throttle=2	13.81	49.7s	+45% FPS
Bloom OFF	17.81	32.6s	+87% FPS

Canvas2D performance push: shipped wins, lessons, and what's next #70

Description

TL;DR

Methodology timeline

Shipped PRs (with measured impact)

Real, verified wins

Infrastructure

Lessons learned

1. Self-time in V8 builtins under heavy throttle can mislead

2. Caches that miss are strictly worse than no cache

3. Test conditions matter as much as the test

4. Multi-rep median + variance flagging makes a noisy bench trustworthy

What's still on the table

Reproducer

Closes / linked

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions