Skip to content

Canvas2D performance push: shipped wins, lessons, and what's next #70

@DFearing

Description

@DFearing

TL;DR

Multi-PR investigation into the Canvas2D long-task blowup at 4× CPU throttle / 3 concurrent sessions. Net result on 5-rep median bench (c978155):

Setting FPS Scripting time vs PR #56 baseline
Defaults (bloom on) 12.46 66.8s / 90s +31% FPS
Bloom throttle=2 13.81 49.7s +45% FPS
Bloom OFF 17.81 32.6s +87% FPS

The big wins are user-tunable. Default behavior is a moderate improvement; the dramatic numbers come from settings users opt into via the Perf popover.


Methodology timeline

  1. Investigation (bench: profile long tasks at 4× throttle (#46 direction item 1) #56): CPU profile at 4× throttle attributed 49.5% to BloomRenderer.apply, 3.4% to fillText, 1.7% to closePath (drawHexGrid). Filed sub-issues perf: Canvas2D BloomRenderer is 49% of all CPU at 4× throttle #53, perf: Canvas2D label/text draws ≈ 6% of CPU at 4× throttle (per-frame redraw) #54, perf: message-feed virtualization measureRef thrashes ≈ 2.4% of CPU #55.
  2. Six initial PRs shipped against those sub-issues plus a React commit-rate fix:
  3. First user-visible test failed expectations — toggling bloom OFF in the dev server appeared to do nothing.
  4. Bench A/B verification (bench: verify bloom is the dominant Canvas2D bottleneck #64) on --no-bloom vindicated the bloom finding cleanly: 81.9s → 40.6s long-tasks, +53% FPS. The discrepancy was test-conditions: 1 canvas / 1 idle session at native speed doesn't surface the contention; 3 sessions at 4× throttle does.
  5. Second user-visible disappointmentperf(canvas2d): cache hex grid to offscreen canvas (closes part of #54) #62 + perf(canvas2d): glyph atlas + per-agent overlay cache (closes #54) #63 didn't move the bench needle (FPS 11.7 → 11.9, long-tasks 81.9s → 82.0s).
  6. Forensic profile comparison by researcher: hex-grid cache and overlay cache both had catastrophic cache hit rates. Net: +5.07s of new cache machinery overhead vs −1.17s of fillText savings. Cache implementations were strictly worse than the original code.
  7. Cache hit-rate fixes (perf(canvas2d): fix hex-grid cache hit rate (closePath 2.2% -> ~0.5%) #67 hex grid, perf(canvas2d): improve overlay cache hit rate (partial #54) #68 overlay): drop camera offset from hex cache key (don't invalidate on sub-pixel pan); quantize overlay dataHash inputs (timeAlive to seconds, cost to $0.01 / $0.001).
  8. Bench stabilization (bench: multi-rep median + variance flagging + CPU governor helper #69): added --reps=N flag with median/stdDev/CoV reporting, variance flagging, system-state snapshot, optional CPU governor helper script. Cut FPS variance from σ/μ ~0.18 to under 0.05.
  9. Final 5-rep verification: numbers above. CoV well under the 15% noisy threshold for all three arms.

Shipped PRs (with measured impact)

Real, verified wins

Infrastructure


Lessons learned

1. Self-time in V8 builtins under heavy throttle can mislead

PR #56's profile correctly attributed bloom's cost via drawImage self-time. But the same metric at the same magnitude could mean "we're truly burning CPU here" or "we're stalling here waiting for the GPU/compositor to flush." PR #64's empirical A/B is the only way to distinguish — and it confirmed the bloom case was real CPU work that frees up the budget when removed.

Action: any future profile-driven optimization plan should include a "verify by ablation" step before merging multiple PRs against a single profile. The bench is cheap; assuming the profile is correct is expensive when it's wrong.

2. Caches that miss are strictly worse than no cache

#62 and #63 looked correct on inspection (cache key, hash function, eviction policy all reasonable). But the cache keys included continuously-changing values (sub-pixel camera offset; per-tick timeAlive/token counts), so hit rate was near-zero. We paid:

  • Original render cost (cache miss → still computes the value)
  • Cache canvas allocation + getContext
  • drawImage from cache to main canvas
  • Cache key construction overhead

Action: add a cache hit-rate metric to any new caching layer and assert it during development. "Theoretical maximum savings" from the eliminated work is wrong if the cache misses; the actual delta is (savings × hit_rate) − (cache_overhead × miss_rate).

3. Test conditions matter as much as the test

The user's manual "bloom toggle does nothing" observation was correct for their test conditions (1 canvas, idle sim, DevTools FPS meter). The bench's "bloom toggle saves 50%" observation was correct for ITS test conditions (3 sessions, concurrent workload, in-canvas rAF FPS). Both were measuring something real; the gap between them is the contention regime, not measurement error.

Action: when a user reports "perf change had no effect," gather the test conditions before reverting. The fix may be correct under load that the manual test didn't reproduce.

4. Multi-rep median + variance flagging makes a noisy bench trustworthy

Pre-#69, FPS swung 6.4–12.8 on identical code on the same machine, primarily from background system activity. Post-#69, all 3 final arms had CoV under 5% and the variance check would have warned us if it crept up.

Action: any perf comparison should use --reps=5 minimum. The CPU governor pinning (bench/scripts/bench-prep.sh --set-performance) is also helpful where available, though not strictly required given the median + variance discipline.


What's still on the table

If we want to push further:

  1. Non-overlay text quantization (~10% of remaining scripting time). Per H's analysis:
    • Agent labels (~1.5s of fillText) — biggest single source. Likely text sprite cache LRU-eviction or color-key churn.
    • Cost summary panel — totalCost.toFixed(3) changes constantly, defeating the text sprite cache.
    • Tool/discovery labels and bubble text — sprite cache misses on new label appearance.
  2. The 21.4% V8 (program) bucket — GC, deopts, parse. Needs Tracing.start with v8 categories (heavier than Profiler.start). Filed as direction item in Perf next steps: render path is optimized, JS main thread is now the bottleneck #46 originally; not yet investigated.
  3. Bloom throttle per-canvas heuristic — when N>1 canvas is open, auto-bump the bloom throttle. The marginal cost of the Nth bloom pass is high; users may not realize the trade is theirs to make.

None of these are urgent. The Canvas2D path is now in a meaningfully better place than at PR #56.


Reproducer

cd source
pnpm run build:app
./bench/scripts/bench-prep.sh --set-performance     # optional, requires sudo
node bench/profile-long-tasks.mjs --reps=5                              # defaults
node bench/profile-long-tasks.mjs --reps=5 --bloom-throttle=2           # throttle 2
node bench/profile-long-tasks.mjs --reps=5 --no-bloom                   # bloom off
./bench/scripts/bench-prep.sh --restore             # if you ran --set-performance

Outputs land in bench/results/long-tasks-{summary,profile,report}{,-no-bloom,-throttle2}.{json,cpuprofile,md} plus per-rep JSON files.


Closes / linked

🤖 Generated with Claude Code

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions