You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The big wins are user-tunable. Default behavior is a moderate improvement; the dramatic numbers come from settings users opt into via the Perf popover.
Forensic profile comparison by researcher: hex-grid cache and overlay cache both had catastrophic cache hit rates. Net: +5.07s of new cache machinery overhead vs −1.17s of fillText savings. Cache implementations were strictly worse than the original code.
perf: stabilize message-feed measureRef (closes #55) #60 — useVirtualList measureRef stabilized via per-id callback Map + rAF-coalesced forceTick. Predicted 2.4% of CPU; not isolated in the final bench but contributes to lower React commits.
1. Self-time in V8 builtins under heavy throttle can mislead
PR #56's profile correctly attributed bloom's cost via drawImage self-time. But the same metric at the same magnitude could mean "we're truly burning CPU here" or "we're stalling here waiting for the GPU/compositor to flush." PR #64's empirical A/B is the only way to distinguish — and it confirmed the bloom case was real CPU work that frees up the budget when removed.
Action: any future profile-driven optimization plan should include a "verify by ablation" step before merging multiple PRs against a single profile. The bench is cheap; assuming the profile is correct is expensive when it's wrong.
2. Caches that miss are strictly worse than no cache
#62 and #63 looked correct on inspection (cache key, hash function, eviction policy all reasonable). But the cache keys included continuously-changing values (sub-pixel camera offset; per-tick timeAlive/token counts), so hit rate was near-zero. We paid:
Original render cost (cache miss → still computes the value)
Cache canvas allocation + getContext
drawImage from cache to main canvas
Cache key construction overhead
Action: add a cache hit-rate metric to any new caching layer and assert it during development. "Theoretical maximum savings" from the eliminated work is wrong if the cache misses; the actual delta is (savings × hit_rate) − (cache_overhead × miss_rate).
3. Test conditions matter as much as the test
The user's manual "bloom toggle does nothing" observation was correct for their test conditions (1 canvas, idle sim, DevTools FPS meter). The bench's "bloom toggle saves 50%" observation was correct for ITS test conditions (3 sessions, concurrent workload, in-canvas rAF FPS). Both were measuring something real; the gap between them is the contention regime, not measurement error.
Action: when a user reports "perf change had no effect," gather the test conditions before reverting. The fix may be correct under load that the manual test didn't reproduce.
4. Multi-rep median + variance flagging makes a noisy bench trustworthy
Pre-#69, FPS swung 6.4–12.8 on identical code on the same machine, primarily from background system activity. Post-#69, all 3 final arms had CoV under 5% and the variance check would have warned us if it crept up.
Action: any perf comparison should use --reps=5 minimum. The CPU governor pinning (bench/scripts/bench-prep.sh --set-performance) is also helpful where available, though not strictly required given the median + variance discipline.
What's still on the table
If we want to push further:
Non-overlay text quantization (~10% of remaining scripting time). Per H's analysis:
Agent labels (~1.5s of fillText) — biggest single source. Likely text sprite cache LRU-eviction or color-key churn.
Cost summary panel — totalCost.toFixed(3) changes constantly, defeating the text sprite cache.
Tool/discovery labels and bubble text — sprite cache misses on new label appearance.
Bloom throttle per-canvas heuristic — when N>1 canvas is open, auto-bump the bloom throttle. The marginal cost of the Nth bloom pass is high; users may not realize the trade is theirs to make.
None of these are urgent. The Canvas2D path is now in a meaningfully better place than at PR #56.
Reproducer
cdsource
pnpm run build:app
./bench/scripts/bench-prep.sh --set-performance # optional, requires sudo
node bench/profile-long-tasks.mjs --reps=5 # defaults
node bench/profile-long-tasks.mjs --reps=5 --bloom-throttle=2 # throttle 2
node bench/profile-long-tasks.mjs --reps=5 --no-bloom # bloom off
./bench/scripts/bench-prep.sh --restore # if you ran --set-performance
Outputs land in bench/results/long-tasks-{summary,profile,report}{,-no-bloom,-throttle2}.{json,cpuprofile,md} plus per-rep JSON files.
TL;DR
Multi-PR investigation into the Canvas2D long-task blowup at 4× CPU throttle / 3 concurrent sessions. Net result on 5-rep median bench (
c978155):The big wins are user-tunable. Default behavior is a moderate improvement; the dramatic numbers come from settings users opt into via the Perf popover.
Methodology timeline
BloomRenderer.apply, 3.4% tofillText, 1.7% toclosePath(drawHexGrid). Filed sub-issues perf: Canvas2D BloomRenderer is 49% of all CPU at 4× throttle #53, perf: Canvas2D label/text draws ≈ 6% of CPU at 4× throttle (per-frame redraw) #54, perf: message-feed virtualization measureRef thrashes ≈ 2.4% of CPU #55.--no-bloomvindicated the bloom finding cleanly: 81.9s → 40.6s long-tasks, +53% FPS. The discrepancy was test-conditions: 1 canvas / 1 idle session at native speed doesn't surface the contention; 3 sessions at 4× throttle does.dataHashinputs (timeAlive to seconds, cost to $0.01 / $0.001).--reps=Nflag with median/stdDev/CoV reporting, variance flagging, system-state snapshot, optional CPU governor helper script. Cut FPS variance from σ/μ ~0.18 to under 0.05.Shipped PRs (with measured impact)
Real, verified wins
closePathself-time 2.2% → 0.5-0.6% (≈75% reduction). Cache key no longer includes camera, so sub-pixel pan doesn't invalidate.Infrastructure
profile-long-tasks.mjs) and parsed report.useVirtualListmeasureRef stabilized via per-id callback Map + rAF-coalesced forceTick. Predicted 2.4% of CPU; not isolated in the final bench but contributes to lower React commits.--bloom-throttle=Nflag + first comparison table.Lessons learned
1. Self-time in V8 builtins under heavy throttle can mislead
PR #56's profile correctly attributed bloom's cost via
drawImageself-time. But the same metric at the same magnitude could mean "we're truly burning CPU here" or "we're stalling here waiting for the GPU/compositor to flush." PR #64's empirical A/B is the only way to distinguish — and it confirmed the bloom case was real CPU work that frees up the budget when removed.Action: any future profile-driven optimization plan should include a "verify by ablation" step before merging multiple PRs against a single profile. The bench is cheap; assuming the profile is correct is expensive when it's wrong.
2. Caches that miss are strictly worse than no cache
#62 and #63 looked correct on inspection (cache key, hash function, eviction policy all reasonable). But the cache keys included continuously-changing values (sub-pixel camera offset; per-tick timeAlive/token counts), so hit rate was near-zero. We paid:
Action: add a cache hit-rate metric to any new caching layer and assert it during development. "Theoretical maximum savings" from the eliminated work is wrong if the cache misses; the actual delta is
(savings × hit_rate) − (cache_overhead × miss_rate).3. Test conditions matter as much as the test
The user's manual "bloom toggle does nothing" observation was correct for their test conditions (1 canvas, idle sim, DevTools FPS meter). The bench's "bloom toggle saves 50%" observation was correct for ITS test conditions (3 sessions, concurrent workload, in-canvas rAF FPS). Both were measuring something real; the gap between them is the contention regime, not measurement error.
Action: when a user reports "perf change had no effect," gather the test conditions before reverting. The fix may be correct under load that the manual test didn't reproduce.
4. Multi-rep median + variance flagging makes a noisy bench trustworthy
Pre-#69, FPS swung 6.4–12.8 on identical code on the same machine, primarily from background system activity. Post-#69, all 3 final arms had CoV under 5% and the variance check would have warned us if it crept up.
Action: any perf comparison should use
--reps=5minimum. The CPU governor pinning (bench/scripts/bench-prep.sh --set-performance) is also helpful where available, though not strictly required given the median + variance discipline.What's still on the table
If we want to push further:
totalCost.toFixed(3)changes constantly, defeating the text sprite cache.(program)bucket — GC, deopts, parse. NeedsTracing.startwith v8 categories (heavier thanProfiler.start). Filed as direction item in Perf next steps: render path is optimized, JS main thread is now the bottleneck #46 originally; not yet investigated.None of these are urgent. The Canvas2D path is now in a meaningfully better place than at PR #56.
Reproducer
Outputs land in
bench/results/long-tasks-{summary,profile,report}{,-no-bloom,-throttle2}.{json,cpuprofile,md}plus per-rep JSON files.Closes / linked
🤖 Generated with Claude Code