feat(async): latency causes — capture, classify, and harden#92
Merged
Conversation
Capture the raw signals needed to split an async operation's wall-clock time into waiting vs on-CPU: per-resource run windows, firstRunAtMs (scheduling-delay precursor), runCount, and a measured clock resolution. Adds the AsyncLatencyCause / AsyncAttributedFrameOrigin data types and threads the new record fields through the probe, async-hooks installer, CDP reader, and capture bundle. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Classify each operation's latency cause (event-loop-blocked | gc-pause | downstream-async | io-wait | cpu-bound | background | unknown) by overlapping its wait windows with event-loop stalls, GC pauses and downstream-async activity, plus per-family byKindLatency percentiles. Reliability properties (from an empirical audit against real targets): - GC overlap uses each pause's actual duration (no +-lookaround padding), so dense sub-ms scavenges no longer blanket the timeline and mislabel waits as gc-pause. - A blocked event loop takes the documented priority over a coincidental GC/downstream overlap, and only counts when the loop was still stalled as the callback became runnable (firstRunAtMs) - eliminating false event-loop-blocked on genuinely slow I/O. - Persistent/multiplexed handles (keep-alive sockets, intervals: runCount>1, alive ~the whole capture, low CPU) and idle handles are classified background instead of reporting a capture-length aggregate waitMs. - Orphans are excluded from topOperations and byKindLatency (their capture-clamped duration is fictional) and remain in orphans[]. - The CPU kind attributes a per-stall topFrame so a blocked op can be tied to the specific frame that blocked it. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… latency cause New cross-kind event-loop-blocked-async detector: ties a slow async op to the synchronous CPU frame blocking the loop, attributed per stall (matched by firstRunAtMs, falling back to the global hotspot). It stands down when no CPU hotspot identifies a culprit instead of emitting a critical finding anchored at a placeholder (event-loop) frame. long-await now carries the wait/CPU decomposition and cause-specific guidance, and skips background (idle/multiplexed) handles. async-evidence gains a minConfidence helper and attributed/ambiguous quality metrics; thresholds live in DETECTOR_THRESHOLDS.eventLoopBlockedAsync. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Adapt the async quality warning to the renamed quality field: warn when ambiguousRatio exceeds 0.33 instead of on any non-zero cpuAmbiguousSamples, matching the analysis change that grades CPU-to-async attribution by unrelated-overlap ratio. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Document the latency decomposition and per-cause reliability (event-loop-blocked readiness, gc-pause rarity, downstream-async/sibling-await limits, io-wait residual, background incl. multiplexed handles, promise fragmentation), byKindLatency/orphan semantics, and the per-stall blocking frame. Adds the async-latency runnable example, updates the agent skill, and bumps the detector catalogue to 19 built-ins (3 cross-kind) for the new event-loop-blocked-async detector. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What & why
Makes the async kind answer which code is slow, what the latency is, and why — then hardens that classification against an empirical audit.
Each
topOperations[]entry now decomposes its lifetime intorunMs(on CPU) /waitMs(the real latency) /scheduleDelayMs, and carries a classifiedlatencyCause(event-loop-blocked|gc-pause|downstream-async|io-wait|cpu-bound|background|unknown) withcauseConfidence/causeEvidence. A new cross-kindevent-loop-blocked-asyncdetector ties a slow async op to the synchronous CPU frame blocking the loop — the answer to "the await was slow but the I/O wasn't".Audit-driven hardening
The classification was tested against real, purpose-built programs (event-loop blocking, CPU-bound, I/O, downstream chains, orphans, multi-blockers, a realistic HTTP server under load). The audit found and this PR fixes:
±20mspadding around every GC event made dense sub-millisecond scavenges tile the timeline, so ~100% of waits were spuriouslygc-pause(it even stoleevent-loop-blocked). Now uses each pause's actual duration →gc-pauseis correctly rare.event-loop-blockednow requires the loop to have still been stalled when the callback became runnable (firstRunAtMs); a stall that ended before the op ran is treated as coincidental.topOperations/byKindLatency(their capture-clamped duration is fictional) and remain inorphans[].asyncIdserving many activations over ~the whole capture) are classifiedbackgroundinstead of reporting a capture-length aggregatewaitMsas a single critical finding. TherunCount > 1discriminator preserves genuine single long operations.eventLoop.stallIntervals[].topFrame, matched byfirstRunAtMs) instead of stamping one globally-dominant frame on every blocked op; it stands down entirely when no CPU hotspot identifies a culprit (no placeholder(event-loop)findings).A separate investigation kept the CPU kind's
gc.correlatedHotspots±lookaround unchanged — verified it is intentional and correct there (it finds the allocating frame near a pause; during a stop-the-world GC no JS runs, so exact windows would find nothing).Validation
On a realistic HTTP server under concurrent load, the tool correctly surfaces
sync-crypto+event-loop-stall+event-loop-blocked-asyncall pointing at the blockingpbkdf2Synchandler (not the innocent await),json-on-hot-pathon the downstream CPU, zerogc-pauseblanket, and zero misleading multi-second findings on keep-alive connections.Commits (atomic, by layer)
feat(core)capture async latency decomposition signalsfeat(core)classify async latency cause + per-stall blocking framesfeat(detectors)event-loop-blocked-async + long-await latency guidancerefactor(cli)gate async quality warning on ambiguousRatiodocs(async)docs, skill, example, READMEchorechangesetNotes
waitMs,latencyCause,byKindLatency,stallIntervals[].topFrame, …) are additive and optional within schema v2.--kind async); attach mode stays partial by design.Test plan
npm run build && npm test— all suites green (core, detectors, cli; incl. new regression tests for the GC fix, readiness gate, priority, orphan exclusion, multiplexed→background, no-culprit stand-down, and per-stall attribution).npm run typecheck(5/5) andnpm run check(Biome clean).🤖 Generated with Claude Code