feat(async): latency causes — capture, classify, and harden by arkerone · Pull Request #92 · arkerone/lanterna

arkerone · 2026-05-29T20:49:21Z

What & why

Makes the async kind answer which code is slow, what the latency is, and why — then hardens that classification against an empirical audit.

Each topOperations[] entry now decomposes its lifetime into runMs (on CPU) / waitMs (the real latency) / scheduleDelayMs, and carries a classified latencyCause (event-loop-blocked | gc-pause | downstream-async | io-wait | cpu-bound | background | unknown) with causeConfidence/causeEvidence. A new cross-kind event-loop-blocked-async detector ties a slow async op to the synchronous CPU frame blocking the loop — the answer to "the await was slow but the I/O wasn't".

Audit-driven hardening

The classification was tested against real, purpose-built programs (event-loop blocking, CPU-bound, I/O, downstream chains, orphans, multi-blockers, a realistic HTTP server under load). The audit found and this PR fixes:

GC blanket (dominant bug): ±20ms padding around every GC event made dense sub-millisecond scavenges tile the timeline, so ~100% of waits were spuriously gc-pause (it even stole event-loop-blocked). Now uses each pause's actual duration → gc-pause is correctly rare.
Priority not implemented: a blocked event loop now actually outranks a coincidental GC/downstream overlap (the documented behaviour).
Coincidental overlap → false positive (R7): event-loop-blocked now requires the loop to have still been stalled when the callback became runnable (firstRunAtMs); a stall that ended before the op ran is treated as coincidental.
Orphans pollute the headline: resources still in flight at capture end are excluded from topOperations/byKindLatency (their capture-clamped duration is fictional) and remain in orphans[].
Multiplexed handles: keep-alive sockets / intervals (one asyncId serving many activations over ~the whole capture) are classified background instead of reporting a capture-length aggregate waitMs as a single critical finding. The runCount > 1 discriminator preserves genuine single long operations.
Per-stall blocking frame (R3): the detector attributes the culprit per stall (new eventLoop.stallIntervals[].topFrame, matched by firstRunAtMs) instead of stamping one globally-dominant frame on every blocked op; it stands down entirely when no CPU hotspot identifies a culprit (no placeholder (event-loop) findings).

A separate investigation kept the CPU kind's gc.correlatedHotspots ±lookaround unchanged — verified it is intentional and correct there (it finds the allocating frame near a pause; during a stop-the-world GC no JS runs, so exact windows would find nothing).

Validation

On a realistic HTTP server under concurrent load, the tool correctly surfaces sync-crypto + event-loop-stall + event-loop-blocked-async all pointing at the blocking pbkdf2Sync handler (not the innocent await), json-on-hot-path on the downstream CPU, zero gc-pause blanket, and zero misleading multi-second findings on keep-alive connections.

Commits (atomic, by layer)

feat(core) capture async latency decomposition signals
feat(core) classify async latency cause + per-stall blocking frames
feat(detectors) event-loop-blocked-async + long-await latency guidance
refactor(cli) gate async quality warning on ambiguousRatio
docs(async) docs, skill, example, README
chore changeset

Notes

Report-shape additions (waitMs, latencyCause, byKindLatency, stallIntervals[].topFrame, …) are additive and optional within schema v2.
The async kind remains experimental / opt-in (--kind async); attach mode stays partial by design.
Detector catalogue: 18 → 19 built-ins (3 cross-kind).

Test plan

npm run build && npm test — all suites green (core, detectors, cli; incl. new regression tests for the GC fix, readiness gate, priority, orphan exclusion, multiplexed→background, no-culprit stand-down, and per-stall attribution).
npm run typecheck (5/5) and npm run check (Biome clean).

🤖 Generated with Claude Code

Capture the raw signals needed to split an async operation's wall-clock time into waiting vs on-CPU: per-resource run windows, firstRunAtMs (scheduling-delay precursor), runCount, and a measured clock resolution. Adds the AsyncLatencyCause / AsyncAttributedFrameOrigin data types and threads the new record fields through the probe, async-hooks installer, CDP reader, and capture bundle. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Classify each operation's latency cause (event-loop-blocked | gc-pause | downstream-async | io-wait | cpu-bound | background | unknown) by overlapping its wait windows with event-loop stalls, GC pauses and downstream-async activity, plus per-family byKindLatency percentiles. Reliability properties (from an empirical audit against real targets): - GC overlap uses each pause's actual duration (no +-lookaround padding), so dense sub-ms scavenges no longer blanket the timeline and mislabel waits as gc-pause. - A blocked event loop takes the documented priority over a coincidental GC/downstream overlap, and only counts when the loop was still stalled as the callback became runnable (firstRunAtMs) - eliminating false event-loop-blocked on genuinely slow I/O. - Persistent/multiplexed handles (keep-alive sockets, intervals: runCount>1, alive ~the whole capture, low CPU) and idle handles are classified background instead of reporting a capture-length aggregate waitMs. - Orphans are excluded from topOperations and byKindLatency (their capture-clamped duration is fictional) and remain in orphans[]. - The CPU kind attributes a per-stall topFrame so a blocked op can be tied to the specific frame that blocked it. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

… latency cause New cross-kind event-loop-blocked-async detector: ties a slow async op to the synchronous CPU frame blocking the loop, attributed per stall (matched by firstRunAtMs, falling back to the global hotspot). It stands down when no CPU hotspot identifies a culprit instead of emitting a critical finding anchored at a placeholder (event-loop) frame. long-await now carries the wait/CPU decomposition and cause-specific guidance, and skips background (idle/multiplexed) handles. async-evidence gains a minConfidence helper and attributed/ambiguous quality metrics; thresholds live in DETECTOR_THRESHOLDS.eventLoopBlockedAsync. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Adapt the async quality warning to the renamed quality field: warn when ambiguousRatio exceeds 0.33 instead of on any non-zero cpuAmbiguousSamples, matching the analysis change that grades CPU-to-async attribution by unrelated-overlap ratio. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Document the latency decomposition and per-cause reliability (event-loop-blocked readiness, gc-pause rarity, downstream-async/sibling-await limits, io-wait residual, background incl. multiplexed handles, promise fragmentation), byKindLatency/orphan semantics, and the per-stall blocking frame. Adds the async-latency runnable example, updates the agent skill, and bumps the detector catalogue to 19 built-ins (3 cross-kind) for the new event-loop-blocked-async detector. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

arkerone and others added 6 commits May 29, 2026 22:46

chore: add changeset for async latency causes

703d967

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

arkerone merged commit 9ed6112 into main May 29, 2026
3 checks passed

arkerone deleted the feat/async-latency-causes branch May 29, 2026 21:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(async): latency causes — capture, classify, and harden#92

feat(async): latency causes — capture, classify, and harden#92
arkerone merged 6 commits into
mainfrom
feat/async-latency-causes

arkerone commented May 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

arkerone commented May 29, 2026

What & why

Audit-driven hardening

Validation

Commits (atomic, by layer)

Notes

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant