Skip to content

feat(async): latency causes — capture, classify, and harden#92

Merged
arkerone merged 6 commits into
mainfrom
feat/async-latency-causes
May 29, 2026
Merged

feat(async): latency causes — capture, classify, and harden#92
arkerone merged 6 commits into
mainfrom
feat/async-latency-causes

Conversation

@arkerone
Copy link
Copy Markdown
Owner

What & why

Makes the async kind answer which code is slow, what the latency is, and why — then hardens that classification against an empirical audit.

Each topOperations[] entry now decomposes its lifetime into runMs (on CPU) / waitMs (the real latency) / scheduleDelayMs, and carries a classified latencyCause (event-loop-blocked | gc-pause | downstream-async | io-wait | cpu-bound | background | unknown) with causeConfidence/causeEvidence. A new cross-kind event-loop-blocked-async detector ties a slow async op to the synchronous CPU frame blocking the loop — the answer to "the await was slow but the I/O wasn't".

Audit-driven hardening

The classification was tested against real, purpose-built programs (event-loop blocking, CPU-bound, I/O, downstream chains, orphans, multi-blockers, a realistic HTTP server under load). The audit found and this PR fixes:

  • GC blanket (dominant bug): ±20ms padding around every GC event made dense sub-millisecond scavenges tile the timeline, so ~100% of waits were spuriously gc-pause (it even stole event-loop-blocked). Now uses each pause's actual durationgc-pause is correctly rare.
  • Priority not implemented: a blocked event loop now actually outranks a coincidental GC/downstream overlap (the documented behaviour).
  • Coincidental overlap → false positive (R7): event-loop-blocked now requires the loop to have still been stalled when the callback became runnable (firstRunAtMs); a stall that ended before the op ran is treated as coincidental.
  • Orphans pollute the headline: resources still in flight at capture end are excluded from topOperations/byKindLatency (their capture-clamped duration is fictional) and remain in orphans[].
  • Multiplexed handles: keep-alive sockets / intervals (one asyncId serving many activations over ~the whole capture) are classified background instead of reporting a capture-length aggregate waitMs as a single critical finding. The runCount > 1 discriminator preserves genuine single long operations.
  • Per-stall blocking frame (R3): the detector attributes the culprit per stall (new eventLoop.stallIntervals[].topFrame, matched by firstRunAtMs) instead of stamping one globally-dominant frame on every blocked op; it stands down entirely when no CPU hotspot identifies a culprit (no placeholder (event-loop) findings).

A separate investigation kept the CPU kind's gc.correlatedHotspots ±lookaround unchanged — verified it is intentional and correct there (it finds the allocating frame near a pause; during a stop-the-world GC no JS runs, so exact windows would find nothing).

Validation

On a realistic HTTP server under concurrent load, the tool correctly surfaces sync-crypto + event-loop-stall + event-loop-blocked-async all pointing at the blocking pbkdf2Sync handler (not the innocent await), json-on-hot-path on the downstream CPU, zero gc-pause blanket, and zero misleading multi-second findings on keep-alive connections.

Commits (atomic, by layer)

  1. feat(core) capture async latency decomposition signals
  2. feat(core) classify async latency cause + per-stall blocking frames
  3. feat(detectors) event-loop-blocked-async + long-await latency guidance
  4. refactor(cli) gate async quality warning on ambiguousRatio
  5. docs(async) docs, skill, example, README
  6. chore changeset

Notes

  • Report-shape additions (waitMs, latencyCause, byKindLatency, stallIntervals[].topFrame, …) are additive and optional within schema v2.
  • The async kind remains experimental / opt-in (--kind async); attach mode stays partial by design.
  • Detector catalogue: 18 → 19 built-ins (3 cross-kind).

Test plan

  • npm run build && npm test — all suites green (core, detectors, cli; incl. new regression tests for the GC fix, readiness gate, priority, orphan exclusion, multiplexed→background, no-culprit stand-down, and per-stall attribution).
  • npm run typecheck (5/5) and npm run check (Biome clean).

🤖 Generated with Claude Code

arkerone and others added 6 commits May 29, 2026 22:46
Capture the raw signals needed to split an async operation's wall-clock time into waiting vs on-CPU: per-resource run windows, firstRunAtMs (scheduling-delay precursor), runCount, and a measured clock resolution. Adds the AsyncLatencyCause / AsyncAttributedFrameOrigin data types and threads the new record fields through the probe, async-hooks installer, CDP reader, and capture bundle.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Classify each operation's latency cause (event-loop-blocked | gc-pause | downstream-async | io-wait | cpu-bound | background | unknown) by overlapping its wait windows with event-loop stalls, GC pauses and downstream-async activity, plus per-family byKindLatency percentiles.

Reliability properties (from an empirical audit against real targets):
- GC overlap uses each pause's actual duration (no +-lookaround padding), so dense sub-ms scavenges no longer blanket the timeline and mislabel waits as gc-pause.
- A blocked event loop takes the documented priority over a coincidental GC/downstream overlap, and only counts when the loop was still stalled as the callback became runnable (firstRunAtMs) - eliminating false event-loop-blocked on genuinely slow I/O.
- Persistent/multiplexed handles (keep-alive sockets, intervals: runCount>1, alive ~the whole capture, low CPU) and idle handles are classified background instead of reporting a capture-length aggregate waitMs.
- Orphans are excluded from topOperations and byKindLatency (their capture-clamped duration is fictional) and remain in orphans[].
- The CPU kind attributes a per-stall topFrame so a blocked op can be tied to the specific frame that blocked it.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… latency cause

New cross-kind event-loop-blocked-async detector: ties a slow async op to the synchronous CPU frame blocking the loop, attributed per stall (matched by firstRunAtMs, falling back to the global hotspot). It stands down when no CPU hotspot identifies a culprit instead of emitting a critical finding anchored at a placeholder (event-loop) frame.

long-await now carries the wait/CPU decomposition and cause-specific guidance, and skips background (idle/multiplexed) handles. async-evidence gains a minConfidence helper and attributed/ambiguous quality metrics; thresholds live in DETECTOR_THRESHOLDS.eventLoopBlockedAsync.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Adapt the async quality warning to the renamed quality field: warn when ambiguousRatio exceeds 0.33 instead of on any non-zero cpuAmbiguousSamples, matching the analysis change that grades CPU-to-async attribution by unrelated-overlap ratio.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Document the latency decomposition and per-cause reliability (event-loop-blocked readiness, gc-pause rarity, downstream-async/sibling-await limits, io-wait residual, background incl. multiplexed handles, promise fragmentation), byKindLatency/orphan semantics, and the per-stall blocking frame. Adds the async-latency runnable example, updates the agent skill, and bumps the detector catalogue to 19 built-ins (3 cross-kind) for the new event-loop-blocked-async detector.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@arkerone arkerone merged commit 9ed6112 into main May 29, 2026
3 checks passed
@arkerone arkerone deleted the feat/async-latency-causes branch May 29, 2026 21:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant