Add browser RUM, Web Vitals, and session replay support#152
Conversation
Introduce opt-in browser session identity, Web Vitals span reporting, and native Web Vitals histogram metrics for @pydantic/logfire-browser. Wire the browser example and docs to exercise the trace and metrics proxy paths, and keep the PRP planning artifacts/checklists with the implementation for review context.
| startupPromise ??= import('web-vitals/attribution') | ||
| .then((webVitals) => registerWebVitals(webVitals, options)) | ||
| .catch((error: unknown) => { | ||
| diag.error('logfire-browser: failed to start Web Vitals reporting', error) | ||
| return noopHandle | ||
| }) | ||
| startupHasMetricRecorder ||= options.metricRecorder !== undefined |
There was a problem hiding this comment.
🔴 Web Vitals startup captures only the first caller's options, silently discarding later metric recorders
The Web Vitals startup promise captures the first options argument in a closure (startupPromise ??= at packages/logfire-browser/src/webVitals.ts:223) and reuses it for all future callers, so when startBrowserWebVitals is called a second time with a metricRecorder, the recorder is never wired to the actual web-vitals callbacks.
Impact: When browser metrics and Web Vitals are both configured and the browser metrics import resolves after Web Vitals observers are already registered, the metric recorder will never receive data.
Mechanism and trigger condition
In packages/logfire-browser/src/index.ts:400-421, when webVitalsMetricOptions is defined, the code first awaits browserMetricsStartupPromise, then calls startBrowserWebVitals({ ...webVitalsOptions, metricRecorder: ... }). However, startBrowserWebVitals uses startupPromise ??= which means if the promise was already assigned (e.g., from an earlier call without a metric recorder), the second call's options containing the metricRecorder is completely ignored. The startupHasMetricRecorder ||= options.metricRecorder !== undefined on line 229 gets set to true, but the actual recorder is never passed to registerWebVitals.
In the current configure() implementation, this specific race is avoided because Web Vitals are only started once per configure call. However, the assertBrowserWebVitalsMetricsCanStart() function at line 210-215 only guards against the case where Web Vitals were started without metrics and a later call tries to add metrics. The inverse (starting with metrics after a bare start) is silently broken. The module-level singleton state (startupPromise) persists across configure() calls in the same page lifecycle, so a second configure() call that enables metrics after a first configure() that enabled Web Vitals without metrics will silently fail to record metrics.
Prompt for agents
The issue is in startBrowserWebVitals in packages/logfire-browser/src/webVitals.ts. The startupPromise ??= pattern captures the first call's options in the closure and ignores all subsequent calls' options. Since web-vitals callbacks cannot be unregistered, the metricRecorder must be stored as a mutable reference that later calls can update. Consider: (1) storing the metricRecorder in a mutable variable outside the closure, (2) having registerWebVitals reference that variable, and (3) allowing startBrowserWebVitals to update the mutable recorder reference when called again with a new metricRecorder. This preserves the once-only web-vitals registration while allowing the metric sink to be attached later.
Was this helpful? React with 👍 or 👎 to provide feedback.
| const body = keepalive ? gzipSync(strToU8(json)) : await gzipAsync(json) | ||
| const useKeepalive = keepalive && body.byteLength <= MAX_KEEPALIVE_BODY_BYTES | ||
| await this.sendWithRetry(sessionId, seq, body, useKeepalive) |
There was a problem hiding this comment.
🟡 Replay chunk uses uncompressed byte estimate for keepalive splitting, but the actual send uses compressed size for the keepalive threshold
Replay events are split into keepalive-safe chunks based on uncompressed estimateBytes (splitKeepaliveEventChunks at packages/logfire-session-replay/src/transport.ts:260-281 uses MAX_KEEPALIVE_CHUNK_BYTES = 48_000), but after splitting, each chunk is JSON-serialized as a full envelope with metadata and then gzip-compressed, then the deliver method checks body.byteLength <= MAX_KEEPALIVE_BODY_BYTES (60 KB) to decide whether to actually use keepalive (packages/logfire-session-replay/src/transport.ts:146).
Impact: When a single rrweb event (such as a FullSnapshot) exceeds the uncompressed budget but compresses well, the splitting cannot break it further (it always pushes at least one event per chunk), and the compressed body may still exceed 60 KB, causing keepalive to be disabled for that chunk — which means the browser may abort the request during page unload.
Mechanism details
The splitKeepaliveEventChunks function at transport.ts:260-281 splits based on uncompressed JSON string length (estimateBytes). Each chunk gets a separate envelope with computeChunkMeta, then is gzip-compressed. A FullSnapshot event can easily be 200+ KB uncompressed but compress to 30-70 KB. With MAX_KEEPALIVE_CHUNK_BYTES = 48_000, a single FullSnapshot exceeding 48KB cannot be split further (the loop always pushes at least one event per chunk). After compression, if the body exceeds MAX_KEEPALIVE_BODY_BYTES = 60_000, then useKeepalive is set to false at line 146, and on page unload the request gets keepalive: false, which means the browser can abort it during navigation. The fundamental issue is the mismatch between the splitting metric (uncompressed bytes) and the keepalive feasibility check (compressed bytes).
Was this helpful? React with 👍 or 👎 to provide feedback.
There was a problem hiding this comment.
Good catch. I am leaving this as a documented/best-effort limitation rather than changing chunking in this PR. The transport already splits keepalive flushes between rrweb events and only uses keepalive when the compressed envelope is under the browser-safe cap. If a single rrweb event, commonly a FullSnapshot, is itself too large after envelope+gzip, we cannot split it further without breaking rrweb event semantics. In that case the current fallback is to send the chunk as a normal non-keepalive request; on page unload that may still be aborted by the browser, but forcing keepalive: true for an oversized body would be rejected or ignored anyway. Explicit stop() uses a normal non-keepalive flush, so this only affects lifecycle/pagehide flushes. I think deeper mitigation here is a follow-up policy decision around snapshot sizing/checkpoints or an alternate final-upload strategy, not a small transport patch in this PR.
| let startupPromise: Promise<BrowserWebVitalsHandle> | undefined | ||
| let startupHasMetricRecorder = false |
There was a problem hiding this comment.
🚩 Web Vitals module singleton state is never reset during browser SDK cleanup
The module-level startupPromise and startupHasMetricRecorder in packages/logfire-browser/src/webVitals.ts:66-67 persist across the entire page lifecycle, and the resetBrowserWebVitalsForTests() function is only for test use. This means once Web Vitals are configured and registered, a subsequent configure() call in the same page (e.g. HMR in development, or a dynamic reconfigure scenario) will silently reuse the first registration's options. This is intentional because web-vitals callbacks are page-lifetime observers that cannot be unregistered, but it means the metric recorder from the first configure call lives forever. This is fine for production (single configure per page load) but could confuse HMR/development scenarios.
Was this helpful? React with 👍 or 👎 to provide feedback.
Summary
Adds SDK-side browser RUM support across session identity, Core Web Vitals, native Web Vitals metrics, and experimental rrweb session replay.
This PR intentionally keeps replay opt-in and experimental while Logfire Platform replay ingest/playback remain feature-flagged. Platform migration and release work should happen on top of this SDK branch after review.
What changed
@pydantic/logfire-session-replaypackage for rrweb chunk capture/upload.sessionReplayinto@pydantic/logfire-browserthrough an explicit lazyloadcallback.session.id/browser.session.idplus replay-active span attributes.examples/browser-rum-replayworkbench plus replay support in the existing browser smoke example.session-replaydev URLs and Vite resolving rrweb to CJS when consuming unpublished workspace output.Validation
vp checkvp check --fix