Skip to content

Add browser RUM, Web Vitals, and session replay support#152

Draft
petyosi wants to merge 8 commits into
mainfrom
petyosi/browser-rum-web-vitals
Draft

Add browser RUM, Web Vitals, and session replay support#152
petyosi wants to merge 8 commits into
mainfrom
petyosi/browser-rum-web-vitals

Conversation

@petyosi

@petyosi petyosi commented Jun 30, 2026

Copy link
Copy Markdown
Member

Summary

Adds SDK-side browser RUM support across session identity, Core Web Vitals, native Web Vitals metrics, and experimental rrweb session replay.

This PR intentionally keeps replay opt-in and experimental while Logfire Platform replay ingest/playback remain feature-flagged. Platform migration and release work should happen on top of this SDK branch after review.

What changed

  • Added browser RUM session identity and URL attribute handling for browser spans.
  • Added Core Web Vitals span reporting and optional native histogram metrics.
  • Added a standalone @pydantic/logfire-session-replay package for rrweb chunk capture/upload.
  • Integrated top-level sessionReplay into @pydantic/logfire-browser through an explicit lazy load callback.
  • Added replay/span correlation through shared session.id / browser.session.id plus replay-active span attributes.
  • Added proxy-first replay documentation and direct-token escape hatch docs.
  • Added a complete examples/browser-rum-replay workbench plus replay support in the existing browser smoke example.
  • Documented local replay testing traps: privacy extensions blocking session-replay dev URLs and Vite resolving rrweb to CJS when consuming unpublished workspace output.

Validation

  • vp check
  • Pre-commit hook: vp check --fix
  • Pre-push hook package builds and tests for the affected workspace packages
  • Manual local platform smoke test for traces, Web Vitals metrics, and replay proxy upload path

petyosi added 6 commits June 29, 2026 17:49
Introduce opt-in browser session identity, Web Vitals span reporting, and native Web Vitals histogram metrics for @pydantic/logfire-browser. Wire the browser example and docs to exercise the trace and metrics proxy paths, and keep the PRP planning artifacts/checklists with the implementation for review context.

@devin-ai-integration devin-ai-integration Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 3 potential issues.

View 1 additional finding in Devin Review.

Open in Devin Review

Comment on lines +223 to +229
startupPromise ??= import('web-vitals/attribution')
.then((webVitals) => registerWebVitals(webVitals, options))
.catch((error: unknown) => {
diag.error('logfire-browser: failed to start Web Vitals reporting', error)
return noopHandle
})
startupHasMetricRecorder ||= options.metricRecorder !== undefined

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Web Vitals startup captures only the first caller's options, silently discarding later metric recorders

The Web Vitals startup promise captures the first options argument in a closure (startupPromise ??= at packages/logfire-browser/src/webVitals.ts:223) and reuses it for all future callers, so when startBrowserWebVitals is called a second time with a metricRecorder, the recorder is never wired to the actual web-vitals callbacks.

Impact: When browser metrics and Web Vitals are both configured and the browser metrics import resolves after Web Vitals observers are already registered, the metric recorder will never receive data.

Mechanism and trigger condition

In packages/logfire-browser/src/index.ts:400-421, when webVitalsMetricOptions is defined, the code first awaits browserMetricsStartupPromise, then calls startBrowserWebVitals({ ...webVitalsOptions, metricRecorder: ... }). However, startBrowserWebVitals uses startupPromise ??= which means if the promise was already assigned (e.g., from an earlier call without a metric recorder), the second call's options containing the metricRecorder is completely ignored. The startupHasMetricRecorder ||= options.metricRecorder !== undefined on line 229 gets set to true, but the actual recorder is never passed to registerWebVitals.

In the current configure() implementation, this specific race is avoided because Web Vitals are only started once per configure call. However, the assertBrowserWebVitalsMetricsCanStart() function at line 210-215 only guards against the case where Web Vitals were started without metrics and a later call tries to add metrics. The inverse (starting with metrics after a bare start) is silently broken. The module-level singleton state (startupPromise) persists across configure() calls in the same page lifecycle, so a second configure() call that enables metrics after a first configure() that enabled Web Vitals without metrics will silently fail to record metrics.

Prompt for agents
The issue is in startBrowserWebVitals in packages/logfire-browser/src/webVitals.ts. The startupPromise ??= pattern captures the first call's options in the closure and ignores all subsequent calls' options. Since web-vitals callbacks cannot be unregistered, the metricRecorder must be stored as a mutable reference that later calls can update. Consider: (1) storing the metricRecorder in a mutable variable outside the closure, (2) having registerWebVitals reference that variable, and (3) allowing startBrowserWebVitals to update the mutable recorder reference when called again with a new metricRecorder. This preserves the once-only web-vitals registration while allowing the metric sink to be attached later.
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Comment on lines +145 to +147
const body = keepalive ? gzipSync(strToU8(json)) : await gzipAsync(json)
const useKeepalive = keepalive && body.byteLength <= MAX_KEEPALIVE_BODY_BYTES
await this.sendWithRetry(sessionId, seq, body, useKeepalive)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Replay chunk uses uncompressed byte estimate for keepalive splitting, but the actual send uses compressed size for the keepalive threshold

Replay events are split into keepalive-safe chunks based on uncompressed estimateBytes (splitKeepaliveEventChunks at packages/logfire-session-replay/src/transport.ts:260-281 uses MAX_KEEPALIVE_CHUNK_BYTES = 48_000), but after splitting, each chunk is JSON-serialized as a full envelope with metadata and then gzip-compressed, then the deliver method checks body.byteLength <= MAX_KEEPALIVE_BODY_BYTES (60 KB) to decide whether to actually use keepalive (packages/logfire-session-replay/src/transport.ts:146).

Impact: When a single rrweb event (such as a FullSnapshot) exceeds the uncompressed budget but compresses well, the splitting cannot break it further (it always pushes at least one event per chunk), and the compressed body may still exceed 60 KB, causing keepalive to be disabled for that chunk — which means the browser may abort the request during page unload.

Mechanism details

The splitKeepaliveEventChunks function at transport.ts:260-281 splits based on uncompressed JSON string length (estimateBytes). Each chunk gets a separate envelope with computeChunkMeta, then is gzip-compressed. A FullSnapshot event can easily be 200+ KB uncompressed but compress to 30-70 KB. With MAX_KEEPALIVE_CHUNK_BYTES = 48_000, a single FullSnapshot exceeding 48KB cannot be split further (the loop always pushes at least one event per chunk). After compression, if the body exceeds MAX_KEEPALIVE_BODY_BYTES = 60_000, then useKeepalive is set to false at line 146, and on page unload the request gets keepalive: false, which means the browser can abort it during navigation. The fundamental issue is the mismatch between the splitting metric (uncompressed bytes) and the keepalive feasibility check (compressed bytes).

Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. I am leaving this as a documented/best-effort limitation rather than changing chunking in this PR. The transport already splits keepalive flushes between rrweb events and only uses keepalive when the compressed envelope is under the browser-safe cap. If a single rrweb event, commonly a FullSnapshot, is itself too large after envelope+gzip, we cannot split it further without breaking rrweb event semantics. In that case the current fallback is to send the chunk as a normal non-keepalive request; on page unload that may still be aborted by the browser, but forcing keepalive: true for an oversized body would be rejected or ignored anyway. Explicit stop() uses a normal non-keepalive flush, so this only affects lifecycle/pagehide flushes. I think deeper mitigation here is a follow-up policy decision around snapshot sizing/checkpoints or an alternate final-upload strategy, not a small transport patch in this PR.

Comment on lines +66 to +67
let startupPromise: Promise<BrowserWebVitalsHandle> | undefined
let startupHasMetricRecorder = false

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚩 Web Vitals module singleton state is never reset during browser SDK cleanup

The module-level startupPromise and startupHasMetricRecorder in packages/logfire-browser/src/webVitals.ts:66-67 persist across the entire page lifecycle, and the resetBrowserWebVitalsForTests() function is only for test use. This means once Web Vitals are configured and registered, a subsequent configure() call in the same page (e.g. HMR in development, or a dynamic reconfigure scenario) will silently reuse the first registration's options. This is intentional because web-vitals callbacks are page-lifetime observers that cannot be unregistered, but it means the metric recorder from the first configure call lives forever. This is fine for production (single configure per page load) but could confuse HMR/development scenarios.

Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant