Skip to content

feat: add component latency metrics end-to-end#91

Merged
TerrifiedBug merged 10 commits intomainfrom
pipeline-latency-metrics-5df
Mar 11, 2026
Merged

feat: add component latency metrics end-to-end#91
TerrifiedBug merged 10 commits intomainfrom
pipeline-latency-metrics-5df

Conversation

@TerrifiedBug
Copy link
Owner

Summary

  • Agent: Scrape component_latency_mean_seconds gauge from Vector 0.54.0's Prometheus endpoint, transmit via heartbeat payload
  • Server: Accept latency in heartbeat validation, store per-component values in MetricStore (real-time) and pipeline-level weighted means in PipelineMetric DB table (historical)
  • API: Expose latency through getPipelineMetrics, getNodePipelineRates, and chartMetrics tRPC endpoints
  • UI: Add latency to 6 surfaces — dashboard chart, pipeline metrics page, flow editor overlay, flow editor "show metrics" panel, fleet node pipeline table, and latency_mean SLI metric type

Changes

  • 24 files changed across Go agent, Prisma schema + migration, server services, tRPC routers, and React UI components
  • Latency is treated as a gauge (direct pass-through, no delta computation) with seconds→milliseconds conversion at the heartbeat boundary
  • Pipeline-level latency uses throughput-weighted mean aggregation across components and nodes
  • New formatLatency() helper with tiered display (us/ms/s)

Test Plan

  • Go agent builds (go build ./...)
  • Go tests pass (go test ./...)
  • TypeScript compiles (pnpm tsc --noEmit)
  • Prisma schema validates (pnpm prisma validate)
  • Next.js production build succeeds (pnpm build)
  • Dashboard — "Component Latency" chart renders (empty until agent reports latency)
  • Pipeline metrics page — 4th chart "Component Latency" renders
  • Flow editor — "show metrics" panel has 3 charts including latency
  • Flow editor — component overlay shows latency when available
  • Fleet/node detail — "Avg Latency" column in pipeline table
  • Pipeline SLI — latency_mean accepted as metric type

@TerrifiedBug TerrifiedBug force-pushed the pipeline-latency-metrics-5df branch from 52f9790 to ef10532 Compare March 11, 2026 10:24
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 11, 2026

Greptile Summary

This PR wires component_latency_mean_seconds from Vector's Prometheus endpoint end-to-end: the Go agent scrapes the gauge, the heartbeat route converts it to milliseconds and computes a throughput-weighted pipeline-level mean, the value is persisted in a new nullable PipelineMetric.latencyMeanMs column, and latency is surfaced across six UI surfaces (dashboard chart, pipeline metrics page, flow editor overlay, show-metrics panel, fleet node table, and a new latency_mean SLI type).

Key design decisions to note:

  • Gauge, not counter — latency is passed through as a direct value; no delta computation in MetricStore.
  • Two-level weighted aggregation — per-component throughput weighting at the heartbeat boundary, then per-node throughput weighting in metrics-ingest.ts.
  • Live vs. historical inconsistencygetNodePipelineRates uses a simple (unweighted) count-based mean across components, while stored historical rows use throughput weighting, so the live "Avg Latency" column can diverge from what the chart shows for pipelines with imbalanced component throughputs.
  • Null → 0 in chartscomponent-chart.tsx maps null latencyMeanMs values to 0 rather than leaving gaps, which may render misleading 0ms data points for time windows where no latency was scraped (e.g., older agents that don't report latency).

The migration is non-destructive (nullable column, backward compatible with agents that don't report latency), the Zod schema extension is additive, and the SLI evaluator's latency_mean case correctly guards for no-data windows.

Confidence Score: 4/5

  • Safe to merge — no data loss, security issues, or runtime errors; two minor display-level inconsistencies noted.
  • The core pipeline (agent scraping → heartbeat → DB storage → tRPC → UI) is correctly implemented with proper weighted aggregation and seconds-to-milliseconds conversion. The migration is backward compatible. The two flagged issues are both presentation-layer: unweighted vs. weighted latency averaging in the live-rates endpoint (can cause visual divergence from historical charts) and null values mapped to 0ms in the pipeline metrics chart (can render misleading flat lines before agents upgrade). Neither causes data corruption or incorrect SLI evaluation.
  • src/server/routers/metrics.ts (live latency averaging strategy) and src/components/metrics/component-chart.tsx (null → 0 fallback in chart data mapping)

Important Files Changed

Filename Overview
src/app/api/agent/heartbeat/route.ts Adds latencyMeanSeconds to Zod schema, implements correct throughput-weighted computeWeightedLatency(), and wires latency into both the MetricStore and PipelineMetric ingest paths. Logic is sound.
src/server/services/metrics-ingest.ts Correctly performs a second-level throughput-weighted mean aggregation across nodes when writing to PipelineMetric. Conditional spread ensures null latency doesn't overwrite existing rows.
src/server/services/sli-evaluator.ts New latency_mean SLI case follows the same query + no-data guard pattern as existing cases. _count: true in Prisma aggregate() returns a scalar number, so the === 0 guard is valid and consistent with pre-existing SLI checks.
src/server/routers/metrics.ts Live pipeline latency in getNodePipelineRates uses a simple count-based mean across components, while stored historical data uses throughput-weighted mean — this inconsistency may cause the live and chart values to diverge noticeably for pipelines with imbalanced component throughputs.
src/components/metrics/component-chart.tsx Latency chart correctly renders a single Area (suppresses the out line). However, null latency values are mapped to 0 rather than kept as null, which plots misleading 0ms data points for time windows where no latency was scraped.
src/lib/format.ts New formatLatency() correctly tiers display across s / ms / µs ranges. Edge cases (exactly 0, sub-µs values) are handled gracefully.

Sequence Diagram

sequenceDiagram
    participant Agent as Go Agent
    participant VectorProm as Vector Prometheus
    participant HB as /api/agent/heartbeat
    participant MS as MetricStore (in-memory)
    participant DB as PipelineMetric (PostgreSQL)
    participant tRPC as tRPC Routers

    Agent->>VectorProm: scrape component_latency_mean_seconds
    VectorProm-->>Agent: gauge value per component (seconds)
    Agent->>HB: POST heartbeat { componentMetrics[].latencyMeanSeconds }

    HB->>HB: computeWeightedLatency()<br/>(throughput-weighted, seconds→ms)
    HB->>MS: metricStore.recordTotals(..., latencyMeanSeconds)<br/>converts to latencyMeanMs (gauge, no delta)
    HB->>DB: ingestMetrics({ latencyMeanMs })<br/>writes weighted mean to PipelineMetric row

    Note over DB: metrics-ingest.ts also does<br/>second-level node aggregation<br/>(throughput-weighted across nodes)

    tRPC->>DB: getPipelineMetrics / chartMetrics<br/>reads latencyMeanMs from PipelineMetric
    tRPC->>MS: getNodePipelineRates<br/>reads latest latencyMeanMs from MetricStore<br/>(simple mean across components)
    tRPC-->>tRPC: sli-evaluator latency_mean<br/>AVG(latencyMeanMs) over window
Loading

Comments Outside Diff (2)

  1. src/server/routers/metrics.ts, line 659-664 (link)

    Unweighted vs throughput-weighted latency mean inconsistency

    The live-rate latency in getNodePipelineRates is computed as a simple count-based mean across components:

    acc.sum += latest.latencyMeanMs;
    acc.count++;

    But the historical PipelineMetric rows stored by the heartbeat route use a throughput-weighted mean (via computeWeightedLatency). This means the live Avg Latency value shown in the fleet table and flow overlay can differ significantly from the historical chart — especially when pipeline components have very different throughputs (e.g., a high-volume source at 1ms and a low-volume transform at 100ms would give ~50ms live but ~2ms weighted historically).

    Consider weighting by latest.receivedEventsRate + latest.sentEventsRate to stay consistent with the stored aggregate:

    if (latest.latencyMeanMs != null) {
      const weight = latest.receivedEventsRate + latest.sentEventsRate;
      const acc = latencyAcc[matchingNode.pipelineId] ?? { sum: 0, count: 0 };
      acc.sum += latest.latencyMeanMs * weight;
      acc.count += weight;
      latencyAcc[matchingNode.pipelineId] = acc;
    }

    (With the final division still acc.sum / acc.count, guarded for acc.count > 0.)

  2. src/components/metrics/component-chart.tsx, line 407-412 (link)

    Null latency mapped to 0 produces misleading flat line

    Both in and out data keys use m.latencyMeanMs ?? 0 for the "latency" case:

    in: dataKey === "latency" ? (m.latencyMeanMs ?? 0)
    out: dataKey === "latency" ? (m.latencyMeanMs ?? 0)

    When latencyMeanMs is null (no latency was scraped from Vector for that minute — e.g., before the agent is upgraded to 0.54.0), the chart will plot 0ms rather than leaving a gap. A 0ms latency reading looks like accurate data rather than missing data, which can confuse users.

    Consider filtering out null values so Recharts renders a gap instead:

    in: dataKey === "latency" ? (m.latencyMeanMs ?? null)
    out: dataKey === "latency" ? (m.latencyMeanMs ?? null)

    Recharts omits null points from the area when connectNulls is not set (the default).

Last reviewed commit: ef10532

@TerrifiedBug TerrifiedBug merged commit eda9e94 into main Mar 11, 2026
9 checks passed
@TerrifiedBug TerrifiedBug deleted the pipeline-latency-metrics-5df branch March 11, 2026 10:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant