feat: add component latency metrics end-to-end by TerrifiedBug · Pull Request #91 · TerrifiedBug/vectorflow

TerrifiedBug · 2026-03-11T10:21:19Z

Summary

Agent: Scrape component_latency_mean_seconds gauge from Vector 0.54.0's Prometheus endpoint, transmit via heartbeat payload
Server: Accept latency in heartbeat validation, store per-component values in MetricStore (real-time) and pipeline-level weighted means in PipelineMetric DB table (historical)
API: Expose latency through getPipelineMetrics, getNodePipelineRates, and chartMetrics tRPC endpoints
UI: Add latency to 6 surfaces — dashboard chart, pipeline metrics page, flow editor overlay, flow editor "show metrics" panel, fleet node pipeline table, and latency_mean SLI metric type

Changes

24 files changed across Go agent, Prisma schema + migration, server services, tRPC routers, and React UI components
Latency is treated as a gauge (direct pass-through, no delta computation) with seconds→milliseconds conversion at the heartbeat boundary
Pipeline-level latency uses throughput-weighted mean aggregation across components and nodes
New formatLatency() helper with tiered display (us/ms/s)

Test Plan

…hboard

- Fix running-average bug in getNodePipelineRates that gave incorrect latency means when aggregating across >2 components - Add early return for formatLatency(0) to display "0ms" instead of "0.000ms"

greptile-apps · 2026-03-11T10:27:21Z

Greptile Summary

This PR wires component_latency_mean_seconds from Vector's Prometheus endpoint end-to-end: the Go agent scrapes the gauge, the heartbeat route converts it to milliseconds and computes a throughput-weighted pipeline-level mean, the value is persisted in a new nullable PipelineMetric.latencyMeanMs column, and latency is surfaced across six UI surfaces (dashboard chart, pipeline metrics page, flow editor overlay, show-metrics panel, fleet node table, and a new latency_mean SLI type).

Key design decisions to note:

Gauge, not counter — latency is passed through as a direct value; no delta computation in MetricStore.
Two-level weighted aggregation — per-component throughput weighting at the heartbeat boundary, then per-node throughput weighting in metrics-ingest.ts.
Live vs. historical inconsistency — getNodePipelineRates uses a simple (unweighted) count-based mean across components, while stored historical rows use throughput weighting, so the live "Avg Latency" column can diverge from what the chart shows for pipelines with imbalanced component throughputs.
Null → 0 in charts — component-chart.tsx maps null latencyMeanMs values to 0 rather than leaving gaps, which may render misleading 0ms data points for time windows where no latency was scraped (e.g., older agents that don't report latency).

The migration is non-destructive (nullable column, backward compatible with agents that don't report latency), the Zod schema extension is additive, and the SLI evaluator's latency_mean case correctly guards for no-data windows.

Confidence Score: 4/5

Safe to merge — no data loss, security issues, or runtime errors; two minor display-level inconsistencies noted.
The core pipeline (agent scraping → heartbeat → DB storage → tRPC → UI) is correctly implemented with proper weighted aggregation and seconds-to-milliseconds conversion. The migration is backward compatible. The two flagged issues are both presentation-layer: unweighted vs. weighted latency averaging in the live-rates endpoint (can cause visual divergence from historical charts) and null values mapped to 0ms in the pipeline metrics chart (can render misleading flat lines before agents upgrade). Neither causes data corruption or incorrect SLI evaluation.
src/server/routers/metrics.ts (live latency averaging strategy) and src/components/metrics/component-chart.tsx (null → 0 fallback in chart data mapping)

Important Files Changed

Filename	Overview
src/app/api/agent/heartbeat/route.ts	Adds `latencyMeanSeconds` to Zod schema, implements correct throughput-weighted `computeWeightedLatency()`, and wires latency into both the MetricStore and PipelineMetric ingest paths. Logic is sound.
src/server/services/metrics-ingest.ts	Correctly performs a second-level throughput-weighted mean aggregation across nodes when writing to `PipelineMetric`. Conditional spread ensures null latency doesn't overwrite existing rows.
src/server/services/sli-evaluator.ts	New `latency_mean` SLI case follows the same query + no-data guard pattern as existing cases. `_count: true` in Prisma `aggregate()` returns a scalar number, so the `=== 0` guard is valid and consistent with pre-existing SLI checks.
src/server/routers/metrics.ts	Live pipeline latency in `getNodePipelineRates` uses a simple count-based mean across components, while stored historical data uses throughput-weighted mean — this inconsistency may cause the live and chart values to diverge noticeably for pipelines with imbalanced component throughputs.
src/components/metrics/component-chart.tsx	Latency chart correctly renders a single Area (suppresses the `out` line). However, null latency values are mapped to `0` rather than kept as `null`, which plots misleading 0ms data points for time windows where no latency was scraped.
src/lib/format.ts	New `formatLatency()` correctly tiers display across s / ms / µs ranges. Edge cases (exactly 0, sub-µs values) are handled gracefully.

Sequence Diagram

sequenceDiagram
    participant Agent as Go Agent
    participant VectorProm as Vector Prometheus
    participant HB as /api/agent/heartbeat
    participant MS as MetricStore (in-memory)
    participant DB as PipelineMetric (PostgreSQL)
    participant tRPC as tRPC Routers

    Agent->>VectorProm: scrape component_latency_mean_seconds
    VectorProm-->>Agent: gauge value per component (seconds)
    Agent->>HB: POST heartbeat { componentMetrics[].latencyMeanSeconds }

    HB->>HB: computeWeightedLatency()<br/>(throughput-weighted, seconds→ms)
    HB->>MS: metricStore.recordTotals(..., latencyMeanSeconds)<br/>converts to latencyMeanMs (gauge, no delta)
    HB->>DB: ingestMetrics({ latencyMeanMs })<br/>writes weighted mean to PipelineMetric row

    Note over DB: metrics-ingest.ts also does<br/>second-level node aggregation<br/>(throughput-weighted across nodes)

    tRPC->>DB: getPipelineMetrics / chartMetrics<br/>reads latencyMeanMs from PipelineMetric
    tRPC->>MS: getNodePipelineRates<br/>reads latest latencyMeanMs from MetricStore<br/>(simple mean across components)
    tRPC-->>tRPC: sli-evaluator latency_mean<br/>AVG(latencyMeanMs) over window

Comments Outside Diff (2)

src/server/routers/metrics.ts, line 659-664 (link)

Unweighted vs throughput-weighted latency mean inconsistency

The live-rate latency in getNodePipelineRates is computed as a simple count-based mean across components:
```
acc.sum += latest.latencyMeanMs;
acc.count++;
```
But the historical PipelineMetric rows stored by the heartbeat route use a throughput-weighted mean (via computeWeightedLatency). This means the live Avg Latency value shown in the fleet table and flow overlay can differ significantly from the historical chart — especially when pipeline components have very different throughputs (e.g., a high-volume source at 1ms and a low-volume transform at 100ms would give ~50ms live but ~2ms weighted historically).

Consider weighting by latest.receivedEventsRate + latest.sentEventsRate to stay consistent with the stored aggregate:
```
if (latest.latencyMeanMs != null) {
  const weight = latest.receivedEventsRate + latest.sentEventsRate;
  const acc = latencyAcc[matchingNode.pipelineId] ?? { sum: 0, count: 0 };
  acc.sum += latest.latencyMeanMs * weight;
  acc.count += weight;
  latencyAcc[matchingNode.pipelineId] = acc;
}
```
(With the final division still acc.sum / acc.count, guarded for acc.count > 0.)
src/components/metrics/component-chart.tsx, line 407-412 (link)

Null latency mapped to 0 produces misleading flat line

Both in and out data keys use m.latencyMeanMs ?? 0 for the "latency" case:
```
in: dataKey === "latency" ? (m.latencyMeanMs ?? 0)
out: dataKey === "latency" ? (m.latencyMeanMs ?? 0)
```
When latencyMeanMs is null (no latency was scraped from Vector for that minute — e.g., before the agent is upgraded to 0.54.0), the chart will plot 0ms rather than leaving a gap. A 0ms latency reading looks like accurate data rather than missing data, which can confuse users.

Consider filtering out null values so Recharts renders a gap instead:
```
in: dataKey === "latency" ? (m.latencyMeanMs ?? null)
out: dataKey === "latency" ? (m.latencyMeanMs ?? null)
```
Recharts omits null points from the area when connectNulls is not set (the default).

_{Last reviewed commit: ef10532}

github-actions bot added agent feature labels Mar 11, 2026

TerrifiedBug added 10 commits March 11, 2026 10:21

feat(agent): scrape component_latency_mean_seconds from Vector 0.54.0

a57d65a

feat(db): add latencyMeanMs column to PipelineMetric

284a5e2

feat: accept and store component latency in heartbeat pipeline

6a0ad6a

feat(api): expose latency data in metrics and dashboard endpoints

7ef5861

feat(ui): add formatLatency helper and Component Latency chart to das…

6f549c6

…hboard

feat(ui): add Component Latency chart to pipeline metrics page

c790f54

feat(ui): add latency to flow editor overlay and show-metrics panel

4b4d2d6

feat(ui): add Avg Latency column to fleet node pipeline table

ce6abcb

feat: add latency_mean SLI metric type for pipeline health evaluation

5e7524c

fix: use proper mean for latency aggregation and handle zero formatting

ef10532

- Fix running-average bug in getNodePipelineRates that gave incorrect latency means when aggregating across >2 components - Add early return for formatLatency(0) to display "0ms" instead of "0.000ms"

TerrifiedBug force-pushed the pipeline-latency-metrics-5df branch from 52f9790 to ef10532 Compare March 11, 2026 10:24

TerrifiedBug merged commit eda9e94 into main Mar 11, 2026
9 checks passed

TerrifiedBug deleted the pipeline-latency-metrics-5df branch March 11, 2026 10:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add component latency metrics end-to-end#91

feat: add component latency metrics end-to-end#91
TerrifiedBug merged 10 commits intomainfrom
pipeline-latency-metrics-5df

TerrifiedBug commented Mar 11, 2026

Uh oh!

greptile-apps bot commented Mar 11, 2026 •

edited

Loading

Comments Outside Diff (2)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

TerrifiedBug commented Mar 11, 2026

Summary

Changes

Test Plan

Uh oh!

greptile-apps bot commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Comments Outside Diff (2)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

greptile-apps bot commented Mar 11, 2026 •

edited

Loading