Skip to content

feat: per-component transform latency on pipeline metrics page#95

Merged
TerrifiedBug merged 12 commits intomainfrom
fix-component-latency-metrics-191
Mar 11, 2026
Merged

feat: per-component transform latency on pipeline metrics page#95
TerrifiedBug merged 12 commits intomainfrom
fix-component-latency-metrics-191

Conversation

@TerrifiedBug
Copy link
Owner

Summary

  • Add nullable componentId column to PipelineMetric to store per-component latency rows alongside aggregate rows
  • Add componentId: null filter to all 13 existing aggregate queries across 5 files to prevent per-component rows from inflating metrics
  • Write per-component latency rows in the heartbeat handler via createMany (separate from the delta-tracking ingestMetrics pipeline, since latency is a gauge)
  • Add getComponentLatencyHistory tRPC procedure for historical per-component latency data
  • Replace single-line aggregate latency chart on pipeline metrics page with multi-line chart (one line per transform component, deterministic color palette)
  • Rename "Component Latency" → "Transform Latency" everywhere (Vector only emits component_latency_mean_seconds for transforms, not sources/sinks)

Test plan

  • Run a pipeline with multiple transform components and verify per-component latency rows appear in PipelineMetric table
  • Verify pipeline metrics page (/pipelines/[id]/metrics) shows multi-line transform latency chart with one line per component
  • Verify main dashboard still shows single aggregate "Transform Latency" chart (not per-component)
  • Verify fleet detail table still shows aggregate latency column
  • Verify no "Component Latency" or "Pipeline Latency" labels remain in the UI
  • Confirm SLI evaluator uses aggregate-only data for latency alerts

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 11, 2026

Greptile Summary

This PR adds per-component transform latency tracking to VectorFlow's pipeline metrics page. It extends PipelineMetric with a nullable componentId column, writes per-component gauge rows from the heartbeat handler, introduces a getComponentLatencyHistory tRPC procedure (correctly gated with withTeamAccess("VIEWER")), and replaces the single aggregate latency chart with a multi-line Recharts LineChart — one line per transform, with a deterministic colour palette. The 13 existing aggregate queries across all 5 files correctly gain componentId: null filters to prevent the new per-component rows from inflating any dashboard, SLI, or fleet metric.

Key points:

  • getComponentLatencyHistory correctly applies withTeamAccess("VIEWER") and averages multi-node rows server-side before returning data to the client, so multi-node deployments produce one line per component rather than one line per node.
  • The heartbeat upsert loop uses sequential findFirst + update/create (intentionally awaited to avoid the TOCTOU fire-and-forget race flagged in prior review). However, the migration does not add a unique constraint on (pipelineId, nodeId, componentId, timestamp), meaning concurrent heartbeats within the same minute window can still race and create duplicate rows — the database-level guard is missing.
  • The tooltip formatter uses Number(value) ?? 0, which is a no-op because Number() never returns null/undefined; NaN falls through and can produce "NaN ms" when a series has no value at the hovered timestamp. Using || 0 instead is the correct fix.

Confidence Score: 3/5

  • Safe to merge functionally, but the missing unique constraint on the per-component metric tuple will accumulate duplicate rows over time and should be addressed.
  • The core feature is correctly implemented — auth middleware is present, multi-node averaging is correct, and all 13 existing aggregate queries are properly filtered. The confidence reduction comes from the absent database-level unique constraint, which means the soft deduplication in the heartbeat handler can be bypassed by concurrent requests, leading to indefinite table bloat.
  • prisma/migrations/20260311030000_add_component_id_to_pipeline_metric/migration.sql — a unique constraint on (pipelineId, nodeId, componentId, timestamp) should be added to enforce the deduplication invariant at the storage layer.

Important Files Changed

Filename Overview
src/server/routers/metrics.ts Adds getComponentLatencyHistory tRPC procedure with correct withTeamAccess("VIEWER") middleware. Server-side averaging across nodes per (componentId, timestamp) is correct. componentId: null filter added to getPipelineMetrics.
src/app/api/agent/heartbeat/route.ts Per-component latency rows are written with a sequential find-first + conditional create/update upsert loop. Correctly uses a shared minute-truncated minuteTimestamp. No unique constraint in the migration means a TOCTOU race between concurrent heartbeats can still create duplicate rows, but sequential awaiting reduces the window compared to the original fire-and-forget approach.
prisma/migrations/20260311030000_add_component_id_to_pipeline_metric/migration.sql Adds nullable componentId column and a compound index on (pipelineId, componentId, timestamp). No unique constraint on (pipelineId, nodeId, componentId, timestamp) is added, leaving the TOCTOU duplicate-row race unguarded at the database level.
src/app/(dashboard)/pipelines/[id]/metrics/page.tsx Replaces aggregate latency chart with TransformLatencyChart. useMemo correctly re-derives chart data from server response. Default minutes initialised to 60 (1h), making "1h" the highlighted tab on load — consistent and intentional. The Number(value) ?? 0 fallback in the tooltip formatter is a no-op because Number() never returns null/undefined, but this only affects visual edge cases.
src/server/routers/dashboard.ts All 8 aggregate queries correctly gain componentId: null to exclude the new per-component rows from dashboard stats, latency history, and fleet metrics.

Sequence Diagram

sequenceDiagram
    participant Agent as Vector Agent
    participant HB as /api/agent/heartbeat
    participant Ingest as ingestMetrics()
    participant DB as PipelineMetric (DB)
    participant tRPC as getComponentLatencyHistory
    participant UI as Pipeline Metrics Page

    Agent->>HB: POST heartbeat {pipelines, componentMetrics}
    HB->>Ingest: fire-and-forget (counter deltas → nodeId rows + nodeId:null aggregate)
    Note over HB: minuteTimestamp = now with seconds zeroed
    loop per pipeline × per component
        HB->>DB: findFirst(pipelineId, nodeId, componentId, timestamp)
        alt row exists
            HB->>DB: update latencyMeanMs
        else
            HB->>DB: create row {nodeId≠null, componentId≠null}
        end
    end
    HB-->>Agent: 200 OK

    UI->>tRPC: getComponentLatencyHistory(pipelineId, minutes)
    tRPC->>DB: findMany(pipelineId, componentId≠null, timestamp≥since)
    DB-->>tRPC: rows [{componentId, timestamp, latencyMeanMs}]
    Note over tRPC: Average rows by (componentId, timestamp)<br/>to collapse multi-node deployments
    tRPC-->>UI: {components: Record<id, [{timestamp, latencyMeanMs}]>}
    UI->>UI: Render multi-line LineChart (one line per transform)
Loading

Last reviewed commit: 3691f2b

- Add withTeamAccess("VIEWER") to getComponentLatencyHistory procedure
- Replace createMany with findFirst+upsert to deduplicate per-component
  latency rows within the same minute bucket
Average per-component latency across nodes in getComponentLatencyHistory
to handle multi-node pipeline deployments correctly
Await per-component latency upserts sequentially to eliminate
TOCTOU race between concurrent heartbeat requests
@TerrifiedBug TerrifiedBug merged commit 16a2773 into main Mar 11, 2026
10 checks passed
@TerrifiedBug TerrifiedBug deleted the fix-component-latency-metrics-191 branch March 11, 2026 14:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant