feat: fleet health visualization — status timeline, event log, uptime tracking by TerrifiedBug · Pull Request #92 · TerrifiedBug/vectorflow

TerrifiedBug · 2026-03-11T11:26:43Z

Summary

NodeStatusEvent data model: New Prisma model with migration, recording every node status transition (heartbeat timeout, recovery, enrollment) with timestamp, from/to status, and reason
tRPC API endpoints: fleet.getStatusTimeline (events in range), fleet.getUptime (uptime %, healthy seconds, incident count), extended fleet.get with currentStatusSince
Health tab UI: Tabbed node detail page (Overview / Health / Metrics / Logs) with uptime KPI cards (24h/7d/30d), horizontal status timeline bar with hover tooltips, and chronological event log
Cleanup: Status events included in existing metrics retention policy

Test plan

Records every node status transition with fromStatus, toStatus, reason, and timestamp. Adds statusEvents reverse relation on VectorNode.

- fleet-health: add status to select, createMany events before marking nodes UNREACHABLE (reason: heartbeat timeout) - heartbeat: fetch prev status before update, fire-and-forget event insert on HEALTHY recovery (reason: heartbeat received) - enroll: await event insert after node creation with fromStatus null and reason: enrolled

Add `getStatusTimeline` and `getUptime` query endpoints to the fleet router, and extend `fleet.get` to include `currentStatusSince` (the timestamp of the latest NodeStatusEvent for the node).

…ime cards) - StatusTimeline: horizontal bar with colored segments per status, range selector, hover tooltips - EventLog: reverse-chronological table of status transition events with colored dots - UptimeCards: 24h/7d/30d KPI cards with color-coded uptime percentage and incident count

Reorganizes the single-scroll node detail page into 4 tabs: - Overview: Node Details, Labels, Pipeline Metrics table - Health: UptimeCards, StatusTimeline, EventLog (with shared range state) - Metrics: NodeMetricsCharts - Logs: NodeLogs Also displays currentStatusSince ("for 3d 14h") alongside the status badge in the Overview tab.

- Wrap event insert + status update in $transaction to prevent race condition - Filter currentStatusSince by toStatus matching current node status - Fix uptime precision to 2 decimal places matching UI display - Await heartbeat status event insert instead of fire-and-forget

The react-hooks/purity lint rule disallows Date.now() during render. Use React Query's dataUpdatedAt as the "now" reference — it's a stable value that updates each time data is fetched, keeping timeline segments aligned with the fetched event data.

greptile-apps · 2026-03-11T11:31:25Z

Greptile Summary

This PR adds a fleet health visualization layer to VectorFlow: a NodeStatusEvent Prisma model that records every node status transition, two new tRPC query procedures (getStatusTimeline, getUptime), and a redesigned node detail page with a tabbed layout (Overview / Health / Metrics / Logs) surfacing uptime KPI cards, a horizontal status timeline bar, and a chronological event log.

Key observations:

Schema & migration are correct. Composite index on (nodeId, timestamp) suits the timeline queries; cascade delete on the FK is appropriate; retention cleanup is wired into the existing cleanupOldMetrics parallel batch.
Authorization is properly handled: both new tRPC procedures use withTeamAccess("VIEWER") and the nodeId field is resolved through the existing withTeamAccess middleware path (lines 263–271 of init.ts).
Uptime math is correct: time is walked event-by-event from rangeStart to now, accumulating only HEALTHY seconds; the node enrollment toStatus: HEALTHY event provides accurate initial currentStatusSince data on the Overview tab.
Race condition in heartbeat recovery: the findUnique → update → create sequence in heartbeat/route.ts is not atomic. Two concurrent heartbeats from the same UNREACHABLE node can both observe the pre-recovery status and both emit a recovery NodeStatusEvent, producing duplicate entries in the event log.
Dead else branch in fleet-health.ts: when goingUnreachable is empty, the else path runs updateMany with the identical WHERE predicate that already returned zero rows — it can never update anything and should be removed.

Confidence Score: 4/5

Safe to merge; one correctness issue (duplicate recovery events under concurrent heartbeats) is unlikely in normal operation and has no security impact.
All authorization paths are correctly wired, the schema migration is clean, uptime calculation logic is sound, and the UI components follow established patterns. The only correctness concern is the non-atomic heartbeat recovery path which can produce duplicate NodeStatusEvents under concurrent heartbeats — rare in practice but worth fixing before the feature ships.
src/app/api/agent/heartbeat/route.ts — recovery event creation should be wrapped in a transaction alongside the node status update.

Important Files Changed

Filename	Overview
src/server/routers/fleet.ts	Adds `getStatusTimeline`, `getUptime`, and extends `fleet.get` with `currentStatusSince`. Both new procedures correctly use `withTeamAccess("VIEWER")` and `nodeId` resolves through `withTeamAccess` as expected. Uptime math is correct. `currentStatusSince` logic (latest event where `toStatus = current status`) is sound.
src/app/api/agent/heartbeat/route.ts	Adds recovery event creation on status transition to HEALTHY. The `findUnique` → `update` → conditional `create` is not wrapped in a transaction, allowing duplicate recovery events under concurrent heartbeats from the same node.
src/server/services/fleet-health.ts	Health-check now wraps event creation and status update in a transaction when nodes go unreachable. The `else` branch (dead code when `goingUnreachable` is empty) is harmless but can be removed. Core logic is correct.
src/server/services/metrics-cleanup.ts	Adds `NodeStatusEvent` deletion to the existing parallel cleanup using the metrics retention cutoff. Consistent with other retention policy entries.
src/app/(dashboard)/fleet/[nodeId]/page.tsx	Refactors node detail into four tabs (Overview / Health / Metrics / Logs). Shared `timelineRange` state correctly drives both `StatusTimeline` and `EventLog`. `currentStatusSince` shown alongside the status badge. Clean restructuring with no regressions to existing functionality.

Sequence Diagram

sequenceDiagram
    participant Agent
    participant Heartbeat as /api/agent/heartbeat
    participant Enroll as /api/agent/enroll
    participant DB as PostgreSQL
    participant HealthCheck as checkNodeHealth()
    participant Cleanup as cleanupOldMetrics()
    participant UI as Fleet UI (Health Tab)
    participant tRPC as fleet tRPC router

    Enroll->>DB: vectorNode.create (status=HEALTHY)
    Enroll->>DB: nodeStatusEvent.create (from=null, to=HEALTHY, reason=enrolled)

    Agent->>Heartbeat: POST heartbeat
    Heartbeat->>DB: vectorNode.findUnique (read prevStatus)
    Heartbeat->>DB: vectorNode.update (status=HEALTHY, lastHeartbeat=now)
    alt prevStatus != HEALTHY
        Heartbeat->>DB: nodeStatusEvent.create (from=prevStatus, to=HEALTHY, reason=heartbeat received)
    end

    HealthCheck->>DB: vectorNode.findMany (stale, non-UNREACHABLE)
    alt goingUnreachable.length > 0
        HealthCheck->>DB: $transaction: nodeStatusEvent.createMany + vectorNode.updateMany
    end

    UI->>tRPC: fleet.getStatusTimeline(nodeId, range)
    tRPC->>DB: nodeStatusEvent.findMany (nodeId, timestamp >= since)
    tRPC-->>UI: events[]

    UI->>tRPC: fleet.getUptime(nodeId, range)
    tRPC->>DB: nodeStatusEvent.findMany (in range) + findFirst (prior event)
    tRPC-->>UI: { uptimePercent, incidents, healthySeconds }

    Cleanup->>DB: nodeStatusEvent.deleteMany (timestamp < metricsCutoff)

Comments Outside Diff (1)

src/server/services/fleet-health.ts, line 1134-1162 (link)

Redundant else branch is dead code

When goingUnreachable.length === 0, the else branch runs updateMany with the exact same WHERE condition that the preceding findMany already evaluated and returned zero rows for. The updateMany will therefore always match zero rows in this branch and can never do useful work.

The original intent was likely to preserve the bare updateMany as a safety net, but since both queries share the same filter, the else branch is unreachable in practice. It can be removed entirely:

Prompt To Fix With AI

This is a comment left during a code review.
Path: src/server/services/fleet-health.ts
Line: 1134-1162

Comment:
**Redundant `else` branch is dead code**

When `goingUnreachable.length === 0`, the `else` branch runs `updateMany` with the exact same WHERE condition that the preceding `findMany` already evaluated and returned zero rows for. The `updateMany` will therefore always match zero rows in this branch and can never do useful work.

The original intent was likely to preserve the bare `updateMany` as a safety net, but since both queries share the same filter, the `else` branch is unreachable in practice. It can be removed entirely:



How can I resolve this? If you propose a fix, please make it concise.

_{Last reviewed commit: e3e602b}

greptile-apps · 2026-03-11T11:31:29Z

src/app/api/agent/heartbeat/route.ts

@@ -189,6 +195,18 @@ export async function POST(request: Request) {
      },
    });

+    // Record a status transition event when the node recovers from a non-HEALTHY state
+    if (prevNode && prevNode.status !== "HEALTHY") {
+      await prisma.nodeStatusEvent.create({
+        data: {
+          nodeId: agent.nodeId,
+          fromStatus: prevNode.status,
+          toStatus: "HEALTHY",
+          reason: "heartbeat received",
+        },
+      });
+    }
+
    // Merge agent-reported labels with existing UI-set labels.


Non-atomic recovery event: duplicate events possible under concurrent heartbeats

prevNode is fetched outside any transaction, so two concurrent heartbeats from the same UNREACHABLE node can both read status !== "HEALTHY", both complete the vectorNode.update, and both call nodeStatusEvent.create. This produces duplicate recovery events in the timeline and event log.

While unlikely in normal operation (agents stagger heartbeats), it becomes more plausible on agent restart or reconnection bursts. Wrapping all three operations in a transaction prevents the race:

Suggested change

const prevNode = await prisma.vectorNode.findUnique({

where: { id: agent.nodeId },

select: { status: true },

});

// Update node heartbeat and metadata

const node = await prisma.$transaction(async (tx) => {

const updated = await tx.vectorNode.update({

Then close the transaction after the conditional nodeStatusEvent.create, ensuring the read-check-write is atomic.

Rule Used: ## Security & Cryptography Review Rules

When revi... (source)

Prompt To Fix With AI

This is a comment left during a code review. Path: src/app/api/agent/heartbeat/route.ts Line: 171-210 Comment: **Non-atomic recovery event: duplicate events possible under concurrent heartbeats** `prevNode` is fetched outside any transaction, so two concurrent heartbeats from the same UNREACHABLE node can both read `status !== "HEALTHY"`, both complete the `vectorNode.update`, and both call `nodeStatusEvent.create`. This produces duplicate recovery events in the timeline and event log. While unlikely in normal operation (agents stagger heartbeats), it becomes more plausible on agent restart or reconnection bursts. Wrapping all three operations in a transaction prevents the race: ```suggestion const prevNode = await prisma.vectorNode.findUnique({ where: { id: agent.nodeId }, select: { status: true }, }); // Update node heartbeat and metadata const node = await prisma.$transaction(async (tx) => { const updated = await tx.vectorNode.update({ ``` Then close the transaction after the conditional `nodeStatusEvent.create`, ensuring the read-check-write is atomic. **Rule Used:** ## Security & Cryptography Review Rules When revi... ([source](https://app.greptile.com/review/custom-context?memory=7cb20c56-ca6a-40aa-8660-7fa75e6e3db2)) How can I resolve this? If you propose a fix, please make it concise.

TerrifiedBug added 7 commits March 11, 2026 11:26

feat: add NodeStatusEvent model to Prisma schema

7ac33c5

Records every node status transition with fromStatus, toStatus, reason, and timestamp. Adds statusEvents reverse relation on VectorNode.

feat: add NodeStatusEvent cleanup to metrics retention

282d9f3

feat: add fleet health tRPC endpoints for timeline and uptime

7181bda

Add `getStatusTimeline` and `getUptime` query endpoints to the fleet router, and extend `fleet.get` to include `currentStatusSince` (the timestamp of the latest NodeStatusEvent for the node).

github-actions bot added the feature label Mar 11, 2026

greptile-apps bot reviewed Mar 11, 2026

View reviewed changes

TerrifiedBug merged commit e466270 into main Mar 11, 2026
9 checks passed

TerrifiedBug deleted the fleet-health-a4c branch March 11, 2026 12:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: fleet health visualization — status timeline, event log, uptime tracking#92

feat: fleet health visualization — status timeline, event log, uptime tracking#92
TerrifiedBug merged 8 commits intomainfrom
fleet-health-a4c

TerrifiedBug commented Mar 11, 2026

Uh oh!

greptile-apps bot commented Mar 11, 2026 •

edited

Loading

Comments Outside Diff (1)

Uh oh!

greptile-apps bot Mar 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

+    const prevNode = await prisma.vectorNode.findUnique({
+      where: { id: agent.nodeId },
+      select: { status: true },
+    });
+    // Update node heartbeat and metadata
+    const node = await prisma.$transaction(async (tx) => {
+      const updated = await tx.vectorNode.update({

Conversation

TerrifiedBug commented Mar 11, 2026

Summary

Test plan

Uh oh!

greptile-apps bot commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Comments Outside Diff (1)

Uh oh!

greptile-apps bot Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

greptile-apps bot commented Mar 11, 2026 •

edited

Loading