Skip to content

feat: fleet health visualization — status timeline, event log, uptime tracking#92

Merged
TerrifiedBug merged 8 commits intomainfrom
fleet-health-a4c
Mar 11, 2026
Merged

feat: fleet health visualization — status timeline, event log, uptime tracking#92
TerrifiedBug merged 8 commits intomainfrom
fleet-health-a4c

Conversation

@TerrifiedBug
Copy link
Owner

Summary

  • NodeStatusEvent data model: New Prisma model with migration, recording every node status transition (heartbeat timeout, recovery, enrollment) with timestamp, from/to status, and reason
  • tRPC API endpoints: fleet.getStatusTimeline (events in range), fleet.getUptime (uptime %, healthy seconds, incident count), extended fleet.get with currentStatusSince
  • Health tab UI: Tabbed node detail page (Overview / Health / Metrics / Logs) with uptime KPI cards (24h/7d/30d), horizontal status timeline bar with hover tooltips, and chronological event log
  • Cleanup: Status events included in existing metrics retention policy

Test plan

  • Run Prisma migration on a fresh database
  • Verify enrollment creates initial status event (fromStatus: null, toStatus: HEALTHY, reason: enrolled)
  • Verify heartbeat recovery creates status event when node transitions from UNREACHABLE to HEALTHY
  • Verify checkNodeHealth() creates events when nodes go UNREACHABLE due to heartbeat timeout
  • Verify getStatusTimeline returns events in ascending order for the selected range
  • Verify getUptime returns correct uptime percentage and incident count
  • Verify currentStatusSince on fleet.get shows when current status began
  • Navigate to a node detail page → verify 4 tabs render (Overview, Health, Metrics, Logs)
  • Health tab: verify uptime cards show color-coded percentages
  • Health tab: verify status timeline bar shows colored segments with tooltips
  • Health tab: verify event log shows transitions most-recent-first
  • Verify cleanup deletes old NodeStatusEvent rows per retention policy

Records every node status transition with fromStatus, toStatus, reason,
and timestamp. Adds statusEvents reverse relation on VectorNode.
- fleet-health: add status to select, createMany events before marking nodes UNREACHABLE (reason: heartbeat timeout)
- heartbeat: fetch prev status before update, fire-and-forget event insert on HEALTHY recovery (reason: heartbeat received)
- enroll: await event insert after node creation with fromStatus null and reason: enrolled
Add `getStatusTimeline` and `getUptime` query endpoints to the fleet
router, and extend `fleet.get` to include `currentStatusSince` (the
timestamp of the latest NodeStatusEvent for the node).
…ime cards)

- StatusTimeline: horizontal bar with colored segments per status, range selector, hover tooltips
- EventLog: reverse-chronological table of status transition events with colored dots
- UptimeCards: 24h/7d/30d KPI cards with color-coded uptime percentage and incident count
Reorganizes the single-scroll node detail page into 4 tabs:
- Overview: Node Details, Labels, Pipeline Metrics table
- Health: UptimeCards, StatusTimeline, EventLog (with shared range state)
- Metrics: NodeMetricsCharts
- Logs: NodeLogs

Also displays currentStatusSince ("for 3d 14h") alongside the status
badge in the Overview tab.
- Wrap event insert + status update in $transaction to prevent race condition
- Filter currentStatusSince by toStatus matching current node status
- Fix uptime precision to 2 decimal places matching UI display
- Await heartbeat status event insert instead of fire-and-forget
The react-hooks/purity lint rule disallows Date.now() during render.
Use React Query's dataUpdatedAt as the "now" reference — it's a stable
value that updates each time data is fetched, keeping timeline segments
aligned with the fetched event data.
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 11, 2026

Greptile Summary

This PR adds a fleet health visualization layer to VectorFlow: a NodeStatusEvent Prisma model that records every node status transition, two new tRPC query procedures (getStatusTimeline, getUptime), and a redesigned node detail page with a tabbed layout (Overview / Health / Metrics / Logs) surfacing uptime KPI cards, a horizontal status timeline bar, and a chronological event log.

Key observations:

  • Schema & migration are correct. Composite index on (nodeId, timestamp) suits the timeline queries; cascade delete on the FK is appropriate; retention cleanup is wired into the existing cleanupOldMetrics parallel batch.
  • Authorization is properly handled: both new tRPC procedures use withTeamAccess("VIEWER") and the nodeId field is resolved through the existing withTeamAccess middleware path (lines 263–271 of init.ts).
  • Uptime math is correct: time is walked event-by-event from rangeStart to now, accumulating only HEALTHY seconds; the node enrollment toStatus: HEALTHY event provides accurate initial currentStatusSince data on the Overview tab.
  • Race condition in heartbeat recovery: the findUnique → update → create sequence in heartbeat/route.ts is not atomic. Two concurrent heartbeats from the same UNREACHABLE node can both observe the pre-recovery status and both emit a recovery NodeStatusEvent, producing duplicate entries in the event log.
  • Dead else branch in fleet-health.ts: when goingUnreachable is empty, the else path runs updateMany with the identical WHERE predicate that already returned zero rows — it can never update anything and should be removed.

Confidence Score: 4/5

  • Safe to merge; one correctness issue (duplicate recovery events under concurrent heartbeats) is unlikely in normal operation and has no security impact.
  • All authorization paths are correctly wired, the schema migration is clean, uptime calculation logic is sound, and the UI components follow established patterns. The only correctness concern is the non-atomic heartbeat recovery path which can produce duplicate NodeStatusEvents under concurrent heartbeats — rare in practice but worth fixing before the feature ships.
  • src/app/api/agent/heartbeat/route.ts — recovery event creation should be wrapped in a transaction alongside the node status update.

Important Files Changed

Filename Overview
src/server/routers/fleet.ts Adds getStatusTimeline, getUptime, and extends fleet.get with currentStatusSince. Both new procedures correctly use withTeamAccess("VIEWER") and nodeId resolves through withTeamAccess as expected. Uptime math is correct. currentStatusSince logic (latest event where toStatus = current status) is sound.
src/app/api/agent/heartbeat/route.ts Adds recovery event creation on status transition to HEALTHY. The findUniqueupdate → conditional create is not wrapped in a transaction, allowing duplicate recovery events under concurrent heartbeats from the same node.
src/server/services/fleet-health.ts Health-check now wraps event creation and status update in a transaction when nodes go unreachable. The else branch (dead code when goingUnreachable is empty) is harmless but can be removed. Core logic is correct.
src/server/services/metrics-cleanup.ts Adds NodeStatusEvent deletion to the existing parallel cleanup using the metrics retention cutoff. Consistent with other retention policy entries.
src/app/(dashboard)/fleet/[nodeId]/page.tsx Refactors node detail into four tabs (Overview / Health / Metrics / Logs). Shared timelineRange state correctly drives both StatusTimeline and EventLog. currentStatusSince shown alongside the status badge. Clean restructuring with no regressions to existing functionality.

Sequence Diagram

sequenceDiagram
    participant Agent
    participant Heartbeat as /api/agent/heartbeat
    participant Enroll as /api/agent/enroll
    participant DB as PostgreSQL
    participant HealthCheck as checkNodeHealth()
    participant Cleanup as cleanupOldMetrics()
    participant UI as Fleet UI (Health Tab)
    participant tRPC as fleet tRPC router

    Enroll->>DB: vectorNode.create (status=HEALTHY)
    Enroll->>DB: nodeStatusEvent.create (from=null, to=HEALTHY, reason=enrolled)

    Agent->>Heartbeat: POST heartbeat
    Heartbeat->>DB: vectorNode.findUnique (read prevStatus)
    Heartbeat->>DB: vectorNode.update (status=HEALTHY, lastHeartbeat=now)
    alt prevStatus != HEALTHY
        Heartbeat->>DB: nodeStatusEvent.create (from=prevStatus, to=HEALTHY, reason=heartbeat received)
    end

    HealthCheck->>DB: vectorNode.findMany (stale, non-UNREACHABLE)
    alt goingUnreachable.length > 0
        HealthCheck->>DB: $transaction: nodeStatusEvent.createMany + vectorNode.updateMany
    end

    UI->>tRPC: fleet.getStatusTimeline(nodeId, range)
    tRPC->>DB: nodeStatusEvent.findMany (nodeId, timestamp >= since)
    tRPC-->>UI: events[]

    UI->>tRPC: fleet.getUptime(nodeId, range)
    tRPC->>DB: nodeStatusEvent.findMany (in range) + findFirst (prior event)
    tRPC-->>UI: { uptimePercent, incidents, healthySeconds }

    Cleanup->>DB: nodeStatusEvent.deleteMany (timestamp < metricsCutoff)
Loading

Comments Outside Diff (1)

  1. src/server/services/fleet-health.ts, line 1134-1162 (link)

    Redundant else branch is dead code

    When goingUnreachable.length === 0, the else branch runs updateMany with the exact same WHERE condition that the preceding findMany already evaluated and returned zero rows for. The updateMany will therefore always match zero rows in this branch and can never do useful work.

    The original intent was likely to preserve the bare updateMany as a safety net, but since both queries share the same filter, the else branch is unreachable in practice. It can be removed entirely:

    Prompt To Fix With AI
    This is a comment left during a code review.
    Path: src/server/services/fleet-health.ts
    Line: 1134-1162
    
    Comment:
    **Redundant `else` branch is dead code**
    
    When `goingUnreachable.length === 0`, the `else` branch runs `updateMany` with the exact same WHERE condition that the preceding `findMany` already evaluated and returned zero rows for. The `updateMany` will therefore always match zero rows in this branch and can never do useful work.
    
    The original intent was likely to preserve the bare `updateMany` as a safety net, but since both queries share the same filter, the `else` branch is unreachable in practice. It can be removed entirely:
    
    
    
    How can I resolve this? If you propose a fix, please make it concise.

Last reviewed commit: e3e602b

Comment on lines 171 to 210
@@ -189,6 +195,18 @@ export async function POST(request: Request) {
},
});

// Record a status transition event when the node recovers from a non-HEALTHY state
if (prevNode && prevNode.status !== "HEALTHY") {
await prisma.nodeStatusEvent.create({
data: {
nodeId: agent.nodeId,
fromStatus: prevNode.status,
toStatus: "HEALTHY",
reason: "heartbeat received",
},
});
}

// Merge agent-reported labels with existing UI-set labels.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-atomic recovery event: duplicate events possible under concurrent heartbeats

prevNode is fetched outside any transaction, so two concurrent heartbeats from the same UNREACHABLE node can both read status !== "HEALTHY", both complete the vectorNode.update, and both call nodeStatusEvent.create. This produces duplicate recovery events in the timeline and event log.

While unlikely in normal operation (agents stagger heartbeats), it becomes more plausible on agent restart or reconnection bursts. Wrapping all three operations in a transaction prevents the race:

Suggested change
const prevNode = await prisma.vectorNode.findUnique({
where: { id: agent.nodeId },
select: { status: true },
});
// Update node heartbeat and metadata
const node = await prisma.$transaction(async (tx) => {
const updated = await tx.vectorNode.update({

Then close the transaction after the conditional nodeStatusEvent.create, ensuring the read-check-write is atomic.

Rule Used: ## Security & Cryptography Review Rules

When revi... (source)

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/app/api/agent/heartbeat/route.ts
Line: 171-210

Comment:
**Non-atomic recovery event: duplicate events possible under concurrent heartbeats**

`prevNode` is fetched outside any transaction, so two concurrent heartbeats from the same UNREACHABLE node can both read `status !== "HEALTHY"`, both complete the `vectorNode.update`, and both call `nodeStatusEvent.create`. This produces duplicate recovery events in the timeline and event log.

While unlikely in normal operation (agents stagger heartbeats), it becomes more plausible on agent restart or reconnection bursts. Wrapping all three operations in a transaction prevents the race:

```suggestion
    const prevNode = await prisma.vectorNode.findUnique({
      where: { id: agent.nodeId },
      select: { status: true },
    });

    // Update node heartbeat and metadata
    const node = await prisma.$transaction(async (tx) => {
      const updated = await tx.vectorNode.update({
```

Then close the transaction after the conditional `nodeStatusEvent.create`, ensuring the read-check-write is atomic.

**Rule Used:** ## Security & Cryptography Review Rules

When revi... ([source](https://app.greptile.com/review/custom-context?memory=7cb20c56-ca6a-40aa-8660-7fa75e6e3db2))

How can I resolve this? If you propose a fix, please make it concise.

@TerrifiedBug TerrifiedBug merged commit e466270 into main Mar 11, 2026
9 checks passed
@TerrifiedBug TerrifiedBug deleted the fleet-health-a4c branch March 11, 2026 12:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant