feat: fleet health visualization — status timeline, event log, uptime tracking#92
feat: fleet health visualization — status timeline, event log, uptime tracking#92TerrifiedBug merged 8 commits intomainfrom
Conversation
Records every node status transition with fromStatus, toStatus, reason, and timestamp. Adds statusEvents reverse relation on VectorNode.
- fleet-health: add status to select, createMany events before marking nodes UNREACHABLE (reason: heartbeat timeout) - heartbeat: fetch prev status before update, fire-and-forget event insert on HEALTHY recovery (reason: heartbeat received) - enroll: await event insert after node creation with fromStatus null and reason: enrolled
Add `getStatusTimeline` and `getUptime` query endpoints to the fleet router, and extend `fleet.get` to include `currentStatusSince` (the timestamp of the latest NodeStatusEvent for the node).
…ime cards) - StatusTimeline: horizontal bar with colored segments per status, range selector, hover tooltips - EventLog: reverse-chronological table of status transition events with colored dots - UptimeCards: 24h/7d/30d KPI cards with color-coded uptime percentage and incident count
Reorganizes the single-scroll node detail page into 4 tabs:
- Overview: Node Details, Labels, Pipeline Metrics table
- Health: UptimeCards, StatusTimeline, EventLog (with shared range state)
- Metrics: NodeMetricsCharts
- Logs: NodeLogs
Also displays currentStatusSince ("for 3d 14h") alongside the status
badge in the Overview tab.
- Wrap event insert + status update in $transaction to prevent race condition - Filter currentStatusSince by toStatus matching current node status - Fix uptime precision to 2 decimal places matching UI display - Await heartbeat status event insert instead of fire-and-forget
The react-hooks/purity lint rule disallows Date.now() during render. Use React Query's dataUpdatedAt as the "now" reference — it's a stable value that updates each time data is fetched, keeping timeline segments aligned with the fetched event data.
Greptile SummaryThis PR adds a fleet health visualization layer to VectorFlow: a Key observations:
Confidence Score: 4/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant Agent
participant Heartbeat as /api/agent/heartbeat
participant Enroll as /api/agent/enroll
participant DB as PostgreSQL
participant HealthCheck as checkNodeHealth()
participant Cleanup as cleanupOldMetrics()
participant UI as Fleet UI (Health Tab)
participant tRPC as fleet tRPC router
Enroll->>DB: vectorNode.create (status=HEALTHY)
Enroll->>DB: nodeStatusEvent.create (from=null, to=HEALTHY, reason=enrolled)
Agent->>Heartbeat: POST heartbeat
Heartbeat->>DB: vectorNode.findUnique (read prevStatus)
Heartbeat->>DB: vectorNode.update (status=HEALTHY, lastHeartbeat=now)
alt prevStatus != HEALTHY
Heartbeat->>DB: nodeStatusEvent.create (from=prevStatus, to=HEALTHY, reason=heartbeat received)
end
HealthCheck->>DB: vectorNode.findMany (stale, non-UNREACHABLE)
alt goingUnreachable.length > 0
HealthCheck->>DB: $transaction: nodeStatusEvent.createMany + vectorNode.updateMany
end
UI->>tRPC: fleet.getStatusTimeline(nodeId, range)
tRPC->>DB: nodeStatusEvent.findMany (nodeId, timestamp >= since)
tRPC-->>UI: events[]
UI->>tRPC: fleet.getUptime(nodeId, range)
tRPC->>DB: nodeStatusEvent.findMany (in range) + findFirst (prior event)
tRPC-->>UI: { uptimePercent, incidents, healthySeconds }
Cleanup->>DB: nodeStatusEvent.deleteMany (timestamp < metricsCutoff)
|
| @@ -189,6 +195,18 @@ export async function POST(request: Request) { | |||
| }, | |||
| }); | |||
|
|
|||
| // Record a status transition event when the node recovers from a non-HEALTHY state | |||
| if (prevNode && prevNode.status !== "HEALTHY") { | |||
| await prisma.nodeStatusEvent.create({ | |||
| data: { | |||
| nodeId: agent.nodeId, | |||
| fromStatus: prevNode.status, | |||
| toStatus: "HEALTHY", | |||
| reason: "heartbeat received", | |||
| }, | |||
| }); | |||
| } | |||
|
|
|||
| // Merge agent-reported labels with existing UI-set labels. | |||
There was a problem hiding this comment.
Non-atomic recovery event: duplicate events possible under concurrent heartbeats
prevNode is fetched outside any transaction, so two concurrent heartbeats from the same UNREACHABLE node can both read status !== "HEALTHY", both complete the vectorNode.update, and both call nodeStatusEvent.create. This produces duplicate recovery events in the timeline and event log.
While unlikely in normal operation (agents stagger heartbeats), it becomes more plausible on agent restart or reconnection bursts. Wrapping all three operations in a transaction prevents the race:
| const prevNode = await prisma.vectorNode.findUnique({ | |
| where: { id: agent.nodeId }, | |
| select: { status: true }, | |
| }); | |
| // Update node heartbeat and metadata | |
| const node = await prisma.$transaction(async (tx) => { | |
| const updated = await tx.vectorNode.update({ |
Then close the transaction after the conditional nodeStatusEvent.create, ensuring the read-check-write is atomic.
Rule Used: ## Security & Cryptography Review Rules
When revi... (source)
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/app/api/agent/heartbeat/route.ts
Line: 171-210
Comment:
**Non-atomic recovery event: duplicate events possible under concurrent heartbeats**
`prevNode` is fetched outside any transaction, so two concurrent heartbeats from the same UNREACHABLE node can both read `status !== "HEALTHY"`, both complete the `vectorNode.update`, and both call `nodeStatusEvent.create`. This produces duplicate recovery events in the timeline and event log.
While unlikely in normal operation (agents stagger heartbeats), it becomes more plausible on agent restart or reconnection bursts. Wrapping all three operations in a transaction prevents the race:
```suggestion
const prevNode = await prisma.vectorNode.findUnique({
where: { id: agent.nodeId },
select: { status: true },
});
// Update node heartbeat and metadata
const node = await prisma.$transaction(async (tx) => {
const updated = await tx.vectorNode.update({
```
Then close the transaction after the conditional `nodeStatusEvent.create`, ensuring the read-check-write is atomic.
**Rule Used:** ## Security & Cryptography Review Rules
When revi... ([source](https://app.greptile.com/review/custom-context?memory=7cb20c56-ca6a-40aa-8660-7fa75e6e3db2))
How can I resolve this? If you propose a fix, please make it concise.
Summary
fleet.getStatusTimeline(events in range),fleet.getUptime(uptime %, healthy seconds, incident count), extendedfleet.getwithcurrentStatusSinceTest plan
fromStatus: null, toStatus: HEALTHY, reason: enrolled)checkNodeHealth()creates events when nodes go UNREACHABLE due to heartbeat timeoutgetStatusTimelinereturns events in ascending order for the selected rangegetUptimereturns correct uptime percentage and incident countcurrentStatusSinceonfleet.getshows when current status began