Skip to content

status: distinguish API down from unreachable#170

Open
jarugupj wants to merge 2 commits into
mainfrom
hypeship/status-distinguish-api-down
Open

status: distinguish API down from unreachable#170
jarugupj wants to merge 2 commits into
mainfrom
hypeship/status-distinguish-api-down

Conversation

@jarugupj
Copy link
Copy Markdown
Contributor

@jarugupj jarugupj commented May 28, 2026

Summary

Splits the failure handling for kernel status so the message accurately reflects what happened, instead of collapsing both transport errors and non-2xx responses into one ambiguous "Could not reach Kernel API" line.

Three failure cases now:

  • Transport error on /status (network down, DNS, timeout) → Could not reach Kernel API. Check https://status.kernel.sh for updates.
  • Non-2xx from /status, /health also unhealthyKernel API is down. Check https://status.kernel.sh for updates.
  • Non-2xx from /status, /health returns 200Kernel API is responding but /status is unavailable. Check https://status.kernel.sh for updates.

The /health probe matters because /status has an upstream dependency on incident.io, so a 5xx from /status doesn't necessarily mean the API itself is down. /health is a dependency-free liveness check on the same server, so a 200 there confirms the API is up and only the status endpoint is broken.

No new helpers, no synthetic UI rendering — reserves the dotted status UI for real data from the API (same convention the dashboard's status indicator follows).

Test plan

  • Healthy run against prod still renders the status UI as before
  • Transport error (unreachable host) prints "Could not reach Kernel API…"
  • /status 5xx + /health 5xx prints "Kernel API is down…"
  • /status 5xx + /health 200 prints "Kernel API is responding but /status is unavailable…"

🤖 Generated with Claude Code


Note

Low Risk
CLI-only error messaging and an extra HTTP probe on failure paths; no auth, data, or API contract changes.

Overview
kernel status now prints different errors depending on whether the CLI cannot talk to the API at all, the API is down, or only /status is failing.

When /status returns a non-2xx, the command probes /health on the same base URL. A healthy /health yields Kernel API is responding but /status is unavailable; otherwise Kernel API is down. Transport failures on the initial /status request still use Could not reach Kernel API.

This separates incident.io-related /status outages from true API unavailability without changing the successful status UI or JSON output path.

Reviewed by Cursor Bugbot for commit ad44904. Bugbot is set up for automated code reviews on this repo. Configure here.

When /status returns a non-2xx response the API is reachable but
unhealthy, so report it as down rather than reusing the generic
'could not reach' message used for transport errors.
@jarugupj jarugupj force-pushed the hypeship/status-distinguish-api-down branch from 5c1365d to cfc2591 Compare May 29, 2026 14:10
…broken

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@jarugupj jarugupj force-pushed the hypeship/status-distinguish-api-down branch from cfc2591 to ad44904 Compare May 29, 2026 15:54
@jarugupj jarugupj marked this pull request as ready for review May 29, 2026 15:57
@jarugupj jarugupj requested a review from masnwilliams May 29, 2026 16:01
@firetiger-agent
Copy link
Copy Markdown

Monitoring Plan: Improve kernel status error message differentiation

What this PR does: Gives the kernel status command a more informative error message when the API's status page is unreachable — distinguishing a /status endpoint outage from a full API outage.

Intended effect:

  • CLI output correctness: No telemetry baseline exists (CLI has no OTel instrumentation). Confirmed by running kernel status post-deploy under normal conditions — output should be identical to pre-deploy (status table displayed). The new messages only appear during failure scenarios.

Risks:

  • Extended timeout during full outage — when the API is completely unreachable, the CLI now makes two sequential HTTP calls (each up to 10s timeout), potentially doubling the wait time before showing an error. No signal to alert on; first user invocation during an outage surfaces this.
  • /health endpoint stability — if https://api.onkernel.com/health is not a stable always-200 endpoint, the fallback message could mislead users. Verify /health reliability before release; alert if any non-2xx is observed from this endpoint during normal operation.
  • API 5xx regression (unrelated path) — overall API error rate baseline is 0.08–0.79% 5xx; alert if sustained above 2% for 2+ hours post-deploy (pre-existing infra noise pattern, not expected to be PR-related).

Status updates will be posted automatically on this PR as monitoring progresses.

View monitor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant