Skip to content

feat(telemetry): alert on gateway controller reconcile-error ratio#218

Open
scotwells wants to merge 1 commit into
mainfrom
alert/gateway-reconcile-errors
Open

feat(telemetry): alert on gateway controller reconcile-error ratio#218
scotwells wants to merge 1 commit into
mainfrom
alert/gateway-reconcile-errors

Conversation

@scotwells

@scotwells scotwells commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

Problem

When the gateway controller can't write a gateway's downstream config, customer changes silently stop reaching the edge — stale, often broken, configuration keeps serving and operators get no signal that anything is wrong. This happened in production right after v0.23.4: the controller ran at an ~83% reconcile failure rate for hours (≈1963 errors vs ≈395 successes) with nothing firing. Cert-listener withholding (#217) can leave a downstream Gateway with zero listeners, which the Gateway-API CRD rejects (spec.listeners: Required value), so the controller retries in a hot loop and the bad config stays programmed at the edge.

The control plane had no alert on reconcile health at all. The failure only surfaced indirectly, days later, through an edge-level Envoy listener-rejection alert — far from the actual cause.

Issue: #212 · PR: #217

What this adds

Two alerts on the gateway controller's reconcile error ratio, so a controller that can't apply configuration pages operators directly instead of failing silently:

  • GatewayControllerReconcileErrorRatioHigh (warning) — error ratio above 20% for 15m
  • GatewayControllerReconcileErrorRatioCritical (critical) — error ratio above 50% for 10m

Each alert links to a new runbook (docs/runbooks/gateway-controller-health.md) covering impact, diagnosis, and remediation. promtool unit tests cover firing at each tier, not firing below threshold, and not firing before the for: window elapses.

These would have fired the moment v0.23.4 shipped, instead of surfacing days later from an edge symptom.

Alert expressions

# Warning (>20% for 15m)
(
  sum without(result) (rate(controller_runtime_reconcile_total{controller="gateway",result="error"}[10m]))
  /
  sum without(result) (rate(controller_runtime_reconcile_total{controller="gateway"}[10m]))
) > 0.2

# Critical (>50% for 10m)
(
  sum without(result) (rate(controller_runtime_reconcile_total{controller="gateway",result="error"}[10m]))
  /
  sum without(result) (rate(controller_runtime_reconcile_total{controller="gateway"}[10m]))
) > 0.5

Adds two new PrometheusRule alerts and a matching runbook for the case
where the gateway controller is stuck retrying rejected API server writes.

Prod context: since v0.23.4 shipped PR #217 (cert-listener withholding),
controller_runtime_reconcile_total{controller="gateway",result="error"}
has been ~1963 vs result="success" ~395 — an 83% failure rate with no
alert firing. The regression: witholding all listeners on a gateway
produces a downstream Gateway with zero listeners, which the Gateway-API
CRD rejects as "Required value". The controller hot-loops silently.

Alerts added to config/telemetry/alerts/gateways.yaml:
- GatewayControllerReconcileErrorRatioHigh (warning, >20% for 15m)
- GatewayControllerReconcileErrorRatioCritical (critical, >50% for 10m)

Expression uses sum without(result) to aggregate across result label
dimensions so the ratio is computed correctly and the alert carries only
the controller label (not result="error").

Also adds promtool unit tests (test/prometheus-rules/gateways/) and a
runbook (docs/runbooks/gateway-controller-health.md) covering meaning,
impact, diagnosis, and remediation for both tiers.

Closes #212 (alerting gap identified in that issue).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant