Don't propagate gateway listeners with invalid TLS certificates by scotwells · Pull Request #217 · datum-cloud/network-services-operator

scotwells · 2026-06-24T01:07:05Z

Fixes #212

Problem

Every HTTPS hostname on a gateway shares a single Envoy :443 listener. When one customer's certificate becomes unusable — expired, or no longer issuable because the domain moved away from Datum — Envoy rejects the entire listener update, so HTTPS goes down for every tenant on that edge, not just the affected hostname. The failure was also silent: nothing told the customer their certificate was the problem, and nothing alerted operators.

What this changes

Isolate bad certificates and tell the customer

A listener whose certificate is expired, not-yet-valid, missing, mismatched, or not yet issued is withheld from the downstream gateway. A bad certificate now only affects its own hostname; every other hostname keeps serving.
The listener reports standard Programmed / ResolvedRefs conditions with concise, non-technical messages (e.g. "The TLS certificate for example.com has expired, so HTTPS for this hostname is paused…"). It stays Accepted — the config is valid, the certificate just isn't usable yet.
Data-plane backstop: if a bad certificate ever reaches the edge anyway, the extension server drops just that one part of the listener instead of letting it reject the whole listener.
Self-healing: once the certificate can be issued, the listener is propagated and serves automatically.

Make it observable

Previously this gating was invisible to operators. This PR adds the full chain from signal to action:

Metrics: per-listener gauges for withheld listeners, certificate expiry time, and managed listeners, a gating counter, and extension-server gauges for the backstop's live state. Labelled by gateway/listener/hostname so an operator can query the exact dark hostname during an incident; series are cleared on recovery, listener removal, and gateway deletion so a stale value never reports a phantom dark hostname.
Alerts (config/telemetry/alerts): a customer listener withheld for a bad certificate, a certificate expiring soon (act before it gates), the backstop actively dropping certificates, and — critical — a listener that can't be protected at all and will have its edge update rejected. They complement the infra EnvoyListenerUpdateRejected alert.
Runbooks (docs/runbooks/gateway-tls-certificates.md): each alert links via runbook_url to a section with meaning, impact, diagnosis, and remediation.
Scrape wiring: ServiceMonitors so the operator and extension-server metrics actually reach Prometheus (otherwise the alerts could never fire).

How it's verified

Unit tests for the certificate-health gate and the data-plane prune.
End-to-end (chainsaw) test test/e2e/gateway-invalid-certificate on a real two-cluster + Envoy Gateway setup: the bad listener is withheld while the good sibling is unaffected; the user-facing status message is shown; a real HTTPS request to the good hostname returns 200 while the bad hostname is not served; and after the certificate issues, the listener reappears and serves 200.
Negative control: with the fix reverted, the e2e test fails at the omission assertion — confirming it actually guards the behavior.
promtool unit tests for all four alert rules.
CI green (unit, e2e, lint, kustomize).

Notes

goconst is disabled in a separate commit: for log fields and map keys like "reason"/"message" the inline literal is clearer than a shared constant, and forcing one couples unrelated call sites.
Runbooks were added for the certificate alerts this PR introduces; the pre-existing SLO / EnvoyPatchPolicy alerts in the same file are left as a separate follow-up.

A single customer's expired or unissuable TLS certificate could freeze HTTPS for an entire edge. Because every HTTPS hostname shares one Envoy :443 listener, one bad certificate makes Envoy reject the whole listener update, taking down HTTPS for every tenant on that edge (issue #212). Make certificate failures isolated and visible: - Gate propagation on certificate health: a listener whose certificate is expired, not-yet-valid, missing, mismatched, or not yet issued is left out of the downstream gateway, so a bad certificate only affects its own hostname and every other hostname keeps serving. - Report it clearly to consumers via standard listener conditions (Programmed / ResolvedRefs) with concise, non-technical messages explaining that HTTPS is paused for the hostname and why. - Add a data-plane backstop in the extension server that drops an invalid TLS filter chain instead of letting it reject the whole listener. Tests: unit coverage for the certificate-health gate and the prune, plus an end-to-end (chainsaw) test that verifies the bad listener is withheld, the status message is shown, the listener recovers once the certificate issues, and a real HTTPS request to the good hostname returns 200 while the bad hostname is not served. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01LsQQoiXVgC1VaiWye4eEz8

For log fields and map keys like "reason"/"message", the inline literal is clearer than a shared constant, and forcing a constant couples unrelated call sites together. Disable the linter rather than work around it. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01LsQQoiXVgC1VaiWye4eEz8

The certificate gating this change introduces was invisible in metrics — operators could not see or alert on how many hostnames were dark. Surface it. - Gateway controller: gauges for withheld listeners, certificate expiry time, and managed listeners, plus a gating counter. Labelled by gateway, listener, and hostname so an operator can find the exact affected hostname during an incident; series are cleared when a listener recovers, is removed, or the gateway is deleted, so a stale value never reports a phantom dark hostname. - Extension server: gauges for the data-plane backstop's current state (chains dropped, listeners left untouched) so an active backstop is visible. - Logs and traces: certificate expiry on the unhealthy-listener log, the affected listener names on the prune log and trace span, and a new log when a withheld listener recovers. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01LsQQoiXVgC1VaiWye4eEz8

… backstop Make the certificate gating actionable: alert when a customer listener is withheld because its certificate is unusable, when a managed certificate is expiring soon (so it can be fixed before it starts gating), when the extension server is actively dropping broken certificates, and — critically — when a listener cannot be protected at all and the edge will reject its update. Includes promtool unit tests for all four rules. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01LsQQoiXVgC1VaiWye4eEz8

Each cert-health alert now links to a runbook (runbook_url) with what it means, the customer impact, how to diagnose it, and how to remediate, so an on-call responder has a clear path from alert to action. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01LsQQoiXVgC1VaiWye4eEz8

The certificate-health alerts depend on the operator's metrics being scraped, which was not wired up. Enable the controller ServiceMonitor (the existing, previously disabled one) and add a ServiceMonitor for the extension server's metrics endpoint, so the metrics reach Prometheus and the alerts can fire. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01LsQQoiXVgC1VaiWye4eEz8

scotwells · 2026-06-24T14:41:37Z

Confirmed connector's came online correctly after this was deployed to staging.

scotwells force-pushed the fix/issue-212-invalid-cert-listener-isolation branch from 509e896 to 15367eb Compare June 24, 2026 01:24

scotwells and others added 4 commits June 23, 2026 20:53

scotwells marked this pull request as ready for review June 24, 2026 02:21

mattdjenkinson approved these changes Jun 24, 2026

View reviewed changes

scotwells merged commit b1eded2 into main Jun 24, 2026
11 checks passed

scotwells deleted the fix/issue-212-invalid-cert-listener-isolation branch June 24, 2026 12:36

scotwells mentioned this pull request Jun 24, 2026

feat(telemetry): alert on gateway controller reconcile-error ratio #218

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Don't propagate gateway listeners with invalid TLS certificates#217

Don't propagate gateway listeners with invalid TLS certificates#217
scotwells merged 6 commits into
mainfrom
fix/issue-212-invalid-cert-listener-isolation

scotwells commented Jun 24, 2026 •

edited

Loading

Uh oh!

Uh oh!

scotwells commented Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

scotwells commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

What this changes

Isolate bad certificates and tell the customer

Make it observable

How it's verified

Notes

Uh oh!

Uh oh!

scotwells commented Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

scotwells commented Jun 24, 2026 •

edited

Loading