Don't propagate gateway listeners with invalid TLS certificates#217
Merged
Conversation
A single customer's expired or unissuable TLS certificate could freeze HTTPS for an entire edge. Because every HTTPS hostname shares one Envoy :443 listener, one bad certificate makes Envoy reject the whole listener update, taking down HTTPS for every tenant on that edge (issue #212). Make certificate failures isolated and visible: - Gate propagation on certificate health: a listener whose certificate is expired, not-yet-valid, missing, mismatched, or not yet issued is left out of the downstream gateway, so a bad certificate only affects its own hostname and every other hostname keeps serving. - Report it clearly to consumers via standard listener conditions (Programmed / ResolvedRefs) with concise, non-technical messages explaining that HTTPS is paused for the hostname and why. - Add a data-plane backstop in the extension server that drops an invalid TLS filter chain instead of letting it reject the whole listener. Tests: unit coverage for the certificate-health gate and the prune, plus an end-to-end (chainsaw) test that verifies the bad listener is withheld, the status message is shown, the listener recovers once the certificate issues, and a real HTTPS request to the good hostname returns 200 while the bad hostname is not served. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01LsQQoiXVgC1VaiWye4eEz8
509e896 to
15367eb
Compare
For log fields and map keys like "reason"/"message", the inline literal is clearer than a shared constant, and forcing a constant couples unrelated call sites together. Disable the linter rather than work around it. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01LsQQoiXVgC1VaiWye4eEz8
The certificate gating this change introduces was invisible in metrics — operators could not see or alert on how many hostnames were dark. Surface it. - Gateway controller: gauges for withheld listeners, certificate expiry time, and managed listeners, plus a gating counter. Labelled by gateway, listener, and hostname so an operator can find the exact affected hostname during an incident; series are cleared when a listener recovers, is removed, or the gateway is deleted, so a stale value never reports a phantom dark hostname. - Extension server: gauges for the data-plane backstop's current state (chains dropped, listeners left untouched) so an active backstop is visible. - Logs and traces: certificate expiry on the unhealthy-listener log, the affected listener names on the prune log and trace span, and a new log when a withheld listener recovers. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01LsQQoiXVgC1VaiWye4eEz8
… backstop Make the certificate gating actionable: alert when a customer listener is withheld because its certificate is unusable, when a managed certificate is expiring soon (so it can be fixed before it starts gating), when the extension server is actively dropping broken certificates, and — critically — when a listener cannot be protected at all and the edge will reject its update. Includes promtool unit tests for all four rules. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01LsQQoiXVgC1VaiWye4eEz8
Each cert-health alert now links to a runbook (runbook_url) with what it means, the customer impact, how to diagnose it, and how to remediate, so an on-call responder has a clear path from alert to action. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01LsQQoiXVgC1VaiWye4eEz8
The certificate-health alerts depend on the operator's metrics being scraped, which was not wired up. Enable the controller ServiceMonitor (the existing, previously disabled one) and add a ServiceMonitor for the extension server's metrics endpoint, so the metrics reach Prometheus and the alerts can fire. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01LsQQoiXVgC1VaiWye4eEz8
mattdjenkinson
approved these changes
Jun 24, 2026
Contributor
Author
|
Confirmed connector's came online correctly after this was deployed to staging. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #212
Problem
Every HTTPS hostname on a gateway shares a single Envoy
:443listener. When one customer's certificate becomes unusable — expired, or no longer issuable because the domain moved away from Datum — Envoy rejects the entire listener update, so HTTPS goes down for every tenant on that edge, not just the affected hostname. The failure was also silent: nothing told the customer their certificate was the problem, and nothing alerted operators.What this changes
Isolate bad certificates and tell the customer
Programmed/ResolvedRefsconditions with concise, non-technical messages (e.g. "The TLS certificate for example.com has expired, so HTTPS for this hostname is paused…"). It staysAccepted— the config is valid, the certificate just isn't usable yet.Make it observable
Previously this gating was invisible to operators. This PR adds the full chain from signal to action:
config/telemetry/alerts): a customer listener withheld for a bad certificate, a certificate expiring soon (act before it gates), the backstop actively dropping certificates, and — critical — a listener that can't be protected at all and will have its edge update rejected. They complement the infraEnvoyListenerUpdateRejectedalert.docs/runbooks/gateway-tls-certificates.md): each alert links viarunbook_urlto a section with meaning, impact, diagnosis, and remediation.How it's verified
test/e2e/gateway-invalid-certificateon a real two-cluster + Envoy Gateway setup: the bad listener is withheld while the good sibling is unaffected; the user-facing status message is shown; a real HTTPS request to the good hostname returns200while the bad hostname is not served; and after the certificate issues, the listener reappears and serves200.Notes
goconstis disabled in a separate commit: for log fields and map keys like"reason"/"message"the inline literal is clearer than a shared constant, and forcing one couples unrelated call sites.