Skip to content

Don't propagate gateway listeners with invalid TLS certificates#217

Merged
scotwells merged 6 commits into
mainfrom
fix/issue-212-invalid-cert-listener-isolation
Jun 24, 2026
Merged

Don't propagate gateway listeners with invalid TLS certificates#217
scotwells merged 6 commits into
mainfrom
fix/issue-212-invalid-cert-listener-isolation

Conversation

@scotwells

@scotwells scotwells commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

Fixes #212

Problem

Every HTTPS hostname on a gateway shares a single Envoy :443 listener. When one customer's certificate becomes unusable — expired, or no longer issuable because the domain moved away from Datum — Envoy rejects the entire listener update, so HTTPS goes down for every tenant on that edge, not just the affected hostname. The failure was also silent: nothing told the customer their certificate was the problem, and nothing alerted operators.

What this changes

Isolate bad certificates and tell the customer

  • A listener whose certificate is expired, not-yet-valid, missing, mismatched, or not yet issued is withheld from the downstream gateway. A bad certificate now only affects its own hostname; every other hostname keeps serving.
  • The listener reports standard Programmed / ResolvedRefs conditions with concise, non-technical messages (e.g. "The TLS certificate for example.com has expired, so HTTPS for this hostname is paused…"). It stays Accepted — the config is valid, the certificate just isn't usable yet.
  • Data-plane backstop: if a bad certificate ever reaches the edge anyway, the extension server drops just that one part of the listener instead of letting it reject the whole listener.
  • Self-healing: once the certificate can be issued, the listener is propagated and serves automatically.

Make it observable

Previously this gating was invisible to operators. This PR adds the full chain from signal to action:

  • Metrics: per-listener gauges for withheld listeners, certificate expiry time, and managed listeners, a gating counter, and extension-server gauges for the backstop's live state. Labelled by gateway/listener/hostname so an operator can query the exact dark hostname during an incident; series are cleared on recovery, listener removal, and gateway deletion so a stale value never reports a phantom dark hostname.
  • Alerts (config/telemetry/alerts): a customer listener withheld for a bad certificate, a certificate expiring soon (act before it gates), the backstop actively dropping certificates, and — critical — a listener that can't be protected at all and will have its edge update rejected. They complement the infra EnvoyListenerUpdateRejected alert.
  • Runbooks (docs/runbooks/gateway-tls-certificates.md): each alert links via runbook_url to a section with meaning, impact, diagnosis, and remediation.
  • Scrape wiring: ServiceMonitors so the operator and extension-server metrics actually reach Prometheus (otherwise the alerts could never fire).

How it's verified

  • Unit tests for the certificate-health gate and the data-plane prune.
  • End-to-end (chainsaw) test test/e2e/gateway-invalid-certificate on a real two-cluster + Envoy Gateway setup: the bad listener is withheld while the good sibling is unaffected; the user-facing status message is shown; a real HTTPS request to the good hostname returns 200 while the bad hostname is not served; and after the certificate issues, the listener reappears and serves 200.
  • Negative control: with the fix reverted, the e2e test fails at the omission assertion — confirming it actually guards the behavior.
  • promtool unit tests for all four alert rules.
  • CI green (unit, e2e, lint, kustomize).

Notes

  • goconst is disabled in a separate commit: for log fields and map keys like "reason"/"message" the inline literal is clearer than a shared constant, and forcing one couples unrelated call sites.
  • Runbooks were added for the certificate alerts this PR introduces; the pre-existing SLO / EnvoyPatchPolicy alerts in the same file are left as a separate follow-up.

A single customer's expired or unissuable TLS certificate could freeze
HTTPS for an entire edge. Because every HTTPS hostname shares one Envoy
:443 listener, one bad certificate makes Envoy reject the whole listener
update, taking down HTTPS for every tenant on that edge (issue #212).

Make certificate failures isolated and visible:

- Gate propagation on certificate health: a listener whose certificate is
  expired, not-yet-valid, missing, mismatched, or not yet issued is left
  out of the downstream gateway, so a bad certificate only affects its own
  hostname and every other hostname keeps serving.
- Report it clearly to consumers via standard listener conditions
  (Programmed / ResolvedRefs) with concise, non-technical messages
  explaining that HTTPS is paused for the hostname and why.
- Add a data-plane backstop in the extension server that drops an invalid
  TLS filter chain instead of letting it reject the whole listener.

Tests: unit coverage for the certificate-health gate and the prune, plus
an end-to-end (chainsaw) test that verifies the bad listener is withheld,
the status message is shown, the listener recovers once the certificate
issues, and a real HTTPS request to the good hostname returns 200 while
the bad hostname is not served.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01LsQQoiXVgC1VaiWye4eEz8
@scotwells scotwells force-pushed the fix/issue-212-invalid-cert-listener-isolation branch from 509e896 to 15367eb Compare June 24, 2026 01:24
scotwells and others added 4 commits June 23, 2026 20:53
For log fields and map keys like "reason"/"message", the inline literal is
clearer than a shared constant, and forcing a constant couples unrelated call
sites together. Disable the linter rather than work around it.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01LsQQoiXVgC1VaiWye4eEz8
The certificate gating this change introduces was invisible in metrics —
operators could not see or alert on how many hostnames were dark. Surface it.

- Gateway controller: gauges for withheld listeners, certificate expiry time,
  and managed listeners, plus a gating counter. Labelled by gateway, listener,
  and hostname so an operator can find the exact affected hostname during an
  incident; series are cleared when a listener recovers, is removed, or the
  gateway is deleted, so a stale value never reports a phantom dark hostname.
- Extension server: gauges for the data-plane backstop's current state (chains
  dropped, listeners left untouched) so an active backstop is visible.
- Logs and traces: certificate expiry on the unhealthy-listener log, the
  affected listener names on the prune log and trace span, and a new log when
  a withheld listener recovers.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01LsQQoiXVgC1VaiWye4eEz8
… backstop

Make the certificate gating actionable: alert when a customer listener is
withheld because its certificate is unusable, when a managed certificate is
expiring soon (so it can be fixed before it starts gating), when the extension
server is actively dropping broken certificates, and — critically — when a
listener cannot be protected at all and the edge will reject its update.

Includes promtool unit tests for all four rules.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01LsQQoiXVgC1VaiWye4eEz8
Each cert-health alert now links to a runbook (runbook_url) with what it means,
the customer impact, how to diagnose it, and how to remediate, so an on-call
responder has a clear path from alert to action.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01LsQQoiXVgC1VaiWye4eEz8
@scotwells scotwells marked this pull request as ready for review June 24, 2026 02:21
The certificate-health alerts depend on the operator's metrics being scraped,
which was not wired up. Enable the controller ServiceMonitor (the existing,
previously disabled one) and add a ServiceMonitor for the extension server's
metrics endpoint, so the metrics reach Prometheus and the alerts can fire.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01LsQQoiXVgC1VaiWye4eEz8
@scotwells scotwells merged commit b1eded2 into main Jun 24, 2026
11 checks passed
@scotwells scotwells deleted the fix/issue-212-invalid-cert-listener-isolation branch June 24, 2026 12:36
@scotwells

Copy link
Copy Markdown
Contributor Author

Confirmed connector's came online correctly after this was deployed to staging.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Expired/abandoned customer certs can freeze an entire edge — health-gate cert propagation + isolate per-cert failures

2 participants