Releases: datum-cloud/network-services-operator
v0.23.5: Reject Hostnameless Listeners
What's Changed
- fix: show newly created networking resources in the activity feed by @scotwells in #223
- test(e2e): reproduce #219 hostname-less listener collision by @ecv in #221
- fix: clear error for hostname-less tenant listeners by @ecv in #220
New Contributors
Full Changelog: v0.23.4...v0.23.5
v0.23.4 — Isolate invalid TLS certificates
Fixes
- Invalid TLS certificates no longer take down a whole edge listener (#212, #217). Every HTTPS hostname on a gateway shares one Envoy
:443listener; previously one customer's unusable certificate (expired, withdrawn, mismatched) made Envoy reject the entire listener, dropping HTTPS for every tenant on that edge. A listener whose certificate is expired, not-yet-valid, missing, mismatched, or not yet issued is now withheld from the downstream gateway, so a bad certificate only affects its own hostname. The listener reports clear, non-technicalProgrammed/ResolvedRefsstatus, and self-heals automatically once the certificate issues. A data-plane backstop in the extension server drops only the offending filter chain if a bad certificate ever reaches the edge.
Observability
- Per-listener metrics (withheld listeners, certificate expiry time, gating counter) and extension-server backstop gauges, labelled by gateway/listener/hostname.
- Four alerts: customer listener withheld for a bad certificate, certificate expiring soon, the backstop actively dropping certificates, and — critical — a listener that can't be protected and will have its edge update rejected.
- Runbooks for each alert and ServiceMonitors so operator and extension-server metrics reach Prometheus.
Validated end-to-end on staging: a connector tunnel's :443 listener programs and serves cleanly (200 round-trip) with zero listener rejections.
Full changes: #217
v0.23.3
What's Changed
- Fix tunnels staying offline at the edge after a connector connects by @scotwells in #211
- fix(extension-server): unique connector domain to stop Envoy config NACK by @scotwells in #214
- docs(extension-server): document edge re-translation for connector liveness by @scotwells in #213
- chore: remove unknown field by @zachsmith1 in #133
- test(e2e): fix flaky CA-secret race in gateway-accepted by @scotwells in #215
Full Changelog: v0.23.2...v0.23.3
v0.23.2
v0.23.1
v0.23.0 — Connector liveness, offline UX & branded error pages
Highlights
- Connector liveness at the edge (#208) — a connector's authoritative
Ready+connectionDetailsnow reach the data-plane member cluster via a propagatednetworking.datumapis.com/upstream-statusannotation (Karmada propagates metadata, not the status subresource). Online connectors are now correctly classified and promoted to live tunnels instead of being treated as offline. Generic, opt-in per resource type. - Branded data-plane error pages (#205) — edge-generated 5xx responses on the downstream/Connector data plane render a branded page via the extension server's
local_reply_config, replacing the (inert) EnvoyPatchPolicy. - Deterministic offline-connector response (#207) — an offline connector's user path returns a deliberate tunnel-offline
503instead of routing to an endpoint-less cluster (no_healthy_upstream). - Offline connector backendRef in extension-server mode (#202) — NSO keeps the connector backendRef so the extension server has a cluster to key on (no more bare 500).
Other
- chore: explicit
nodeTaintsPolicy - docs: branded data-plane error pages enhancement proposal
Operational note: the branded page also requires the infra-side errorPage enablement (already merged) and a data-plane re-translation (extension-server + envoy-gateway restart) per cluster to fully activate.
v0.22.1 — AI Edge provisioning & cluster-name regression fixes
This is a patch release that fixes a set of regressions surfaced by the v0.22.0 milo library upgrade — most importantly AI Edge hostname/TLS provisioning, which broke for every project — along with downstream watch handling for graduated Gateway API types and a couple of CI/test fixes.
What's fixed
-
AI Edge cert/hostname provisioning restored — The v0.22.0 milo bump (
v0.7.4→v0.28.1) changed how project clusters are keyed internally, from slash-prefixed (/my-project) to bare names (my-project), but a leftover webhook workaround still prepended a slash. The result: the gateway webhook rejected everyHTTPProxycreate/update withcluster /<project> not found, blocking all hostname and TLS certificate provisioning. The workaround is removed; existing resources self-heal on re-reconcile. (#196) -
Legacy slash-encoded cluster names tolerated everywhere — Downstream resources created before #196 still carry the old encoding in their
meta.datumapis.com/upstream-cluster-namelabel, which decoded back to/my-projectand no longer matched the engaged cluster key — flooding every replica's logs withcluster /<project> not foundand silently breaking downstream→upstream event propagation (a cert going Ready or a listener becoming Programmed no longer re-reconciled the upstream Gateway). All decode sites — the gateway/certificate/HTTPRoute watches, the replicator enqueue, theHTTPProxyandTrafficProtectionPolicycontrollers, and the iroh-dns decoder — now route through a sharedUpstreamClusterNameFromLabelhelper that strips the legacy slash. No data migration: resources re-stamp to the new format on their next reconcile. (#199, #200) -
Replicator tolerates graduated Gateway API versions — When a downstream type graduates and the older version stops being served (e.g.
BackendTLSPolicyv1alpha3→v1after the Gateway API 1.5 CRD bump), registering a watch on the old version blockedWaitForCacheSyncand stalled the gateway controller entirely, while CRUD against the old version returned 404. The replicator now resolves the served GVK once at setup, falls back to the configured storage version when needed, and uses it consistently across the watch, create, status-fetch, and delete paths. (#197)
CI & testing
-
Kustomize bundle published after the image build —
publish-kustomize-bundlesno longer runs in parallel withpublish-container-image; it now depends on the image job, so the bundle can never pin its references to an image that hasn't been built and pushed yet. (#201) -
e2e CRDs and a dev bootstrap Taskfile — Installs the missing NSO CRDs on the downstream cluster so the e2e suite passes, and adds a
devTaskfile for spinning up the two-cluster environment (task dev:bootstrap) and running the suite (task dev:test, optionally scoped to a folder). (#198)
Note
No schema or migration changes. This release is fixes only — no CRD changes and no data migration. Resources created before v0.22.0 automatically re-stamp their upstream-cluster-name label to the new format as they reconcile.
Full Changelog: v0.22.0...v0.22.1
v0.22.0 — Envoy Gateway extension server & location discovery
This release ships the Envoy Gateway extension server — replacing per-gateway EnvoyPatchPolicy objects with a single gRPC hook that applies WAF and Connector configuration directly during xDS translation — along with new location discovery APIs that let consumers see which locations are available to their project.
What's new
-
Envoy Gateway extension server — Replaces N per-gateway
EnvoyPatchPolicyobjects with a singlePostTranslateModifygRPC server co-located with the data plane; eliminates the TLS certificate race at scale (>140 gateways) and makes WAF/Connector injection robust against upstream connectivity loss. Deployed as an HA Deployment with mTLS, least-privilege RBAC, and a dedicated EG control plane. (#192, #187) -
Envoy Gateway 1.8 & Gateway API 1.5 — Upgrades the gateway stack from EG 1.5/Gateway API 1.3; includes multi-tenancy hardening that rejects operator-owned
Gateway.spec.tlsand constrainsBackend/SecurityPolicyegress targets to block SSRF against internal and cloud-metadata addresses. (#189) -
Location discovery for consumers — New
LocationBindingresource lets consumers query which locations are available to their project, with city/region metadata and a healthy/deployed gate enforced by the service catalog. (#171) -
LocationBinding IAM coverage — New
networking.datumapis.com-locationbinding-viewerrole grants list/get/watch onLocationBindings; thelocation-viewerrole is now included innetworking-viewerso org owners are no longerForbiddenwhen listing locations. (#172, #162) -
Quota display metadata — All 13 networking quota registrations now carry
services.miloapis.com/owner,kubernetes.io/display-name, andkubernetes.io/descriptionfor richer quota UI display. (#176) -
Connector cert-readiness gate — The connector
EnvoyPatchPolicyis now generated per-listener only once that listener isProgrammed, preventing permanently-failing policies when a hostname's TLS certificate is pending. (#185) -
Faster tunnel reconnect — Iroh DNS TXT TTL reduced from 30 s to 5 s; tunnels now reconverge in ~1 second after a connector restart instead of 30+ seconds. (#179)
Note
Schema changes: LocationBinding is a new CRD added in this release. The HTTPProxy, Location, and TrafficProtectionPolicy CRDs have minor spec-comment and validation updates from the Gateway API 1.5 upgrade — no field removals or migration required. Existing resources keep working without any conversion.
v0.21.9
Bug Fixes
- WAF and OIDC now work together on the same gateway. Previously, enabling WAF (via TrafficProtectionPolicy) would silently disable OIDC authentication on the same gateway. Unauthenticated requests would pass through instead of being redirected to the login provider. This is now fixed — both policies can be active at the same time without interfering with each other. (#150)
Full Changelog: v0.21.8...v0.21.9
v0.21.8
What's Changed
- fix: remove ActivityPolicies for uninstalled route CRDs by @scotwells in #148
- Publish multi-arch (amd64 + arm64) container images by @scotwells in #135
- chore: bump datum-cloud/actions to v1.14.0 by @scotwells in #157
- chore: embed image tag in kustomize bundle at publish time by @scotwells in #158
- fix: respect user Host header override on HTTPProxy backends by @mattdjenkinson in #159
Full Changelog: v0.21.7...v0.21.8