Skip to content

Releases: datum-cloud/network-services-operator

v0.23.5: Reject Hostnameless Listeners

26 Jun 19:32
@ecv ecv
40ca7d3

Choose a tag to compare

What's Changed

  • fix: show newly created networking resources in the activity feed by @scotwells in #223
  • test(e2e): reproduce #219 hostname-less listener collision by @ecv in #221
  • fix: clear error for hostname-less tenant listeners by @ecv in #220

New Contributors

  • @ecv made their first contribution in #221

Full Changelog: v0.23.4...v0.23.5

v0.23.4 — Isolate invalid TLS certificates

24 Jun 14:35
b1eded2

Choose a tag to compare

Fixes

  • Invalid TLS certificates no longer take down a whole edge listener (#212, #217). Every HTTPS hostname on a gateway shares one Envoy :443 listener; previously one customer's unusable certificate (expired, withdrawn, mismatched) made Envoy reject the entire listener, dropping HTTPS for every tenant on that edge. A listener whose certificate is expired, not-yet-valid, missing, mismatched, or not yet issued is now withheld from the downstream gateway, so a bad certificate only affects its own hostname. The listener reports clear, non-technical Programmed / ResolvedRefs status, and self-heals automatically once the certificate issues. A data-plane backstop in the extension server drops only the offending filter chain if a bad certificate ever reaches the edge.

Observability

  • Per-listener metrics (withheld listeners, certificate expiry time, gating counter) and extension-server backstop gauges, labelled by gateway/listener/hostname.
  • Four alerts: customer listener withheld for a bad certificate, certificate expiring soon, the backstop actively dropping certificates, and — critical — a listener that can't be protected and will have its edge update rejected.
  • Runbooks for each alert and ServiceMonitors so operator and extension-server metrics reach Prometheus.

Validated end-to-end on staging: a connector tunnel's :443 listener programs and serves cleanly (200 round-trip) with zero listener rejections.

Full changes: #217

v0.23.3

23 Jun 21:39
a474ece

Choose a tag to compare

What's Changed

  • Fix tunnels staying offline at the edge after a connector connects by @scotwells in #211
  • fix(extension-server): unique connector domain to stop Envoy config NACK by @scotwells in #214
  • docs(extension-server): document edge re-translation for connector liveness by @scotwells in #213
  • chore: remove unknown field by @zachsmith1 in #133
  • test(e2e): fix flaky CA-secret race in gateway-accepted by @scotwells in #215

Full Changelog: v0.23.2...v0.23.3

v0.23.2

22 Jun 16:23
e5e0e4a

Choose a tag to compare

What's Changed

  • chore: add last transition resource metrics by @savme in #210

Full Changelog: v0.23.1...v0.23.2

v0.23.1

22 Jun 14:03
5013a86

Choose a tag to compare

What's Changed

  • chore: add last transition time metrics by @savme in #203

Full Changelog: v0.23.0...v0.23.1

v0.23.0 — Connector liveness, offline UX & branded error pages

19 Jun 19:39
a4214b2

Choose a tag to compare

Highlights

  • Connector liveness at the edge (#208) — a connector's authoritative Ready + connectionDetails now reach the data-plane member cluster via a propagated networking.datumapis.com/upstream-status annotation (Karmada propagates metadata, not the status subresource). Online connectors are now correctly classified and promoted to live tunnels instead of being treated as offline. Generic, opt-in per resource type.
  • Branded data-plane error pages (#205) — edge-generated 5xx responses on the downstream/Connector data plane render a branded page via the extension server's local_reply_config, replacing the (inert) EnvoyPatchPolicy.
  • Deterministic offline-connector response (#207) — an offline connector's user path returns a deliberate tunnel-offline 503 instead of routing to an endpoint-less cluster (no_healthy_upstream).
  • Offline connector backendRef in extension-server mode (#202) — NSO keeps the connector backendRef so the extension server has a cluster to key on (no more bare 500).

Other

  • chore: explicit nodeTaintsPolicy
  • docs: branded data-plane error pages enhancement proposal

Operational note: the branded page also requires the infra-side errorPage enablement (already merged) and a data-plane re-translation (extension-server + envoy-gateway restart) per cluster to fully activate.

v0.22.1 — AI Edge provisioning & cluster-name regression fixes

18 Jun 16:20
c51b0bc

Choose a tag to compare

This is a patch release that fixes a set of regressions surfaced by the v0.22.0 milo library upgrade — most importantly AI Edge hostname/TLS provisioning, which broke for every project — along with downstream watch handling for graduated Gateway API types and a couple of CI/test fixes.

What's fixed

  • AI Edge cert/hostname provisioning restored — The v0.22.0 milo bump (v0.7.4v0.28.1) changed how project clusters are keyed internally, from slash-prefixed (/my-project) to bare names (my-project), but a leftover webhook workaround still prepended a slash. The result: the gateway webhook rejected every HTTPProxy create/update with cluster /<project> not found, blocking all hostname and TLS certificate provisioning. The workaround is removed; existing resources self-heal on re-reconcile. (#196)

  • Legacy slash-encoded cluster names tolerated everywhere — Downstream resources created before #196 still carry the old encoding in their meta.datumapis.com/upstream-cluster-name label, which decoded back to /my-project and no longer matched the engaged cluster key — flooding every replica's logs with cluster /<project> not found and silently breaking downstream→upstream event propagation (a cert going Ready or a listener becoming Programmed no longer re-reconciled the upstream Gateway). All decode sites — the gateway/certificate/HTTPRoute watches, the replicator enqueue, the HTTPProxy and TrafficProtectionPolicy controllers, and the iroh-dns decoder — now route through a shared UpstreamClusterNameFromLabel helper that strips the legacy slash. No data migration: resources re-stamp to the new format on their next reconcile. (#199, #200)

  • Replicator tolerates graduated Gateway API versions — When a downstream type graduates and the older version stops being served (e.g. BackendTLSPolicy v1alpha3v1 after the Gateway API 1.5 CRD bump), registering a watch on the old version blocked WaitForCacheSync and stalled the gateway controller entirely, while CRUD against the old version returned 404. The replicator now resolves the served GVK once at setup, falls back to the configured storage version when needed, and uses it consistently across the watch, create, status-fetch, and delete paths. (#197)

CI & testing

  • Kustomize bundle published after the image buildpublish-kustomize-bundles no longer runs in parallel with publish-container-image; it now depends on the image job, so the bundle can never pin its references to an image that hasn't been built and pushed yet. (#201)

  • e2e CRDs and a dev bootstrap Taskfile — Installs the missing NSO CRDs on the downstream cluster so the e2e suite passes, and adds a dev Taskfile for spinning up the two-cluster environment (task dev:bootstrap) and running the suite (task dev:test, optionally scoped to a folder). (#198)

Note

No schema or migration changes. This release is fixes only — no CRD changes and no data migration. Resources created before v0.22.0 automatically re-stamp their upstream-cluster-name label to the new format as they reconcile.

Full Changelog: v0.22.0...v0.22.1

v0.22.0 — Envoy Gateway extension server & location discovery

18 Jun 01:07
a238174

Choose a tag to compare

This release ships the Envoy Gateway extension server — replacing per-gateway EnvoyPatchPolicy objects with a single gRPC hook that applies WAF and Connector configuration directly during xDS translation — along with new location discovery APIs that let consumers see which locations are available to their project.

What's new

  • Envoy Gateway extension server — Replaces N per-gateway EnvoyPatchPolicy objects with a single PostTranslateModify gRPC server co-located with the data plane; eliminates the TLS certificate race at scale (>140 gateways) and makes WAF/Connector injection robust against upstream connectivity loss. Deployed as an HA Deployment with mTLS, least-privilege RBAC, and a dedicated EG control plane. (#192, #187)

  • Envoy Gateway 1.8 & Gateway API 1.5 — Upgrades the gateway stack from EG 1.5/Gateway API 1.3; includes multi-tenancy hardening that rejects operator-owned Gateway.spec.tls and constrains Backend/SecurityPolicy egress targets to block SSRF against internal and cloud-metadata addresses. (#189)

  • Location discovery for consumers — New LocationBinding resource lets consumers query which locations are available to their project, with city/region metadata and a healthy/deployed gate enforced by the service catalog. (#171)

  • LocationBinding IAM coverage — New networking.datumapis.com-locationbinding-viewer role grants list/get/watch on LocationBindings; the location-viewer role is now included in networking-viewer so org owners are no longer Forbidden when listing locations. (#172, #162)

  • Quota display metadata — All 13 networking quota registrations now carry services.miloapis.com/owner, kubernetes.io/display-name, and kubernetes.io/description for richer quota UI display. (#176)

  • Connector cert-readiness gate — The connector EnvoyPatchPolicy is now generated per-listener only once that listener is Programmed, preventing permanently-failing policies when a hostname's TLS certificate is pending. (#185)

  • Faster tunnel reconnect — Iroh DNS TXT TTL reduced from 30 s to 5 s; tunnels now reconverge in ~1 second after a connector restart instead of 30+ seconds. (#179)

Note

Schema changes: LocationBinding is a new CRD added in this release. The HTTPProxy, Location, and TrafficProtectionPolicy CRDs have minor spec-comment and validation updates from the Gateway API 1.5 upgrade — no field removals or migration required. Existing resources keep working without any conversion.

v0.21.9

21 May 20:01
a4ae50b

Choose a tag to compare

Bug Fixes

  • WAF and OIDC now work together on the same gateway. Previously, enabling WAF (via TrafficProtectionPolicy) would silently disable OIDC authentication on the same gateway. Unauthenticated requests would pass through instead of being redirected to the login provider. This is now fixed — both policies can be active at the same time without interfering with each other. (#150)

Full Changelog: v0.21.8...v0.21.9

v0.21.8

15 May 14:21
0c9baa1

Choose a tag to compare

What's Changed

  • fix: remove ActivityPolicies for uninstalled route CRDs by @scotwells in #148
  • Publish multi-arch (amd64 + arm64) container images by @scotwells in #135
  • chore: bump datum-cloud/actions to v1.14.0 by @scotwells in #157
  • chore: embed image tag in kustomize bundle at publish time by @scotwells in #158
  • fix: respect user Host header override on HTTPProxy backends by @mattdjenkinson in #159

Full Changelog: v0.21.7...v0.21.8