diff --git a/docs/enhancements/envoy-gateway-extension-server/README.md b/docs/enhancements/envoy-gateway-extension-server/README.md index 47f7a464..3957e68e 100644 --- a/docs/enhancements/envoy-gateway-extension-server/README.md +++ b/docs/enhancements/envoy-gateway-extension-server/README.md @@ -251,6 +251,22 @@ computed from the snapshot Envoy Gateway passes in plus policy read from Kubernetes (cached; see [Sourcing Policy](#sourcing-policy-how-the-extension-server-knows-what-to-inject)). +#### Invariant: mutations to shared resources must be globally unique + +Because Envoy Gateway merges all customer gateways into one shared +configuration, several resources the Extension Server mutates are themselves +shared — most notably the HTTP listener's route configuration, whose +virtual-host domains form a single global namespace. Envoy enforces that these +domains are unique and rejects the entire xDS snapshot if any two collide. Any +identifier the Extension Server adds to a shared resource — a virtual-host +domain above all — must therefore be globally unique across the whole fleet, not +merely unique within one connector or gateway. The Connector satisfies this by +deriving a synthetic per-connector domain rather than reusing the backend host +(which is frequently a non-unique value such as `localhost`). This invariant is +load-bearing under the fail-closed delivery posture: a single duplicate value +does not degrade one gateway, it NACKs the snapshot and freezes configuration +delivery for every gateway (see [High Availability](#high-availability)). + ### Extension Mechanism: Which Hook and Why Envoy Gateway's @@ -443,15 +459,18 @@ following controls apply: only. mTLS already authenticates the caller; the `NetworkPolicy` shrinks the attack surface so unauthorized peers cannot even open a connection to attempt the handshake. -- **Least-privilege Kubernetes access.** The Extension Server reads policy and - status from Kubernetes but never writes; its ServiceAccount holds - **read-only** - [RBAC](https://kubernetes.io/docs/reference/access-authn-authz/rbac/) on - exactly the resource types it consumes (Traffic Protection and Connector - policy and their status), and nothing more. This bounds the damage if the - process is compromised. The contrast with NSO's reconcilers, which hold write - access, is the reason the two run as separate processes — see [Deployment - Topology](#deployment-topology). +- **Least-privilege Kubernetes access.** The Extension Server reads the policy + and status it consumes (Traffic Protection and Connector policy and their + status) and holds no write access to any of it; its + [RBAC](https://kubernetes.io/docs/reference/access-authn-authz/rbac/) grants + read-only verbs on exactly those types. Its one write is the re-translation + trigger: `patch` on `Gateway` (see [Triggering Re-translation on Policy + Change](#triggering-re-translation-on-policy-change)), scoped to that single + resource and verb and used only to set a trigger annotation — it cannot mutate + Gateway spec semantics, and it touches nothing else. This bounds the damage if + the process is compromised. The contrast with NSO's reconcilers, which hold + broad write access, is the reason the two run as separate processes — see + [Deployment Topology](#deployment-topology). - **Hardened pod.** Run as non-root with a read-only root filesystem, all capabilities dropped, and `allowPrivilegeEscalation: false` — the standard hardened posture for a control-plane workload. @@ -544,14 +563,58 @@ carried in `.status.conditions[Ready]`, updated on Lease expiry or renewal; stat updates do not increment the generation. The `resources` trigger therefore does not cover Connector online/offline transitions. -For this case, the Connector controller touches an annotation on each downstream -`Gateway` the Connector serves when the `Ready` condition flips. EG's -`metadataPredicate` for `Gateway` includes `AnnotationChangedPredicate`, so the -annotation change fires a full re-translation and the extension server re-applies -the correct routing config from its cache. This is the deliberate choice for -status-driven transitions: it uses the predicate already in place for Gateway -events, requires no new EG configuration, and places trigger logic in the -controller that detects the Lease-driven state change. +On the edge, the Connector's liveness arrives in the +`networking.datumapis.com/upstream-status` annotation — Karmada propagates a +member object's metadata, but not its `status` subresource, so the connector's +`Ready` condition and `connectionDetails` are mirrored into an annotation by the +replicator and the extension server reads its routing decision from there. EG +does not watch this annotation (the `Connector` is registered with a +generation-only predicate via `resources`), so a freshly online connector's +liveness lands in the extension server's cache while the data plane keeps serving +the stale — usually offline — program until some unrelated rebuild happens to +fire. + +A **dedicated re-translation controller**, co-located with the extension server +and sharing its informer cache, closes this gap. It watches Connectors and, when +a connector's liveness changes, patches a trigger annotation onto every Gateway +backed by an `HTTPProxy` that references the connector. EG re-translates on +Gateway annotation changes, so the patch forces a fresh `PostTranslateModify` +call, and the extension server re-applies the correct routing config from cache. + +Three properties make this safe and cheap: + +- **It runs at the edge, against the same cache EG translates against.** This is + the reason the controller is co-located with the extension server rather than + placed in the project-side Connector controller. The annotation is touched only + *after* the new liveness is already in the shared cache, so the re-translation + it provokes is guaranteed to read fresh data — eliminating the cross-cluster + ordering race a project-side trigger would have, where the Gateway touch could + reach the edge before the connector's own status annotation does and re-translate + against a stale cache. +- **It only reconciles on liveness changes.** The watch predicate admits creates + (so connectors already online at startup get stamped) and only those updates + that change the `(online, nodeID)` the extension server keys on; routine + heartbeat churn that does not affect routing is ignored. The annotation value + encodes that same `(online, nodeID)`, so a `connectionDetails` change (e.g. the + tunnel endpoint moves) re-translates too, not only `Ready` flips. +- **The Gateway patch is idempotent.** It is a merge patch with no preceding Get; + an unchanged value is a no-op at the API server (no `resourceVersion` bump, no + EG event), so the controller never provokes a spurious re-translation. A missing + Gateway is ignored — EG translates a Gateway when it is created, reading the + already-fresh cache, so there is nothing to nudge yet. + +The connector→Gateway mapping stays local: the Connector, its `HTTPProxy`, and the +Gateway share a namespace, and the Gateway is named after the `HTTPProxy`. The +controller needs `get;patch` on `gateways.gateway.networking.k8s.io` in addition +to the extension server's read-only policy access. + +One coverage edge remains: the controller watches Connectors, but the +connector→Gateway association is resolved through the `HTTPProxy`. A change that +newly links an existing Gateway to an already-online Connector — an `HTTPProxy` +created or repointed after the Connector's liveness has settled — produces no +Connector event, so the Gateway is not stamped until an unrelated rebuild fires. +Closing this fully requires the controller to additionally watch `HTTPProxy` and +map back to the affected Gateway. The alternative considered was having the replicator write a monotonic nonce into the downstream Connector's `spec` when the `Ready` condition flips, which would @@ -582,17 +645,27 @@ The Envoy AI Gateway project on all replicas because EG's calls are the bottleneck under load. NSO takes the split one step further: the Extension Server ships as its **own Deployment**, built from the same Go module and image as NSO but running as a distinct -workload with its own ServiceAccount. This is what gives it the read-only RBAC, -the dedicated `NetworkPolicy`, and the independent horizontal scaling the rest -of this design depends on — none of which it could have as a serving path inside -the leader-elected reconciler process. It is **horizontally scalable and +workload with its own ServiceAccount. This is what gives it the tightly scoped +RBAC, the dedicated `NetworkPolicy`, and the independent horizontal scaling the +rest of this design depends on — none of which it could have as a serving path +inside the leader-elected reconciler process. It is **horizontally scalable and stateless**, sized by EG's call rate rather than by reconcile load. The Extension Server runs in NSO's namespace, owned by NSO, and is reached by Envoy Gateway at a stable in-cluster FQDN (the -`extensionManager.service.fqdn`). NSO ownership is what makes the read-only RBAC, -the `NetworkPolicy`, and TLS SAN scoping coherent: one team owns the workload, -its identity, and the policy that fronts it. +`extensionManager.service.fqdn`). NSO ownership is what makes the RBAC, the +`NetworkPolicy`, and TLS SAN scoping coherent: one team owns the workload, its +identity, and the policy that fronts it. + +The re-translation controller (see [Triggering Re-translation on Policy +Change](#triggering-re-translation-on-policy-change)) runs **inside** this +process. It is the one writer in an otherwise read-only workload, and it runs on +every replica without leader election: its only write is an idempotent merge patch +of a trigger annotation, so concurrent replicas converge on the same value and the +redundant patches are no-ops at the API server. Co-locating it here is deliberate +— it must observe the same informer cache the extension server translates against, +which is what removes the cross-cluster ordering race a separate, project-side +trigger would have. ### High Availability @@ -622,10 +695,30 @@ Envoy Gateway is configured with: Server recovers, which is exactly why the two-replica, probe-gated, retry-backed posture above is mandatory rather than optional. +Fail-closed protects against the hook *erroring*, but not against the hook +returning a response Envoy Gateway accepts and pushes yet Envoy itself then +rejects (for example, a malformed or colliding resource — see the uniqueness +invariant above). Envoy applies each xDS snapshot atomically, so one rejected +resource discards the whole update and freezes configuration for every gateway +at once. This failure is invisible to Kubernetes state: Gateway and Route status +stay `Programmed=True` because translation succeeded. The only signals are +Envoy's `*.update_rejected` counters (LDS/RDS/CDS) and Envoy Gateway's +translation-error logs. Alerting on these xDS rejection metrics — RDS and CDS +rejections as well as LDS — is required; without it a fleet-wide config freeze is +silent. + Latency must be monitored: per-build hook latency and error rate are platform-health metrics. Policy reads stay off the synchronous build path via the informer cache described above. +Operability note: because the Extension Server embeds controller-runtime (the +policy cache and the re-translation controller), it must install a logger at +process startup. controller-runtime suppresses all of its own and its +controllers' log output until a logger is set, so an Extension Server that skips +this step runs without controller or hook logs — removing the second of the only +two diagnostic signals (logs and xDS metrics) precisely when an incident needs +them. + ### Reference Implementation A working prototype of the Extension Server has been built and validated @@ -848,6 +941,13 @@ configuration updates simultaneously; an EPP misconfiguration affects only one gateway's EPP object. This is the primary operational tradeoff (see [High Availability](#high-availability) for mitigations). +**Fleet-wide config freeze is invisible to Kubernetes status.** A single +malformed or colliding resource in a shared config makes Envoy reject the whole +atomic snapshot, freezing updates for every gateway while Gateway/Route status +still reads `Programmed=True`. Detection depends on Envoy xDS-rejection metrics +and EG translation logs rather than Kubernetes conditions (see +[High Availability](#high-availability)). + **Coupling to EG's extension hook API.** A supported, upstream-recommended mechanism, but an additional API surface to track across EG version upgrades.