fix(envoy-gateway): roll back EG control plane to v1.7.4#193
Open
scotwells wants to merge 1 commit into
Open
Conversation
EG v1.8.x crashes the shared downstream gateway when any tenant's OIDC SecurityPolicy references a missing clientSecret. v1.8.0 (PR #8703) moved OIDC to a listener-level oauth2 filter; on a missing secret EG ships a config-less filter that Envoy rejects, taking down the entire listener snapshot for ALL tenants. v1.7.4 fails safe (per-route 500, no filter). Pins both the control-plane and downstream gateway-helm charts to v1.7.4. All extensionManager features the ext-server depends on (policyResources, resources, translation.includeAll, retry, failOpen) are present in v1.7.4; kustomize build --enable-helm verified for both overlays. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Pins the Envoy Gateway control plane (both the management cluster and the downstream
datum-downstream-gatewaycharts) from v1.8.1 → v1.7.4.Why
On v1.8.x, a single tenant's misconfigured OIDC
SecurityPolicytakes down end-user traffic for every tenant on the shared downstream gateway.SecurityPolicyreferences aclientSecretthat doesn't exist on the edge.envoy.filters.http.oauth2filter (envoyproxy/gateway#8703). With the secret missing, EG ships that filter config-less.config must be present for global config. Because listener updates are atomic, the entire listener snapshot (tcp-80/tcp-443/udp-443) is rejected — so every freshly-rolled proxy pod comes up serving nothing.Observed live on
us-central-1-alice: proxy pods passing readiness but carrying 0 active filter chains; WAF/Connector injection from the extension server was healthy and unrelated.v1.7.4 is the last release before the redesign. It fails safe: a missing OIDC secret omits the oauth2 filter for that one route and returns a per-route 500, leaving every other tenant unaffected (verified in
internal/xds/translator/oidc.go@v1.7.4— the filter is only attached whenrouteContainsOIDCis true, with a fully-populated config).Safety
extensionManagerfeatures the NSO extension server depends on (policyResources,resources,translation.includeAll,retry,failOpen,XDSNameSchemeV2) are present in v1.7.4.kustomize build --enable-helmverified for bothconfig/tools/envoy-gatewayandconfig/tools/envoy-gateway-downstream; extensionManager config renders unchanged.go.modEG v1.8.1) is intentionally left as-is — the ext-server gRPC extension protocol is stable across v1.5–v1.8, so a v1.8-built ext-server interoperates with a v1.7.4 control plane. Reverting the whole Go dependency set is out of scope.Note for reviewers / rollout
The control-plane chart uses
includeCRDs: true, so this re-applies Gateway API v1.4.1 CRDs (v1.8.1 shipped v1.5). Apply server-side; existing CRDs are updated in place, not deleted. NSO's generated CRDs target gateway-api v1.5 — confirm no v1.5-only fields are in use before rollout.Follow-up: this is an upstream EG robustness gap (Invalid OIDC policy still poisons the shared snapshot); track re-upgrade once fixed upstream, and consider gating Invalid SecurityPolicies out of the shared gateway.