Skip to content

fix(envoy-gateway): roll back EG control plane to v1.7.4#193

Open
scotwells wants to merge 1 commit into
mainfrom
fix/rollback-eg-1.7.4
Open

fix(envoy-gateway): roll back EG control plane to v1.7.4#193
scotwells wants to merge 1 commit into
mainfrom
fix/rollback-eg-1.7.4

Conversation

@scotwells

Copy link
Copy Markdown
Contributor

What

Pins the Envoy Gateway control plane (both the management cluster and the downstream datum-downstream-gateway charts) from v1.8.1 → v1.7.4.

Why

On v1.8.x, a single tenant's misconfigured OIDC SecurityPolicy takes down end-user traffic for every tenant on the shared downstream gateway.

  • A tenant's OIDC SecurityPolicy references a clientSecret that doesn't exist on the edge.
  • EG v1.8.0 reworked OIDC into a listener-level envoy.filters.http.oauth2 filter (envoyproxy/gateway#8703). With the secret missing, EG ships that filter config-less.
  • Envoy rejects it with config must be present for global config. Because listener updates are atomic, the entire listener snapshot (tcp-80/tcp-443/udp-443) is rejected — so every freshly-rolled proxy pod comes up serving nothing.

Observed live on us-central-1-alice: proxy pods passing readiness but carrying 0 active filter chains; WAF/Connector injection from the extension server was healthy and unrelated.

v1.7.4 is the last release before the redesign. It fails safe: a missing OIDC secret omits the oauth2 filter for that one route and returns a per-route 500, leaving every other tenant unaffected (verified in internal/xds/translator/oidc.go@v1.7.4 — the filter is only attached when routeContainsOIDC is true, with a fully-populated config).

Safety

  • All extensionManager features the NSO extension server depends on (policyResources, resources, translation.includeAll, retry, failOpen, XDSNameSchemeV2) are present in v1.7.4.
  • kustomize build --enable-helm verified for both config/tools/envoy-gateway and config/tools/envoy-gateway-downstream; extensionManager config renders unchanged.
  • Build dependency (go.mod EG v1.8.1) is intentionally left as-is — the ext-server gRPC extension protocol is stable across v1.5–v1.8, so a v1.8-built ext-server interoperates with a v1.7.4 control plane. Reverting the whole Go dependency set is out of scope.

Note for reviewers / rollout

The control-plane chart uses includeCRDs: true, so this re-applies Gateway API v1.4.1 CRDs (v1.8.1 shipped v1.5). Apply server-side; existing CRDs are updated in place, not deleted. NSO's generated CRDs target gateway-api v1.5 — confirm no v1.5-only fields are in use before rollout.

Follow-up: this is an upstream EG robustness gap (Invalid OIDC policy still poisons the shared snapshot); track re-upgrade once fixed upstream, and consider gating Invalid SecurityPolicies out of the shared gateway.

EG v1.8.x crashes the shared downstream gateway when any tenant's OIDC
SecurityPolicy references a missing clientSecret. v1.8.0 (PR #8703) moved
OIDC to a listener-level oauth2 filter; on a missing secret EG ships a
config-less filter that Envoy rejects, taking down the entire listener
snapshot for ALL tenants. v1.7.4 fails safe (per-route 500, no filter).

Pins both the control-plane and downstream gateway-helm charts to v1.7.4.
All extensionManager features the ext-server depends on (policyResources,
resources, translation.includeAll, retry, failOpen) are present in v1.7.4;
kustomize build --enable-helm verified for both overlays.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant