You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
On the shared datum-downstream-gateway Envoy Gateway, a single tenant's OIDC SecurityPolicy that references a missing clientSecret takes down end-user traffic for every tenant on the gateway. Freshly-rolled Envoy proxy pods come up with zero active listener filter chains and serve nothing (HTTP + HTTPS). This was a live production incident on edge cluster us-central-1-alice and was resolved by rolling Envoy Gateway back from v1.8.1 → v1.7.4 (infra PR datum-cloud/infra#2746, NSO PR #193).
We need an e2e test that reproduces this failure mode so we can (a) prevent regressions when we re-upgrade EG, and (b) cover the broader class of "one bad tenant policy poisons the shared snapshot."
Impact
Severity: high. All-tenant data-plane outage on a shared gateway, triggered by one tenant's misconfiguration (or any not-yet-propagated OIDC secret).
The bug is latent: it only manifests when a proxy pod (re)starts and pulls the poisoned xDS snapshot. Existing pods keep serving their last-good config, so it can lurk until the next rollout / node drain / scale event and then cause a wide outage.
Root cause
EG v1.8.0 (envoyproxy/gateway#8703, "oidc: native oauth2 per-route config") reworked OIDC from a per-route filter into a listener-levelenvoy.filters.http.oauth2 filter, with per-route config in typed_per_filter_config.
When the OIDC clientSecret is missing:
EG correctly marks the SecurityPolicyAccepted=False, reason=Invalid ("OIDC: secret ... does not exist").
But it still emits the listener-level oauth2 filter with an empty/config-less typed_config (just the @type, no config).
Envoy rejects that filter with Error adding/updating listener(s) ...: config must be present for global config.
Because xDS listener updates are atomic per snapshot, that one bad filter chain rejects the whole listener set (tcp-80/tcp-443/udp-443) — so every fresh proxy pod has 0 active chains.
v1.7.x and earlier fail safe: a missing OIDC secret omits the oauth2 filter for just that route and returns a per-route 500 direct response (internal/gatewayapi/securitypolicy.go sets an errorResponse DirectResponse; internal/xds/translator/oidc.go only attaches the oauth2 filter when routeContainsOIDC is true, with a fully-populated config). One tenant's broken policy cannot poison the shared listener.
Reproduction (observed)
On the shared datum-downstream-gateway EG v1.8.x, create/propagate an OIDC SecurityPolicy (targeting a Gateway/route on the shared gateway) whose spec.oidc.clientSecret points at a Secret that does not exist on the edge.
Live trigger: ns-2594b296-.../google-oidc → clientSecret: google-oidc-client-secret (never propagated to edge).
Trigger a proxy pod restart (rollout, drain, or delete the pod).
Observe:
EG controller logs: Envoy rejected the last update ... config must be present for global config (repeating for tcp-80/tcp-443/udp-443).
Envoy proxy config_dump: tcp-443 has error_state set and 0 active filter chains; /ready still returns 200 (only the readiness listener is up — misleading).
The offending listener-level filter in the dump: {"name":"envoy.filters.http.oauth2","typed_config":{"@type":"...oauth2.v3.OAuth2"}} — no config.
End-user HTTPS to any hostname on the gateway fails (no listener to accept it).
Expected behavior
A single Invalid SecurityPolicy (missing OIDC secret) must not reject the shared listener snapshot. The broken route should fail closed on its own (e.g. per-route 500), while all other tenants/routes on the gateway continue serving — including the ext-server-injected WAF/Connector config.
e2e test we want (the ask)
Add an e2e case (two-cluster topology, matching the real edge projection path) that:
Stands up the shared downstream gateway with EG at the version under test, the NSO extension server, and at least one WAF-protected end-user HTTPProxy/route with a valid TrafficProtectionPolicy.
Introduces an OIDC SecurityPolicy on the shared gateway whose clientSecret is missing (Secret not created / not propagated).
Forces a proxy pod restart (this is essential — the bug only surfaces on a fresh xDS pull; a test that only checks steady state will pass even on a buggy build).
Asserts:
The freshly-started proxy has the tcp-443 listener accepted with active filter chains (e.g. via admin config_dump: active_state present, error_state absent, len(filter_chains) > 0).
The unrelated WAF-protected route still works (benign request 200, attack request 403) — i.e. the bad OIDC policy did not poison the shared listener.
The route targeted by the broken OIDC policy fails on its own (per-route 5xx), not the whole gateway.
EG controller emits noconfig must be present for global config rejection.
Generalize the assertion helper so we can reuse it for the broader invariant: "one Invalid/partial tenant policy must never reject the shared listener snapshot for other tenants." (Same class also seen with missing BasicAuth secrets, which EG already handles gracefully — good positive control.)
Notes / links
Mitigation in prod: EG pinned to v1.7.4 (infra apps/network-services-operator/downstream/base/envoy-gateway-oci-repository.yaml; PR datum-cloud/infra#2746) and NSO config/tools overlays (PR fix(envoy-gateway): roll back EG control plane to v1.7.4 #193). Verified resolved on us-central-1-alice: post-rollback both proxies serve tcp-443 with 203 active chains + WAF on every chain, zero config must be present rejections.
This is fundamentally an upstream EG robustness gap (Invalid OIDC policy still poisons the shared snapshot). Closely related upstream: Envoy OIDC SecurityPolicy fails on startup and doesn't recover envoyproxy/gateway#6123. The e2e test here is our guardrail for when we attempt to re-upgrade EG to v1.8.x+ — block the upgrade until this case passes, or land a platform-side guard that filters Invalid SecurityPolicies out of the shared gateway translation.
Summary
On the shared
datum-downstream-gatewayEnvoy Gateway, a single tenant's OIDCSecurityPolicythat references a missingclientSecrettakes down end-user traffic for every tenant on the gateway. Freshly-rolled Envoy proxy pods come up with zero active listener filter chains and serve nothing (HTTP + HTTPS). This was a live production incident on edge clusterus-central-1-aliceand was resolved by rolling Envoy Gateway back from v1.8.1 → v1.7.4 (infra PR datum-cloud/infra#2746, NSO PR #193).We need an e2e test that reproduces this failure mode so we can (a) prevent regressions when we re-upgrade EG, and (b) cover the broader class of "one bad tenant policy poisons the shared snapshot."
Impact
Root cause
EG v1.8.0 (envoyproxy/gateway#8703, "oidc: native oauth2 per-route config") reworked OIDC from a per-route filter into a listener-level
envoy.filters.http.oauth2filter, with per-route config intyped_per_filter_config.When the OIDC
clientSecretis missing:SecurityPolicyAccepted=False, reason=Invalid("OIDC: secret ... does not exist").typed_config(just the@type, noconfig).Error adding/updating listener(s) ...: config must be present for global config.v1.7.x and earlier fail safe: a missing OIDC secret omits the oauth2 filter for just that route and returns a per-route 500 direct response (
internal/gatewayapi/securitypolicy.gosets anerrorResponseDirectResponse;internal/xds/translator/oidc.goonly attaches the oauth2 filter whenrouteContainsOIDCis true, with a fully-populated config). One tenant's broken policy cannot poison the shared listener.Reproduction (observed)
datum-downstream-gatewayEG v1.8.x, create/propagate an OIDCSecurityPolicy(targeting a Gateway/route on the shared gateway) whosespec.oidc.clientSecretpoints at a Secret that does not exist on the edge.ns-2594b296-.../google-oidc→clientSecret: google-oidc-client-secret(never propagated to edge).Envoy rejected the last update ... config must be present for global config(repeating for tcp-80/tcp-443/udp-443).config_dump:tcp-443haserror_stateset and 0 active filter chains;/readystill returns 200 (only the readiness listener is up — misleading).{"name":"envoy.filters.http.oauth2","typed_config":{"@type":"...oauth2.v3.OAuth2"}}— noconfig.Expected behavior
A single Invalid SecurityPolicy (missing OIDC secret) must not reject the shared listener snapshot. The broken route should fail closed on its own (e.g. per-route 500), while all other tenants/routes on the gateway continue serving — including the ext-server-injected WAF/Connector config.
e2e test we want (the ask)
Add an e2e case (two-cluster topology, matching the real edge projection path) that:
SecurityPolicyon the shared gateway whoseclientSecretis missing (Secret not created / not propagated).tcp-443listener accepted with active filter chains (e.g. via adminconfig_dump:active_statepresent,error_stateabsent,len(filter_chains) > 0).config must be present for global configrejection.Generalize the assertion helper so we can reuse it for the broader invariant: "one Invalid/partial tenant policy must never reject the shared listener snapshot for other tenants." (Same class also seen with missing BasicAuth secrets, which EG already handles gracefully — good positive control.)
Notes / links
apps/network-services-operator/downstream/base/envoy-gateway-oci-repository.yaml; PR datum-cloud/infra#2746) and NSOconfig/toolsoverlays (PR fix(envoy-gateway): roll back EG control plane to v1.7.4 #193). Verified resolved onus-central-1-alice: post-rollback both proxies servetcp-443with 203 active chains + WAF on every chain, zeroconfig must be presentrejections.