From bc04aa99393d9adc09dd331885ea6da32aaf79a5 Mon Sep 17 00:00:00 2001 From: Scot Wells Date: Thu, 25 Jun 2026 14:35:06 -0500 Subject: [PATCH] test(e2e): prove the edge's core guarantees with real traffic Adds four end-to-end scenarios that send real traffic through the edge and confirm the promises customers depend on: the firewall enforces, an offline origin fails cleanly, the branded error page shows, and one bad certificate can't break its neighbors. Each asserts on a real response, not just that the control plane wrote the config. Includes a plain-language guide to how we test the edge and what we can't yet guarantee. These scenarios need the two-cluster production-fidelity environment, so they live under test/e2e-edge (separate from the single-stack test/e2e suite the default CI runs) and execute via `task test-infra:e2e` against that environment. Co-Authored-By: Claude Opus 4.8 (1M context) Claude-Session: https://claude.ai/code/session_01JbCy8vy66RdNYzGSgqH6P6 --- docs/testing/README.md | 87 +++ test/e2e-edge/README.md | 87 +++ .../_fixtures/certs/expired-cert-secret.yaml | 14 + test/e2e-edge/_fixtures/certs/gen-certs.sh | 84 +++ .../_fixtures/connector-tunnel/Dockerfile | 13 + .../_fixtures/connector-tunnel/go.mod | 3 + .../_fixtures/connector-tunnel/main.go | 94 +++ .../_fixtures/connector-tunnel/manifest.yaml | 66 ++ test/e2e-edge/_fixtures/echo-backend.yaml | 46 ++ test/e2e-edge/_fixtures/waf-corpus.txt | 30 + test/e2e-edge/_steps/assert-config-dump.yaml | 156 ++++ .../e2e-edge/_steps/assert-http-response.yaml | 152 ++++ test/e2e-edge/_steps/capture-build-id.yaml | 115 +++ .../_steps/flip-connector-liveness.yaml | 78 ++ test/e2e-edge/_steps/wait-config-settle.yaml | 85 +++ .../chainsaw-test.yaml | 672 ++++++++++++++++++ .../branded-error-page/chainsaw-test.yaml | 480 +++++++++++++ .../connector-offline-503/chainsaw-test.yaml | 546 ++++++++++++++ .../extension-server-smoke/chainsaw-test.yaml | 337 +++++++++ .../waf-enforcement/chainsaw-test.yaml | 440 ++++++++++++ 20 files changed, 3585 insertions(+) create mode 100644 docs/testing/README.md create mode 100644 test/e2e-edge/README.md create mode 100644 test/e2e-edge/_fixtures/certs/expired-cert-secret.yaml create mode 100755 test/e2e-edge/_fixtures/certs/gen-certs.sh create mode 100644 test/e2e-edge/_fixtures/connector-tunnel/Dockerfile create mode 100644 test/e2e-edge/_fixtures/connector-tunnel/go.mod create mode 100644 test/e2e-edge/_fixtures/connector-tunnel/main.go create mode 100644 test/e2e-edge/_fixtures/connector-tunnel/manifest.yaml create mode 100644 test/e2e-edge/_fixtures/echo-backend.yaml create mode 100644 test/e2e-edge/_fixtures/waf-corpus.txt create mode 100644 test/e2e-edge/_steps/assert-config-dump.yaml create mode 100644 test/e2e-edge/_steps/assert-http-response.yaml create mode 100644 test/e2e-edge/_steps/capture-build-id.yaml create mode 100644 test/e2e-edge/_steps/flip-connector-liveness.yaml create mode 100644 test/e2e-edge/_steps/wait-config-settle.yaml create mode 100644 test/e2e-edge/atomic-reject-isolation/chainsaw-test.yaml create mode 100644 test/e2e-edge/branded-error-page/chainsaw-test.yaml create mode 100644 test/e2e-edge/connector-offline-503/chainsaw-test.yaml create mode 100644 test/e2e-edge/extension-server-smoke/chainsaw-test.yaml create mode 100644 test/e2e-edge/waf-enforcement/chainsaw-test.yaml diff --git a/docs/testing/README.md b/docs/testing/README.md new file mode 100644 index 00000000..8de1abba --- /dev/null +++ b/docs/testing/README.md @@ -0,0 +1,87 @@ +# Testing the Datum edge + +This is the map for how we test the network-services-operator (NSO) — what we +prove, and why it's built the way it is. It's written for anyone who wants to +understand the safety net, not just the people who maintain it. + +## Why this exists + +NSO programs the edge: when a customer creates a Gateway, a route, a web-app +firewall policy, or a connector, NSO turns that intent into live configuration +on the Envoy proxies that actually serve customer traffic. The hard part isn't +producing the configuration — it's making sure the configuration that lands on +the running proxy *does what the customer asked*. + +Almost every production incident in this system has shared one shape: + +> **The configuration was logically correct, the platform reported success, but +> the running proxy behaved differently than intended** — a firewall rule that +> protected nothing, an offline backend that still returned a blank error, a +> branded error page that never appeared, a single bad certificate that froze a +> whole shared listener. + +These failures are invisible to ordinary tests. The Kubernetes resources report +"Programmed = True," unit tests pass, and the gap only shows up when real +traffic — often a real *attack* — arrives at the edge. So our testing is built +around one principle: + +**Prove behavior at the edge with real traffic, against an environment that +looks like production — not against the platform's own report of success.** + +## The two ideas everything rests on + +**1. Production fidelity.** The recent incidents lived precisely on the axes +where our old test environment differed from production: the proxy version, the +"fail closed" availability coupling, the firewall data plane, and multi-cluster +replication. So the test environment stands up the *real* shape of the edge — +the same proxy version, the same extension server that rewrites configuration, +the same firewall image, and the same federation mechanism that fans +configuration out to edge clusters. If a test passes here, it passes against +something production actually runs. This environment is brought online with a +single command set (`task test-infra:up`); see +[`Taskfile.test-infra.yml`](../../Taskfile.test-infra.yml). + +**2. Traffic-first, with a tie-breaker.** Every test's verdict is the +*observed behavior* of real traffic through the real proxy — a blocked attack, +a 503 for an offline backend, the branded page in the response body. We never +let "the platform says it worked" stand in for "the edge did the right thing." + +Real traffic alone has one blind spot, though: a firewall that protects nothing +and a firewall that's simply not being attacked produce the *same* successful +response. So traffic is backed by a **parity check** — a comparison of what the +edge was told to do against what the running proxy is actually doing — which +turns a surprising result into a diagnosis instead of a guess. Traffic is always +the verdict; parity is the tie-breaker that catches the silently-inert case. +See [`test/parity/README.md`](../../test/parity/README.md). + +## What we guarantee + +The end-to-end suite ([`test/e2e-edge/README.md`](../../test/e2e-edge/README.md)) turns +each past incident into a standing guarantee, checked against real traffic: + +- **A web-app firewall actually blocks attacks** — a malicious request is + refused while a legitimate one still succeeds. +- **An offline backend fails cleanly** — the customer path returns a real 503, + not a hang or a blank. +- **Branded error pages reach the customer** — the styled page appears in the + actual response, not just in configuration. +- **One bad certificate can't take down its neighbors** — an invalid listener + is isolated while the healthy listeners on the same proxy keep serving. + +And the federation layer ([`config/federation/README.md`](../../config/federation/README.md)) +proves that configuration created in the control plane genuinely arrives at the +edge clusters that serve traffic. + +## Where things live + +| Area | What it covers | +|---|---| +| [`Taskfile.test-infra.yml`](../../Taskfile.test-infra.yml) | Brings the production-fidelity edge online and runs the suites | +| [`test/e2e-edge/README.md`](../../test/e2e-edge/README.md) | The real-traffic guarantees, scenario by scenario | +| [`test/parity/README.md`](../../test/parity/README.md) | The parity check that catches silently-inert configuration | +| [`config/federation/README.md`](../../config/federation/README.md) | Fanning configuration out to edge clusters | + +> The design rationale that led here — the original audit of test-vs-production +> gaps and the proposals for closing them — lives in the pull requests that +> introduced this work, not in the repository, because it describes a plan +> rather than the system as it stands today. diff --git a/test/e2e-edge/README.md b/test/e2e-edge/README.md new file mode 100644 index 00000000..952e28ea --- /dev/null +++ b/test/e2e-edge/README.md @@ -0,0 +1,87 @@ +# End-to-end edge guarantees + +These tests prove that the Datum edge *behaves* the way customers expect, by +sending real traffic through the real proxy and checking what actually comes +back. They are the standing guarantees described in +[`docs/testing/README.md`](../../docs/testing/README.md). + +## How a guarantee is checked + +Every scenario follows the same three-part check, in priority order: + +1. **The traffic verdict (always decisive).** The test makes a real request and + asserts on the real response — a blocked attack, a 503, a branded page, a + 200 from a healthy listener. If the edge behaves wrongly, this fails. A test + is never satisfied by the platform merely *reporting* success. +2. **The configuration is genuinely present.** A + [parity check](../parity/README.md) confirms the running proxy actually + carries the configuration it was told to — closing the blind spot where a + successful-looking response hides a rule that protects nothing. +3. **It's the right configuration serving the request.** A build marker + confirms the response came from the configuration under test, not from stale + config left over from a previous state — so a pass can't be a timing fluke. + +The first is the point; the second and third exist so a surprising result +becomes a diagnosis rather than a mystery. + +## The scenarios + +Each one corresponds to a past production incident, now held in place. + +### Web-app firewall enforcement +A malicious request (matching the firewall's attack rules) must be refused, +while a legitimate request to the same endpoint still succeeds. The test also +flips the policy into observe-only mode and confirms the same attack is then +*allowed* — proving the block is genuinely driven by the customer's firewall +policy and not by some unrelated default. This guards the customer's actual +protection, and the risk that a single bad firewall rule wedges the whole +listener. + +### Offline backend returns a clean 503 +When a backend connector is offline, the customer-facing path must return a +real 503 — not hang, and not serve a blank. This is checked as an observed +response on the user path, because that's what a customer experiences. + +### Branded error page reaches the customer +When the edge serves an error, the customer must receive Datum's styled page, +confirmed by finding the page's content in the actual response body — not by +trusting that the configuration was applied. Production once needed manual +restarts to make this take effect, exactly the kind of inert-configuration gap +this scenario now catches. + +### One bad certificate can't break its neighbors +A single invalid certificate must not take down the other, healthy listeners +sharing the same proxy. The test introduces a genuinely bad certificate and +confirms its listener is isolated while sibling listeners keep serving real +traffic — the "one bad resource freezes everything" failure mode, contained. + +## Running them + +The scenarios run against the production-fidelity environment: + +``` +task test-infra:up # bring the edge online (proxy + extension server + firewall) +task test-infra:smoke # quick confidence check: real traffic serves +task test-infra:e2e # the full guarantee suite above +``` + +See [`Taskfile.test-infra.yml`](../../Taskfile.test-infra.yml) for the +environment these assume. + +## Layout + +- Scenario folders (`waf-enforcement/`, `connector-offline-503/`, + `branded-error-page/`, `atomic-reject-isolation/`) — one guarantee each. +- `_steps/` — shared, reusable checks (send-a-request, confirm-configuration, + capture-the-build-marker) so every scenario asserts behavior the same way. +- `_fixtures/` — the supporting pieces a scenario needs (a sample backend, an + attack corpus, a pre-made bad certificate, the offline-backend stand-in). + +## A note on honesty + +Where the edge genuinely cannot yet do something, the suite says so rather than +papering over it. The "offline backend recovers" path is deliberately held back +because the edge today does not reliably re-apply configuration when a backend +comes *back* online — and the test proves that gap exists instead of pretending +it's closed. A guarantee we can't keep is documented as a gap, not asserted as +a pass. diff --git a/test/e2e-edge/_fixtures/certs/expired-cert-secret.yaml b/test/e2e-edge/_fixtures/certs/expired-cert-secret.yaml new file mode 100644 index 00000000..305c8583 --- /dev/null +++ b/test/e2e-edge/_fixtures/certs/expired-cert-secret.yaml @@ -0,0 +1,14 @@ +# GENERATED by gen-certs.sh — do not edit by hand; re-run the script to refresh. +# +# A pre-minted EXPIRED certificate for bad.e2e.env.datum.net. Its validity window +# (20260621000000Z .. 20260622000000Z) is in the past, so the extension server +# drops the listener that references it while sibling listeners keep serving. +apiVersion: v1 +kind: Secret +metadata: + name: expired-leaf-tls + namespace: default +type: kubernetes.io/tls +data: + tls.crt: LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSUJ0ekNDQVY2Z0F3SUJBZ0lVTW5lMFBkKzczaDFPcFh0YWo1eDlIbzRxR2Fjd0NnWUlLb1pJemowRUF3SXcKSURFZU1Cd0dBMVVFQXd3VlltRmtMbVV5WlM1bGJuWXVaR0YwZFcwdWJtVjBNQjRYRFRJMk1EWXlNVEF3TURBdwpNRm9YRFRJMk1EWXlNakF3TURBd01Gb3dJREVlTUJ3R0ExVUVBd3dWWW1Ga0xtVXlaUzVsYm5ZdVpHRjBkVzB1CmJtVjBNRmt3RXdZSEtvWkl6ajBDQVFZSUtvWkl6ajBEQVFjRFFnQUVKWEYyV0U1UmZzN2lJbkpmVWRWZHNRa0IKWU9aRWk1TmV3cmZ5T3hkRWdiZ2ViNnRGMkpUbHA5L2tMNUYweEkvSWpYeXpFWlF1YU9vMVNwWCtYd0pWS3FOMgpNSFF3SUFZRFZSMFJCQmt3RjRJVlltRmtMbVV5WlM1bGJuWXVaR0YwZFcwdWJtVjBNQXdHQTFVZEV3RUIvd1FDCk1BQXdEZ1lEVlIwUEFRSC9CQVFEQWdXZ01CTUdBMVVkSlFRTU1Bb0dDQ3NHQVFVRkJ3TUJNQjBHQTFVZERnUVcKQkJTaTRpOTB2VkEwdmlQU0k2aW9WdFFEdXFhOE5qQUtCZ2dxaGtqT1BRUURBZ05IQURCRUFpQTJmalMvbGVmNQpTRHFEMEt5UWJRV1hENlltMHg4NDJnU2VPMS9XZ041MHpnSWdYVHd4MWZwSTNVc0U1UXI4bkNXWk0wd044b0cyCkU4a0ttZUo1a3BDL3lFST0KLS0tLS1FTkQgQ0VSVElGSUNBVEUtLS0tLQo= + tls.key: LS0tLS1CRUdJTiBFQyBQUklWQVRFIEtFWS0tLS0tCk1IY0NBUUVFSUM3Ri9qbkVFSTBYQk12emg4K0lCaVVycG5MMnVGeEpFSUlMQWkyd1Q4eXlvQW9HQ0NxR1NNNDkKQXdFSG9VUURRZ0FFSlhGMldFNVJmczdpSW5KZlVkVmRzUWtCWU9aRWk1TmV3cmZ5T3hkRWdiZ2ViNnRGMkpUbApwOS9rTDVGMHhJL0lqWHl6RVpRdWFPbzFTcFgrWHdKVktnPT0KLS0tLS1FTkQgRUMgUFJJVkFURSBLRVktLS0tLQo= diff --git a/test/e2e-edge/_fixtures/certs/gen-certs.sh b/test/e2e-edge/_fixtures/certs/gen-certs.sh new file mode 100755 index 00000000..8d5a5d9b --- /dev/null +++ b/test/e2e-edge/_fixtures/certs/gen-certs.sh @@ -0,0 +1,84 @@ +#!/usr/bin/env bash +# Regenerate the pre-minted EXPIRED TLS certificate the invalid-certificate test +# relies on. +# +# The test needs a certificate that is already expired at test time so the +# extension server drops the listener that references it while sibling listeners +# keep serving. cert-manager cannot help here: it renews before expiry, so it +# cannot hold a certificate in the expired state. We instead mint a self-signed +# certificate whose validity window is in the past. +# +# This is a listener certificate for the customer hostname under test, entirely +# separate from the certificate authority securing the extension server's own +# connection to the gateway; do not wire it to that authority. +# +# Output: expired-cert-secret.yaml — a Secret with the expired certificate and +# key inline. The committed Secret is what the test consumes, so a live run needs +# no openssl; re-run this script only to refresh it. +# +# Usage: ./gen-certs.sh [HOSTNAME] [SECRET_NAME] [SECRET_NAMESPACE] +set -euo pipefail + +HOST="${1:-bad.e2e.env.datum.net}" +SECRET_NAME="${2:-expired-leaf-tls}" +SECRET_NS="${3:-default}" + +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" +WORK="$(mktemp -d)" +trap 'rm -rf "${WORK}"' EXIT + +# ECDSA P-256 key. +openssl ecparam -name prime256v1 -genkey -noout -out "${WORK}/tls.key" + +# A self-signed certificate whose validity window is entirely in the past. We set +# the exact start and end times with OpenSSL 3.x's -not_before/-not_after so the +# certificate is already expired by the time the test runs. +NOT_BEFORE="20260621000000Z" +NOT_AFTER="20260622000000Z" + +cat > "${WORK}/leaf.cnf" < "${SCRIPT_DIR}/expired-cert-secret.yaml" <200 liveness swap, but not real tunnel +// establishment or NAT traversal. +// +// When a connector is online, the extension server points its backend at a path +// that issues an HTTP CONNECT toward the connector's target, which in the test +// is this pod's Service. On a CONNECT, this proxy dials the configured upstream +// (the echo backend) and blindly splices bytes both ways — a minimal forward +// proxy. +// +// Liveness is driven entirely by the connector's annotation on the control plane +// (see _steps/flip-connector-liveness.yaml); this proxy is always up. It exists +// so an online request can only succeed via the tunnel, never via a direct +// fallback route — which is what makes the 503->200 transition meaningful. +package main + +import ( + "io" + "log" + "net" + "net/http" + "os" + "time" +) + +func main() { + listen := envOr("LISTEN_ADDR", ":8080") + // Where CONNECT requests are forwarded. Default to the echo backend Service. + // The proxy ignores the CONNECT target host and always dials this upstream, + // so the test controls the destination via UPSTREAM_ADDR rather than the + // host the proxy sends. + upstream := envOr("UPSTREAM_ADDR", "echo-backend.default.svc.cluster.local:8080") + + srv := &http.Server{ + Addr: listen, + ReadTimeout: 0, // tunnels are long-lived; no read deadline on the hijacked conn + Handler: &proxy{upstream: upstream}, + } + log.Printf("connect-proxy listening on %s, forwarding CONNECT -> %s", listen, upstream) + if err := srv.ListenAndServe(); err != nil { + log.Fatalf("server exited: %v", err) + } +} + +type proxy struct { + upstream string +} + +func (p *proxy) ServeHTTP(w http.ResponseWriter, r *http.Request) { + if r.Method != http.MethodConnect { + // A plain GET is handy as a liveness/readiness probe. + w.WriteHeader(http.StatusOK) + _, _ = io.WriteString(w, "connect-proxy ready\n") + return + } + + dst, err := net.DialTimeout("tcp", p.upstream, 10*time.Second) + if err != nil { + log.Printf("CONNECT %s: dial upstream %s failed: %v", r.Host, p.upstream, err) + http.Error(w, "upstream unavailable", http.StatusBadGateway) + return + } + defer dst.Close() + + hj, ok := w.(http.Hijacker) + if !ok { + http.Error(w, "hijacking unsupported", http.StatusInternalServerError) + return + } + client, _, err := hj.Hijack() + if err != nil { + log.Printf("CONNECT %s: hijack failed: %v", r.Host, err) + return + } + defer client.Close() + + // Tell the client the tunnel is established, then splice bytes both ways. + if _, err := client.Write([]byte("HTTP/1.1 200 Connection Established\r\n\r\n")); err != nil { + log.Printf("CONNECT %s: write 200 failed: %v", r.Host, err) + return + } + + done := make(chan struct{}, 2) + go func() { _, _ = io.Copy(dst, client); done <- struct{}{} }() + go func() { _, _ = io.Copy(client, dst); done <- struct{}{} }() + <-done +} + +func envOr(key, def string) string { + if v := os.Getenv(key); v != "" { + return v + } + return def +} diff --git a/test/e2e-edge/_fixtures/connector-tunnel/manifest.yaml b/test/e2e-edge/_fixtures/connector-tunnel/manifest.yaml new file mode 100644 index 00000000..d6595904 --- /dev/null +++ b/test/e2e-edge/_fixtures/connector-tunnel/manifest.yaml @@ -0,0 +1,66 @@ +# Connector tunnel stand-in deployment. +# +# The image is built from this directory's Dockerfile and loaded into the +# downstream cluster, e.g.: +# docker build -t connect-proxy:e2e test/e2e/_fixtures/connector-tunnel +# kind load docker-image connect-proxy:e2e --name +# +# When a connector is online, the extension server points it at the connector's +# target. Point that target at this Service so an online request can only succeed +# through the tunnel — there is no direct fallback route to the echo backend. +apiVersion: apps/v1 +kind: Deployment +metadata: + name: connect-proxy + namespace: default + labels: + purpose: connect-proxy +spec: + replicas: 1 + selector: + matchLabels: + purpose: connect-proxy + template: + metadata: + labels: + purpose: connect-proxy + spec: + containers: + - name: connect-proxy + image: connect-proxy:e2e + imagePullPolicy: IfNotPresent + env: + - name: LISTEN_ADDR + value: ":8080" + # Forward CONNECT to the echo backend Service. + - name: UPSTREAM_ADDR + value: "echo-backend.default.svc.cluster.local:8080" + ports: + - containerPort: 8080 + readinessProbe: + httpGet: + path: / + port: 8080 + initialDelaySeconds: 1 + periodSeconds: 2 + resources: + requests: + cpu: 50m + memory: 32Mi + limits: + cpu: 100m + memory: 64Mi + terminationGracePeriodSeconds: 0 +--- +apiVersion: v1 +kind: Service +metadata: + name: connect-proxy + namespace: default +spec: + ports: + - name: connect + port: 8080 + targetPort: 8080 + selector: + purpose: connect-proxy diff --git a/test/e2e-edge/_fixtures/echo-backend.yaml b/test/e2e-edge/_fixtures/echo-backend.yaml new file mode 100644 index 00000000..8806317d --- /dev/null +++ b/test/e2e-edge/_fixtures/echo-backend.yaml @@ -0,0 +1,46 @@ +# Echo / HTTP backend fixture. +# +# go-httpbin gives deterministic endpoints the suite leans on: +# /status/ -> returns that status (drive 200 and 5xx) +# /get -> echoes the request (drive WAF benign/malicious probes) +# Plain HTTP only: the edge proxy terminates TLS per listener and forwards +# plaintext to this backend. +# +# Applied rather than freshly created so scenarios that share the downstream +# cluster coexist regardless of run order. +apiVersion: v1 +kind: Pod +metadata: + name: echo-backend + namespace: default + labels: + purpose: echo-backend +spec: + containers: + - name: backend + image: ghcr.io/mccutchen/go-httpbin:2.18.1 + command: ["/bin/go-httpbin"] + args: ["-host", "0.0.0.0", "-port", "8080"] + ports: + - containerPort: 8080 + resources: + requests: + cpu: 50m + memory: 64Mi + limits: + cpu: 100m + memory: 128Mi + terminationGracePeriodSeconds: 0 +--- +apiVersion: v1 +kind: Service +metadata: + name: echo-backend + namespace: default +spec: + ports: + - name: http + port: 8080 + targetPort: 8080 + selector: + purpose: echo-backend diff --git a/test/e2e-edge/_fixtures/waf-corpus.txt b/test/e2e-edge/_fixtures/waf-corpus.txt new file mode 100644 index 00000000..c9621630 --- /dev/null +++ b/test/e2e-edge/_fixtures/waf-corpus.txt @@ -0,0 +1,30 @@ +# WAF request corpus. +# +# Format (tab- or space-separated; lines starting with # are comments): +#