datum-cloud · scotwells · Jun 25, 2026
diff --git a/docs/testing/README.md b/docs/testing/README.md
@@ -0,0 +1,87 @@
+# Testing the Datum edge
+
+This is the map for how we test the network-services-operator (NSO) — what we
+prove, and why it's built the way it is. It's written for anyone who wants to
+understand the safety net, not just the people who maintain it.
+
+## Why this exists
+
+NSO programs the edge: when a customer creates a Gateway, a route, a web-app
+firewall policy, or a connector, NSO turns that intent into live configuration
+on the Envoy proxies that actually serve customer traffic. The hard part isn't
+producing the configuration — it's making sure the configuration that lands on
+the running proxy *does what the customer asked*.
+
+Almost every production incident in this system has shared one shape:
+
+> **The configuration was logically correct, the platform reported success, but
+> the running proxy behaved differently than intended** — a firewall rule that
+> protected nothing, an offline backend that still returned a blank error, a
+> branded error page that never appeared, a single bad certificate that froze a
+> whole shared listener.
+
+These failures are invisible to ordinary tests. The Kubernetes resources report
+"Programmed = True," unit tests pass, and the gap only shows up when real
+traffic — often a real *attack* — arrives at the edge. So our testing is built
+around one principle:
+
+**Prove behavior at the edge with real traffic, against an environment that
+looks like production — not against the platform's own report of success.**
+
+## The two ideas everything rests on
+
+**1. Production fidelity.** The recent incidents lived precisely on the axes
+where our old test environment differed from production: the proxy version, the
+"fail closed" availability coupling, the firewall data plane, and multi-cluster
+replication. So the test environment stands up the *real* shape of the edge —
+the same proxy version, the same extension server that rewrites configuration,
+the same firewall image, and the same federation mechanism that fans
+configuration out to edge clusters. If a test passes here, it passes against
+something production actually runs. This environment is brought online with a
+single command set (`task test-infra:up`); see
+[`Taskfile.test-infra.yml`](../../Taskfile.test-infra.yml).
+
+**2. Traffic-first, with a tie-breaker.** Every test's verdict is the
+*observed behavior* of real traffic through the real proxy — a blocked attack,
+a 503 for an offline backend, the branded page in the response body. We never
+let "the platform says it worked" stand in for "the edge did the right thing."
+
+Real traffic alone has one blind spot, though: a firewall that protects nothing
+and a firewall that's simply not being attacked produce the *same* successful
+response. So traffic is backed by a **parity check** — a comparison of what the
+edge was told to do against what the running proxy is actually doing — which
+turns a surprising result into a diagnosis instead of a guess. Traffic is always
+the verdict; parity is the tie-breaker that catches the silently-inert case.
+See [`test/parity/README.md`](../../test/parity/README.md).
+
+## What we guarantee
+
+The end-to-end suite ([`test/e2e-edge/README.md`](../../test/e2e-edge/README.md)) turns
+each past incident into a standing guarantee, checked against real traffic:
+
+- **A web-app firewall actually blocks attacks** — a malicious request is
+  refused while a legitimate one still succeeds.
+- **An offline backend fails cleanly** — the customer path returns a real 503,
+  not a hang or a blank.
+- **Branded error pages reach the customer** — the styled page appears in the
+  actual response, not just in configuration.
+- **One bad certificate can't take down its neighbors** — an invalid listener
+  is isolated while the healthy listeners on the same proxy keep serving.
+
+And the federation layer ([`config/federation/README.md`](../../config/federation/README.md))
+proves that configuration created in the control plane genuinely arrives at the
+edge clusters that serve traffic.
+
+## Where things live
+
+| Area | What it covers |
+|---|---|
+| [`Taskfile.test-infra.yml`](../../Taskfile.test-infra.yml) | Brings the production-fidelity edge online and runs the suites |
+| [`test/e2e-edge/README.md`](../../test/e2e-edge/README.md) | The real-traffic guarantees, scenario by scenario |
+| [`test/parity/README.md`](../../test/parity/README.md) | The parity check that catches silently-inert configuration |
+| [`config/federation/README.md`](../../config/federation/README.md) | Fanning configuration out to edge clusters |
+
+> The design rationale that led here — the original audit of test-vs-production
+> gaps and the proposals for closing them — lives in the pull requests that
+> introduced this work, not in the repository, because it describes a plan
+> rather than the system as it stands today.
diff --git a/test/e2e-edge/README.md b/test/e2e-edge/README.md
@@ -0,0 +1,87 @@
+# End-to-end edge guarantees
+
+These tests prove that the Datum edge *behaves* the way customers expect, by
+sending real traffic through the real proxy and checking what actually comes
+back. They are the standing guarantees described in
+[`docs/testing/README.md`](../../docs/testing/README.md).
+
+## How a guarantee is checked
+
+Every scenario follows the same three-part check, in priority order:
+
+1. **The traffic verdict (always decisive).** The test makes a real request and
+   asserts on the real response — a blocked attack, a 503, a branded page, a
+   200 from a healthy listener. If the edge behaves wrongly, this fails. A test
+   is never satisfied by the platform merely *reporting* success.
+2. **The configuration is genuinely present.** A
+   [parity check](../parity/README.md) confirms the running proxy actually
+   carries the configuration it was told to — closing the blind spot where a
+   successful-looking response hides a rule that protects nothing.
+3. **It's the right configuration serving the request.** A build marker
+   confirms the response came from the configuration under test, not from stale
+   config left over from a previous state — so a pass can't be a timing fluke.
+
+The first is the point; the second and third exist so a surprising result
+becomes a diagnosis rather than a mystery.
+
+## The scenarios
+
+Each one corresponds to a past production incident, now held in place.
+
+### Web-app firewall enforcement
+A malicious request (matching the firewall's attack rules) must be refused,
+while a legitimate request to the same endpoint still succeeds. The test also
+flips the policy into observe-only mode and confirms the same attack is then
+*allowed* — proving the block is genuinely driven by the customer's firewall
+policy and not by some unrelated default. This guards the customer's actual
+protection, and the risk that a single bad firewall rule wedges the whole
+listener.
+
+### Offline backend returns a clean 503
+When a backend connector is offline, the customer-facing path must return a
+real 503 — not hang, and not serve a blank. This is checked as an observed
+response on the user path, because that's what a customer experiences.
+
+### Branded error page reaches the customer
+When the edge serves an error, the customer must receive Datum's styled page,
+confirmed by finding the page's content in the actual response body — not by
+trusting that the configuration was applied. Production once needed manual
+restarts to make this take effect, exactly the kind of inert-configuration gap
+this scenario now catches.
+
+### One bad certificate can't break its neighbors
+A single invalid certificate must not take down the other, healthy listeners
+sharing the same proxy. The test introduces a genuinely bad certificate and
+confirms its listener is isolated while sibling listeners keep serving real
+traffic — the "one bad resource freezes everything" failure mode, contained.
+
+## Running them
+
+The scenarios run against the production-fidelity environment:
+
+```
+task test-infra:up        # bring the edge online (proxy + extension server + firewall)
+task test-infra:smoke     # quick confidence check: real traffic serves
+task test-infra:e2e       # the full guarantee suite above
+```
+
+See [`Taskfile.test-infra.yml`](../../Taskfile.test-infra.yml) for the
+environment these assume.
+
+## Layout
+
+- Scenario folders (`waf-enforcement/`, `connector-offline-503/`,
+  `branded-error-page/`, `atomic-reject-isolation/`) — one guarantee each.
+- `_steps/` — shared, reusable checks (send-a-request, confirm-configuration,
+  capture-the-build-marker) so every scenario asserts behavior the same way.
+- `_fixtures/` — the supporting pieces a scenario needs (a sample backend, an
+  attack corpus, a pre-made bad certificate, the offline-backend stand-in).
+
+## A note on honesty
+
+Where the edge genuinely cannot yet do something, the suite says so rather than
+papering over it. The "offline backend recovers" path is deliberately held back
+because the edge today does not reliably re-apply configuration when a backend
+comes *back* online — and the test proves that gap exists instead of pretending
+it's closed. A guarantee we can't keep is documented as a gap, not asserted as
+a pass.
diff --git a/test/e2e-edge/_fixtures/certs/expired-cert-secret.yaml b/test/e2e-edge/_fixtures/certs/expired-cert-secret.yaml
@@ -0,0 +1,14 @@
+# GENERATED by gen-certs.sh — do not edit by hand; re-run the script to refresh.
+#
+# A pre-minted EXPIRED certificate for bad.e2e.env.datum.net. Its validity window
+# (20260621000000Z .. 20260622000000Z) is in the past, so the extension server
+# drops the listener that references it while sibling listeners keep serving.
+apiVersion: v1
+kind: Secret
+metadata:
+  name: expired-leaf-tls
+  namespace: default
+type: kubernetes.io/tls
+data:
+  tls.crt: LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSUJ0ekNDQVY2Z0F3SUJBZ0lVTW5lMFBkKzczaDFPcFh0YWo1eDlIbzRxR2Fjd0NnWUlLb1pJemowRUF3SXcKSURFZU1Cd0dBMVVFQXd3VlltRmtMbVV5WlM1bGJuWXVaR0YwZFcwdWJtVjBNQjRYRFRJMk1EWXlNVEF3TURBdwpNRm9YRFRJMk1EWXlNakF3TURBd01Gb3dJREVlTUJ3R0ExVUVBd3dWWW1Ga0xtVXlaUzVsYm5ZdVpHRjBkVzB1CmJtVjBNRmt3RXdZSEtvWkl6ajBDQVFZSUtvWkl6ajBEQVFjRFFnQUVKWEYyV0U1UmZzN2lJbkpmVWRWZHNRa0IKWU9aRWk1TmV3cmZ5T3hkRWdiZ2ViNnRGMkpUbHA5L2tMNUYweEkvSWpYeXpFWlF1YU9vMVNwWCtYd0pWS3FOMgpNSFF3SUFZRFZSMFJCQmt3RjRJVlltRmtMbVV5WlM1bGJuWXVaR0YwZFcwdWJtVjBNQXdHQTFVZEV3RUIvd1FDCk1BQXdEZ1lEVlIwUEFRSC9CQVFEQWdXZ01CTUdBMVVkSlFRTU1Bb0dDQ3NHQVFVRkJ3TUJNQjBHQTFVZERnUVcKQkJTaTRpOTB2VkEwdmlQU0k2aW9WdFFEdXFhOE5qQUtCZ2dxaGtqT1BRUURBZ05IQURCRUFpQTJmalMvbGVmNQpTRHFEMEt5UWJRV1hENlltMHg4NDJnU2VPMS9XZ041MHpnSWdYVHd4MWZwSTNVc0U1UXI4bkNXWk0wd044b0cyCkU4a0ttZUo1a3BDL3lFST0KLS0tLS1FTkQgQ0VSVElGSUNBVEUtLS0tLQo=
+  tls.key: LS0tLS1CRUdJTiBFQyBQUklWQVRFIEtFWS0tLS0tCk1IY0NBUUVFSUM3Ri9qbkVFSTBYQk12emg4K0lCaVVycG5MMnVGeEpFSUlMQWkyd1Q4eXlvQW9HQ0NxR1NNNDkKQXdFSG9VUURRZ0FFSlhGMldFNVJmczdpSW5KZlVkVmRzUWtCWU9aRWk1TmV3cmZ5T3hkRWdiZ2ViNnRGMkpUbApwOS9rTDVGMHhJL0lqWHl6RVpRdWFPbzFTcFgrWHdKVktnPT0KLS0tLS1FTkQgRUMgUFJJVkFURSBLRVktLS0tLQo=
diff --git a/test/e2e-edge/_fixtures/certs/gen-certs.sh b/test/e2e-edge/_fixtures/certs/gen-certs.sh
@@ -0,0 +1,84 @@
+#!/usr/bin/env bash
+# Regenerate the pre-minted EXPIRED TLS certificate the invalid-certificate test
+# relies on.
+#
+# The test needs a certificate that is already expired at test time so the
+# extension server drops the listener that references it while sibling listeners
+# keep serving. cert-manager cannot help here: it renews before expiry, so it
+# cannot hold a certificate in the expired state. We instead mint a self-signed
+# certificate whose validity window is in the past.
+#
+# This is a listener certificate for the customer hostname under test, entirely
+# separate from the certificate authority securing the extension server's own
+# connection to the gateway; do not wire it to that authority.
+#
+# Output: expired-cert-secret.yaml — a Secret with the expired certificate and
+# key inline. The committed Secret is what the test consumes, so a live run needs
+# no openssl; re-run this script only to refresh it.
+#
+# Usage: ./gen-certs.sh [HOSTNAME] [SECRET_NAME] [SECRET_NAMESPACE]
+set -euo pipefail
+
+HOST="${1:-bad.e2e.env.datum.net}"
+SECRET_NAME="${2:-expired-leaf-tls}"
+SECRET_NS="${3:-default}"
+
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+WORK="$(mktemp -d)"
+trap 'rm -rf "${WORK}"' EXIT
+
+# ECDSA P-256 key.
+openssl ecparam -name prime256v1 -genkey -noout -out "${WORK}/tls.key"
+
+# A self-signed certificate whose validity window is entirely in the past. We set
+# the exact start and end times with OpenSSL 3.x's -not_before/-not_after so the
+# certificate is already expired by the time the test runs.
+NOT_BEFORE="20260621000000Z"
+NOT_AFTER="20260622000000Z"
+
+cat > "${WORK}/leaf.cnf" <<EOF
+[req]
+distinguished_name = dn
+prompt = no
+x509_extensions = v3
+[dn]
+CN = ${HOST}
+[v3]
+subjectAltName = DNS:${HOST}
+basicConstraints = critical, CA:FALSE
+keyUsage = critical, digitalSignature, keyEncipherment
+extendedKeyUsage = serverAuth
+EOF
+
+openssl req -new -x509 \
+  -key "${WORK}/tls.key" \
+  -out "${WORK}/tls.crt" \
+  -config "${WORK}/leaf.cnf" \
+  -not_before "${NOT_BEFORE}" \
+  -not_after "${NOT_AFTER}" \
+  -sha256
+
+echo "minted leaf for ${HOST}:"
+openssl x509 -in "${WORK}/tls.crt" -noout -subject -dates
+
+CRT_B64="$(base64 < "${WORK}/tls.crt" | tr -d '\n')"
+KEY_B64="$(base64 < "${WORK}/tls.key" | tr -d '\n')"
+
+cat > "${SCRIPT_DIR}/expired-cert-secret.yaml" <<EOF
+# GENERATED by gen-certs.sh — do not edit by hand; re-run the script to refresh.
+#
+# A pre-minted EXPIRED certificate for ${HOST}. Its validity window
+# (${NOT_BEFORE} .. ${NOT_AFTER}) is in the past, so the extension server drops
+# the listener that references it while sibling listeners keep serving.
+apiVersion: v1
+kind: Secret
+metadata:
+  name: ${SECRET_NAME}
+  namespace: ${SECRET_NS}
+type: kubernetes.io/tls
+data:
+  tls.crt: ${CRT_B64}
+  tls.key: ${KEY_B64}
+EOF
+
+echo "wrote ${SCRIPT_DIR}/expired-cert-secret.yaml"
diff --git a/test/e2e-edge/_fixtures/connector-tunnel/Dockerfile b/test/e2e-edge/_fixtures/connector-tunnel/Dockerfile
@@ -0,0 +1,13 @@
+# CONNECT-proxy stand-in for the connector tunnel.
+# Single static binary, no dependencies beyond the standard library.
+FROM golang:1.23-alpine AS build
+WORKDIR /src
+COPY go.mod .
+COPY main.go .
+# Standard library only; go.mod just gives the build a module context.
+RUN CGO_ENABLED=0 go build -o /connect-proxy .
+
+FROM gcr.io/distroless/static:nonroot
+COPY --from=build /connect-proxy /connect-proxy
+USER nonroot:nonroot
+ENTRYPOINT ["/connect-proxy"]
diff --git a/test/e2e-edge/_fixtures/connector-tunnel/go.mod b/test/e2e-edge/_fixtures/connector-tunnel/go.mod
@@ -0,0 +1,3 @@
+module connect-proxy
+
+go 1.23
diff --git a/test/e2e-edge/_fixtures/connector-tunnel/main.go b/test/e2e-edge/_fixtures/connector-tunnel/main.go
@@ -0,0 +1,94 @@
+// connect-proxy is a stand-in for the real connector tunnel. It exercises the
+// proxy-side CONNECT wiring and the 503<->200 liveness swap, but not real tunnel
+// establishment or NAT traversal.
+//
+// When a connector is online, the extension server points its backend at a path
+// that issues an HTTP CONNECT toward the connector's target, which in the test
+// is this pod's Service. On a CONNECT, this proxy dials the configured upstream
+// (the echo backend) and blindly splices bytes both ways — a minimal forward
+// proxy.
+//
+// Liveness is driven entirely by the connector's annotation on the control plane
+// (see _steps/flip-connector-liveness.yaml); this proxy is always up. It exists
+// so an online request can only succeed via the tunnel, never via a direct
+// fallback route — which is what makes the 503->200 transition meaningful.
+package main
+
+import (
+	"io"
+	"log"
+	"net"
+	"net/http"
+	"os"
+	"time"
+)
+
+func main() {
+	listen := envOr("LISTEN_ADDR", ":8080")
+	// Where CONNECT requests are forwarded. Default to the echo backend Service.
+	// The proxy ignores the CONNECT target host and always dials this upstream,
+	// so the test controls the destination via UPSTREAM_ADDR rather than the
+	// host the proxy sends.
+	upstream := envOr("UPSTREAM_ADDR", "echo-backend.default.svc.cluster.local:8080")
+
+	srv := &http.Server{
+		Addr:        listen,
+		ReadTimeout: 0, // tunnels are long-lived; no read deadline on the hijacked conn
+		Handler:     &proxy{upstream: upstream},
+	}
+	log.Printf("connect-proxy listening on %s, forwarding CONNECT -> %s", listen, upstream)
+	if err := srv.ListenAndServe(); err != nil {
+		log.Fatalf("server exited: %v", err)
+	}
+}
+
+type proxy struct {
+	upstream string
+}
+
+func (p *proxy) ServeHTTP(w http.ResponseWriter, r *http.Request) {
+	if r.Method != http.MethodConnect {
+		// A plain GET is handy as a liveness/readiness probe.
+		w.WriteHeader(http.StatusOK)
+		_, _ = io.WriteString(w, "connect-proxy ready\n")
+		return
+	}
+
+	dst, err := net.DialTimeout("tcp", p.upstream, 10*time.Second)
+	if err != nil {
+		log.Printf("CONNECT %s: dial upstream %s failed: %v", r.Host, p.upstream, err)
+		http.Error(w, "upstream unavailable", http.StatusBadGateway)
+		return
+	}
+	defer dst.Close()
+
+	hj, ok := w.(http.Hijacker)
+	if !ok {
+		http.Error(w, "hijacking unsupported", http.StatusInternalServerError)
+		return
+	}
+	client, _, err := hj.Hijack()
+	if err != nil {
+		log.Printf("CONNECT %s: hijack failed: %v", r.Host, err)
+		return
+	}
+	defer client.Close()
+
+	// Tell the client the tunnel is established, then splice bytes both ways.
+	if _, err := client.Write([]byte("HTTP/1.1 200 Connection Established\r\n\r\n")); err != nil {
+		log.Printf("CONNECT %s: write 200 failed: %v", r.Host, err)
+		return
+	}
+
+	done := make(chan struct{}, 2)
+	go func() { _, _ = io.Copy(dst, client); done <- struct{}{} }()
+	go func() { _, _ = io.Copy(client, dst); done <- struct{}{} }()
+	<-done
+}
+
+func envOr(key, def string) string {
+	if v := os.Getenv(key); v != "" {
+		return v
+	}
+	return def
+}