Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
87 changes: 87 additions & 0 deletions docs/testing/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
# Testing the Datum edge

This is the map for how we test the network-services-operator (NSO) — what we
prove, and why it's built the way it is. It's written for anyone who wants to
understand the safety net, not just the people who maintain it.

## Why this exists

NSO programs the edge: when a customer creates a Gateway, a route, a web-app
firewall policy, or a connector, NSO turns that intent into live configuration
on the Envoy proxies that actually serve customer traffic. The hard part isn't
producing the configuration — it's making sure the configuration that lands on
the running proxy *does what the customer asked*.

Almost every production incident in this system has shared one shape:

> **The configuration was logically correct, the platform reported success, but
> the running proxy behaved differently than intended** — a firewall rule that
> protected nothing, an offline backend that still returned a blank error, a
> branded error page that never appeared, a single bad certificate that froze a
> whole shared listener.

These failures are invisible to ordinary tests. The Kubernetes resources report
"Programmed = True," unit tests pass, and the gap only shows up when real
traffic — often a real *attack* — arrives at the edge. So our testing is built
around one principle:

**Prove behavior at the edge with real traffic, against an environment that
looks like production — not against the platform's own report of success.**

## The two ideas everything rests on

**1. Production fidelity.** The recent incidents lived precisely on the axes
where our old test environment differed from production: the proxy version, the
"fail closed" availability coupling, the firewall data plane, and multi-cluster
replication. So the test environment stands up the *real* shape of the edge —
the same proxy version, the same extension server that rewrites configuration,
the same firewall image, and the same federation mechanism that fans
configuration out to edge clusters. If a test passes here, it passes against
something production actually runs. This environment is brought online with a
single command set (`task test-infra:up`); see
[`Taskfile.test-infra.yml`](../../Taskfile.test-infra.yml).

**2. Traffic-first, with a tie-breaker.** Every test's verdict is the
*observed behavior* of real traffic through the real proxy — a blocked attack,
a 503 for an offline backend, the branded page in the response body. We never
let "the platform says it worked" stand in for "the edge did the right thing."

Real traffic alone has one blind spot, though: a firewall that protects nothing
and a firewall that's simply not being attacked produce the *same* successful
response. So traffic is backed by a **parity check** — a comparison of what the
edge was told to do against what the running proxy is actually doing — which
turns a surprising result into a diagnosis instead of a guess. Traffic is always
the verdict; parity is the tie-breaker that catches the silently-inert case.
See [`test/parity/README.md`](../../test/parity/README.md).

## What we guarantee

The end-to-end suite ([`test/e2e-edge/README.md`](../../test/e2e-edge/README.md)) turns
each past incident into a standing guarantee, checked against real traffic:

- **A web-app firewall actually blocks attacks** — a malicious request is
refused while a legitimate one still succeeds.
- **An offline backend fails cleanly** — the customer path returns a real 503,
not a hang or a blank.
- **Branded error pages reach the customer** — the styled page appears in the
actual response, not just in configuration.
- **One bad certificate can't take down its neighbors** — an invalid listener
is isolated while the healthy listeners on the same proxy keep serving.

And the federation layer ([`config/federation/README.md`](../../config/federation/README.md))
proves that configuration created in the control plane genuinely arrives at the
edge clusters that serve traffic.

## Where things live

| Area | What it covers |
|---|---|
| [`Taskfile.test-infra.yml`](../../Taskfile.test-infra.yml) | Brings the production-fidelity edge online and runs the suites |
| [`test/e2e-edge/README.md`](../../test/e2e-edge/README.md) | The real-traffic guarantees, scenario by scenario |
| [`test/parity/README.md`](../../test/parity/README.md) | The parity check that catches silently-inert configuration |
| [`config/federation/README.md`](../../config/federation/README.md) | Fanning configuration out to edge clusters |

> The design rationale that led here — the original audit of test-vs-production
> gaps and the proposals for closing them — lives in the pull requests that
> introduced this work, not in the repository, because it describes a plan
> rather than the system as it stands today.
87 changes: 87 additions & 0 deletions test/e2e-edge/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
# End-to-end edge guarantees

These tests prove that the Datum edge *behaves* the way customers expect, by
sending real traffic through the real proxy and checking what actually comes
back. They are the standing guarantees described in
[`docs/testing/README.md`](../../docs/testing/README.md).

## How a guarantee is checked

Every scenario follows the same three-part check, in priority order:

1. **The traffic verdict (always decisive).** The test makes a real request and
asserts on the real response — a blocked attack, a 503, a branded page, a
200 from a healthy listener. If the edge behaves wrongly, this fails. A test
is never satisfied by the platform merely *reporting* success.
2. **The configuration is genuinely present.** A
[parity check](../parity/README.md) confirms the running proxy actually
carries the configuration it was told to — closing the blind spot where a
successful-looking response hides a rule that protects nothing.
3. **It's the right configuration serving the request.** A build marker
confirms the response came from the configuration under test, not from stale
config left over from a previous state — so a pass can't be a timing fluke.

The first is the point; the second and third exist so a surprising result
becomes a diagnosis rather than a mystery.

## The scenarios

Each one corresponds to a past production incident, now held in place.

### Web-app firewall enforcement
A malicious request (matching the firewall's attack rules) must be refused,
while a legitimate request to the same endpoint still succeeds. The test also
flips the policy into observe-only mode and confirms the same attack is then
*allowed* — proving the block is genuinely driven by the customer's firewall
policy and not by some unrelated default. This guards the customer's actual
protection, and the risk that a single bad firewall rule wedges the whole
listener.

### Offline backend returns a clean 503
When a backend connector is offline, the customer-facing path must return a
real 503 — not hang, and not serve a blank. This is checked as an observed
response on the user path, because that's what a customer experiences.

### Branded error page reaches the customer
When the edge serves an error, the customer must receive Datum's styled page,
confirmed by finding the page's content in the actual response body — not by
trusting that the configuration was applied. Production once needed manual
restarts to make this take effect, exactly the kind of inert-configuration gap
this scenario now catches.

### One bad certificate can't break its neighbors
A single invalid certificate must not take down the other, healthy listeners
sharing the same proxy. The test introduces a genuinely bad certificate and
confirms its listener is isolated while sibling listeners keep serving real
traffic — the "one bad resource freezes everything" failure mode, contained.

## Running them

The scenarios run against the production-fidelity environment:

```
task test-infra:up # bring the edge online (proxy + extension server + firewall)
task test-infra:smoke # quick confidence check: real traffic serves
task test-infra:e2e # the full guarantee suite above
```

See [`Taskfile.test-infra.yml`](../../Taskfile.test-infra.yml) for the
environment these assume.

## Layout

- Scenario folders (`waf-enforcement/`, `connector-offline-503/`,
`branded-error-page/`, `atomic-reject-isolation/`) — one guarantee each.
- `_steps/` — shared, reusable checks (send-a-request, confirm-configuration,
capture-the-build-marker) so every scenario asserts behavior the same way.
- `_fixtures/` — the supporting pieces a scenario needs (a sample backend, an
attack corpus, a pre-made bad certificate, the offline-backend stand-in).

## A note on honesty

Where the edge genuinely cannot yet do something, the suite says so rather than
papering over it. The "offline backend recovers" path is deliberately held back
because the edge today does not reliably re-apply configuration when a backend
comes *back* online — and the test proves that gap exists instead of pretending
it's closed. A guarantee we can't keep is documented as a gap, not asserted as
a pass.
14 changes: 14 additions & 0 deletions test/e2e-edge/_fixtures/certs/expired-cert-secret.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
# GENERATED by gen-certs.sh — do not edit by hand; re-run the script to refresh.
#
# A pre-minted EXPIRED certificate for bad.e2e.env.datum.net. Its validity window
# (20260621000000Z .. 20260622000000Z) is in the past, so the extension server
# drops the listener that references it while sibling listeners keep serving.
apiVersion: v1
kind: Secret
metadata:
name: expired-leaf-tls
namespace: default
type: kubernetes.io/tls
data:
tls.crt: LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSUJ0ekNDQVY2Z0F3SUJBZ0lVTW5lMFBkKzczaDFPcFh0YWo1eDlIbzRxR2Fjd0NnWUlLb1pJemowRUF3SXcKSURFZU1Cd0dBMVVFQXd3VlltRmtMbVV5WlM1bGJuWXVaR0YwZFcwdWJtVjBNQjRYRFRJMk1EWXlNVEF3TURBdwpNRm9YRFRJMk1EWXlNakF3TURBd01Gb3dJREVlTUJ3R0ExVUVBd3dWWW1Ga0xtVXlaUzVsYm5ZdVpHRjBkVzB1CmJtVjBNRmt3RXdZSEtvWkl6ajBDQVFZSUtvWkl6ajBEQVFjRFFnQUVKWEYyV0U1UmZzN2lJbkpmVWRWZHNRa0IKWU9aRWk1TmV3cmZ5T3hkRWdiZ2ViNnRGMkpUbHA5L2tMNUYweEkvSWpYeXpFWlF1YU9vMVNwWCtYd0pWS3FOMgpNSFF3SUFZRFZSMFJCQmt3RjRJVlltRmtMbVV5WlM1bGJuWXVaR0YwZFcwdWJtVjBNQXdHQTFVZEV3RUIvd1FDCk1BQXdEZ1lEVlIwUEFRSC9CQVFEQWdXZ01CTUdBMVVkSlFRTU1Bb0dDQ3NHQVFVRkJ3TUJNQjBHQTFVZERnUVcKQkJTaTRpOTB2VkEwdmlQU0k2aW9WdFFEdXFhOE5qQUtCZ2dxaGtqT1BRUURBZ05IQURCRUFpQTJmalMvbGVmNQpTRHFEMEt5UWJRV1hENlltMHg4NDJnU2VPMS9XZ041MHpnSWdYVHd4MWZwSTNVc0U1UXI4bkNXWk0wd044b0cyCkU4a0ttZUo1a3BDL3lFST0KLS0tLS1FTkQgQ0VSVElGSUNBVEUtLS0tLQo=
tls.key: LS0tLS1CRUdJTiBFQyBQUklWQVRFIEtFWS0tLS0tCk1IY0NBUUVFSUM3Ri9qbkVFSTBYQk12emg4K0lCaVVycG5MMnVGeEpFSUlMQWkyd1Q4eXlvQW9HQ0NxR1NNNDkKQXdFSG9VUURRZ0FFSlhGMldFNVJmczdpSW5KZlVkVmRzUWtCWU9aRWk1TmV3cmZ5T3hkRWdiZ2ViNnRGMkpUbApwOS9rTDVGMHhJL0lqWHl6RVpRdWFPbzFTcFgrWHdKVktnPT0KLS0tLS1FTkQgRUMgUFJJVkFURSBLRVktLS0tLQo=
84 changes: 84 additions & 0 deletions test/e2e-edge/_fixtures/certs/gen-certs.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
#!/usr/bin/env bash
# Regenerate the pre-minted EXPIRED TLS certificate the invalid-certificate test
# relies on.
#
# The test needs a certificate that is already expired at test time so the
# extension server drops the listener that references it while sibling listeners
# keep serving. cert-manager cannot help here: it renews before expiry, so it
# cannot hold a certificate in the expired state. We instead mint a self-signed
# certificate whose validity window is in the past.
#
# This is a listener certificate for the customer hostname under test, entirely
# separate from the certificate authority securing the extension server's own
# connection to the gateway; do not wire it to that authority.
#
# Output: expired-cert-secret.yaml — a Secret with the expired certificate and
# key inline. The committed Secret is what the test consumes, so a live run needs
# no openssl; re-run this script only to refresh it.
#
# Usage: ./gen-certs.sh [HOSTNAME] [SECRET_NAME] [SECRET_NAMESPACE]
set -euo pipefail

HOST="${1:-bad.e2e.env.datum.net}"
SECRET_NAME="${2:-expired-leaf-tls}"
SECRET_NS="${3:-default}"

SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
WORK="$(mktemp -d)"
trap 'rm -rf "${WORK}"' EXIT

# ECDSA P-256 key.
openssl ecparam -name prime256v1 -genkey -noout -out "${WORK}/tls.key"

# A self-signed certificate whose validity window is entirely in the past. We set
# the exact start and end times with OpenSSL 3.x's -not_before/-not_after so the
# certificate is already expired by the time the test runs.
NOT_BEFORE="20260621000000Z"
NOT_AFTER="20260622000000Z"

cat > "${WORK}/leaf.cnf" <<EOF
[req]
distinguished_name = dn
prompt = no
x509_extensions = v3
[dn]
CN = ${HOST}
[v3]
subjectAltName = DNS:${HOST}
basicConstraints = critical, CA:FALSE
keyUsage = critical, digitalSignature, keyEncipherment
extendedKeyUsage = serverAuth
EOF

openssl req -new -x509 \
-key "${WORK}/tls.key" \
-out "${WORK}/tls.crt" \
-config "${WORK}/leaf.cnf" \
-not_before "${NOT_BEFORE}" \
-not_after "${NOT_AFTER}" \
-sha256

echo "minted leaf for ${HOST}:"
openssl x509 -in "${WORK}/tls.crt" -noout -subject -dates

CRT_B64="$(base64 < "${WORK}/tls.crt" | tr -d '\n')"
KEY_B64="$(base64 < "${WORK}/tls.key" | tr -d '\n')"

cat > "${SCRIPT_DIR}/expired-cert-secret.yaml" <<EOF
# GENERATED by gen-certs.sh — do not edit by hand; re-run the script to refresh.
#
# A pre-minted EXPIRED certificate for ${HOST}. Its validity window
# (${NOT_BEFORE} .. ${NOT_AFTER}) is in the past, so the extension server drops
# the listener that references it while sibling listeners keep serving.
apiVersion: v1
kind: Secret
metadata:
name: ${SECRET_NAME}
namespace: ${SECRET_NS}
type: kubernetes.io/tls
data:
tls.crt: ${CRT_B64}
tls.key: ${KEY_B64}
EOF

echo "wrote ${SCRIPT_DIR}/expired-cert-secret.yaml"
13 changes: 13 additions & 0 deletions test/e2e-edge/_fixtures/connector-tunnel/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# CONNECT-proxy stand-in for the connector tunnel.
# Single static binary, no dependencies beyond the standard library.
FROM golang:1.23-alpine AS build
WORKDIR /src
COPY go.mod .
COPY main.go .
# Standard library only; go.mod just gives the build a module context.
RUN CGO_ENABLED=0 go build -o /connect-proxy .

FROM gcr.io/distroless/static:nonroot
COPY --from=build /connect-proxy /connect-proxy
USER nonroot:nonroot
ENTRYPOINT ["/connect-proxy"]
3 changes: 3 additions & 0 deletions test/e2e-edge/_fixtures/connector-tunnel/go.mod
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
module connect-proxy

go 1.23
94 changes: 94 additions & 0 deletions test/e2e-edge/_fixtures/connector-tunnel/main.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,94 @@
// connect-proxy is a stand-in for the real connector tunnel. It exercises the
// proxy-side CONNECT wiring and the 503<->200 liveness swap, but not real tunnel
// establishment or NAT traversal.
//
// When a connector is online, the extension server points its backend at a path
// that issues an HTTP CONNECT toward the connector's target, which in the test
// is this pod's Service. On a CONNECT, this proxy dials the configured upstream
// (the echo backend) and blindly splices bytes both ways — a minimal forward
// proxy.
//
// Liveness is driven entirely by the connector's annotation on the control plane
// (see _steps/flip-connector-liveness.yaml); this proxy is always up. It exists
// so an online request can only succeed via the tunnel, never via a direct
// fallback route — which is what makes the 503->200 transition meaningful.
package main

import (
"io"
"log"
"net"
"net/http"
"os"
"time"
)

func main() {
listen := envOr("LISTEN_ADDR", ":8080")
// Where CONNECT requests are forwarded. Default to the echo backend Service.
// The proxy ignores the CONNECT target host and always dials this upstream,
// so the test controls the destination via UPSTREAM_ADDR rather than the
// host the proxy sends.
upstream := envOr("UPSTREAM_ADDR", "echo-backend.default.svc.cluster.local:8080")

srv := &http.Server{
Addr: listen,
ReadTimeout: 0, // tunnels are long-lived; no read deadline on the hijacked conn
Handler: &proxy{upstream: upstream},
}
log.Printf("connect-proxy listening on %s, forwarding CONNECT -> %s", listen, upstream)
if err := srv.ListenAndServe(); err != nil {
log.Fatalf("server exited: %v", err)
}
}

type proxy struct {
upstream string
}

func (p *proxy) ServeHTTP(w http.ResponseWriter, r *http.Request) {
if r.Method != http.MethodConnect {
// A plain GET is handy as a liveness/readiness probe.
w.WriteHeader(http.StatusOK)
_, _ = io.WriteString(w, "connect-proxy ready\n")
return
}

dst, err := net.DialTimeout("tcp", p.upstream, 10*time.Second)
if err != nil {
log.Printf("CONNECT %s: dial upstream %s failed: %v", r.Host, p.upstream, err)
http.Error(w, "upstream unavailable", http.StatusBadGateway)
return
}
defer dst.Close()

hj, ok := w.(http.Hijacker)
if !ok {
http.Error(w, "hijacking unsupported", http.StatusInternalServerError)
return
}
client, _, err := hj.Hijack()
if err != nil {
log.Printf("CONNECT %s: hijack failed: %v", r.Host, err)
return
}
defer client.Close()

// Tell the client the tunnel is established, then splice bytes both ways.
if _, err := client.Write([]byte("HTTP/1.1 200 Connection Established\r\n\r\n")); err != nil {
log.Printf("CONNECT %s: write 200 failed: %v", r.Host, err)
return
}

done := make(chan struct{}, 2)
go func() { _, _ = io.Copy(dst, client); done <- struct{}{} }()
go func() { _, _ = io.Copy(client, dst); done <- struct{}{} }()
<-done
}

func envOr(key, def string) string {
if v := os.Getenv(key); v != "" {
return v
}
return def
}
Loading
Loading