From 29a7529aac1da7c7e50fde24891da6eb6f84cdc2 Mon Sep 17 00:00:00 2001 From: Jose Szychowski Date: Tue, 9 Jun 2026 12:52:14 -0300 Subject: [PATCH 1/3] docs: add WASM filter emitter as the recommended approach for telemetry and billing attribution --- .../service-catalog-registration.md | 179 ++++++++++++++++-- 1 file changed, 158 insertions(+), 21 deletions(-) diff --git a/docs/enhancements/service-catalog-registration.md b/docs/enhancements/service-catalog-registration.md index 29521e44..381d74aa 100644 --- a/docs/enhancements/service-catalog-registration.md +++ b/docs/enhancements/service-catalog-registration.md @@ -387,7 +387,7 @@ confirmed architectural choice before implementation begins. ### How Envoy Gateway exposes telemetry -Envoy Gateway exposes traffic telemetry through two primary surfaces: +Envoy Gateway exposes traffic telemetry through three primary surfaces: 1. **Prometheus scrape endpoint** — the Envoy data plane exposes a `/stats` endpoint with counters including `envoy_http_downstream_rq_total` (request @@ -395,7 +395,12 @@ Envoy Gateway exposes traffic telemetry through two primary surfaces: `envoy_http_downstream_cx_tx_bytes_total` (egress bytes), and `envoy_http_downstream_cx_active` (active connections, from which connection-seconds can be derived). These are gauge/counter deltas — they are - not pre-attributed to a project or billing account. + not pre-attributed to a project or billing account. Importantly, the + cluster-level variants (`envoy_cluster_upstream_*`) carry `httproute_name`, + `httproute_namespace`, and `httproute_rule_ordinal` labels already, providing + per-route attribution with no configuration changes. Counter values are held + in-memory and lost when a pod restarts; the data loss window equals the scrape + interval (typically 15–30 s). 2. **Access logs** — Envoy can be configured via the `EnvoyProxy` CR to emit structured JSON access logs (one line per completed request). Each line carries @@ -403,6 +408,15 @@ Envoy Gateway exposes traffic telemetry through two primary surfaces: metadata. Access logs are the most natural source for per-request billing because each log line maps directly to one billable unit. +3. **WASM filter hooks** — Envoy embeds a WASM runtime (V8 or Wasmtime) and + executes a `.wasm` binary inside the Envoy process. The proxy-wasm ABI + exposes lifecycle hooks (`on_request_headers`, `on_response_headers`, + `on_log`, `on_done`) that fire at well-defined points in the request + lifecycle. The filter reads request metadata via `get_property()` host + function calls and dispatches async HTTP calls via `dispatch_http_call()` — + both are synchronous from the filter's perspective but non-blocking to Envoy. + Because the filter runs in-process, no sidecar is needed and no data is lost on pod restart. + ### Candidate approaches #### Option A: Access log scraping via Vector Agent (recommended) @@ -467,29 +481,129 @@ forwards to the Ingestion Gateway. - Attribution of OTLP metric streams to billing accounts is not defined in the current pipeline design. +#### Option D: WASM filter emitter + +Build a small WASM binary (Rust, using `proxy-wasm-rust-sdk`) that runs inside +the Envoy process and hooks into the request lifecycle via the proxy-wasm ABI. +On `on_log()` — which fires after each request completes and on connection close +for WebSocket / long-lived connections — the filter extracts billing signals and +accumulates them in shared memory. Rather than emitting one event per request, +the filter flushes a single batched payload every N seconds (configurable, +default 60 s) via `dispatch_http_call()`. At 10 k req/s this reduces outbound +billing calls from 10,000/s down to 1/min per proxy pod — a reduction of ~600,000× +in emission volume, with no loss of per-route accuracy since signals are +aggregated by `httproute_name` + `httproute_namespace` before flushing. + +The filter reads signal data through `get_property()` host calls: + +``` +request.size → ingress bytes for this request +response.size → egress bytes for this request +request.duration → connection duration (ms) on close +upstream.cluster_name → encodes httproute_name + httproute_namespace +response.code → HTTP status +``` + +Envoy's upstream cluster name for a route follows the pattern: +`httproute///rule/` — the filter parses this string +to extract route identity with no Kubernetes API calls. + +The binary is packaged as an OCI image and wired in via `EnvoyExtensionPolicy`: + +```yaml +apiVersion: gateway.envoyproxy.io/v1alpha1 +kind: EnvoyExtensionPolicy +metadata: + name: billing-wasm +spec: + targetRef: + group: gateway.networking.k8s.io + kind: Gateway + name: + wasm: + - name: billing-emitter + rootID: billing + code: + type: Image + image: + url: oci://ghcr.io/datum-cloud/billing-wasm:latest + config: | + { "endpoint": "http://billing-usage-collector-vector.billing-system.svc.cluster.local:9880/cloudevents", "flush_interval_seconds": 60 } +``` + +**Pros:** +- No data loss on pod restarts — each request emits independently; there are no + accumulated counters to lose. +- Connection-seconds for WebSocket / long-lived connections is captured natively + — `on_log()` fires on connection close with the full duration available. +- Runs inside the Envoy process — no sidecar required. +- No changes to the network-services-operator Go binary. +- Route identity (`httproute_name`, `httproute_namespace`) is available directly + from the upstream cluster name without Kubernetes API calls. + +**Cons:** +- Requires a separate Rust build pipeline and OCI image; adds a new artifact to + maintain. +- The filter cannot make synchronous network calls — billing dispatch goes + through Envoy's async `dispatch_http_call()` to a local agent. If that agent + is unavailable, events must be dropped or buffered in WASM shared memory + (limited). +- WASM sandbox has no filesystem access; durable buffering must be delegated to + Vector or another on-node agent. +- Adds a small per-request overhead (<1 ms) for the `on_log()` hook. + +**Comparison across options:** + +| | Option A (access logs) | Option B (Prometheus) | Option D (WASM) | +|---|---|---|---| +| Data loss on pod restart | near-zero | ~1 scrape interval | none | +| Per-request granularity | ✓ | ✗ | ✓ | +| Connection-seconds | requires separate mechanism | via `cx_length_ms_sum` | ✓ native on close | +| NSO Go changes | none | none | none | +| New build artifact | none | none | Rust WASM binary | +| Infrastructure changes | `EnvoyProxy` CR + Vector config | none | `EnvoyExtensionPolicy` | + ### Connection-seconds handling -For all options, the connection-seconds signal for WebSocket and other -long-lived connections is not captured naturally by per-request telemetry. The -recommended approach is to emit a connection-open and connection-close event from -the network-services-operator's gateway controller (which already watches -`Gateway` objects and manages their lifecycle). The duration between open and -close events, reported as a sum of connection-seconds, gives the meter its -signal. This is a small, localized Go change to the gateway controller. +For Option A and Option B, the connection-seconds signal for WebSocket and other +long-lived connections requires a separate mechanism — per-request access logs +do not emit while a connection is held open, and Prometheus counters +(`envoy_cluster_upstream_cx_length_ms_sum`) only update when a connection +closes. + +For **Option D (WASM)**, connection-seconds is handled natively: `on_log()` +fires when a connection closes, at which point `request.duration` contains the +full connection lifetime in milliseconds. No additional Go controller changes +are needed. ### Recommendation -**Option A** (access log scraping via Vector) is the most consistent with the -billing pipeline architecture and is recommended as the primary collection -mechanism for request count, egress bytes, and ingress bytes. Connection-seconds -for persistent connections is handled by a lightweight controller-side emitter -(see above). This approach requires: +**Option D** (WASM filter emitter) is the recommended approach. It eliminates +the primary reliability concern of Option B (counter loss on pod restart), covers +all four billing signals including connection-seconds natively, and requires no +changes to the NSO Go binary or access log configuration. The main investment is +a small Rust build pipeline and an `EnvoyExtensionPolicy` manifest. + +The filter should batch-emit usage events to a local Vector agent (rather than +calling the Ingestion Gateway directly) to preserve Vector's tier-1 durability +guarantee and decouple the filter from billing availability. Flush interval is +configurable; 60 seconds is a reasonable default. + +**Option A** (access logs via Vector) remains a valid fallback if WASM support +in the deployed Envoy Gateway version is insufficient or if the Rust build +pipeline cannot be prioritized. In that case connection-seconds requires an +additional controller-side emitter. + +This approach requires: -1. An `EnvoyProxy` CR patch configuring structured JSON access logs. -2. A Vector configuration to parse the log format and construct `UsageEvent`s. -3. A small addition to the gateway controller for connection-lifecycle events. +1. A Rust WASM binary (`proxy-wasm-rust-sdk`) implementing `on_log()` signal + extraction and batched `dispatch_http_call()` to the local Vector agent. +2. An OCI image packaging the `.wasm` binary, published to + `ghcr.io/datum-cloud/billing-wasm`. +3. An `EnvoyExtensionPolicy` manifest in `config/` referencing the image. -None of these changes require modifications to the billing pipeline contract. +None of these changes require modifications to the billing pipeline contract or +the NSO Go binary. ### Pipeline context enrichment @@ -531,12 +645,15 @@ The following decisions are required before work can begin on each phase. | OD-3 | ~~`producerProjectRef.name`~~ — resolved: `datum-cloud`. | — | — | | OD-4 | ~~Bundle layout~~ — resolved: per-service-domain directory under `config/services/networking.datumapis.com/`, matching `datum-cloud/datum/config/services//`. | — | — | | OD-5 | Is the Vector Agent DaemonSet planned to run on the edge cluster nodes that host Envoy Gateway pods? | Platform / infra | Phase 2 | -| OD-6 | Can the network-services-operator patch the `EnvoyProxy` CR to inject access log configuration? | Kevin | Phase 2 | +| OD-6 | Can the network-services-operator patch the `EnvoyProxy` CR to inject access log configuration? | Kevin | Phase 2 (Option A fallback only) | | OD-7 | Is the billing SDK published as a consumable Go module? | Billing team | Phase 2 | | OD-8 | Enrichment-sidecar placement: per-node alongside Vector, or central in front of the Ingestion Gateway? | Billing team / platform | Phase 2 | +| OD-9 | Does the deployed Envoy Gateway version support `EnvoyExtensionPolicy` with WASM? Minimum version is Envoy Gateway v1.1. | Platform / infra | Phase 2 (Option D) | +| OD-10 | WASM build pipeline: standalone repo or subdirectory of `network-services-operator`? Affects release cadence and artifact ownership. | NSO team | Phase 2 (Option D) | +| OD-11 | WASM dispatch target: emit directly to local Vector agent, or to a dedicated on-node billing relay? Affects durability contract and Vector coupling. | Billing team / platform | Phase 2 (Option D) | All Phase 1 decisions are resolved; implementation can begin. Phase 2 is -blocked on OD-5 through OD-8. +blocked on OD-5, OD-7, OD-8, and (for Option D) OD-9 through OD-11. --- @@ -574,7 +691,27 @@ From issue [#155](https://github.com/datum-cloud/network-services-operator/issue ### Phase 2 — Emission integration (~1–2 weeks) -1. Resolve OD-5 through OD-8. +#### Path D (WASM — recommended) + +1. Resolve OD-5, OD-7, OD-9, OD-10, OD-11. +2. Create the WASM filter binary (Rust, `proxy-wasm-rust-sdk`): + - Implement `on_log()` to extract `request.size`, `response.size`, + `request.duration`, and `upstream.cluster_name`. + - Parse `upstream.cluster_name` to derive `httproute_name` and + `httproute_namespace`. + - Batch signals and flush via `dispatch_http_call()` to the local Vector + agent at a configurable interval (default 60 s). +3. Package the `.wasm` binary as an OCI image and publish to + `ghcr.io/datum-cloud/billing-wasm`. +4. Author an `EnvoyExtensionPolicy` manifest referencing the image. +5. Configure Vector to receive batched events from the WASM filter and forward + them as `UsageEvent` CloudEvents to the Ingestion Gateway. +6. Write unit tests for the Rust filter and integration tests validating + end-to-end event delivery. + +#### Path A (access logs — fallback if OD-9 blocks Option D) + +1. Resolve OD-5, OD-6, OD-7, OD-8. 2. Add billing SDK to `go.mod`. 3. Patch the `EnvoyProxy` CR to enable structured JSON access logs. 4. Configure Vector to parse access log entries and emit `UsageEvent`s for From 872624065804a177642595622484b4b34caad925 Mon Sep 17 00:00:00 2001 From: Jose Szychowski Date: Tue, 9 Jun 2026 13:31:49 -0300 Subject: [PATCH 2/3] docs: update service catalog registration enhancement to recommend access log scraping over WASM filters --- .../service-catalog-registration.md | 62 ++++--------------- 1 file changed, 12 insertions(+), 50 deletions(-) diff --git a/docs/enhancements/service-catalog-registration.md b/docs/enhancements/service-catalog-registration.md index 381d74aa..90a97184 100644 --- a/docs/enhancements/service-catalog-registration.md +++ b/docs/enhancements/service-catalog-registration.md @@ -578,32 +578,17 @@ are needed. ### Recommendation -**Option D** (WASM filter emitter) is the recommended approach. It eliminates -the primary reliability concern of Option B (counter loss on pod restart), covers -all four billing signals including connection-seconds natively, and requires no -changes to the NSO Go binary or access log configuration. The main investment is -a small Rust build pipeline and an `EnvoyExtensionPolicy` manifest. +**Option A** (access log scraping via Vector) is the most consistent with the +billing pipeline architecture and is recommended as the primary collection +mechanism for request count, egress bytes, and ingress bytes. Connection-seconds +for persistent connections is handled by a lightweight controller-side emitter +(see above). This approach requires: -The filter should batch-emit usage events to a local Vector agent (rather than -calling the Ingestion Gateway directly) to preserve Vector's tier-1 durability -guarantee and decouple the filter from billing availability. Flush interval is -configurable; 60 seconds is a reasonable default. +1. An `EnvoyProxy` CR patch configuring structured JSON access logs. +2. A Vector configuration to parse the log format and construct `UsageEvent`s. +3. A small addition to the gateway controller for connection-lifecycle events. -**Option A** (access logs via Vector) remains a valid fallback if WASM support -in the deployed Envoy Gateway version is insufficient or if the Rust build -pipeline cannot be prioritized. In that case connection-seconds requires an -additional controller-side emitter. - -This approach requires: - -1. A Rust WASM binary (`proxy-wasm-rust-sdk`) implementing `on_log()` signal - extraction and batched `dispatch_http_call()` to the local Vector agent. -2. An OCI image packaging the `.wasm` binary, published to - `ghcr.io/datum-cloud/billing-wasm`. -3. An `EnvoyExtensionPolicy` manifest in `config/` referencing the image. - -None of these changes require modifications to the billing pipeline contract or -the NSO Go binary. +None of these changes require modifications to the billing pipeline contract. ### Pipeline context enrichment @@ -645,15 +630,12 @@ The following decisions are required before work can begin on each phase. | OD-3 | ~~`producerProjectRef.name`~~ — resolved: `datum-cloud`. | — | — | | OD-4 | ~~Bundle layout~~ — resolved: per-service-domain directory under `config/services/networking.datumapis.com/`, matching `datum-cloud/datum/config/services//`. | — | — | | OD-5 | Is the Vector Agent DaemonSet planned to run on the edge cluster nodes that host Envoy Gateway pods? | Platform / infra | Phase 2 | -| OD-6 | Can the network-services-operator patch the `EnvoyProxy` CR to inject access log configuration? | Kevin | Phase 2 (Option A fallback only) | +| OD-6 | Can the network-services-operator patch the `EnvoyProxy` CR to inject access log configuration? | Kevin | Phase 2 | | OD-7 | Is the billing SDK published as a consumable Go module? | Billing team | Phase 2 | | OD-8 | Enrichment-sidecar placement: per-node alongside Vector, or central in front of the Ingestion Gateway? | Billing team / platform | Phase 2 | -| OD-9 | Does the deployed Envoy Gateway version support `EnvoyExtensionPolicy` with WASM? Minimum version is Envoy Gateway v1.1. | Platform / infra | Phase 2 (Option D) | -| OD-10 | WASM build pipeline: standalone repo or subdirectory of `network-services-operator`? Affects release cadence and artifact ownership. | NSO team | Phase 2 (Option D) | -| OD-11 | WASM dispatch target: emit directly to local Vector agent, or to a dedicated on-node billing relay? Affects durability contract and Vector coupling. | Billing team / platform | Phase 2 (Option D) | All Phase 1 decisions are resolved; implementation can begin. Phase 2 is -blocked on OD-5, OD-7, OD-8, and (for Option D) OD-9 through OD-11. +blocked on OD-5 through OD-8. --- @@ -691,27 +673,7 @@ From issue [#155](https://github.com/datum-cloud/network-services-operator/issue ### Phase 2 — Emission integration (~1–2 weeks) -#### Path D (WASM — recommended) - -1. Resolve OD-5, OD-7, OD-9, OD-10, OD-11. -2. Create the WASM filter binary (Rust, `proxy-wasm-rust-sdk`): - - Implement `on_log()` to extract `request.size`, `response.size`, - `request.duration`, and `upstream.cluster_name`. - - Parse `upstream.cluster_name` to derive `httproute_name` and - `httproute_namespace`. - - Batch signals and flush via `dispatch_http_call()` to the local Vector - agent at a configurable interval (default 60 s). -3. Package the `.wasm` binary as an OCI image and publish to - `ghcr.io/datum-cloud/billing-wasm`. -4. Author an `EnvoyExtensionPolicy` manifest referencing the image. -5. Configure Vector to receive batched events from the WASM filter and forward - them as `UsageEvent` CloudEvents to the Ingestion Gateway. -6. Write unit tests for the Rust filter and integration tests validating - end-to-end event delivery. - -#### Path A (access logs — fallback if OD-9 blocks Option D) - -1. Resolve OD-5, OD-6, OD-7, OD-8. +1. Resolve OD-5 through OD-8. 2. Add billing SDK to `go.mod`. 3. Patch the `EnvoyProxy` CR to enable structured JSON access logs. 4. Configure Vector to parse access log entries and emit `UsageEvent`s for From 48b76df74b5acde0eea24f9a6942a1df0682cc87 Mon Sep 17 00:00:00 2001 From: Jose Szychowski Date: Tue, 9 Jun 2026 13:37:57 -0300 Subject: [PATCH 3/3] docs: refine service catalog registration recommendation to favor per-request usage event forwarding https://github.com/datum-cloud/network-services-operator/pull/177#pullrequestreview-4460379681 --- docs/enhancements/service-catalog-registration.md | 14 +++++++++----- 1 file changed, 9 insertions(+), 5 deletions(-) diff --git a/docs/enhancements/service-catalog-registration.md b/docs/enhancements/service-catalog-registration.md index 90a97184..a88bf298 100644 --- a/docs/enhancements/service-catalog-registration.md +++ b/docs/enhancements/service-catalog-registration.md @@ -578,11 +578,15 @@ are needed. ### Recommendation -**Option A** (access log scraping via Vector) is the most consistent with the -billing pipeline architecture and is recommended as the primary collection -mechanism for request count, egress bytes, and ingress bytes. Connection-seconds -for persistent connections is handled by a lightweight controller-side emitter -(see above). This approach requires: +**Option A** (access log scraping via Vector) is the recommended approach for +the initial implementation. At current scale, per-request events without +aggregation are sufficient — one `UsageEvent` per completed request, forwarded +directly by Vector to the Ingestion Gateway. Aggregation is a pipeline-level +concern; if it becomes necessary at scale it will be added as a platform +capability in the billing pipeline, not as Envoy-specific logic. + +Connection-seconds for persistent connections is handled by a lightweight +controller-side emitter (see above). This approach requires: 1. An `EnvoyProxy` CR patch configuring structured JSON access logs. 2. A Vector configuration to parse the log format and construct `UsageEvent`s.