Skip to content

Add event governance layer — centralized DLQ strategy, schema validation, event catalog #34

@haasonsaas

Description

@haasonsaas

Problem

NATS JetStream carries CloudEvents between services, but there is no mesh-wide event governance:

No centralized DLQ strategy

  • Cerebro has its own file-backed outbox with DLQ and replay
  • Ensemble-Tap has its own DLQ
  • Other services: unknown — what happens when Pipeline's consumer fails? When Ensemble can't process an event? Each service handles this independently, or doesn't.

No runtime schema validation

Events are self-describing CloudEvents with a dataschema field (e.g., urn:cerebro:events:v1:<type>), but nothing validates that the payload matches the declared schema at publish or consume time. A publisher can emit malformed events and the consumer discovers the problem at deserialize time — or silently mishandles the data.

No event catalog

  • Cerebro defines ~60 event types in internal/webhooks/webhooks.go
  • Pipeline defines its own event types
  • Ensemble defines its own
  • There is no central registry of "these are the events in the mesh, these are their schemas, these are their producers and consumers"
  • The proto repo covers API contracts but not event contracts

No dead letter alerting

If events accumulate in a DLQ (where DLQs exist) or silently drop (where they don't), there is no mesh-wide signal.

Proposed approach

1. natsbus DLQ convention

Add a standard DLQ subject pattern to natsbus: when a consumer fails to process an event after N retries, publish to dlq.<original-subject> with the original event + error metadata. This gives every service using natsbus a consistent failure path without custom implementation.

2. Schema validation middleware for natsbus

Optional validation that checks published CloudEvents against their declared dataschema. For proto-backed events, validate against the generated proto descriptor. Log/metric on validation failure; optionally reject.

3. Event catalog in evalops/proto

Extend the proto repo with an events/ directory that declares:

  • Event type constants
  • Subject namespace conventions
  • Producer/consumer mapping
  • Proto message type for each event's data field

This becomes the single source of truth for "what events exist in the mesh" — complementing the API contract definitions that already live there.

4. DLQ monitoring

Add a Prometheus metric in natsbus for DLQ publishes (evalops_natsbus_dlq_total{subject, error_class}). Wire to Grafana alerting alongside the existing Tempo dashboard.

Why this matters

The event bus is the nervous system of the three loops. Revenue operations (Loop 3) depend on events flowing correctly from Tap → Pipeline → Ensemble. Agent delegation (Loop 2) will depend on Registry lifecycle events once wired. Without governance, the event bus is a trust-based system in a platform that sells verified trust.

Context

Identified during org-wide architecture review (2026-04-12). Related: evalops/proto#5 (cross-service contract tests), evalops/deploy#4 (NATS clustering).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions