Problem
NATS JetStream carries CloudEvents between services, but there is no mesh-wide event governance:
No centralized DLQ strategy
- Cerebro has its own file-backed outbox with DLQ and replay
- Ensemble-Tap has its own DLQ
- Other services: unknown — what happens when Pipeline's consumer fails? When Ensemble can't process an event? Each service handles this independently, or doesn't.
No runtime schema validation
Events are self-describing CloudEvents with a dataschema field (e.g., urn:cerebro:events:v1:<type>), but nothing validates that the payload matches the declared schema at publish or consume time. A publisher can emit malformed events and the consumer discovers the problem at deserialize time — or silently mishandles the data.
No event catalog
- Cerebro defines ~60 event types in
internal/webhooks/webhooks.go
- Pipeline defines its own event types
- Ensemble defines its own
- There is no central registry of "these are the events in the mesh, these are their schemas, these are their producers and consumers"
- The proto repo covers API contracts but not event contracts
No dead letter alerting
If events accumulate in a DLQ (where DLQs exist) or silently drop (where they don't), there is no mesh-wide signal.
Proposed approach
1. natsbus DLQ convention
Add a standard DLQ subject pattern to natsbus: when a consumer fails to process an event after N retries, publish to dlq.<original-subject> with the original event + error metadata. This gives every service using natsbus a consistent failure path without custom implementation.
2. Schema validation middleware for natsbus
Optional validation that checks published CloudEvents against their declared dataschema. For proto-backed events, validate against the generated proto descriptor. Log/metric on validation failure; optionally reject.
3. Event catalog in evalops/proto
Extend the proto repo with an events/ directory that declares:
- Event type constants
- Subject namespace conventions
- Producer/consumer mapping
- Proto message type for each event's
data field
This becomes the single source of truth for "what events exist in the mesh" — complementing the API contract definitions that already live there.
4. DLQ monitoring
Add a Prometheus metric in natsbus for DLQ publishes (evalops_natsbus_dlq_total{subject, error_class}). Wire to Grafana alerting alongside the existing Tempo dashboard.
Why this matters
The event bus is the nervous system of the three loops. Revenue operations (Loop 3) depend on events flowing correctly from Tap → Pipeline → Ensemble. Agent delegation (Loop 2) will depend on Registry lifecycle events once wired. Without governance, the event bus is a trust-based system in a platform that sells verified trust.
Context
Identified during org-wide architecture review (2026-04-12). Related: evalops/proto#5 (cross-service contract tests), evalops/deploy#4 (NATS clustering).
Problem
NATS JetStream carries CloudEvents between services, but there is no mesh-wide event governance:
No centralized DLQ strategy
No runtime schema validation
Events are self-describing CloudEvents with a
dataschemafield (e.g.,urn:cerebro:events:v1:<type>), but nothing validates that the payload matches the declared schema at publish or consume time. A publisher can emit malformed events and the consumer discovers the problem at deserialize time — or silently mishandles the data.No event catalog
internal/webhooks/webhooks.goNo dead letter alerting
If events accumulate in a DLQ (where DLQs exist) or silently drop (where they don't), there is no mesh-wide signal.
Proposed approach
1.
natsbusDLQ conventionAdd a standard DLQ subject pattern to
natsbus: when a consumer fails to process an event after N retries, publish todlq.<original-subject>with the original event + error metadata. This gives every service usingnatsbusa consistent failure path without custom implementation.2. Schema validation middleware for
natsbusOptional validation that checks published CloudEvents against their declared
dataschema. For proto-backed events, validate against the generated proto descriptor. Log/metric on validation failure; optionally reject.3. Event catalog in
evalops/protoExtend the proto repo with an
events/directory that declares:datafieldThis becomes the single source of truth for "what events exist in the mesh" — complementing the API contract definitions that already live there.
4. DLQ monitoring
Add a Prometheus metric in
natsbusfor DLQ publishes (evalops_natsbus_dlq_total{subject, error_class}). Wire to Grafana alerting alongside the existing Tempo dashboard.Why this matters
The event bus is the nervous system of the three loops. Revenue operations (Loop 3) depend on events flowing correctly from Tap → Pipeline → Ensemble. Agent delegation (Loop 2) will depend on Registry lifecycle events once wired. Without governance, the event bus is a trust-based system in a platform that sells verified trust.
Context
Identified during org-wide architecture review (2026-04-12). Related: evalops/proto#5 (cross-service contract tests), evalops/deploy#4 (NATS clustering).