Add event governance layer — centralized DLQ strategy, schema validation, event catalog

## Problem

NATS JetStream carries CloudEvents between services, but there is no mesh-wide event governance:

### No centralized DLQ strategy
- Cerebro has its own file-backed outbox with DLQ and replay
- Ensemble-Tap has its own DLQ
- Other services: unknown — what happens when Pipeline's consumer fails? When Ensemble can't process an event? Each service handles this independently, or doesn't.

### No runtime schema validation
Events are self-describing CloudEvents with a `dataschema` field (e.g., `urn:cerebro:events:v1:<type>`), but nothing validates that the payload matches the declared schema at publish or consume time. A publisher can emit malformed events and the consumer discovers the problem at deserialize time — or silently mishandles the data.

### No event catalog
- Cerebro defines ~60 event types in `internal/webhooks/webhooks.go`
- Pipeline defines its own event types
- Ensemble defines its own
- There is no central registry of "these are the events in the mesh, these are their schemas, these are their producers and consumers"
- The proto repo covers API contracts but not event contracts

### No dead letter alerting
If events accumulate in a DLQ (where DLQs exist) or silently drop (where they don't), there is no mesh-wide signal.

## Proposed approach

### 1. `natsbus` DLQ convention
Add a standard DLQ subject pattern to `natsbus`: when a consumer fails to process an event after N retries, publish to `dlq.<original-subject>` with the original event + error metadata. This gives every service using `natsbus` a consistent failure path without custom implementation.

### 2. Schema validation middleware for `natsbus`
Optional validation that checks published CloudEvents against their declared `dataschema`. For proto-backed events, validate against the generated proto descriptor. Log/metric on validation failure; optionally reject.

### 3. Event catalog in `evalops/proto`
Extend the proto repo with an `events/` directory that declares:
- Event type constants
- Subject namespace conventions
- Producer/consumer mapping
- Proto message type for each event's `data` field

This becomes the single source of truth for "what events exist in the mesh" — complementing the API contract definitions that already live there.

### 4. DLQ monitoring
Add a Prometheus metric in `natsbus` for DLQ publishes (`evalops_natsbus_dlq_total{subject, error_class}`). Wire to Grafana alerting alongside the existing Tempo dashboard.

## Why this matters

The event bus is the nervous system of the three loops. Revenue operations (Loop 3) depend on events flowing correctly from Tap → Pipeline → Ensemble. Agent delegation (Loop 2) will depend on Registry lifecycle events once wired. Without governance, the event bus is a trust-based system in a platform that sells verified trust.

## Context

Identified during org-wide architecture review (2026-04-12). Related: evalops/proto#5 (cross-service contract tests), evalops/deploy#4 (NATS clustering).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add event governance layer — centralized DLQ strategy, schema validation, event catalog #34

Problem

No centralized DLQ strategy

No runtime schema validation

No event catalog

No dead letter alerting

Proposed approach

1. `natsbus` DLQ convention

2. Schema validation middleware for `natsbus`

3. Event catalog in `evalops/proto`

4. DLQ monitoring

Why this matters

Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Add event governance layer — centralized DLQ strategy, schema validation, event catalog #34

Description

Problem

No centralized DLQ strategy

No runtime schema validation

No event catalog

No dead letter alerting

Proposed approach

1. natsbus DLQ convention

2. Schema validation middleware for natsbus

3. Event catalog in evalops/proto

4. DLQ monitoring

Why this matters

Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

1. `natsbus` DLQ convention

2. Schema validation middleware for `natsbus`

3. Event catalog in `evalops/proto`