Skip to content

feat(outbox): harden relay with HA, dead-lettering, SLO alerts#4

Merged
AlpNuhoglu merged 2 commits into
mainfrom
feat/outbox-relay-hardening
Jun 20, 2026
Merged

feat(outbox): harden relay with HA, dead-lettering, SLO alerts#4
AlpNuhoglu merged 2 commits into
mainfrom
feat/outbox-relay-hardening

Conversation

@AlpNuhoglu

Copy link
Copy Markdown
Owner

Make the transactional outbox relay production-resilient and remove it as a single point of failure between persistence and event transport. Three resilience features plus a coupling cleanup, all behaviour-preserving by default (new behaviour is opt-in via OUTBOX_MAX_ATTEMPTS).

Multi-replica HA:

  • Add deployments/k8s/15-outbox-relay.yaml: 3 replicas + PodDisruptionBudget (minAvailable: 2). Replicas are safe by construction — RunBatch already claims rows with FOR UPDATE SKIP LOCKED — so node drains and rollouts never stall publishing.

Dead-letter path (poison-row backstop):

  • A row that fails to publish OUTBOX_MAX_ATTEMPTS times is moved to FAILED (dead-lettered) in the same SQL statement that bumps attempt_count, so it drops out of the PENDING poll and can never wedge a worker. 0 = retry forever (original behaviour). Adds a last_error column and the gamemesh_outbox_events_dead_lettered_total counter.
  • migrations/0004_outbox_dead_letter.{up,down}.sql: relax the status CHECK to allow FAILED, add last_error, add a partial index over FAILED rows.

Lag-based SLO alerting:

  • config/prometheus/alerts.yml + k8s prometheus-config: OutboxEventsDeadLettered (critical), OutboxBacklogGrowing (warning), OutboxPublishStalled (critical). Wired via rule_files in both compose and k8s. Verified with promtool and a live Prometheus load.

Interface cleanup:

  • Narrow the relay's bus dependency to a one-method BusPublisher (just Publish), shrinking its coupling to the event bus.
  • Centralize async trace-context propagation in tracing.InjectCarrier / ResumeFromCarrier; collapse the five duplicated otel Inject/Extract sites in publisher.go, relay.go, nats.go and redis.go.

Tests: unit + Postgres-backed integration tests for the dead-letter path (real SQL CASE, FAILED status, last_error, poison row never re-polled). Full suite and integration suite pass; docs/outbox.md updated to the three-state model.

AlpNuhoglu and others added 2 commits June 20, 2026 01:12
Make the transactional outbox relay production-resilient and remove it as
a single point of failure between persistence and event transport. Three
resilience features plus a coupling cleanup, all behaviour-preserving by
default (new behaviour is opt-in via OUTBOX_MAX_ATTEMPTS).

Multi-replica HA:
- Add deployments/k8s/15-outbox-relay.yaml: 3 replicas + PodDisruptionBudget
  (minAvailable: 2). Replicas are safe by construction — RunBatch already
  claims rows with FOR UPDATE SKIP LOCKED — so node drains and rollouts never
  stall publishing.

Dead-letter path (poison-row backstop):
- A row that fails to publish OUTBOX_MAX_ATTEMPTS times is moved to FAILED
  (dead-lettered) in the same SQL statement that bumps attempt_count, so it
  drops out of the PENDING poll and can never wedge a worker. 0 = retry
  forever (original behaviour). Adds a last_error column and the
  gamemesh_outbox_events_dead_lettered_total counter.
- migrations/0004_outbox_dead_letter.{up,down}.sql: relax the status CHECK to
  allow FAILED, add last_error, add a partial index over FAILED rows.

Lag-based SLO alerting:
- config/prometheus/alerts.yml + k8s prometheus-config: OutboxEventsDeadLettered
  (critical), OutboxBacklogGrowing (warning), OutboxPublishStalled (critical).
  Wired via rule_files in both compose and k8s. Verified with promtool and a
  live Prometheus load.

Interface cleanup:
- Narrow the relay's bus dependency to a one-method BusPublisher (just Publish),
  shrinking its coupling to the event bus.
- Centralize async trace-context propagation in tracing.InjectCarrier /
  ResumeFromCarrier; collapse the five duplicated otel Inject/Extract sites in
  publisher.go, relay.go, nats.go and redis.go.

Tests: unit + Postgres-backed integration tests for the dead-letter path (real
SQL CASE, FAILED status, last_error, poison row never re-polled). Full suite
and integration suite pass; docs/outbox.md updated to the three-state model.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@AlpNuhoglu AlpNuhoglu merged commit c22c298 into main Jun 20, 2026
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant