feat(outbox): harden relay with HA, dead-lettering, SLO alerts#4
Merged
Conversation
Make the transactional outbox relay production-resilient and remove it as
a single point of failure between persistence and event transport. Three
resilience features plus a coupling cleanup, all behaviour-preserving by
default (new behaviour is opt-in via OUTBOX_MAX_ATTEMPTS).
Multi-replica HA:
- Add deployments/k8s/15-outbox-relay.yaml: 3 replicas + PodDisruptionBudget
(minAvailable: 2). Replicas are safe by construction — RunBatch already
claims rows with FOR UPDATE SKIP LOCKED — so node drains and rollouts never
stall publishing.
Dead-letter path (poison-row backstop):
- A row that fails to publish OUTBOX_MAX_ATTEMPTS times is moved to FAILED
(dead-lettered) in the same SQL statement that bumps attempt_count, so it
drops out of the PENDING poll and can never wedge a worker. 0 = retry
forever (original behaviour). Adds a last_error column and the
gamemesh_outbox_events_dead_lettered_total counter.
- migrations/0004_outbox_dead_letter.{up,down}.sql: relax the status CHECK to
allow FAILED, add last_error, add a partial index over FAILED rows.
Lag-based SLO alerting:
- config/prometheus/alerts.yml + k8s prometheus-config: OutboxEventsDeadLettered
(critical), OutboxBacklogGrowing (warning), OutboxPublishStalled (critical).
Wired via rule_files in both compose and k8s. Verified with promtool and a
live Prometheus load.
Interface cleanup:
- Narrow the relay's bus dependency to a one-method BusPublisher (just Publish),
shrinking its coupling to the event bus.
- Centralize async trace-context propagation in tracing.InjectCarrier /
ResumeFromCarrier; collapse the five duplicated otel Inject/Extract sites in
publisher.go, relay.go, nats.go and redis.go.
Tests: unit + Postgres-backed integration tests for the dead-letter path (real
SQL CASE, FAILED status, last_error, poison row never re-polled). Full suite
and integration suite pass; docs/outbox.md updated to the three-state model.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Make the transactional outbox relay production-resilient and remove it as a single point of failure between persistence and event transport. Three resilience features plus a coupling cleanup, all behaviour-preserving by default (new behaviour is opt-in via OUTBOX_MAX_ATTEMPTS).
Multi-replica HA:
Dead-letter path (poison-row backstop):
Lag-based SLO alerting:
Interface cleanup:
Tests: unit + Postgres-backed integration tests for the dead-letter path (real SQL CASE, FAILED status, last_error, poison row never re-polled). Full suite and integration suite pass; docs/outbox.md updated to the three-state model.