retrymq limit & DLQ

When a retry task's executor fails (e.g., event not found in logstore, transient errors), the message sits in the queue and becomes visible again after a fixed 30s visibility timeout. This repeats indefinitely with no limit.

## Problems

- A permanently failing retry message cycles forever with no cap
- No dead-letter path to detect or surface stuck messages
- Fixed visibility timeout on re-fetch failures — no backoff between attempts

The underlying queue already tracks receive count and supports per-message visibility changes, so the primitives are there.

## Open questions

### Max receive count
What should the default be?

> **Suggestion:** 5 internal re-fetch attempts before giving up. This is separate from the delivery retry max limit, which controls how many times we re-deliver to the destination.

### Backoff on re-fetch
Should we apply exponential backoff on internal failures (e.g., 30s → 60s → 120s), or is a fixed interval fine since these are typically short-lived transient issues?

### What happens when max is exceeded
> **Suggestion:** Route to a DLQ. Gives observability into stuck messages and the ability to replay them.

### Configuration
> **Suggestion:** Expose as `retrymq` config, similar to how `deliverymq` is configured. e.g., `RETRYMQ_MAX_RECEIVE_COUNT`, `RETRYMQ_VISIBILITY_TIMEOUT_SECONDS`.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

retrymq limit & DLQ #663

Problems

Open questions

Max receive count

Backoff on re-fetch

What happens when max is exceeded

Configuration

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

retrymq limit & DLQ #663

Description

Problems

Open questions

Max receive count

Backoff on re-fetch

What happens when max is exceeded

Configuration

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions