Skip to content

retrymq limit & DLQ #663

@alexluong

Description

@alexluong

When a retry task's executor fails (e.g., event not found in logstore, transient errors), the message sits in the queue and becomes visible again after a fixed 30s visibility timeout. This repeats indefinitely with no limit.

Problems

  • A permanently failing retry message cycles forever with no cap
  • No dead-letter path to detect or surface stuck messages
  • Fixed visibility timeout on re-fetch failures — no backoff between attempts

The underlying queue already tracks receive count and supports per-message visibility changes, so the primitives are there.

Open questions

Max receive count

What should the default be?

Suggestion: 5 internal re-fetch attempts before giving up. This is separate from the delivery retry max limit, which controls how many times we re-deliver to the destination.

Backoff on re-fetch

Should we apply exponential backoff on internal failures (e.g., 30s → 60s → 120s), or is a fixed interval fine since these are typically short-lived transient issues?

What happens when max is exceeded

Suggestion: Route to a DLQ. Gives observability into stuck messages and the ability to replay them.

Configuration

Suggestion: Expose as retrymq config, similar to how deliverymq is configured. e.g., RETRYMQ_MAX_RECEIVE_COUNT, RETRYMQ_VISIBILITY_TIMEOUT_SECONDS.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions