Skip to content

26.4.1: monitor_consumer wedges on idle-in-transaction after UPDATE sentry_monitorcheckin, check-in ingestion stops org-wide #4301

@stumbaumr

Description

@stumbaumr

Self-Hosted Version

26.4.1

CPU Architecture

x86_64

Docker Version

28.0.4

Docker Compose Version

v5.1.3

Machine Specification

  • My system meets the minimum system requirements of Sentry

Installation Type

Upgraded from 26.3.1

Steps to Reproduce

  1. Run 26.3.1 with ~50 active Crons monitors (mixed * * * * *, */5, */30, hourly).
  2. Upgrade to 26.4.1 via the standard self-hosted install.
  3. Monitor consumer processes a few dozen messages, then wedges.

Expected Result

Monitor check-ins continue to ingest normally after the upgrade.

Actual Result

Roughly 15–60 seconds after ingest-monitors starts processing post-upgrade, it silently wedges:

  • Kafka consumer group ingest-monitors stays joined (valid consumer-id) but CURRENT-OFFSET stops advancing.
  • Relay continues returning HTTP 202 at the edge; Kafka lag grows; check-ins never land in the monitor store.
  • The healthcheck file /tmp/health.txt stops being touched, so Docker eventually marks the container unhealthy.
  • No MAXPOLL until the configured interval expires.

Two recurring failure signatures (we've seen both in the same incident, not sure yet if they're two distinct bugs or two manifestations of one):

  1. PG wedge. pg_stat_activity shows an open transaction idle in transaction / wait_event=ClientRead, last statement:
    UPDATE "sentry_monitorcheckin" SET "date_in_progress" = '...' WHERE "sentry_monitorcheckin"."id" = <id>
  2. Row id differs each time. Immediately before silence the consumer logs:
    [INFO] sentry.utils.exceptions: No task state found in exception_grouping_context
  3. Non-PG wedge. No activity on PG at all (no active/idle-in-tx consumer connections). The Python process is blocked somewhere not waiting on the database. Log stops mid-run after a routine check_in_closed message.

sentry-self-hosted-taskbroker-1 floods logs with:
WARN set_task_status: taskbroker::grpc::server: No pending activations
but this appears to be routine "nothing to do" noise, not an error.

Net effect: every active monitor flips to error with missed backfill; the system-wide Crons feature is effectively dead on the prod host.

Dev vs Prod comparison (this is the key data point)

Same 26.4.1 deployment on a second host (coreos-dev1-sentry1) handles check-ins fine — but that instance has 0 rows in sentry_monitorcheckin and near-zero organic traffic. Same images, same config, same override, same start.sh — the only difference is accumulated state on the broken host:

  • sentry_monitorcheckin: 1,272,808 rows
  • sentry_monitorincident: 2,013 rows
  • sentry_monitorenvironment: 149 rows, 117 in status=4, 30 in status=5, 2 in status=0 post-upgrade

What we tried (none cleared the wedge)

  1. docker compose restart ingest-monitors ingest-occurrences — wedges again within minutes.
  2. kafka-consumer-groups --reset-offsets --to-latest for ingest-monitors and ingest-occurrences — gives temporary progress (seconds/minutes) before re-wedging at a different sentry_monitorcheckin.id.
  3. --max-poll-interval-ms 300000 via docker-compose.override.yml — still wedges within the window.
  4. --max-poll-interval-ms 1800000 — same, just longer wedge duration.
  5. State surgery, in order of aggressiveness:
    - TRUNCATE sentry_monitorcheckin, sentry_monitorincident RESTART IDENTITY CASCADE (also cascades to sentry_monitorenvbrokendetection) → wedge reproduces.
    - Same truncate plus UPDATE sentry_monitorenvironment SET status=0, last_checkin=NULL, next_checkin=NULL, next_checkin_latest=NULL WHERE status != 1 to match dev's "fresh install" status distribution → still wedges.

Surgical cleanup of the tables directly referenced in the wedging UPDATE is not sufficient to restore function. State that triggers the bug evidently lives somewhere else (Redis cache? ClickHouse monitors_* tables? workflow-engine or issue-platform tables?), or the bug is independent of data and low-traffic dev simply doesn't hit it.

Event ID

N/A — this is self-hosted consumer-side; not user-facing event IDs.

Metadata

Metadata

Assignees

No one assigned
    No fields configured for issues without a type.

    Projects

    Status

    No status

    Status

    Waiting for: Community

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions