26.4.1: monitor_consumer wedges on idle-in-transaction after UPDATE sentry_monitorcheckin, check-in ingestion stops org-wide

### Self-Hosted Version

26.4.1

### CPU Architecture

x86_64

### Docker Version

28.0.4

### Docker Compose Version

v5.1.3

### Machine Specification

- [x] My system meets the minimum system requirements of Sentry

### Installation Type

Upgraded from 26.3.1

### Steps to Reproduce

1. Run 26.3.1 with ~50 active Crons monitors (mixed * * * * *, */5, */30, hourly).                                                                                                                                                                                                                                                                                                
2. Upgrade to 26.4.1 via the standard self-hosted install.
3. Monitor consumer processes a few dozen messages, then wedges.

### Expected Result

Monitor check-ins continue to ingest normally after the upgrade.

### Actual Result

  Roughly 15–60 seconds after ingest-monitors starts processing post-upgrade, it silently wedges:
                                                                                                                                                                                                                                                                                                                                                                                    
  - Kafka consumer group ingest-monitors stays joined (valid consumer-id) but CURRENT-OFFSET stops advancing.
  - Relay continues returning HTTP 202 at the edge; Kafka lag grows; check-ins never land in the monitor store.                                                                                                                                                                                                                                                                     
  - The healthcheck file /tmp/health.txt stops being touched, so Docker eventually marks the container unhealthy.
  - No MAXPOLL until the configured interval expires.                                                                                                                                                                                                                                                                                                                               
                                                            
  Two recurring failure signatures (we've seen both in the same incident, not sure yet if they're two distinct bugs or two manifestations of one):                                                                                                                                                                                                                                  
                                                                                                                                                                                                                                                                                                                                                                                    
  1. PG wedge. pg_stat_activity shows an open transaction idle in transaction / wait_event=ClientRead, last statement:                                                                                                                                                                                                                                                              
  `UPDATE "sentry_monitorcheckin" SET "date_in_progress" = '...' WHERE "sentry_monitorcheckin"."id" = <id>`                                                                                                                                                                                                                                                                           
  1. Row id differs each time. Immediately before silence the consumer logs:                                                                                                                                                                                                                                                                                                        
 `[INFO] sentry.utils.exceptions: No task state found in exception_grouping_context`                                                                                                                                                                                                                                                                                                 
  2. Non-PG wedge. No activity on PG at all (no active/idle-in-tx consumer connections). The Python process is blocked somewhere not waiting on the database. Log stops mid-run after a routine check_in_closed message.
                                                                                                                                                                                                                                                                                                                                                                                    
  sentry-self-hosted-taskbroker-1 floods logs with:                                                                                                                                                                                                                                                                                                                                 
  WARN set_task_status: taskbroker::grpc::server: No pending activations                                                                                                                                                                                                                                                                                                            
  but this appears to be routine "nothing to do" noise, not an error.                                                                                                                                                                                                                                                                                                               
                                                                                                                                                                                                                                                                                                                                                                                    
  Net effect: every active monitor flips to error with missed backfill; the system-wide Crons feature is effectively dead on the prod host.                                                                                                                                                                                                                                         
                                                                                                                                                                                                                                                                                                                                                                                    
  Dev vs Prod comparison (this is the key data point)                                                                                                                                                                                                                                                                                                                               
                                                                                                                                                                                                                                                                                                                                                                                    
  Same 26.4.1 deployment on a second host (coreos-dev1-sentry1) handles check-ins fine — but that instance has 0 rows in sentry_monitorcheckin and near-zero organic traffic. Same images, same config, same override, same start.sh — the only difference is accumulated state on the broken host:                                                                                 
  - sentry_monitorcheckin: 1,272,808 rows                                                                                                                                                                                                                                                                                                                                           
  - sentry_monitorincident: 2,013 rows                                                                                                                                                                                                                                                                                                                                              
  - sentry_monitorenvironment: 149 rows, 117 in status=4, 30 in status=5, 2 in status=0 post-upgrade                                                                                                                                                                                                                                                                                
                                                                                                    
  What we tried (none cleared the wedge)                                                                                                                                                                                                                                                                                                                                            
                                                            
  1. `docker compose restart ingest-monitors ingest-occurrences `— wedges again within minutes.                                                                                                                                                                                                                                                                                       
  2. `kafka-consumer-groups --reset-offsets --to-latest` for `ingest-monitors` and `ingest-occurrences` — gives temporary progress (seconds/minutes) before re-wedging at a different `sentry_monitorcheckin.id`.                                                                                                                                                                           
  3. `--max-poll-interval-ms 300000` via `docker-compose.override.yml` — still wedges within the window.                                                                                                                                                                                                                                                                                
  4. `--max-poll-interval-ms 1800000` — same, just longer wedge duration.                                                                                                                                                                                                                                                                                                             
  5. State surgery, in order of aggressiveness:                                                                                                                                                                                                                                                                                                                                     
    - `TRUNCATE sentry_monitorcheckin, sentry_monitorincident RESTART IDENTITY CASCADE (also cascades to sentry_monitorenvbrokendetection)` → wedge reproduces.                                                                                                                                                                                                                       
    - Same truncate plus `UPDATE sentry_monitorenvironment SET status=0, last_checkin=NULL, next_checkin=NULL, next_checkin_latest=NULL WHERE status != 1` to match dev's "fresh install" status distribution → still wedges.                                                                                                                                                         
                                                                                                                                                                                                                                                                                                                                                                                    
  Surgical cleanup of the tables directly referenced in the wedging UPDATE is not sufficient to restore function. State that triggers the bug evidently lives somewhere else (Redis cache? ClickHouse monitors_* tables? workflow-engine or issue-platform tables?), or the bug is independent of data and low-traffic dev simply doesn't hit it.

### Event ID

N/A — this is self-hosted consumer-side; not user-facing event IDs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

26.4.1: monitor_consumer wedges on idle-in-transaction after UPDATE sentry_monitorcheckin, check-in ingestion stops org-wide #4301

Self-Hosted Version

CPU Architecture

Docker Version

Docker Compose Version

Machine Specification

Installation Type

Steps to Reproduce

Expected Result

Actual Result

Event ID

Metadata

Assignees

Labels

Fields

Projects

Milestone

Relationships

Development

Uh oh!

26.4.1: monitor_consumer wedges on idle-in-transaction after UPDATE sentry_monitorcheckin, check-in ingestion stops org-wide #4301

Description

Self-Hosted Version

CPU Architecture

Docker Version

Docker Compose Version

Machine Specification

Installation Type

Steps to Reproduce

Expected Result

Actual Result

Event ID

Metadata

Metadata

Assignees

Labels

Fields

Projects

Milestone

Relationships

Development

Issue actions