Skip to content

feat(google_chat): group-based alert threading#116

Open
jvlxz wants to merge 8 commits into
mr-karan:mainfrom
jvlxz:jvlxz/google-chat-thread-alert
Open

feat(google_chat): group-based alert threading#116
jvlxz wants to merge 8 commits into
mr-karan:mainfrom
jvlxz:jvlxz/google-chat-thread-alert

Conversation

@jvlxz

@jvlxz jvlxz commented Jun 16, 2026

Copy link
Copy Markdown

Summary

Thread Google Chat messages by Alertmanager group instead of per-alert fingerprint, so an incident's whole lifecycle (firing → partially resolved → resolved) lives in one thread. Opt-in per room via threading_mode = "group"; legacy per-alert mode stays the default and is unchanged.

Setup guide: docs/group-threading.md.

Key features

  • Deterministic threading — thread key is hash(GroupKey + 12h bucket), so clustered calert instances converge on one thread with no shared state. Still-firing alerts roll into a new thread at bucket boundaries.
  • Aggregated message — one CardsV2 card per webhook payload: exact N firing / M resolved counters, firing-first sections capped at max_alerts_per_message (default 10) with an "and X more" overflow. Resolved instances shown once per thread.
  • Cluster-race dedup — state-hash dedup (fingerprint→status) suppresses duplicates within dedup_window (default 2m). Optional Redis-backed shared state (atomic Lua CAS) for active-active deployments; fails open (posts anyway, increments error counter) so a Redis blip never drops an alert.
  • New metricsalerts_deduplicated_total, group_dedup_store_errors_total.

Implementation

  • threading.go — thread-key engine + groupStateStore interface (in-memory default, redis_store.go alternative)
  • group_message.go + static/message-group.tmpl — aggregated card builder/template
  • providers.WebhookPayload widens Provider.Push() to carry GroupKey (absent from template.Data)
  • New config knobs: threading_mode, dedup_window, max_alerts_per_message, [providers.<room>.redis] (thread_ttl reused as bucket size)

Testing

Dispatch suite runs against both in-memory and Redis backends. Legacy-mode tests unchanged (regression guard).

Rollout

Requires send_resolved: true; use a group_by without instance. Active-active deployments need a shared Redis (see docs).

jvlxz added 8 commits June 12, 2026 15:57
Thread Google Chat messages by Alertmanager group instead of per-alert
fingerprint, opt-in via threading_mode = "group":

- Deterministic thread key hash(GroupKey + 12h wall-clock bucket): both
  clustered calert instances converge into one thread, no shared state
- One aggregated CardsV2 message per webhook payload with exact
  N firing / M resolved counters and per-instance collapsible sections
- State-hash dedup window (default 5m) drops cluster-race duplicates,
  counted by a new alerts_deduplicated_total metric
- Sections capped at max_alerts_per_message (default 10) with an exact
  "and X more instances" overflow summary
- Provider.Push() widened to the full webhook payload (new
  providers.WebhookPayload carrying GroupKey); legacy alert mode is the
  default and behaviourally unchanged

Implements docs/prd-group-threading.md
Record the last posted fingerprint→status map in per-group state and
omit resolved instances already shown as resolved from later messages'
sections. Header counters keep covering the full payload, so hidden
instances still count in 'M resolved'. Unknown previous status (first
post, restart, post-resolve state deletion) always renders — fail
toward showing.
… threading

Add a groupStateStore interface with two implementations: the existing
in-memory store (default) and a Redis store that lets active-active calert
instances share group dedup state. The read-modify-write decision runs as a
single atomic Lua script so racing instances cannot both post the same group
card.

A dedup-store outage fails open: the alert is posted anyway and a dedicated
group_dedup_store_errors_total counter increments, so a Redis blip degrades
dedup to a possible duplicate, never a dropped alert.

Group state is no longer deleted on full resolve; the store TTL (thread_ttl)
is the sole reaper, so lingering resolved members re-sent when a new member
fires stay suppressed. An empty-render guard avoids posting blank cards.

The whole dispatch suite runs against both backends to prove equivalence.
Header counters covered the full payload, so a resolved instance hidden
by show-resolved-once still counted, producing confusing headers like
'2 firing / 1 resolved' with no resolved card shown. Count over the
post-filter rendered set instead so the header matches what's visible.
Operator-facing guide for threading_mode = "group": when to use it,
active-active flow diagram, copy-paste config (incl. redis block), and
metrics. README points to it from the threading section.
Drop prd-group-threading.md and redis-shared-state-prd.md (design
history, not user-facing); strip the dangling links from the setup guide.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant