feat(google_chat): group-based alert threading by jvlxz · Pull Request #116 · mr-karan/calert

jvlxz · 2026-06-16T14:58:48Z

Summary

Thread Google Chat messages by Alertmanager group instead of per-alert fingerprint, so an incident's whole lifecycle (firing → partially resolved → resolved) lives in one thread. Opt-in per room via threading_mode = "group"; legacy per-alert mode stays the default and is unchanged.

Setup guide: docs/group-threading.md.

Key features

Deterministic threading — thread key is hash(GroupKey + 12h bucket), so clustered calert instances converge on one thread with no shared state. Still-firing alerts roll into a new thread at bucket boundaries.
Aggregated message — one CardsV2 card per webhook payload: exact N firing / M resolved counters, firing-first sections capped at max_alerts_per_message (default 10) with an "and X more" overflow. Resolved instances shown once per thread.
Cluster-race dedup — state-hash dedup (fingerprint→status) suppresses duplicates within dedup_window (default 2m). Optional Redis-backed shared state (atomic Lua CAS) for active-active deployments; fails open (posts anyway, increments error counter) so a Redis blip never drops an alert.
New metrics — alerts_deduplicated_total, group_dedup_store_errors_total.

Implementation

threading.go — thread-key engine + groupStateStore interface (in-memory default, redis_store.go alternative)
group_message.go + static/message-group.tmpl — aggregated card builder/template
providers.WebhookPayload widens Provider.Push() to carry GroupKey (absent from template.Data)
New config knobs: threading_mode, dedup_window, max_alerts_per_message, [providers.<room>.redis] (thread_ttl reused as bucket size)

Testing

Dispatch suite runs against both in-memory and Redis backends. Legacy-mode tests unchanged (regression guard).

Rollout

Requires send_resolved: true; use a group_by without instance. Active-active deployments need a shared Redis (see docs).

Thread Google Chat messages by Alertmanager group instead of per-alert fingerprint, opt-in via threading_mode = "group": - Deterministic thread key hash(GroupKey + 12h wall-clock bucket): both clustered calert instances converge into one thread, no shared state - One aggregated CardsV2 message per webhook payload with exact N firing / M resolved counters and per-instance collapsible sections - State-hash dedup window (default 5m) drops cluster-race duplicates, counted by a new alerts_deduplicated_total metric - Sections capped at max_alerts_per_message (default 10) with an exact "and X more instances" overflow summary - Provider.Push() widened to the full webhook payload (new providers.WebhookPayload carrying GroupKey); legacy alert mode is the default and behaviourally unchanged Implements docs/prd-group-threading.md

Record the last posted fingerprint→status map in per-group state and omit resolved instances already shown as resolved from later messages' sections. Header counters keep covering the full payload, so hidden instances still count in 'M resolved'. Unknown previous status (first post, restart, post-resolve state deletion) always renders — fail toward showing.

… threading Add a groupStateStore interface with two implementations: the existing in-memory store (default) and a Redis store that lets active-active calert instances share group dedup state. The read-modify-write decision runs as a single atomic Lua script so racing instances cannot both post the same group card. A dedup-store outage fails open: the alert is posted anyway and a dedicated group_dedup_store_errors_total counter increments, so a Redis blip degrades dedup to a possible duplicate, never a dropped alert. Group state is no longer deleted on full resolve; the store TTL (thread_ttl) is the sole reaper, so lingering resolved members re-sent when a new member fires stay suppressed. An empty-render guard avoids posting blank cards. The whole dispatch suite runs against both backends to prove equivalence.

Header counters covered the full payload, so a resolved instance hidden by show-resolved-once still counted, producing confusing headers like '2 firing / 1 resolved' with no resolved card shown. Count over the post-filter rendered set instead so the header matches what's visible.

Operator-facing guide for threading_mode = "group": when to use it, active-active flow diagram, copy-paste config (incl. redis block), and metrics. README points to it from the threading section.

Drop prd-group-threading.md and redis-shared-state-prd.md (design history, not user-facing); strip the dangling links from the setup guide.

jvlxz added 8 commits June 12, 2026 15:57

fix: tune dedup window and enforce group threading

f9d47ee

docs: add group threading & Redis setup guide

174a53a

Operator-facing guide for threading_mode = "group": when to use it, active-active flow diagram, copy-paste config (incl. redis block), and metrics. README points to it from the threading section.

docs: remove PRD files, keep operator guide

c1bad4e

Drop prd-group-threading.md and redis-shared-state-prd.md (design history, not user-facing); strip the dangling links from the setup guide.

docs: remove google-chat-duplicate-issue analysis

fbc448d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(google_chat): group-based alert threading#116

feat(google_chat): group-based alert threading#116
jvlxz wants to merge 8 commits into
mr-karan:mainfrom
jvlxz:jvlxz/google-chat-thread-alert

jvlxz commented Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jvlxz commented Jun 16, 2026

Summary

Key features

Implementation

Testing

Rollout

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant