feat(google_chat): group-based alert threading#116
Open
jvlxz wants to merge 8 commits into
Open
Conversation
Thread Google Chat messages by Alertmanager group instead of per-alert fingerprint, opt-in via threading_mode = "group": - Deterministic thread key hash(GroupKey + 12h wall-clock bucket): both clustered calert instances converge into one thread, no shared state - One aggregated CardsV2 message per webhook payload with exact N firing / M resolved counters and per-instance collapsible sections - State-hash dedup window (default 5m) drops cluster-race duplicates, counted by a new alerts_deduplicated_total metric - Sections capped at max_alerts_per_message (default 10) with an exact "and X more instances" overflow summary - Provider.Push() widened to the full webhook payload (new providers.WebhookPayload carrying GroupKey); legacy alert mode is the default and behaviourally unchanged Implements docs/prd-group-threading.md
Record the last posted fingerprint→status map in per-group state and omit resolved instances already shown as resolved from later messages' sections. Header counters keep covering the full payload, so hidden instances still count in 'M resolved'. Unknown previous status (first post, restart, post-resolve state deletion) always renders — fail toward showing.
… threading Add a groupStateStore interface with two implementations: the existing in-memory store (default) and a Redis store that lets active-active calert instances share group dedup state. The read-modify-write decision runs as a single atomic Lua script so racing instances cannot both post the same group card. A dedup-store outage fails open: the alert is posted anyway and a dedicated group_dedup_store_errors_total counter increments, so a Redis blip degrades dedup to a possible duplicate, never a dropped alert. Group state is no longer deleted on full resolve; the store TTL (thread_ttl) is the sole reaper, so lingering resolved members re-sent when a new member fires stay suppressed. An empty-render guard avoids posting blank cards. The whole dispatch suite runs against both backends to prove equivalence.
Header counters covered the full payload, so a resolved instance hidden by show-resolved-once still counted, producing confusing headers like '2 firing / 1 resolved' with no resolved card shown. Count over the post-filter rendered set instead so the header matches what's visible.
Operator-facing guide for threading_mode = "group": when to use it, active-active flow diagram, copy-paste config (incl. redis block), and metrics. README points to it from the threading section.
Drop prd-group-threading.md and redis-shared-state-prd.md (design history, not user-facing); strip the dangling links from the setup guide.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Thread Google Chat messages by Alertmanager group instead of per-alert fingerprint, so an incident's whole lifecycle (firing → partially resolved → resolved) lives in one thread. Opt-in per room via
threading_mode = "group"; legacy per-alert mode stays the default and is unchanged.Setup guide: docs/group-threading.md.
Key features
hash(GroupKey + 12h bucket), so clustered calert instances converge on one thread with no shared state. Still-firing alerts roll into a new thread at bucket boundaries.N firing / M resolvedcounters, firing-first sections capped atmax_alerts_per_message(default 10) with an "and X more" overflow. Resolved instances shown once per thread.dedup_window(default 2m). Optional Redis-backed shared state (atomic Lua CAS) for active-active deployments; fails open (posts anyway, increments error counter) so a Redis blip never drops an alert.alerts_deduplicated_total,group_dedup_store_errors_total.Implementation
threading.go— thread-key engine +groupStateStoreinterface (in-memory default,redis_store.goalternative)group_message.go+static/message-group.tmpl— aggregated card builder/templateproviders.WebhookPayloadwidensProvider.Push()to carryGroupKey(absent fromtemplate.Data)threading_mode,dedup_window,max_alerts_per_message,[providers.<room>.redis](thread_ttlreused as bucket size)Testing
Dispatch suite runs against both in-memory and Redis backends. Legacy-mode tests unchanged (regression guard).
Rollout
Requires
send_resolved: true; use agroup_bywithoutinstance. Active-active deployments need a shared Redis (see docs).