Summary
getPreviousAlertHistories() only needs the single most-recent AlertHistory record per group, but it does so by scanning a hard-coded 7-day window and $group-ing it down. Because MongoDB cannot skip-scan to the latest key per group here, every alert tick reads the entire 7-day range of history for each alert. With per-minute alert intervals this is one ~12k-key index scan per alert per tick, and the cost grows as history accumulates.
It's correct and indexed (not an outage risk), but it's an avoidable, steadily-growing cost and it floods MongoDB's slow-query log.
Where
packages/api/src/tasks/checkAlerts/index.ts — getPreviousAlertHistories():
const lookbackDate = new Date(now.getTime() - ms('7d'));
// ...
$match: { alert: id, createdAt: { $lte: now, $gte: lookbackDate } },
$sort: { alert: 1, group: 1, createdAt: -1 },
$group: { _id: { alert: '$alert', group: '$group' },
createdAt: { $first: '$createdAt' }, state: { $first: '$state' } },
The schema itself is well-tuned (packages/api/src/models/alertHistory.ts): a 30-day TTL on createdAt plus the compound index { alert: 1, group: 1, createdAt: -1 } this query relies on. The inefficiency is purely the oversized lookback window, not the indexing.
The code comment expects $group + $first to "short-circuit per group" off the index. In practice that short-circuit does not happen — see below.
Observed in production
A document is written per alert per check interval, so for 1-minute alerts the "previous state" is never more than one interval old — yet the query examines the full 7-day window every tick. MongoDB's slow-query log (Slow query, id 51803, slowms 100) shows, per evaluation:
planSummary: IXSCAN { alert: 1, group: 1, createdAt: -1 }
keysExamined ≈ docsExamined ≈ 12,000, nreturned ≈ 30 (~400:1 examined:returned)
durationMillis 110–200 ms, one such aggregate per alert per minute
The plan is a plain IXSCAN of the whole window — not a DISTINCT_SCAN — so the per-group short-circuit the comment hopes for is not occurring; cost scales with window size, not with the number of groups returned.
Over a 2.5-day window we watched it trend upward as alerthistories filled in (a handful of alerts, 1-minute interval):
|
start |
+2.5 days |
avg docsExamined |
~10,500 |
~12,400 |
avg durationMillis |
~127 ms |
~165 ms |
It will plateau once the 7-day window saturates (well under the 30-day TTL), but at a cost set by 7d × write-rate rather than by what the query actually needs.
Impact
- Mild but unbounded-feeling latency growth on a hot path that runs every tick per alert.
- Heavy MongoDB slow-query log noise (one slow op per alert per minute) that drowns out genuinely slow operations.
- Scales with alert count × history density, so it gets worse for heavier alerting setups.
Proposed fix
getPreviousAlertHistories() only needs the latest record per group, so bound the lookback to a small multiple of the alert's check interval (e.g. max(N × interval, someFloor)) instead of a fixed ms('7d'). That cuts keysExamined from ~all-rows-in-7-days to a handful while returning the identical result. If robustness against gaps is a concern, fall back to a wider window only when the narrow window returns nothing for an alert.
(A larger refactor — maintaining current per-group state separately rather than deriving it from history each tick — would remove the scan entirely, but the lookback bound is the minimal, low-risk change.)
Related
Environment
- Observed on a ClickStack deployment (HyperDX API alert checker + bundled MongoDB), per-minute alert intervals.
Summary
getPreviousAlertHistories()only needs the single most-recentAlertHistoryrecord per group, but it does so by scanning a hard-coded 7-day window and$group-ing it down. Because MongoDB cannot skip-scan to the latest key per group here, every alert tick reads the entire 7-day range of history for each alert. With per-minute alert intervals this is one ~12k-key index scan per alert per tick, and the cost grows as history accumulates.It's correct and indexed (not an outage risk), but it's an avoidable, steadily-growing cost and it floods MongoDB's slow-query log.
Where
packages/api/src/tasks/checkAlerts/index.ts—getPreviousAlertHistories():The schema itself is well-tuned (
packages/api/src/models/alertHistory.ts): a 30-day TTL oncreatedAtplus the compound index{ alert: 1, group: 1, createdAt: -1 }this query relies on. The inefficiency is purely the oversized lookback window, not the indexing.The code comment expects
$group + $firstto "short-circuit per group" off the index. In practice that short-circuit does not happen — see below.Observed in production
A document is written per alert per check interval, so for 1-minute alerts the "previous state" is never more than one interval old — yet the query examines the full 7-day window every tick. MongoDB's slow-query log (
Slow query, id51803,slowms100) shows, per evaluation:planSummary: IXSCAN { alert: 1, group: 1, createdAt: -1 }keysExamined≈docsExamined≈ 12,000,nreturned≈ 30 (~400:1 examined:returned)durationMillis110–200 ms, one such aggregate per alert per minuteThe plan is a plain
IXSCANof the whole window — not aDISTINCT_SCAN— so the per-group short-circuit the comment hopes for is not occurring; cost scales with window size, not with the number of groups returned.Over a 2.5-day window we watched it trend upward as
alerthistoriesfilled in (a handful of alerts, 1-minute interval):docsExamineddurationMillisIt will plateau once the 7-day window saturates (well under the 30-day TTL), but at a cost set by
7d × write-raterather than by what the query actually needs.Impact
Proposed fix
getPreviousAlertHistories()only needs the latest record per group, so bound the lookback to a small multiple of the alert's check interval (e.g.max(N × interval, someFloor)) instead of a fixedms('7d'). That cutskeysExaminedfrom ~all-rows-in-7-days to a handful while returning the identical result. If robustness against gaps is a concern, fall back to a wider window only when the narrow window returns nothing for an alert.(A larger refactor — maintaining current per-group state separately rather than deriving it from history each tick — would remove the scan entirely, but the lookback bound is the minimal, low-risk change.)
Related
checkAlertstask but addresses a different concern (one alert blocking others), not the history-read cost.Environment