Skip to content

checkAlerts: getPreviousAlertHistories scans a fixed 7-day window per alert per tick (only needs the latest record per group) #2434

@ZeynelKoca

Description

@ZeynelKoca

Summary

getPreviousAlertHistories() only needs the single most-recent AlertHistory record per group, but it does so by scanning a hard-coded 7-day window and $group-ing it down. Because MongoDB cannot skip-scan to the latest key per group here, every alert tick reads the entire 7-day range of history for each alert. With per-minute alert intervals this is one ~12k-key index scan per alert per tick, and the cost grows as history accumulates.

It's correct and indexed (not an outage risk), but it's an avoidable, steadily-growing cost and it floods MongoDB's slow-query log.

Where

packages/api/src/tasks/checkAlerts/index.tsgetPreviousAlertHistories():

const lookbackDate = new Date(now.getTime() - ms('7d'));
// ...
$match: { alert: id, createdAt: { $lte: now, $gte: lookbackDate } },
$sort:  { alert: 1, group: 1, createdAt: -1 },
$group: { _id: { alert: '$alert', group: '$group' },
          createdAt: { $first: '$createdAt' }, state: { $first: '$state' } },

The schema itself is well-tuned (packages/api/src/models/alertHistory.ts): a 30-day TTL on createdAt plus the compound index { alert: 1, group: 1, createdAt: -1 } this query relies on. The inefficiency is purely the oversized lookback window, not the indexing.

The code comment expects $group + $first to "short-circuit per group" off the index. In practice that short-circuit does not happen — see below.

Observed in production

A document is written per alert per check interval, so for 1-minute alerts the "previous state" is never more than one interval old — yet the query examines the full 7-day window every tick. MongoDB's slow-query log (Slow query, id 51803, slowms 100) shows, per evaluation:

  • planSummary: IXSCAN { alert: 1, group: 1, createdAt: -1 }
  • keysExamineddocsExamined12,000, nreturned30 (~400:1 examined:returned)
  • durationMillis 110–200 ms, one such aggregate per alert per minute

The plan is a plain IXSCAN of the whole window — not a DISTINCT_SCAN — so the per-group short-circuit the comment hopes for is not occurring; cost scales with window size, not with the number of groups returned.

Over a 2.5-day window we watched it trend upward as alerthistories filled in (a handful of alerts, 1-minute interval):

start +2.5 days
avg docsExamined ~10,500 ~12,400
avg durationMillis ~127 ms ~165 ms

It will plateau once the 7-day window saturates (well under the 30-day TTL), but at a cost set by 7d × write-rate rather than by what the query actually needs.

Impact

  • Mild but unbounded-feeling latency growth on a hot path that runs every tick per alert.
  • Heavy MongoDB slow-query log noise (one slow op per alert per minute) that drowns out genuinely slow operations.
  • Scales with alert count × history density, so it gets worse for heavier alerting setups.

Proposed fix

getPreviousAlertHistories() only needs the latest record per group, so bound the lookback to a small multiple of the alert's check interval (e.g. max(N × interval, someFloor)) instead of a fixed ms('7d'). That cuts keysExamined from ~all-rows-in-7-days to a handful while returning the identical result. If robustness against gaps is a concern, fall back to a wider window only when the narrow window returns nothing for an alert.

(A larger refactor — maintaining current per-group state separately rather than deriving it from history each tick — would remove the scan entirely, but the lookback bound is the minimal, low-risk change.)

Related

Environment

  • Observed on a ClickStack deployment (HyperDX API alert checker + bundled MongoDB), per-minute alert intervals.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions