checkAlerts: getPreviousAlertHistories scans a fixed 7-day window per alert per tick (only needs the latest record per group)

## Summary

`getPreviousAlertHistories()` only needs the single most-recent `AlertHistory` record per group, but it does so by scanning a hard-coded **7-day** window and `$group`-ing it down. Because MongoDB cannot skip-scan to the latest key per group here, every alert tick reads the entire 7-day range of history for each alert. With per-minute alert intervals this is one ~12k-key index scan per alert per tick, and the cost grows as history accumulates.

It's correct and indexed (not an outage risk), but it's an avoidable, steadily-growing cost and it floods MongoDB's slow-query log.

## Where

`packages/api/src/tasks/checkAlerts/index.ts` — `getPreviousAlertHistories()`:

```js
const lookbackDate = new Date(now.getTime() - ms('7d'));
// ...
$match: { alert: id, createdAt: { $lte: now, $gte: lookbackDate } },
$sort:  { alert: 1, group: 1, createdAt: -1 },
$group: { _id: { alert: '$alert', group: '$group' },
          createdAt: { $first: '$createdAt' }, state: { $first: '$state' } },
```

The schema itself is well-tuned (`packages/api/src/models/alertHistory.ts`): a 30-day TTL on `createdAt` plus the compound index `{ alert: 1, group: 1, createdAt: -1 }` this query relies on. The inefficiency is purely the **oversized lookback window**, not the indexing.

The code comment expects `$group + $first` to "short-circuit per group" off the index. In practice that short-circuit does not happen — see below.

## Observed in production

A document is written per alert per check interval, so for 1-minute alerts the "previous state" is never more than one interval old — yet the query examines the full 7-day window every tick. MongoDB's slow-query log (`Slow query`, id `51803`, `slowms` 100) shows, per evaluation:

- `planSummary: IXSCAN { alert: 1, group: 1, createdAt: -1 }`
- `keysExamined` ≈ `docsExamined` ≈ **12,000**, `nreturned` ≈ **30** (~400:1 examined:returned)
- `durationMillis` **110–200 ms**, one such aggregate per alert per minute

The plan is a plain `IXSCAN` of the whole window — not a `DISTINCT_SCAN` — so the per-group short-circuit the comment hopes for is not occurring; cost scales with window size, not with the number of groups returned.

Over a 2.5-day window we watched it trend upward as `alerthistories` filled in (a handful of alerts, 1-minute interval):

| | start | +2.5 days |
|---|---|---|
| avg `docsExamined` | ~10,500 | ~12,400 |
| avg `durationMillis` | ~127 ms | ~165 ms |

It will plateau once the 7-day window saturates (well under the 30-day TTL), but at a cost set by `7d × write-rate` rather than by what the query actually needs.

## Impact

- Mild but unbounded-feeling latency growth on a hot path that runs every tick per alert.
- Heavy MongoDB slow-query log noise (one slow op per alert per minute) that drowns out genuinely slow operations.
- Scales with alert count × history density, so it gets worse for heavier alerting setups.

## Proposed fix

`getPreviousAlertHistories()` only needs the latest record per group, so bound the lookback to a small multiple of the alert's check interval (e.g. `max(N × interval, someFloor)`) instead of a fixed `ms('7d')`. That cuts `keysExamined` from ~all-rows-in-7-days to a handful while returning the identical result. If robustness against gaps is a concern, fall back to a wider window only when the narrow window returns nothing for an alert.

(A larger refactor — maintaining current per-group state separately rather than deriving it from history each tick — would remove the scan entirely, but the lookback bound is the minimal, low-risk change.)

## Related

- #1411 (alert execution stability/concurrency) touches the same `checkAlerts` task but addresses a different concern (one alert blocking others), not the history-read cost.

## Environment

- Observed on a ClickStack deployment (HyperDX API alert checker + bundled MongoDB), per-minute alert intervals.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

checkAlerts: getPreviousAlertHistories scans a fixed 7-day window per alert per tick (only needs the latest record per group) #2434

Summary

Where

Observed in production

Impact

Proposed fix

Related

Environment

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

	start	+2.5 days
avg `docsExamined`	~10,500	~12,400
avg `durationMillis`	~127 ms	~165 ms

checkAlerts: getPreviousAlertHistories scans a fixed 7-day window per alert per tick (only needs the latest record per group) #2434

Description

Summary

Where

Observed in production

Impact

Proposed fix

Related

Environment

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions