Skip to content

Commit 3549563

Browse files
d-csclaude
andcommitted
docs(mollifier): ops manual with alertable signals and recovery flows
Captures what the metrics mean, which signal is alertable (the `stale_entries.current` gauge, not the counter), each named failure mode and its recovery flow, and Redis debug commands for poking at the buffer by hand. Mirrors the `batch-queue-metrics.md` internal-doc style. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent af85cda commit 3549563

1 file changed

Lines changed: 211 additions & 0 deletions

File tree

docs/mollifier-ops.md

Lines changed: 211 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,211 @@
1+
# Mollifier Ops Manual
2+
3+
The mollifier is a Redis-backed buffer that sits in front of the Postgres
4+
trigger-task path. When the per-env trigger rate exceeds the configured
5+
threshold, the gate diverts the trigger into a Redis ZSET; a drainer
6+
later materialises the buffered entry as a real PG `TaskRun` via
7+
`engine.trigger`. This document covers what to watch, how to recognise
8+
each failure mode, and how to recover.
9+
10+
## Architecture at a glance
11+
12+
```
13+
client.trigger()
14+
|
15+
v
16+
triggerTask.server.ts ── traceEventConcern.traceRun (writes run span to ClickHouse)
17+
| |
18+
| gate evaluates per-env rate
19+
| |
20+
| ┌──────┴───────┐
21+
| | |
22+
| PASS MOLLIFY
23+
| | |
24+
| engine.trigger mollifier:queue:<envId> (ZSET, score = createdAtMicros)
25+
| → PG TaskRun mollifier:entries:<runId> (hash, snapshot payload)
26+
v
27+
PG TaskRun + Electric stream + dashboard
28+
^
29+
|
30+
mollifier drainer (when buffered)
31+
- pops oldest entry from ZSET
32+
- calls engine.trigger with snapshot
33+
- writes PG TaskRun
34+
```
35+
36+
Key flag: `TRIGGER_MOLLIFIER_ENABLED=1` turns the whole system on. With it
37+
off the gate short-circuits and every trigger goes straight to PG.
38+
39+
## Key Redis keys
40+
41+
| Key pattern | Type | Purpose |
42+
|---|---|---|
43+
| `mollifier:queue:<envId>` | ZSET | Per-env queue. Score is `createdAtMicros`. Member is the runId. |
44+
| `mollifier:entries:<runId>` | HASH | Snapshot payload + metadata for one buffered run. |
45+
| `mollifier:orgs` | SET | Tracks orgs with non-empty buffers (for drainer fairness). |
46+
| `mollifier:envs:<orgId>` | SET | Tracks envs with non-empty buffers under each org. |
47+
| `mollifier:idempotency:<envId>:<taskId>:<key>` | STRING | SETNX for buffered-window idempotency dedup. |
48+
49+
The drainer pops `(orgId, envId)` pairs fairly, pulls oldest member from
50+
the env queue, reads the snapshot hash, and replays it. On success it
51+
deletes the hash and the ZSET member; on retryable error it requeues.
52+
53+
## Metrics
54+
55+
### Alertable signals
56+
57+
| Metric | Type | Labels | Alert pattern |
58+
|---|---|---|---|
59+
| `mollifier.stale_entries.current` | Gauge | `envId` | `> 0 for 5m` — drainer is offline or falling behind |
60+
| `mollifier.realtime_subscriptions.buffered` | Counter | `envId` | rate climbing — many customers hitting the buffered-window |
61+
62+
### Diagnostic signals
63+
64+
| Metric | Type | Labels | Meaning |
65+
|---|---|---|---|
66+
| `mollifier.decisions` | Counter | `outcome` (`pass_through`, `mollify`, `shadow_log`), `reason` (e.g. `per_env_rate`) | Gate decisions over time |
67+
| `mollifier.stale_entries` | Counter | `envId` | Per-sweep stale-entry events. **Not directly alertable** — see `…current` gauge instead |
68+
69+
The gate-decisions counter is the primary throughput view: when the
70+
mollifier is doing its job the `mollify` slice climbs in lockstep with
71+
the trigger burst.
72+
73+
### Structured logs
74+
75+
| Message | Level | Fields |
76+
|---|---|---|
77+
| `mollifier.buffered` | info | `runId`, `envId`, `orgId`, `taskId`, `reason` |
78+
| `mollifier.stale_entry` | warn | `runId`, `envId`, `orgId`, `dwellMs`, `staleThresholdMs` |
79+
| `mollifier.realtime.buffered_subscription` | info | `runId`, `envId`, `bufferDwellMs` |
80+
81+
The stale-entry log emits **one line per stale entry per sweep tick**.
82+
A single stuck entry will emit ~once every `TRIGGER_MOLLIFIER_STALE_SWEEP_INTERVAL_MS`
83+
(default 5min) until it drains. For alert routing, prefer the gauge.
84+
85+
## Configuration
86+
87+
The mollifier-related env vars live in `apps/webapp/app/env.server.ts`.
88+
Defaults are tuned for production; tune below for incident response.
89+
90+
| Var | Default | Purpose |
91+
|---|---|---|
92+
| `TRIGGER_MOLLIFIER_ENABLED` | `0` | Master switch |
93+
| `TRIGGER_MOLLIFIER_DRAINER_ENABLED` | inherits | Which replicas run the drainer loop. Set to `1` on dedicated drainer replicas only in multi-replica deployments |
94+
| `TRIGGER_MOLLIFIER_TRIP_WINDOW_MS` | `200` | Sliding window for per-env trigger rate |
95+
| `TRIGGER_MOLLIFIER_TRIP_THRESHOLD` | `100` | Trigger count that trips the gate within the window |
96+
| `TRIGGER_MOLLIFIER_HOLD_MS` | `500` | How long the gate stays tripped once it's tripped |
97+
| `TRIGGER_MOLLIFIER_DRAIN_CONCURRENCY` | `50` | Parallel drains per replica |
98+
| `TRIGGER_MOLLIFIER_DRAIN_MAX_ATTEMPTS` | `3` | Retries before terminal failure → `SYSTEM_FAILURE` PG row |
99+
| `TRIGGER_MOLLIFIER_STALE_SWEEP_ENABLED` | inherits | Run the alerting sweep |
100+
| `TRIGGER_MOLLIFIER_STALE_SWEEP_INTERVAL_MS` | `300_000` | Sweep cadence |
101+
| `TRIGGER_MOLLIFIER_STALE_SWEEP_THRESHOLD_MS` | (unset) | Dwell threshold. Defaults to half of `entryTtlSeconds` when unset |
102+
103+
## Failure modes & recovery
104+
105+
### Drainer is stopped / falling behind
106+
107+
**Signal**: `mollifier_stale_entries_current{envId=...} > 0 for 5m`
108+
plus `mollifier.stale_entry` warn logs.
109+
110+
**Triage**:
111+
1. Check drainer health on each replica — is the polling loop running?
112+
`grep "Initializing mollifier drainer"` near boot logs; recent
113+
`recordRunDebugLog` entries from `mollifier.drained` spans in
114+
Axiom.
115+
2. Check Redis reachability from the drainer replica.
116+
3. Check `TRIGGER_MOLLIFIER_DRAINER_ENABLED` — accidentally turned off?
117+
118+
**Recovery**: bring the drainer back up. It will drain the backlog at
119+
`TRIGGER_MOLLIFIER_DRAIN_CONCURRENCY` per replica. The gauge clears as
120+
each env's stale count drops to 0.
121+
122+
### Buffer growing in Redis
123+
124+
**Signal**: Redis memory pressure alerts (separate from mollifier).
125+
126+
**Triage**:
127+
```sh
128+
redis-cli ZCARD "mollifier:queue:<envId>" # depth for one env
129+
redis-cli SCARD "mollifier:orgs" # orgs with non-empty buffers
130+
```
131+
132+
**Recovery**: drainer pickup is the only mechanism that removes entries.
133+
If Redis is about to OOM, the safest option is to scale up the drainer
134+
replica count temporarily (raise `TRIGGER_MOLLIFIER_DRAIN_CONCURRENCY`
135+
or add replicas).
136+
137+
### Terminal drainer failure on a non-retryable error
138+
139+
**Signal**: `SYSTEM_FAILURE` PG rows with `error.raw` matching
140+
`Mollifier drainer terminal failure: …`. Existing alerts pipeline picks
141+
these up via `runFailed`.
142+
143+
**Triage**: the snapshot was structurally valid enough to reach
144+
`engine.trigger`, but engine.trigger threw a non-retryable error
145+
(schema drift, version-locked-task race, etc.). The drainer writes the
146+
SYSTEM_FAILURE row via `engine.createFailedTaskRun` so the customer
147+
sees the run in their dashboard rather than nothing.
148+
149+
**Recovery**: case-by-case. Read the error message in the SYSTEM_FAILURE
150+
row; fix the underlying issue.
151+
152+
### Cancel-before-PG (Q4 bifurcation)
153+
154+
A customer cancelling a buffered run patches the snapshot with
155+
`cancelledAt` + `cancelReason`. When the drainer next picks it up, it
156+
takes the cancel-bifurcation path: writes a `CANCELED` PG row via
157+
`engine.createCancelledRun` instead of triggering. Electric streams the
158+
INSERT to `useRealtimeRun` subscribers.
159+
160+
If the drainer is offline, the snapshot just sits in Redis with
161+
`cancelledAt` set. The customer's API cancel call already returned
162+
success (synthesised from the snapshot), but the realtime hook stays
163+
unpopulated until the drainer materialises the row.
164+
165+
### Realtime subscription opened during the buffered window
166+
167+
`useRealtimeRun(bufferedRunId)` keeps the Electric subscription open
168+
against `WHERE id=<id>` even though no PG row exists yet. Each initial
169+
subscription increments `mollifier.realtime_subscriptions.buffered` and
170+
logs `mollifier.realtime.buffered_subscription`. When the drainer
171+
INSERTs the PG row, Electric streams it to the client.
172+
173+
This is normal behaviour — only worth investigating if the counter
174+
climbs disproportionately to the gate's `mollify` outcomes (suggests
175+
customers are subscribing inside the buffered window faster than the
176+
drainer can materialise).
177+
178+
## Manual buffer inspection
179+
180+
```sh
181+
# Latest member of an env's queue (newest first by score)
182+
redis-cli -p 6379 ZRANGE "mollifier:queue:<envId>" -1 -1 WITHSCORES
183+
184+
# Full payload for one buffered run
185+
redis-cli -p 6379 HGETALL "mollifier:entries:<runId>"
186+
187+
# Depth per env
188+
for k in $(redis-cli -p 6379 --scan --pattern 'mollifier:queue:*'); do
189+
echo "$k $(redis-cli -p 6379 ZCARD $k)"
190+
done
191+
192+
# Orgs with non-empty buffers
193+
redis-cli -p 6379 SMEMBERS "mollifier:orgs"
194+
```
195+
196+
A phantom ZSET member (`ZSCORE` returns a value but the entry hash is
197+
empty) used to be possible when entry-hash TTLs expired ahead of the
198+
queue ZSET. The entry TTL has since been removed; entries persist
199+
until the drainer ACKs them. If you see a phantom in prod, that
200+
indicates a real bug — investigate before manually `ZREM`-ing.
201+
202+
## Related code
203+
204+
- Drainer loop: `internal-packages/redis-worker/src/mollifier/drainer.ts`
205+
- Drainer handler: `apps/webapp/app/v3/mollifier/mollifierDrainerHandler.server.ts`
206+
- Gate: `apps/webapp/app/v3/mollifier/mollifierGate.server.ts`
207+
- Mollify (write to buffer): `apps/webapp/app/v3/mollifier/mollifierMollify.server.ts`
208+
- Sweep: `apps/webapp/app/v3/mollifier/mollifierStaleSweep.server.ts`
209+
- Telemetry: `apps/webapp/app/v3/mollifier/mollifierTelemetry.server.ts`
210+
- Realtime buffered-fallback: `apps/webapp/app/routes/realtime.v1.runs.$runId.ts`
211+
- Test helpers: `apps/webapp/test/mollifier*.test.ts`

0 commit comments

Comments
 (0)