|
| 1 | +# Mollifier Ops Manual |
| 2 | + |
| 3 | +The mollifier is a Redis-backed buffer that sits in front of the Postgres |
| 4 | +trigger-task path. When the per-env trigger rate exceeds the configured |
| 5 | +threshold, the gate diverts the trigger into a Redis ZSET; a drainer |
| 6 | +later materialises the buffered entry as a real PG `TaskRun` via |
| 7 | +`engine.trigger`. This document covers what to watch, how to recognise |
| 8 | +each failure mode, and how to recover. |
| 9 | + |
| 10 | +## Architecture at a glance |
| 11 | + |
| 12 | +``` |
| 13 | +client.trigger() |
| 14 | + | |
| 15 | + v |
| 16 | +triggerTask.server.ts ── traceEventConcern.traceRun (writes run span to ClickHouse) |
| 17 | + | | |
| 18 | + | gate evaluates per-env rate |
| 19 | + | | |
| 20 | + | ┌──────┴───────┐ |
| 21 | + | | | |
| 22 | + | PASS MOLLIFY |
| 23 | + | | | |
| 24 | + | engine.trigger mollifier:queue:<envId> (ZSET, score = createdAtMicros) |
| 25 | + | → PG TaskRun mollifier:entries:<runId> (hash, snapshot payload) |
| 26 | + v |
| 27 | + PG TaskRun + Electric stream + dashboard |
| 28 | + ^ |
| 29 | + | |
| 30 | + mollifier drainer (when buffered) |
| 31 | + - pops oldest entry from ZSET |
| 32 | + - calls engine.trigger with snapshot |
| 33 | + - writes PG TaskRun |
| 34 | +``` |
| 35 | + |
| 36 | +Key flag: `TRIGGER_MOLLIFIER_ENABLED=1` turns the whole system on. With it |
| 37 | +off the gate short-circuits and every trigger goes straight to PG. |
| 38 | + |
| 39 | +## Key Redis keys |
| 40 | + |
| 41 | +| Key pattern | Type | Purpose | |
| 42 | +|---|---|---| |
| 43 | +| `mollifier:queue:<envId>` | ZSET | Per-env queue. Score is `createdAtMicros`. Member is the runId. | |
| 44 | +| `mollifier:entries:<runId>` | HASH | Snapshot payload + metadata for one buffered run. | |
| 45 | +| `mollifier:orgs` | SET | Tracks orgs with non-empty buffers (for drainer fairness). | |
| 46 | +| `mollifier:envs:<orgId>` | SET | Tracks envs with non-empty buffers under each org. | |
| 47 | +| `mollifier:idempotency:<envId>:<taskId>:<key>` | STRING | SETNX for buffered-window idempotency dedup. | |
| 48 | + |
| 49 | +The drainer pops `(orgId, envId)` pairs fairly, pulls oldest member from |
| 50 | +the env queue, reads the snapshot hash, and replays it. On success it |
| 51 | +deletes the hash and the ZSET member; on retryable error it requeues. |
| 52 | + |
| 53 | +## Metrics |
| 54 | + |
| 55 | +### Alertable signals |
| 56 | + |
| 57 | +| Metric | Type | Labels | Alert pattern | |
| 58 | +|---|---|---|---| |
| 59 | +| `mollifier.stale_entries.current` | Gauge | `envId` | `> 0 for 5m` — drainer is offline or falling behind | |
| 60 | +| `mollifier.realtime_subscriptions.buffered` | Counter | `envId` | rate climbing — many customers hitting the buffered-window | |
| 61 | + |
| 62 | +### Diagnostic signals |
| 63 | + |
| 64 | +| Metric | Type | Labels | Meaning | |
| 65 | +|---|---|---|---| |
| 66 | +| `mollifier.decisions` | Counter | `outcome` (`pass_through`, `mollify`, `shadow_log`), `reason` (e.g. `per_env_rate`) | Gate decisions over time | |
| 67 | +| `mollifier.stale_entries` | Counter | `envId` | Per-sweep stale-entry events. **Not directly alertable** — see `…current` gauge instead | |
| 68 | + |
| 69 | +The gate-decisions counter is the primary throughput view: when the |
| 70 | +mollifier is doing its job the `mollify` slice climbs in lockstep with |
| 71 | +the trigger burst. |
| 72 | + |
| 73 | +### Structured logs |
| 74 | + |
| 75 | +| Message | Level | Fields | |
| 76 | +|---|---|---| |
| 77 | +| `mollifier.buffered` | info | `runId`, `envId`, `orgId`, `taskId`, `reason` | |
| 78 | +| `mollifier.stale_entry` | warn | `runId`, `envId`, `orgId`, `dwellMs`, `staleThresholdMs` | |
| 79 | +| `mollifier.realtime.buffered_subscription` | info | `runId`, `envId`, `bufferDwellMs` | |
| 80 | + |
| 81 | +The stale-entry log emits **one line per stale entry per sweep tick**. |
| 82 | +A single stuck entry will emit ~once every `TRIGGER_MOLLIFIER_STALE_SWEEP_INTERVAL_MS` |
| 83 | +(default 5min) until it drains. For alert routing, prefer the gauge. |
| 84 | + |
| 85 | +## Configuration |
| 86 | + |
| 87 | +The mollifier-related env vars live in `apps/webapp/app/env.server.ts`. |
| 88 | +Defaults are tuned for production; tune below for incident response. |
| 89 | + |
| 90 | +| Var | Default | Purpose | |
| 91 | +|---|---|---| |
| 92 | +| `TRIGGER_MOLLIFIER_ENABLED` | `0` | Master switch | |
| 93 | +| `TRIGGER_MOLLIFIER_DRAINER_ENABLED` | inherits | Which replicas run the drainer loop. Set to `1` on dedicated drainer replicas only in multi-replica deployments | |
| 94 | +| `TRIGGER_MOLLIFIER_TRIP_WINDOW_MS` | `200` | Sliding window for per-env trigger rate | |
| 95 | +| `TRIGGER_MOLLIFIER_TRIP_THRESHOLD` | `100` | Trigger count that trips the gate within the window | |
| 96 | +| `TRIGGER_MOLLIFIER_HOLD_MS` | `500` | How long the gate stays tripped once it's tripped | |
| 97 | +| `TRIGGER_MOLLIFIER_DRAIN_CONCURRENCY` | `50` | Parallel drains per replica | |
| 98 | +| `TRIGGER_MOLLIFIER_DRAIN_MAX_ATTEMPTS` | `3` | Retries before terminal failure → `SYSTEM_FAILURE` PG row | |
| 99 | +| `TRIGGER_MOLLIFIER_STALE_SWEEP_ENABLED` | inherits | Run the alerting sweep | |
| 100 | +| `TRIGGER_MOLLIFIER_STALE_SWEEP_INTERVAL_MS` | `300_000` | Sweep cadence | |
| 101 | +| `TRIGGER_MOLLIFIER_STALE_SWEEP_THRESHOLD_MS` | (unset) | Dwell threshold. Defaults to half of `entryTtlSeconds` when unset | |
| 102 | + |
| 103 | +## Failure modes & recovery |
| 104 | + |
| 105 | +### Drainer is stopped / falling behind |
| 106 | + |
| 107 | +**Signal**: `mollifier_stale_entries_current{envId=...} > 0 for 5m` |
| 108 | +plus `mollifier.stale_entry` warn logs. |
| 109 | + |
| 110 | +**Triage**: |
| 111 | +1. Check drainer health on each replica — is the polling loop running? |
| 112 | + `grep "Initializing mollifier drainer"` near boot logs; recent |
| 113 | + `recordRunDebugLog` entries from `mollifier.drained` spans in |
| 114 | + Axiom. |
| 115 | +2. Check Redis reachability from the drainer replica. |
| 116 | +3. Check `TRIGGER_MOLLIFIER_DRAINER_ENABLED` — accidentally turned off? |
| 117 | + |
| 118 | +**Recovery**: bring the drainer back up. It will drain the backlog at |
| 119 | +`TRIGGER_MOLLIFIER_DRAIN_CONCURRENCY` per replica. The gauge clears as |
| 120 | +each env's stale count drops to 0. |
| 121 | + |
| 122 | +### Buffer growing in Redis |
| 123 | + |
| 124 | +**Signal**: Redis memory pressure alerts (separate from mollifier). |
| 125 | + |
| 126 | +**Triage**: |
| 127 | +```sh |
| 128 | +redis-cli ZCARD "mollifier:queue:<envId>" # depth for one env |
| 129 | +redis-cli SCARD "mollifier:orgs" # orgs with non-empty buffers |
| 130 | +``` |
| 131 | + |
| 132 | +**Recovery**: drainer pickup is the only mechanism that removes entries. |
| 133 | +If Redis is about to OOM, the safest option is to scale up the drainer |
| 134 | +replica count temporarily (raise `TRIGGER_MOLLIFIER_DRAIN_CONCURRENCY` |
| 135 | +or add replicas). |
| 136 | + |
| 137 | +### Terminal drainer failure on a non-retryable error |
| 138 | + |
| 139 | +**Signal**: `SYSTEM_FAILURE` PG rows with `error.raw` matching |
| 140 | +`Mollifier drainer terminal failure: …`. Existing alerts pipeline picks |
| 141 | +these up via `runFailed`. |
| 142 | + |
| 143 | +**Triage**: the snapshot was structurally valid enough to reach |
| 144 | +`engine.trigger`, but engine.trigger threw a non-retryable error |
| 145 | +(schema drift, version-locked-task race, etc.). The drainer writes the |
| 146 | +SYSTEM_FAILURE row via `engine.createFailedTaskRun` so the customer |
| 147 | +sees the run in their dashboard rather than nothing. |
| 148 | + |
| 149 | +**Recovery**: case-by-case. Read the error message in the SYSTEM_FAILURE |
| 150 | +row; fix the underlying issue. |
| 151 | + |
| 152 | +### Cancel-before-PG (Q4 bifurcation) |
| 153 | + |
| 154 | +A customer cancelling a buffered run patches the snapshot with |
| 155 | +`cancelledAt` + `cancelReason`. When the drainer next picks it up, it |
| 156 | +takes the cancel-bifurcation path: writes a `CANCELED` PG row via |
| 157 | +`engine.createCancelledRun` instead of triggering. Electric streams the |
| 158 | +INSERT to `useRealtimeRun` subscribers. |
| 159 | + |
| 160 | +If the drainer is offline, the snapshot just sits in Redis with |
| 161 | +`cancelledAt` set. The customer's API cancel call already returned |
| 162 | +success (synthesised from the snapshot), but the realtime hook stays |
| 163 | +unpopulated until the drainer materialises the row. |
| 164 | + |
| 165 | +### Realtime subscription opened during the buffered window |
| 166 | + |
| 167 | +`useRealtimeRun(bufferedRunId)` keeps the Electric subscription open |
| 168 | +against `WHERE id=<id>` even though no PG row exists yet. Each initial |
| 169 | +subscription increments `mollifier.realtime_subscriptions.buffered` and |
| 170 | +logs `mollifier.realtime.buffered_subscription`. When the drainer |
| 171 | +INSERTs the PG row, Electric streams it to the client. |
| 172 | + |
| 173 | +This is normal behaviour — only worth investigating if the counter |
| 174 | +climbs disproportionately to the gate's `mollify` outcomes (suggests |
| 175 | +customers are subscribing inside the buffered window faster than the |
| 176 | +drainer can materialise). |
| 177 | + |
| 178 | +## Manual buffer inspection |
| 179 | + |
| 180 | +```sh |
| 181 | +# Latest member of an env's queue (newest first by score) |
| 182 | +redis-cli -p 6379 ZRANGE "mollifier:queue:<envId>" -1 -1 WITHSCORES |
| 183 | + |
| 184 | +# Full payload for one buffered run |
| 185 | +redis-cli -p 6379 HGETALL "mollifier:entries:<runId>" |
| 186 | + |
| 187 | +# Depth per env |
| 188 | +for k in $(redis-cli -p 6379 --scan --pattern 'mollifier:queue:*'); do |
| 189 | + echo "$k $(redis-cli -p 6379 ZCARD $k)" |
| 190 | +done |
| 191 | + |
| 192 | +# Orgs with non-empty buffers |
| 193 | +redis-cli -p 6379 SMEMBERS "mollifier:orgs" |
| 194 | +``` |
| 195 | + |
| 196 | +A phantom ZSET member (`ZSCORE` returns a value but the entry hash is |
| 197 | +empty) used to be possible when entry-hash TTLs expired ahead of the |
| 198 | +queue ZSET. The entry TTL has since been removed; entries persist |
| 199 | +until the drainer ACKs them. If you see a phantom in prod, that |
| 200 | +indicates a real bug — investigate before manually `ZREM`-ing. |
| 201 | + |
| 202 | +## Related code |
| 203 | + |
| 204 | +- Drainer loop: `internal-packages/redis-worker/src/mollifier/drainer.ts` |
| 205 | +- Drainer handler: `apps/webapp/app/v3/mollifier/mollifierDrainerHandler.server.ts` |
| 206 | +- Gate: `apps/webapp/app/v3/mollifier/mollifierGate.server.ts` |
| 207 | +- Mollify (write to buffer): `apps/webapp/app/v3/mollifier/mollifierMollify.server.ts` |
| 208 | +- Sweep: `apps/webapp/app/v3/mollifier/mollifierStaleSweep.server.ts` |
| 209 | +- Telemetry: `apps/webapp/app/v3/mollifier/mollifierTelemetry.server.ts` |
| 210 | +- Realtime buffered-fallback: `apps/webapp/app/routes/realtime.v1.runs.$runId.ts` |
| 211 | +- Test helpers: `apps/webapp/test/mollifier*.test.ts` |
0 commit comments