You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
feat(redis-worker,webapp): drop mollifier entry TTL — drainer is the recovery mechanism
Buffer entries used to EXPIRE after entryTtlSeconds (600s dev / 1h
prod). Once that window elapsed without the drainer ack'ing, the
entry just vanished — no PG row, no log, no customer signal. The
stale-entry sweep was added in the previous commit so ops gets paged
on dwell-too-long; with that signal in place, the TTL itself is now
the cause of the failure mode it was meant to mitigate.
Remove it. Buffer entries persist until the drainer ACKs (with the
existing 30s post-materialise grace TTL) or FAILs them. Idempotency
lookup keys also lose their TTL — keeping them paired to the entry
hash prevents the dedup-drift bug where a TTL'd lookup would let the
same idempotency key spawn a second buffered run while the first
still existed. `failMollifierEntry` now DELs the entry hash + lookup
because the SYSTEM_FAILURE PG row written by the drainer is the
canonical record; the buffer entry is no longer load-bearing.
Knock-on changes:
- `MollifierBufferOptions`: `entryTtlSeconds` removed (no consumers
outside this repo).
- `TRIGGER_MOLLIFIER_ENTRY_TTL_S`: removed from env.server.ts and the
example .env. The stale-sweep threshold now has its own explicit
default (5min) instead of "half of TTL".
- `MollifierBuffer.getEntryTtlSeconds`: retained — it returns the
Redis-side TTL, which is now -1 in steady state and ~30s after ack.
Used by the ack-grace-TTL test.
- Existing tests updated: TTL-related cases inverted to assert no TTL;
FAILED-state cases inverted to assert teardown; runId-reuse-after-
fail now succeeds (slot is reclaimable).
Operational alert: Redis memory pressure if the drainer is offline.
That's the same failure mode as Redis OOM in any other context, with
existing infra-level alerts. The mollifier.stale_entries.current
gauge fires first; ops should be on it long before memory becomes a
problem. See _ops/mollifier-ops.md.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`MollifierBuffer`: remove the `entryTtlSeconds` constructor option and stop applying any TTL to buffer entry hashes or idempotency-lookup keys. Buffer entries now persist until the drainer ACKs (with a 30s post-materialise grace TTL) or FAILs them. The previous design auto-evicted entries after the TTL, which silently lost runs when the drainer was offline or falling behind — no PG row, no log, no customer signal. With the TTL gone, the drainer is the only mechanism that removes entries; operators alert on Redis memory pressure (separate, existing concern) and on the `mollifier.stale_entries.current` gauge (5min default threshold) instead. `fail` now also DELs the entry hash plus its idempotency lookup, because the SYSTEM_FAILURE PG row written by the drainer is the canonical record of the failure and the buffer entry is no longer load-bearing.
Drop `TRIGGER_MOLLIFIER_ENTRY_TTL_S` and the `entryTtlSeconds` option on `MollifierBuffer`. Buffer entries no longer auto-expire — the drainer is the only mechanism that removes them, which prevents silent run loss when the drainer is offline or falling behind. Default for `TRIGGER_MOLLIFIER_STALE_SWEEP_THRESHOLD_MS` is now an explicit 5 minutes (used to be half of the old entry TTL); set it directly if you want a different alerting horizon. See `_ops/mollifier-ops.md` for the new recovery flow.
|`TRIGGER_MOLLIFIER_STALE_SWEEP_THRESHOLD_MS`|(unset)| Dwell threshold. Defaults to half of `entryTtlSeconds`when unset|
101
+
|`TRIGGER_MOLLIFIER_STALE_SWEEP_THRESHOLD_MS`|`300_000`| Dwell threshold above which an entry is flagged stale (matches the sweep interval — "anything still here when we check")|
0 commit comments