feat(redis-worker,webapp): drop mollifier entry TTL — drainer is the recovery mechanism

d-cs · claude · d-cs · commit 8dc878e96e1c · 2026-05-22T15:37:01.000+01:00
Buffer entries used to EXPIRE after entryTtlSeconds (600s dev / 1h
prod). Once that window elapsed without the drainer ack'ing, the
entry just vanished — no PG row, no log, no customer signal. The
stale-entry sweep was added in the previous commit so ops gets paged
on dwell-too-long; with that signal in place, the TTL itself is now
the cause of the failure mode it was meant to mitigate.

Remove it. Buffer entries persist until the drainer ACKs (with the
existing 30s post-materialise grace TTL) or FAILs them. Idempotency
lookup keys also lose their TTL — keeping them paired to the entry
hash prevents the dedup-drift bug where a TTL'd lookup would let the
same idempotency key spawn a second buffered run while the first
still existed. `failMollifierEntry` now DELs the entry hash + lookup
because the SYSTEM_FAILURE PG row written by the drainer is the
canonical record; the buffer entry is no longer load-bearing.

Knock-on changes:
- `MollifierBufferOptions`: `entryTtlSeconds` removed (no consumers
  outside this repo).
- `TRIGGER_MOLLIFIER_ENTRY_TTL_S`: removed from env.server.ts and the
  example .env. The stale-sweep threshold now has its own explicit
  default (5min) instead of "half of TTL".
- `MollifierBuffer.getEntryTtlSeconds`: retained — it returns the
  Redis-side TTL, which is now -1 in steady state and ~30s after ack.
  Used by the ack-grace-TTL test.
- Existing tests updated: TTL-related cases inverted to assert no TTL;
  FAILED-state cases inverted to assert teardown; runId-reuse-after-
  fail now succeeds (slot is reclaimable).

Operational alert: Redis memory pressure if the drainer is offline.
That's the same failure mode as Redis OOM in any other context, with
existing infra-level alerts. The mollifier.stale_entries.current
gauge fires first; ops should be on it long before memory becomes a
problem. See _ops/mollifier-ops.md.

Co-Authored-By: Claude Opus 4.7 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/.changeset/mollifier-drop-entry-ttl.md b/.changeset/mollifier-drop-entry-ttl.md
@@ -0,0 +1,5 @@
+---
+"@trigger.dev/redis-worker": minor
+---
+
+`MollifierBuffer`: remove the `entryTtlSeconds` constructor option and stop applying any TTL to buffer entry hashes or idempotency-lookup keys. Buffer entries now persist until the drainer ACKs (with a 30s post-materialise grace TTL) or FAILs them. The previous design auto-evicted entries after the TTL, which silently lost runs when the drainer was offline or falling behind — no PG row, no log, no customer signal. With the TTL gone, the drainer is the only mechanism that removes entries; operators alert on Redis memory pressure (separate, existing concern) and on the `mollifier.stale_entries.current` gauge (5min default threshold) instead. `fail` now also DELs the entry hash plus its idempotency lookup, because the SYSTEM_FAILURE PG row written by the drainer is the canonical record of the failure and the buffer entry is no longer load-bearing.
diff --git a/.server-changes/mollifier-drop-entry-ttl.md b/.server-changes/mollifier-drop-entry-ttl.md
@@ -0,0 +1,6 @@
+---
+area: webapp
+type: improvement
+---
+
+Drop `TRIGGER_MOLLIFIER_ENTRY_TTL_S` and the `entryTtlSeconds` option on `MollifierBuffer`. Buffer entries no longer auto-expire — the drainer is the only mechanism that removes them, which prevents silent run loss when the drainer is offline or falling behind. Default for `TRIGGER_MOLLIFIER_STALE_SWEEP_THRESHOLD_MS` is now an explicit 5 minutes (used to be half of the old entry TTL); set it directly if you want a different alerting horizon. See `_ops/mollifier-ops.md` for the new recovery flow.
diff --git a/_ops/mollifier-ops.md b/_ops/mollifier-ops.md
@@ -98,7 +98,7 @@ Defaults are tuned for production; tune below for incident response.
 | `TRIGGER_MOLLIFIER_DRAIN_MAX_ATTEMPTS` | `3` | Retries before terminal failure → `SYSTEM_FAILURE` PG row |
 | `TRIGGER_MOLLIFIER_STALE_SWEEP_ENABLED` | inherits | Run the alerting sweep |
 | `TRIGGER_MOLLIFIER_STALE_SWEEP_INTERVAL_MS` | `300_000` | Sweep cadence |
-| `TRIGGER_MOLLIFIER_STALE_SWEEP_THRESHOLD_MS` | (unset) | Dwell threshold. Defaults to half of `entryTtlSeconds` when unset |
+| `TRIGGER_MOLLIFIER_STALE_SWEEP_THRESHOLD_MS` | `300_000` | Dwell threshold above which an entry is flagged stale (matches the sweep interval — "anything still here when we check") |
 
 ## Failure modes & recovery
 
diff --git a/apps/webapp/app/env.server.ts b/apps/webapp/app/env.server.ts
@@ -1093,7 +1093,6 @@ const EnvironmentSchema = z
     TRIGGER_MOLLIFIER_TRIP_THRESHOLD: z.coerce.number().int().nonnegative().default(100),
     TRIGGER_MOLLIFIER_HOLD_MS: z.coerce.number().int().positive().default(500),
     TRIGGER_MOLLIFIER_DRAIN_CONCURRENCY: z.coerce.number().int().positive().default(50),
-    TRIGGER_MOLLIFIER_ENTRY_TTL_S: z.coerce.number().int().positive().default(600),
     TRIGGER_MOLLIFIER_DRAIN_MAX_ATTEMPTS: z.coerce.number().int().positive().default(3),
     TRIGGER_MOLLIFIER_DRAIN_SHUTDOWN_TIMEOUT_MS: z.coerce.number().int().positive().default(30_000),
     TRIGGER_MOLLIFIER_DRAIN_MAX_ORGS_PER_TICK: z.coerce.number().int().positive().default(500),
@@ -1102,7 +1101,9 @@ const EnvironmentSchema = z
     // dwell exceeds the stale threshold. Independent of the drainer —
     // its job is exactly to make a stuck/offline drainer visible to
     // ops. Defaults: enabled when the mollifier is enabled, run every
-    // 5 minutes, flag entries with dwell > half of entryTtlSeconds.
+    // 5 minutes, alert on anything that's been dwelling for 5+ minutes
+    // (matches the sweep interval — "anything still here when we
+    // check" is the simplest threshold that converges).
     TRIGGER_MOLLIFIER_STALE_SWEEP_ENABLED: z
       .string()
       .default(process.env.TRIGGER_MOLLIFIER_ENABLED ?? "0"),
@@ -1115,7 +1116,7 @@ const EnvironmentSchema = z
       .number()
       .int()
       .positive()
-      .optional(),
+      .default(5 * 60_000),
 
     BATCH_TRIGGER_PROCESS_JOB_VISIBILITY_TIMEOUT_MS: z.coerce
       .number()
diff --git a/apps/webapp/app/v3/mollifier/mollifierBuffer.server.ts b/apps/webapp/app/v3/mollifier/mollifierBuffer.server.ts
@@ -22,7 +22,6 @@ function initializeMollifierBuffer(): MollifierBuffer {
       enableAutoPipelining: true,
       ...(env.TRIGGER_MOLLIFIER_REDIS_TLS_DISABLED === "true" ? {} : { tls: {} }),
     },
-    entryTtlSeconds: env.TRIGGER_MOLLIFIER_ENTRY_TTL_S,
   });
 }
 
diff --git a/apps/webapp/app/v3/mollifierStaleSweepWorker.server.ts b/apps/webapp/app/v3/mollifierStaleSweepWorker.server.ts
@@ -30,21 +30,14 @@ export function initMollifierStaleSweepWorker(): void {
   if (env.TRIGGER_MOLLIFIER_STALE_SWEEP_ENABLED !== "1") return;
   if (global.__mollifierStaleSweepRegistered__) return;
 
-  // Default the threshold to half of `entryTtlSeconds`, mirroring the
-  // plan doc's cadence. Operators wanting an earlier or later signal
-  // can set it explicitly.
-  const staleThresholdMs =
-    env.TRIGGER_MOLLIFIER_STALE_SWEEP_THRESHOLD_MS ??
-    Math.floor(env.TRIGGER_MOLLIFIER_ENTRY_TTL_S * 1000 * 0.5);
-
   logger.debug("Initializing mollifier stale-entry sweep", {
     intervalMs: env.TRIGGER_MOLLIFIER_STALE_SWEEP_INTERVAL_MS,
-    staleThresholdMs,
+    staleThresholdMs: env.TRIGGER_MOLLIFIER_STALE_SWEEP_THRESHOLD_MS,
   });
 
   const handle = startStaleSweepInterval({
     intervalMs: env.TRIGGER_MOLLIFIER_STALE_SWEEP_INTERVAL_MS,
-    staleThresholdMs,
+    staleThresholdMs: env.TRIGGER_MOLLIFIER_STALE_SWEEP_THRESHOLD_MS,
   });
 
   signalsEmitter.on("SIGTERM", handle.stop);
diff --git a/apps/webapp/test/mollifierRealtimeRunResourceBuffer.test.ts b/apps/webapp/test/mollifierRealtimeRunResourceBuffer.test.ts
@@ -34,7 +34,7 @@ describe("realtime buffered-subscription resource resolution (testcontainers)",
   redisTest(
     "synthesises a resource whose `id` matches RunId.fromFriendlyId",
     async ({ redisOptions }) => {
-      const buffer = new MollifierBuffer({ redisOptions, entryTtlSeconds: 60 });
+      const buffer = new MollifierBuffer({ redisOptions });
       try {
         await buffer.accept({
           runId: SNAPSHOT_BASE.friendlyId,
@@ -78,7 +78,7 @@ describe("realtime buffered-subscription resource resolution (testcontainers)",
   redisTest(
     "returns null when neither PG nor the buffer have the entry",
     async ({ redisOptions }) => {
-      const buffer = new MollifierBuffer({ redisOptions, entryTtlSeconds: 60 });
+      const buffer = new MollifierBuffer({ redisOptions });
       try {
         const bufferedSynthetic = await findRunByIdWithMollifierFallback(
           {
@@ -109,7 +109,7 @@ describe("realtime buffered-subscription resource resolution (testcontainers)",
   redisTest(
     "does not fall back to buffer when PG has the row",
     async ({ redisOptions }) => {
-      const buffer = new MollifierBuffer({ redisOptions, entryTtlSeconds: 60 });
+      const buffer = new MollifierBuffer({ redisOptions });
       try {
         await buffer.accept({
           runId: SNAPSHOT_BASE.friendlyId,
diff --git a/apps/webapp/test/mollifierStaleSweep.test.ts b/apps/webapp/test/mollifierStaleSweep.test.ts
@@ -69,7 +69,7 @@ describe("runStaleSweepOnce — testcontainers", () => {
   redisTest(
     "flags entries whose dwell exceeds the stale threshold and skips fresh ones",
     async ({ redisOptions }) => {
-      const buffer = new MollifierBuffer({ redisOptions, entryTtlSeconds: 60 });
+      const buffer = new MollifierBuffer({ redisOptions });
       try {
         // Two stale entries (one in each env) + one fresh entry. Sweep
         // should flag the two stale, leave the fresh one alone, record
@@ -143,7 +143,7 @@ describe("runStaleSweepOnce — testcontainers", () => {
       // stale, alert fired, drainer caught up. The next sweep must
       // report `env_a -> 0` so the gauge drops below the alert
       // threshold instead of staying latched at the last stale value.
-      const buffer = new MollifierBuffer({ redisOptions, entryTtlSeconds: 60 });
+      const buffer = new MollifierBuffer({ redisOptions });
       try {
         await buffer.accept({
           runId: "run_just_arrived",
@@ -171,7 +171,7 @@ describe("runStaleSweepOnce — testcontainers", () => {
       // `dwellMs > threshold` to `dwellMs >= threshold` would flag every
       // entry the first time the sweep runs after a perfectly synchronised
       // accept call — the dashboard would page on every burst.
-      const buffer = new MollifierBuffer({ redisOptions, entryTtlSeconds: 60 });
+      const buffer = new MollifierBuffer({ redisOptions });
       try {
         await buffer.accept({
           runId: "run_fresh_only",
@@ -200,7 +200,7 @@ describe("runStaleSweepOnce — testcontainers", () => {
       // must walk every org/env, not just the first one it finds. If a
       // future refactor collapsed listOrgs/listEnvsForOrg into a single
       // env-flat list this test catches a regression there.
-      const buffer = new MollifierBuffer({ redisOptions, entryTtlSeconds: 60 });
+      const buffer = new MollifierBuffer({ redisOptions });
       try {
         await buffer.accept({
           runId: "run_x",
diff --git a/apps/webapp/test/mollifierSyntheticRedirectInfo.test.ts b/apps/webapp/test/mollifierSyntheticRedirectInfo.test.ts
@@ -23,7 +23,7 @@ function fakePrisma(member: { id: string } | null) {
 
 describe("findBufferedRunRedirectInfo (testcontainers)", () => {
   redisTest("returns slugs + spanId for a real buffer entry when user is a member", async ({ redisOptions }) => {
-    const buffer = new MollifierBuffer({ redisOptions, entryTtlSeconds: 60 });
+    const buffer = new MollifierBuffer({ redisOptions });
     try {
       await buffer.accept({
         runId: "run_real_1",
@@ -47,7 +47,7 @@ describe("findBufferedRunRedirectInfo (testcontainers)", () => {
   });
 
   redisTest("returns null when no buffer entry exists for the runId", async ({ redisOptions }) => {
-    const buffer = new MollifierBuffer({ redisOptions, entryTtlSeconds: 60 });
+    const buffer = new MollifierBuffer({ redisOptions });
     try {
       const info = await findBufferedRunRedirectInfo(
         { runFriendlyId: "run_missing", userId: "user_1" },
@@ -60,7 +60,7 @@ describe("findBufferedRunRedirectInfo (testcontainers)", () => {
   });
 
   redisTest("returns null when the user is not an org member (default check enforced)", async ({ redisOptions }) => {
-    const buffer = new MollifierBuffer({ redisOptions, entryTtlSeconds: 60 });
+    const buffer = new MollifierBuffer({ redisOptions });
     try {
       await buffer.accept({
         runId: "run_real_2",
@@ -79,7 +79,7 @@ describe("findBufferedRunRedirectInfo (testcontainers)", () => {
   });
 
   redisTest("skips the org-membership check when skipOrgMembershipCheck is set (admin path)", async ({ redisOptions }) => {
-    const buffer = new MollifierBuffer({ redisOptions, entryTtlSeconds: 60 });
+    const buffer = new MollifierBuffer({ redisOptions });
     try {
       await buffer.accept({
         runId: "run_real_3",
@@ -103,7 +103,7 @@ describe("findBufferedRunRedirectInfo (testcontainers)", () => {
   });
 
   redisTest("returns null when snapshot is malformed JSON", async ({ redisOptions }) => {
-    const buffer = new MollifierBuffer({ redisOptions, entryTtlSeconds: 60 });
+    const buffer = new MollifierBuffer({ redisOptions });
     try {
       await buffer.accept({
         runId: "run_real_4",
@@ -122,7 +122,7 @@ describe("findBufferedRunRedirectInfo (testcontainers)", () => {
   });
 
   redisTest("returns null when snapshot lacks org/project slugs", async ({ redisOptions }) => {
-    const buffer = new MollifierBuffer({ redisOptions, entryTtlSeconds: 60 });
+    const buffer = new MollifierBuffer({ redisOptions });
     try {
       await buffer.accept({
         runId: "run_real_5",
@@ -141,7 +141,7 @@ describe("findBufferedRunRedirectInfo (testcontainers)", () => {
   });
 
   redisTest("returns info with undefined spanId when snapshot has no spanId", async ({ redisOptions }) => {
-    const buffer = new MollifierBuffer({ redisOptions, entryTtlSeconds: 60 });
+    const buffer = new MollifierBuffer({ redisOptions });
     try {
       await buffer.accept({
         runId: "run_real_6",
diff --git a/apps/webapp/test/mollifierTripEvaluator.test.ts b/apps/webapp/test/mollifierTripEvaluator.test.ts
@@ -14,7 +14,7 @@ describe("createRealTripEvaluator", () => {
   redisTest(
     "returns divert=false when the sliding window stays under threshold",
     async ({ redisOptions }) => {
-      const buffer = new MollifierBuffer({ redisOptions, entryTtlSeconds: 600 });
+      const buffer = new MollifierBuffer({ redisOptions });
       try {
         const evaluator = createRealTripEvaluator({
           getBuffer: () => buffer,
@@ -32,7 +32,7 @@ describe("createRealTripEvaluator", () => {
   redisTest(
     "returns divert=true with reason per_env_rate once the window trips",
     async ({ redisOptions }) => {
-      const buffer = new MollifierBuffer({ redisOptions, entryTtlSeconds: 600 });
+      const buffer = new MollifierBuffer({ redisOptions });
       try {
         // threshold=2 → the 3rd call within windowMs is the first that trips.
         const options = { windowMs: 5000, threshold: 2, holdMs: 5000 } as const;
@@ -73,7 +73,7 @@ describe("createRealTripEvaluator", () => {
   redisTest(
     "returns divert=false when buffer throws (fail-open)",
     async ({ redisOptions }) => {
-      const buffer = new MollifierBuffer({ redisOptions, entryTtlSeconds: 600 });
+      const buffer = new MollifierBuffer({ redisOptions });
       // Closing the client up front means evaluateTrip will throw on the first
       // Redis command — a real failure mode, not a stub.
       await buffer.close();
diff --git a/packages/redis-worker/src/mollifier/buffer.test.ts b/packages/redis-worker/src/mollifier/buffer.test.ts
diff --git a/packages/redis-worker/src/mollifier/buffer.ts b/packages/redis-worker/src/mollifier/buffer.ts
diff --git a/packages/redis-worker/src/mollifier/drainer.test.ts b/packages/redis-worker/src/mollifier/drainer.test.ts

-Original file line number
+Diff line change
@@ @@ -0,0 +1,5 @@ @@
 +---
 +"@trigger.dev/redis-worker": minor
 +---
++
 +`MollifierBuffer`: remove the `entryTtlSeconds` constructor option and stop applying any TTL to buffer entry hashes or idempotency-lookup keys. Buffer entries now persist until the drainer ACKs (with a 30s post-materialise grace TTL) or FAILs them. The previous design auto-evicted entries after the TTL, which silently lost runs when the drainer was offline or falling behind — no PG row, no log, no customer signal. With the TTL gone, the drainer is the only mechanism that removes entries; operators alert on Redis memory pressure (separate, existing concern) and on the `mollifier.stale_entries.current` gauge (5min default threshold) instead. `fail` now also DELs the entry hash plus its idempotency lookup, because the SYSTEM_FAILURE PG row written by the drainer is the canonical record of the failure and the buffer entry is no longer load-bearing.
Original file line number	Diff line number	Diff line change
`@@ -22,7 +22,6 @@ function initializeMollifierBuffer(): MollifierBuffer {`
`22`	`22`	`enableAutoPipelining: true,`
`23`	`23`	`...(env.TRIGGER_MOLLIFIER_REDIS_TLS_DISABLED === "true" ? {} : { tls: {} }),`
`24`	`24`	`},`
`25`		`- entryTtlSeconds: env.TRIGGER_MOLLIFIER_ENTRY_TTL_S,`
`26`	`25`	`});`
`27`	`26`	`}`
`28`	`27`