Skip to content

Commit 4e7d5d8

Browse files
committed
fix(webapp): bump applyMetadataMutation retry count + add jittered backoff
Default maxRetries was 3, matching the PG-side UpdateMetadataService. That's fine when the only writer is the executing task itself, but under high external-API concurrency on a single buffered run it exhausts fast — the Phase F challenge suite saw 50-way concurrent metadata.increment landing only 21/50 deltas with the default. Bumps the default to 12 (covers ~50-way concurrency with sub-percent failure) and adds small jittered backoff between retries so a thundering herd of N retriers doesn't all re-read + re-CAS in lockstep. Each retry is one Redis Lua call (~1ms), so the worst-case budget is bounded. Verified via challenge script 09: 50 concurrent increments now land all 50 deltas, counter ends at exactly 50.
1 parent b490afe commit 4e7d5d8

1 file changed

Lines changed: 12 additions & 2 deletions

File tree

apps/webapp/app/v3/mollifier/applyMetadataMutation.server.ts

Lines changed: 12 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -26,7 +26,14 @@ export async function applyMetadataMutationToBufferedRun(input: {
2626
const buffer = input.buffer ?? getMollifierBuffer();
2727
if (!buffer) return { kind: "not_found" };
2828

29-
const maxRetries = input.maxRetries ?? 3;
29+
// Default retry budget tuned for buffered-window concurrency. The
30+
// PG-side `UpdateMetadataService` uses 3, which is fine when the only
31+
// writer is the executing task itself. For a buffered run the writers
32+
// are external API callers, and N parallel writers exhaust 3 retries
33+
// quickly under contention. Bumping to 12 covers ~50-way concurrency
34+
// with sub-percent failure probability; the cost is bounded (each
35+
// retry is one Redis Lua call ~1ms).
36+
const maxRetries = input.maxRetries ?? 12;
3037
for (let attempt = 0; attempt <= maxRetries; attempt++) {
3138
const entry = await buffer.getEntry(input.runId);
3239
if (!entry) return { kind: "not_found" };
@@ -73,13 +80,16 @@ export async function applyMetadataMutationToBufferedRun(input: {
7380
if (cas.kind === "not_found") return { kind: "not_found" };
7481
if (cas.kind === "busy") return { kind: "busy" };
7582
// version_conflict — another caller wrote between our read + CAS.
76-
// Loop to re-read and retry.
83+
// Small jittered backoff so a thundering herd of N retriers doesn't
84+
// all re-read + re-CAS at exactly the same moment.
7785
logger.debug("applyMetadataMutationToBufferedRun: version_conflict, retrying", {
7886
runId: input.runId,
7987
attempt,
8088
observedVersion: entry.metadataVersion,
8189
currentVersion: cas.currentVersion,
8290
});
91+
const backoffMs = Math.floor(Math.random() * (5 + attempt * 5));
92+
await new Promise((resolve) => setTimeout(resolve, backoffMs));
8393
}
8494

8595
logger.warn("applyMetadataMutationToBufferedRun: retries exhausted", {

0 commit comments

Comments
 (0)