[core] Optimistic concurrency control for event writes against stale logs#2113
Draft
VaguelySerious wants to merge 4 commits into
Draft
[core] Optimistic concurrency control for event writes against stale logs#2113VaguelySerious wants to merge 4 commits into
VaguelySerious wants to merge 4 commits into
Conversation
The elapsed-wait scan now snapshots the loaded events' tail eventId and passes it as `lastKnownEventId` on each `wait_completed` write, so a concurrent `resumeHook` that has already advanced the canonical log is detected — the server's CAS rejects the write, we surface it as the existing `EntityConflictError`, and the next iteration re-replays against the fresh event list (mirroring the duplicate-wait fall-through that was already there). `resumeHook` sends `asOfTimestamp` (Date.now() at call time) so the server resolves the fence to the highest eventId strictly before resume time — no client-side event pre-read needed. Plumbed through `CreateEventParams` on `@workflow/world` so future worlds can forward as-is. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
🦋 Changeset detectedLatest commit: 1fca755 The changes in this PR will be included in the next version bump. This PR includes changesets to release 20 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
Contributor
Contributor
🧪 E2E Test Results❌ Some tests failed Summary
❌ Failed Tests▲ Vercel Production (9 failed)hono (6 failed):
vite (3 failed):
Details by Category❌ ▲ Vercel Production
✅ 💻 Local Development
✅ 📦 Local Production
✅ 🐘 Local Postgres
✅ 🪟 Windows
✅ 📋 Other
❌ Some E2E test jobs failed:
Check the workflow run for details. |
Contributor
📊 Benchmark Results
workflow with no steps💻 Local Development
▲ Production (Vercel)
workflow with 1 step💻 Local Development
▲ Production (Vercel)
workflow with 10 sequential steps💻 Local Development
▲ Production (Vercel)
workflow with 25 sequential steps💻 Local Development
▲ Production (Vercel)
workflow with 50 sequential steps💻 Local Development
▲ Production (Vercel)
Promise.all with 10 concurrent steps💻 Local Development
▲ Production (Vercel)
Promise.all with 25 concurrent steps💻 Local Development
▲ Production (Vercel)
Promise.all with 50 concurrent steps💻 Local Development
▲ Production (Vercel)
Promise.race with 10 concurrent steps💻 Local Development
▲ Production (Vercel)
Promise.race with 25 concurrent steps💻 Local Development
▲ Production (Vercel)
Promise.race with 50 concurrent steps💻 Local Development
▲ Production (Vercel)
workflow with 10 sequential data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
workflow with 25 sequential data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
workflow with 50 sequential data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
workflow with 10 concurrent data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
workflow with 25 concurrent data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
workflow with 50 concurrent data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
Stream Benchmarks (includes TTFB metrics)workflow with stream💻 Local Development
▲ Production (Vercel)
stream pipeline with 5 transform steps (1MB)💻 Local Development
▲ Production (Vercel)
10 parallel streams (1MB each)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express fan-out fan-in 10 streams (1MB each)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro SummaryFastest Framework by WorldWinner determined by most benchmark wins
Fastest World by FrameworkWinner determined by most benchmark wins
Column Definitions
Worlds:
❌ Some benchmark jobs failed:
Check the workflow run for details. |
The earlier revision filtered duplicate-wait conflicts by a
workflow-server-specific error message ("Workflow wait ..."), which
meant world-local's "Wait \"...\" already completed" (and any other
world's duplicate-wait error shape) fell through and bubbled the
EntityConflictError out of the elapsed-wait scan. abortHookOrdering
e2e suites started failing as a result.
Invert the filter: only the fence-conflict message (a workflow-server-
only error) drives the retry path. Anything else is the pre-OCC
"skip and continue" shape — matches the original behavior across all
world implementations.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds optimistic-concurrency fencing to the event writes that go through workflow-server, closing the hook/sleep race that produces
CORRUPTED_EVENT_LOGon production runs.eventIdand passes it aslastKnownEventIdon eachwait_completedwrite. If a concurrentresumeHookhas already advanced the canonical log, the server's CAS rejects the write.EntityConflictError, the runtime now retries in-place rather than throwing the whole tick away: it reloads events from the cursor, refreshes the fence, and tries again (up to 5x with backoff). Falling back to queue redelivery turned out to thunder-herd — every redelivery spawns another concurrent tick, which fences-conflicts again, and workflows stall inrunning. If the wait was completed by a concurrent writer between attempts, we observe it in the reloaded log and skip the write entirely.resumeHookappendshook_receivedunconditionally. ULID ordering already places this write after anything committed before us, and applying CAS would only ever reject the hook in favor of an unrelated concurrent write (which would lose the user's signal). Stale-snapshot protection lives on the tick writes that consume hooks, not on the write that delivers them.CreateEventParamson@workflow/worldgrowslastKnownEventIdandasOfTimestamp(both optional). Worlds that don't implement OCC can pass them through or ignore them.Pairs with the workflow-server PR which materializes
run.lastKnownEventIdand gates event writes on it. The server's CAS is explicit opt-in — unfenced writers (most paths) still atomically advance the materialized value so fenced writers can chain off it, but they don't reject on contention.Test plan
CORRUPTED_EVENT_LOG, zerorun_failed, zero fence conflicts surfaced past the retry loop🤖 Generated with Claude Code