Skip to content

RTC: fix data corruption on concurrent list move#78274

Open
danluu wants to merge 1 commit into
WordPress:trunkfrom
danluu:try/pr77924-step1-plan-20260513
Open

RTC: fix data corruption on concurrent list move#78274
danluu wants to merge 1 commit into
WordPress:trunkfrom
danluu:try/pr77924-step1-plan-20260513

Conversation

@danluu
Copy link
Copy Markdown
Contributor

@danluu danluu commented May 14, 2026

This is part of an AI fuzzing project, where an AI wrote a fuzzer and then triages bugs from the fuzzer and creates fixes. See #77716 for the tracking issue. As of this writing, there have been no known false positives from this project, but there have been some issues, which are documented in #77716. I expect we’ll see false positives at some point (and may even have one that’s been filed in a PR that hasn’t been inspected by a code owner yet).

What?

Per @alecgeatches's request, this is part 1 of N of splitting #77924 into multiple PRs. This fixes the issue shown in the 3rd video there. I'm opening this single PR rather than doing all N parts to get feedback one to see if other PRs from this series should be modified before submitting them.

AI TEXT

The failure is not from one isolated bad commit. It comes from a few RTC design
choices interacting:

  • #68483 introduced the
    reliable sync protocol foundation. The block merge strategy remained largely
    snapshot-oriented: local block trees are converted into Yjs structures, and
    later snapshots are merged into the current Yjs document.
  • #75699 removed plugin
    checks around collaborative editing and made this path more broadly active.
  • #75029 deferred
    SyncManager.update() through yieldToEventLoop(). That improved scheduling,
    but it also opened a race: a local snapshot can be captured before a remote
    update arrives, then applied after that remote update has already modified the
    Yjs document.
  • #75975 fixed one save
    manifestation of this by waiting for deferred Y.Doc writes before serializing
    _crdt_document, confirming that deferred local writes are observable at save
    boundaries.
  • #77966 moved sync
    observers after persisted CRDT hydration, fixing a redundant hydration edit.
    That did not address stale local snapshots that are already queued before
    remote reconciliation finishes.
  • #76913 added
    schema-aware table/query merging. That is relevant to the full PR 77924
    branch, but is not part of this first split.

Bug Shape

The list-item stress failure is:

npm run test:e2e:rtc-websocket -- \
  --grep "two users concurrently move list items"

Two users move different list items at about the same time. The old merge path
could only see the incoming full block snapshot, not the record the local user
edited from. When a queued local snapshot was stale, the merge could not tell:

  • "this field is unchanged relative to my local base, preserve the remote
    update", from
  • "this field is a deliberate local write, apply it over the remote update".

For structural block changes, the positional merge also treated moves as
delete/insert-like rewrites. That can replace Yjs block objects instead of moving
the existing objects. Once that happens, rich text and nested block state may
attach to the wrong list item, duplicate an item, or smear text across siblings.

The visible e2e symptom was a persisted/reloaded list such as:

Item Alpha
Item Gamma
Item Gamma
Item EpsiEpsilonon
Item DeDeltata
Item Zeta

The live editors could converge before save, while a reload immediately after
save exposed stale provider/store reconciliation. The test now waits for the
semantic post-save state on both editors before reloading; without that wait,
the test can race its own save boundary instead of testing the merge semantics.

Fix Plan Used In The Split

This split keeps the fix to the smallest code surface that made the list-move
case reliable.

  1. Pass a pre-edit baseRecord from editEntityRecord() into
    SyncManager.update().

    This gives CRDT block merging a three-way view: base record, current Yjs
    document, and incoming local edits.

  2. Thread baseRecord.blocks into mergeCrdtBlocks().

    The block merger can now skip fields that are unchanged from the local base
    instead of overwriting remote changes with stale local values.

  3. Rebase block order by identity before falling back to positional merging.

    For same-size block arrays with stable clientIds, the merger moves the
    current Yjs block objects into the incoming order. This preserves nested
    Yjs state and avoids rewriting adjacent blocks. Where clientIds are not
    enough, the code can use unique semantic block keys.

  4. Track remote key versions in SyncManager.

    Remote Yjs updates mark top-level record keys as reconciling. A deferred
    local update that was scheduled before a remote change can then filter stale
    non-block keys instead of replaying them. Block updates are still passed
    through to the base-aware block merger.

  5. Strengthen the list-move e2e assertion.

    The old test only checked relative ordering:

    • Beta after Gamma.
    • Epsilon before Delta.

    The split asserts the exact semantic order:

    Item Alpha
    Item Gamma
    Item Beta
    Item Epsilon
    Item Delta
    Item Zeta
    

    It also checks that exact order after save and before reload, so the reload
    phase does not race a pending sync/save boundary.

What Stayed Out

The first split does not attempt to fix every RTC stress failure that remains
when PR 77924's behavioral changes are layered onto current trunk. In local
validation, the other existing failure,
three users concurrently edit a large post with diverse blocks, still failed
after this reduced branch. Its failures were broader: same-paragraph convergence,
block toolbar movement, and final paragraph propagation. Those require more of
the full PR 77924 surface and should be reviewed in later splits.

The table/query-array merge work and WebSocket/provider reload behavior fixes
should also remain separate review slices. The WebSocket test harness itself is
not a later slice; it comes from PR 78179 and must be preserved by every split
unless a future split explicitly replaces that harness.

END AI TEXT

@danluu danluu requested a review from nerrad as a code owner May 14, 2026 07:51
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 14, 2026

Warning: Type of PR label mismatch

To merge this PR, it requires exactly 1 label indicating the type of PR. Other labels are optional and not being checked here.

  • Required label: Any label starting with [Type].
  • Labels found: [Package] Core data, [Feature] Real-time Collaboration, [Package] Sync.

Read more about Type labels in Gutenberg. Don't worry if you don't have the required permissions to add labels; the PR reviewer should be able to help with the task.

@github-actions
Copy link
Copy Markdown

The following accounts have interacted with this PR and/or linked issues. I will continue to update these lists as activity occurs. You can also manually ask me to refresh this list by adding the props-bot label.

If you're merging code through a pull request on GitHub, copy and paste the following into the bottom of the merge commit message.

Co-authored-by: danluu <danluu@git.wordpress.org>

To understand the WordPress project's expectations around crediting contributors, please review the Contributor Attribution page in the Core Handbook.

@danluu danluu changed the title Split RTC list move convergence fix RTC: fix data corruption on concurrent list move May 14, 2026
@alecgeatches alecgeatches added the [Feature] Real-time Collaboration Phase 3 of the Gutenberg roadmap around real-time collaboration label May 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

[Feature] Real-time Collaboration Phase 3 of the Gutenberg roadmap around real-time collaboration [Package] Core data /packages/core-data [Package] Sync

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants