Skip to content

crl-release-25.2: vfs: fix MemFS cloneMu recursive RLock deadlock#5999

Merged
xxmplus merged 1 commit intocockroachdb:crl-release-25.2from
xxmplus:backport-5993-crl-release-25.2
Apr 24, 2026
Merged

crl-release-25.2: vfs: fix MemFS cloneMu recursive RLock deadlock#5999
xxmplus merged 1 commit intocockroachdb:crl-release-25.2from
xxmplus:backport-5993-crl-release-25.2

Conversation

@xxmplus
Copy link
Copy Markdown
Contributor

@xxmplus xxmplus commented Apr 24, 2026

Backport of #5993 to crl-release-25.2.

Summary

Cherry-pick of b47118f (master). Three MemFS methods deadlock against concurrent CrashClone due to recursive cloneMu.RLock(). See #5993 for full details.

Note: earlier investigation incorrectly concluded that 25.2 did not contain cloneMu. The commit a70d5b3c was cherry-picked to 25.2 under a different SHA, so git branch --contains missed it. The code is identical and the same deadlock applies.

Fixes #5913.

Three MemFS methods deadlock against concurrent CrashClone due to
recursive cloneMu.RLock() acquisition. Go's sync.RWMutex is
writer-preferring: once CrashClone is waiting for the write lock, any
new RLock call blocks behind it. When a method already holding RLock
calls another method that also tries RLock, neither can proceed.

The three affected methods and their recursive locking chains:

  - ReuseForWrite: takes RLock, then calls Rename which takes RLock
  - Link: takes RLock at entry, then takes RLock again inside inner
    walk callback (copy-paste of the locking boilerplate)
  - Lock: takes RLock, then calls Create which takes RLock

The bug was introduced in a70d5b3 ("vfs: redesign MemFS strict mode",
2024-08-26), which replaced the old SetIgnoreSyncs/ResetToSyncedState
approach with the CrashClone model. That commit systematically added
cloneMu.RLock() to every mutation method, but did not trace call graphs
to catch methods that compose other MemFS methods.

The deadlock has existed for ~20 months as a sporadic CI flake (cockroachdb#4677,
cockroachdb#5199, cockroachdb#5327). CrashClone fires only once per test iteration (when the
error injector's write counter hits zero), so the contention window is
extremely narrow on fast hardware. It surfaces reliably only on slow
platforms: s390x QEMU (~10x slower), Windows CI, linux-race. Each
occurrence was closed without a root cause until cockroachdb#5987 provided a full
goroutine dump showing the RWMutex.Lock / RWMutex.RLock deadlock.

Fix: extract lock-free internal methods (rename, create) that assume the
caller already holds cloneMu. The public Rename/Create methods delegate
to these after acquiring the lock. ReuseForWrite calls rename() and Lock
calls create() to avoid the recursive acquisition. For Link, the inner
RLock/RUnlock is simply removed since the outer lock at method entry
already covers the entire body. Add a comment on the cloneMu field
documenting the recursive-locking constraint to prevent recurrence.

Add TestMemFSCrashCloneConcurrency which exercises concurrent CrashClone
against ReuseForWrite, Link, and Lock. Without this fix, the test
deadlocks within seconds.

Fixes cockroachdb#5987.
Informs cockroachdb#5959, cockroachdb#5972.

Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@xxmplus xxmplus changed the title vfs: fix MemFS cloneMu recursive RLock deadlock crl-release-25.2: vfs: fix MemFS cloneMu recursive RLock deadlock Apr 24, 2026
@xxmplus xxmplus marked this pull request as ready for review April 24, 2026 16:50
@xxmplus xxmplus requested a review from a team as a code owner April 24, 2026 16:50
@xxmplus xxmplus requested a review from RaduBerinde April 24, 2026 16:50
@xxmplus xxmplus merged commit 98f59c2 into cockroachdb:crl-release-25.2 Apr 24, 2026
5 checks passed
@xxmplus xxmplus deleted the backport-5993-crl-release-25.2 branch April 24, 2026 17:18
@xxmplus
Copy link
Copy Markdown
Contributor Author

xxmplus commented Apr 24, 2026

TFTR!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants