crl-release-25.2: vfs: fix MemFS cloneMu recursive RLock deadlock#5999
Merged
xxmplus merged 1 commit intocockroachdb:crl-release-25.2from Apr 24, 2026
Merged
Conversation
Three MemFS methods deadlock against concurrent CrashClone due to
recursive cloneMu.RLock() acquisition. Go's sync.RWMutex is
writer-preferring: once CrashClone is waiting for the write lock, any
new RLock call blocks behind it. When a method already holding RLock
calls another method that also tries RLock, neither can proceed.
The three affected methods and their recursive locking chains:
- ReuseForWrite: takes RLock, then calls Rename which takes RLock
- Link: takes RLock at entry, then takes RLock again inside inner
walk callback (copy-paste of the locking boilerplate)
- Lock: takes RLock, then calls Create which takes RLock
The bug was introduced in a70d5b3 ("vfs: redesign MemFS strict mode",
2024-08-26), which replaced the old SetIgnoreSyncs/ResetToSyncedState
approach with the CrashClone model. That commit systematically added
cloneMu.RLock() to every mutation method, but did not trace call graphs
to catch methods that compose other MemFS methods.
The deadlock has existed for ~20 months as a sporadic CI flake (cockroachdb#4677,
cockroachdb#5199, cockroachdb#5327). CrashClone fires only once per test iteration (when the
error injector's write counter hits zero), so the contention window is
extremely narrow on fast hardware. It surfaces reliably only on slow
platforms: s390x QEMU (~10x slower), Windows CI, linux-race. Each
occurrence was closed without a root cause until cockroachdb#5987 provided a full
goroutine dump showing the RWMutex.Lock / RWMutex.RLock deadlock.
Fix: extract lock-free internal methods (rename, create) that assume the
caller already holds cloneMu. The public Rename/Create methods delegate
to these after acquiring the lock. ReuseForWrite calls rename() and Lock
calls create() to avoid the recursive acquisition. For Link, the inner
RLock/RUnlock is simply removed since the outer lock at method entry
already covers the entire body. Add a comment on the cloneMu field
documenting the recursive-locking constraint to prevent recurrence.
Add TestMemFSCrashCloneConcurrency which exercises concurrent CrashClone
against ReuseForWrite, Link, and Lock. Without this fix, the test
deadlocks within seconds.
Fixes cockroachdb#5987.
Informs cockroachdb#5959, cockroachdb#5972.
Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
Member
RaduBerinde
approved these changes
Apr 24, 2026
Contributor
Author
|
TFTR! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Backport of #5993 to crl-release-25.2.
Summary
Cherry-pick of b47118f (master). Three MemFS methods deadlock against concurrent CrashClone due to recursive
cloneMu.RLock(). See #5993 for full details.Note: earlier investigation incorrectly concluded that 25.2 did not contain
cloneMu. The commita70d5b3cwas cherry-picked to 25.2 under a different SHA, sogit branch --containsmissed it. The code is identical and the same deadlock applies.Fixes #5913.