wal: fix flaky TestConcurrentWritersWithManyRecords#6000
Open
xxmplus wants to merge 1 commit intocockroachdb:masterfrom
Open
wal: fix flaky TestConcurrentWritersWithManyRecords#6000xxmplus wants to merge 1 commit intocockroachdb:masterfrom
xxmplus wants to merge 1 commit intocockroachdb:masterfrom
Conversation
newFailoverWriter internally calls switchToNewDir, which async-creates writer 0 and sends a signal to the logWriterCreated channel. The test never drained this initial signal, so each <-logWriterCreated in the loop drained the wrong writer's signal (off-by-one). This caused writer creation goroutines to run after nextWriterIndex had advanced past them, triggering the closeWriter path (writerIndex+1 != nextWriterIndex) and skipping snapshotAndSwitchWriter entirely. When combined with a writer whose flusher could make progress (advancing the queue tail via doneSyncCallback), a later writer's snapshot would start from a non-zero tail, producing the observed interval.first > 0 failures. Fix by draining the initial writer creation signal before the record pushing loop. Also add stopper.stop() before file verification to ensure all async LogWriter closes complete, and add diagnostic logging for each file's interval. Fixes cockroachdb#5995. Informs cockroachdb#4754. Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
Member
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
logWriterCreatedchannel synchronization that causedTestConcurrentWritersWithManyRecordsto flake on Windows nightlies.stopper.stop()before file verification to ensure all asyncLogWritercloses complete.t.Logffor each file's interval so future flakes produce actionable data.Fixes #5995. Informs #4754.
Root cause
newFailoverWriterinternally callsswitchToNewDir, which async-creates writer 0 and sends a signal tologWriterCreated(buffered channel, cap 100). The test never drained this initial signal. Each<-logWriterCreatedin the loop therefore drained the wrong writer's signal (off-by-one):This caused writer creation goroutines to run after
nextWriterIndexhad advanced past them, triggering thecloseWriterpath (writerIndex+1 != nextWriterIndex) and skippingsnapshotAndSwitchWriter. When combined with a writer whose flusher could advance the queue tail viadoneSyncCallback, a later writer's snapshot started from a non-zero tail, producinginterval.first > 0.Deterministic reproduction
The following test (not included in this PR) deterministically reproduces the bug by:
blockingWrite) so its flusher pops entriesblockingCreate) to force it to be skippedlogWriterCreatedsignal (the bug)Reproduction test code
Output (without fix, demonstrating the bug):
Output (with fix):
Test plan
go test ./wal/ -run TestConcurrentWritersWithManyRecords -count=50 -tags invariantspasses (50/50)