Regression test for range sync CGC race condition#8039
Conversation
dapplion
left a comment
There was a problem hiding this comment.
Nice catch and good test! Some minor comments
|
I had a look at the failing checkpoint sync tests (optionally triggered jobs), and I don't believe they are related to changes from this PR, i think what's happening is:
|
|
@dapplion thanks for the review! I've addressed most of your comments. |
pawanjay176
left a comment
There was a problem hiding this comment.
FYI, running this on devnet 3 i'm getting a stuck chain with 40 peers.
Need to investigate further
|
Thanks for raising this Pawan. Sorry i haven't managed to get to this today, will look into this tomorrow. |
@pawanjay176 im going to remove the clean ups from this PR so we can merge this, because the flaky sync test is really quite annoying 😅 |
|
This pull request has merge conflicts. Could you please resolve them @jimmygchen? 🙏 |
|
This pull request has merge conflicts. Could you please resolve them @jimmygchen? 🙏 |
|
@dapplion please hold off from merging this. I'm seeing a trend of increasing |
|
^ unrelated to this PR i think. I've made a PR to clean it up: |
|
@pawanjay176 @dapplion |
|
@pawanjay176 I've looked at this PR again with @michaelsproul - this PR contains two changes that can be separated:
|
|
Rebased on
|
|
Some required checks have failed. Could you please take a look @jimmygchen? 🙏 |
|
Hi @jimmygchen, this pull request has been closed automatically due to 30 days of inactivity. If you’d like to continue working on it, feel free to reopen at any time. |
1 similar comment
|
Hi @jimmygchen, this pull request has been closed automatically due to 30 days of inactivity. If you’d like to continue working on it, feel free to reopen at any time. |
Merge Queue Status
This pull request spent 21 minutes 49 seconds in the queue, including 19 minutes 37 seconds running CI. Waiting for
All conditions
ReasonThe merge conditions cannot be satisfied due to failing checks HintYou may have to fix your CI before adding the pull request to the queue again. |
Merge Queue Status
This pull request spent 21 minutes 49 seconds in the queue, including 19 minutes 37 seconds running CI. Waiting for
All conditions
ReasonThe merge conditions cannot be satisfied due to failing checks HintYou may have to fix your CI before adding the pull request to the queue again. |
Merge Queue Status
This pull request spent 30 minutes 7 seconds in the queue, including 28 minutes 42 seconds running CI. Required conditions to merge
ReasonPull request #8039 has been dequeued The pull request could not be merged. This could be related to an activated branch protection or ruleset rule that prevents us from merging. 1 review requesting changes and 1 approving review by reviewers with write access. HintYou should look at the reason for the failure and decide if the pull request needs to be fixed or if you want to requeue it. |
Merge Queue Status
This pull request spent 30 minutes 17 seconds in the queue, including 28 minutes 7 seconds running CI. Required conditions to merge
|
Description
Regression test for a sync race condition where range sync gets stuck due to a CGC (custody group count) metadata timing issue. The original bug was discovered in CI kurtosis sync tests: https://github.com/sigp/lighthouse/actions/runs/17632496975/job/50102884940
The core fix (calling
attempt_send_awaiting_download_batchesonSyncingChain::resume) was merged in #8230. This PR adds a regression test that reproduces the exact scenario, plus related fixes.Changes:
Regression test (
finalized_sync_not_enough_custody_peers_resume_after_peer_cgc_update): reproduces the failing scenario and verifies that sync resumes correctly when CGC updates arrive.Moved to separate PR:
CountAwaitingDownloadbatches in buffer size check (range sync + backfill sync): thein_buffercheck previously only countedDownloadingandAwaitingProcessingbatches. This meantAwaitingDownloadbatches didn't count towardBATCH_BUFFER_SIZE, so new batches could be created whileAwaitingDownloadones accumulated. NowAwaitingDownloadis included so the buffer limit applies properly.Moved to separate PR
:
HandleNoPeererror without fake-failing the batch: previously, whensend_batchgot aNoPeererror, it would fake-start the download (start_downloading(1)), then calldownload_failedto increment the failure count. This incorrectly penalized the batch for something that wasn't a real download failure. NowNoPeerjust logs a debug message and leaves the batch inAwaitingDownload(a valid in-between state since Allow AwaitingDownload to be a valid in-between state #7984), letting it be retried when peers become available.