Skip to content

Regression test for range sync CGC race condition#8039

Merged
mergify[bot] merged 7 commits into
sigp:unstablefrom
jimmygchen:chain-removed-due-to-race
Jun 4, 2026
Merged

Regression test for range sync CGC race condition#8039
mergify[bot] merged 7 commits into
sigp:unstablefrom
jimmygchen:chain-removed-due-to-race

Conversation

@jimmygchen
Copy link
Copy Markdown
Member

@jimmygchen jimmygchen commented Sep 12, 2025

Description

Regression test for a sync race condition where range sync gets stuck due to a CGC (custody group count) metadata timing issue. The original bug was discovered in CI kurtosis sync tests: https://github.com/sigp/lighthouse/actions/runs/17632496975/job/50102884940

The core fix (calling attempt_send_awaiting_download_batches on SyncingChain::resume) was merged in #8230. This PR adds a regression test that reproduces the exact scenario, plus related fixes.

Changes:

  1. Regression test (finalized_sync_not_enough_custody_peers_resume_after_peer_cgc_update): reproduces the failing scenario and verifies that sync resumes correctly when CGC updates arrive.

  2. Moved to separate PR: Count AwaitingDownload batches in buffer size check (range sync + backfill sync): the in_buffer check previously only counted Downloading and AwaitingProcessing batches. This meant AwaitingDownload batches didn't count toward BATCH_BUFFER_SIZE, so new batches could be created while AwaitingDownload ones accumulated. Now AwaitingDownload is included so the buffer limit applies properly.

  3. Moved to separate PR
    : Handle NoPeer error without fake-failing the batch: previously, when send_batch got a NoPeer error, it would fake-start the download (start_downloading(1)), then call download_failed to increment the failure count. This incorrectly penalized the batch for something that wasn't a real download failure. Now NoPeer just logs a debug message and leaves the batch in AwaitingDownload (a valid in-between state since Allow AwaitingDownload to be a valid in-between state #7984), letting it be retried when peers become available.

@jimmygchen jimmygchen added bug Something isn't working work-in-progress PR is a work-in-progress syncing labels Sep 12, 2025
@jimmygchen jimmygchen marked this pull request as ready for review September 12, 2025 15:32
@jimmygchen jimmygchen requested a review from jxs as a code owner September 12, 2025 15:32
Copy link
Copy Markdown
Collaborator

@dapplion dapplion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch and good test! Some minor comments

Comment thread beacon_node/network/src/sync/tests/range.rs Outdated
Comment thread beacon_node/network/src/sync/tests/range.rs
Comment thread beacon_node/network/src/sync/tests/range.rs
Comment thread beacon_node/network/src/sync/tests/range.rs Outdated
@jimmygchen
Copy link
Copy Markdown
Member Author

jimmygchen commented Sep 15, 2025

I had a look at the failing checkpoint sync tests (optionally triggered jobs), and I don't believe they are related to changes from this PR, i think what's happening is:

  • The sepolia checkpoint server used is currently 4 days behind, so it takes longer to sync
  • The devnet was under non finality tests, and it was just taking longer to sync.
  • Github-hosted runner struggles to sync supernode on peerdas devnet due to capacity, to be resovled in Use unique enclave names for CI Kurtosis tests on self hosted runners #7804 once we have dedicated kurtosis runners.

@jimmygchen jimmygchen added ready-for-review The code is ready for review and removed work-in-progress PR is a work-in-progress labels Sep 15, 2025
@jimmygchen
Copy link
Copy Markdown
Member Author

@dapplion thanks for the review! I've addressed most of your comments.

@jimmygchen jimmygchen added the v8.0.0-rc.0 Q3 2025 release for Fusaka on Holesky label Sep 15, 2025
Copy link
Copy Markdown
Member

@pawanjay176 pawanjay176 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, running this on devnet 3 i'm getting a stuck chain with 40 peers.
Need to investigate further

@jimmygchen jimmygchen added waiting-on-author The reviewer has suggested changes and awaits thier implementation. and removed ready-for-review The code is ready for review labels Sep 16, 2025
@jimmygchen
Copy link
Copy Markdown
Member Author

Thanks for raising this Pawan. Sorry i haven't managed to get to this today, will look into this tomorrow.

@jimmygchen jimmygchen added v8.0.0 Q4 2025 Fusaka Mainnet Release and removed v8.0.0-rc.0 Q3 2025 release for Fusaka on Holesky labels Sep 17, 2025
@jimmygchen
Copy link
Copy Markdown
Member Author

FYI, running this on devnet 3 i'm getting a stuck chain with 40 peers. Need to investigate further

@pawanjay176 im going to remove the clean ups from this PR so we can merge this, because the flaky sync test is really quite annoying 😅

@mergify mergify Bot added ready-for-review The code is ready for review and removed waiting-on-author The reviewer has suggested changes and awaits thier implementation. labels Sep 25, 2025
@mergify
Copy link
Copy Markdown

mergify Bot commented Sep 25, 2025

This pull request has merge conflicts. Could you please resolve them @jimmygchen? 🙏

@mergify mergify Bot added waiting-on-author The reviewer has suggested changes and awaits thier implementation. and removed ready-for-review The code is ready for review labels Sep 25, 2025
@jimmygchen jimmygchen added v8.1.0 Post-Fulu release and removed v8.0.0 Q4 2025 Fusaka Mainnet Release labels Sep 30, 2025
@mergify mergify Bot added ready-for-review The code is ready for review and removed waiting-on-author The reviewer has suggested changes and awaits thier implementation. labels Oct 16, 2025
@mergify
Copy link
Copy Markdown

mergify Bot commented Oct 20, 2025

This pull request has merge conflicts. Could you please resolve them @jimmygchen? 🙏

@jimmygchen
Copy link
Copy Markdown
Member Author

@dapplion please hold off from merging this. I'm seeing a trend of increasing sync_active_network_requests for components_by_range running this branch. Will investigate further tomorrow.

@jimmygchen
Copy link
Copy Markdown
Member Author

^ unrelated to this PR i think. I've made a PR to clean it up:

@jimmygchen jimmygchen added the ready-for-review The code is ready for review label Feb 24, 2026
@jimmygchen
Copy link
Copy Markdown
Member Author

jimmygchen commented Feb 25, 2026

@pawanjay176 @dapplion
I believe I've addressed all the prior comments and this change is still required. Could you please re-review?
This PR doesn't fix the potential leaks, but prevents sending an unbounded number of batches upon resuming sync.
The leak is addressed in the link PR above.

@jimmygchen jimmygchen removed the v8.1.1 Hotfix for v8.1.0 label Feb 26, 2026
@jimmygchen
Copy link
Copy Markdown
Member Author

@pawanjay176 I've looked at this PR again with @michaelsproul - this PR contains two changes that can be separated:

  1. The regression test for a CGC race condition that has already been fixed a while ago - this is a low risk and low hanging fruit, but not urgent for v8.1.1.
  2. The NoPeer clean up, which makes it possible to create a batch in AwaitingDownload state, and therefore needs to be bundled with the in_buffer check AND the stall detection. This is mainly a clean up. On current unstable, the batch never stays in AwaitingDownload after send batch returns, so these batches can't accumulate. Therefore this is not a required fix for v8.1.1 either.

@jimmygchen jimmygchen added waiting-on-author The reviewer has suggested changes and awaits thier implementation. and removed ready-for-review The code is ready for review labels Feb 26, 2026
@jimmygchen
Copy link
Copy Markdown
Member Author

jimmygchen commented Mar 2, 2026

Rebased on unstable and split this PR into two:

@mergify
Copy link
Copy Markdown

mergify Bot commented Mar 2, 2026

Some required checks have failed. Could you please take a look @jimmygchen? 🙏

@mergify
Copy link
Copy Markdown

mergify Bot commented Apr 30, 2026

Hi @jimmygchen, this pull request has been closed automatically due to 30 days of inactivity. If you’d like to continue working on it, feel free to reopen at any time.

1 similar comment
@mergify
Copy link
Copy Markdown

mergify Bot commented May 30, 2026

Hi @jimmygchen, this pull request has been closed automatically due to 30 days of inactivity. If you’d like to continue working on it, feel free to reopen at any time.

Copy link
Copy Markdown
Member

@pawanjay176 pawanjay176 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good test, lets merge!

@mergify
Copy link
Copy Markdown

mergify Bot commented Jun 3, 2026

Merge Queue Status

This pull request spent 21 minutes 49 seconds in the queue, including 19 minutes 37 seconds running CI.

Waiting for
  • check-success=local-testnet-success
  • check-success=test-suite-success
All conditions

Reason

The merge conditions cannot be satisfied due to failing checks

Hint

You may have to fix your CI before adding the pull request to the queue again.
If you update this pull request, to fix the CI, it will automatically be requeued once the queue conditions match again.
If you think this was a flaky issue instead, you can requeue the pull request, without updating it, by posting a @mergifyio queue comment.

@mergify
Copy link
Copy Markdown

mergify Bot commented Jun 4, 2026

Merge Queue Status

This pull request spent 21 minutes 49 seconds in the queue, including 19 minutes 37 seconds running CI.

Waiting for
  • check-success=local-testnet-success
  • check-success=test-suite-success
All conditions

Reason

The merge conditions cannot be satisfied due to failing checks

Hint

You may have to fix your CI before adding the pull request to the queue again.
If you update this pull request, to fix the CI, it will automatically be requeued once the queue conditions match again.
If you think this was a flaky issue instead, you can requeue the pull request, without updating it, by posting a @mergifyio queue comment.

@mergify
Copy link
Copy Markdown

mergify Bot commented Jun 4, 2026

Merge Queue Status

This pull request spent 30 minutes 7 seconds in the queue, including 28 minutes 42 seconds running CI.

Required conditions to merge

Reason

Pull request #8039 has been dequeued

The pull request could not be merged. This could be related to an activated branch protection or ruleset rule that prevents us from merging. 1 review requesting changes and 1 approving review by reviewers with write access.

Hint

You should look at the reason for the failure and decide if the pull request needs to be fixed or if you want to requeue it.
If you do update this pull request, it will automatically be requeued once the queue conditions match again.
If you think this was a flaky issue instead, you can requeue the pull request, without updating it, by posting a @mergifyio queue comment.

@mergify
Copy link
Copy Markdown

mergify Bot commented Jun 4, 2026

Merge Queue Status

This pull request spent 30 minutes 17 seconds in the queue, including 28 minutes 7 seconds running CI.

Required conditions to merge

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready-for-merge This PR is ready to merge. syncing test improvement Improve tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants