Skip to content

Range sync custody by root#67

Open
dapplion wants to merge 10 commits into
unstablefrom
range-sync-custody-by-root
Open

Range sync custody by root#67
dapplion wants to merge 10 commits into
unstablefrom
range-sync-custody-by-root

Conversation

@dapplion
Copy link
Copy Markdown
Owner

Issue Addressed

Which issue # does this PR address?

Proposed Changes

Please list or describe the changes introduced by this PR.

Additional Info

Please provide any additional information. For example, future considerations
or information useful for reviewers.

{
Ok((None, _)) => {}
_ => {
debug!(%chain_id, "trigger_batch_downloads for removed chain");
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This discards the chain removal result. Every other caller of call_by_id in this file (blocks_by_range_response, handle_block_process_result, inject_error) extracts the removed chain and passes it through self.on_chain_removed(...) for cleanup (awaiting_head_peers, sync state, metrics). Here the Ok(Some((chain, remove_reason))) case is swallowed under _ => with just a debug log.

If request_batches returns Err(RemoveChain::ChainFailed { blacklist, .. }), the chain gets removed from the collection internally by call_by_id but the higher-level cleanup is skipped.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Fixed in 5ed01d1. trigger_batch_downloads now properly handles Ok(Some((removed_chain, remove_reason))) by calling self.on_chain_removed(...), matching all other call_by_id callers.

///
/// Only does work when blocks have arrived and we're in DataColumns mode. For each block
/// with data, inserts a `Requesting` entry and fires a custody request via the network context.
pub fn continue_requests<T: BeaconChainTypes<EthSpec = E>>(
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 If block N's custody request succeeds (inserting an ActiveCustodyRequest into custody_by_root_requests) but block N+1 fails, this returns Err immediately. Back in range_block_component_response, that error causes an early return without re-inserting the RangeBlockComponentsRequest into components_by_range_requests. The custody requests already in-flight for blocks 0..N become orphaned — when they complete, on_custody_by_root_result routes the result to on_range_components_response, but the parent entry is gone, so the result is silently dropped.

Not catastrophic (the batch retries from scratch), but the orphaned ActiveCustodyRequest entries in custody_by_root_requests are a resource leak. Fix options: (a) clean up already-sent custody requests on failure, or (b) continue sending for all blocks and only propagate the error after all requests are queued.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Fixed in 5ed01d1 — went with option (b): continue_requests now collects errors from all blocks instead of returning early. All blocks that can be dispatched are dispatched; errors are joined and returned only after the loop completes. This avoids orphaning already-sent custody requests for blocks 0..N when block N+1 fails.

custody_columns_by_root.insert(block_root, DataColumnsState::Requesting);
debug!(?block_root, %id, "Initiated custody-by-root for range block");
}
Ok(LookupRequestResult::NoRequestNeeded(reason)) => {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 NoRequestNeeded means all custody columns are already available — e.g., they arrived via gossip between block download and this call. This is a valid state that should mark the block's data as Complete rather than failing the whole batch. Similarly, Pending seems overly aggressive as a hard error.

This could cause spurious batch failures when columns happen to already be available.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Fixed in 5ed01d1NoRequestNeeded now marks the block as Complete with empty columns instead of erroring. Added a debug log for visibility. Pending is kept as an error since it indicates an unexpected state for range sync (historical blocks should not have pending lookups).

Comment thread beacon_node/network/src/sync/manager.rs Outdated
.ok()
.map(|(_, peer_group, _)| peer_group.clone())
.unwrap_or_else(|| PeerGroup::from_set(Default::default()));
let peer_id = peer_group.all().next().copied().unwrap_or(PeerId::random());
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 On error paths, peer_group is empty (constructed from Default::default() above), so this generates a random PeerId. This peer ID gets passed through on_range_components_responseinject_error, which may attempt to penalize a non-existent peer. Probably a no-op in practice but conceptually wrong and could confuse debugging. Consider restructuring so the error path doesn't need a peer_id, or use a documented sentinel.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Fixed in 5ed01d1 — restructured to thread Option<PeerId> through on_range_components_response and inject_error down to batch.download_failed(peer_id.copied()). On custody error paths where no real peer is available, None is passed — this still counts toward the retry limit but does not blame any peer. Blocks/blobs callers wrap their real peer in Some().

Some(DataColumnsState::Complete(..)) => Err(format!(
"duplicate custody columns for block root {block_root:?}"
)),
Some(DataColumnsState::Requesting) | None => {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 The None case here means we'll accept custody columns for a block root that was never registered via continue_requests. Probably doesn't happen in practice, but tightening this to only accept Some(DataColumnsState::Requesting) and returning an error for None would catch unexpected routing bugs early.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Fixed in 5ed01d1add_custody_columns now returns an error for None (unregistered block root). Added register_custody_block() test helper and 5 new unit tests in ff206e4 covering: unregistered root rejection, duplicate rejection, pending state blocking, missing columns error, and mixed blocks (with/without data).

@jimmygchen
Copy link
Copy Markdown

🤖 Test coverage gap: Three tests were removed (missing_custody_columns_from_faulty_peers, retry_logic_after_peer_failures, max_retries_exceeded_behavior) that covered important column retry error paths. No new integration tests were added for the custody-by-root flow in range sync — the only new test (rpc_block_with_custody_columns_no_data_blocks) covers the trivial no-blobs case, and the existing range sync tests skip custody-by-root entirely since test blocks have no blobs.

The ActiveCustodyRequest system presumably has its own tests, but the integration between continue_requests, custody-by-root failures/retries, and the coupling layer is untested.

@dapplion
Copy link
Copy Markdown
Owner Author

dapplion commented Mar 9, 2026

🤖 Re: test coverage gap

The 4 removed tests (rpc_block_with_custody_columns_batched, missing_custody_columns_from_faulty_peers, retry_logic_after_peer_failures, max_retries_exceeded_behavior) all tested retry/error machinery that lived in the coupling struct under the old DataColumnsByRange flow — specifically reinsert_failed_column_requests, MAX_COLUMN_RETRIES, per-column peer tracking, and DataColumnPeerFailure.exceeded_retries. All of that is now owned by ActiveCustodyRequest, which has its own test suite.

Added 5 new coupling layer unit tests in ff206e4:

  • add_custody_columns_rejects_unregistered_root
  • add_custody_columns_rejects_duplicate
  • responses_returns_none_while_custody_requesting
  • responses_error_on_missing_custody_columns
  • mixed_blocks_with_and_without_data

The full integration path (continue_requestsActiveCustodyRequest → failure → batch retry) would require a SyncNetworkContext mock to test. Worth a TODO but not blocking since each layer is tested independently.

dapplion added 10 commits March 9, 2026 06:48
…ange

For Fulu+ epochs, range sync now fetches blocks first via BlocksByRange,
then initiates custody-by-root requests per block that has data. This
reuses the existing ActiveCustodyRequest machinery (peer selection,
retries, batching) from single block lookups.

Removes retry_partial_batch and DataColumnsByRange usage from forward
range sync and backfill sync. Custody backfill sync path is unchanged.
continue_requests() now calls custody_lookup_request directly instead
of returning roots for network_context to dispatch. Replace
custody_initiated bool + Option hashmap with CustodyByRootState enum
tracking per-block Requesting/Complete state.
- fix wrong req_id returned by custody_lookup_request (was returning
  incremented self.request_id instead of caller-allocated id)
- move column index computation to callers so range sync uses batch
  epoch columns instead of current epoch
- use real peer from custody response instead of PeerId::random()
- fix responses() to check blocks need custody entries before completing
- use full match in add_custody_columns_by_root
…up_request

- inline initiate_custody into continue_requests, merge two loops
- pass block_epoch to custody_lookup_request so it computes and filters
  columns internally (restoring cached_data_column_indexes filtering)
- include PeerGroup in RangeBlockComponent::CustodyResult variant
- remove unused is_finished test helper
… cases

- Fix trigger_batch_downloads to call on_chain_removed on chain failure
- Handle NoRequestNeeded as success in continue_requests (mark Complete)
- Collect errors in continue_requests instead of early return (avoid orphaning)
- Thread Option<PeerId> through inject_error to avoid fabricating random peers
- Tighten add_custody_columns to reject unregistered block roots
- add_custody_columns rejects unregistered block roots
- add_custody_columns rejects duplicate completions
- responses returns None while custody still Requesting
- responses errors when block with data has no custody columns
- mixed blocks (with/without data) resolve correctly
@dapplion dapplion force-pushed the range-sync-custody-by-root branch from ff206e4 to 3762953 Compare March 9, 2026 05:52
}
ByRangeDataRequestIds::PostPeerDAS => {
// Post-PeerDAS: no DataColumnsByRange requests to complete.
// With empty blocks, continue_requests() won't trigger custody-by-root.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All PostPeerDAS tests use NumBlobs::None, so continue_requests() never triggers custody-by-root — the core new code path goes untested at the integration level. Worth adding a test with NumBlobs::Number(1) that exercises blocks arriving → continue_requests fires custody-by-root → custody completes → coupling assembles RpcBlock with columns.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about if I add coverage on unstable for range sync and then build this on top?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants