Range sync custody by root#67
Conversation
9198ceb to
296ec61
Compare
| { | ||
| Ok((None, _)) => {} | ||
| _ => { | ||
| debug!(%chain_id, "trigger_batch_downloads for removed chain"); |
There was a problem hiding this comment.
🤖 This discards the chain removal result. Every other caller of call_by_id in this file (blocks_by_range_response, handle_block_process_result, inject_error) extracts the removed chain and passes it through self.on_chain_removed(...) for cleanup (awaiting_head_peers, sync state, metrics). Here the Ok(Some((chain, remove_reason))) case is swallowed under _ => with just a debug log.
If request_batches returns Err(RemoveChain::ChainFailed { blacklist, .. }), the chain gets removed from the collection internally by call_by_id but the higher-level cleanup is skipped.
There was a problem hiding this comment.
🤖 Fixed in 5ed01d1. trigger_batch_downloads now properly handles Ok(Some((removed_chain, remove_reason))) by calling self.on_chain_removed(...), matching all other call_by_id callers.
| /// | ||
| /// Only does work when blocks have arrived and we're in DataColumns mode. For each block | ||
| /// with data, inserts a `Requesting` entry and fires a custody request via the network context. | ||
| pub fn continue_requests<T: BeaconChainTypes<EthSpec = E>>( |
There was a problem hiding this comment.
🤖 If block N's custody request succeeds (inserting an ActiveCustodyRequest into custody_by_root_requests) but block N+1 fails, this returns Err immediately. Back in range_block_component_response, that error causes an early return without re-inserting the RangeBlockComponentsRequest into components_by_range_requests. The custody requests already in-flight for blocks 0..N become orphaned — when they complete, on_custody_by_root_result routes the result to on_range_components_response, but the parent entry is gone, so the result is silently dropped.
Not catastrophic (the batch retries from scratch), but the orphaned ActiveCustodyRequest entries in custody_by_root_requests are a resource leak. Fix options: (a) clean up already-sent custody requests on failure, or (b) continue sending for all blocks and only propagate the error after all requests are queued.
There was a problem hiding this comment.
🤖 Fixed in 5ed01d1 — went with option (b): continue_requests now collects errors from all blocks instead of returning early. All blocks that can be dispatched are dispatched; errors are joined and returned only after the loop completes. This avoids orphaning already-sent custody requests for blocks 0..N when block N+1 fails.
| custody_columns_by_root.insert(block_root, DataColumnsState::Requesting); | ||
| debug!(?block_root, %id, "Initiated custody-by-root for range block"); | ||
| } | ||
| Ok(LookupRequestResult::NoRequestNeeded(reason)) => { |
There was a problem hiding this comment.
🤖 NoRequestNeeded means all custody columns are already available — e.g., they arrived via gossip between block download and this call. This is a valid state that should mark the block's data as Complete rather than failing the whole batch. Similarly, Pending seems overly aggressive as a hard error.
This could cause spurious batch failures when columns happen to already be available.
There was a problem hiding this comment.
🤖 Fixed in 5ed01d1 — NoRequestNeeded now marks the block as Complete with empty columns instead of erroring. Added a debug log for visibility. Pending is kept as an error since it indicates an unexpected state for range sync (historical blocks should not have pending lookups).
| .ok() | ||
| .map(|(_, peer_group, _)| peer_group.clone()) | ||
| .unwrap_or_else(|| PeerGroup::from_set(Default::default())); | ||
| let peer_id = peer_group.all().next().copied().unwrap_or(PeerId::random()); |
There was a problem hiding this comment.
🤖 On error paths, peer_group is empty (constructed from Default::default() above), so this generates a random PeerId. This peer ID gets passed through on_range_components_response → inject_error, which may attempt to penalize a non-existent peer. Probably a no-op in practice but conceptually wrong and could confuse debugging. Consider restructuring so the error path doesn't need a peer_id, or use a documented sentinel.
There was a problem hiding this comment.
🤖 Fixed in 5ed01d1 — restructured to thread Option<PeerId> through on_range_components_response and inject_error down to batch.download_failed(peer_id.copied()). On custody error paths where no real peer is available, None is passed — this still counts toward the retry limit but does not blame any peer. Blocks/blobs callers wrap their real peer in Some().
| Some(DataColumnsState::Complete(..)) => Err(format!( | ||
| "duplicate custody columns for block root {block_root:?}" | ||
| )), | ||
| Some(DataColumnsState::Requesting) | None => { |
There was a problem hiding this comment.
🤖 The None case here means we'll accept custody columns for a block root that was never registered via continue_requests. Probably doesn't happen in practice, but tightening this to only accept Some(DataColumnsState::Requesting) and returning an error for None would catch unexpected routing bugs early.
There was a problem hiding this comment.
🤖 Fixed in 5ed01d1 — add_custody_columns now returns an error for None (unregistered block root). Added register_custody_block() test helper and 5 new unit tests in ff206e4 covering: unregistered root rejection, duplicate rejection, pending state blocking, missing columns error, and mixed blocks (with/without data).
|
🤖 Test coverage gap: Three tests were removed ( The |
|
🤖 Re: test coverage gap The 4 removed tests ( Added 5 new coupling layer unit tests in ff206e4:
The full integration path ( |
…ange For Fulu+ epochs, range sync now fetches blocks first via BlocksByRange, then initiates custody-by-root requests per block that has data. This reuses the existing ActiveCustodyRequest machinery (peer selection, retries, batching) from single block lookups. Removes retry_partial_batch and DataColumnsByRange usage from forward range sync and backfill sync. Custody backfill sync path is unchanged.
continue_requests() now calls custody_lookup_request directly instead of returning roots for network_context to dispatch. Replace custody_initiated bool + Option hashmap with CustodyByRootState enum tracking per-block Requesting/Complete state.
- fix wrong req_id returned by custody_lookup_request (was returning incremented self.request_id instead of caller-allocated id) - move column index computation to callers so range sync uses batch epoch columns instead of current epoch - use real peer from custody response instead of PeerId::random() - fix responses() to check blocks need custody entries before completing - use full match in add_custody_columns_by_root
…up_request - inline initiate_custody into continue_requests, merge two loops - pass block_epoch to custody_lookup_request so it computes and filters columns internally (restoring cached_data_column_indexes filtering) - include PeerGroup in RangeBlockComponent::CustodyResult variant - remove unused is_finished test helper
… cases - Fix trigger_batch_downloads to call on_chain_removed on chain failure - Handle NoRequestNeeded as success in continue_requests (mark Complete) - Collect errors in continue_requests instead of early return (avoid orphaning) - Thread Option<PeerId> through inject_error to avoid fabricating random peers - Tighten add_custody_columns to reject unregistered block roots
- add_custody_columns rejects unregistered block roots - add_custody_columns rejects duplicate completions - responses returns None while custody still Requesting - responses errors when block with data has no custody columns - mixed blocks (with/without data) resolve correctly
ff206e4 to
3762953
Compare
| } | ||
| ByRangeDataRequestIds::PostPeerDAS => { | ||
| // Post-PeerDAS: no DataColumnsByRange requests to complete. | ||
| // With empty blocks, continue_requests() won't trigger custody-by-root. |
There was a problem hiding this comment.
All PostPeerDAS tests use NumBlobs::None, so continue_requests() never triggers custody-by-root — the core new code path goes untested at the integration level. Worth adding a test with NumBlobs::Number(1) that exercises blocks arriving → continue_requests fires custody-by-root → custody completes → coupling assembles RpcBlock with columns.
There was a problem hiding this comment.
What about if I add coverage on unstable for range sync and then build this on top?
Issue Addressed
Which issue # does this PR address?
Proposed Changes
Please list or describe the changes introduced by this PR.
Additional Info
Please provide any additional information. For example, future considerations
or information useful for reviewers.