chore: Port downstream fork commits upstream#1553
Merged
romac merged 5 commits intoMay 11, 2026
Merged
Conversation
`on_value_processing_error` did not penalize the peer or exclude it from the immediate retry, so a peer serving undecodable `value_bytes` kept its weighted-selection score and could win back the same request. Mirror `on_invalid_value`: decrement the peer's score with `SyncResult::Failure` and pass `Some(peer_id)` to `re_request_values_from_peer_except` so the peer is added to `excluded_peers`. `re_request_values_from_peer_except` already removes the stale `pending_requests` entry before picking a new peer. Added `handle::tests::test_value_processing_error_penalizes_peer_and_retries_via_other_peer`; full unit + `arc-malachitebft-test` integration suites pass locally. --- - [x] I am a core contributor, OR I have been explicitly assigned to the linked issue - [x] I have read [CONTRIBUTING.md](/CONTRIBUTING.md) and my PR complies with all requirements - [x] I understand that PRs not meeting these requirements will be closed without review - [x] Reference a GitHub issue - [x] Ensure the PR title follows the [conventional commits][conv-commits] spec - [ ] Add a release note in [`RELEASE_NOTES.md`](/RELEASE_NOTES.md) if the change warrants it - [ ] Add an entry in [`BREAKING_CHANGES.md`](/BREAKING_CHANGES.md) if the change warrants it [conv-commits]: https://www.conventionalcommits.org/en/v1.0.0/#summary
## Problem Synced values whose id does not match the accompanying commit certificate were unconditionally forwarded to consensus, letting uncertified entries reach `state.store_value`. ## Solution Gate the `Msg::ReceivedProposedValue` cast in `Effect::ValidSyncValue` on an id match, and `warn!` on rejection. Values that match the id but are locally `Invalid` still flow through so `maybe_sync_decision`'s version-skew error log keeps firing. ## Tests Covered by the existing `value_sync` integration suite (40 tests, all green locally and on CI). --- ### PR author checklist #### Contribution eligibility - [x] I am a core contributor, OR I have been explicitly assigned to the linked issue - [x] I have read [CONTRIBUTING.md](/CONTRIBUTING.md) and my PR complies with all requirements - [x] I understand that PRs not meeting these requirements will be closed without review #### For all contributors - [x] Reference a GitHub issue - [x] Ensure the PR title follows the [conventional commits][conv-commits] spec - [x] Add a release note in [`RELEASE_NOTES.md`](/RELEASE_NOTES.md) if the change warrants it - [ ] Add an entry in [`BREAKING_CHANGES.md`](/BREAKING_CHANGES.md) if the change warrants it [conv-commits]: https://www.conventionalcommits.org/en/v1.0.0/#summary
## Summary In order to avoid race conditions in the app, where a value arrives after a decision has already been reached, this PR makes consensus skip forwarding the synced value bytes to the application when processing the commit certificate has already driven the state machine to a decision. ## Background When a sync `ValueResponse` arrives, `core-consensus::handle::sync::on_value_response` does two things: 1. **Certificate validation** (`process_commit_certificate`) — verifies the commit certificate, stores it in the driver, and — if a matching `ProposedValue` is already known locally (typically WAL replay on restart, or a race between sync and consensus) — calls `maybe_sync_decision` which feeds `DriverInput::SyncDecision` and can drive the state machine straight to `Decide` / `Commit`. 2. **Value forwarding** (`Effect::ValidSyncValue`) — asks the host (app) to decode the value bytes via `HostMsg::ProcessSyncedValue` and feed the resulting `ProposedValue` back as `Msg::ReceivedProposedValue(_, ValueOrigin::Sync)`. Until now step 2 ran unconditionally on the success path. When step 1 had already led to a decision, step 2 was at best wasted work as the app had already accepted the value through the local proposed-value path. At worst triggered a race in the app: without proper guard, the synced value bytes would be re-decoded and stored *after* the decision had been delivered, leaving the app in a state where it processed a "new" value for an already-finalized height. The resulting `ReceivedProposedValue` was eventually dropped during `on_proposed_value`, but only after the app had done duplicate work and exposed the race window. ## Change In `code/crates/core-consensus/src/handle/sync.rs::on_value_response`, after `process_commit_certificate` succeeds, check whether the driver has transitioned to (or already is in) the `Commit` step: - If yes, skip emitting `Effect::ValidSyncValue` and log a debug line — the host already has the value, no need to re-deliver the bytes. - If no, emit `Effect::ValidSyncValue` exactly as before so the host can decode the bytes and produce a `ProposedValue` (the typical sync-catch-up case). The error path is unchanged: a failed certificate verification still emits `Effect::InvalidSyncValue` so sync re-requests from another peer. ## Tests Extended `code/crates/core-consensus/tests/sync_cert_verification.rs`: - **`sync_value_response_skips_valid_sync_value_when_already_decided`** — feeds a `ProposedValue` first, then a matching sync `ValueResponse`, and asserts `Effect::ValidSyncValue` is **not** emitted (because certificate processing decided via the sync decision path). - **`sync_value_response_emits_valid_sync_value_when_not_decided`** — sanity check of the inverse: with no locally stored `ProposedValue`, the sync `ValueResponse` still emits `Effect::ValidSyncValue` so the host can decode the bytes. The pre-existing `sync_decision_path_verifies_commit_certificate_once` continues to pass, confirming the cert-verification accounting did not regress. --- ### PR author checklist #### Contribution eligibility - [ ] I am a core contributor, OR I have been explicitly assigned to the linked issue - [ ] I have read [CONTRIBUTING.md](/CONTRIBUTING.md) and my PR complies with all requirements - [ ] I understand that PRs not meeting these requirements will be closed without review #### For all contributors - [ ] Reference a GitHub issue - [ ] Ensure the PR title follows the [conventional commits][conv-commits] spec - [ ] Add a release note in [`RELEASE_NOTES.md`](/RELEASE_NOTES.md) if the change warrants it - [ ] Add an entry in [`BREAKING_CHANGES.md`](/BREAKING_CHANGES.md) if the change warrants it #### For external contributors - [ ] Maintainers of Malachite are [allowed to push changes to the branch][gh-edit-branch] [conv-commits]: https://www.conventionalcommits.org/en/v1.0.0/#summary [gh-edit-branch]: https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/allowing-changes-to-a-pull-request-branch-created-from-a-fork#enabling-repository-maintainer-permissions-on-existing-pull-requests --------- Co-authored-by: Anca Zamfir <zamfiranca@gmail.com>
## Problem
`handle_sync_event` and `handle_validator_proof_event` log `tx_event`
send failures and then return `ControlFlow::Continue(())`, so the
network task keeps running after the engine drops the receiver while
consensus silently sees nothing.
## Solution
Align both handlers with the `if let Err(e) = tx_event.send(...).await {
error!(...); return ControlFlow::Break(()); }` pattern already used at
five other send sites in the same file. Three send sites change: sync
`Request`, sync `Response`, validator-proof `ProofReceived`.
## Tests
Added two unit tests in `code/crates/network/src/lib.rs`:
- `handle_validator_proof_event_breaks_when_event_receiver_dropped`
- `handle_validator_proof_event_forwards_proof_and_continues`
`handle_sync_event`'s affected branches are not unit-testable —
`libp2p::request_response::{InboundRequestId, OutboundRequestId,
ResponseChannel}` have private constructors. The fix is structurally
identical to the tested handler.
---
### PR author checklist
#### Contribution eligibility
- [x] I am a core contributor, OR I have been explicitly assigned to the
linked issue
- [x] I have read [CONTRIBUTING.md](/CONTRIBUTING.md) and my PR complies
with all requirements
- [x] I understand that PRs not meeting these requirements will be
closed without review
#### For all contributors
- [x] Reference a GitHub issue
- [x] Ensure the PR title follows the [conventional
commits][conv-commits] spec
- [ ] Add a release note in [`RELEASE_NOTES.md`](/RELEASE_NOTES.md) if
the change warrants it
- [ ] Add an entry in [`BREAKING_CHANGES.md`](/BREAKING_CHANGES.md) if
the change warrants it
[conv-commits]: https://www.conventionalcommits.org/en/v1.0.0/#summary
## Summary
`malachitebft_network_discovered_peers` gauge grows unboundedly on
long-lived nodes, eventually exhausting scrape budgets.
Two related metrics, `peer_mesh_membership` and `explicit_peers`, used
the same tombstone pattern
## Root cause
Two bugs compounded across all three metrics:
1. **`discovered_peers` only:** `PeerInfoLabels` included the peer's
full multiaddr as the `address` label. The remote address embeds an
*ephemeral* TCP port that changes on every reconnection, so each
reconnect from the same peer minted a new Prometheus time series.
2. **All three metrics:** on disconnect / topic-leave / explicit-stale,
the code set the gauge to a sentinel value (`i64::MIN` for
`discovered_peers` and `explicit_peers`, `0` for `peer_mesh_membership`)
to "signal" the state change but never removed the entry from the
`Family`. Stale entries accumulated forever, growing with peer churn
rather than the live peer set.
## Fix
All in `code/crates/network/src/metrics.rs`:
**`discovered_peers`**
- Drop `address` from `PeerInfoLabels` (and from `labels_match`). Peers
are identified by the stable `peer_id` plus the bounded `slot` (capped
at `MAX_PEER_SLOTS = 100`). The address remains available in logs and on
`PeerInfo` for debugging.
- Replace `set(i64::MIN)` tombstones with `Family::remove(&labels)` in
`free_slot` and the `labels_changed` branch of `update_peer_labels`.
**`peer_mesh_membership`**
- In `update_peer_metrics`, topics leaving the mesh are now pruned with
`Family::remove(&labels)` instead of `set(0)`. Without this, every
`(peer_id, peer_moniker, topic)` combination the peer ever joined and
left lingered as a stale series — an unbounded growth path under moniker
churn.
- `free_slot` already prunes the peer's joined-topic entries on
disconnect.
**`explicit_peers`**
- `mark_explicit_peer_stale` now calls `Family::remove(&labels)` instead
of `set(i64::MIN)`. Otherwise every peer ever marked explicit
accumulated an entry over the node's lifetime.
Cardinality across all three metrics is now bounded by the *current*
peer / mesh / explicit-peer set rather than by node uptime.
## Tests
Added a `#[cfg(test)] mod tests` in `metrics.rs` that scrapes the
registry to text and counts `<metric>{...}` series. A shared
`metric_series_count(registry, name)` helper drives all the assertions:
`discovered_peers`
- `discovered_peers_bounded_under_ephemeral_port_churn` — 500
connect/disconnect cycles for one peer with varying ports → final series
count = 0.
- `discovered_peers_single_series_per_connected_peer` — 100
reconnections under different addresses → series count = 1.
- `discovered_peers_cardinality_tracks_connected_peers` — 100 distinct
peers connect (count = 100), then all disconnect (count = 0).
`peer_mesh_membership`
- `peer_mesh_membership_pruned_on_leave` — 200 join/leave cycles for one
topic → final series count = 0.
- `peer_mesh_membership_pruned_on_disconnect` — peer joins 3 topics
(count = 3), then disconnects (count = 0).
`explicit_peers`
- `explicit_peers_pruned_on_stale` — 200 record/mark-stale cycles for
distinct peers → series count = 0.
---
### PR author checklist
#### Contribution eligibility
- [ ] I am a core contributor, OR I have been explicitly assigned to the
linked issue
- [ ] I have read [CONTRIBUTING.md](/CONTRIBUTING.md) and my PR complies
with all requirements
- [ ] I understand that PRs not meeting these requirements will be
closed without review
#### For all contributors
- [ ] Reference a GitHub issue
- [ ] Ensure the PR title follows the [conventional
commits][conv-commits] spec
- [ ] Add a release note in [`RELEASE_NOTES.md`](/RELEASE_NOTES.md) if
the change warrants it
- [ ] Add an entry in [`BREAKING_CHANGES.md`](/BREAKING_CHANGES.md) if
the change warrants it
#### For external contributors
- [ ] Maintainers of Malachite are [allowed to push changes to the
branch][gh-edit-branch]
[conv-commits]: https://www.conventionalcommits.org/en/v1.0.0/#summary
[gh-edit-branch]:
https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/allowing-changes-to-a-pull-request-branch-created-from-a-fork#enabling-repository-maintainer-permissions-on-existing-pull-requests
---------
Co-authored-by: Anca Zamfir <zamfiranca@gmail.com>
Co-authored-by: Sergio Mena <sergio.mena@circle.com>
ancazamfir
approved these changes
May 11, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Brings 5 commits forward from a downstream fork of Malachite, on top of #1552.
What's in here
Sync
Network
Test plan
cargo check --workspace --tests --all-featurespasses locally🤖 Generated with Claude Code