Skip to content

chore: Port downstream fork commits upstream#1553

Merged
romac merged 5 commits into
circlefin:mainfrom
romac:chore/port-downstream-fork-commits-2
May 11, 2026
Merged

chore: Port downstream fork commits upstream#1553
romac merged 5 commits into
circlefin:mainfrom
romac:chore/port-downstream-fork-commits-2

Conversation

@romac
Copy link
Copy Markdown
Contributor

@romac romac commented May 11, 2026

Summary

Brings 5 commits forward from a downstream fork of Malachite, on top of #1552.

What's in here

Sync

  • fix: penalize peer on value processing error
  • fix: drop synced values whose id mismatches the commit certificate
  • fix: Do not forward sync value to host if we have a decision

Network

  • fix: break network loop on event channel send failure
  • fix: Remove stale metric on peer disconnect

Test plan

  • cargo check --workspace --tests --all-features passes locally
  • CI green

🤖 Generated with Claude Code

Bastien Faivre and others added 5 commits May 11, 2026 10:01
`on_value_processing_error` did not penalize the peer or exclude it from
the immediate retry, so a peer serving undecodable `value_bytes` kept
its weighted-selection score and could win back the same request.

Mirror `on_invalid_value`: decrement the peer's score with
`SyncResult::Failure` and pass `Some(peer_id)` to
`re_request_values_from_peer_except` so the peer is added to
`excluded_peers`. `re_request_values_from_peer_except` already removes
the stale `pending_requests` entry before picking a new peer.

Added
`handle::tests::test_value_processing_error_penalizes_peer_and_retries_via_other_peer`;
full unit + `arc-malachitebft-test` integration suites pass locally.

---

- [x] I am a core contributor, OR I have been explicitly assigned to the
linked issue
- [x] I have read [CONTRIBUTING.md](/CONTRIBUTING.md) and my PR complies
with all requirements
- [x] I understand that PRs not meeting these requirements will be
closed without review

- [x] Reference a GitHub issue
- [x] Ensure the PR title follows the [conventional
commits][conv-commits] spec
- [ ] Add a release note in [`RELEASE_NOTES.md`](/RELEASE_NOTES.md) if
the change warrants it
- [ ] Add an entry in [`BREAKING_CHANGES.md`](/BREAKING_CHANGES.md) if
the change warrants it

[conv-commits]: https://www.conventionalcommits.org/en/v1.0.0/#summary
## Problem

Synced values whose id does not match the accompanying commit
certificate were unconditionally forwarded to consensus, letting
uncertified entries reach `state.store_value`.

## Solution

Gate the `Msg::ReceivedProposedValue` cast in `Effect::ValidSyncValue`
on an id match, and `warn!` on rejection. Values that match the id but
are locally `Invalid` still flow through so `maybe_sync_decision`'s
version-skew error log keeps firing.

## Tests

Covered by the existing `value_sync` integration suite (40 tests, all
green locally and on CI).

---

### PR author checklist

#### Contribution eligibility

- [x] I am a core contributor, OR I have been explicitly assigned to the
linked issue
- [x] I have read [CONTRIBUTING.md](/CONTRIBUTING.md) and my PR complies
with all requirements
- [x] I understand that PRs not meeting these requirements will be
closed without review

#### For all contributors

- [x] Reference a GitHub issue
- [x] Ensure the PR title follows the [conventional
commits][conv-commits] spec
- [x] Add a release note in [`RELEASE_NOTES.md`](/RELEASE_NOTES.md) if
the change warrants it
- [ ] Add an entry in [`BREAKING_CHANGES.md`](/BREAKING_CHANGES.md) if
the change warrants it

[conv-commits]: https://www.conventionalcommits.org/en/v1.0.0/#summary
## Summary

In order to avoid race conditions in the app, where a value arrives
after a decision has already been reached, this PR makes consensus skip
forwarding the synced value bytes to the application when processing the
commit certificate has already driven the state machine to a decision.

## Background

When a sync `ValueResponse` arrives,
`core-consensus::handle::sync::on_value_response` does two things:

1. **Certificate validation** (`process_commit_certificate`) — verifies
the commit certificate, stores it in the driver, and — if a matching
`ProposedValue` is already known locally (typically WAL replay on
restart, or a race between sync and consensus) — calls
`maybe_sync_decision` which feeds `DriverInput::SyncDecision` and can
drive the state machine straight to `Decide` / `Commit`.
2. **Value forwarding** (`Effect::ValidSyncValue`) — asks the host (app)
to decode the value bytes via `HostMsg::ProcessSyncedValue` and feed the
resulting `ProposedValue` back as `Msg::ReceivedProposedValue(_,
ValueOrigin::Sync)`.

Until now step 2 ran unconditionally on the success path. When step 1
had already led to a decision, step 2 was at best wasted work as the app
had already accepted the value through the local proposed-value path. At
worst triggered a race in the app: without proper guard, the synced
value bytes would be re-decoded and stored *after* the decision had been
delivered, leaving the app in a state where it processed a "new" value
for an already-finalized height. The resulting `ReceivedProposedValue`
was eventually dropped during `on_proposed_value`, but only after the
app had done duplicate work and exposed the race window.

## Change

In `code/crates/core-consensus/src/handle/sync.rs::on_value_response`,
after `process_commit_certificate` succeeds, check whether the driver
has transitioned to (or already is in) the `Commit` step:

- If yes, skip emitting `Effect::ValidSyncValue` and log a debug line —
the host already has the value, no need to re-deliver the bytes.
- If no, emit `Effect::ValidSyncValue` exactly as before so the host can
decode the bytes and produce a `ProposedValue` (the typical
sync-catch-up case).

The error path is unchanged: a failed certificate verification still
emits `Effect::InvalidSyncValue` so sync re-requests from another peer.

## Tests

Extended `code/crates/core-consensus/tests/sync_cert_verification.rs`:

- **`sync_value_response_skips_valid_sync_value_when_already_decided`**
— feeds a `ProposedValue` first, then a matching sync `ValueResponse`,
and asserts `Effect::ValidSyncValue` is **not** emitted (because
certificate processing decided via the sync decision path).
- **`sync_value_response_emits_valid_sync_value_when_not_decided`** —
sanity check of the inverse: with no locally stored `ProposedValue`, the
sync `ValueResponse` still emits `Effect::ValidSyncValue` so the host
can decode the bytes.

The pre-existing `sync_decision_path_verifies_commit_certificate_once`
continues to pass, confirming the cert-verification accounting did not
regress.

---

### PR author checklist

#### Contribution eligibility

- [ ] I am a core contributor, OR I have been explicitly assigned to the
linked issue
- [ ] I have read [CONTRIBUTING.md](/CONTRIBUTING.md) and my PR complies
with all requirements
- [ ] I understand that PRs not meeting these requirements will be
closed without review

#### For all contributors

- [ ] Reference a GitHub issue
- [ ] Ensure the PR title follows the [conventional
commits][conv-commits] spec
- [ ] Add a release note in [`RELEASE_NOTES.md`](/RELEASE_NOTES.md) if
the change warrants it
- [ ] Add an entry in [`BREAKING_CHANGES.md`](/BREAKING_CHANGES.md) if
the change warrants it

#### For external contributors

- [ ] Maintainers of Malachite are [allowed to push changes to the
branch][gh-edit-branch]

[conv-commits]: https://www.conventionalcommits.org/en/v1.0.0/#summary
[gh-edit-branch]:
https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/allowing-changes-to-a-pull-request-branch-created-from-a-fork#enabling-repository-maintainer-permissions-on-existing-pull-requests

---------

Co-authored-by: Anca Zamfir <zamfiranca@gmail.com>
## Problem

`handle_sync_event` and `handle_validator_proof_event` log `tx_event`
send failures and then return `ControlFlow::Continue(())`, so the
network task keeps running after the engine drops the receiver while
consensus silently sees nothing.

## Solution

Align both handlers with the `if let Err(e) = tx_event.send(...).await {
error!(...); return ControlFlow::Break(()); }` pattern already used at
five other send sites in the same file. Three send sites change: sync
`Request`, sync `Response`, validator-proof `ProofReceived`.

## Tests

Added two unit tests in `code/crates/network/src/lib.rs`:
- `handle_validator_proof_event_breaks_when_event_receiver_dropped`
- `handle_validator_proof_event_forwards_proof_and_continues`

`handle_sync_event`'s affected branches are not unit-testable —
`libp2p::request_response::{InboundRequestId, OutboundRequestId,
ResponseChannel}` have private constructors. The fix is structurally
identical to the tested handler.

---

### PR author checklist

#### Contribution eligibility

- [x] I am a core contributor, OR I have been explicitly assigned to the
linked issue
- [x] I have read [CONTRIBUTING.md](/CONTRIBUTING.md) and my PR complies
with all requirements
- [x] I understand that PRs not meeting these requirements will be
closed without review

#### For all contributors

- [x] Reference a GitHub issue
- [x] Ensure the PR title follows the [conventional
commits][conv-commits] spec
- [ ] Add a release note in [`RELEASE_NOTES.md`](/RELEASE_NOTES.md) if
the change warrants it
- [ ] Add an entry in [`BREAKING_CHANGES.md`](/BREAKING_CHANGES.md) if
the change warrants it

[conv-commits]: https://www.conventionalcommits.org/en/v1.0.0/#summary
## Summary

`malachitebft_network_discovered_peers` gauge grows unboundedly on
long-lived nodes, eventually exhausting scrape budgets.
Two related metrics, `peer_mesh_membership` and `explicit_peers`, used
the same tombstone pattern
## Root cause

Two bugs compounded across all three metrics:

1. **`discovered_peers` only:** `PeerInfoLabels` included the peer's
full multiaddr as the `address` label. The remote address embeds an
*ephemeral* TCP port that changes on every reconnection, so each
reconnect from the same peer minted a new Prometheus time series.
2. **All three metrics:** on disconnect / topic-leave / explicit-stale,
the code set the gauge to a sentinel value (`i64::MIN` for
`discovered_peers` and `explicit_peers`, `0` for `peer_mesh_membership`)
to "signal" the state change but never removed the entry from the
`Family`. Stale entries accumulated forever, growing with peer churn
rather than the live peer set.

## Fix

All in `code/crates/network/src/metrics.rs`:

**`discovered_peers`**
- Drop `address` from `PeerInfoLabels` (and from `labels_match`). Peers
are identified by the stable `peer_id` plus the bounded `slot` (capped
at `MAX_PEER_SLOTS = 100`). The address remains available in logs and on
`PeerInfo` for debugging.
- Replace `set(i64::MIN)` tombstones with `Family::remove(&labels)` in
`free_slot` and the `labels_changed` branch of `update_peer_labels`.

**`peer_mesh_membership`**
- In `update_peer_metrics`, topics leaving the mesh are now pruned with
`Family::remove(&labels)` instead of `set(0)`. Without this, every
`(peer_id, peer_moniker, topic)` combination the peer ever joined and
left lingered as a stale series — an unbounded growth path under moniker
churn.
- `free_slot` already prunes the peer's joined-topic entries on
disconnect.

**`explicit_peers`**
- `mark_explicit_peer_stale` now calls `Family::remove(&labels)` instead
of `set(i64::MIN)`. Otherwise every peer ever marked explicit
accumulated an entry over the node's lifetime.

Cardinality across all three metrics is now bounded by the *current*
peer / mesh / explicit-peer set rather than by node uptime.

## Tests

Added a `#[cfg(test)] mod tests` in `metrics.rs` that scrapes the
registry to text and counts `<metric>{...}` series. A shared
`metric_series_count(registry, name)` helper drives all the assertions:

`discovered_peers`
- `discovered_peers_bounded_under_ephemeral_port_churn` — 500
connect/disconnect cycles for one peer with varying ports → final series
count = 0.
- `discovered_peers_single_series_per_connected_peer` — 100
reconnections under different addresses → series count = 1.
- `discovered_peers_cardinality_tracks_connected_peers` — 100 distinct
peers connect (count = 100), then all disconnect (count = 0).

`peer_mesh_membership`
- `peer_mesh_membership_pruned_on_leave` — 200 join/leave cycles for one
topic → final series count = 0.
- `peer_mesh_membership_pruned_on_disconnect` — peer joins 3 topics
(count = 3), then disconnects (count = 0).

`explicit_peers`
- `explicit_peers_pruned_on_stale` — 200 record/mark-stale cycles for
distinct peers → series count = 0.

---

### PR author checklist

#### Contribution eligibility

- [ ] I am a core contributor, OR I have been explicitly assigned to the
linked issue
- [ ] I have read [CONTRIBUTING.md](/CONTRIBUTING.md) and my PR complies
with all requirements
- [ ] I understand that PRs not meeting these requirements will be
closed without review

#### For all contributors

- [ ] Reference a GitHub issue
- [ ] Ensure the PR title follows the [conventional
commits][conv-commits] spec
- [ ] Add a release note in [`RELEASE_NOTES.md`](/RELEASE_NOTES.md) if
the change warrants it
- [ ] Add an entry in [`BREAKING_CHANGES.md`](/BREAKING_CHANGES.md) if
the change warrants it

#### For external contributors

- [ ] Maintainers of Malachite are [allowed to push changes to the
branch][gh-edit-branch]

[conv-commits]: https://www.conventionalcommits.org/en/v1.0.0/#summary
[gh-edit-branch]:
https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/allowing-changes-to-a-pull-request-branch-created-from-a-fork#enabling-repository-maintainer-permissions-on-existing-pull-requests

---------

Co-authored-by: Anca Zamfir <zamfiranca@gmail.com>
Co-authored-by: Sergio Mena <sergio.mena@circle.com>
Copilot AI review requested due to automatic review settings May 11, 2026 10:13
@romac romac requested a review from ancazamfir as a code owner May 11, 2026 10:13

This comment was marked as off-topic.

@romac romac merged commit 0422137 into circlefin:main May 11, 2026
19 checks passed
@romac romac deleted the chore/port-downstream-fork-commits-2 branch May 11, 2026 10:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants