perf: make m_peer_mutex a SharedMutex#6468
Conversation
|
Guix Automation has began to build this PR tagged as v22.1.0-devpr6468.fba3a04a. A new comment will be made when the image is pushed. |
|
Guix Automation has completed; a release should be present here: https://github.com/dashpay/dash-dev-branches/releases/tag/v22.1.0-devpr6468.fba3a04a. The image should be on dockerhub soon. |
fba3a04 to
f044507
Compare
✅ No Merge Conflicts DetectedThis PR currently has no conflicts with other open PRs. |
|
Guix Automation has began to build this PR tagged as v23.0.0-devpr6468.f0445074. A new comment will be made when the image is pushed. |
|
Below is the aggregated contention data I collected on masternode-1 on testnet during 2 minutes of ~ 8tps, so lots of tx locking happening. As you can see,
Also; when running a local functional test for testing is lock latency, I saw this contention (contention detected 563 times): After this change, there is no contention over the lock. |
WalkthroughThis change replaces the exclusive Sequence Diagram(s)mermaid Note over Reader1,Reader2: Old flow (exclusive lock) Note over Reader1,Reader2,Writer: New flow (shared locks for readers) Note right of Mutex: Colored blocks — shared reads (green), exclusive writes (orange) Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/net_processing.cpp (1)
2515-2532: Fix lifetime hazard: raw Peer used after releasing m_peer_mutex.*You store
Peer*inbest[...]under a shared lock, release the lock, then callPushAddress(*best[i].second, …). If a peer is removed concurrently, this can become a dangling pointer (UAF). CapturePeerRefinstead to keep the object alive post-unlock.Apply this diff:
- std::array<std::pair<uint64_t, Peer*>, 2> best{{{0, nullptr}, {0, nullptr}}}; + std::array<std::pair<uint64_t, PeerRef>, 2> best{{{0, {}}, {0, {}}}}; @@ - best[i] = std::make_pair(hashKey, peer.get()); + best[i] = std::make_pair(hashKey, peer); @@ - for (unsigned int i = 0; i < nRelayNodes && best[i].first != 0; i++) { - PushAddress(*best[i].second, addr, insecure_rand); - } + for (unsigned int i = 0; i < nRelayNodes && best[i].first != 0; i++) { + PushAddress(*best[i].second, addr, insecure_rand); + }This mirrors the safe pattern used elsewhere (e.g., collecting
PeerRefs inAskPeersForTransaction) and removes the UAF risk.
🧹 Nitpick comments (1)
src/net_processing.cpp (1)
2370-2375: Shrink critical section to reduce contention (optional).Current loops hold
m_peer_mutex(shared) while acquiring per‑peer locks and doing work. Consider:
- Collect
std::vector<PeerRef>underREAD_LOCK, then releasem_peer_mutexand process each peer (callGetInvRelay(),PushInv(), etc.).- This reduces time readers hold
m_peer_mutexand lets writers (insert/remove) proceed sooner.Also applies to: 2463-2471, 2476-2482
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📥 Commits
Reviewing files that changed from the base of the PR and between e5ce1f0 and f0445074a18e4961f3be8880dbc8a7268bcd1f4e.
📒 Files selected for processing (1)
src/net_processing.cpp(9 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
src/**/*.{cpp,h,cc,cxx,hpp}
📄 CodeRabbit inference engine (CLAUDE.md)
src/**/*.{cpp,h,cc,cxx,hpp}: Dash Core C++ codebase must be written in C++20 and require at least Clang 16 or GCC 11.1
Dash uses unordered_lru_cache for efficient caching with LRU eviction
Files:
src/net_processing.cpp
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: win64-build / Build source
- GitHub Check: Lint / Run linters
- GitHub Check: arm-linux-gnueabihf / Build depends
- GitHub Check: x86_64-pc-linux-gnu / Build depends
🔇 Additional comments (4)
src/net_processing.cpp (4)
1731-1734: GetPeerRef under shared lock — LGTM.Read-only map lookup with
READ_LOCK(m_peer_mutex)is correct; returnsPeerRefto extend lifetime post-unlock.
2304-2306: SendPings: shared lock + atomic write — LGTM.Iterating under
READ_LOCKand settingstd::atomic<bool> m_ping_queuedis safe.
2314-2324: AskPeersForTransaction: shared lock to collect PeerRefs — LGTM.Collecting
PeerRefs underREAD_LOCKthen acting undercs_mainavoids holdingm_peer_mutexacross downstream locks.
2382-2395: RelayDSQ: shared lock usage — LGTM.Read-only iteration and deferring sends via
nodes_send_allafter unlocking is sound.
| @@ -819,7 +819,7 @@ class PeerManagerImpl final : public PeerManager | |||
|
|
|||
| /** Protects m_peer_map. This mutex must not be locked while holding a lock | |||
| * on any of the mutexes inside a Peer object. */ | |||
| mutable Mutex m_peer_mutex; | |||
| mutable SharedMutex m_peer_mutex; | |||
There was a problem hiding this comment.
🧩 Analysis chain
SharedMutex swap looks good; confirm macro support/fairness.
Change to SharedMutex is appropriate for read-heavy access. Please confirm:
READ_LOCK(...)is defined by our sync wrappers and mapped to a shared-lock type.- No writer starvation policy surprises on our platforms (clang/gcc toolchains).
🏁 Script executed:
#!/bin/bash
# Ensure shared-lock macros/types exist and are used consistently
rg -nP '(^|[^A-Z_])READ_LOCK\s*\(' --type=cpp
rg -nP 'class|using\s+SharedMutex|struct\s+SharedMutex' --type=cpp
rg -nP 'LOCK_SHARED\s*\(' --type=cppLength of output: 205545
I can confirm the macro support is in place. Now I need to verify the actual code at the flagged lines to assess the lifetime hazard mentioned in the scratchpad.
I need to verify the actual code at the flagged lines to assess the lifetime hazard mentioned in the scratchpad.
🏁 Script executed:
#!/bin/bash
# Inspect RelayAddress and _RelayTransaction code context for potential UAF issues
echo "=== RelayAddress context (line 2515-2528) ==="
sed -n '2510,2535p' src/net_processing.cpp
echo ""
echo "=== _RelayTransaction context (line 2463-2471) ==="
sed -n '2458,2480p' src/net_processing.cpp
echo ""
echo "=== Check if PeerRef is defined/used ==="
rg -n "struct PeerRef|class PeerRef|using PeerRef" src/net_processing.cppLength of output: 1797
🏁 Script executed:
#!/bin/bash
# Inspect RelayAddress and _RelayTransaction code context for potential UAF issues
echo "=== RelayAddress context (line 2515-2528) ==="
sed -n '2510,2535p' src/net_processing.cpp
echo ""
echo "=== _RelayTransaction context (line 2463-2471) ==="
sed -n '2458,2480p' src/net_processing.cpp
echo ""
echo "=== Check if PeerRef is used and defined ==="
rg -n "PeerRef|using.*Ref" src/net_processing.cpp | head -20Length of output: 2935
Fix the lifetime hazard in RelayAddress where raw peer pointers are used after releasing the lock.
The READ_LOCK(...) macro is properly supported (defined at src/sync.h:343 and mapped to ReadLock<decltype(cs)> on std::shared_mutex). However, RelayAddress stores raw Peer* pointers in the best array at line 2522, then uses them after the lock scope ends at line 2532. If a peer is removed from m_peer_map between lock release and the PushAddress call, this becomes a use-after-free.
Lines 2515–2528 (RelayAddress):
Capture PeerRef (shared_ptr, line 439) instead of raw pointers to extend peer lifetime beyond the lock. Replace peer.get() with the shared_ptr directly, or collect into std::vector<PeerRef> like AskPeersForTransaction does (line 2310).
The other READ_LOCK sites (_RelayTransaction, RelayRecoveredSig) are safe because all peer access occurs within the lock scope.
🤖 Prompt for AI Agents
src/net_processing.cpp lines 2515-2528: RelayAddress currently collects raw
Peer* pointers into `best` while holding the READ_LOCK, then uses them after the
lock is released causing a potential use-after-free; change the collection to
capture PeerRef (the shared_ptr alias used for peers) instead of raw pointers
(e.g., gather a std::vector<PeerRef> while inside the READ_LOCK), replace later
uses of `peer.get()` with the stored PeerRef, and then call PushAddress/other
post-lock APIs using the preserved PeerRef so peers remain alive after the lock
is released.
There was a problem hiding this comment.
Please look at this again;
For the first case, it appears to me the READ_LOCK is held through the destruction of the best array. I don't see a UAF.
There was a problem hiding this comment.
Oops, something went wrong! Please try again later. 🐰 💔
There was a problem hiding this comment.
Please look at this again;
For the first case, it appears to me the READ_LOCK is held through the destruction of the best array. I don't see a UAF.
There was a problem hiding this comment.
@PastaPastaPasta You’re right — in RelayAddress the READ_LOCK(m_peer_mutex) is acquired in the function scope and remains held through both loops, including the PushAddress calls. With the shared lock held, m_peer_map cannot be modified by writers, so the Peer* captured into best stays valid during its use. No UAF there. My earlier concern was unfounded — thanks for the nudge.
Optional: if you ever want to reduce the time m_peer_mutex is held, you could collect PeerRef entries under the read lock, release it, then call PushAddress on those refs. Not necessary for correctness here, just a potential micro-optimization.
|
Guix Automation has completed; a release should be present here: https://github.com/dashpay/dash-dev-branches/releases/tag/v23.0.0-devpr6468.f0445074. The image should be on dockerhub soon. |
kwvg
left a comment
There was a problem hiding this comment.
utACK f0445074a18e4961f3be8880dbc8a7268bcd1f4e
|
This pull request has conflicts, please rebase. |
f044507 to
43fd60f
Compare
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (6)
src/net_processing.cpp (6)
2293-2298: SendPings: read-lock + atomic write is safe; consider shortening lock scope.Optional: collect PeerRef under the read lock, then set m_ping_queued outside to minimize time holding m_peer_mutex.
- READ_LOCK(m_peer_mutex); - for(auto& it : m_peer_map) it.second->m_ping_queued = true; + std::vector<PeerRef> peers; + { + READ_LOCK(m_peer_mutex); + peers.reserve(m_peer_map.size()); + for (const auto& [_, peer] : m_peer_map) peers.emplace_back(peer); + } + for (const auto& peer : peers) { + peer->m_ping_queued = true; + }
2361-2366: RelayInv: avoid holding m_peer_mutex while pushing to peers.Optional: mirror the AskPeersForTransaction pattern to shorten lock time.
- READ_LOCK(m_peer_mutex); - for (const auto& [_, peer] : m_peer_map) { - if (!peer->GetInvRelay()) continue; - PushInv(*peer, inv); - } + std::vector<PeerRef> peers; + { + READ_LOCK(m_peer_mutex); + peers.reserve(m_peer_map.size()); + for (const auto& [_, peer] : m_peer_map) peers.emplace_back(peer); + } + for (const auto& peer : peers) { + if (!peer->GetInvRelay()) continue; + PushInv(*peer, inv); + }
2370-2393: RelayDSQ: consistent with shared lock; optionally move INV path out of lock.You already defer DSQ::ALL sends via NodeId vector. For DSQ::INV, consider collecting PeerRef under READ_LOCK, then PushInv after releasing it to further reduce contention.
2453-2464: _RelayTransaction: shorten lock scope while broadcasting.Optional: collect PeerRef, then PushInv outside the READ_LOCK.
- READ_LOCK(m_peer_mutex); - for(auto& it : m_peer_map) { - Peer& peer = *it.second; - auto tx_relay = peer.GetTxRelay(); - if (!tx_relay) continue; - PushInv(peer, inv); - }; + std::vector<PeerRef> peers; + { + READ_LOCK(m_peer_mutex); + peers.reserve(m_peer_map.size()); + for (const auto& [_, peer] : m_peer_map) peers.emplace_back(peer); + } + for (const auto& peer : peers) { + if (auto tx_relay = peer->GetTxRelay()) { + PushInv(*peer, inv); + } + }
2468-2475: RelayRecoveredSig: same optional lock-narrowing.Collect PeerRef under READ_LOCK, then PushInv after releasing it.
- READ_LOCK(m_peer_mutex); - for (const auto& [_, peer] : m_peer_map) { - if (peer->m_wants_recsigs) { - PushInv(*peer, inv); - } - } + std::vector<PeerRef> peers; + { + READ_LOCK(m_peer_mutex); + peers.reserve(m_peer_map.size()); + for (const auto& [_, peer] : m_peer_map) peers.emplace_back(peer); + } + for (const auto& peer : peers) { + if (peer->m_wants_recsigs) { + PushInv(*peer, inv); + } + }
2504-2525: RelayAddress: current approach is safe; optional lifetime-friendly tweak.With READ_LOCK held across both loops, raw Peer* remains valid — no UAF. If you want to release m_peer_mutex before PushAddress, store PeerRef instead of raw pointers.
- std::array<std::pair<uint64_t, Peer*>, 2> best{{{0, nullptr}, {0, nullptr}}}; + std::array<std::pair<uint64_t, PeerRef>, 2> best{{{0, {}}, {0, {}}}}; @@ - READ_LOCK(m_peer_mutex); - for (auto& [id, peer] : m_peer_map) { + { + READ_LOCK(m_peer_mutex); + for (const auto& [id, peer] : m_peer_map) { @@ - best[i] = std::make_pair(hashKey, peer.get()); + best[i] = std::make_pair(hashKey, peer); @@ - }; + } // release m_peer_mutex @@ - PushAddress(*best[i].second, addr, insecure_rand); + PushAddress(*best[i].second, addr, insecure_rand);
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📥 Commits
Reviewing files that changed from the base of the PR and between f0445074a18e4961f3be8880dbc8a7268bcd1f4e and 43fd60f.
📒 Files selected for processing (1)
src/net_processing.cpp(9 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
src/**/*.{cpp,h,cc,cxx,hpp}
📄 CodeRabbit inference engine (CLAUDE.md)
src/**/*.{cpp,h,cc,cxx,hpp}: Dash Core C++ codebase must be written in C++20 and require at least Clang 16 or GCC 11.1
Dash uses unordered_lru_cache for efficient caching with LRU eviction
Files:
src/net_processing.cpp
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (9)
- GitHub Check: linux64-build / Build source
- GitHub Check: arm-linux-build / Build source
- GitHub Check: linux64_nowallet-build / Build source
- GitHub Check: x86_64-apple-darwin / Build depends
- GitHub Check: x86_64-pc-linux-gnu_multiprocess / Build depends
- GitHub Check: Lint / Run linters
- GitHub Check: x86_64-w64-mingw32 / Build depends
- GitHub Check: x86_64-pc-linux-gnu / Build depends
- GitHub Check: Lint / Run linters
🔇 Additional comments (3)
src/net_processing.cpp (3)
815-824: SharedMutex swap is correct for read-dominant access.m_peer_map still uses exclusive LOCKs on mutation (InitializeNode/RemovePeer). An appropriate upgrade.
1724-1729: GetPeerRef now uses READ_LOCK — safe lifetime.Returning PeerRef under a shared lock, then releasing, is correct per the comment that a shared_ptr keeps the peer alive.
2304-2315: AskPeersForTransaction: good pattern.Collecting PeerRef under READ_LOCK and using them after releasing the lock is exactly the right approach here.
…ntion e918729 refactor: add CacheTipHeight, improve cache API safety (UdjinM6) 493d077 refactor: simplify optional usage (pasta) 0e1f114 instantsend: cache heights and reduce cs_main contention (pasta) Pull request description: ## Issue being fixed or feature implemented Today's testing on testnet showed under load, cs_main was a major contender: (however, note the major contenders from before #6912 / #6468 are now significantly reduced) ``` ======================================================================================================================== TOP 40 LOCATIONS BY TOTAL CONTENTION TIME (AGGREGATE) ======================================================================================================================== Lock Name Location Count Total(μs) Avg(μs) Max(μs) Nodes ------------------------------------------------------------------------------------------------------------------------ cs llmq/signing_shares.cpp:507 11952 81357095 6807.0 299734 8 ::cs_main instantsend/signing.cpp:210 75241 62985844 837.1 93240 8 cs_main index/base.cpp:340 29584 27040349 914.0 1613365 8 cs_main net_processing.cpp:5462 69291 19784473 285.5 1329191 8 cs llmq/signing_shares.cpp:1732 2355 17466035 7416.6 127471 8 ::cs_main llmq/quorums.cpp:1214 38975 15478543 397.1 883884 8 m_nodes_mutex net.cpp:2043 33457 14736059 440.4 31622 8 ::cs_main llmq/quorums.cpp:526 10587 10577503 999.1 2083501 8 pnode->cs_vSend net.cpp:2435 68528 9671050 141.1 37265 8 cs_main net_processing.cpp:4196 4945 8608157 1740.8 1326146 8 ::cs_main instantsend/instantsend.cpp:272 5602 8416269 1502.4 1260997 8 cs txmempool.cpp:1319 7942 8059684 1014.8 356199 8 ::cs_main validation.cpp:3747 309 6468144 20932.5 1225274 8 ::cs_main validation.cpp:6009 21855 3393152 155.3 228195 8 pnode->cs_vSend net.cpp:4709 7532 2299813 305.3 41543 8 m_nodes_mutex ./net.h:1374 6229 1998506 320.8 15593 8 inv_relay->m_tx_inventory_mutex net_processing.cpp:1169 7687 1871859 243.5 16292 8 cs_db instantsend/db.cpp:239 4323 1527297 353.3 20779 8 cs_cache spork.cpp:244 18083 1331472 73.6 27722 8 ::cs_main chainlock/signing.cpp:58 98 1312496 13392.8 1273691 8 cs_db ./instantsend/db.h:139 4381 1305738 298.0 29777 8 mutexMsgProc net.cpp:2604 4877 1243737 255.0 24641 8 ``` - Add unordered_lru_cache for block heights in CInstantSendManager - Provide GetTipHeight() (non-optional) and GetBlockHeight() helpers - Use cache in InstantSendSigner::CheckCanLock - Cache-first for cycleHash height in ISLOCK message path and batch verify - Cache-first mined-height for HasChainLock and WriteInstantSendLockMined ## What was done? ## How Has This Been Tested? ## Breaking Changes _Please describe any breaking changes your code introduces_ ## Checklist: _Go over all the following points, and put an `x` in all the boxes that apply._ - [ ] I have performed a self-review of my own code - [ ] I have commented my code, particularly in hard-to-understand areas - [ ] I have added or updated relevant unit/integration/functional/e2e tests - [ ] I have made corresponding changes to the documentation - [ ] I have assigned this pull request to a milestone _(for repository code-owners and collaborators only)_ ACKs for top commit: UdjinM6: utACK e918729 Tree-SHA512: 1ea45e92c717dbefe5625bba3b27612b9ffbb21ea106d11005d78836fbe6ce00503e3f3fdb927fb405b09fd795f50137ca2beb07351bf94488d2af5694e543ba
Issue being fixed or feature implemented
m_peer_mutex is a great use case for a shared mutex; it's private, well contained scope, and 99% of usages are for the read case. The only time we hold an exclusive lock is when adding or removing a peer, which happens rarely compared to all other usages.
On latest nightly, this mutex only represents about 2% of contention (all in PeerRef PeerManagerImpl::GetPeerRef), however, I intend to move over mnauth / verified protx related stuff over into PeerManager, and I expect that will result in significantly more contention over m_peer_mutex, making this change more valuable.
I expect this will reduce contention over this mutex to ~0
What was done?
How Has This Been Tested?
On testnet
Breaking Changes
Checklist:
Go over all the following points, and put an
xin all the boxes that apply.