Optimize DHT Symbol Retrieval: Primary-Provider Waves, Streamed Writes, and Bounded Concurrency#227
Merged
mateeullahmalik merged 1 commit intomasterfrom Nov 7, 2025
Merged
Conversation
…s, and Bounded Concurrency
fb94027 to
f4877b0
Compare
j-rafique
approved these changes
Nov 7, 2025
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
🧩 Summary
This PR completely overhauls the symbol retrieval pipeline to make it production-ready, memory-efficient, and network-optimized.
It introduces a streaming retrieval mechanism (
BatchRetrieveStream) that:Streams symbols directly to disk in the RaptorQ workspace — no in-memory accumulation.
Performs local-first retrieval using
RetrieveBatchValuesin 5k-key batches.Introduces a deterministic primary-provider algorithm to ensure that each key is initially requested from exactly one node, maximizing unique symbol coverage.
Adds multi-wave fallback logic — if primaries fail or are slow, subsequent waves fetch missing keys from alternate top-K nodes.
Implements strict per-node payload caps (
perNodeRequestCap = 600⇒ ~36 MB @ 60 KB/symbol).Enforces two-level concurrency limits for predictable performance (
fetchSymbolsBatchConcurrency = 8,storeSameSymbolsBatchConcurrency = 4).Uses atomic early-stop and cancellation to halt network fetches the moment 17 % symbol-threshold is reached.
Maintains global de-duplication via a concurrent
resSeenmap to avoid duplicate writes or wasted network calls.🧠 Design Goals
🧪 Testing Plan
Unit:
Mock DHT with in-memory store → verify early exit when
foundLocalCount ≥ required.Simulate 50 nodes with XOR spread → confirm unique provider per key per wave.
Validate per-node request count ≤ 600.
Integration (testnet):
Spin up 50-validator cluster.
Upload 1 GB file → measure retrieval time, memory footprint, and bandwidth.
Verify file reconstruction hash matches action’s
dataHash.Validate logs show expected early-stop (
found_network ≥ needNetwork).🧾 Key Files Changed
pkg/dht/retrieve_stream.goNew
BatchRetrieveStream,processBatchStream, anditerateBatchGetValuesStreamwith streaming and waves.pkg/dht/local.goAdded
fetchAndWriteLocalKeysBatchedfor batched local streaming.pkg/dht/constants.goAdded tuning constants and
perNodeRequestCap.🧰 Backwards Compatibility
✅ 100 % backwards-compatible.
Old
BatchRetrieveAPI remains untouched;BatchRetrieveStreamis a non-breaking enhancement used by cascade restore path.🧠 Reviewer Notes
Verify
doBatchGetValuesCallobservesctxdeadlines (it should).Confirm
s.ht.closestContactsWithIncludingNodereturns up-to-date routing info (top-K list correctness).Look out for any log spam; consider lowering some debug levels to trace if necessary.
✅ Checklist
Local streaming verified
Primary-provider waves tested
Concurrency limits validated (8×4 = 32 RPCs max)
No over-fetch or memory blow-up
File reconstruction verified via hash match
Integration test passed on 50-validator testnet
TL;DR:
This PR makes the supernode’s symbol retrieval fast, predictable, and production-grade — zero memory pressure, bounded concurrency, and smarter use of the network.