WIP: Enable downloading of transaction sets in parallel with early SCP stages by bboston7 · Pull Request #5209 · stellar/stellar-core

bboston7 · 2026-04-08T00:08:08Z

This is very much a work-in-progress. This is littered with TODOs that I am actively chipping away at. I've opened this draft PR as an easier way to point people towards the prototype as it evolves into a production-ready feature.

I try to keep this branch in a good enough state such that it:

Compiles
Can join the network
Can run on supercluster

Therefore, I tend to do most work on other branches and merge it in here when ready in larger single commits.

I plan to split this whole thing into multiple smaller PRs when it's time to actually merge it in.

bboston7 · 2026-04-16T19:06:19Z

I just rebased this on master and force pushed up so we can get a CI run. Ignore the terrible branch name; this is built on top of protocol 26, not 25.

Some early notes Item fetcher note More notes Notes Interface improvements Add way to get time spent waiting for tx hash Formatting Expose download times, and check in ballot protocol Find actual "step 3" location Notes Consider values being downloaded as valid for nomination Allow partially validated values to step 3 Fix nomination validation check Allow proceeding without tx set Logging More general logging More targeted log messages More super detailed logging More logging Improve pending envelope flow Handle combining candidates Sim network setup Manually feed SCP messages to node0 Loadgen Catchup two ledgers worth Optional checks fmt Gate feature on app being synced Lots of log changes and sanity checks Fix loadgen issues Allow processed messages through Add `feedTxSetFromStatement` Cleanup Add ability to feed all tx sets from a given slots Feed everywhere Test different injection points TODO Check wait time Revert "Check wait time" This reverts commit 0f8d1e4. Fix test for cases where the voted on value changes Harden test to handle case where first slot experiences issue Reapply "Check wait time" This reverts commit 53719f7. Outline skip ledger test Add support for message filtering Reliably force skip ledger condition Debug timers Simple vote-to-skip based on static timeout Checkpoint Builds Add `isSkipLedgerValue` Fix test instances Downstream skip ledger handling Take largest existing skip value when available todo Downstream skip ledger handling Replace v_3 v_2 Update some comments Bump xdr Handle high level union TODO signature checking Generate new skip values Don't short circuit on stall Refactor skip location Programmatically check that SKIP is externalized Better skip checking Metrics New (nonworking) test Fix vote reversal test Configurable timeout Bump default timeout Move debug messages to own partition Add new metric tracking time blocked during balloting Some testing notes Add some low level skip tests Fix errors introduced in rebase Cleanup + triage TODOs Fix lifetimes of tx set objects Formatting Handles todos 1 and 2 Fix todo 3 TODO 6 TODOs 7, 9, and 10 Add some details around TODOs 11 and 12 Value -> StellarValue error handling Add new fields to better construct skip value tx sets Replace `kInvalidValue` with skip in `maybeReplaceValueWithSkip` Remove todo 16 Clarify envelope parsing order Remove todo 20 Remove todo 21 Error handling around missing wait times Clean up TODOs around stall point Clean up catchup related TODOs Remove TODO(30) Remove TODO(31) Remove todos around min validation level Disallow skip values in nomination

Fix `getStatementValues` to extract all values, move validtion relax

We expect an extra cache hit now from the additional validation check before setting `mCommit` in the ballot protocol

marta-lokhova · 2026-04-22T20:44:30Z

            return ret;
        }
+
+        // If no more fully ready envelopes, proceed to processing partially


This might be an implementation detail or a naming issue, but doesn't having a single ready envelope for the slot imply all partial envelopes become ready?

I guess it depends on which envelopes are "ready" - if we can't vote any new values, then a ready envelope would imply all pending envelopes are ready.

doesn't having a single ready envelope for the slot imply all partial envelopes become ready

Not necessarily. There could be multiple Values on the network for a given slot, and the local node may have some but not all of them. One way this can happen is if nomination has multiple round leaders (timeouts have occurred) proposing different values. Processing "ready" envelopes first nudges the network towards values that the local node sees as available in the case of multiple values on the network.

You're right though that in the common case (a single round leader), either all envelopes will be "ready" or they'll all be "partially ready".

marta-lokhova · 2026-04-22T20:50:00Z

+        // accepted so that checkHeardFromQuorum can see a quorum and arm
+        // the ballot timer. The commit gate in setConfirmPrepared
+        // independently validates values before voting to commit.
+        return SCPDriver::kFullyValidatedValue;


Hmm, generic validation function returning a blanket "valid" for the whole PREPARE phase seems really sketchy - could you please walk me through what necessitates this change? Is this related to the relaxed validity checks we discussed before?

This might mean we need to rewrite the validation flow a bit (for example, I'm not sure why we'd validate values at commit time outside of this function)

This is exactly the relaxation we needed on PREPARE statements. The only things we were checking before were p, p', and ballot, but with parallel downloading these values may be legetimately invalid. I'm not really sure what's left to check here.

I'm not sure why we'd validate values at commit time outside of this function

I think I addressed this in the comment, but the added check at commit time isn't related to this. It's mainly to provide a more useful error if someone forgets to update their stellar-core instance.

the issue that I'm seeing is that we're not returning "maybe valid" to indicate relaxed validation, and instead we return "fully valid". It seems error-prone as the caller could misinterpret this result. Is there a reason we don't return "maybe valid" here?

marta-lokhova · 2026-04-22T20:54:17Z

               mSlot.getSlotIndex(), mSlot.getSCP().ballotToStr(c),
               mSlot.getSCP().ballotToStr(h));

+    throwIfValueInvalidForCommit(c.value, "setAcceptCommit");


I think this is related to my other comment, but I don't quite follow why an additional validation is needed here given that CONFIRM messages would have gone through validateValues that should have the tx set by then (or is this just a sanity check?)

This is really just here to offer a better error message when a quorum of nodes pushes through a value that the local node sees as invalid. Without this, we would still crash in that scenario, but we'd crash at the emitCurrentStateStatement call at the end of this function.

I don't think we really care that a malicious quorum can crash the node, but in practice this is most likely to occur when someone forgets to update prior to a protocol boundary. This check lets us tell them why their node crashed.

marta-lokhova · 2026-04-22T20:56:15Z

+{
+    auto validationLevel = mSlot.getSCPDriver().validateValue(
+        mSlot.getSlotIndex(), value, /*nomination=*/false);
+    if (validationLevel != SCPDriver::kInvalidValue)


Hmm, this might be an abstraction/implementation issue, but how come it's safe to have the value in "waiting for download" state at commit time?

Ah, this is a weird one. There are two ways to get to this function:

Via setConfirmCommit. setConfirmCommit is only called when the hint ballot is a CONFIRM or EXTERNALIZE ballot. PendingEnvelopes holds those back if they aren't fully downloaded, so a kAwaitingDownload value cannot end up here.

Via setAcceptCommit. Unlike setConfirmCommit, you can wind up at setAcceptCommit via a PREPARE hint ballot. In that case however, a quorum has already determined the value to be valid, so it must be valid. It's safe to set mCommit in that instance. It's worth noting that the node will naturally stall after this, as it cannot make further progress into the CONFIRM phase without the transaction set because PendingEnvelopes will hold partially fetched CONFIRM messages back. This is the one weird case where it's OK to have a kAwaitingDownload value in c.

Again, this function is technically unnecessary altogether. It really is just there to emit a better error message in a single edge case. I added it because of some confusion that came up in a model checker run where the model checker ran a scenario with a malicious quorum. I'm curious whether you think its existence makes the code more confusing overall, in which case I can remove it or better document its purpose.

marta-lokhova · 2026-04-22T21:05:15Z

+            auto maybeQSet = mApp.getHerder().getQSet(qSetHash);
+            if (maybeQSet)
+            {
+                processSCPQueue(true);


is this correct? wouldn't it be possible to have other message types in this queue, which would cause them to get processed as well?

I think that's fine? Any CONFIRM or EXTERNALIZE messages in the queue would have fully downloaded tx sets and should be safe to process. The "partially ready" queue only contains nomination and PREPARE messages, which should also be safe to process.

I don't see how this is different than the processSCPQueue call in the Herder::ENVELOPE_STATUS_READY case. In either case, all messages in the queue should safe to process, even though you may end up processing messages other than the one you just received.

Do you see a scenario where it would be wrong to process other message types here?

My worry is processing messages in unexpected order, such that it trips downstream invariance (e.g. value must be downloaded, but we accidentally fed the message that's not fully ready, crashing the node).

I don't see how this is different than the processSCPQueue call in the Herder::ENVELOPE_STATUS_READY case.

Do we understand why? Messages can be received in any order, so let's say we feed a ballot message with 1 value followed by a nomination message referencing 3 values that aren't all downloaded. I don't quite understand how this wouldn't trip SCP.

Oh, is this due to isNewerStatement checks in SCP itself? If so, that seems quite brittle at the two subsystems have an implicit correctness dependency between them that isn't documented. I know this was here before, but what worries me is that we're building on top of badness, making this even harder to maintain and reason about.

value must be downloaded, but we accidentally fed the message that's not fully ready, crashing the node

If the value must be downloaded (such as for CONFIRM and EXTERNALIZE), then the envelope isn't considered ready or partially ready and isn't in either queue, so processSCPQueue won't pass it on.

Messages can be received in any order, so let's say we feed a ballot message with 1 value followed by a nomination message referencing 3 values that aren't all downloaded. I don't quite understand how this wouldn't trip SCP.

I don't understand why this would be a problem? Nomination can run concurrently with balloting. In this case, we'd treat the undownloaded values as valid in nomination. If balloting times out and hasn't set mValueOverride, it'll pick up the latest composite candidate from nomination. If it has set mValueOverride (and therefore set h), it will stick with the value from balloting. This is no different than the situation without parallel downloading.

Following up about this (and the other comments on HerderImpl) to capture some discussion from our meeting.

The plan is to try pushing much of this logic (and related logic in PendingEnvelopes) down to SCP. My first thought is that it would live in Slot. PendingEnvelopes would treat all messages with downloaded qsets as "fully ready" (there would be no "partially ready" queue anymore) and pass them on to Slot. Slot would then interface with item fetcher as needed, and manage which values get passed to nomination/ballot protocol, and which get held back (CONFIRM and EXTERNALIZE messages without downloaded qsets). Slot would also handle delivering / replaying messages as tx sets arrive, potentially via a callback registered with ItemFetcher.

The first step here is to draw out the various data flows between Herder, PendingEnvelopes, ItemFetcher, and SCP. From there we can decide how to change these interfaces.

marta-lokhova · 2026-04-22T21:10:01Z

+        // a NOMINATE or PREPARE message. Technically both of these criteria
+        // should be properly handled downstream, but this provides some
+        // additional assurance.
+        if (mApp.getState() == Application::State::APP_SYNCED_STATE &&


Hmm, herder having to deal with the guts of SCP and differentiate between nomination, prepare and commit seems like an improper abstraction/separation of responsibilities situation. It also seems quite brittle - what if we change SCP in the future such that some CONFIRM messages would be allowed and some won't? or some PREPARE messages should be stalled?

It looks to me like the main problem is the weird mechanism in herder (which existed prior to your changes), which decides when to feed new messages to SCP. Does it really need to live in herder? Have you considered moving the mechanism inside of SCP, so it can automatically decide when to advance the protocol vs when to stall (then re-feed the messages back to the protocol).

Yeah, I see what you mean. I'll try moving this into SCP after wrapping up the test harness work.

marta-lokhova · 2026-04-22T21:12:37Z

+    }
+    break;
+    default:
+        // Value is valid or maybe valid, so we shouldn't replace it with skip


just wanted to clarify: this function should never be called with maybe valid state value right?

You can end up with kMaybeValidValue here when processing messages for slots beyond LCL+1.

Hm, based on what we discussed last week, I was under impression we weren't going to allow skip values for future slots - so it's unexpected to see maybe swapping values for slots we can't actually validate. Possibly this is related to the existing implementation of SCP and the hoops we have to jump through to get skip value to work. Is this because of how bumpState is implemented for all slots, current and future?

Ah, I think there might be a bug here. The intent was that this only had an impact on the current slot, but once I added the kInvalidValue arm to the switch statement, it can technically fire for future slots containing invalid values.

The root issue is that the BallotProtocol itself doesn't know whether this is a future slot or not (that's a Herder concern). I think what we might want is to make validateValue communicate a bit more information back. It should probably tell us whether this value belongs to a future slot or not.

Additionally, if we add more details about why a value is invalid (structural vs something related to the tx set), then we can do a bit more validation for PREPARE messages as well. I need to think about that a bit to make sure it wouldn't break anything.

bboston7 force-pushed the skip-ledgers-p25 branch from 285df2f to 1692131 Compare April 16, 2026 19:05

bboston7 added 12 commits April 20, 2026 13:26

Relax validation on PREPARE messages

ee87400

Fix `getStatementValues` to extract all values, move validtion relax

Remove outdated HMAC hack

717fbce

Properly handle skip in test feedTxSetFromStatement fn

eb279bd

Disable flaky tests that depend on uncontrollable behavior

0a8ff6b

Add note on malicious quorums to fork test

a06beef

Fix type error from rebase

3f6cd6f

Fix xdr submodule pointer

06f6c2b

Formatting

4173347

Detect and offer useful error on transition to invalid state

22548a7

Bump next submodule

72879c5

Fix issues from rebase

0564c22

bboston7 force-pushed the skip-ledgers-p25 branch from cf854d3 to 0564c22 Compare April 20, 2026 23:50

bboston7 added 2 commits April 21, 2026 13:40

Fix expected validateValue txSet cache counter

5756571

We expect an extra cache hit now from the additional validation check before setting `mCommit` in the ballot protocol

Surface proposer when replacing invalid value with skip

9d96bc4

marta-lokhova requested changes Apr 22, 2026

View reviewed changes

Remove outdated comment

2fcc493

Conversation

bboston7 commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bboston7 commented Apr 16, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bboston7 Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

marta-lokhova Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

bboston7 commented Apr 8, 2026 •

edited

Loading

bboston7 Apr 22, 2026 •

edited

Loading

marta-lokhova Apr 28, 2026 •

edited

Loading