Experimental triggerNextLedger timer Change by SirTyson · Pull Request #4865 · stellar/stellar-core

SirTyson · 2025-08-04T20:11:48Z

Description

This adds an experimental flag that when set, uses the closeTime from the last externalized SCP message as the basis for setting the triggerNextLedger timer.

I include a couple of basic unit tests, making sure that the behavior of the trigger is correct when nodes are drifting and when we have long nomination timeouts. Most of the simulation testing is reported below using this super cluster change: stellar/supercluster#384

Checklist

Reviewed the contributing document
Rebased on top of master (no merge commits)
Ran clang-format v8.0.0 (via make format or the Visual Studio extension)
Compiles
Ran all tests
If change impacts performance, include supporting evidence per the performance document

SirTyson · 2026-05-01T17:52:28Z

Key findings

Overall, the change works as expected. When clocks are mostly synced and most nodes enable the new timer,
ledger age falls to almost exactly 5 seconds, down from around 5.75 without the change. As network conditions
deteriorate, (i.e. not all nodes have enabled the flag or nodes have large clock drifts), we see a gradual
degradation in block time up to 5.75 seconds. We have some safeguards around trigger time such that we
"fall back" to the old timer if we think we're drifting from the network. This ensures nodes will never
schedule the trigger timer too far into the future and hang or trigger ledgers immediately, closing
ledger too fast and potentially snowballing the network with nomination load.

We do see that nomination timeouts and nomination timing overall increase fairly significantly when clocks
are unsynced, or when the network has a mix of new and old timers. When clocks are synced, we see a slight
increase in nomination timeouts and timing. This is probably just the result of the network speeding up,
where there's less time to do the same amount of work.

However, as clocks become unsynced, we do see a significant increase in timeouts and nomination timing.
While in this experimental setting, block time is still faster under bad clock conditions, in actual
network settings it's possible the additional nomination work will result in an overall decrease in
performance. With this change, it's important we instruct validator operators to sync system clocks
on a cron job with an ntp server (which is common on other blockchains). We should also have stronger
warnings about drifting clocks, given the potential decrease in network perf.

Importantly, while performance does decrease, the network does not get wedged, and degrades back to the
timer we have today. If a node thinks it's out of sync with the externalized time stamp, it falls back
to the ballot protocol based timer we use today. At very high levels of drift, we see similar performance
to today. A conservative upper bound on trigger time, based on the ballot prepare cadence, ensure that we
don't get "stuck" by validators scheduling nomination start into the distant future. We also have a lower
bound, making sure that clock drift does not make the network speed up beyond our target ledger close.

We actually see the peak decrease in performance at around 6 seconds of absolute clock drift. This seems to
be the worst case sweet spot where nodes can be maximally out of sync before the conservative safeguards
kick in. At much larger drifts, like 20 seconds, performance is improved and similar to what we see today.

I also tested with a mixed network, where some nodes did not have the flag enabled while others did. Block time gradually decreases
based on the percentage of tier 1 nodes who have switched to the timer. During this time, block time can have
higher variance, but still stays within the [5, 5.7] second bounds. We also see higher nomination time and more timeouts
for nodes who have switched to the experimental timer. Non-experimental nodes are not affected. As more and more
nodes switch timers, the nomination time of experimental nodes decreases.

This seems to indicate that this feature is fine as a non-protocol upgrade. Given that the network can proceed with
a mix of timers (with increased load due to nomination timeouts), it seems like this can be introduced as a feature
flag in a point release. We should not gradually flip this flag (i.e. test it out on just SDF nodes), but should
just make it default true in a release. The flag can be a safeguard if we see issues in prod, but we should still
try to make the switch as atomic as possible. In simulation we saw blocktimes close to 5 seconds at around 75%
adoption, though we still had increased timeouts. Stable blocktime of 5 seconds with no degradation in nomination
performance occurs as you approach 100% adoption.

Setup

Topology 3
- tier 1 + 70 watchers
- medium connection density
250 TPS for 10 minutes

The general idea is to try to mimic pubnet as close as possible, without spinning up hundreds of
nodes. Specifically, we want to make sure tier 1 is not densely connected.

Changes to Metrics

Our traditional ledger.age.current-seconds benchmark is not accurate in determining the actual network
blocktime. Even on master, this metric sees high variance for all nodes ledger to ledger. This is expected,
as any node's local ledger age metric depends on its latency from the leader for that given block. Instead,
I'm using the following metric to determine the actual network block time across all nodes, instead of
extrapolating the value from individual nodes:

  avg(
    (
      1 / clamp_min(
        irate(stellar_core_ledger_ledger_close_seconds_count{
          kubernetes_namespace=~"$namespace$",
          network=~"^$network$",
          build=~".*${build:regex}$",
          kubernetes_pod_name=~".*${pods:regex}$"
        }[5m]),
        1e-9
      )
    ) < 10
  )

We've also added meta info to node names. pX and mY indicate the node has a drifting
clock, either +X seconds or -Y seconds. Note that actual drift values are in milliseconds, so
these are rounded values. Additionally, nodes with the "expr" string indicate that they are
running the experimental timer change.

Test Results

First, we want to analyze the change itself, where all nodes have the new timer change. We will then
compare this with a network where no nodes have the change.

All the following results have the experimental timer disabled on the left, and then enabled on all nodes
to the right.

No Drift

https://grafana.stellar-ops.com/goto/bGk5BZTDg?orgId=1

Block Time

We see a similar block time to pubnet in our control of 5.7s. With the experimental timer, this drops to 5.01 seconds.

Nomination Timeouts

We see an increase in timeouts with the new timer.

Nomination p75

We see an increase in nomination timing with the timer change:

mean [0.634s, 1.03s]
max [0.764s, 1.33s]

Compared to the control group:

mean [0.594s, 0.781s]
max [0.702s, 1.16s]

Nomination p99

We see an increase again with the timer change:

mean [0.743s, 1.20s]
max [0.853s, 2.26s]

Compared to the control group:

mean [0.768s, 0.970s]
max [1.01s, 1.45s]

2 Second Absolute Drift

All nodes given a random drift uniformly selected from [-1000,+1000] ms.

https://grafana.stellar-ops.com/goto/ttwsLWTvg?orgId=1

Block Time

We see little change compared to synchronized clocks.

Nomination Timeouts

Elevated nomination timeouts observed in both experimental and non-experimental runs.
Experimental timer still has more timeouts.

Nomination p75

We see a significant increase in nomination time in the experimental timer, compared to
the non-drifting test. This is directly correlated to the direction in which nodes are drifting.
Nodes that are drifting behind can experience very low nomination time, with some p75 mean around 0.35s. Nodes
drifting ahead had longer nomination times, with upper bound means around 2.35s. This delta does correlate
with the drift delta.

mean [0.302, 2.47]
max [0.626, 3.52]

Compared to no experimental flags

mean [0.487, 0.754]
max [0.806, 3.11]

Nomination p99

mean [0.574, 2.9]
max [0.856, 4.7]

Compared to no experimental flags

mean [0.552, 0.897]
max [0.806, 3.1]

6 Second Absolute Drift

All nodes given a random drift uniformly selected from [-3000,+3000] ms.

https://grafana.stellar-ops.com/goto/3w0a6Kovg?orgId=1

Block Time

At this point, we see higher variance in the experimental timer and slower blocks overall, but still faster than
the control.

Nomination Timeouts

We see a significant increase in timeouts compared to more synced clocks.

Nomination p75

Much higher nomination times as well, with the new timer:

mean [0.585, 4.09]
max [0.900, 5.37]

Compared to control:

mean [0.601, 0.781]
max [0.697, 1.07]

Nomination p99

Experimental Timer:

mean [1.10, 4.61]
max [1.65, 6.53]

Compared to control:

mean [0.746, 1.1]
max [1.07, 2.15]

20 Second Absolute Drift

All nodes given a random drift uniformly selected from [-10,+10] seconds. At this point,
all gains from the experimental timer are gone, and block time is basically the same as the
control.

https://grafana.stellar-ops.com/goto/I05-W5ovg?orgId=1

Block Time

Basically the same as the control block time with relatively low variance.

Nomination Timeouts

Still significantly greater than the control, but less than the 6 second absolute drift case.

Nomination p75

Still larger than the control, but much improved from the 6 second case. With trigger timer:

mean [0.662, 1.97]
max [0.759, 4.41]

vs baseline:

mean [0.638, 0.781]
max [0.720, 1.08]

Nomination p99

mean [0.848, 3.58]
max [1.17, 5.88]

vs baseline:

mean [0.771, 0.940]
max [0.960, 2.26]

Extreme bimodal distribution

Originally what I expected to be a worst case stress test. 25% of nodes with minor drift (within 1 second),
75% with bimodal distribution between [-20, -10] seconds and [+10, +20] seconds. While the block time
was worse, from a nomination standpoint, this was more stable than the 6 second of absolute drift case.

My laptop died between runs, so they are on separate graphs. Baseline,
experimental.

Block Time

Functionally equivalent.

Experimental timer:

Baseline:

Nomination Timeouts

Experimental timer:

Baseline:

Nomination p75

Experimental:

mean [0.632, 2.01]
max [0.697, 4.37]

Baseline:

mean [0.630, 0.758]
max [0.724, 1.24]

Nomination p99

Experimental:

mean [0.791, 3.24]
max [1, 4.92]

Baseline:

mean [0.760, 0.959]
max [0.931, 1.90]

Network with mix of experimental flag and non-experimental flag

For this test, we used a moderate clock drift of +- 1 second across all nodes. We then ran several
simulations, increasing the number of nodes using the experimental timer at each run.

This grafana board shows several
runs, going from 13%, 34%, 38%, 43%, 66%, 78%, then 90% experimental flag adoption.

Block Time

We see blocktime decrease gradually as more timers are enabled, but with higher variance in mixed networks. At around
75%, we achieve most of the block time gains.

Nomination

Whenever a node enables the experimental flag, it's nomination timeouts and timing increase based on how much of the
rest of the network has the flag enabled. I.e. if less of the network has enabled the timer, those who have enabled it
are more affected by timeouts and have longer nomination timing. As more nodes adopt the timer, the magnitude of the
degradation across all upgraded nodes lessens, eventually converging on values close to the non-experimental baseline
(assuming little clock drift).

During this in-between phase, only nodes with the new timer seem affected. Even when most of the nodes have upgraded their
timer, those left behind see little increase in nomination timeouts or timing. This allows a safe "escape hatch" should
we push the upgrade with a config flag to disable the timer. If we see network degradation, node operators can disable the
timer and quickly see nomination timings for their node go back to previous values.

Timeouts

Nomination Timing

Different topologies

Most testing was with topology 1, as this is the "closest" 100 node approximation to pubnet, where tier 1 is moderately
connected, but not directly connected. We also tested topology 0 (tier 1 only, complete graph), topology 1 (100 nodes, tier 1
fully connected) and topology 2 (100 nodes, tier 1 maximally apart in graph). Topology 0 and 1 performed better than topology
3, which was used for the rest of our tests. Topology 2 had worse performance, and is the worst case topology for this change.

While other tests were run at 250 TPS, both the control and experimental flag failed at this load, so we reduced to 150 TPS.

Worst case topology, no drift

https://grafana.stellar-ops.com/goto/U5eoCpTvg?orgId=1

We tested the worst case topology with "realistic" ntp drift (+- 100ms).
We found that block time was reduced to 5s and stable.

Nomination timeouts and timing were increased, more so than in topology
3. While increased relative to the baseline and relative to more connected
topologies, the network was still much healthier than more connected topologies
with higher rates of drift.

Worst case topology, worst case drift

With topology 3 and the worst drift of 6 seconds absolute, we see high
increases in nomination timeouts and timing. Block time has an average of 5.1s,
but is more variable. The network definitely has less TX capacity with this
topology, but that is true both with and without the experimental timer.

https://grafana.stellar-ops.com/goto/ja6WR2oDg?orgId=1

Max TPS

While not the primary motivation of this change, the max TPS test gives us some
idea of how the new timer behaves under load.

It looks like the experimental timer has a small increase in overall max tps,
from around ~2650 -> 2850. We see that the experimental timer maintains a faster
block time of around 5.77s, while the current timer averages 7.85s. Nomination
timeouts and timing is similar between the two timers. It seems like the new timer,
because it has a more consistent and smaller blocktime, can achieve a higher TPS
due to bandwidth savings in block propagation.

Here is the max TPS test with the experimental timer
vs the current timer.

Copilot

Pull request overview

Adds an experimental mode to anchor HerderImpl’s triggerNextLedger scheduling off the last externalized SCP close time (with drift/availability fallbacks), plus test-only knobs to simulate clock drift and slow nomination message emission.

Changes:

Add EXPERIMENTAL_TRIGGER_TIMER and implement consensus-close-time-based trigger anchoring with fallback/metrics.
Add test-only support for simulated system clock drift and delayed nomination emit to exercise timeout/drift behavior.
Add a new SCP trigger fallback meter and a (hidden) herder simulation test covering drift and long nomination scenarios.

Reviewed changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
src/util/Timer.h	Adds test-only system clock offset state + new `actual_and_fake_system_now()` API.
src/util/Timer.cpp	Implements drifted `system_now()` via injected offset and exposes paired sampling helper.
src/test/test.cpp	Enables the experimental trigger timer in the default test config.
src/scp/Slot.h	Adds a test-only SCP timer ID for delayed nomination emission.
src/scp/SCPDriver.h	Adds a test-only virtual hook for configuring nomination emit delay.
src/scp/NominationProtocol.cpp	Defers nomination broadcast in tests via a new driver timer when configured.
src/main/Config.h	Adds `EXPERIMENTAL_TRIGGER_TIMER` and two new test-only timing knobs.
src/main/Config.cpp	Initializes/parses new config options; extends testing-only option list.
src/main/ApplicationImpl.cpp	Applies configured test-only system clock offset at startup via `VirtualClock`.
src/herder/test/HerderTests.cpp	Adds a (hidden) simulation test for experimental trigger behavior under drift/slow nomination.
src/herder/HerderSCPDriver.h	Declares test-only nomination emit delay accessor; exposes nomination timeout count getter.
src/herder/HerderSCPDriver.cpp	Implements test-only nomination emit delay accessor and special-cases emit timer callback behavior.
src/herder/HerderImpl.h	Declares new trigger anchor helper methods and adds a fallback meter to SCP metrics.
src/herder/HerderImpl.cpp	Implements consensus-close-time anchoring logic, fallback conditions, and new metrics wiring.
docs/metrics.md	Documents the new `scp.trigger.prepare-start-fallback` meter.

Copilot · 2026-05-01T18:02:26Z

        thisConfig.MANUAL_CLOSE = true;

        thisConfig.TEST_CASES_ENABLED = true;
+        thisConfig.EXPERIMENTAL_TRIGGER_TIMER = true;


getTestConfig() unconditionally enables EXPERIMENTAL_TRIGGER_TIMER, which changes the default behavior for the entire test suite (including tests not exercising the experimental path). It would be safer to leave this flag at its default and enable it only in tests that specifically need it, so other tests continue validating the non-experimental behavior by default.

Suggested change

thisConfig.EXPERIMENTAL_TRIGGER_TIMER = true;

Copilot · 2026-05-01T18:02:26Z

+    // Returns both the unshifted and drifted system time samples captured from
+    // the same underlying timestamp source.
+    std::pair<system_time_point, system_time_point>
+    actual_and_fake_system_now() const noexcept;


actual_and_fake_system_now() is added as a new public VirtualClock API, but it’s currently only used internally by system_now(). If it’s only needed for test drift injection, consider making it private (or #ifdef BUILD_TESTS) to avoid widening the public surface area of VirtualClock without a clear production use-case.

Copilot · 2026-05-01T18:02:27Z

    REQUIRE(herder0.getSCP().getHighestKnownSlotIndex() < FAR_FUTURE_BASE);
 }
+
+TEST_CASE("experimental trigger timer", "[herder][hide]")


This new test is tagged [hide], which typically excludes it from default CI runs. If this is meant to validate the experimental trigger timer behavior going forward, consider making at least a small/fast subset non-hidden (or adding a non-hidden unit test) so regressions are caught automatically.

Suggested change

TEST_CASE("experimental trigger timer", "[herder][hide]")

TEST_CASE("experimental trigger timer", "[herder]")

Copilot · 2026-05-01T18:02:27Z

+        auto const newTimer = runSimulation(true, nominationDelay);
+
+        REQUIRE(oldTimer.elapsed > newTimer.elapsed);
+        REQUIRE(newTimer.elapsed < oldTimer.elapsed);


These two assertions are equivalent (oldTimer.elapsed > newTimer.elapsed and newTimer.elapsed < oldTimer.elapsed). Consider keeping just one to reduce redundancy.

Suggested change

REQUIRE(newTimer.elapsed < oldTimer.elapsed);

Copilot · 2026-05-01T18:02:27Z

    "ARTIFICIALLY_SLEEP_MAIN_THREAD_FOR_TESTING",
+    "ARTIFICIALLY_DELAY_NOMINATION_EMIT_FOR_TESTING",
    "ARTIFICIALLY_SKIP_CONNECTION_ADJUSTMENT_FOR_TESTING",
    "ARTIFICIALLY_DELAY_LEDGER_CLOSE_FOR_TESTING",
    "SKIP_HIGH_CRITICAL_VALIDATOR_CHECKS_FOR_TESTING",


ARTIFICIALLY_SET_SYSTEM_CLOCK_OFFSET_FOR_TESTING is parsed (under BUILD_TESTS) but isn’t listed in TESTING_ONLY_OPTIONS, so it won’t get the standard “testing-only option” warning logged when present. Consider adding it to TESTING_ONLY_OPTIONS for consistency with the other ARTIFICIALLY_*_FOR_TESTING knobs.

SirTyson force-pushed the externalize-clock branch from 6af4b3f to 99cf3aa Compare August 5, 2025 18:15

SirTyson force-pushed the externalize-clock branch from 99cf3aa to 24db760 Compare August 25, 2025 20:00

MonsieurNicolas reviewed Aug 26, 2025

View reviewed changes

Comment thread src/herder/HerderImpl.cpp Outdated

SirTyson force-pushed the externalize-clock branch from 24db760 to 52b7601 Compare August 28, 2025 22:01

SirTyson force-pushed the externalize-clock branch 2 times, most recently from 538021b to 72a3614 Compare February 26, 2026 18:49

SirTyson force-pushed the externalize-clock branch from 7a0d9bc to 226b1e7 Compare March 30, 2026 19:45

SirTyson force-pushed the externalize-clock branch from 226b1e7 to fbbbdc4 Compare April 14, 2026 23:17

SirTyson force-pushed the externalize-clock branch from 11605e9 to 385758e Compare April 30, 2026 23:22

SirTyson added 2 commits May 1, 2026 01:16

Add experimental trigger timer

83e2c6d

Added tests for experimental timer

b17483d

SirTyson force-pushed the externalize-clock branch from 785022c to b17483d Compare May 1, 2026 09:16

SirTyson requested a review from marta-lokhova May 1, 2026 17:55

SirTyson marked this pull request as ready for review May 1, 2026 17:55

Copilot AI review requested due to automatic review settings May 1, 2026 17:56

Copilot started reviewing on behalf of SirTyson May 1, 2026 17:56 View session

Copilot AI reviewed May 1, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Experimental triggerNextLedger timer Change#4865

Experimental triggerNextLedger timer Change#4865
SirTyson wants to merge 2 commits intostellar:masterfrom
SirTyson:externalize-clock

SirTyson commented Aug 4, 2025 •

edited

Loading

Uh oh!

Uh oh!

SirTyson commented May 1, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI May 1, 2026

Uh oh!

Copilot AI May 1, 2026

Uh oh!

Copilot AI May 1, 2026

Uh oh!

Copilot AI May 1, 2026

Uh oh!

Copilot AI May 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	TEST_CASE("experimental trigger timer", "[herder][hide]")
	TEST_CASE("experimental trigger timer", "[herder]")

Conversation

SirTyson commented Aug 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

Uh oh!

Uh oh!

SirTyson commented May 1, 2026

Key findings

Setup

Changes to Metrics

Test Results

No Drift

Block Time

Nomination Timeouts

Nomination p75

Nomination p99

2 Second Absolute Drift

Block Time

Nomination Timeouts

Nomination p75

Nomination p99

6 Second Absolute Drift

Block Time

Nomination Timeouts

Nomination p75

Nomination p99

20 Second Absolute Drift

Block Time

Nomination Timeouts

Nomination p75

Nomination p99

Extreme bimodal distribution

Block Time

Nomination Timeouts

Nomination p75

Nomination p99

Network with mix of experimental flag and non-experimental flag

Block Time

Nomination

Different topologies

Worst case topology, no drift

Worst case topology, worst case drift

Max TPS

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI May 1, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI May 1, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI May 1, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI May 1, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI May 1, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

SirTyson commented Aug 4, 2025 •

edited

Loading