Summary
Daemon restarts during an active GPBFT instance cause self-equivocation via the WAL replay path, permanently silencing the restarted node for that instance. When multiple nodes restart during the same instance, quorum becomes permanently unreachable and F3 stalls indefinitely. There is no recovery mechanism — GPBFT has no MaxRounds limit and no instance-skip path when quorum is unachievable.
This is the same class of failure as lotus#13544 (F3 stuck at instances 465951/466454), but triggered by the protocol's own safety mechanism rather than voluntary non-participation.
Environment
logs.txt
- 4 Lotus nodes (each ~25% mining power) + 1 Forest node (unpowered observer)
- Deterministic fault injection (network partitions, process kills, packet delays)
- EC finality patched to 20 epochs (vs 900 mainnet) for faster bug surfacing
Reproduction
- Start a 4+ node F3 network with equal mining power
- Wait for GPBFT instance N to begin (QUALITY phase)
- Kill 2+ Lotus daemons simultaneously (e.g.,
kill -9)
- Restart them after 5-10 seconds (EC chain head has advanced)
- Observe: each restarted node self-equivocates on instance N; F3 stalls permanently
Root Cause
The WAL replay path triggers the equivocation filter
On daemon restart, newRunner() (host.go:68) replays WAL entries into the equivocation filter:
host.go:112: runner.equivFilter.ProcessBroadcast(v.Message)
This seeds seenMessages with the old QUALITY vote signature from before the restart.
Then startInstanceAt() (host.go:378) creates a fresh GPBFT instance which calls GetProposal() (consensus_inputs.go:131), which calls ec.GetHead() — returning the current EC chain head, which has advanced during downtime.
The new instance calls beginQuality() (gpbft/gpbft.go:391) which broadcasts a QUALITY vote for the new proposal. BroadcastMessage() (host.go:468) hits the equivocation filter:
// equivocation.go:85-89
if ok && !bytes.Equal(msgInfo.signature, m.Signature) {
if msgInfo.origin == ef.localPID {
log.Warnw("local self-equivocation detected", ...)
return false // broadcast BLOCKED
}
}
Old signature (from WAL) != new signature (from new proposal) = self-equivocation detected, broadcast blocked.
The GPBFT state machine doesn't know
BroadcastMessage() returns nil instead of an error when the filter blocks:
// host.go:468-470
if !h.equivFilter.ProcessBroadcast(msg) {
return nil // silently dropped
}
The GPBFT instance struct continues as if its vote was broadcast. It waits for quorum that can never arrive.
No recovery path exists
- No MaxRounds limit:
beginNextRound() (gpbft/gpbft.go:705) increments Round without bound. Timeouts grow exponentially (2^round * baseDelta). No production round limit exists — maxRounds is only in sim/sim.go (test simulator).
- No instance skip: The only mechanism to abandon a stuck instance is certificate exchange (
host.go:324-335). But if no quorum was reached, no certificate was produced, so there's nothing to exchange.
- OhShitStore doesn't help: The OhShitStore (
powerstore/powerstore.go:252) catches up power tables but has zero knowledge of GPBFT votes or equivocation.
Observed Timeline
| Time |
Event |
| 211s |
All 5 nodes decide instance 23 (last healthy instance) |
| 228-238s |
3 of 4 Lotus nodes restart (network partition caused RPC failures) |
| 254s |
lotus2: local self-equivocation detected {sender:1002, instance:24, round:0, phase:QUALITY} |
| 296s |
lotus2: equivocates again after second restart |
| 405s |
lotus3: local self-equivocation detected {sender:1003, instance:24, round:0, phase:QUALITY} |
| 434s |
Only lotus0 + forest0 decide instance 24 (~25% power, far below quorum) |
| 434-623s |
Instances 25-31: only lotus0 + forest0 participate. F3 permanently stalled. |
Key detail: The nodes self-equivocated before connecting to any peer — purely from local WAL replay vs stale chain head. lotus2 equivocated at 254s but didn't successfully connect to a peer until 510s.
Impact
- Trigger threshold is low: Restarting 2 of 4 miners (~50% power) during one GPBFT instance is enough to permanently stall F3.
- On mainnet (900-epoch EC finality): Would not cause a visible chain fork, but F3 finality would stop advancing — a Sev1 event.
- Cascade risk with lotus#13544: If F3 is already near the quorum edge (~63-67% participation) and operators restart nodes to fix it, the restarts cause equivocation, dropping participation further. The fix attempt amplifies the failure.
Suggested Fixes
1. Abstain on proposal divergence (highest impact, smallest change)
When startInstanceAt() replays WAL messages, compare the replayed QUALITY vote's ECChain against the new GetProposal() result. If they differ, abstain from the instance instead of voting:
// In startInstanceAt(), after WAL replay:
if walProposal != nil && walProposal.Key() != newProposal.Key() {
log.Warnw("proposal changed since last vote, abstaining from instance",
"instance", instance)
return nil // wait for certificate exchange
}
2. Add MaxRounds with forced base-decision
Add a configurable MaxRounds. When reached, decide on the base chain (no new finality but produces a valid certificate for instance advancement):
// gpbft/gpbft.go, beginNextRound():
if i.participant.maxRounds > 0 && i.current.Round >= i.participant.maxRounds {
i.decide(i.input.Base())
return
}
3. Don't start F3 participation until peers are connected
F3 currently resumes from WAL immediately on daemon start, before any peer connections. Gate on minimum peer count to avoid voting on stale local state.
4. Return error on blocked broadcast
Change BroadcastMessage() to return an error when the equivocation filter blocks, so the GPBFT state machine can transition to observer mode instead of waiting for quorum indefinitely.
Related
- lotus#13544 — F3 stuck at instances 465951/466454 (insufficient participation, same quorum-loss class)
- go-f3#1056 — F3 exponential backoff cap (addresses timeout growth but not equivocation deadlock)
Summary
Daemon restarts during an active GPBFT instance cause self-equivocation via the WAL replay path, permanently silencing the restarted node for that instance. When multiple nodes restart during the same instance, quorum becomes permanently unreachable and F3 stalls indefinitely. There is no recovery mechanism — GPBFT has no MaxRounds limit and no instance-skip path when quorum is unachievable.
This is the same class of failure as lotus#13544 (F3 stuck at instances 465951/466454), but triggered by the protocol's own safety mechanism rather than voluntary non-participation.
Environment
logs.txt
Reproduction
kill -9)Root Cause
The WAL replay path triggers the equivocation filter
On daemon restart,
newRunner()(host.go:68) replays WAL entries into the equivocation filter:This seeds
seenMessageswith the old QUALITY vote signature from before the restart.Then
startInstanceAt()(host.go:378) creates a fresh GPBFT instance which callsGetProposal()(consensus_inputs.go:131), which callsec.GetHead()— returning the current EC chain head, which has advanced during downtime.The new instance calls
beginQuality()(gpbft/gpbft.go:391) which broadcasts a QUALITY vote for the new proposal.BroadcastMessage()(host.go:468) hits the equivocation filter:Old signature (from WAL) != new signature (from new proposal) = self-equivocation detected, broadcast blocked.
The GPBFT state machine doesn't know
BroadcastMessage()returnsnilinstead of an error when the filter blocks:The GPBFT
instancestruct continues as if its vote was broadcast. It waits for quorum that can never arrive.No recovery path exists
beginNextRound()(gpbft/gpbft.go:705) incrementsRoundwithout bound. Timeouts grow exponentially (2^round * baseDelta). No production round limit exists —maxRoundsis only insim/sim.go(test simulator).host.go:324-335). But if no quorum was reached, no certificate was produced, so there's nothing to exchange.powerstore/powerstore.go:252) catches up power tables but has zero knowledge of GPBFT votes or equivocation.Observed Timeline
local self-equivocation detected {sender:1002, instance:24, round:0, phase:QUALITY}local self-equivocation detected {sender:1003, instance:24, round:0, phase:QUALITY}Key detail: The nodes self-equivocated before connecting to any peer — purely from local WAL replay vs stale chain head. lotus2 equivocated at 254s but didn't successfully connect to a peer until 510s.
Impact
Suggested Fixes
1. Abstain on proposal divergence (highest impact, smallest change)
When
startInstanceAt()replays WAL messages, compare the replayed QUALITY vote'sECChainagainst the newGetProposal()result. If they differ, abstain from the instance instead of voting:2. Add MaxRounds with forced base-decision
Add a configurable
MaxRounds. When reached, decide on the base chain (no new finality but produces a valid certificate for instance advancement):3. Don't start F3 participation until peers are connected
F3 currently resumes from WAL immediately on daemon start, before any peer connections. Gate on minimum peer count to avoid voting on stale local state.
4. Return error on blocked broadcast
Change
BroadcastMessage()to return an error when the equivocation filter blocks, so the GPBFT state machine can transition to observer mode instead of waiting for quorum indefinitely.Related