Skip to content

F3 GPBFT Not progressing #255

@parthshah1

Description

@parthshah1

Summary

Daemon restarts during an active GPBFT instance cause self-equivocation via the WAL replay path, permanently silencing the restarted node for that instance. When multiple nodes restart during the same instance, quorum becomes permanently unreachable and F3 stalls indefinitely. There is no recovery mechanism — GPBFT has no MaxRounds limit and no instance-skip path when quorum is unachievable.

This is the same class of failure as lotus#13544 (F3 stuck at instances 465951/466454), but triggered by the protocol's own safety mechanism rather than voluntary non-participation.

Environment

logs.txt

  • 4 Lotus nodes (each ~25% mining power) + 1 Forest node (unpowered observer)
  • Deterministic fault injection (network partitions, process kills, packet delays)
  • EC finality patched to 20 epochs (vs 900 mainnet) for faster bug surfacing

Reproduction

  1. Start a 4+ node F3 network with equal mining power
  2. Wait for GPBFT instance N to begin (QUALITY phase)
  3. Kill 2+ Lotus daemons simultaneously (e.g., kill -9)
  4. Restart them after 5-10 seconds (EC chain head has advanced)
  5. Observe: each restarted node self-equivocates on instance N; F3 stalls permanently

Root Cause

The WAL replay path triggers the equivocation filter

On daemon restart, newRunner() (host.go:68) replays WAL entries into the equivocation filter:

host.go:112: runner.equivFilter.ProcessBroadcast(v.Message)

This seeds seenMessages with the old QUALITY vote signature from before the restart.

Then startInstanceAt() (host.go:378) creates a fresh GPBFT instance which calls GetProposal() (consensus_inputs.go:131), which calls ec.GetHead() — returning the current EC chain head, which has advanced during downtime.

The new instance calls beginQuality() (gpbft/gpbft.go:391) which broadcasts a QUALITY vote for the new proposal. BroadcastMessage() (host.go:468) hits the equivocation filter:

// equivocation.go:85-89
if ok && !bytes.Equal(msgInfo.signature, m.Signature) {
    if msgInfo.origin == ef.localPID {
        log.Warnw("local self-equivocation detected", ...)
        return false  // broadcast BLOCKED
    }
}

Old signature (from WAL) != new signature (from new proposal) = self-equivocation detected, broadcast blocked.

The GPBFT state machine doesn't know

BroadcastMessage() returns nil instead of an error when the filter blocks:

// host.go:468-470
if !h.equivFilter.ProcessBroadcast(msg) {
    return nil  // silently dropped
}

The GPBFT instance struct continues as if its vote was broadcast. It waits for quorum that can never arrive.

No recovery path exists

  • No MaxRounds limit: beginNextRound() (gpbft/gpbft.go:705) increments Round without bound. Timeouts grow exponentially (2^round * baseDelta). No production round limit exists — maxRounds is only in sim/sim.go (test simulator).
  • No instance skip: The only mechanism to abandon a stuck instance is certificate exchange (host.go:324-335). But if no quorum was reached, no certificate was produced, so there's nothing to exchange.
  • OhShitStore doesn't help: The OhShitStore (powerstore/powerstore.go:252) catches up power tables but has zero knowledge of GPBFT votes or equivocation.

Observed Timeline

Time Event
211s All 5 nodes decide instance 23 (last healthy instance)
228-238s 3 of 4 Lotus nodes restart (network partition caused RPC failures)
254s lotus2: local self-equivocation detected {sender:1002, instance:24, round:0, phase:QUALITY}
296s lotus2: equivocates again after second restart
405s lotus3: local self-equivocation detected {sender:1003, instance:24, round:0, phase:QUALITY}
434s Only lotus0 + forest0 decide instance 24 (~25% power, far below quorum)
434-623s Instances 25-31: only lotus0 + forest0 participate. F3 permanently stalled.

Key detail: The nodes self-equivocated before connecting to any peer — purely from local WAL replay vs stale chain head. lotus2 equivocated at 254s but didn't successfully connect to a peer until 510s.

Impact

  • Trigger threshold is low: Restarting 2 of 4 miners (~50% power) during one GPBFT instance is enough to permanently stall F3.
  • On mainnet (900-epoch EC finality): Would not cause a visible chain fork, but F3 finality would stop advancing — a Sev1 event.
  • Cascade risk with lotus#13544: If F3 is already near the quorum edge (~63-67% participation) and operators restart nodes to fix it, the restarts cause equivocation, dropping participation further. The fix attempt amplifies the failure.

Suggested Fixes

1. Abstain on proposal divergence (highest impact, smallest change)

When startInstanceAt() replays WAL messages, compare the replayed QUALITY vote's ECChain against the new GetProposal() result. If they differ, abstain from the instance instead of voting:

// In startInstanceAt(), after WAL replay:
if walProposal != nil && walProposal.Key() != newProposal.Key() {
    log.Warnw("proposal changed since last vote, abstaining from instance",
        "instance", instance)
    return nil // wait for certificate exchange
}

2. Add MaxRounds with forced base-decision

Add a configurable MaxRounds. When reached, decide on the base chain (no new finality but produces a valid certificate for instance advancement):

// gpbft/gpbft.go, beginNextRound():
if i.participant.maxRounds > 0 && i.current.Round >= i.participant.maxRounds {
    i.decide(i.input.Base())
    return
}

3. Don't start F3 participation until peers are connected

F3 currently resumes from WAL immediately on daemon start, before any peer connections. Gate on minimum peer count to avoid voting on stale local state.

4. Return error on blocked broadcast

Change BroadcastMessage() to return an error when the equivocation filter blocks, so the GPBFT state machine can transition to observer mode instead of waiting for quorum indefinitely.

Related

  • lotus#13544 — F3 stuck at instances 465951/466454 (insufficient participation, same quorum-loss class)
  • go-f3#1056 — F3 exponential backoff cap (addresses timeout growth but not equivocation deadlock)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions