F3 GPBFT Not progressing

## Summary

Daemon restarts during an active GPBFT instance cause self-equivocation via the WAL replay path, permanently silencing the restarted node for that instance. When multiple nodes restart during the same instance, quorum becomes permanently unreachable and F3 stalls indefinitely. There is no recovery mechanism — GPBFT has no MaxRounds limit and no instance-skip path when quorum is unachievable.

This is the same class of failure as lotus#13544 (F3 stuck at instances 465951/466454), but triggered by the protocol's own safety mechanism rather than voluntary non-participation.

## Environment

[logs.txt](https://github.com/user-attachments/files/26675239/logs.txt)

- 4 Lotus nodes (each ~25% mining power) + 1 Forest node (unpowered observer)
- Deterministic fault injection (network partitions, process kills, packet delays)
- EC finality patched to 20 epochs (vs 900 mainnet) for faster bug surfacing

## Reproduction

1. Start a 4+ node F3 network with equal mining power
2. Wait for GPBFT instance N to begin (QUALITY phase)
3. Kill 2+ Lotus daemons simultaneously (e.g., `kill -9`)
4. Restart them after 5-10 seconds (EC chain head has advanced)
5. Observe: each restarted node self-equivocates on instance N; F3 stalls permanently

## Root Cause

### The WAL replay path triggers the equivocation filter

On daemon restart, `newRunner()` (`host.go:68`) replays WAL entries into the equivocation filter:

```
host.go:112: runner.equivFilter.ProcessBroadcast(v.Message)
```

This seeds `seenMessages` with the **old QUALITY vote signature** from before the restart.

Then `startInstanceAt()` (`host.go:378`) creates a fresh GPBFT instance which calls `GetProposal()` (`consensus_inputs.go:131`), which calls `ec.GetHead()` — returning the **current** EC chain head, which has advanced during downtime.

The new instance calls `beginQuality()` (`gpbft/gpbft.go:391`) which broadcasts a QUALITY vote for the **new** proposal. `BroadcastMessage()` (`host.go:468`) hits the equivocation filter:

```go
// equivocation.go:85-89
if ok && !bytes.Equal(msgInfo.signature, m.Signature) {
    if msgInfo.origin == ef.localPID {
        log.Warnw("local self-equivocation detected", ...)
        return false  // broadcast BLOCKED
    }
}
```

Old signature (from WAL) != new signature (from new proposal) = **self-equivocation detected, broadcast blocked**.

### The GPBFT state machine doesn't know

`BroadcastMessage()` returns `nil` instead of an error when the filter blocks:

```go
// host.go:468-470
if !h.equivFilter.ProcessBroadcast(msg) {
    return nil  // silently dropped
}
```

The GPBFT `instance` struct continues as if its vote was broadcast. It waits for quorum that can never arrive.

### No recovery path exists

- **No MaxRounds limit**: `beginNextRound()` (`gpbft/gpbft.go:705`) increments `Round` without bound. Timeouts grow exponentially (`2^round * baseDelta`). No production round limit exists — `maxRounds` is only in `sim/sim.go` (test simulator).
- **No instance skip**: The only mechanism to abandon a stuck instance is certificate exchange (`host.go:324-335`). But if no quorum was reached, no certificate was produced, so there's nothing to exchange.
- **OhShitStore doesn't help**: The OhShitStore (`powerstore/powerstore.go:252`) catches up power tables but has zero knowledge of GPBFT votes or equivocation.

## Observed Timeline

| Time | Event |
|------|-------|
| 211s | All 5 nodes decide instance 23 (last healthy instance) |
| 228-238s | 3 of 4 Lotus nodes restart (network partition caused RPC failures) |
| 254s | lotus2: `local self-equivocation detected {sender:1002, instance:24, round:0, phase:QUALITY}` |
| 296s | lotus2: equivocates again after second restart |
| 405s | lotus3: `local self-equivocation detected {sender:1003, instance:24, round:0, phase:QUALITY}` |
| 434s | Only lotus0 + forest0 decide instance 24 (~25% power, far below quorum) |
| 434-623s | Instances 25-31: only lotus0 + forest0 participate. F3 permanently stalled. |

**Key detail**: The nodes self-equivocated **before connecting to any peer** — purely from local WAL replay vs stale chain head. lotus2 equivocated at 254s but didn't successfully connect to a peer until 510s.

## Impact

- **Trigger threshold is low**: Restarting 2 of 4 miners (~50% power) during one GPBFT instance is enough to permanently stall F3.
- **On mainnet (900-epoch EC finality)**: Would not cause a visible chain fork, but F3 finality would stop advancing — a Sev1 event.
- **Cascade risk with lotus#13544**: If F3 is already near the quorum edge (~63-67% participation) and operators restart nodes to fix it, the restarts cause equivocation, dropping participation further. The fix attempt amplifies the failure.

## Suggested Fixes

### 1. Abstain on proposal divergence (highest impact, smallest change)

When `startInstanceAt()` replays WAL messages, compare the replayed QUALITY vote's `ECChain` against the new `GetProposal()` result. If they differ, abstain from the instance instead of voting:

```go
// In startInstanceAt(), after WAL replay:
if walProposal != nil && walProposal.Key() != newProposal.Key() {
    log.Warnw("proposal changed since last vote, abstaining from instance",
        "instance", instance)
    return nil // wait for certificate exchange
}
```

### 2. Add MaxRounds with forced base-decision

Add a configurable `MaxRounds`. When reached, decide on the **base chain** (no new finality but produces a valid certificate for instance advancement):

```go
// gpbft/gpbft.go, beginNextRound():
if i.participant.maxRounds > 0 && i.current.Round >= i.participant.maxRounds {
    i.decide(i.input.Base())
    return
}
```

### 3. Don't start F3 participation until peers are connected

F3 currently resumes from WAL immediately on daemon start, before any peer connections. Gate on minimum peer count to avoid voting on stale local state.

### 4. Return error on blocked broadcast

Change `BroadcastMessage()` to return an error when the equivocation filter blocks, so the GPBFT state machine can transition to observer mode instead of waiting for quorum indefinitely.

## Related

- lotus#13544 — F3 stuck at instances 465951/466454 (insufficient participation, same quorum-loss class)
- go-f3#1056 — F3 exponential backoff cap (addresses timeout growth but not equivocation deadlock)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

F3 GPBFT Not progressing #255

Summary

Environment

Reproduction

Root Cause

The WAL replay path triggers the equivocation filter

The GPBFT state machine doesn't know

No recovery path exists

Observed Timeline

Impact

Suggested Fixes

1. Abstain on proposal divergence (highest impact, smallest change)

2. Add MaxRounds with forced base-decision

3. Don't start F3 participation until peers are connected

4. Return error on blocked broadcast

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Time	Event
211s	All 5 nodes decide instance 23 (last healthy instance)
228-238s	3 of 4 Lotus nodes restart (network partition caused RPC failures)
254s	lotus2: `local self-equivocation detected {sender:1002, instance:24, round:0, phase:QUALITY}`
296s	lotus2: equivocates again after second restart
405s	lotus3: `local self-equivocation detected {sender:1003, instance:24, round:0, phase:QUALITY}`
434s	Only lotus0 + forest0 decide instance 24 (~25% power, far below quorum)
434-623s	Instances 25-31: only lotus0 + forest0 participate. F3 permanently stalled.

F3 GPBFT Not progressing #255

Description

Summary

Environment

Reproduction

Root Cause

The WAL replay path triggers the equivocation filter

The GPBFT state machine doesn't know

No recovery path exists

Observed Timeline

Impact

Suggested Fixes

1. Abstain on proposal divergence (highest impact, smallest change)

2. Add MaxRounds with forced base-decision

3. Don't start F3 participation until peers are connected

4. Return error on blocked broadcast

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions