Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
119 changes: 119 additions & 0 deletions CHECKPOINT_SYNC_PROGRESS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,119 @@
# Checkpoint Sync Progress — epbs-devnet-1

## Goal
Make Lighthouse checkpoint sync against epbs-devnet-1 and follow head.

## Status: CHECKPOINT SYNC WORKS + ENVELOPE LOOKUP WIRED
- Checkpoint sync initializes correctly, zero block rejections, finalized root matches devnet
- Cannot test full sync-to-head: Prysm collocation limit blocks our IP
- Root cause: 22 peer IDs stored from our IP 85.10.201.236, exceeds Prysm's CollocationLimit=5
- Not an IP ban — it's an anti-Sybil measure in `beacon-chain/p2p/peers/status.go:isfromBadIP`
- Need: fresh IP, Prysm restart (clear peer store), or different machine

## Devnet State (checked 2026-04-04)
- Devnet alive, head slot ~25232, synced, not optimistic
- Checkpoint sync URL: `https://beacon.epbs-devnet-1.ethpandaops.io/`

## Pre-existing Bugs (from DEVNET_SYNC_STATUS.md)

### Bug 1: MissingHotStateSummary — FIXED (prior work)
- **File**: `beacon_node/store/src/hot_cold_store.rs`
- **Fix**: Fall back to Pending when previous state summary missing during checkpoint sync

### Bug 2: Missing Envelope for Parent Block — WORKAROUND ONLY
- **File**: `beacon_node/beacon_chain/src/block_verification.rs`
- **Symptom**: `DBInconsistent("Missing envelope for parent block")`
- **Current workaround**: Falls back to Pending state, causes state root mismatches → validation fails → peers drop → stall
- **This is the blocking issue**

---

## My Attempts

### Attempt 1: Download envelope during checkpoint sync
- Downloaded envelope from checkpoint server HTTP API (works, Prysm serves it)
- Stored envelope in DB for invariant 5 compliance
- **Problem**: fork choice set `payload_received=true` for anchor (genesis logic), returned `Full` status
- `get_advanced_hot_state` couldn't find Full state (only Pending stored)
- **Fix**: Changed `is_genesis` in proto_array to check `slot == 0` not just `parent.is_none()`

### Attempt 2: Snapshot with envelope=None, fallback in load_parent
- Snapshot without envelope → fork choice returns Pending → head loads correctly
- Envelope stored in DB only
- `load_parent` falls back from Full→Pending when Full state not found
- **Problem**: Pending state has wrong `latest_block_hash` — hasn't been updated by envelope
- Child block's `ExecutionPayloadBid.parent_block_hash` doesn't match `state.latest_block_hash`
- Error: `ExecutionPayloadBidInvalid: ParentBlockHashMismatch`

### Attempt 3: Mutate `latest_block_hash` on Pending state + recompute root
- Applied minimal mutation: set `state.latest_block_hash = envelope.payload.block_hash`
- Recomputed state root after mutation and updated split point
- Stored envelope in DB for invariant 5 compliance
- Snapshot uses `execution_envelope: None` so fork choice computes correct block root
- Proto_array fix: `is_genesis = parent_index.is_none() && block.slot == 0` (not just no parent)
- `load_parent` falls back from Full→Pending when Full state not found (for first child block)
- **Result**: Zero block rejections, head advances from checkpoint slot to slot+~65
- **Remaining issue**: Prysm peers rate-limit `data_columns_by_range` requests → peers disconnect → sync stalls
- This is a networking issue, not a checkpoint sync bug

### Bug 3: Peers disconnect with "Fault" — wrong finalized_root (CRITICAL)
- Prysm peers send `Goodbye: Fault` immediately after status exchange
- Our `finalized_root` doesn't match theirs for the same finalized epoch
- Root cause: mutating `latest_block_hash` on the checkpoint state changes the state root
- The changed state root cascades: `get_forkchoice_store` computes block header root using
the mutated state root → different block root → different finalized_root in status messages
- Blocks DO import correctly (head advances ~100 slots) but peers disconnect during status
- **The state mutation approach is fundamentally broken** — can't change state without
changing roots, which makes status messages incompatible

### Key insight: Can't mutate the Pending state
The downloaded Pending state has a specific root that matches what the network expects.
Mutating it changes the root, making our node incompatible. Need a different approach.

### Attempt 4: Patch in-memory state only, don't mutate stored state (CURRENT)
- Reverted all stored-state mutations (keeps correct roots for status messages)
- In `load_parent`, when falling back from Full→Pending, load the envelope from DB and
apply `latest_block_hash = envelope.payload.block_hash` on the IN-MEMORY state only
- The on-disk state retains its original root → correct fork choice and status messages
- **Result**: Zero block rejections, head advances ~100+ slots from checkpoint
- **finalized_root matches devnet** — our status messages have correct finalized data

### Bug 4: Range sync doesn't download envelopes (CRITICAL for Gloas)
- `block_components_by_range_request` sends: BlocksByRange + DataColumnsByRange
- No `PayloadEnvelopesByRange` requests are made
- Blocks import successfully as Pending (beacon block processing succeeds without envelope)
- The chain operates in Pending-only mode — no Full states, no execution payload validation
- Eventually, child blocks whose parents were Full will fail bid validation:
`ExecutionPayloadBidInvalid: ParentBlockHashMismatch`
- Our `load_parent` in-memory patch covers the checkpoint block's children, but NOT
subsequent full blocks whose envelopes were never downloaded
- **Fix needed**: Add `PayloadEnvelopesByRange` to `block_components_by_range_request`,
similar to how `DataColumnsByRange` was integrated. This is a significant change to the
range sync pipeline and coupling logic.

### Bug 5: Prysm peers IP-banned from previous broken sessions (NETWORKING)
- Prysm sends `Goodbye: Fault` immediately after status exchange
- Happens BEFORE any data requests — not caused by rate limiting
- Our finalized_root and epoch match the devnet's canonical chain
- Likely a Prysm interop issue with StatusMessageV2 or some field mismatch
- Lodestar peers at epoch 3/206 are far behind and correctly disconnected
- Blocks import correctly when peers are connected (zero rejections)
- **This is a separate P2P interop issue, not related to checkpoint sync**

### Changes Made (files modified)
1. `beacon_node/client/src/builder.rs` — Download envelope during Gloas checkpoint sync
2. `beacon_node/beacon_chain/src/builder.rs` — Accept envelope param, store in DB, no state mutation
3. `beacon_node/beacon_chain/src/block_verification.rs` — Fallback Full→Pending + in-memory block_hash patch
4. `beacon_node/beacon_chain/tests/store_tests.rs` — Updated call sites for new signature
5. `beacon_node/store/src/hot_cold_store.rs` — Bug 1 fix (handle missing previous_state_root)
6. `consensus/proto_array/src/proto_array.rs` — Fix `is_genesis` for checkpoint sync anchors
7. `.cargo/config.toml` — Build target dir

### Changes Made (files modified)
1. `beacon_node/client/src/builder.rs` — Download envelope during Gloas checkpoint sync
2. `beacon_node/beacon_chain/src/builder.rs` — Accept envelope, mutate `latest_block_hash`, store envelope in DB, recompute state root
3. `beacon_node/beacon_chain/src/block_verification.rs` — Fallback from Full→Pending in load_parent
4. `beacon_node/beacon_chain/tests/store_tests.rs` — Updated call sites for new signature
5. `beacon_node/store/src/hot_cold_store.rs` — Bug 1 fix (handle missing previous_state_root)
6. `consensus/proto_array/src/proto_array.rs` — Fix `is_genesis` for checkpoint sync anchors
7. `.cargo/config.toml` — Build target dir
158 changes: 158 additions & 0 deletions DEVNET_SYNC_STATUS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,158 @@
# Lighthouse ePBS Devnet-1 Checkpoint Sync — Status & Handoff

## Goal
Get Lighthouse to checkpoint sync against epbs-devnet-1 and follow head.

## Branch
- **Location**: `/root/.openclaw/workspace/lighthouse-devnet-test`
- **Branch**: `devnet-test-combined` (local only, not pushed)
- **Base**: `sigp/unstable` @ `99f5a92b9`
- **Merged in**: sigp/lighthouse PR #9025 (Gloas fork choice redux, commit `68f18efbe`)
- **Merged in**: dapplion/lighthouse PR #68 (gloas-lookup-sync-fixes, branch `gloas-lookup-sync-fixes` @ `8f4a5f0a4`)
- **Local fixes**: 2 patches applied on top (see below)
- **Cargo target-dir**: `/mnt/ssd/builds/lighthouse-devnet-test`

## Devnet Config
- **Network**: epbs-devnet-1
- **Config files**: `/tmp/epbs-devnet-1/` (config.yaml, genesis.ssz, jwt.hex, boot_enrs.txt, el_bootnodes.txt, genesis.json)
- **Checkpoint sync URL**: `https://beacon.epbs-devnet-1.ethpandaops.io/`
- **Beacon API**: `https://beacon.epbs-devnet-1.ethpandaops.io/`
- **Ports used**: CL 9200/udp+tcp, HTTP 5053, EL authrpc 18551
- **Data dirs**: CL `/mnt/ssd/lighthouse-devnet-1`, EL `/mnt/ssd/geth-devnet-1`

## EL Setup
- **Image**: `ethpandaops/geth:epbs-devnet-0` (Docker)
- **Container name**: `geth-devnet-1`
- **Network ID**: 7070339337
- **Start command**:
```bash
EL_BOOTNODES=$(cat /tmp/epbs-devnet-1/el_bootnodes.txt | tr '\n' ',' | sed 's/,$//')
docker run -d --name geth-devnet-1 --network host \
-v /mnt/ssd/geth-devnet-1:/data -v /tmp/epbs-devnet-1/jwt.hex:/jwt.hex \
ethpandaops/geth:epbs-devnet-0 \
--datadir /data --networkid 7070339337 --bootnodes "$EL_BOOTNODES" \
--port 30304 --discovery.port 30304 \
--http --http.port 8546 --http.api eth,net,web3,txpool \
--authrpc.port 18551 --authrpc.jwtsecret /jwt.hex \
--syncmode full --verbosity 3
```
- **Init**: Must run `docker run --rm ... geth init --datadir /data /genesis.json` first with the EL genesis

## CL Start Command
Script at `/tmp/start-lh-devnet.sh`:
```bash
BOOT_ENRS=$(cat /tmp/epbs-devnet-1/boot_enrs.txt | paste -sd,)
exec /mnt/ssd/builds/lighthouse-devnet-test/release/lighthouse bn \
--testnet-dir /tmp/epbs-devnet-1 \
--datadir /mnt/ssd/lighthouse-devnet-1 \
--checkpoint-sync-url https://beacon.epbs-devnet-1.ethpandaops.io \
--boot-nodes "$BOOT_ENRS" \
--target-peers 50 --port 9200 --discovery-port 9200 \
--http --http-port 5053 \
--execution-endpoint http://localhost:18551 \
--execution-jwt /tmp/epbs-devnet-1/jwt.hex \
--subscribe-all-subnets --import-all-attestations
```

## Bugs Found & Fixed

### Bug 1: MissingHotStateSummary (FIXED)
- **File**: `beacon_node/store/src/hot_cold_store.rs` ~line 1897
- **Symptom**: `CRIT Failed to start beacon node: MissingHotStateSummary(0xe8ee...)`
- **Root cause**: During checkpoint sync, only ONE state is stored (the checkpoint state). `HotStateSummary::new` computes a `previous_state_root` pointing to slot-1's state root, but that state was never stored. When `get_hot_state_summary_payload_status()` tries to load it, it fails.
- **Fix applied**: In `get_hot_state_summary_payload_status()`, when `load_hot_state_summary(&previous_state_root)` returns `None`, instead of erroring, fall back to determining payload status from the current summary alone:
- If `summary.slot == summary.latest_block_slot` → Pending (block state)
- Otherwise → Pending (safe default for checkpoint boundary states)
- **This is a correct fix** — checkpoint states at epoch boundaries are always Pending.

### Bug 2: Missing Envelope for Parent Block (PARTIALLY FIXED — NEEDS PROPER FIX)
- **File**: `beacon_node/beacon_chain/src/block_verification.rs` ~line 1976
- **Symptom**: `BlockProcessingFailure: DBInconsistent("Missing envelope for parent block 0xfd97...")`
- **Root cause**: During checkpoint sync, only the block and state are downloaded — NOT the execution payload envelope. When child blocks arrive and reference the checkpoint block as parent with `is_parent_block_full()=true`, the code needs the parent's envelope to get the Full state root. The envelope isn't in the DB.
- **Current workaround**: Falls back to `(Pending, parent_block.state_root())` when envelope is missing. This allows processing to proceed but **causes state root mismatches** → block validation fails → peers disconnect.
- **Result**: Node starts, briefly connects to peers, fails to validate blocks, loses all peers, stalls.

## What Needs to Happen (Priority Order)

### 1. Fix the envelope problem (BLOCKING)
The core issue: checkpoint sync doesn't download/store the execution payload envelope for the checkpoint block. Three approaches:

**Option A — Download envelope during checkpoint sync (RECOMMENDED)**
- Extend `weak_subjectivity_state()` in `beacon_node/beacon_chain/src/builder.rs` (~line 425) to also download and store the checkpoint block's envelope
- The checkpoint sync server at `https://beacon.epbs-devnet-1.ethpandaops.io/` serves blocks via `/eth/v2/beacon/blocks/{slot}` which contains `signed_execution_payload_bid`
- BUT the envelope itself may need a separate endpoint. Check if `/eth/v1/beacon/execution_payload_envelopes/{block_root}` exists (it 404'd when I tried)
- If the envelope isn't available via HTTP, you'd need to either:
- Add envelope support to the checkpoint sync protocol
- Or compute it: fetch the execution payload from geth for that block hash and construct the envelope

**Option B — Trigger P2P envelope lookup when missing**
- When `get_payload_envelope(&root)` returns `None`, instead of erroring or falling back, queue an envelope lookup via P2P (similar to how block lookups work)
- PR #68's lookup sync code may already have infrastructure for this — check `single_block_lookup.rs` and `network_context.rs` for envelope request methods
- `request_single_envelope()` exists at `network_context.rs` — this may be usable

**Option C — Compute Full state from Pending state + payload**
- Load the Pending state, execute the payload against it to produce the Full state
- This requires having the execution payload data and running a state transition
- Complex and not ideal for the sync hot path

### 2. Peer connectivity issues
- Node connects to 2-7 peers initially but drops to 0 quickly
- This happens even before block processing (the checkpoint sync instance lost peers before any blocks were processed)
- Might be related to: fork digest mismatch, status message incompatibility, or rate limiting
- The genesis sync test (with vibehouse) maintained peers better — investigate why checkpoint sync loses them
- Could also be a gossip subnet issue — the devnet only has ~10 nodes total

### 3. EL sync coordination
- Geth starts and imports ~157 blocks but then stalls waiting for forkchoice updates from the CL
- Once the CL can process blocks, it will send forkchoice updates and geth will follow
- This is expected and not a bug — it's just downstream of fixing the envelope issue

## Key Code Locations

| What | File | Line |
|------|------|------|
| Checkpoint sync init | `beacon_node/beacon_chain/src/builder.rs` | ~425 (`weak_subjectivity_state`) |
| State storage | `beacon_node/store/src/hot_cold_store.rs` | ~1077 (`put_state`) |
| Hot state summary | `beacon_node/store/src/hot_cold_store.rs` | ~4220 (`HotStateSummary::new`) |
| Payload status check | `beacon_node/store/src/hot_cold_store.rs` | ~1864 (`get_hot_state_summary_payload_status`) |
| Parent state loading | `beacon_node/beacon_chain/src/block_verification.rs` | ~1960 (`load_parent`) |
| Envelope storage | `beacon_node/store/src/hot_cold_store.rs` | ~1064 (`put_payload_envelope`) |
| Envelope retrieval | `beacon_node/store/src/hot_cold_store.rs` | ~741 (`get_payload_envelope`) |
| Envelope P2P request | `beacon_node/network/src/sync/network_context.rs` | search for `request_single_envelope` |

## Files Modified (uncommitted)

1. **`beacon_node/store/src/hot_cold_store.rs`** — Bug 1 fix: handle missing previous_state_root summary in `get_hot_state_summary_payload_status`
2. **`beacon_node/beacon_chain/src/block_verification.rs`** — Bug 2 workaround: fallback to Pending when envelope missing + added `warn` to tracing imports

## What's NOT the Problem
- Build: compiles fine in release (~2-4 min incremental)
- EL: geth syncs and connects, authrpc works
- P2P boot ENRs: correct, 11 entries, work for genesis sync
- Checkpoint sync download: block and state download fine
- Config: correct network config, fork schedule, genesis state

## Useful Commands
```bash
# Check sync status
curl -s http://127.0.0.1:5053/eth/v1/node/syncing | jq .

# Check devnet head
curl -s "https://beacon.epbs-devnet-1.ethpandaops.io/eth/v1/beacon/headers/head" | jq '.data.header.message.slot'

# Check geth
docker logs geth-devnet-1 2>&1 | tail -20

# Check CL logs
tail -50 /tmp/lh-devnet.log

# Rebuild after changes
export PATH="$HOME/.cargo/bin:$PATH"
cd /root/.openclaw/workspace/lighthouse-devnet-test
cargo build --release --bin lighthouse

# Restart clean
pkill -f "lighthouse-devnet-test" 2>/dev/null; sleep 2
rm -rf /mnt/ssd/lighthouse-devnet-1
nohup /tmp/start-lh-devnet.sh > /tmp/lh-devnet.log 2>&1 &
```
18 changes: 18 additions & 0 deletions beacon_node/beacon_chain/src/beacon_chain.rs
Original file line number Diff line number Diff line change
Expand Up @@ -2894,6 +2894,24 @@ impl<T: BeaconChainTypes> BeaconChain<T> {
}
}

// Pre-store payload envelopes in the DB before block processing.
// Blocks from post-Gloas epochs may carry an envelope that must be
// available in the store when the block is imported.
for block in &chain_segment {
if let Some(envelope) = block.envelope() {
let block_root = block.block_root();
if let Err(e) = self
.store
.put_payload_envelope(&block_root, envelope.as_ref().clone())
{
return ChainSegmentResult::Failed {
imported_blocks: vec![],
error: BlockError::BeaconChainError(Box::new(Error::DBError(e))),
};
}
}
}

let mut imported_blocks = vec![];

// Filter uninteresting blocks from the chain segment in a blocking task.
Expand Down
Loading
Loading