Skip to content

Resolver panic under load causes permanent block builder stall after restart #75

@wpank

Description

@wpank

Summary

Under load, the Commonware resolver actor panics with "resolver should not finish". After Docker auto-restarts the node, the transaction pool retains stale transactions from before the crash. The block builder then enters a permanent failure loop because on-chain nonces were reset but pool txs have higher nonces.

Additionally, after any node restart, the resolver permanently blocks all peers within milliseconds because EVM block verification requires sequential parent snapshots that are lost on restart. This makes catch-up impossible.

Load Test Evidence (2026-05-22)

During a 1,000-tx load test:

ERROR commonware_runtime::utils::handle: task panicked err="resolver should not finish"

After restart, permanent block builder failure:

WARN build_block: execution failed height=288 txs=195 error=TxExecution("Transaction(NonceTooHigh { tx: 24, state: 0 })")

This repeated 1,373 times in a single minute.

Impact

  • Any node restart is potentially fatal to the network
  • Two restarts in a 4-validator cluster = permanent quorum loss
  • Rolling upgrades are impossible
  • No mechanism to unblock peers or recover without full cluster restart

Root Cause

  1. Resolver panic: The resolver actor terminates unexpectedly under high load
  2. Peer blocking: After restart, verify_block() returns false when parent snapshots are missing (cold cache), and the resolver permanently blocks the peer
  3. Stale pool: Restarted node retains pool txs with nonces ahead of the reset state

See tmp/issues/04-resolver-permanent-peer-blocking-after-restart.md for the full writeup.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions