Skip to content

fix(snapshot-upload): pickUploadCandidate uploads incomplete snapshots after pod restarts #129

@bdchatham

Description

@bdchatham

Problem

`pickUploadCandidate` in `sidecar/tasks/snapshot_upload.go` returns the second-to-latest snapshot height by directory ordering:

```go
sort.Slice(heights, func(i, j int) bool { return heights[i] < heights[j] })
return heights[len(heights)-2], nil
```

The intent (and the comment) is to skip a potentially-in-progress snapshot. The heuristic assumes only the most recent directory could be in-progress.

In practice, every pod restart abandons whatever snapshot was being written when seid died, but tendermint doesn't clean up the partial directory. The result is that multiple in-progress snapshot directories can stack up at the head of the list:

```
205168000 (4950 chunks, complete)
205208000 (4950 chunks, complete)
205660000 (1893 chunks, INCOMPLETE — pod restart killed it)
205666000 (1500 chunks, INCOMPLETE — second pod restart killed it)
205676000 ( 254 chunks, currently being written)
```

`heights[len-2]` returns `205660000` — selecting an incomplete snapshot. The upload tars whatever's on disk (only 1893 of the expected 4950 chunks) and pushes a corrupt artifact to S3. Indistinguishable from a successful upload to the consumer side until they try to restore.

Reproduced

Hit empirically on `pacific-1/syncer-0-0-0` while testing seictl#128. Manual fire of `POST /v0/tasks {"type":"snapshot-upload"}` produced an 18 GB tar.gz at `prod-sei-snapshots/pacific-1/state-sync/205660000.tar.gz` versus the expected ~47 GB. Bucket already cleaned up.

Fix

Filter `heights` to only complete snapshots before applying the second-to-latest pick. Tendermint tracks completion in `metadata.db` (a LevelDB at `/sei/data/snapshots/metadata.db`); each entry there represents a fully-finalized snapshot.

Two implementation paths:

A. Read metadata.db directly — open the LevelDB, iterate entries, build the set of "completed heights", filter `heights` to that set before sorting + picking second-to-latest.

B. Verify chunk count matches the snapshot's expected total — every complete snapshot in this fleet has 4950 chunks for format=1 at the current chain state. The expected total comes from the snapshot's `metadata.json` if tendermint writes one, otherwise from a prior known-complete reference. Less robust because the expected count varies with chain state size.

A is preferred — directly using tendermint's own completion tracking. The cosmos-sdk has Go libraries for opening the snapshot-store LevelDB.

Acceptance criteria

  • After 2+ pod restarts that orphan in-progress snapshots, `pickUploadCandidate` returns a height whose directory contents match the snapshot's expected chunk count (or matches the metadata.db entry).
  • Test reproduces the multi-incomplete scenario (e.g., create dirs with truncated chunks) and asserts pickUploadCandidate skips them.
  • No silent corrupt-upload path remains — if no complete snapshot exists, return 0/nil with a clear log line, not a half-baked one.

Out of scope

  • The orphaned directories themselves are a tendermint-level issue (seid should clean them up on startup, but doesn't currently). This issue is just about the consumer-side defense against the resulting bad state.
  • The sei-infra script at `snapshotter/deploy/scripts/upload_state_sync_snapshots.sh` has the same vulnerability; it's avoided in practice because the EC2 producer doesn't get pod-killed. Worth a separate cleanup if those snapshotters ever move to a controlled-restart context.

References

  • `sidecar/tasks/snapshot_upload.go:184-211` (current pickUploadCandidate)
  • Surfaced during platform fork-test work; fix unblocks reliable K8s-side snapshot publishing.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions