fix #3231 prevent snapshot corruption#3232
Conversation
|
@corhere any chance someone could review this PR before the next release of moby? Any help needed? |
There was a problem hiding this comment.
Pull request overview
Fixes raft snapshot splitting in SwarmKit transport to avoid snapshot corruption/panics when raft messages become too large for the configured gRPC max message size.
Changes:
- Add a lower-bound behavior to the computed snapshot chunk payload size used for splitting.
- Avoid mutating the original
raftpb.Message.Snapshot.Dataacross chunk iterations by copying theSnapshotstruct per chunk. - Add regression tests for snapshot splitting and payload size behavior.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
manager/state/raft/transport/peer.go |
Adjusts snapshot chunk sizing and fixes snapshot chunking to avoid pointer/slice mutation across iterations. |
manager/state/raft/transport/transport_test.go |
Adds tests covering snapshot reassembly correctness and payload size minimum behavior. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| func raftMessagePayloadSize(m *raftpb.Message) int { | ||
| return GRPCMaxMsgSize - raftMessageStructSize(m) | ||
| s := GRPCMaxMsgSize - raftMessageStructSize(m) | ||
| if s < 64<<10 { | ||
| s = 64 << 10 | ||
| } | ||
| return s |
There was a problem hiding this comment.
raftMessagePayloadSize currently floors the payload to 64 KiB for any computed value < 64 KiB. If the struct overhead is close to (but still below) GRPCMaxMsgSize, this increases the chunk size beyond what fits in a gRPC message and can turn a previously-sendable snapshot into one that always hits ResourceExhausted. Consider only applying the 64 KiB fallback when the computed size is <= 0 (or otherwise keep the computed positive size), so chunking still respects GRPCMaxMsgSize whenever it’s possible to do so.
| t.Run("LargeMetadataDoesNotPanic", func(t *testing.T) { | ||
| // Simulate a snapshot where struct overhead exceeds GRPCMaxMsgSize. | ||
| // This is the scenario from the bug: many cluster objects cause | ||
| // large raft message metadata, making payloadSize negative without | ||
| // the fix. | ||
| data := make([]byte, 5<<20) // 5 MiB snapshot data | ||
| for i := range data { | ||
| data[i] = byte(i % 256) | ||
| } | ||
| m := raftpb.Message{ | ||
| Type: raftpb.MsgSnap, | ||
| From: 1, | ||
| To: 2, | ||
| Snapshot: &raftpb.Snapshot{ | ||
| Data: data, | ||
| Metadata: raftpb.SnapshotMetadata{ | ||
| Index: uint64(len(data)), | ||
| // Large ConfState to push struct size above GRPCMaxMsgSize | ||
| ConfState: raftpb.ConfState{ | ||
| Voters: make([]uint64, 1<<20), | ||
| }, | ||
| }, | ||
| }, | ||
| } | ||
| // Must not panic | ||
| msgs := splitSnapshotData(ctx, &m) | ||
| assert.NotEmpty(t, msgs) |
There was a problem hiding this comment.
The test intends to simulate “struct overhead exceeds GRPCMaxMsgSize”, but ConfState.Voters is filled with zero values; protobuf varint encoding makes these extremely small (likely ~2 bytes/value including the tag), so this may not actually push the message overhead past 4 MiB. To make the test robust, assert the precondition (e.g., raftMessageStructSize(&m) > GRPCMaxMsgSize) or populate Voters with sufficiently large non-zero IDs so the encoded size truly exceeds the limit.
corhere
left a comment
There was a problem hiding this comment.
Clamped raftMessagePayloadSize to a minimum of 64 KiB. When a large cluster
state causes the raft message struct overhead to exceed GRPCMaxMsgSize, the
computed payload size went negative, leading to corruption during snapshot
splitting. The payload size is now floored at 64 KiB to guarantee forward
progress.
This doesn't make sense to me. No amount of splitting the snapshot data will yield a message small enough to send if the message overhead alone exceeds GRPCMaxMsgSize. Are you sure this is a real issue?
| // Capture the full snapshot data before the loop, because | ||
| // we need to copy the Snapshot struct for each chunk to avoid | ||
| // mutating m.Snapshot.Data through the shared pointer. | ||
| fullData := m.Snapshot.Data |
There was a problem hiding this comment.
I don't understand why another copy of the slice is needed. Cloning raftMsg.Snapshot is sufficient to ensure that assigning to raftMsg.Snapshot.Data does not mutate m.Snapshot.Data, therefore fullData == m.Snapshot.Data for all loop iterations.
| bigEntry := raftpb.Entry{Data: make([]byte, GRPCMaxMsgSize)} | ||
| m := &raftpb.Message{ | ||
| Type: raftpb.MsgSnap, | ||
| Entries: []raftpb.Entry{bigEntry}, |
There was a problem hiding this comment.
How realistic is this scenario? Is it possible in practice for a MsgSnap message to be constructed with a non-empty Entries field?
The shared *Snapshot pointer is the real cause of moby#3231: the shallow copy of the raft message meant re-slicing raftMsg.Snapshot.Data also re-sliced m.Snapshot.Data, shrinking its capacity until a later iteration panicked with "slice bounds out of range". Cloning the Snapshot struct alone is sufficient. Drop the 64 KiB payload-size clamp (which could not produce a sendable message when struct overhead already exceeds GRPCMaxMsgSize) and the redundant fullData capture. Replace the tests with a single regression test that feeds a realistic MsgSnap of 3*GRPCMaxMsgSize and asserts the input message's Snapshot.Data is untouched after splitting. Signed-off-by: Sylvere Richard <sylvere.richard@gmail.com> Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
Thanks @corhere for the review. |
The shared *Snapshot pointer is the real cause of moby#3231: the shallow copy of the raft message meant re-slicing raftMsg.Snapshot.Data also re-sliced m.Snapshot.Data, shrinking its capacity until a later iteration panicked with "slice bounds out of range". Cloning the Snapshot struct alone is sufficient. Drop the 64 KiB payload-size clamp (which could not produce a sendable message when struct overhead already exceeds GRPCMaxMsgSize) and the redundant fullData capture. Replace the tests with a single regression test that feeds a realistic MsgSnap of 3*GRPCMaxMsgSize and asserts the input message's Snapshot.Data is untouched after splitting. Signed-off-by: Sylvere Richard <sylvere.richard@gmail.com>
corhere
left a comment
There was a problem hiding this comment.
The code looks great; no notes! I just need you to squash the changes down to a single commit and update the PR description before I can approve and merge. Thank you so much for identifying and fixing this bug!
5afb846 to
6d7f722
Compare
|
Thank you @corhere . |
|
@nowheresly one last thing before merging: the commit message. Could you please do the following?
|
|
just updated my commit. Please let me know if it's fine now. |
|
@nowheresly lines in the commit message should be wrapped to 72 chars, like how the commit messages were formatted before squashing. |
Clone raft Snapshot so that re-slicing Snapshot.Data does not mutate the initial Snapshot.Data through the shared pointer. Signed-off-by: Sylvere Richard <sylvere.richard@gmail.com>
|
Is it better now? |
- What I did
Fixed snapshot corruption that occurs when the swarm state is large enough that the raft message struct overhead exceeds GRPCMaxMsgSize.
- How I did it
Copied the Snapshot struct before slicing chunk data. The splitSnapshotData loop was assigning sub-slices of Snapshot.Data directly through a shared pointer, which mutated the original message's snapshot data across iterations. Each chunk now gets its own copy of the Snapshot struct so the original data remains intact for correct sub-slicing.
- How to test it
go test ./manager/state/raft/transport/ -run TestSplitSnapshotData— verifies both normal snapshots and the large-metadata scenario that previously panicked/corrupted data.
- Description for the changelog