Skip to content

Fix resume data loss: route heartbeat coords through applyEventsQueue#34

Merged
grodowski merged 3 commits into
masterfrom
grodowski/coding-chimp/fix-heartbeat-dml-checkpoint-race
May 21, 2026
Merged

Fix resume data loss: route heartbeat coords through applyEventsQueue#34
grodowski merged 3 commits into
masterfrom
grodowski/coding-chimp/fix-heartbeat-dml-checkpoint-race

Conversation

@grodowski
Copy link
Copy Markdown
Member

onChangelogHeartbeatEvent was mutating applier.CurrentCoordinates directly from the streamer goroutine, before any DML that preceded the heartbeat was applied to the ghost table. The checkpoint loop reads CurrentCoordinates as "applied through this GTID" and could persist a checkpoint whose LastTrxCoords was ahead of what was actually applied.

If gh-ost crashed before applyEventsQueue drained, --resume read that checkpoint and called StartSyncGTID with the persisted set; MySQL treated the un-applied GTIDs as already-seen and never re-streamed them. The ghost table silently lost those DMLs and cut-over produced a stale table.

Fix: enqueue the heartbeat through applyEventsQueue. The apply goroutine executes it in order, after the DMLs the streamer enqueued before the heartbeat, restoring the invariant.

Adds TestMigratorHeartbeatDoesNotAdvancePastUnappliedDML, which fails at the previous HEAD and passes after the fix.

In case this PR introduced Go code changes:

  • contributed code is using same conventions as original code
  • script/cibuild returns with no formatting errors, build errors or unit test errors.

onChangelogHeartbeatEvent was mutating applier.CurrentCoordinates directly
from the streamer goroutine, before any DML that preceded the heartbeat was
applied to the ghost table. The checkpoint loop reads CurrentCoordinates as
"applied through this GTID" and could persist a checkpoint whose
LastTrxCoords was ahead of what was actually applied.

If gh-ost crashed before applyEventsQueue drained, --resume read that
checkpoint and called StartSyncGTID with the persisted set; MySQL treated
the un-applied GTIDs as already-seen and never re-streamed them. The ghost
table silently lost those DMLs and cut-over produced a stale table.

Fix: enqueue a tableWriteFunc onto applyEventsQueue that performs the
coords bump. The apply goroutine executes it in order, after the DMLs the
streamer enqueued before the heartbeat, restoring the invariant.

Adds TestMigratorHeartbeatDoesNotAdvancePastUnappliedDML, which fails at
the previous HEAD and passes after the fix; also asserts queue ordering to
guard against future changes that wrap the heartbeat enqueue in a goroutine.

Co-authored-by: Bastian Bartmann <bastian.bartmann@shopify.com>
@grodowski grodowski requested a review from coding-chimp May 20, 2026 08:54
Comment thread go/logic/migrator.go Outdated
Comment on lines +332 to +343
// Route the coords bump through applyEventsQueue so it is ordered after
// any DMLs the streamer enqueued before this heartbeat.
//
// If we instead mutated applier.CurrentCoordinates directly from this
// streamer goroutine, the checkpoint loop (Migrator.Checkpoint) could
// observe coords that include DMLs still sitting un-applied in
// applyEventsQueue and write a checkpoint row whose LastTrxCoords is
// AHEAD of what has actually been applied to the ghost table. If gh-ost
// then crashes before the queue drains, resume reads that checkpoint and
// calls StartSyncGTID with the persisted set; MySQL treats the un-applied
// GTIDs as already-seen and never re-streams them, so the ghost table
// silently loses those DMLs and cut-over produces a stale table.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure we really need this comment in the code. We have a test and context in git.

I would make it more concise if we intend to keep it.

grodowski and others added 2 commits May 20, 2026 11:05
Co-authored-by: Bastian Bartmann <bastian.bartmann@shopify.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
@grodowski
Copy link
Copy Markdown
Member Author

FYI added d96ad02 to align with github#1639

@grodowski grodowski merged commit 6e49d64 into master May 21, 2026
9 checks passed
@grodowski grodowski deleted the grodowski/coding-chimp/fix-heartbeat-dml-checkpoint-race branch May 21, 2026 10:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants