Skip to content

fix(sync): reap stale running rows on primary startup#10

Merged
dotwaffle merged 1 commit intomainfrom
fix/reap-stale-running-rows
Apr 11, 2026
Merged

fix(sync): reap stale running rows on primary startup#10
dotwaffle merged 1 commit intomainfrom
fix/reap-stale-running-rows

Conversation

@dotwaffle
Copy link
Copy Markdown
Owner

Summary

  • Add ReapStaleRunningRows helper in internal/sync/status.go that transitions every status='running' row to status='failed' with an explanatory error_message via a single SQL UPDATE
  • Call it from cmd/peeringdb-plus/main.go on the primary at startup, immediately after InitStatusTable
  • Unit tests cover the success case (mix of running/success/failed, verify only running rows are transitioned and pre-existing rows stay untouched) and the no-op case (empty table)

Motivation

We currently have a stale row 929 in the production sync_status table in running state. It's left over from the rolling deploy that destroyed the original LHR primary machine (801e9df646e918) — that process was killed mid-cycle before RecordSyncComplete ran, and the running row has no process to clean it up. Phantom rows like this show up in /ui/about freshness queries and /readyz fallback logic, even though Worker.running is a per-process atomic and nothing is actually in flight.

Safety

Safe under Consul lease semantics:

  • Only one primary at a time holds the LiteFS write lease
  • The reap fires in main.go BEFORE the first sync tick, so there's no overlap with a legitimate in-flight row
  • Worker.running is reset on process start, so the cleanup is purely cosmetic — no future sync is blocked by the stale row
  • RecordSyncComplete's eventual UPDATE ... WHERE id = ? would win anyway if there were an overlap (latest-write-wins)

Test plan

  • go test -race ./internal/sync/... ./cmd/peeringdb-plus/...
  • go vet ./...
  • golangci-lint run ./... (0 issues)

After merge + deploy, the existing prod row 929 should be transitioned to failed with error_message='startup reap: process restarted before sync completed'.

🤖 Generated with Claude Code

A sync_status row can be left in 'running' state forever if the
primary process is killed mid-cycle before RecordSyncComplete runs —
typically during a rolling deploy that terminates the old primary
before it finishes syncing. The phantom row then shows up in
/ui/about and /readyz queries that search by status, even though
nothing is actually in flight.

Worker.running is an in-memory atomic reset on process start, so
no future sync is actually blocked by the stale row — the fix is
purely cosmetic, transitioning orphans to 'failed' with an
explanatory error_message on every primary startup.

Safe under Consul lease semantics: only one primary runs at a
time, and the reap fires BEFORE the first sync tick, so there's
no overlap with a legitimate in-flight row.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

Code Metrics Report

Coverage
83.7%

Code coverage of files in pull request scope (29.9%)

Files Coverage
cmd/peeringdb-plus/main.go 14.0%
internal/sync/status.go 88.5%

Reported by octocov

@dotwaffle dotwaffle merged commit dfdf234 into main Apr 11, 2026
5 checks passed
@dotwaffle dotwaffle deleted the fix/reap-stale-running-rows branch April 11, 2026 13:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant