fix(sync): reap stale running rows on primary startup by dotwaffle · Pull Request #10 · dotwaffle/peeringdb-plus

dotwaffle · 2026-04-11T13:10:10Z

Summary

Add ReapStaleRunningRows helper in internal/sync/status.go that transitions every status='running' row to status='failed' with an explanatory error_message via a single SQL UPDATE
Call it from cmd/peeringdb-plus/main.go on the primary at startup, immediately after InitStatusTable
Unit tests cover the success case (mix of running/success/failed, verify only running rows are transitioned and pre-existing rows stay untouched) and the no-op case (empty table)

Motivation

We currently have a stale row 929 in the production sync_status table in running state. It's left over from the rolling deploy that destroyed the original LHR primary machine (801e9df646e918) — that process was killed mid-cycle before RecordSyncComplete ran, and the running row has no process to clean it up. Phantom rows like this show up in /ui/about freshness queries and /readyz fallback logic, even though Worker.running is a per-process atomic and nothing is actually in flight.

Safety

Safe under Consul lease semantics:

Only one primary at a time holds the LiteFS write lease
The reap fires in main.go BEFORE the first sync tick, so there's no overlap with a legitimate in-flight row
Worker.running is reset on process start, so the cleanup is purely cosmetic — no future sync is blocked by the stale row
RecordSyncComplete's eventual UPDATE ... WHERE id = ? would win anyway if there were an overlap (latest-write-wins)

Test plan

go test -race ./internal/sync/... ./cmd/peeringdb-plus/...
go vet ./...
golangci-lint run ./... (0 issues)

After merge + deploy, the existing prod row 929 should be transitioned to failed with error_message='startup reap: process restarted before sync completed'.

🤖 Generated with Claude Code

A sync_status row can be left in 'running' state forever if the primary process is killed mid-cycle before RecordSyncComplete runs — typically during a rolling deploy that terminates the old primary before it finishes syncing. The phantom row then shows up in /ui/about and /readyz queries that search by status, even though nothing is actually in flight. Worker.running is an in-memory atomic reset on process start, so no future sync is actually blocked by the stale row — the fix is purely cosmetic, transitioning orphans to 'failed' with an explanatory error_message on every primary startup. Safe under Consul lease semantics: only one primary runs at a time, and the reap fires BEFORE the first sync tick, so there's no overlap with a legitimate in-flight row. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

github-actions · 2026-04-11T13:15:26Z

Code Metrics Report

Coverage
83.7%

Code coverage of files in pull request scope (29.9%)

Files	Coverage
cmd/peeringdb-plus/main.go	14.0%
internal/sync/status.go	88.5%

Reported by octocov

dotwaffle merged commit dfdf234 into main Apr 11, 2026
5 checks passed

dotwaffle deleted the fix/reap-stale-running-rows branch April 11, 2026 13:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(sync): reap stale running rows on primary startup#10

fix(sync): reap stale running rows on primary startup#10
dotwaffle merged 1 commit intomainfrom
fix/reap-stale-running-rows

dotwaffle commented Apr 11, 2026

Uh oh!

github-actions Bot commented Apr 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dotwaffle commented Apr 11, 2026

Summary

Motivation

Safety

Test plan

Uh oh!

github-actions Bot commented Apr 11, 2026

Code Metrics Report

Code coverage of files in pull request scope (29.9%)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant