log healthcheck cleanup cycle at INFO by parkan · Pull Request #663 · data-preservation-programs/singularity

parkan · 2026-04-16T20:39:53Z

Summary

We have a reaper that deletes orphaned rows (car_blocks, files, cars, etc.) from a 5min loop inside the dataset worker. Debugging a case of 7k+ orphan cars + 6 stale workers accumulating across ~14 months of production, we noticed that with GOLOG_LOG_LEVEL=debug we see heartbeat logs every minute but zero running healthcheck cleanup debug lines across 22 minutes of worker output — something is wedging the cleanup goroutine (or its path before the first log line) while the sibling heartbeat goroutine runs fine.

Before diagnosing the wedge we need log output that makes a running-but-silent reaper distinguishable from a wedged one.

Changes

Loop start/exit at INFO (healthcheck cleanup loop started, ...exiting) so the goroutine is visibly alive
Cycle start/end at INFO + stale-worker count on every cycle
Each orphan-reap phase logs before and after the delete at INFO, including count=0 and elapsed ms. If a cycle wedges, the last log line names the phase that hung.
Extracted the five near-identical phase blocks into a reapPhase helper

No behavior changes -- same SQL, same batch sizes, same cadence.

Test plan

go test ./service/healthcheck/... passes on sqlite, mysql, postgres in devcontainer
go vet + staticcheck clean

silence on the debug channel was misleading: a running reaper produced no output when counts were 0, making 'is it even running' hard to tell from 'running but nothing to do'. - log loop start/exit at INFO so goroutine presence is observable - log cycle start/end + stale worker count at INFO - log each orphan-reap phase before and after the delete, including count=0, with elapsed time. if a cycle wedges, the last log line names the phase that hung. - extract per-phase handling into reapPhase helper to cut repetition

prod has gone months with a silently-wedged cleanup loop. wrap the DELETE in a transaction with SET LOCAL statement_timeout so a stalled query returns an error instead of hanging the goroutine. SET LOCAL reverts on commit/rollback and can't leak across the connection pool.

move StartHealthCheckCleanup from Thread.Start to Worker.Run so only one reaper runs per process, regardless of --concurrency. gate with --enable-reaper (default true) so multi-process deployments can leave it on for exactly one of their dataset-worker tasks. also bump statement_timeout from 30s to 2min (no contention expected with a singleton, so this only catches pathology) and reduce files batch from 1000 to 200 (files->car_blocks cascade write amplification).

parkan · 2026-04-17T11:27:37Z

considering replacing this approach with per-prep partitions

parkan · 2026-04-17T11:46:03Z

going to merge this to test on IA; may end up reverting

parkan added 3 commits April 16, 2026 16:39

regenerate swagger

14cab3c

parkan merged commit d3734e1 into main Apr 17, 2026
3 checks passed

parkan deleted the feat/reaper-logging branch April 17, 2026 11:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

log healthcheck cleanup cycle at INFO#663

log healthcheck cleanup cycle at INFO#663
parkan merged 4 commits intomainfrom
feat/reaper-logging

parkan commented Apr 16, 2026

Uh oh!

parkan commented Apr 17, 2026

Uh oh!

parkan commented Apr 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

parkan commented Apr 16, 2026

Summary

Changes

Test plan

Uh oh!

parkan commented Apr 17, 2026

Uh oh!

parkan commented Apr 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant