Skip to content

log healthcheck cleanup cycle at INFO#663

Merged
parkan merged 4 commits intomainfrom
feat/reaper-logging
Apr 17, 2026
Merged

log healthcheck cleanup cycle at INFO#663
parkan merged 4 commits intomainfrom
feat/reaper-logging

Conversation

@parkan
Copy link
Copy Markdown
Collaborator

@parkan parkan commented Apr 16, 2026

Summary

We have a reaper that deletes orphaned rows (car_blocks, files, cars, etc.) from a 5min loop inside the dataset worker. Debugging a case of 7k+ orphan cars + 6 stale workers accumulating across ~14 months of production, we noticed that with GOLOG_LOG_LEVEL=debug we see heartbeat logs every minute but zero running healthcheck cleanup debug lines across 22 minutes of worker output — something is wedging the cleanup goroutine (or its path before the first log line) while the sibling heartbeat goroutine runs fine.

Before diagnosing the wedge we need log output that makes a running-but-silent reaper distinguishable from a wedged one.

Changes

  • Loop start/exit at INFO (healthcheck cleanup loop started, ...exiting) so the goroutine is visibly alive
  • Cycle start/end at INFO + stale-worker count on every cycle
  • Each orphan-reap phase logs before and after the delete at INFO, including count=0 and elapsed ms. If a cycle wedges, the last log line names the phase that hung.
  • Extracted the five near-identical phase blocks into a reapPhase helper

No behavior changes -- same SQL, same batch sizes, same cadence.

Test plan

  • go test ./service/healthcheck/... passes on sqlite, mysql, postgres in devcontainer
  • go vet + staticcheck clean

parkan added 3 commits April 16, 2026 16:39
silence on the debug channel was misleading: a running reaper produced
no output when counts were 0, making 'is it even running' hard to tell
from 'running but nothing to do'.

- log loop start/exit at INFO so goroutine presence is observable
- log cycle start/end + stale worker count at INFO
- log each orphan-reap phase before and after the delete, including
  count=0, with elapsed time. if a cycle wedges, the last log line
  names the phase that hung.
- extract per-phase handling into reapPhase helper to cut repetition
prod has gone months with a silently-wedged cleanup loop. wrap the
DELETE in a transaction with SET LOCAL statement_timeout so a stalled
query returns an error instead of hanging the goroutine. SET LOCAL
reverts on commit/rollback and can't leak across the connection pool.
move StartHealthCheckCleanup from Thread.Start to Worker.Run so only
one reaper runs per process, regardless of --concurrency. gate with
--enable-reaper (default true) so multi-process deployments can leave
it on for exactly one of their dataset-worker tasks.

also bump statement_timeout from 30s to 2min (no contention expected
with a singleton, so this only catches pathology) and reduce files
batch from 1000 to 200 (files->car_blocks cascade write amplification).
@parkan
Copy link
Copy Markdown
Collaborator Author

parkan commented Apr 17, 2026

considering replacing this approach with per-prep partitions

@parkan
Copy link
Copy Markdown
Collaborator Author

parkan commented Apr 17, 2026

going to merge this to test on IA; may end up reverting

@parkan parkan merged commit d3734e1 into main Apr 17, 2026
3 checks passed
@parkan parkan deleted the feat/reaper-logging branch April 17, 2026 11:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant