Conversation
silence on the debug channel was misleading: a running reaper produced no output when counts were 0, making 'is it even running' hard to tell from 'running but nothing to do'. - log loop start/exit at INFO so goroutine presence is observable - log cycle start/end + stale worker count at INFO - log each orphan-reap phase before and after the delete, including count=0, with elapsed time. if a cycle wedges, the last log line names the phase that hung. - extract per-phase handling into reapPhase helper to cut repetition
prod has gone months with a silently-wedged cleanup loop. wrap the DELETE in a transaction with SET LOCAL statement_timeout so a stalled query returns an error instead of hanging the goroutine. SET LOCAL reverts on commit/rollback and can't leak across the connection pool.
move StartHealthCheckCleanup from Thread.Start to Worker.Run so only one reaper runs per process, regardless of --concurrency. gate with --enable-reaper (default true) so multi-process deployments can leave it on for exactly one of their dataset-worker tasks. also bump statement_timeout from 30s to 2min (no contention expected with a singleton, so this only catches pathology) and reduce files batch from 1000 to 200 (files->car_blocks cascade write amplification).
Collaborator
Author
|
considering replacing this approach with per-prep partitions |
Collaborator
Author
|
going to merge this to test on IA; may end up reverting |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
We have a reaper that deletes orphaned rows (car_blocks, files, cars, etc.) from a 5min loop inside the dataset worker. Debugging a case of 7k+ orphan cars + 6 stale workers accumulating across ~14 months of production, we noticed that with
GOLOG_LOG_LEVEL=debugwe see heartbeat logs every minute but zerorunning healthcheck cleanupdebug lines across 22 minutes of worker output — something is wedging the cleanup goroutine (or its path before the first log line) while the sibling heartbeat goroutine runs fine.Before diagnosing the wedge we need log output that makes a running-but-silent reaper distinguishable from a wedged one.
Changes
healthcheck cleanup loop started,...exiting) so the goroutine is visibly alivecount=0and elapsed ms. If a cycle wedges, the last log line names the phase that hung.reapPhasehelperNo behavior changes -- same SQL, same batch sizes, same cadence.
Test plan
go test ./service/healthcheck/...passes on sqlite, mysql, postgres in devcontainergo vet+staticcheckclean