fix: collapse concurrent watch manager creates with singleflight#655
Open
savme wants to merge 1 commit into
Open
fix: collapse concurrent watch manager creates with singleflight#655savme wants to merge 1 commit into
savme wants to merge 1 commit into
Conversation
Contributor
|
Is there a dashboard you used to confirm this behavior or observe the goroutine leak? Is it happening in all environments? |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
milo-apiserverleaks goroutines at ~1,100/sec and OOMs every 45–90 minutes (observed since v0.28.0).A pprof goroutine dump from production shows ~31,375 orphaned
watchManagerinstances — each holding ~18 goroutines (etcd watch, cacher, gRPC keepalive, reflector). The integer ratios in the dump are exact multiples of 31,375, pointing to a single coherent leak source.Root cause:
getWatchManager()has a TOCTOU race. On a cache miss, multiple concurrent goroutines each create and start a watch manager, then race toLoadOrStore. Only one survives in the cache; the rest are permanently orphaned — started but never reachable, so their TTL timers never fire andStop()is never called.The v0.28.0
disengageProjectfix improved the controller-manager's manager footprint, but project re-engagements generate admission bursts that hitgetWatchManager()with high concurrency.Fix
Wrap the creation path in a
singleflight.Groupkeyed byprojectID. Concurrent calls for the same project collapse into one; all callers get the same result. A fast-pathsync.Mapload bypasses singleflight entirely on the hot path (cache hit).An inner re-check inside
Do()handles the sequential race: a goroutine that missed the outer cache check before a priorDo()stored its result will find the manager on the inner load and return it rather than creating a duplicate.