Skip to content

fix: collapse concurrent watch manager creates with singleflight#655

Open
savme wants to merge 1 commit into
mainfrom
fix/watch-manager-singleflight
Open

fix: collapse concurrent watch manager creates with singleflight#655
savme wants to merge 1 commit into
mainfrom
fix/watch-manager-singleflight

Conversation

@savme

@savme savme commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Problem

milo-apiserver leaks goroutines at ~1,100/sec and OOMs every 45–90 minutes (observed since v0.28.0).

A pprof goroutine dump from production shows ~31,375 orphaned watchManager instances — each holding ~18 goroutines (etcd watch, cacher, gRPC keepalive, reflector). The integer ratios in the dump are exact multiples of 31,375, pointing to a single coherent leak source.

Root cause: getWatchManager() has a TOCTOU race. On a cache miss, multiple concurrent goroutines each create and start a watch manager, then race to LoadOrStore. Only one survives in the cache; the rest are permanently orphaned — started but never reachable, so their TTL timers never fire and Stop() is never called.

The v0.28.0 disengageProject fix improved the controller-manager's manager footprint, but project re-engagements generate admission bursts that hit getWatchManager() with high concurrency.

Fix

Wrap the creation path in a singleflight.Group keyed by projectID. Concurrent calls for the same project collapse into one; all callers get the same result. A fast-path sync.Map load bypasses singleflight entirely on the hot path (cache hit).

An inner re-check inside Do() handles the sequential race: a goroutine that missed the outer cache check before a prior Do() stored its result will find the manager on the inner load and return it rather than creating a duplicate.

@savme savme marked this pull request as ready for review June 16, 2026 17:10
@savme savme requested a review from scotwells June 16, 2026 17:11
@scotwells

Copy link
Copy Markdown
Contributor

Is there a dashboard you used to confirm this behavior or observe the goroutine leak? Is it happening in all environments?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants