fix(agc): bound listener control-plane broker calls so a stalled broker cannot wedge session registration (Q134)#231
Merged
Conversation
f0d6d89 to
dce6ebb
Compare
Q134 (e2e "no session registered" flake) is root-caused to a missing per-call deadline on the AGC listener's control-plane broker calls; moved to the top of the Queue and rescoped S. Filed Q137 (reconciler self-healing gap: RequeueAfter=reapAfter is 0 with no worker pods, so a non-retriably-exited baseline listener is not revived until the ~10h resync), Q138 (bounded-by-default HTTP clients + forbidigo/noctx lint gate), and Q139 (e2e proxy-connect 504 flake).
The AGC listener's session-establishment broker calls -- createSession and the OAuth token exchange (refreshBrokerToken) -- ran on the long-lived manager context with http.DefaultClient (no read timeout), so a broker that accepts the connection but is slow to respond (an overloaded single-replica fakegithub under parallel CI load) wedged the goroutine inside a single attempt. The Multiplexer only restarts a baseline on a *returned* error, so a wedged call never retried and the RunnerGroup never registered a session within the e2e budget -- the intermittent "no session registered" 180s timeout (Q134). It predates its attempted fix in 5e490d9 (whose observedGeneration gate is set right after mux.Start() spawns the goroutine, before any session registers, so it never gated on session readiness). Add Config.ControlPlaneTimeout (default 30s) bounding createSession and refreshBrokerToken so a stall fails fast and retriably; the Multiplexer then restarts the baseline and many retries fit the budget. The GetMessage long-poll is deliberately left unbounded. Verified: new TestListener_CreateSessionStallDoesNotWedge (a broker that never responds makes Run return a retriable error in ~0.2s instead of hanging); make check green.
The first commit bounded the session-establishment calls (createSession + OAuth token), but the job-pickup path has the same unbounded control-plane call: AcquireJob. A CI e2e run got past session registration (session-4 registered) yet still timed out at job_lifecycle with "expected >= 1 new worker pods, have 0" -- a job was enqueued onto a live session but no worker pod spawned for 240s, the signature of a wedged AcquireJob (it runs on the manager ctx with http.DefaultClient, no read timeout). Bound AcquireJob with the same Config.ControlPlaneTimeout so a stalled broker fails fast. AcquireJob errors are already handled as recoverable (the poll loop logs and re-polls), so the listener re-acquires on the next delivery instead of wedging in handleJob. Test: TestListener_AcquireJobStallDoesNotWedge -- a broker that never responds to AcquireJob lets the listener re-enter GetMessage instead of hanging.
dce6ebb to
edc28ae
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Bounds the AGC listener's non-long-poll broker control-plane calls with a per-call deadline (
Config.ControlPlaneTimeout, default 30s):createSession, the OAuth token exchange (refreshBrokerToken), andAcquireJob. TheGetMessagelong-poll is intentionally left unbounded (it holds the connection open for the broker's poll interval by design).Why — root cause of the e2e flake (Q134)
The Tier-A kind e2e suite has been intermittently red with specs timing out (180–240s) on the AGC→fakegithub path.
The bisect in the report is misleading. The "last green
ac54a78→ first red5e490d9" boundary is an artifact of an intermittent flake: the identical failure also hitmainat5bab2d50e(run 27505615811), beforeac54a78. This is a pre-existing flake, not a regression from5e490d9(Q134's attempted fix, PR #226). That commit'sWaitForRunnerGroupReconciledgate is also ineffective —observedGenerationis set right aftermux.Start()spawns the goroutine, before any session registers — but it's harmless and left unchanged.The bug: the listener's control-plane calls ran on the long-lived manager context with
http.DefaultClient(no read timeout —BrokerConfig.HTTPClientis never set). Decisive clue:EnsureAgents(an HTTP round-trip to fakegithub) succeeds at reconcile time, proving connectivity is fine. The only way a call then fails to complete for 180–240s is a wedge: an overloaded single-replica fakegithub under 6-proc CI load accepts the TCP connection but is slow to respond, so the goroutine blocks inside one call. The Multiplexer/poll-loop only recovers on a returned error, so a wedged call never retries.Two call sites, two commits
createSession+ OAuth. A CI run on the first commit confirmed it: the suite got past session registration (job 1: picking a live session → enqueuing onto session-4).job_lifecycle_test.go:179("expected >= 1 new worker pods, have 0"): a job enqueued onto a live session, but no worker pod for 240s — the same class atAcquireJob, which was also unbounded. Now bounded too (its error is already handled as recoverable → the poll loop re-acquires).Verification
TestListener_CreateSessionStallDoesNotWedgeandTestListener_AcquireJobStallDoesNotWedge— a broker that never responds makes the listener fail fast / re-poll instead of hanging.cmd/agc/internal/listenerpackage (incl.-race) +make checkgreen. Rebased onto currentmain(resolves the Q105/fix(gmc): confine worker/proxy DNS egress to cluster DNS (Q105) #228 STATUS conflict).Scope / follow-ups
E2E_GMC_TenantProvisioning_ProxyConnectWorks(provisioning_test.go:282, curl-through-proxy egress, HTTP 504) — a fast failure, not a wedge. Filed as Q139.RequeueAfter=reapAfteris 0 with no worker pods, so a non-retriably-exited baseline listener isn't revived until the ~10h resync.http.DefaultClient; retrofit a bounded-by-defaulthttpxclient (long-poll the explicit exception) + aforbidigo/noctxlint gate.Root-cause write-up:
docs/plan/q134-session-registration-flake.md.