Skip to content

fix(agc): bound listener control-plane broker calls so a stalled broker cannot wedge session registration (Q134)#231

Merged
karlkfi merged 4 commits into
mainfrom
claude/nervous-chatelet-61bad8
Jun 14, 2026
Merged

fix(agc): bound listener control-plane broker calls so a stalled broker cannot wedge session registration (Q134)#231
karlkfi merged 4 commits into
mainfrom
claude/nervous-chatelet-61bad8

Conversation

@karlkfi

@karlkfi karlkfi commented Jun 14, 2026

Copy link
Copy Markdown
Collaborator

What

Bounds the AGC listener's non-long-poll broker control-plane calls with a per-call deadline (Config.ControlPlaneTimeout, default 30s): createSession, the OAuth token exchange (refreshBrokerToken), and AcquireJob. The GetMessage long-poll is intentionally left unbounded (it holds the connection open for the broker's poll interval by design).

Why — root cause of the e2e flake (Q134)

The Tier-A kind e2e suite has been intermittently red with specs timing out (180–240s) on the AGC→fakegithub path.

The bisect in the report is misleading. The "last green ac54a78 → first red 5e490d9" boundary is an artifact of an intermittent flake: the identical failure also hit main at 5bab2d50e (run 27505615811), before ac54a78. This is a pre-existing flake, not a regression from 5e490d9 (Q134's attempted fix, PR #226). That commit's WaitForRunnerGroupReconciled gate is also ineffective — observedGeneration is set right after mux.Start() spawns the goroutine, before any session registers — but it's harmless and left unchanged.

The bug: the listener's control-plane calls ran on the long-lived manager context with http.DefaultClient (no read timeout — BrokerConfig.HTTPClient is never set). Decisive clue: EnsureAgents (an HTTP round-trip to fakegithub) succeeds at reconcile time, proving connectivity is fine. The only way a call then fails to complete for 180–240s is a wedge: an overloaded single-replica fakegithub under 6-proc CI load accepts the TCP connection but is slow to respond, so the goroutine blocks inside one call. The Multiplexer/poll-loop only recovers on a returned error, so a wedged call never retries.

Two call sites, two commits

  1. Session establishmentcreateSession + OAuth. A CI run on the first commit confirmed it: the suite got past session registration (job 1: picking a live session → enqueuing onto session-4).
  2. Job pickup — that same run then timed out at job_lifecycle_test.go:179 ("expected >= 1 new worker pods, have 0"): a job enqueued onto a live session, but no worker pod for 240s — the same class at AcquireJob, which was also unbounded. Now bounded too (its error is already handled as recoverable → the poll loop re-acquires).

Verification

  • New tests: TestListener_CreateSessionStallDoesNotWedge and TestListener_AcquireJobStallDoesNotWedge — a broker that never responds makes the listener fail fast / re-poll instead of hanging.
  • Full cmd/agc/internal/listener package (incl. -race) + make check green. Rebased onto current main (resolves the Q105/fix(gmc): confine worker/proxy DNS egress to cluster DNS (Q105) #228 STATUS conflict).
  • This is a CI-load-induced, multi-flaky suite — confirm e2e over multiple runs rather than from one.

Scope / follow-ups

  • Separate, unrelated flake observed in the same run: E2E_GMC_TenantProvisioning_ProxyConnectWorks (provisioning_test.go:282, curl-through-proxy egress, HTTP 504) — a fast failure, not a wedge. Filed as Q139.
  • Q137 (filed): reconciler self-healing gap — RequeueAfter=reapAfter is 0 with no worker pods, so a non-retriably-exited baseline listener isn't revived until the ~10h resync.
  • Q138 (filed): the broad class — ~8 production clients fall back to http.DefaultClient; retrofit a bounded-by-default httpx client (long-poll the explicit exception) + a forbidigo/noctx lint gate.

Root-cause write-up: docs/plan/q134-session-registration-flake.md.

@karlkfi karlkfi force-pushed the claude/nervous-chatelet-61bad8 branch from f0d6d89 to dce6ebb Compare June 14, 2026 23:19
karlkfi added 4 commits June 14, 2026 16:22
Q134 (e2e "no session registered" flake) is root-caused to a missing
per-call deadline on the AGC listener's control-plane broker calls; moved
to the top of the Queue and rescoped S. Filed Q137 (reconciler
self-healing gap: RequeueAfter=reapAfter is 0 with no worker pods, so a
non-retriably-exited baseline listener is not revived until the ~10h
resync), Q138 (bounded-by-default HTTP clients + forbidigo/noctx lint
gate), and Q139 (e2e proxy-connect 504 flake).
The AGC listener's session-establishment broker calls -- createSession
and the OAuth token exchange (refreshBrokerToken) -- ran on the
long-lived manager context with http.DefaultClient (no read timeout), so
a broker that accepts the connection but is slow to respond (an
overloaded single-replica fakegithub under parallel CI load) wedged the
goroutine inside a single attempt. The Multiplexer only restarts a
baseline on a *returned* error, so a wedged call never retried and the
RunnerGroup never registered a session within the e2e budget -- the
intermittent "no session registered" 180s timeout (Q134). It predates
its attempted fix in 5e490d9 (whose observedGeneration gate is set right
after mux.Start() spawns the goroutine, before any session registers,
so it never gated on session readiness).

Add Config.ControlPlaneTimeout (default 30s) bounding createSession and
refreshBrokerToken so a stall fails fast and retriably; the Multiplexer
then restarts the baseline and many retries fit the budget. The
GetMessage long-poll is deliberately left unbounded.

Verified: new TestListener_CreateSessionStallDoesNotWedge (a broker that
never responds makes Run return a retriable error in ~0.2s instead of
hanging); make check green.
The first commit bounded the session-establishment calls (createSession +
OAuth token), but the job-pickup path has the same unbounded control-plane
call: AcquireJob. A CI e2e run got past session registration (session-4
registered) yet still timed out at job_lifecycle with "expected >= 1 new
worker pods, have 0" -- a job was enqueued onto a live session but no
worker pod spawned for 240s, the signature of a wedged AcquireJob (it runs
on the manager ctx with http.DefaultClient, no read timeout).

Bound AcquireJob with the same Config.ControlPlaneTimeout so a stalled
broker fails fast. AcquireJob errors are already handled as recoverable
(the poll loop logs and re-polls), so the listener re-acquires on the next
delivery instead of wedging in handleJob.

Test: TestListener_AcquireJobStallDoesNotWedge -- a broker that never
responds to AcquireJob lets the listener re-enter GetMessage instead of
hanging.
@karlkfi karlkfi force-pushed the claude/nervous-chatelet-61bad8 branch from dce6ebb to edc28ae Compare June 14, 2026 23:23
@karlkfi karlkfi merged commit c60f3ba into main Jun 14, 2026
24 checks passed
@karlkfi karlkfi deleted the claude/nervous-chatelet-61bad8 branch June 14, 2026 23:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant