fix(agc): bound listener control-plane broker calls so a stalled broker cannot wedge session registration (Q134) by karlkfi · Pull Request #231 · actions-gateway/github-actions-gateway

karlkfi · 2026-06-14T21:38:31Z

What

Bounds the AGC listener's non-long-poll broker control-plane calls with a per-call deadline (Config.ControlPlaneTimeout, default 30s): createSession, the OAuth token exchange (refreshBrokerToken), and AcquireJob. The GetMessage long-poll is intentionally left unbounded (it holds the connection open for the broker's poll interval by design).

Why — root cause of the e2e flake (Q134)

The Tier-A kind e2e suite has been intermittently red with specs timing out (180–240s) on the AGC→fakegithub path.

The bisect in the report is misleading. The "last green ac54a78 → first red 5e490d9" boundary is an artifact of an intermittent flake: the identical failure also hit main at 5bab2d50e (run 27505615811), before ac54a78. This is a pre-existing flake, not a regression from 5e490d9 (Q134's attempted fix, PR #226). That commit's WaitForRunnerGroupReconciled gate is also ineffective — observedGeneration is set right after mux.Start() spawns the goroutine, before any session registers — but it's harmless and left unchanged.

The bug: the listener's control-plane calls ran on the long-lived manager context with http.DefaultClient (no read timeout — BrokerConfig.HTTPClient is never set). Decisive clue: EnsureAgents (an HTTP round-trip to fakegithub) succeeds at reconcile time, proving connectivity is fine. The only way a call then fails to complete for 180–240s is a wedge: an overloaded single-replica fakegithub under 6-proc CI load accepts the TCP connection but is slow to respond, so the goroutine blocks inside one call. The Multiplexer/poll-loop only recovers on a returned error, so a wedged call never retries.

Two call sites, two commits

Session establishment — createSession + OAuth. A CI run on the first commit confirmed it: the suite got past session registration (job 1: picking a live session → enqueuing onto session-4).
Job pickup — that same run then timed out at job_lifecycle_test.go:179 ("expected >= 1 new worker pods, have 0"): a job enqueued onto a live session, but no worker pod for 240s — the same class at AcquireJob, which was also unbounded. Now bounded too (its error is already handled as recoverable → the poll loop re-acquires).

Verification

New tests: TestListener_CreateSessionStallDoesNotWedge and TestListener_AcquireJobStallDoesNotWedge — a broker that never responds makes the listener fail fast / re-poll instead of hanging.
Full cmd/agc/internal/listener package (incl. -race) + make check green. Rebased onto current main (resolves the Q105/fix(gmc): confine worker/proxy DNS egress to cluster DNS (Q105) #228 STATUS conflict).
This is a CI-load-induced, multi-flaky suite — confirm e2e over multiple runs rather than from one.

Scope / follow-ups

Separate, unrelated flake observed in the same run: E2E_GMC_TenantProvisioning_ProxyConnectWorks (provisioning_test.go:282, curl-through-proxy egress, HTTP 504) — a fast failure, not a wedge. Filed as Q139.
Q137 (filed): reconciler self-healing gap — RequeueAfter=reapAfter is 0 with no worker pods, so a non-retriably-exited baseline listener isn't revived until the ~10h resync.
Q138 (filed): the broad class — ~8 production clients fall back to http.DefaultClient; retrofit a bounded-by-default httpx client (long-poll the explicit exception) + a forbidigo/noctx lint gate.

Root-cause write-up: docs/plan/q134-session-registration-flake.md.

Q134 (e2e "no session registered" flake) is root-caused to a missing per-call deadline on the AGC listener's control-plane broker calls; moved to the top of the Queue and rescoped S. Filed Q137 (reconciler self-healing gap: RequeueAfter=reapAfter is 0 with no worker pods, so a non-retriably-exited baseline listener is not revived until the ~10h resync), Q138 (bounded-by-default HTTP clients + forbidigo/noctx lint gate), and Q139 (e2e proxy-connect 504 flake).

The AGC listener's session-establishment broker calls -- createSession and the OAuth token exchange (refreshBrokerToken) -- ran on the long-lived manager context with http.DefaultClient (no read timeout), so a broker that accepts the connection but is slow to respond (an overloaded single-replica fakegithub under parallel CI load) wedged the goroutine inside a single attempt. The Multiplexer only restarts a baseline on a *returned* error, so a wedged call never retried and the RunnerGroup never registered a session within the e2e budget -- the intermittent "no session registered" 180s timeout (Q134). It predates its attempted fix in 5e490d9 (whose observedGeneration gate is set right after mux.Start() spawns the goroutine, before any session registers, so it never gated on session readiness). Add Config.ControlPlaneTimeout (default 30s) bounding createSession and refreshBrokerToken so a stall fails fast and retriably; the Multiplexer then restarts the baseline and many retries fit the budget. The GetMessage long-poll is deliberately left unbounded. Verified: new TestListener_CreateSessionStallDoesNotWedge (a broker that never responds makes Run return a retriable error in ~0.2s instead of hanging); make check green.

The first commit bounded the session-establishment calls (createSession + OAuth token), but the job-pickup path has the same unbounded control-plane call: AcquireJob. A CI e2e run got past session registration (session-4 registered) yet still timed out at job_lifecycle with "expected >= 1 new worker pods, have 0" -- a job was enqueued onto a live session but no worker pod spawned for 240s, the signature of a wedged AcquireJob (it runs on the manager ctx with http.DefaultClient, no read timeout). Bound AcquireJob with the same Config.ControlPlaneTimeout so a stalled broker fails fast. AcquireJob errors are already handled as recoverable (the poll loop logs and re-polls), so the listener re-acquires on the next delivery instead of wedging in handleJob. Test: TestListener_AcquireJobStallDoesNotWedge -- a broker that never responds to AcquireJob lets the listener re-enter GetMessage instead of hanging.

karlkfi force-pushed the claude/nervous-chatelet-61bad8 branch from f0d6d89 to dce6ebb Compare June 14, 2026 23:19

karlkfi added 4 commits June 14, 2026 16:22

docs(plan): record AcquireJob extension + multi-flake finding for Q134

edc28ae

karlkfi force-pushed the claude/nervous-chatelet-61bad8 branch from dce6ebb to edc28ae Compare June 14, 2026 23:23

karlkfi merged commit c60f3ba into main Jun 14, 2026
24 checks passed

karlkfi deleted the claude/nervous-chatelet-61bad8 branch June 14, 2026 23:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(agc): bound listener control-plane broker calls so a stalled broker cannot wedge session registration (Q134)#231

fix(agc): bound listener control-plane broker calls so a stalled broker cannot wedge session registration (Q134)#231
karlkfi merged 4 commits into
mainfrom
claude/nervous-chatelet-61bad8

karlkfi commented Jun 14, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

karlkfi commented Jun 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Why — root cause of the e2e flake (Q134)

Two call sites, two commits

Verification

Scope / follow-ups

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

karlkfi commented Jun 14, 2026 •

edited

Loading