Skip to content
This repository was archived by the owner on May 19, 2026. It is now read-only.

scheduler: detect job stuck queued while its parent run has completed#25

Merged
luhenry merged 1 commit into
mainfrom
go
May 15, 2026
Merged

scheduler: detect job stuck queued while its parent run has completed#25
luhenry merged 1 commit into
mainfrom
go

Conversation

@luhenry

@luhenry luhenry commented May 15, 2026

Copy link
Copy Markdown
Contributor

A workflow_job sometimes stays status=queued on GitHub forever after its parent workflow run terminates -- e.g. when a sibling fails fast or the run is cancelled before scheduling reaches the job. The scheduler used to keep trying to provision a runner for those jobs indefinitely.

sync_jobs now probes the parent run when GitHub still reports the job queued. If the run is completed, mark the job failed with a v1 record that captures the run's conclusion. The probe is gated on JobStuckQueuedMinAge (10 min in internal/constants.go) so the normal freshly-queued window is not charged for an extra GitHub API call every reconcile cycle.

Side fixes hit while wiring this up:

  • internal.FailureInfo gained a Message field so v1 rows ({version:1, message:"..."}) match the legacy on-disk shape templates_test.go has been pinning all along. The two existing v1 callsites in sync_jobs (installation 404, job 404) were stuffing the human message into the typed Reason field; switch them to Message. Document the v1 vs v2 split in the struct doc.
  • Extend internal.GHJob with the run_id field GitHub already returns, and add GHRun + GitHubClient.GetRunInfo for the new probe.
  • Wire OnGetRunInfo through FakeGH for tests.

https://claude.ai/code/session_01Vda2TpwJnGYRYuw1Xg46Da

Comment thread container/internal/contract.go Outdated
A workflow_job sometimes stays status=queued on GitHub forever after
its parent workflow run terminates -- e.g. when a sibling fails fast
or the run is cancelled before scheduling reaches the job. The
scheduler used to keep trying to provision a runner for those jobs
indefinitely.

sync_jobs now probes the parent run when GitHub still reports the job
queued. If the run is completed, mark the job failed with a v1 record
that captures the run's conclusion. The probe is gated on
JobStuckQueuedMinAge (10 min in internal/constants.go) so the normal
freshly-queued window is not charged for an extra GitHub API call
every reconcile cycle.

Side fixes hit while wiring this up:

- internal.FailureInfo gained a Message field so v1 rows
  ({version:1, message:"..."}) match the legacy on-disk shape
  templates_test.go has been pinning all along. The two existing v1
  callsites in sync_jobs (installation 404, job 404) were stuffing
  the human message into the typed Reason field; switch them to
  Message. Document the v1 vs v2 split in the struct doc.
- Extend internal.GHJob with the run_id field GitHub already returns,
  and add GHRun + GitHubClient.GetRunInfo for the new probe.
- Wire OnGetRunInfo through FakeGH for tests.

https://claude.ai/code/session_01Vda2TpwJnGYRYuw1Xg46Da
@luhenry luhenry marked this pull request as ready for review May 15, 2026 23:49
@luhenry luhenry merged commit f4036c4 into main May 15, 2026
2 checks passed
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant