Skip to content

[BUG] Stack sometimes omits relevant Kubernetes events in stalled-job failure messages #856

Description

@adanducci

Describe the bug

My team and I manage Buildkite for our organization, and we've recently fielded questions around stalled job cleanup where a job was terminated with a vague Events: none:

Image

Checking our Kubernetes cluster for events corresponding to the job, we see exceeded quota events (see Logs), but we didn't see them in the UI. This makes debugging for many users in our organization much more difficult, as they have to make pure guesses about what went wrong until we're able to check the Kubernetes cluster manually.

To determine if the agent stack gathers and prints the events at all, I ran a pipeline explicitly configured to exceed the resource limits of its host cluster on two of my team's Buildkite clusters (one staging cluster, and one production cluster). When run on our staging cluster, exceeded-quota Kubernetes events were included in the UI. When run on our production cluster, exceeded-quota Kubernetes events were not included in the UI. The two Buildkite clusters have minor environmental differences like running on different underlying Kubernetes clusters with different namespaces and different requests.memory values in the namespace's ResourceQuota. However, the most obvious parts of the environment are identical: both Buildkite clusters are running on Kubernetes clusters with the same version, both Buildkite clusters are deployed to the Kubernetes clusters with the same version of the agent-stack-k8s Helm chart, both Buildkite clusters are running the same agent-stack-k8s controller image, etc.

To Reproduce

Exact reproduction conditions are unclear, as the bug consistently reproduces in one environment and constently does not reproduce in another seemingly-identical environment. However, for testing purposes, configuring a pipeline to use the Kubernetes plugin on a command step with a podspec requesting significantly more memory than is available in the Kubernetes cluster is an easy way to force relevant Kubernetes errors to occur.

Expected behavior

Kubernetes errors should be visible, as they are on jobs run on the "good" staging cluster:

Image

Environment

  • agent-stack-k8s version: 0.40.0
  • Kubernetes version: v1.32.10
  • Deployment method: Helm chart

Logs

kubectl events shows the events I'd expect in both cases:

$ kubectl ... events --namespace buildkite-prod-queue-k8s-default --for job/buildkite-019d7800-7a93-4c15-bc5a-b2fccd9c7781
LAST SEEN   TYPE      REASON         OBJECT                                               MESSAGE
23m         Warning   FailedCreate   Job/buildkite-019d7800-7a93-4c15-bc5a-b2fccd9c7781   Error creating: pods "buildkite-019d7800-7a93-4c15-bc5a-b2fccd9c7781-hmj92" is forbidden: exceeded quota: default-resource-quota, requested: requests.memory=76802048M, used: requests.memory=77096157184, limited: requests.memory=1280G
...
$ kubectl ... events --namespace buildkite-staging-queue-k8s-default --for job/buildkite-019d7813-0826-4dc2-b129-e4877346b48c
LAST SEEN   TYPE      REASON             OBJECT                                               MESSAGE
3m34s       Warning   FailedCreate       Job/buildkite-019d7813-0826-4dc2-b129-e4877346b48c   Error creating: pods "buildkite-019d7813-0826-4dc2-b129-e4877346b48c-xqd4q" is forbidden: exceeded quota: default-resource-quota, requested: requests.memory=76802048M, used: requests.memory=20877934592, limited: requests.memory=768G
...

Additional context

Note that even in the screenshot for the "good" staging cluster, the last event timestamps are likely wrong- they're all Golang's zero-time. This is captured in #857 .

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions