Describe the bug
My team and I manage Buildkite for our organization, and we've recently fielded questions around stalled job cleanup where a job was terminated with a vague Events: none:
Checking our Kubernetes cluster for events corresponding to the job, we see exceeded quota events (see Logs), but we didn't see them in the UI. This makes debugging for many users in our organization much more difficult, as they have to make pure guesses about what went wrong until we're able to check the Kubernetes cluster manually.
To determine if the agent stack gathers and prints the events at all, I ran a pipeline explicitly configured to exceed the resource limits of its host cluster on two of my team's Buildkite clusters (one staging cluster, and one production cluster). When run on our staging cluster, exceeded-quota Kubernetes events were included in the UI. When run on our production cluster, exceeded-quota Kubernetes events were not included in the UI. The two Buildkite clusters have minor environmental differences like running on different underlying Kubernetes clusters with different namespaces and different requests.memory values in the namespace's ResourceQuota. However, the most obvious parts of the environment are identical: both Buildkite clusters are running on Kubernetes clusters with the same version, both Buildkite clusters are deployed to the Kubernetes clusters with the same version of the agent-stack-k8s Helm chart, both Buildkite clusters are running the same agent-stack-k8s controller image, etc.
To Reproduce
Exact reproduction conditions are unclear, as the bug consistently reproduces in one environment and constently does not reproduce in another seemingly-identical environment. However, for testing purposes, configuring a pipeline to use the Kubernetes plugin on a command step with a podspec requesting significantly more memory than is available in the Kubernetes cluster is an easy way to force relevant Kubernetes errors to occur.
Expected behavior
Kubernetes errors should be visible, as they are on jobs run on the "good" staging cluster:
Environment
- agent-stack-k8s version: 0.40.0
- Kubernetes version: v1.32.10
- Deployment method: Helm chart
Logs
kubectl events shows the events I'd expect in both cases:
$ kubectl ... events --namespace buildkite-prod-queue-k8s-default --for job/buildkite-019d7800-7a93-4c15-bc5a-b2fccd9c7781
LAST SEEN TYPE REASON OBJECT MESSAGE
23m Warning FailedCreate Job/buildkite-019d7800-7a93-4c15-bc5a-b2fccd9c7781 Error creating: pods "buildkite-019d7800-7a93-4c15-bc5a-b2fccd9c7781-hmj92" is forbidden: exceeded quota: default-resource-quota, requested: requests.memory=76802048M, used: requests.memory=77096157184, limited: requests.memory=1280G
...
$ kubectl ... events --namespace buildkite-staging-queue-k8s-default --for job/buildkite-019d7813-0826-4dc2-b129-e4877346b48c
LAST SEEN TYPE REASON OBJECT MESSAGE
3m34s Warning FailedCreate Job/buildkite-019d7813-0826-4dc2-b129-e4877346b48c Error creating: pods "buildkite-019d7813-0826-4dc2-b129-e4877346b48c-xqd4q" is forbidden: exceeded quota: default-resource-quota, requested: requests.memory=76802048M, used: requests.memory=20877934592, limited: requests.memory=768G
...
Additional context
Note that even in the screenshot for the "good" staging cluster, the last event timestamps are likely wrong- they're all Golang's zero-time. This is captured in #857 .
Describe the bug
My team and I manage Buildkite for our organization, and we've recently fielded questions around stalled job cleanup where a job was terminated with a vague
Events: none:Checking our Kubernetes cluster for events corresponding to the job, we see
exceeded quotaevents (see Logs), but we didn't see them in the UI. This makes debugging for many users in our organization much more difficult, as they have to make pure guesses about what went wrong until we're able to check the Kubernetes cluster manually.To determine if the agent stack gathers and prints the events at all, I ran a pipeline explicitly configured to exceed the resource limits of its host cluster on two of my team's Buildkite clusters (one staging cluster, and one production cluster). When run on our staging cluster, exceeded-quota Kubernetes events were included in the UI. When run on our production cluster, exceeded-quota Kubernetes events were not included in the UI. The two Buildkite clusters have minor environmental differences like running on different underlying Kubernetes clusters with different namespaces and different
requests.memoryvalues in the namespace'sResourceQuota. However, the most obvious parts of the environment are identical: both Buildkite clusters are running on Kubernetes clusters with the same version, both Buildkite clusters are deployed to the Kubernetes clusters with the same version of theagent-stack-k8sHelm chart, both Buildkite clusters are running the sameagent-stack-k8scontroller image, etc.To Reproduce
Exact reproduction conditions are unclear, as the bug consistently reproduces in one environment and constently does not reproduce in another seemingly-identical environment. However, for testing purposes, configuring a pipeline to use the Kubernetes plugin on a command step with a podspec requesting significantly more memory than is available in the Kubernetes cluster is an easy way to force relevant Kubernetes errors to occur.
Expected behavior
Kubernetes errors should be visible, as they are on jobs run on the "good" staging cluster:
Environment
Logs
kubectl eventsshows the events I'd expect in both cases:Additional context
Note that even in the screenshot for the "good" staging cluster, the last event timestamps are likely wrong- they're all Golang's zero-time. This is captured in #857 .