[BUG] Stack sometimes omits relevant Kubernetes events in stalled-job failure messages

## Describe the bug

My team and I manage Buildkite for our organization, and we've recently fielded questions around stalled job cleanup where a job was terminated with a vague `Events: none`:

<img width="528" height="188" alt="Image" src="https://github.com/user-attachments/assets/31475133-0892-48db-9d74-3db8b6efef1e" />

 Checking our Kubernetes cluster for events corresponding to the job, we see `exceeded quota` events (see Logs), but we didn't see them in the UI. This makes debugging for many users in our organization much more difficult, as they have to make pure guesses about what went wrong until we're able to check the Kubernetes cluster manually.

To determine if the agent stack gathers and prints the events at all, I ran a pipeline explicitly configured to exceed the resource limits of its host cluster on two of my team's Buildkite clusters (one staging cluster, and one production cluster). When run on our staging cluster, exceeded-quota Kubernetes events were included in the UI. When run on our production cluster, exceeded-quota Kubernetes events were _not_ included in the UI. The two Buildkite clusters have minor environmental differences like running on different underlying Kubernetes clusters with different namespaces and different `requests.memory` values in the namespace's `ResourceQuota`. However, the most obvious parts of the environment are identical: both Buildkite clusters are running on Kubernetes clusters with the same version, both Buildkite clusters are deployed to the Kubernetes clusters with the same version of the `agent-stack-k8s` Helm chart, both Buildkite clusters are running the same `agent-stack-k8s` controller image, etc. 

## To Reproduce

Exact reproduction conditions are unclear, as the bug consistently reproduces in one environment and constently does not reproduce in another seemingly-identical environment. However, for testing purposes, configuring a pipeline to use the Kubernetes plugin on a command step with a podspec requesting significantly more memory than is available in the Kubernetes cluster is an easy way to force relevant Kubernetes errors to occur.

## Expected behavior

Kubernetes errors should be visible, as they are on jobs run on the "good" staging cluster:

<img width="876" height="372" alt="Image" src="https://github.com/user-attachments/assets/e3785125-839f-4ce0-9ce6-c3d910f3ed6b" />

## Environment
- agent-stack-k8s version: 0.40.0
- Kubernetes version: v1.32.10
- Deployment method: Helm chart

## Logs

`kubectl events` shows the events I'd expect in both cases:

```
$ kubectl ... events --namespace buildkite-prod-queue-k8s-default --for job/buildkite-019d7800-7a93-4c15-bc5a-b2fccd9c7781
LAST SEEN   TYPE      REASON         OBJECT                                               MESSAGE
23m         Warning   FailedCreate   Job/buildkite-019d7800-7a93-4c15-bc5a-b2fccd9c7781   Error creating: pods "buildkite-019d7800-7a93-4c15-bc5a-b2fccd9c7781-hmj92" is forbidden: exceeded quota: default-resource-quota, requested: requests.memory=76802048M, used: requests.memory=77096157184, limited: requests.memory=1280G
...
$ kubectl ... events --namespace buildkite-staging-queue-k8s-default --for job/buildkite-019d7813-0826-4dc2-b129-e4877346b48c
LAST SEEN   TYPE      REASON             OBJECT                                               MESSAGE
3m34s       Warning   FailedCreate       Job/buildkite-019d7813-0826-4dc2-b129-e4877346b48c   Error creating: pods "buildkite-019d7813-0826-4dc2-b129-e4877346b48c-xqd4q" is forbidden: exceeded quota: default-resource-quota, requested: requests.memory=76802048M, used: requests.memory=20877934592, limited: requests.memory=768G
...
```

## Additional context

Note that even in the screenshot for the "good" staging cluster, the last event timestamps are likely wrong- they're all Golang's zero-time. This is captured in #857 .

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BUG] Stack sometimes omits relevant Kubernetes events in stalled-job failure messages #856

Describe the bug

To Reproduce

Expected behavior

Environment

Logs

Additional context

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[BUG] Stack sometimes omits relevant Kubernetes events in stalled-job failure messages #856

Description

Describe the bug

To Reproduce

Expected behavior

Environment

Logs

Additional context

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions