Skip to content

Add new alerts#48

Open
bio-boris wants to merge 5 commits into
masterfrom
add_new_alerts
Open

Add new alerts#48
bio-boris wants to merge 5 commits into
masterfrom
add_new_alerts

Conversation

@bio-boris

Copy link
Copy Markdown
Contributor

No description provided.

Copilot AI review requested due to automatic review settings May 11, 2026 19:39
@bio-boris bio-boris marked this pull request as draft May 11, 2026 19:39
This script checks for failed Kubernetes jobs due to BackoffLimitExceeded or DeadlineExceeded events and outputs the status for CheckMK.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a new CheckMK local check script intended to detect failed Velero backups by querying Kubernetes for Velero Backup resources and emitting local-check results.

Changes:

  • Added a new executable local check lakehouse/velero_failed_backups that runs kubectl get backups -A -o json.
  • Emits CRIT results for backups in Failed / PartiallyFailed phase, including basic perfdata (errors/warnings) and timestamps.
  • Emits UNKNOWN output on kubectl timeout/command errors/JSON parse errors.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread lakehouse/velero_failed_backups Outdated
Comment thread lakehouse/velero_failed_backups
Comment thread lakehouse/velero_failed_backups
bio-boris and others added 3 commits May 11, 2026 15:07
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
@bio-boris bio-boris marked this pull request as ready for review June 30, 2026 17:21

@kkellerlbl kkellerlbl left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM (k8s and microk8s scripts only minimally reviewed; velero script reviewed)

It may be useful to check the date on the most recent backup, and throw an alert if it's too old, but that's not a blocker.

@bio-boris

Copy link
Copy Markdown
Contributor Author

That is a good idea about the backup date. We can do that in a new PR.

For this PR, I think we should somehow trigger these to fail to see if the checkmk checks are actually working though. Were you able to do that?

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.

Comment thread lakehouse/microk8s_certs
Comment on lines +13 to +15
if [ -n "$cmd" ]; then
echo "2 \"MicroK8s Certs\" - Refresh microk8s cert with \`$cmd"
else
Comment thread lakehouse/microk8s_certs
Comment on lines +7 to +9
if (line=="CA") { print "microk8s refresh-certs -e ca.crt` - this will TERMINATE all running workloads"; next }
c=tolower(line); gsub(/ /,"-",c)
print "microk8s refresh-certs -e " c ".crt`"
Comment thread lakehouse/k8s_failed_jobs
Comment on lines +16 to +18
import json
import subprocess
import sys
Comment thread lakehouse/k8s_failed_jobs
Comment on lines +21 to +33
KUBECTL = "microk8s kubectl"
FAILURE_REASONS = {"BackoffLimitExceeded", "DeadlineExceeded"}


def get_events() -> list[dict]:
try:
result = subprocess.run(
f"{KUBECTL} get events -A -o json",
shell=True,
capture_output=True,
text=True,
timeout=30,
)
Comment on lines +14 to +16
import json
import subprocess
import sys
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants