Skip to content

Integration: Add prometheusAlerts source type to TaskSpawner for alert-driven autonomous remediation #775

@kelos-bot

Description

@kelos-bot

🤖 Kelos Strategist Agent @gjkim42

Summary

Propose a new prometheusAlerts source type in the TaskSpawner when field that discovers firing alerts from a Prometheus Alertmanager instance and spawns agent Tasks to investigate and remediate them. This enables a closed-loop observability workflow: alert fires → agent analyzes code/config → agent opens a fix PR — all within the Kubernetes cluster where Kelos already runs.

Motivation

Kelos currently supports four trigger sources: GitHub Issues, GitHub Pull Requests, Jira, and Cron. All are developer-initiated or time-based. There is no way to trigger agents from operational signals — the metrics and alerts that indicate something is wrong in production.

This is a gap because:

  1. Prometheus/Alertmanager is the de facto monitoring stack in Kubernetes — the same environment where Kelos runs. Most Kelos users will already have it deployed.
  2. Many alerts have code-level root causes that an AI agent can investigate: memory leaks, unoptimized queries, missing error handling, misconfigured resource limits, broken health checks.
  3. Alert → ticket → human → fix → deploy is slow. Alert → agent → fix PR cuts the response time from hours/days to minutes.
  4. Existing proposals for error tracking (Integration: Add errorTracking source type for production-error-driven agent remediation (Sentry, Datadog) #736) focus on application-level errors from external SaaS platforms (Sentry, Datadog). This proposal targets infrastructure/platform alerts from the cluster-local Prometheus stack — a fundamentally different trigger surface that doesn't require external service integration.

Proposed API

Add a new prometheusAlerts field to the When struct:

type When struct {
    // ... existing fields ...

    // PrometheusAlerts discovers firing alerts from a Prometheus Alertmanager instance.
    // +optional
    PrometheusAlerts *PrometheusAlerts `json:"prometheusAlerts,omitempty"`
}

type PrometheusAlerts struct {
    // URL is the Alertmanager API endpoint (e.g., "http://alertmanager.monitoring:9093").
    // +kubebuilder:validation:Required
    // +kubebuilder:validation:Pattern="^https?://.+"
    URL string `json:"url"`

    // AlertNames filters alerts by alertname. When empty, all firing alerts are discovered.
    // +optional
    AlertNames []string `json:"alertNames,omitempty"`

    // Labels filters alerts by label matchers (all must match).
    // Example: {"severity": "critical", "team": "backend"}
    // +optional
    Labels map[string]string `json:"labels,omitempty"`

    // ExcludeLabels excludes alerts matching any of these label pairs.
    // +optional
    ExcludeLabels map[string]string `json:"excludeLabels,omitempty"`

    // Severities filters alerts by severity label value.
    // +optional
    Severities []string `json:"severities,omitempty"`

    // PollInterval overrides spec.pollInterval for this source (e.g., "30s", "2m").
    // Shorter intervals are typical for alerts vs issue sources.
    // +optional
    PollInterval string `json:"pollInterval,omitempty"`

    // SecretRef optionally references a Secret for Alertmanager authentication
    // (key "ALERTMANAGER_TOKEN" for Bearer auth, or "ALERTMANAGER_USER" + "ALERTMANAGER_PASSWORD" for Basic auth).
    // +optional
    SecretRef *SecretReference `json:"secretRef,omitempty"`
}

WorkItem mapping

WorkItem field Source
ID alertname + fingerprint (unique per alert instance)
Number Alert fingerprint (int hash)
Title alertname label value
Body Alert annotations (summary + description) + all labels as key-value pairs
URL generatorURL from alert (links to Prometheus query)
Labels All alert label keys (enables priorityLabels sorting by severity)
Kind "Alert"

Template variables

All standard variables apply. The {{.Body}} contains the full alert context (annotations + labels), giving the agent enough information to investigate the root cause.

Example configs

1. OOM-kill remediation agent

apiVersion: kelos.dev/v1alpha1
kind: TaskSpawner
metadata:
  name: oom-remediator
spec:
  when:
    prometheusAlerts:
      url: http://alertmanager.monitoring.svc:9093
      alertNames: [KubePodCrashLooping, KubeContainerOOMKilled]
      labels:
        namespace: production
      pollInterval: 2m
  taskTemplate:
    type: claude-code
    credentials:
      type: api-key
      secretRef:
        name: claude-credentials
    workspaceRef:
      name: main-app
    agentConfigRef:
      name: sre-agent
    branch: fix/alert-{{.ID}}
    promptTemplate: |
      A production alert is firing: {{.Title}}

      Alert details:
      {{.Body}}

      Prometheus query: {{.URL}}

      Investigate the root cause in the codebase. Look at resource limits,
      memory allocation patterns, and recent changes. If you can identify a
      fix, open a PR. If the issue requires manual intervention, create a
      GitHub issue with your analysis.
    ttlSecondsAfterFinished: 7200
    podOverrides:
      activeDeadlineSeconds: 1800
  maxConcurrency: 2
  maxTotalTasks: 20

2. SLO violation investigator (critical alerts only)

apiVersion: kelos.dev/v1alpha1
kind: TaskSpawner
metadata:
  name: slo-investigator
spec:
  when:
    prometheusAlerts:
      url: http://alertmanager.monitoring.svc:9093
      labels:
        severity: critical
        team: platform
      excludeLabels:
        silenced: "true"
      pollInterval: 1m
  taskTemplate:
    type: claude-code
    credentials:
      type: oauth
      secretRef:
        name: claude-oauth
    workspaceRef:
      name: platform-services
    agentConfigRef:
      name: sre-investigator
    branch: investigate/{{.Title}}-{{.ID}}
    promptTemplate: |
      CRITICAL ALERT: {{.Title}}

      {{.Body}}

      This alert indicates an SLO violation. Analyze the codebase to:
      1. Identify the likely root cause
      2. Check recent commits for related changes
      3. If a code fix is possible, open a PR
      4. Document your findings in a GitHub issue regardless
    ttlSecondsAfterFinished: 3600
  maxConcurrency: 3

Implementation approach

The implementation follows the established source pattern in internal/source/:

  1. New source file: internal/source/alertmanager.go implementing Source interface
  2. Alertmanager API client: Uses the Alertmanager v2 API GET /api/v2/alerts endpoint with query parameter filtering
  3. Deduplication: Use alert fingerprint as the work item ID. The spawner's existing dedup logic (skip items with existing non-terminal Tasks) prevents duplicate remediation Tasks for the same alert.
  4. Re-trigger on re-fire: If an alert resolves and fires again, TriggerTime is set to the alert's startsAt timestamp. If this is newer than the previous Task's completion time, a new Task is spawned — matching the existing retrigger pattern used by GitHub sources.
  5. Polling mode: Polling-based (creates a Deployment), consistent with GitHub and Jira sources. Default pollInterval should be shorter (e.g., "2m") since alerts are time-sensitive.

API compatibility note

This is a purely additive change to the When struct — a new optional field alongside the existing four source types. No changes to existing source types or validation rules. The "exactly one field" validation in When already covers mutual exclusivity.

Differentiation from existing proposals

Why this matters

This integration positions Kelos as the bridge between observability and remediation in Kubernetes clusters. Instead of alerts sitting in dashboards waiting for human attention, agents can begin investigation immediately. Even when automated fixes aren't possible, the agent's analysis (posted as a GitHub issue) gives the on-call engineer a head start.

/kind feature

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions