Integration: Add prometheusAlerts source type to TaskSpawner for alert-driven autonomous remediation

🤖 **Kelos Strategist Agent** @gjkim42

## Summary

Propose a new `prometheusAlerts` source type in the TaskSpawner `when` field that discovers firing alerts from a Prometheus Alertmanager instance and spawns agent Tasks to investigate and remediate them. This enables a closed-loop observability workflow: alert fires → agent analyzes code/config → agent opens a fix PR — all within the Kubernetes cluster where Kelos already runs.

## Motivation

Kelos currently supports four trigger sources: GitHub Issues, GitHub Pull Requests, Jira, and Cron. All are developer-initiated or time-based. There is no way to trigger agents from **operational signals** — the metrics and alerts that indicate something is wrong in production.

This is a gap because:

1. **Prometheus/Alertmanager is the de facto monitoring stack in Kubernetes** — the same environment where Kelos runs. Most Kelos users will already have it deployed.
2. **Many alerts have code-level root causes** that an AI agent can investigate: memory leaks, unoptimized queries, missing error handling, misconfigured resource limits, broken health checks.
3. **Alert → ticket → human → fix → deploy** is slow. Alert → agent → fix PR cuts the response time from hours/days to minutes.
4. **Existing proposals for error tracking (#736) focus on application-level errors** from external SaaS platforms (Sentry, Datadog). This proposal targets infrastructure/platform alerts from the cluster-local Prometheus stack — a fundamentally different trigger surface that doesn't require external service integration.

## Proposed API

Add a new `prometheusAlerts` field to the `When` struct:

```go
type When struct {
    // ... existing fields ...

    // PrometheusAlerts discovers firing alerts from a Prometheus Alertmanager instance.
    // +optional
    PrometheusAlerts *PrometheusAlerts `json:"prometheusAlerts,omitempty"`
}

type PrometheusAlerts struct {
    // URL is the Alertmanager API endpoint (e.g., "http://alertmanager.monitoring:9093").
    // +kubebuilder:validation:Required
    // +kubebuilder:validation:Pattern="^https?://.+"
    URL string `json:"url"`

    // AlertNames filters alerts by alertname. When empty, all firing alerts are discovered.
    // +optional
    AlertNames []string `json:"alertNames,omitempty"`

    // Labels filters alerts by label matchers (all must match).
    // Example: {"severity": "critical", "team": "backend"}
    // +optional
    Labels map[string]string `json:"labels,omitempty"`

    // ExcludeLabels excludes alerts matching any of these label pairs.
    // +optional
    ExcludeLabels map[string]string `json:"excludeLabels,omitempty"`

    // Severities filters alerts by severity label value.
    // +optional
    Severities []string `json:"severities,omitempty"`

    // PollInterval overrides spec.pollInterval for this source (e.g., "30s", "2m").
    // Shorter intervals are typical for alerts vs issue sources.
    // +optional
    PollInterval string `json:"pollInterval,omitempty"`

    // SecretRef optionally references a Secret for Alertmanager authentication
    // (key "ALERTMANAGER_TOKEN" for Bearer auth, or "ALERTMANAGER_USER" + "ALERTMANAGER_PASSWORD" for Basic auth).
    // +optional
    SecretRef *SecretReference `json:"secretRef,omitempty"`
}
```

### WorkItem mapping

| WorkItem field | Source |
|---|---|
| `ID` | `alertname + fingerprint` (unique per alert instance) |
| `Number` | Alert fingerprint (int hash) |
| `Title` | `alertname` label value |
| `Body` | Alert annotations (`summary` + `description`) + all labels as key-value pairs |
| `URL` | `generatorURL` from alert (links to Prometheus query) |
| `Labels` | All alert label keys (enables `priorityLabels` sorting by severity) |
| `Kind` | `"Alert"` |

### Template variables

All standard variables apply. The `{{.Body}}` contains the full alert context (annotations + labels), giving the agent enough information to investigate the root cause.

## Example configs

### 1. OOM-kill remediation agent

```yaml
apiVersion: kelos.dev/v1alpha1
kind: TaskSpawner
metadata:
  name: oom-remediator
spec:
  when:
    prometheusAlerts:
      url: http://alertmanager.monitoring.svc:9093
      alertNames: [KubePodCrashLooping, KubeContainerOOMKilled]
      labels:
        namespace: production
      pollInterval: 2m
  taskTemplate:
    type: claude-code
    credentials:
      type: api-key
      secretRef:
        name: claude-credentials
    workspaceRef:
      name: main-app
    agentConfigRef:
      name: sre-agent
    branch: fix/alert-{{.ID}}
    promptTemplate: |
      A production alert is firing: {{.Title}}

      Alert details:
      {{.Body}}

      Prometheus query: {{.URL}}

      Investigate the root cause in the codebase. Look at resource limits,
      memory allocation patterns, and recent changes. If you can identify a
      fix, open a PR. If the issue requires manual intervention, create a
      GitHub issue with your analysis.
    ttlSecondsAfterFinished: 7200
    podOverrides:
      activeDeadlineSeconds: 1800
  maxConcurrency: 2
  maxTotalTasks: 20
```

### 2. SLO violation investigator (critical alerts only)

```yaml
apiVersion: kelos.dev/v1alpha1
kind: TaskSpawner
metadata:
  name: slo-investigator
spec:
  when:
    prometheusAlerts:
      url: http://alertmanager.monitoring.svc:9093
      labels:
        severity: critical
        team: platform
      excludeLabels:
        silenced: "true"
      pollInterval: 1m
  taskTemplate:
    type: claude-code
    credentials:
      type: oauth
      secretRef:
        name: claude-oauth
    workspaceRef:
      name: platform-services
    agentConfigRef:
      name: sre-investigator
    branch: investigate/{{.Title}}-{{.ID}}
    promptTemplate: |
      CRITICAL ALERT: {{.Title}}

      {{.Body}}

      This alert indicates an SLO violation. Analyze the codebase to:
      1. Identify the likely root cause
      2. Check recent commits for related changes
      3. If a code fix is possible, open a PR
      4. Document your findings in a GitHub issue regardless
    ttlSecondsAfterFinished: 3600
  maxConcurrency: 3
```

## Implementation approach

The implementation follows the established source pattern in `internal/source/`:

1. **New source file**: `internal/source/alertmanager.go` implementing `Source` interface
2. **Alertmanager API client**: Uses the [Alertmanager v2 API](https://github.com/prometheus/alertmanager/blob/main/api/v2/openapi.yaml) `GET /api/v2/alerts` endpoint with query parameter filtering
3. **Deduplication**: Use alert fingerprint as the work item ID. The spawner's existing dedup logic (skip items with existing non-terminal Tasks) prevents duplicate remediation Tasks for the same alert.
4. **Re-trigger on re-fire**: If an alert resolves and fires again, `TriggerTime` is set to the alert's `startsAt` timestamp. If this is newer than the previous Task's completion time, a new Task is spawned — matching the existing retrigger pattern used by GitHub sources.
5. **Polling mode**: Polling-based (creates a Deployment), consistent with GitHub and Jira sources. Default `pollInterval` should be shorter (e.g., `"2m"`) since alerts are time-sensitive.

### API compatibility note

This is a purely additive change to the `When` struct — a new optional field alongside the existing four source types. No changes to existing source types or validation rules. The "exactly one field" validation in `When` already covers mutual exclusivity.

## Differentiation from existing proposals

- **#736 (errorTracking / Sentry / Datadog)**: Focuses on application-level error events from external SaaS platforms, requiring outbound API integration. This proposal targets cluster-local Prometheus alerts — no external dependencies, leveraging the Kubernetes-native stack already in place.
- **#697 (kubernetesResources)**: Watches arbitrary Kubernetes resource state changes (e.g., Pod status, CRD updates). Prometheus alerts are metric-derived signals with rich context (annotations, labels, generator URLs) that don't map cleanly to K8s resource watches.
- **#687 (generic webhooks)**: Alertmanager could technically push to a webhook endpoint, but that requires the user to configure Alertmanager receivers, expose an ingress, and handle authentication — significant operational overhead. A native polling source is simpler and follows Kelos's existing pattern.

## Why this matters

This integration positions Kelos as the bridge between **observability and remediation** in Kubernetes clusters. Instead of alerts sitting in dashboards waiting for human attention, agents can begin investigation immediately. Even when automated fixes aren't possible, the agent's analysis (posted as a GitHub issue) gives the on-call engineer a head start.

/kind feature

WorkItem field	Source
`ID`	`alertname + fingerprint` (unique per alert instance)
`Number`	Alert fingerprint (int hash)
`Title`	`alertname` label value
`Body`	Alert annotations (`summary` + `description`) + all labels as key-value pairs
`URL`	`generatorURL` from alert (links to Prometheus query)
`Labels`	All alert label keys (enables `priorityLabels` sorting by severity)
`Kind`	`"Alert"`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integration: Add prometheusAlerts source type to TaskSpawner for alert-driven autonomous remediation #775

Summary

Motivation

Proposed API

WorkItem mapping

Template variables

Example configs

1. OOM-kill remediation agent

2. SLO violation investigator (critical alerts only)

Implementation approach

API compatibility note

Differentiation from existing proposals

Why this matters

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Integration: Add prometheusAlerts source type to TaskSpawner for alert-driven autonomous remediation #775

Description

Summary

Motivation

Proposed API

WorkItem mapping

Template variables

Example configs

1. OOM-kill remediation agent

2. SLO violation investigator (critical alerts only)

Implementation approach

API compatibility note

Differentiation from existing proposals

Why this matters

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions