-
Notifications
You must be signed in to change notification settings - Fork 12
Description
🤖 Kelos Strategist Agent @gjkim42
Summary
Propose a new prometheusAlerts source type in the TaskSpawner when field that discovers firing alerts from a Prometheus Alertmanager instance and spawns agent Tasks to investigate and remediate them. This enables a closed-loop observability workflow: alert fires → agent analyzes code/config → agent opens a fix PR — all within the Kubernetes cluster where Kelos already runs.
Motivation
Kelos currently supports four trigger sources: GitHub Issues, GitHub Pull Requests, Jira, and Cron. All are developer-initiated or time-based. There is no way to trigger agents from operational signals — the metrics and alerts that indicate something is wrong in production.
This is a gap because:
- Prometheus/Alertmanager is the de facto monitoring stack in Kubernetes — the same environment where Kelos runs. Most Kelos users will already have it deployed.
- Many alerts have code-level root causes that an AI agent can investigate: memory leaks, unoptimized queries, missing error handling, misconfigured resource limits, broken health checks.
- Alert → ticket → human → fix → deploy is slow. Alert → agent → fix PR cuts the response time from hours/days to minutes.
- Existing proposals for error tracking (Integration: Add errorTracking source type for production-error-driven agent remediation (Sentry, Datadog) #736) focus on application-level errors from external SaaS platforms (Sentry, Datadog). This proposal targets infrastructure/platform alerts from the cluster-local Prometheus stack — a fundamentally different trigger surface that doesn't require external service integration.
Proposed API
Add a new prometheusAlerts field to the When struct:
type When struct {
// ... existing fields ...
// PrometheusAlerts discovers firing alerts from a Prometheus Alertmanager instance.
// +optional
PrometheusAlerts *PrometheusAlerts `json:"prometheusAlerts,omitempty"`
}
type PrometheusAlerts struct {
// URL is the Alertmanager API endpoint (e.g., "http://alertmanager.monitoring:9093").
// +kubebuilder:validation:Required
// +kubebuilder:validation:Pattern="^https?://.+"
URL string `json:"url"`
// AlertNames filters alerts by alertname. When empty, all firing alerts are discovered.
// +optional
AlertNames []string `json:"alertNames,omitempty"`
// Labels filters alerts by label matchers (all must match).
// Example: {"severity": "critical", "team": "backend"}
// +optional
Labels map[string]string `json:"labels,omitempty"`
// ExcludeLabels excludes alerts matching any of these label pairs.
// +optional
ExcludeLabels map[string]string `json:"excludeLabels,omitempty"`
// Severities filters alerts by severity label value.
// +optional
Severities []string `json:"severities,omitempty"`
// PollInterval overrides spec.pollInterval for this source (e.g., "30s", "2m").
// Shorter intervals are typical for alerts vs issue sources.
// +optional
PollInterval string `json:"pollInterval,omitempty"`
// SecretRef optionally references a Secret for Alertmanager authentication
// (key "ALERTMANAGER_TOKEN" for Bearer auth, or "ALERTMANAGER_USER" + "ALERTMANAGER_PASSWORD" for Basic auth).
// +optional
SecretRef *SecretReference `json:"secretRef,omitempty"`
}WorkItem mapping
| WorkItem field | Source |
|---|---|
ID |
alertname + fingerprint (unique per alert instance) |
Number |
Alert fingerprint (int hash) |
Title |
alertname label value |
Body |
Alert annotations (summary + description) + all labels as key-value pairs |
URL |
generatorURL from alert (links to Prometheus query) |
Labels |
All alert label keys (enables priorityLabels sorting by severity) |
Kind |
"Alert" |
Template variables
All standard variables apply. The {{.Body}} contains the full alert context (annotations + labels), giving the agent enough information to investigate the root cause.
Example configs
1. OOM-kill remediation agent
apiVersion: kelos.dev/v1alpha1
kind: TaskSpawner
metadata:
name: oom-remediator
spec:
when:
prometheusAlerts:
url: http://alertmanager.monitoring.svc:9093
alertNames: [KubePodCrashLooping, KubeContainerOOMKilled]
labels:
namespace: production
pollInterval: 2m
taskTemplate:
type: claude-code
credentials:
type: api-key
secretRef:
name: claude-credentials
workspaceRef:
name: main-app
agentConfigRef:
name: sre-agent
branch: fix/alert-{{.ID}}
promptTemplate: |
A production alert is firing: {{.Title}}
Alert details:
{{.Body}}
Prometheus query: {{.URL}}
Investigate the root cause in the codebase. Look at resource limits,
memory allocation patterns, and recent changes. If you can identify a
fix, open a PR. If the issue requires manual intervention, create a
GitHub issue with your analysis.
ttlSecondsAfterFinished: 7200
podOverrides:
activeDeadlineSeconds: 1800
maxConcurrency: 2
maxTotalTasks: 202. SLO violation investigator (critical alerts only)
apiVersion: kelos.dev/v1alpha1
kind: TaskSpawner
metadata:
name: slo-investigator
spec:
when:
prometheusAlerts:
url: http://alertmanager.monitoring.svc:9093
labels:
severity: critical
team: platform
excludeLabels:
silenced: "true"
pollInterval: 1m
taskTemplate:
type: claude-code
credentials:
type: oauth
secretRef:
name: claude-oauth
workspaceRef:
name: platform-services
agentConfigRef:
name: sre-investigator
branch: investigate/{{.Title}}-{{.ID}}
promptTemplate: |
CRITICAL ALERT: {{.Title}}
{{.Body}}
This alert indicates an SLO violation. Analyze the codebase to:
1. Identify the likely root cause
2. Check recent commits for related changes
3. If a code fix is possible, open a PR
4. Document your findings in a GitHub issue regardless
ttlSecondsAfterFinished: 3600
maxConcurrency: 3Implementation approach
The implementation follows the established source pattern in internal/source/:
- New source file:
internal/source/alertmanager.goimplementingSourceinterface - Alertmanager API client: Uses the Alertmanager v2 API
GET /api/v2/alertsendpoint with query parameter filtering - Deduplication: Use alert fingerprint as the work item ID. The spawner's existing dedup logic (skip items with existing non-terminal Tasks) prevents duplicate remediation Tasks for the same alert.
- Re-trigger on re-fire: If an alert resolves and fires again,
TriggerTimeis set to the alert'sstartsAttimestamp. If this is newer than the previous Task's completion time, a new Task is spawned — matching the existing retrigger pattern used by GitHub sources. - Polling mode: Polling-based (creates a Deployment), consistent with GitHub and Jira sources. Default
pollIntervalshould be shorter (e.g.,"2m") since alerts are time-sensitive.
API compatibility note
This is a purely additive change to the When struct — a new optional field alongside the existing four source types. No changes to existing source types or validation rules. The "exactly one field" validation in When already covers mutual exclusivity.
Differentiation from existing proposals
- Integration: Add errorTracking source type for production-error-driven agent remediation (Sentry, Datadog) #736 (errorTracking / Sentry / Datadog): Focuses on application-level error events from external SaaS platforms, requiring outbound API integration. This proposal targets cluster-local Prometheus alerts — no external dependencies, leveraging the Kubernetes-native stack already in place.
- Integration: Add kubernetesResources source type to TaskSpawner for cluster-event-driven agent execution #697 (kubernetesResources): Watches arbitrary Kubernetes resource state changes (e.g., Pod status, CRD updates). Prometheus alerts are metric-derived signals with rich context (annotations, labels, generator URLs) that don't map cleanly to K8s resource watches.
- Integration: Add generic webhook source type to TaskSpawner for universal event-driven task triggering #687 (generic webhooks): Alertmanager could technically push to a webhook endpoint, but that requires the user to configure Alertmanager receivers, expose an ingress, and handle authentication — significant operational overhead. A native polling source is simpler and follows Kelos's existing pattern.
Why this matters
This integration positions Kelos as the bridge between observability and remediation in Kubernetes clusters. Instead of alerts sitting in dashboards waiting for human attention, agents can begin investigation immediately. Even when automated fixes aren't possible, the agent's analysis (posted as a GitHub issue) gives the on-call engineer a head start.
/kind feature