Skip to content

Automated Remediation and Playbooks #6

@achtsnits

Description

@achtsnits

Problem Statement

When alerts fire, the response is still mostly manual. Alertmanager can send alerts, but it is not built to run remediation steps or operational workflows.

This creates a gap between detecting a problem and reacting to it in a clear and repeatable way. Operators still need to investigate the issue, decide what to do, and execute the action manually.

Keep helps close this gap. It adds workflow execution and playbook-based handling on top of the existing alerting setup. This makes it possible to move from simple notifications to guided or automated remediation.

This work follows #4 and applies it to one concrete and highly EO-relevant demo setup: an EO platform stack with APISIX -> eoapi -> PostgreSQL serving STAC APIs. The goal is to show how reusable remediation patterns can be implemented and tested in a real platform scenario.

Scope

Introduce GitOps-managed remediation workflows and playbooks for the APISIX -> eoapi -> PostgreSQL setup.

Tasks

  • Define remediation actions such as restart, scale, or cleanup
  • Implement playbooks managed through Git
  • Use Keep to trigger or coordinate remediation workflows
  • Validate the full flow: SLO -> burn rate -> alert -> enrichment -> remediation
  • Test the setup with synthetic load scenarios

Outcome

A repeatable setup for guided or automated remediation that improves response time and operational consistency, demonstrated on a realistic EO platform use case.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions