GitHub - altikva/spero: Self-healing supervision agent for Linux hosts and Kubernetes

 ___ _ __   ___ _ __ ___
/ __| '_ \ / _ \ '__/ _ \
\__ \ |_) |  __/ | | (_) |
|___/ .__/ \___|_|  \___/
    |_|

Self-healing supervision agent for Linux hosts and Kubernetes.

Spero watches the things you run, processes, services, disks, workloads, notices when they break, and heals them under policy-governed autonomy. It sits between a single-host tool like Monit and a full Prometheus + Alertmanager + Ansible stack: lightweight, agent-based, cluster-aware, with an AI layer for prediction, root-cause, and policy-gated remediation.

Shipped under Altikva.

Status: pre-alpha but functional, published on PyPI as spero 0.4.0. It supervises and heals Linux hosts (local + asyncssh) and Kubernetes (kubectl) through one engine, with an AI layer (prediction, root-cause, natural-language queries, agentic policy-gated remediation), a control plane with live status / metrics / logs, dial-home fleet operation, and pluggable alerting. 265 tests, ruff + mypy clean.

Concepts

Spero is built on four seams so it spans hosts and Kubernetes through one engine:

Seam	Question	Examples
Provider	where things run	`local`, `ssh:[user@]host[:port]`, `k8s:[context][/namespace]`
Probe	how you know it is healthy	host: process, systemd, port, disk; k8s: pod, deployment, restart-count, resource-usage, pvc, cert-expiry; serverless: keda-scaledobject, knative-service, elpio-*; data: http, command, postgres, kafka, trino, clickhouse
Remediation	what to do about it	host: restart, respawn, kill, rotate; k8s: rollout-restart, scale, delete-pod, patch-requests, keda-unpause
Policy	the declared intent	YAML: target → probe → remediations + autonomy

Each seam is a registry: add a capability by writing the class and registering it. Remediations carry an autonomy level, suggest, gated, or auto, so low-risk healing happens on its own while high-risk actions wait for a human (or the AI approver). Destructive actions (kill, delete-pod, patch-requests) can never be auto.

Install

pip install spero            # or: uv pip install spero
pip install "spero[ai]"      # Claude-backed AI layer (prediction, root-cause, NL queries)
pip install "spero[k8s]"     # Kubernetes provider deps
pip install "spero[tui]"     # Textual dashboard for `spero top`
pip install "spero[certs]"   # the cert-expiry probe (cryptography)

Configuration is env-driven (SPERO_* prefix or a .env file): SPERO_POLICY_PATH, SPERO_DATABASE_URL, SPERO_HOST, SPERO_PORT, plus the auth / alerting / privacy knobs below.

Quickstart

Bare spero greets with a branded landing screen and the command list:

spero status                       # show targets from the active policy
spero run                          # run one supervision cycle (gated actions wait for a human)
spero run --ai-approve             # agentic: the model decides gated remediations
spero watch                        # supervise continuously, each target on its interval
spero top                          # live k9s-style dashboard (needs the tui extra)
spero heal nginx                   # probe one target, walk its remediations interactively
spero ask "what flapped today?"    # natural-language query over the event history
spero diagnose nginx               # LLM root-cause sketch for a target
spero forecast disk-root           # predictive: when a disk crosses a threshold
spero serve                        # run the control-plane API on :8800
spero --version

A single spero run cycle reports each target's health and the action taken:

Control plane and dashboards

spero serve runs a FastAPI control plane that supervises in the background and exposes its live state over HTTP:

Endpoint	What
`/health`	liveness (always open)
`/status`, `/events`	per-target health + the last action, recent events
`/objects/{target}`	the target's underlying object as YAML
`/logs/{target}`	last N log lines; `/logs/{target}/stream` follows over SSE
`/metrics`	Prometheus text (per-target health and failure counts)

spero top renders a k9s-style live grid of targets and a rolling event feed. With the tui extra it is a full Textual UI (mouse, scrollback, command palette); without it, a rich.Live fallback. Keys: a approve a gated action, f toggle the freeze, i inspect YAML, l tail logs, L follow logs, s shell into a pod (local, your kubectl). spero top --remote http://host:port observes a running worker over the endpoints above instead of probing locally:

Every route except /health is guarded by a bearer token when SPERO_API_TOKEN is set (empty means auth off, the localhost default); pass it to an observer with spero top --remote <url> --token <token>.

Fleet operation (dial-home)

For clusters you cannot reach inbound, run the worker as spero agent --owner <url>: it supervises locally and dials OUT to a spero owner service, reporting status and events on a timer and pulling orders. The owner answers gated remediations as a remote approver and can push a new policy to a running agent (hot-swapped, no redeploy). auto actions still run if the owner is offline. SPERO_OWNER_TOKEN guards the owner.

Alerting

Spero fires on first failure and resolves on recovery. NullAlerter is the default; configure email (SMTP with optional STARTTLS + login), a generic JSON webhook (SPERO_ALERT_WEBHOOK_URL), or Slack (SPERO_SLACK_WEBHOOK_URL). The channel is used by run, watch, serve, and the dial-home agent.

Kubernetes deployment

Run spero in-cluster with the Kustomize manifests in deploy/k8s/: a supervise-only base (read RBAC, always-on Deployment, default-deny NetworkPolicy) and an acting overlay that opts into remediation with exactly the mutating verbs it needs plus leader-election leases. The image runs non-root with a read-only root filesystem. See deploy/k8s/README.md.

Data egress and privacy

With ANTHROPIC_API_KEY set, spero ask, spero diagnose, and --ai-approve send text to Anthropic's API: the question, target names, remediation params, and recorded event details (which can include command output). Event details are also stored in the local sqlite database. With no key, spero uses the NullLLM fallback and nothing leaves the host. To scrub likely secrets/PII from event text before it is sent to the model, set SPERO_REDACT_EVENTS=1 (best-effort; see src/spero/ai/redact.py). The control-plane endpoints that expose object YAML and pod logs are token-guarded; see deploy/k8s/README.md.

From source

git clone https://github.com/altikva/spero && cd spero
uv sync --locked --all-extras
uv run pytest                      # run the suite

License

Spero is released under the ALTIKVA Dual License v1.0 (MIT and CC BY-NC-SA 4.0), SPDX MIT AND CC-BY-NC-SA-4.0. See LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 96 Commits
.github/workflows		.github/workflows
adr		adr
assets		assets
deploy		deploy
policies		policies
scripts		scripts
src/spero		src/spero
tests		tests
.dockerignore		.dockerignore
.gitignore		.gitignore
.mcp.json		.mcp.json
.pre-commit-config.yaml		.pre-commit-config.yaml
CHANGELOG.md		CHANGELOG.md
Dockerfile		Dockerfile
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Concepts

Install

Quickstart

Control plane and dashboards

Fleet operation (dial-home)

Alerting

Kubernetes deployment

Data egress and privacy

From source

License

About

Uh oh!

Releases 3

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Concepts

Install

Quickstart

Control plane and dashboards

Fleet operation (dial-home)

Alerting

Kubernetes deployment

Data egress and privacy

From source

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages