monitoring is the single point of execution for the monitoring scripts on
the VPS. If it goes down the scheduled checks do not run and alerts go silent.
This runbook covers the operations on-call needs to recover from common
failures.
All command examples assume you are SSH'd into the VPS as the deploy user (the
same account the systemd unit runs as), with the repository checked out at
/srv/monitoring.
No Docker or compose. One systemd unit runs
supercronic, which executes
python -m automation run <profile> on each cron tick per
automation/jobs.yaml.
An optional monitoring-api unit exposes persisted alert history on localhost;
public authentication and rate limiting belong at the reverse proxy.
systemctl status monitoring
systemctl status monitoring-apiSee the schedule supercronic is actually running (rendered from jobs.yaml):
cd /srv/monitoring
uv run python -m automation list # profiles + cron + task counts
uv run python -m automation render-crontab # the exact crontab supercronic runssupercronic logs every job start and exit to journald (see below) — that's the record of what ran and when.
journalctl -u monitoring -f --since '15 min ago'
journalctl -u monitoring-api -fsupercronic prints a structured line per job (channel=stdout, job.position,
exit status). Filter to a single profile run by grepping the lock command:
journalctl -u monitoring --since 1h | grep 'automation run hourly'The Python tasks themselves log at LOG_LEVEL (default INFO). A profile with
failing tasks also posts one Telegram digest (see Failure handling below), so
you usually learn about a failure there before reading journald.
Useful when a tick was skipped, or to validate a hand-edited config. Replaces
workflow_dispatch.
cd /srv/monitoring
# Print what would run without executing:
uv run python -m automation run hourly --dry-run
# Actually run it (sends real alerts / Telegram on failure):
uv run python -m automation run hourlyAvailable profiles: hourly, daily, multisig (see automation/jobs.yaml).
Contributors merge to main through PRs; nobody normally SSHes in to ship code.
Two things move main onto the box, split on "does the change need the
scheduler restarted to take effect?"
Auto-sync (no restart, the common case). The multisig profile fetches
origin main and runs git reset --hard origin/main before its tasks every 10
minutes (sync_before_run: true in jobs.yaml). Since supercronic re-spawns
every profile fresh against the on-disk tree, that one sync keeps the whole
checkout current for all profiles within ~10 minutes — every other profile
rides along for free. Tracked local edits are intentionally discarded in favor
of reviewed main, so the VPS does not get stuck on a divergent local branch or
operator hotfix. The sync is best-effort: a failure is logged (journalctl -u monitoring | grep 'pre-run git sync') and the multisig checks run against the
existing checkout anyway, so a git hiccup never silences an alert. This covers
scripts, config modules, data, and jobs.yaml task bodies — anything read
fresh by a subprocess.
So for the common case — a script tweak, a new task in a profile — merge the PR and the next ~10-min multisig tick syncs it in; no SSH needed.
Manual restart (cadence + deps). Two kinds of change land on disk via the auto-sync but stay inert until a restart, because they're read once at scheduler boot, not per-tick:
- a
cron:cadence change injobs.yaml(the crontab is rendered at unit start byExecStartPre), and - a
pyproject.toml/uv.lockchange (the venv).
For those, after the PR merges:
cd /srv/monitoring
git fetch origin main && git reset --hard origin/main # or let the next multisig tick land it
uv sync --frozen --extra ai # only if pyproject.toml / uv.lock changed (--extra ai: openai client for the AI explainer)
sudo systemctl restart monitoring # re-renders the crontab and re-points supercronic at the treeThe restart re-renders the crontab and points supercronic at the freshly-pulled tree. Total downtime is a few seconds and only affects the scheduler, not an in-flight job.
Secrets live in /etc/monitoring/.env (mode 0640, root:),
loaded by the unit via EnvironmentFile. They are not in git. Edit in place
and restart:
sudo $EDITOR /etc/monitoring/.env
sudo systemctl restart monitoringThe restart re-reads the env and re-renders the crontab. Anyone who could read the file has seen the old values, so a leaked credential should be rotated at the provider, not just edited here.
To validate the whole fleet without spamming production chats — e.g. comparing
this VPS's output against the old GitHub Actions runs — set
TELEGRAM_TEST_CHAT_ID in the env. While it is set, every alert from every
protocol is sent to that single chat via the default bot, prefixed with a
[protocol] label and with no topic threading, so production routing
(TELEGRAM_TOPIC_ID_* / per-protocol chats) is bypassed entirely:
sudo $EDITOR /etc/monitoring/.env
# TELEGRAM_TEST_CHAT_ID=-1001234567890 # the dummy group
# (the default bot, TELEGRAM_BOT_TOKEN_DEFAULT, must be a member of it)
sudo systemctl restart monitoringKeep LOG_LEVEL=INFO (the default) — LOG_LEVEL=DEBUG skips all Telegram sends,
so nothing would arrive. Comment the line out and restart to restore normal
per-protocol routing.
Single-node failover (cattle, not pets):
- Provision a fresh VPS:
sudo bash deploy/install.sh(installs uv/Python/ supercronic, clones the repo, creates the venv, installs the systemd unit). It can also be curled — see the header ofinstall.sh. - Drop the production env at
/etc/monitoring/.env(mode 0640, root:) — copy it from the old host or recreate from.env.example. - Start it:
sudo systemctl enable --now monitoring systemctl status monitoring - Confirm the schedule and the first ticks:
cd /srv/monitoring && uv run python -m automation render-crontab journalctl -u monitoring -f
- Stop the old host so alerts aren't sent twice:
sudo systemctl disable --now monitoring
systemd restarts the unit on crash (Restart=on-failure, capped at 5
restarts/60s). There is no HTTP healthcheck (and no daemon to wedge): supercronic
is a foreground process whose death systemd observes directly. If supercronic is
running, the crontab is being ticked. If a single job hangs it blocks only its
own profile (each is wrapped in flock -n, so the next tick of that profile is
skipped, not queued) — the others keep ticking.
The alerts API is a separate service. It reads the SQLite alert database under
/srv/cache and binds to 127.0.0.1:8923 by default, so stopping it does not
stop scheduled monitoring. See alerts-api.md for endpoint
examples and response shapes.
sudo systemctl enable --now monitoring-api
systemctl status monitoring-api
journalctl -u monitoring-api -f
curl http://127.0.0.1:8923/healthz
curl 'http://127.0.0.1:8923/v1/alerts?limit=10&source=protocol'Forward public traffic through a reverse proxy to 127.0.0.1:8923. The proxy
must require bearer token or basic auth, apply rate limiting, and set request
and response timeouts. Do not expose /srv/cache or the SQLite database file
directly.
Alert history and migrated monitor state live in /srv/cache/monitoring.db.
SQLite may also create /srv/cache/monitoring.db-wal and
/srv/cache/monitoring.db-shm; keep all three files local to the VPS and owned
by the deploy user. deploy/install.sh installs the sqlite3 CLI, creates the
database schema, and imports existing text cache files into the monitor_state
table.
For the current live host, freeze writes briefly and run the migration once:
cd /srv/monitoring
git pull --ff-only
uv sync --frozen --extra ai
sudo systemctl stop monitoring
sudo REPO_DIR=/srv/monitoring CACHE_DIR=/srv/cache ./deploy/migrate-file-cache-to-db.sh
sudo systemctl daemon-reload
sudo systemctl start monitoring
sudo systemctl enable --now monitoring-apiThe migration imports known cache files from automation/jobs.yaml, including
cache-id.txt, cache-id-daily.txt, and nonces.txt. It preserves existing
SQLite values by default, so rerunning it is safe. Use --overwrite only when
the legacy text files are known to be the source of truth.
Inspect migrated state with:
sqlite3 /srv/cache/monitoring.db \
'select namespace, key, value from monitor_state order by namespace, key limit 20;'If rollback is needed, set CACHE_BACKEND=file in /etc/monitoring/.env and
restart monitoring. That makes utils.cache use the legacy text files again.
Remove the variable and restart to return to SQLite-backed cache state.
| Symptom | First thing to check | Likely cause |
|---|---|---|
Active: failed on start |
journalctl -u monitoring -n 50 |
Malformed automation/jobs.yaml (render-crontab aborts the start), or /etc/monitoring/.env missing (the unit refuses to start without it). |
| Telegram suddenly silent | Is TELEGRAM_BOT_TOKEN_DEFAULT valid? LOG_LEVEL=DEBUG skips sends. |
Bot revoked, chat removed bot, or LOG_LEVEL left at DEBUG. |
| One profile never runs | uv run python -m automation render-crontab — is its line present? |
Profile/task enabled: false in jobs.yaml, or its flock lock is stuck held by a hung run (restart clears it). |
ModuleNotFoundError after a deploy |
journalctl -u monitoring -n 50 |
Forgot uv sync --frozen after a pyproject.toml/uv.lock change. |
| Cache/dedupe acting up | ls -l /srv/cache |
Wrong perms (must be writable by the runner user) or a corrupt cache file — safe to delete; it re-seeds. |
- Source tree:
/srv/monitoring(owned by the deploy user). - Python venv:
/srv/monitoring/.venv(created byuv sync). - Cache / dedupe state:
/srv/cache(owned by the deploy user; the unit grants it viaReadWritePathsand setsCACHE_DIR=/srv/cache, whichutils.cacheresolves every cache file against). A profile only overrides a cache basename inautomation/jobs.yamlwhen it needs an isolated file (e.g. daily). - SQLite database:
/srv/cache/monitoring.dbplus WAL/shm sidecars. - Env file:
/etc/monitoring/.env(mode 0640, root:; operator-supplied, not in git). - systemd units:
/etc/systemd/system/monitoring.serviceand/etc/systemd/system/monitoring-api.service. - Rendered crontab:
/tmp/crontab(per-servicePrivateTmp; regenerated on every start). - Code sync: no separate unit — the
multisigprofile's pre-rungit fetch origin main+git reset --hard origin/main(sync_before_runinjobs.yaml) every 10 min. The unit grants the repo write access viaReadWritePaths=/srv/cache /srv/monitoringso the sync can update.git/underProtectSystem=strict.