Skip to content

feat: watchdog for stuck triggers + daemon health check#73

Open
talos-adapt2move wants to merge 1 commit intomxzinke:masterfrom
stefan-adapt2move:feat/watchdog-and-health-checks
Open

feat: watchdog for stuck triggers + daemon health check#73
talos-adapt2move wants to merge 1 commit intomxzinke:masterfrom
stefan-adapt2move:feat/watchdog-and-health-checks

Conversation

@talos-adapt2move
Copy link
Copy Markdown

Summary

Production-hardening scripts for any Atlas deployment:

  • watchdog-triggers.sh — Detects trigger-runner processes stuck for >30min, kills them (children first, then SIGKILL fallback), cleans orphan lock files, and re-fires the trigger for automatic recovery
  • check-daemon-health.sh — Generic supervisord service health checker. Restarts failed services, tracks consecutive failures with configurable alert threshold
  • Adds watchdog to default crontab (every 5 minutes)

Why this matters

Trigger sessions can hang indefinitely due to API timeouts, SDK issues, or network problems — blocking all subsequent messages on that channel until manual intervention. This watchdog provides automatic detection and recovery.

Configuration

  • MAX_AGE_MIN env var (default: 30 minutes)
  • WATCHDOG_LOG env var (default: /atlas/logs/watchdog-triggers.log)
  • check-daemon-health.sh <service-name> [alert-threshold]

🤖 Generated with Claude Code

- watchdog-triggers.sh: Detects trigger-runner processes running longer
  than MAX_AGE_MIN (default 30min), kills them and their child process
  trees, cleans up orphan lock files, and re-fires the trigger for
  automatic recovery.

- check-daemon-health.sh: Generic supervisord service health checker.
  Restarts failed services and tracks consecutive failures with
  configurable alert threshold.

- Adds watchdog to default crontab (every 5 minutes).

These address a real production need: trigger sessions can hang
indefinitely due to API timeouts or SDK issues, blocking all
subsequent messages on that channel.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant