feat: watchdog for stuck triggers + daemon health check by talos-adapt2move · Pull Request #73 · mxzinke/atlas

talos-adapt2move · 2026-03-28T08:03:04Z

Summary

Production-hardening scripts for any Atlas deployment:

watchdog-triggers.sh — Detects trigger-runner processes stuck for >30min, kills them (children first, then SIGKILL fallback), cleans orphan lock files, and re-fires the trigger for automatic recovery
check-daemon-health.sh — Generic supervisord service health checker. Restarts failed services, tracks consecutive failures with configurable alert threshold
Adds watchdog to default crontab (every 5 minutes)

Why this matters

Trigger sessions can hang indefinitely due to API timeouts, SDK issues, or network problems — blocking all subsequent messages on that channel until manual intervention. This watchdog provides automatic detection and recovery.

Configuration

MAX_AGE_MIN env var (default: 30 minutes)
WATCHDOG_LOG env var (default: /atlas/logs/watchdog-triggers.log)
check-daemon-health.sh <service-name> [alert-threshold]

🤖 Generated with Claude Code

- watchdog-triggers.sh: Detects trigger-runner processes running longer than MAX_AGE_MIN (default 30min), kills them and their child process trees, cleans up orphan lock files, and re-fires the trigger for automatic recovery. - check-daemon-health.sh: Generic supervisord service health checker. Restarts failed services and tracks consecutive failures with configurable alert threshold. - Adds watchdog to default crontab (every 5 minutes). These address a real production need: trigger sessions can hang indefinitely due to API timeouts or SDK issues, blocking all subsequent messages on that channel. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: watchdog for stuck triggers + daemon health check#73

feat: watchdog for stuck triggers + daemon health check#73
talos-adapt2move wants to merge 1 commit intomxzinke:masterfrom
stefan-adapt2move:feat/watchdog-and-health-checks

talos-adapt2move commented Mar 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

talos-adapt2move commented Mar 28, 2026

Summary

Why this matters

Configuration

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant