Sentinel is an autonomous Site Reliability Engineering (SRE) agent that continuously watches system health, uses AI to understand what went wrong, applies safety checks, and fixes issues automatically when safe or with human approval when needed.
βββββββββββββββββββ ββββββββββββββββ βββββββββββββββ
β cAdvisor β β Node Exporterβ β Prometheus β
β (Docker) βββββΆβ (System) βββββΆβ (Time-seriesβ
β - Containers β β - CPU β β Database) β
β - CPU/Memory β β - Memory β β - Queries β
βββββββββββββββββββ ββββββββββββββββ βββββββββββββββ
β β β
βΌ βΌ βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Sential Monitoring Layer β
β - Health checks with REAL thresholds β
β - Alert detection from actual metrics β
β - Latency calculation: 25ms base + 2ms/MB/s + 100ms/error β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
- Memory: Triggers at 60% usage
- CPU: Triggers at 80% usage
- Disk: Triggers at 95% usage
- Network Latency: Triggers at 200ms
- Container Status: Any container down = Critical
// Real latency calculation based on network activity
const baseLatency = 25 // Base network latency
const loadLatency = networkMBps * 2 // +2ms per MB/s traffic
const errorLatency = networkErrors * 100 // +100ms per error/sec
const simulatedLatency = baseLatency + loadLatency + errorLatencyβββββββββββββββ ββββββββββββββββ βββββββββββββββ
β OpenTelemetryβββββΆβ Jaeger β β Grafana β
β (Tracing) β β (UI & Query) β β (Dashboards)β
βββββββββββββββ ββββββββββββββββ βββββββββββββββ
β β β
βΌ βΌ βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Production Monitoring β
β - Distributed request tracing across all steps β
β - Visual workflow analysis in Jaeger UI β
β - Custom dashboards for infrastructure metrics β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Jaeger Features:
- Distributed request tracing across all Motia steps
- Visual workflow analysis and bottleneck identification
- Service dependency mapping
- Performance metrics and latency analysis
Grafana Features:
- Real-time infrastructure dashboards
- Prometheus metric visualization
- Custom alerts and notifications
- Historical trend analysis
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Security Layer β
β βββββββββββββββ βββββββββββββββ βββββββββββββββ β
β β Auth β β Sandbox β β Rollback β β
β β RBAC β β Safety β β Snapshots β β
β β - Admin β β - Isolated β β - Before β β
β β - Operator β β - /Motia/Sandbox β - After Fix β β
β β - Viewer β β - Real Exec β β - Proceduresβ β
β βββββββββββββββ βββββββββββββββ βββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
- API Key Authentication with role-based access control
- Admin Role: Full access (approve, trigger, rollback, sandbox cleanup)
- Operator Role: View and approve incidents
- Viewer Role: Read-only access
- Isolated Directory:
/home/sidharth/Desktop/Motia/Sandboxonly - Real Disk Cleanup: Actually removes files from safe directories
- Preserved Data: Never touches
data/subdirectory - Threshold Monitoring: Alerts at 100MB sandbox size
// Captures "before" state before applying any fix
const snapshot = {
incidentId,
beforeState: { /* captured metrics */ },
fixApplied: "memory_limit_adjustment",
rollbackProcedure: [ /* steps */ ]
}- Motia States Plugin with TTL and caching
- Incident Lifecycle Tracking: Complete audit trail
- Rollback History: Success/failure rates and procedures
- Pending Approvals: Real-time status tracking
# Real AI analysis using OpenRouter API
async def handler(alert, context):
analysis = await client.analyze_alert(
alert_type=alert_type,
severity=severity,
metric=metric,
current_value=value,
threshold=threshold,
affected_resource=resource,
system_context=context
)
return {
"root_cause": analysis.root_cause,
"proposed_fix": analysis.recommended_action,
"risk_level": analysis.risk_assessment,
"confidence": analysis.confidence_score
}AI Analysis Includes:
- Root Cause Identification: DeepSeek R1 reasoning
- Risk Assessment: Low/Medium/High classification
- Proposed Fix: Specific remediation actions
Sentinel performs real actions, not simulations:
- Sandbox directory cleanup (logs, temp, cache only)
- Container memory analysis via
docker stats - Docker container restarts
- Slack-based approvals and execution
Example:
High memory alert β Sentinel runs docker stats β presents restart options β operator approves directly in Slack.
- SSE Streaming: Real-time incident updates
- Infrastructure Metrics: From Prometheus (cAdvisor + Node Exporter)
- Network Latency: Calculated live (25ms base + load + errors)
- Interactive Controls: Manual approval with auth, rollback
- Live Metrics: CPU, Memory, Disk, Network, Containers
- Incident Timeline: Real-time status updates
- Approval Queue: Interactive human approval workflow
| # | Motia Feature | Sential Implementation | Real Production Use |
|---|---|---|---|
| 1 | Cron Steps | src/monitoring/infrastructure-monitor.step.ts |
Scheduled monitoring every 2 minutes |
| 2 | API Steps | src/api/ (8+ endpoints) |
REST API, webhooks, dashboard |
| 3 | Event Steps | src/ai/, src/policy/, src/execution/ |
Background processing workflows |
| 4 | Python Steps | src/ai/root_cause_analysis_step.py |
AI analysis with DeepSeek R1 |
| 5 | TypeScript Steps | All .step.ts files |
Business logic, policy engine |
| 6 | State Management | Motia States plugin | Incident tracking, TTL, caching |
| 7 | Streams | SSE streaming | Real-time dashboard updates |
| 8 | Middlewares | middlewares/ directory |
Auth, error handling, logging |
| 9 | Virtual Steps | src/flows/approval-gate.step.ts |
Workflow visualization |
| 10 | Flows | sentinal-sre complete flow |
Full incident resolution pipeline |
| 11 | Conditional Emits | Policy engine routing | Risk-based smart decision making |
| 12 | Error Handling | Built-in retry mechanisms | 3x retry with exponential backoff |
| 13 | Polyglot | TypeScript + Python | Optimal language per task |
Demo Link : https://youtu.be/QEWtsZIajeY
Example Real case workflow :
1. MONITORING
βββββββββββββββββββ
β Cron Step β Every 2 minutes
β (Monitoring) β
βββββββββββββββββββ
β
βΌ
2. ALERT DETECTION
βββββββββββββββββββ ββββββββββββββββββββββββββββ
β Health Checks βββββΆβ Alert Creation β
β (Prometheus) β β (Real Thresholds) β
βββββββββββββββββββ ββββββββββββββββββββββββββββ
β
βΌ
3. AI ANALYSIS
βββββββββββββββββββ ββββββββββββββββββββββββββββ
β Python Step βββββΆβ DeepSeek R1 β
β (AI Brain) β β Root Cause Analysis β
βββββββββββββββββββ ββββββββββββββββββββββββββββ
β
βΌ
4. RISK ASSESSMENT
βββββββββββββββββββ ββββββββββββββββββββββββββββ
β Policy Engine βββββΆβ Risk Level Classification β
β (TypeScript) β β (Low / Medium / High) β
βββββββββββββββββββ ββββββββββββββββββββββββββββ
β
ββββββββββββββββββββββββββΌβββββββββββββββββββββββββ
β β β
βΌ βΌ βΌ
LOW RISK MEDIUM RISK HIGH RISK
(Auto-fix) (Human Approval) (Admin Required)
β β β
βΌ βΌ βΌ
5. APPROVAL & TRIGGER
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β Auto Trigger β β Slack / API β β Admin Auth β
β (Event Step) β β Approval β β Approval β
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β β β
ββββββββββββββββββββββββββΌβββββββββββββββββββββββββ
β
βΌ
6. EXECUTION LAYER
ββββββββββββββββββββββββββββββββββββββββββββ
β EXECUTION LAYER β
ββββββββββββββββββββββββββββββββββββββββββββ€
β β’ Sandbox Cleanup (real file operations) β
β β’ Container Restart (Docker CLI) β
β β’ Scenario Reset (state management) β
ββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
7. VERIFICATION & ROLLBACK
βββββββββββββββββββ ββββββββββββββββββββββββββββ
β Verify Metrics βββββΆβ Snapshot + Rollback β
β (Prometheus) β β (If degradation detected)β
βββββββββββββββββββ ββββββββββββββββββββββββββββ
β
βΌ
8. COMPLETION
ββββββββββββββββββββββββββββββββββββββββββββ
β State Update + Audit Trail + Metrics β
β Incident Closed or Escalated β
ββββββββββββββββββββββββββββββββββββββββββββ
# Required
- Node.js 18+
- npm or yarn
- Docker (for container monitoring)
- Prometheus + node-exporter (optional, has fallback)
# Optional
- Slack webhook URL (for notifications)
- OpenRouter API key (for AI analysis)# 1. Clone repository
git clone <repo-url>
cd Sentinal
# 2. Install dependencies
npm install
# 3. Configure environment variables
cp .env.example .env
# Edit .env with your API keys:
# - OPENROUTER_API_KEY=your_key
# - SLACK_WEBHOOK_URL=your_webhook
# - PUBLIC_URL=http://localhost:3000
# 4. Start development server
npm run dev
# 5. Open dashboard
open http://localhost:3000/dashboard.html
# 6. Test Real Features
```bash
# Trigger a real incident (requires admin API key)
curl -H "X-API-Key: sk-sentinal-admin-demo-key-12345" \
http://localhost:3000/api/trigger-alert
# Monitor real metrics
curl http://localhost:3000/api/infrastructure-status
# View pending approvals
curl -H "X-API-Key: sk-sentinal-operator-demo-key-67890" \
http://localhost:3000/api/approvals/pending| Role | API Key | Permissions |
|---|---|---|
| Admin | sk-sentinal-admin-demo-key-12345 |
Full access + sandbox cleanup |
| Operator | sk-sentinal-operator-demo-key-67890 |
View + approve incidents |
| Viewer | sk-sentinal-viewer-demo-key-11111 |
Read-only access |
This is a hackathon project. Contributions welcome after the event!
Built with β€οΈ for the hackathon.Thanks to System Failres
β Star this repo if you find it useful!