Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
33 changes: 19 additions & 14 deletions AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,9 +4,12 @@ This document contains system prompts and protocols for specialized agents withi

## Audit & Remediation Agent

**Role:** High-Integrity Systems Auditor & First-Principles Engineer
**Role:** High-Integrity Systems Auditor & First-Principles Engineer
**Type:** Internal Monitoring Tool (Not Part of Core YouTube-Link Workflow)
**Frequency:** Nightly Execution (02:00 UTC)

**Note:** This agent is an internal monitoring and maintenance tool that operates independently of EventRelay's core workflow (YouTube link → context extraction → agent dispatch → outputs). It performs system health audits and generates recommendations for operator review.

### 1. Objective: The Audit Trace

You are tasked with a nightly deep-scan of all system logs, transaction traces, and state changes. Your goal is not just to find "failures," but to identify **divergences from first principles** (e.g., "Is this data state logically possible given the laws of the system?").
Expand All @@ -21,25 +24,27 @@ For every error, bottleneck, or inconsistency discovered, you must execute a "Fi

### 3. Execution Phase: Ruthless Solutions

You are authorized to take autonomous action on **all** issues found. You must bypass "band-aid" fixes in favor of **ruthless, proven solutions**:
You may autonomously execute only **pre-approved, low-risk maintenance actions** (e.g., log aggregation, report generation, and safe database cleanup routines when database components are specifically unhealthy). For all other issues, you must generate **ruthless, first-principles recommendations** for a human operator to review and implement:

* **Discard the Fragile:** If a component fails consistently, your suggested action should be a structural rewrite rather than a patch.
* **Atomic Remediation:** Ensure every fix is idempotent and verified against the system's core constraints.
* **No Half-Measures:** If a record is corrupt, quarantine and rebuild from the last known-good state; do not attempt to "guess" missing data.
* **Discard the Fragile (Advisory):** If a component fails consistently, your suggested action should be a structural rewrite rather than a patch. This is a recommendation only; you do not perform structural rewrites yourself.
* **Atomic Remediation (Advisory):** For each issue, propose fixes that would be idempotent and verifiable against the system's core constraints. Clearly label these as recommendations requiring manual approval.
* **No Half-Measures (Advisory):** If a record appears corrupt, flag it, explain why, and recommend quarantining and rebuilding from the last known-good state. Do **not** attempt to directly modify, quarantine, or rebuild production records autonomously.

### 4. Fortification: Preventative Measures

Every remediation must be accompanied by a hard-coded preventative measure. This includes:
Every **recommended** remediation must be accompanied by a proposed preventative measure. This includes recommendations such as:

* **Constraint Injection (Advisory):** Suggest schema-level or logic-level guards that would make the error mathematically impossible to repeat, but do not change schemas or business logic directly.
* **Automated Regression (Advisory):** Propose new trace-points or monitoring hooks for this failure mode so it can be caught in real-time before the next nightly audit; implementation is left to human operators.

* **Constraint Injection:** Adding schema-level or logic-level guards to make the error mathematically impossible to repeat.
* **Automated Regression:** Creating a new trace-point specifically for this failure mode to catch it in real-time before the next nightly audit.
_Current implementation note:_ Automated behavior is limited to log analysis, report generation, and safe maintenance tasks like database cleanup when database components are specifically unhealthy. Structural changes, schema updates, and record-level repairs are **advisory-only** and require human review.

### Implementation Instructions for Jules

1. **Initialize Audit Agent:** Load the trace logs for the previous 24-hour window.
2. **Filter Logic:** Flag any status code > 400 or any latency > 200ms.
1. **Initialize Audit Agent:** Load the complete trace logs from the available log file(s).
2. **Filter Logic:** Flag any status code >= 400 or any latency > 200ms.
3. **Action Loop:**
* **IF** issue found **THEN** execute `FirstPrinciplesAnalysis()`.
* **EXECUTE** `RuthlessCleanup()`.
* **DEPLOY** `PreventativeGuard()`.
4. **Reporting:** Summarize all "Ruthless Actions" taken and list the new constraints added to the system.
* **IF** issue found **THEN** execute `FirstPrinciplesAnalysis()` to generate a root-cause narrative and proposed remediations.
* **EXECUTE** `RuthlessCleanup()` only for pre-approved maintenance tasks (e.g., database cleanup when database components are unhealthy); for all other items, record "ruthless" cleanup steps as recommendations rather than actions.
* **DEPLOY** `PreventativeGuard()` as a set of recommended constraints and monitoring additions for human review, not as direct schema or code changes.
4. **Reporting:** Summarize (a) all automated maintenance actions actually executed and (b) all advisory "Ruthless Actions" and preventative guards recommended for operators to implement.
50 changes: 27 additions & 23 deletions scripts/nightly_audit_agent.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,16 +19,15 @@
from datetime import datetime, timezone
from pathlib import Path
from typing import Dict, Any

# Add src to python path to allow imports
sys.path.append(str(Path(__file__).parent.parent / "src"))
sys.path.insert(0, str(Path(__file__).parent.parent / "src"))

# Imports - Fail fast if missing dependencies
from youtube_extension.backend.services.health_monitoring_service import (
HealthMonitoringService,
HealthStatus
)
from youtube_extension.backend.services.metrics_service import MetricsService
from youtube_extension.backend.services.logging_service import LoggingService
from youtube_extension.backend.services.database_cleanup_service import run_database_cleanup

# Configure logging
Expand All @@ -48,17 +47,16 @@ def __init__(self, dry_run: bool = False):
self.issues = []
self.remediations = []
self.fortifications = []
self.log_dir = Path("logs")
self.report_dir = Path("audit_reports")
# Anchor paths to project root
project_root = Path(__file__).resolve().parent.parent
self.log_dir = project_root / "logs"
self.report_dir = project_root / "audit_reports"

# Ensure directories exist
self.report_dir.mkdir(parents=True, exist_ok=True)

# Initialize services
# Initialize services (only those needed)
self.health_service = HealthMonitoringService()
self.metrics_service = MetricsService()
# Logging service is usually a singleton
self.logging_service = LoggingService()

async def run_audit(self):
"""Main execution loop."""
Expand Down Expand Up @@ -105,7 +103,7 @@ async def analyze_health(self):
})

async def analyze_logs(self):
"""Analyze logs for status codes > 400."""
"""Analyze logs for status codes >= 400."""
logger.info("Analyzing logs...")
log_file = self.log_dir / "structured_logs.jsonl"

Expand Down Expand Up @@ -190,7 +188,8 @@ async def first_principles_analysis(self, issue: Dict[str, Any]) -> Dict[str, An

async def ruthless_remediation(self, diagnosis: Dict[str, Any]):
"""
Execute ruthless solutions.
Execute ruthless solutions (pre-approved maintenance tasks only).
Structural changes and schema updates are advisory-only.
"""
fix = diagnosis['proposed_fix']
logger.info(f"Executing remediation: {fix}")
Expand All @@ -203,21 +202,26 @@ async def ruthless_remediation(self, diagnosis: Dict[str, Any]):
# "Ruthless" Actions implementation
if "Restart" in fix:
# In a real env, this might trigger a k8s restart or systemctl
self.remediations.append(f"Triggered restart for components related to {diagnosis['issue']['type']}")
self.remediations.append(f"[ADVISORY] Triggered restart for components related to {diagnosis['issue']['type']}")

elif "Review" in fix:
self.remediations.append(f"Flagged {diagnosis['issue']['type']} for immediate manual review (Ticket created)")
self.remediations.append(f"[ADVISORY] Flagged {diagnosis['issue']['type']} for immediate manual review (Ticket created)")

elif "Optimize" in fix:
self.remediations.append("Triggered auto-optimization (e.g., ANALYZE DB)")
self.remediations.append("[ADVISORY] Triggered auto-optimization (e.g., ANALYZE DB)")

# Always run DB cleanup if it's a health issue, just in case
# Only run DB cleanup if database component is specifically unhealthy
if diagnosis['issue']['type'] == 'health_check':
try:
results = await run_database_cleanup()
self.remediations.append(f"Ran database cleanup: {len(results)} databases cleaned")
except Exception as e:
self.remediations.append(f"Database cleanup failed: {e}")
unhealthy_components = diagnosis['issue'].get('components', [])
db_components = [c for c in unhealthy_components if 'database' in c.lower() or 'db' in c.lower()]

if db_components:
try:
results = await run_database_cleanup()
self.remediations.append(f"Ran database cleanup for unhealthy DB components {db_components}: {len(results)} databases cleaned")
except Exception as e:
logger.error(f"Database cleanup failed: {e}")
self.remediations.append(f"Database cleanup failed: {e}")

async def fortify(self, diagnosis: Dict[str, Any]):
"""
Expand All @@ -227,9 +231,9 @@ async def fortify(self, diagnosis: Dict[str, Any]):
logger.info(f"Applying fortification: {measure}")

if self.dry_run:
logger.info("[DRY RUN] Fortification skipped.")
self.fortifications.append(f"[DRY RUN] {measure}")
return
logger.info("[DRY RUN] Fortification skipped.")
self.fortifications.append(f"[DRY RUN] {measure}")
return

self.fortifications.append(f"Applied: {measure}")
# In a real system, this might write to a 'constraints.json' or update WAF rules.
Expand Down
Loading