PROJECT DOCUMENT: AXIOM-SOVEREIGN (WINNER'S EDITION)
- Title & Basic Info • Project Name: Axiom-Sovereign • One-line Tagline: Distributed Autonomous Remediation with Multi-Agent Explainability & Governance. • Domain / Category: SRE / Autonomous Agents / Chaos Engineering. • Team: The Butchers
- Elevator Pitch Axiom-Sovereign is a lean autonomous remediation engine that resolves infrastructure failures through a multi-agent "Debate" protocol. It features sub-10ms system-level event detection (simulated for demo), an AI Explainability panel, and automated Jira governance to ensure every fix is audited and justified.
- The 48-Hour "Winner" Features
- AI Explainability Panel: Live UI section showing the Agent’s Reasoning, Hypothesis, and Confidence Scores.
- Auto Root Cause Graph: A visual flow showing failure propagation (e.g., DB Down → App Degraded).
- Structured Agent Debate: Two internal LLM prompts (Agent A vs Agent B) that output a JSON-structured conflict resolution.
- Smart Retry Strategy: Intelligent evolution of fixes: Restart (L1) → Cache Clear (L2) → Human Escalation (L3).
- Jira State Machine: Automated ticket lifecycle (To Do -> Done) with the "Decision JSON" logged as a comment.
- Problem Statement • The Crisis: Manual L1 triage is the biggest bottleneck in SRE, leading to high MTTR. • The Gap: Standard bots act without explaining why, making them risky for production. • The Solution: A system that argues the best fix before executing, providing 100% transparency.
- System Architecture (Lean & Real) • Sentinel Layer (C++): Lightweight system-level monitoring (simulating kernel-level events for demo speed) to detect log/state changes. • Intelligence Orchestrator (Python): Central engine managing the Multi-Agent Debate and Jira API. • Governance Layer (Jira): Source of truth for incident state and audit trail. • Chaos Controller: Dashboard buttons to "Choke" nodes (Stop Process, DB Timeout). • Visualizer (Streamlit): Unified UI for the Node Map, RCA Graph, and Explainability Panel.
- The Logic: Sense-Think-Act-Explain
- Sense: C++ Sentinel flags a CRITICAL state change.
- Think (JSON Debate): Orchestrator runs a dual-prompt debate. Internal JSON Output:
- {
- "agent_A": {"action": "restart_db", "confidence": 0.85},
- "agent_B": {"action": "check_pool", "confidence": 0.72},
- "final_decision": "restart_db",
- "reasoning": "High confidence in service recovery via restart."
- }
- Act: Executes systemctl restart (or mock script) via subprocess.
- Explain: Updates UI and Jira with the "Why" using the reasoning from JSON.
- The Winning Demo Flow
- Baseline: System is . Traffic is flowing on the visualizer.
- Chaos: Click "CHOKE DATABASE".
- Detection: Dashboard turns red; Sentinel alerts the Orchestrator.
- Reasoning: Explainability Panel populates with the JSON Debate.
- Governance: Jira ticket is created; Ticket ID appears on UI.
- Remediation: Agent executes fix. System returns to .
- Closure: Jira ticket auto-moves to DONE.
- Punchline: "Not only does it fix, it explains exactly WHY it fixed it."
- Tech Stack • Monitoring: C++. • Orchestration: Python (FastAPI/Streamlit). • Intelligence: Gemini 1.5 Flash (Fast & Lean). • Governance: Jira Cloud API. • State Store: Local global_state.json (No Redis/External DB for 48h speed).
- Evaluation Strategy • MTTR: Time from Choke to Jira Done. • Explainability Score: Clarity of the Agent Debate log.
- ChatGPT / LLM Context Prompt "I am building 'Axiom-Sovereign' with team 'The Butchers'. It's an autonomous remediation system for a hackathon. It uses C++ monitoring and a Python orchestrator. Key features: Multi-agent debate (outputting JSON with confidence scores), Jira integration, and a Streamlit explainability panel. Help me implement the JSON debate logic and the Jira status transition."