AI-powered GOAP agent that automates incident triage, root cause analysis, and guided remediation across an enterprise Java/Spring stack. Integrates with Prometheus, Loki, Tempo, Grafana, PagerDuty, ArgoCD, and Slack to reduce MTTR and on-call toil.
Hexagonal architecture enforced by ArchUnit — domain has zero dependencies on infrastructure, agents, or frameworks.
┌─────────────────────────────────────────────────────────────────────────┐
│ APPLICATION LAYER │
│ REST API · PagerDuty Webhook · Spring Shell CLI │
└──────────────────────────────┬──────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────┐
│ AGENT LAYER (Embabel GOAP) │
│ │
│ Triage · Logs · Deploy · Health · Trace · Fatigue · PostMortem · SLO │
│ │
│ ┌───────────────────┐ │
│ │ LLM (Claude 4.5) │ │
│ └───────────────────┘ │
└──────────────────────────────┬──────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────┐
│ DOMAIN LAYER (pure — no deps) │
│ │
│ Port Interfaces · Records & Enums · Formatters │
└──────────────────────────────┬──────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────┐
│ INFRASTRUCTURE LAYER │
│ │
│ PrometheusAdapter · LokiAdapter · TempoAdapter · GrafanaAdapter│
│ ArgoCDAdapter · PagerDutyAdapter · SlackAdapter · MockAdapters │
└───────┬──────────┬──────────┬─────────┬─────────┬──────────┬───────────┘
│ │ │ │ │ │
▼ ▼ ▼ ▼ ▼ ▼
Prometheus Loki Tempo Grafana ArgoCD PagerDuty Slack
| Layer | Technology |
|---|---|
| Language | Java 25 |
| Framework | Spring Boot 4.0.3 (WebFlux — reactive) |
| AI Agent | Embabel Agent Framework 0.3.4 (GOAP orchestration) |
| LLM | Ollama + llama3.2 (local, free) — swappable to Claude or OpenAI |
| Build | Gradle 9.3.1 (Kotlin DSL) |
| Architecture | Hexagonal (ports & adapters), validated by ArchUnit |
| Code Style | Spotless (auto-formatting) |
| UC | Agent | Command | REST Endpoint | Persona |
|---|---|---|---|---|
| UC-1 | IncidentTriageAgent | triage <alertId> |
POST /api/v1/triage |
Senior SRE |
| UC-2 | LogAnalysisAgent | logs <service> |
POST /api/v1/log-analysis |
Log Analyst |
| UC-3 | DeployImpactAgent | deploy-impact <service> |
POST /api/v1/deploy-impact |
Senior SRE |
| UC-4 | ServiceHealthAgent | health <service> |
GET /api/v1/health/{service} |
Senior SRE |
| UC-5 | TraceAnalysisAgent | trace <service> [traceId] |
POST /api/v1/trace-analysis |
Senior SRE |
| UC-8 | AlertFatigueAgent | alert-fatigue <team> <days> |
GET /api/v1/alert-fatigue |
SRE Manager |
| UC-9 | PostMortemAgent | postmortem <incidentId> [service] |
POST /api/v1/postmortem |
Incident Commander |
| UC-10 | SLOMonitorAgent | slo <service> |
GET /api/v1/slo/{service} |
Senior SRE |
Each agent uses the Embabel GOAP (Goal-Oriented Action Planning) engine to autonomously plan and execute a sequence of actions — fetching metrics, logs, traces, and deploys — then synthesizes findings via an LLM into a structured report.
- Java 25+
- Docker (for observability stack)
- Ollama (for local LLM)
ollama serve
ollama pull llama3.2./gradlew bootRunThis starts the Embabel interactive shell with mock data for all external services. Try:
embabel> health alert-api
embabel> slo alert-api
embabel> logs alert-api
embabel> triage ALERT-123
Start the price-alert observability stack (Prometheus, Loki, Tempo, Grafana):
cd /path/to/price-alert
docker compose up -dStart WireMock mocks for PagerDuty, Slack, and ArgoCD:
cd docker && docker compose up -dRun the agent with all adapters enabled:
SPRING_PROFILES_ACTIVE=e2e SERVER_PORT=8090 ./gradlew bootRun# Service health check
curl http://localhost:8090/api/v1/health/alert-api
# SLO burn rate
curl http://localhost:8090/api/v1/slo/alert-api
# Log analysis
curl -X POST http://localhost:8090/api/v1/log-analysis \
-H "Content-Type: application/json" \
-d '{"service": "alert-api", "timeWindow": "1h", "severity": "error"}'
# Incident triage
curl -X POST http://localhost:8090/api/v1/triage \
-H "Content-Type: application/json" \
-d '{"alertId": "ALERT-123", "service": "alert-api", "severity": "SEV2", "description": "High error rate"}'
# Deploy impact
curl -X POST http://localhost:8090/api/v1/deploy-impact \
-H "Content-Type: application/json" \
-d '{"service": "evaluator"}'
# Trace analysis
curl -X POST http://localhost:8090/api/v1/trace-analysis \
-H "Content-Type: application/json" \
-d '{"service": "alert-api"}'
# Alert fatigue
curl "http://localhost:8090/api/v1/alert-fatigue?team=platform&days=7"
# Post-mortem draft
curl -X POST http://localhost:8090/api/v1/postmortem \
-H "Content-Type: application/json" \
-d '{"incidentId": "INC-456", "service": "alert-api"}'src/main/java/com/stablebridge/oncall/
├── agent/ # 8 GOAP agents (one per use case)
│ ├── triage/ # IncidentTriageAgent
│ ├── logs/ # LogAnalysisAgent
│ ├── deploy/ # DeployImpactAgent
│ ├── health/ # ServiceHealthAgent
│ ├── trace/ # TraceAnalysisAgent
│ ├── fatigue/ # AlertFatigueAgent
│ ├── postmortem/ # PostMortemAgent
│ ├── slo/ # SLOMonitorAgent
│ └── persona/ # OnCallPersonas (4 personas)
├── domain/ # Pure domain — no framework deps
│ ├── model/ # 45 immutable records + 8 enums
│ │ ├── common/ # Enums, exceptions
│ │ ├── alert/ # AlertContext, TriageReport, etc.
│ │ ├── metrics/ # MetricsSnapshot, SLOSnapshot, etc.
│ │ ├── logs/ # LogCluster, LogAnalysisReport
│ │ ├── deploy/ # DeploySnapshot, DeployImpactReport
│ │ ├── health/ # ServiceHealthReport, DependencyStatus
│ │ ├── trace/ # CallChainStep, TraceAnalysisReport
│ │ ├── fatigue/ # AlertFatigueReport, NoisyRule
│ │ ├── postmortem/ # PostMortemDraft, ActionItem
│ │ └── slo/ # SLOReport, BurnContributor
│ ├── port/ # 12 port interfaces
│ │ ├── loki/ # LogSearchProvider
│ │ ├── prometheus/ # MetricsProvider, DependencyGraphProvider
│ │ ├── tempo/ # TraceProvider
│ │ ├── argocd/ # DeployHistoryProvider
│ │ ├── grafana/ # DashboardProvider
│ │ ├── pagerduty/ # AlertProvider, AlertHistoryProvider, AlertNotifier
│ │ └── notification/ # SlackNotifier
│ └── service/ # 7 pure formatters
├── infrastructure/ # Adapter implementations
│ ├── loki/ # LokiAdapter
│ ├── prometheus/ # PrometheusAdapter, PrometheusDependencyAdapter
│ ├── tempo/ # TempoAdapter
│ ├── grafana/ # GrafanaAdapter
│ ├── argocd/ # ArgoCDAdapter
│ ├── pagerduty/ # PagerDutyAlertAdapter, HistoryAdapter, NotifierAdapter
│ ├── notification/ # SlackAdapter
│ ├── config/ # WebClientConfig, ServiceProperties
│ └── mock/ # MockAdaptersConfig (@ConditionalOnMissingBean)
├── application/
│ ├── controller/ # 8 REST controllers
│ └── webhook/ # PagerDuty webhook receiver
└── shell/ # Spring Shell CLI (OnCallCommands)
Each agent follows a Goal-Oriented Action Planning pattern:
- User Input → parsed into context (service name, alert ID, etc.)
- Planning → GOAP engine determines which actions to execute based on the goal
- Data Fetch → actions call ports (Prometheus, Loki, Tempo, ArgoCD, PagerDuty) in parallel
- LLM Synthesis → collected data is sent to the LLM with a persona-specific prompt
- Structured Output → LLM produces a typed report (e.g.,
TriageReport,ServiceHealthReport) - Formatting → domain formatters render the report as markdown
UserInput("alert-api")
→ FetchMetrics (Prometheus) → MetricsSnapshot
→ FetchDependencies (Prometheus) → List<DependencyStatus>
→ FetchSLOBudget (Prometheus) → SLOSnapshot
→ FetchAnnotations (Grafana) → List<String>
→ LLM Synthesis (Senior SRE persona)
→ ServiceHealthReport
→ HealthCardFormatter → Markdown output
Each adapter is activated via @ConditionalOnProperty:
app:
services:
prometheus:
enabled: true # Activates PrometheusAdapter + PrometheusDependencyAdapter
loki:
enabled: false # Falls back to MockAdaptersConfigWhen no real adapter is active, MockAdaptersConfig provides @ConditionalOnMissingBean fallbacks with realistic dummy data.
| Persona | Used By | Focus |
|---|---|---|
| Senior SRE | UC-1, UC-3, UC-4, UC-5, UC-10 | Root cause identification, MTTR reduction |
| Log Analyst | UC-2 | Error pattern clustering, signal vs noise |
| Incident Commander | UC-9 | Blameless post-mortems, action items |
| SRE Manager | UC-8 | Alert noise reduction, on-call toil |
Alert: AlertContext, AlertSummary, AlertHistorySnapshot, IncidentAssessment, TriageReport
Metrics: MetricsSnapshot, MetricsWindow, SLOSnapshot, SLISnapshot
Logs: LogCluster, LogAnalysisReport, NewPattern
Deploy: DeploySnapshot, DeployDetail, RollbackHistory, MetricChange, NewErrorSummary, DeployCorrelation, DeployImpactReport
Health: ServiceHealthReport, DependencyStatus, Risk
Trace: CallChainStep, TraceAnalysisReport, BottleneckInfo, CascadeImpact
Fatigue: AlertFatigueReport, NoisyRule, DuplicateGroup, TuningRecommendation
Post-Mortem: PostMortemDraft, TimelineEntry, ActionItem, ImpactSummary
SLO: SLOReport, BurnContributor
IncidentSeverity, HealthStatus, AlertStatus, RollbackDecision, SLOStatus, Trend, Confidence, FindingCategory
| Service | Port Interface | Adapter | What It Provides |
|---|---|---|---|
| Prometheus | MetricsProvider, DependencyGraphProvider |
PrometheusAdapter, PrometheusDependencyAdapter |
Error rates, latency percentiles, throughput, CPU/memory, SLO budgets, dependency graph |
| Loki | LogSearchProvider |
LokiAdapter |
Error log clusters, exception grouping, stack traces |
| Tempo | TraceProvider |
TempoAdapter |
Distributed traces, call chains, bottleneck detection |
| ArgoCD | DeployHistoryProvider |
ArgoCDAdapter |
Deployment history, commit diffs, rollback info |
| Grafana | DashboardProvider |
GrafanaAdapter |
Dashboard annotations, incident markers |
| PagerDuty | AlertProvider, AlertHistoryProvider, AlertNotifier |
PagerDutyAlertAdapter, PagerDutyHistoryAdapter, PagerDutyNotifierAdapter |
Alert context, incident history, automated notes |
| Slack | SlackNotifier |
SlackAdapter |
Incident notifications |
src/test/java/ # 29 unit tests
src/integration-test/java/ # 9 integration tests
src/e2e-test/java/ # 9 E2E tests
src/testFixtures/java/ # 9 shared fixture factories
| Type | Count | Framework | What It Tests |
|---|---|---|---|
| Unit | 29 | Mockito + AssertJ | Agents (mocked ports), controllers (WebTestClient), adapters (WireMock) |
| Integration | 9 | EmbabelMockitoIntegrationTest | Full GOAP chain with mocked ports |
| E2E | 9 | Live stack | Real Prometheus/Loki/Tempo + WireMock PagerDuty/Slack |
| Architecture | 1 | ArchUnit | Hexagonal layer dependency rules |
./gradlew test # Unit tests
./gradlew integrationTest # Integration tests
./gradlew e2eTest # E2E tests (requires live stack)
./gradlew check # All tests (unit + integration)
./gradlew spotlessApply # Fix code formattingLocated in src/testFixtures/java/com/stablebridge/oncall/fixtures/. Factory pattern:
// Usage in tests
var alert = AlertFixtures.anAlertContext();
var metrics = MetricsFixtures.aMetricsSnapshot();
var deploy = DeployFixtures.aDeploySnapshot();| Variable | Default | Description |
|---|---|---|
PROMETHEUS_URL |
http://localhost:9090 |
Prometheus server |
LOKI_URL |
http://localhost:3100 |
Loki log aggregator |
TEMPO_URL |
http://localhost:3200 |
Tempo trace backend |
GRAFANA_URL |
http://localhost:3000 |
Grafana dashboards |
GRAFANA_API_KEY |
— | Grafana API key |
ARGOCD_URL |
http://localhost:8080 |
ArgoCD server |
ARGOCD_AUTH_TOKEN |
— | ArgoCD auth token |
PAGERDUTY_API_KEY |
— | PagerDuty REST API key |
SLACK_WEBHOOK_URL |
— | Slack incoming webhook |
Default is Ollama (local, free). To switch providers, change the dependency in build.gradle.kts:
// Ollama (default — local, free)
implementation("com.embabel.agent:embabel-agent-starter-ollama:$embabelVersion")
// Anthropic Claude
implementation("com.embabel.agent:embabel-agent-starter-anthropic:$embabelVersion")
// OpenAI
implementation("com.embabel.agent:embabel-agent-starter-openai:$embabelVersion")And update application.yml:
embabel:
models:
default-llm: claude-sonnet-4-5 # or gpt-4o, llama3.2:latestAvailable Anthropic models: claude-sonnet-4-5, claude-opus-4-1, claude-haiku-4-5
| Profile | Purpose |
|---|---|
| (default) | Mock adapters, Ollama LLM |
e2e |
All adapters enabled, pointed at live stack |
cd docker && docker compose up -dStarts WireMock containers for:
- ArgoCD (
:8100) — deploy history, revision metadata - PagerDuty (
:8101) — incidents, log entries, notes - Slack (
:8102) — webhook receiver
# Start E2E stack (WireMock mocks)
scripts/e2e-stack.sh up
# Run failure scenario tests
scripts/e2e-failure-tests.sh
# Tear down
scripts/e2e-stack.sh downEnforced by ArchUnit — these are compile-time validated:
- Domain isolation —
domainpackage must not depend oninfrastructure,agent,application, orshell - No framework in domain — domain must not use Spring Web, WebClient, or Spring annotations
- Layer boundaries — infrastructure must not depend on agent layer
- Domain purity — only records, enums, exceptions, port interfaces, and pure services
- Notification isolation — notification adapters must not depend on domain services
- Agent boundaries — agents must not directly call notification ports
make help # Show all available targets
make test # Run unit tests
make integration-test # Run integration tests (real services)
make check # Run all tests and checks
make format # Fix code formatting (Spotless)
make clean # Clean build artifacts
make run # Run with mock adapters
make run-live # Run against live price-alert stackThis project is licensed under the MIT License.