IncidentIQ is an SRE Assistant Chatbot, a Slack-style Site Reliability Engineering assistant built with Google's Agent Development Kit (ADK). It provides an SRE-oriented chat interface for incident triage, reliability reviews, AWS operations, AWS cost analysis, and Kubernetes operations.
The project started as an MVP Slack bot and has been extended into a more complete local operations assistant with multiple model providers, a background service runner, ADK Web UI testing, health checks, and read-only infrastructure tooling.
SRE AssistaBot lets an engineer ask operational questions in Slack threads or through the ADK Web UI. The root SRE agent can answer general reliability questions directly and, when full model mode is enabled, delegate specialized requests to sub-agents.
Primary workflows:
- Create incident briefs from messy incident reports.
- Generate first 15-minute incident response plans.
- Review system designs from an SRE perspective.
- Recommend SLIs, SLOs, alerts, dashboards, and runbook steps.
- Analyze AWS cost and usage patterns when AWS credentials are configured.
- Inspect AWS infrastructure when AWS credentials are configured.
- Inspect Kubernetes cluster state through read-only
kubectltools. - Search local runbooks and past incidents for citation-backed guidance.
- Classify alerts for page, ticket, or dedupe routing.
- Keep Slack thread context through ADK sessions.
This project uses several AI concepts, not just a basic chatbot wrapper.
Main AI Concepts Used:
-
Large Language Models The bot uses LLMs to understand natural-language SRE questions and generate structured operational responses. Supported providers include Ollama, Amazon Bedrock, Google Gemini, and Anthropic Claude.
-
Agentic AI The system is built with Google ADK, so the bot is modeled as an agent that can reason about a user request, decide whether to answer directly, or delegate to a specialized sub-agent.
-
Multi-Agent Architecture The root SRE agent can delegate to specialized agents:
- AWS Cost agent
- AWS Core operations agent
- Kubernetes operations agent
-
Tool Use / Function Calling The agents can use tools such as AWS Cost Explorer helpers, AWS infrastructure checks, and read-only Kubernetes
kubectltools to gather operational evidence. -
Retrieval-Augmented Generation The full-model path can search a local runbook and past-incident knowledge base before answering reliability, incident, and alerting questions. Retrieved sources include citation IDs such as
[RB-001]and confidence scores. -
Natural Language Understanding Users can ask questions in normal SRE language, such as "create an incident brief" or "review this system design," and the bot maps that request into structured SRE output.
-
Prompt Engineering The project uses system prompts to shape the bot's behavior, tone, safety rules, response format, delegation rules, and SRE-specific reasoning style.
-
Retrieval of Live Operational Context When tools are enabled, the bot can retrieve live or configured infrastructure data from AWS or Kubernetes instead of only relying on static model knowledge.
-
Session Memory / Conversational Context ADK sessions let the bot maintain conversation context across messages, especially inside Slack threads.
-
Reasoning and Decision Support The bot helps with incident triage, root-cause hypotheses, rollback criteria, risk assessment, SLO thinking, and reliability reviews.
-
Human-in-the-Loop Safety The prompts and tool design emphasize read-only checks first and ask for confirmation before risky or production-impacting actions.
-
Alert Classification And Deflection A deterministic alert-intelligence helper classifies alert severity, recommends page/ticket/dedupe routing, matches known issues, and measures pager-noise reduction in evals.
In short: this project combines LLMs, agentic AI, multi-agent delegation, tool use, RAG, prompt engineering, alert classification, and conversational memory to create an SRE-focused operational assistant.
Implemented and tested:
- Slack bot integration using Slack Socket Mode.
- ADK API server for programmatic sessions and
/runcalls. - ADK Web UI for local browser-based testing.
- Root SRE orchestrator agent.
- AWS Cost sub-agent.
- AWS Core operations sub-agent.
- Kubernetes operations sub-agent.
- Local runbook and past-incident RAG tool with citations and confidence.
- Alert escalation, known-issue matching, and pager-noise deflection helper.
- Multiple model providers:
- Ollama local demo mode.
- Amazon Bedrock.
- Google Gemini.
- Anthropic Claude.
- Local background process manager for day-to-day use.
- Optional Windows login task for auto-start.
- Docker Compose stack for containerized local development.
- Health checks, request logging, response chunking, Slack event de-dupe, and safer environment handling.
- Tests and linting.
Validation at completion:
.\.venv\Scripts\python.exe -m pytest tests -q
.\.venv\Scripts\python.exe -m ruff check .Latest local result:
77 passed, 1 skipped
All checks passed
The original MVP scope remains intact:
- Slack bot -> ADK API -> SRE agent flow still works.
- ADK Web UI and API testing are still available.
- AWS Cost, AWS Core, and Kubernetes sub-agents are still present.
- Docker Compose support, health checks, environment files, and dev tooling are still supported.
The newer resume-metric additions are additive:
- RAG over local runbooks and past incidents.
- Citation and confidence scoring.
- Alert severity classification, dedupe keys, known-issue matching, and deflection scoring.
- TTFT and benchmark probes.
- Expanded eval sets for SRE response quality, RAG retrieval, and alert routing.
Provider behavior matters:
- Ollama demo mode keeps the bot stable by answering directly with the root SRE prompt and avoiding sub-agent/tool delegation.
- Bedrock, Gemini, and Claude full-model modes enable richer ADK behavior, including sub-agent delegation and root-agent tools such as RAG and alert classification.
The project includes lightweight eval and benchmark harnesses in evals/.
This harness calls the running local ADK API and measures:
- non-streaming
/runresponse latency - p50 and p95 latency
- deterministic SRE quality score
- pass rate across canned SRE scenarios
- missing required SRE concepts per answer
- unsupported live-claim rate, such as claiming logs or metrics were checked when no tool result was provided
- hallucination proxy rate, currently equal to unsupported live-claim rate
Run the full benchmark:
.\.venv\Scripts\python.exe -m evals.run_sre_eval --api-url http://localhost:8001Run a smaller smoke benchmark:
.\.venv\Scripts\python.exe -m evals.run_sre_eval --limit 3Results are written to:
evals/results/
Those generated result files are ignored by Git.
Latest local smoke benchmark:
Cases: 3
Successful API calls: 3
Pass rate: 100.0%
Average quality score: 0.885
Average latency: 24.236s
P50 latency: 24.745s
P95 latency: 28.339s
Unsupported live-claim rate: 0.0%
Hallucination proxy rate: 0.0%
This benchmark was run against the currently reachable local API. To benchmark a specific provider, restart the bot with that provider first, then run the eval:
.\start-assistabot.ps1 -Provider bedrock -Restart
.\.venv\Scripts\python.exe -m evals.run_sre_eval --limit 3Token-level time-to-first-token is measured through ADK's streaming /run_sse
endpoint:
.\.venv\Scripts\python.exe -m evals.run_ttft_probe --api-url http://localhost:8001Latest local TTFT probe:
Status: success
First streamed text / TTFT: 15.031s
Total response time: 24.497s
Events observed: 253
TTFT depends heavily on the active provider, model size, local machine, and whether Ollama/Bedrock/Gemini/Claude is being used.
The repository includes an expanded local knowledge base under
agents/sre_agent/knowledge_base/documents/ with 14 runbooks and 10
past-incident notes. The root agent can use search_knowledge_base in full
model mode to cite retrieved sources such as [RB-001] and [PI-001].
Run the RAG benchmark:
.\.venv\Scripts\python.exe -m evals.run_rag_evalLatest local RAG benchmark:
Cases: 30
Hit@1: 96.7%
Hit@3: 100.0%
Hit@5: 100.0%
MRR: 0.983
Citation precision@3: 0.544
Citation precision@5: 0.333
Average top relevant confidence: 0.986
These are real measurements over the included demo/anonymized corpus. They should not be presented as production RAG metrics until the corpus is replaced or augmented with actual sanitized team runbooks and incident records.
The alert eval calls the deterministic alert classification helper and measures page/ticket/dedupe routing quality.
.\.venv\Scripts\python.exe -m evals.run_alert_evalLatest local alert benchmark:
Cases: 30
Page decision accuracy: 100.0%
Severity accuracy: 100.0%
Known-issue hit rate: 100.0%
Alert deflection rate: 60.0%
False escalation rate: 0.0%
Missed page rate: 0.0%
PagerNoise reduction: 60.0%
This is a controlled benchmark over 30 representative alert scenarios, not a claim from live PagerDuty production traffic.
Implemented and measured locally:
- Slack-based SRE assistant.
- Structured incident briefs and first-response plans.
- RAG over local runbooks and past incidents.
- Citations and confidence scores.
- Eval harnesses for SRE quality, RAG, alert routing, and TTFT.
- Hallucination proxy based on unsupported live-data claims.
- Guardrails for read-only-first investigation and risky-action confirmation.
- Severity classification, dedupe keys, known-issue detection, and alert deflection metrics.
Not claimed as production evidence yet:
- Real production PagerDuty alert deflection.
- Real production PagerNoise reduction.
- Real team/user time saved per incident.
- A 250-scenario benchmark.
- Production RAG quality over actual company runbooks and incident records.
The repo now has the structure needed for larger resume claims, but the wording must match the evidence:
- Current evidence: 30 RAG queries, 30 alert scenarios, 20 SRE quality prompts, 14 runbooks, and 10 past-incident notes.
- To claim "250 incident scenarios", add at least 250 JSONL cases across
sre_eval_scenarios.jsonl,rag_eval_queries.jsonl, andalert_eval_scenarios.jsonl, then run the evals and cite the measured output. - To claim production RAG quality, replace or augment the demo corpus with sanitized real runbooks and real past incidents.
- To claim real alert deflection or PagerNoise reduction, compare the classifier against historical alert/page data, not only synthetic benchmark scenarios.
Slack / ADK Web UI / API clients
|
v
FastAPI ADK server: agents/sre_agent/serve.py
|
v
Root SRE agent: agents/sre_agent/agent.py
|
+-- local knowledge-base tools
| +-- runbook / past-incident retrieval with citations
| +-- alert severity, known-issue, and deflection scoring
|
+-- aws_cost_agent
| +-- Cost Explorer tools
| +-- service, account, tag, monthly, trend, and optimization analysis
|
+-- aws_core_agent
| +-- AWS account, EC2, S3, RDS, region, and connectivity checks
|
+-- kubernetes_agent
+-- read-only kubectl tools for contexts, nodes, pods,
deployments, services, pod logs, and cluster summary
The local knowledge base is stored as Markdown documents with lightweight metadata:
agents/sre_agent/knowledge_base/documents/
RB-001...RB-014 Runbooks
PI-001...PI-010 Past incident notes
Root-agent tools:
agents/sre_agent/tools/knowledge_base.py Lexical retrieval, citations, confidence
agents/sre_agent/tools/alert_intelligence.py Severity, route, dedupe, known issue scoring
The retriever is deterministic and local. It does not call an external vector database. That keeps the project easy to run for a class/demo environment while still making RAG behavior measurable through Hit@K, MRR, citation precision, and confidence.
Main directories:
agents/sre_agent/
agent.py Root ADK SRE agent and delegation logic
serve.py FastAPI ADK API server with health checks
settings.py Session DB configuration
utils.py Model selection and shared helpers
tools/ Root-agent RAG and alert-intelligence tools
knowledge_base/documents/ Local runbooks and past incidents
aws_auth/ Optional role-based AWS auth layer
sub_agents/
aws_cost/ AWS cost analysis sub-agent
aws_core/ AWS infrastructure operations sub-agent
kubernetes/ Kubernetes operations sub-agent
slack_bot/
main.py Slack Socket Mode and Events API integration
modules/health.py Slack listener health check
tests/ Unit tests for auth, tools, providers, and k8s
The project supports four provider paths. Provider selection can be automatic, but local scripts let you force the provider explicitly.
Ollama mode is the free local demo path.
Use it when:
- You do not want paid API usage.
- You want to demo Slack integration locally.
- You want basic SRE guidance without cloud billing.
Important behavior:
- Uses a local model through Ollama.
- Defaults to
qwen2.5:1.5b. - Enables
SRE_OLLAMA_SIMPLE_MODE=true. - Disables active sub-agent handoffs for stability with small local models.
- Still answers as an SRE assistant, but does not actively query AWS or Kubernetes tools in simple mode.
Start Ollama mode:
.\start-assistabot.ps1 -Provider ollama -RestartBedrock mode is the AWS-native paid model path used for stronger responses and full ADK delegation.
Use it when:
- You want better model quality than local Ollama.
- You want sub-agent delegation enabled.
- You are comfortable with AWS Bedrock usage charges.
Recommended smoke-test model:
BEDROCK_MODEL_ID=amazon.nova-micro-v1:0
BEDROCK_REGION=us-east-1Authentication options:
# Option A: Bedrock API key
BEDROCK_API_KEY=your_bedrock_api_key_here
# Option B: official AWS bearer-token env var
AWS_BEARER_TOKEN_BEDROCK=your_bedrock_api_key_here
# Option C: normal AWS credentials/profile
AWS_PROFILE=your_aws_profile
AWS_REGION=us-east-1Start Bedrock mode:
.\start-assistabot.ps1 -Provider bedrock -RestartBedrock calls are billable AWS usage. Use short prompts while testing.
Gemini is supported through Google AI Studio API keys.
GOOGLE_API_KEY=your_google_api_key_here
GOOGLE_AI_MODEL=gemini-2.0-flashStart Gemini mode:
.\start-assistabot.ps1 -Provider google -RestartClaude is supported through LiteLLM.
ANTHROPIC_API_KEY=your_anthropic_api_key_here
ANTHROPIC_MODEL=claude-3-5-sonnet-20240620Start Claude mode:
.\start-assistabot.ps1 -Provider anthropic -RestartIf MODEL_PROVIDER is not forced, the code checks providers in this order:
- Google Gemini
- Anthropic Claude
- Amazon Bedrock
The local scripts are preferred because they make the selected provider
explicit and avoid confusion when multiple keys exist in agents/.env.
| Scenario | Recommended Provider | Suggested Model | Why |
|---|---|---|---|
| Free local demo, no cloud spend | Ollama | qwen2.5:1.5b |
Runs locally and is enough to prove Slack integration, background services, and basic SRE response behavior. |
| Best low-cost AWS smoke test | Amazon Bedrock | amazon.nova-micro-v1:0 |
Confirms Bedrock auth, billing, and ADK provider wiring without starting with a larger model. |
| AWS-native project demo | Amazon Bedrock | Nova or Claude model available in Bedrock | Keeps the model path inside AWS and enables full sub-agent delegation for a stronger demo. |
| Fast general SRE assistance | Google Gemini | gemini-2.0-flash |
Good default for quick operational guidance and lower-latency interactions. |
| Strong incident/reliability reasoning | Anthropic Claude | claude-3-5-sonnet-20240620 or newer available model |
Best fit for detailed incident command, tradeoff analysis, and design reviews. |
| Full production-style behavior | Bedrock, Gemini, or Claude | Stronger hosted model | Hosted models handle ADK delegation more reliably than the small local Ollama demo model. |
python -m venv .venv
.\.venv\Scripts\python.exe -m pip install --upgrade pip
.\.venv\Scripts\python.exe -m pip install -r agents\sre_agent\requirements.txt
.\.venv\Scripts\python.exe -m pip install -r slack_bot\requirements.txt
.\.venv\Scripts\python.exe -m pip install -r requirements-dev.txtCopy-Item agents\env.example agents\.env
Copy-Item slack_bot\env.example slack_bot\.envEdit:
agents/.envfor model provider, AWS, and Kubernetes settings.slack_bot/.envfor Slack tokens.
Do not commit .env files.
Create a Slack app at:
https://api.slack.com/apps
Required bot scopes:
app_mentions:read
channels:history
channels:join
chat:write
chat:write.public
im:history
im:read
im:write
Enable Socket Mode and create an app-level token with:
connections:write
Set these values in slack_bot/.env:
SLACK_BOT_TOKEN=xoxb-your-token
SLACK_SIGNING_SECRET=your-signing-secret
SLACK_APP_TOKEN=xapp-your-app-tokenThe local Slack runner forces:
SLACK_SOCKET_MODE=true
SRE_AGENT_API_URL=http://localhost:8001Socket Mode avoids needing ngrok or a public Events API URL for local testing.
Start the Slack bot and ADK API in the background:
.\start-assistabot.ps1 -Provider bedrock -Restartor:
.\start-assistabot.ps1 -Provider ollama -RestartCheck status:
.\status-assistabot.ps1Stop all background services:
.\stop-assistabot.ps1Tail logs:
Get-Content .runtime\logs\agent.out.log -Wait
Get-Content .runtime\logs\slack.out.log -WaitThe ADK Web UI is optional. It lets you test the agent directly in the browser without Slack.
.\start-assistabot.ps1 -Provider bedrock -WithWeb -RestartThen open:
http://localhost:8000/dev-ui/
Install the Windows scheduled task:
.\install-assistabot-login-task.ps1 -Provider bedrock -WithWebIf Windows returns Access is denied, run PowerShell as Administrator and retry.
Remove the login task:
.\uninstall-assistabot-login-task.ps1Create a session:
curl -X POST http://localhost:8001/apps/sre_agent/users/u_123/sessions/s_123 \
-H "Content-Type: application/json" \
-d '{"state": {"source": "manual-test"}}'Send a message:
curl -X POST http://localhost:8001/run \
-H "Content-Type: application/json" \
-d '{
"app_name": "sre_agent",
"user_id": "u_123",
"session_id": "s_123",
"new_message": {
"role": "user",
"parts": [{"text": "Create an incident brief for checkout 5xx errors."}]
}
}'Health checks:
http://localhost:8001/health
http://localhost:8001/health/readiness
http://localhost:8001/health/liveness
Docker Compose remains available for containerized local development.
Services:
sre-bot-web: ADK Web UI on port8000.sre-bot-api: ADK API server on port8001.slack-bot: Slack integration service on port8002.postgres: session database.
Start the stack:
docker compose up --buildStart selected services:
docker compose up sre-bot-web
docker compose up sre-bot-api slack-botView logs:
docker compose logs -f sre-bot-api slack-botThe Windows PowerShell scripts are the recommended path for the current local demo because they support provider switching, background execution, and local SQLite sessions without requiring Docker Desktop.
The Kubernetes sub-agent is read-only. It shells out to kubectl with structured
arguments and never runs mutating commands.
Tools include:
- current context
- available contexts
- nodes and readiness
- pods by namespace or label selector
- deployments and replica health
- services
- pod logs with tail limits
- cluster summary
Configuration:
KUBE_CONTEXT=your_kube_context
# Optional:
KUBE_NAMESPACE=default
KUBECTL_PATH=C:\path\to\kubectl.exeIf kubectl is missing, the tools return an actionable error instead of
crashing the agent.
The AWS sub-agents are available when full model mode is enabled and AWS credentials are configured.
AWS Core can help with:
- caller identity
- AWS connectivity checks
- region discovery
- EC2, S3, and RDS summaries
- account-level operational review
AWS Cost can help with:
- monthly totals
- current month-to-date cost
- previous month cost
- last N months trend
- spend by service
- spend by tag
- spend by linked account
- most expensive linked account
- daily averages
- step-change and trend summaries
- cost optimization recommendations
Configuration:
AWS_PROFILE=your_aws_profile
AWS_REGION=us-east-1Optional role-based auth:
AWS_AUTH_ENABLE_CACHING=true
AWS_AUTH_DEFAULT_REGION=us-east-1
AWS_AUTH_DEFAULT_ROLE_ARN=arn:aws:iam::123456789012:role/SRERole
AWS_AUTH_DEFAULT_ACCOUNT_ID=123456789012
AWS_AUTH_DEFAULT_SESSION_NAME=SREBotSessionagents/env.example template for agent model/AWS/Kubernetes settings
agents/.env local agent secrets and settings, not committed
slack_bot/env.example template for Slack settings
slack_bot/.env local Slack secrets, not committed
Important .gitignore protections:
.env**/.env.venv/.runtime/*.log*.dbpostgres_data/
Run tests:
.\.venv\Scripts\python.exe -m pytest tests -qRun lint:
.\.venv\Scripts\python.exe -m ruff check .Format:
.\.venv\Scripts\python.exe -m ruff format .Pre-commit:
.\.venv\Scripts\pre-commit.exe run --all-filesCI is configured in:
.github/workflows/ci.yml
Check local service status:
.\status-assistabot.ps1Restart with a specific provider:
.\start-assistabot.ps1 -Provider bedrock -Restart
.\start-assistabot.ps1 -Provider ollama -RestartCommon issues:
-
Slack does not reply:
- Run
.\status-assistabot.ps1. - Check
.runtime\logs\slack.out.log. - Verify
SLACK_BOT_TOKENandSLACK_APP_TOKEN.
- Run
-
API returns 500:
- Check
.runtime\logs\agent.out.log. - Verify the selected model provider credentials.
- If using Ollama, confirm Ollama is running on
localhost:11434.
- Check
-
Bedrock fails:
- Verify
BEDROCK_API_KEYor AWS credentials. - Verify
BEDROCK_MODEL_IDandBEDROCK_REGION. - Remember that Bedrock usage is billable.
- Verify
-
Kubernetes tools fail:
- Run
kubectl config current-context. - Verify
KUBE_CONTEXT. - Set
KUBECTL_PATHifkubectlis not on PATH.
- Run
- Never commit
.envfiles. - Rotate any key that was accidentally exposed during development.
- Use least-privilege AWS and Kubernetes credentials.
- Prefer read-only AWS/IAM/Kubernetes permissions for demos.
- Bedrock, Gemini, and Claude API calls may incur provider charges.
- Review logs before sharing them; application logs may contain operational details.






















