Author: Prateek Chaudhary
Website: https://kakveda.com
Open‑source, event‑driven platform that gives LLM systems a memory of failures, runtime "this failed before" warnings, and a system‑level health view.
Kakveda sits around LLM runtimes and observability tools and adds something most systems lack: failure memory. Instead of treating failures as logs, it treats them as first‑class entities that can be remembered, matched, warned against, and analyzed over time.
This repository provides a complete, production‑adjacent, single‑node implementation designed for local use, demos, and learning — with a clear path to future enterprise extensions.
| Document | Description |
|---|---|
| docs/architecture.md | Architecture and event flow |
| docs/concepts.md | Core concepts (failures, patterns, fingerprints) |
| docs/failure-intelligence.md | What "failure intelligence" means |
| docs/COMPARISON.md | Kakveda vs Datadog, LangSmith, MLflow, etc. |
| docs/netra-host-install.md | Install and run kakveda-netra on any host |
| TROUBLESHOOTING.md | Common issues and solutions |
- AI/LLM systems fail in recurring ways, but failures are mostly stored as logs, not reusable knowledge.
- Teams get post-incident visibility, but weak pre-incident prevention.
- Multi-agent and host-level observability is fragmented across tools and environments.
- Root-cause and remediation context gets lost across runs, teams, and projects.
- Converts failures into a persistent Failure Knowledge Base (GFKB).
- Performs pattern detection and emits pre-flight warnings to prevent repeat failures.
- Adds unified host + observability telemetry via
kakveda-netra. - Gives one dashboard for warning history, traces, infra signals, and reliability indicators.
- Keeps deployment self-hostable and governance-safe for teams needing data control.
- Positions Kakveda as a single unified platform for infra monitoring, observability, and LLM/AI/agent monitoring in one place, where many market offerings still require multiple separate tools.
(platform baseline delivered):
- Failure Knowledge Base (GFKB), recurring pattern detection, pre-flight warning flow.
- Event-driven architecture with trace ingestion, classifier, pattern detector, and health scoring.
- Dashboard for runs, warnings, evaluations, prompts, experiments, and feedback.
(observability + host coverage strengthened):
kakveda-netrahost agent with full infra payload groups:- CPU, memory, disk, network, process, file descriptors, system, load, temperature.
- Docker container metrics with diagnostics (
docker_error, socket diagnostics). - Kubernetes inventory/metadata collection (nodes, pods, deployments, services, configmaps, secrets).
- Observability views:
- golden signals, SLO/error budget, inferred service map, synthetic checks,
- incident timeline, forecast summary, correlation summary.
- Detail pages + chart fallback rendering for environments where CDN chart scripts are blocked.
- Dashboard-driven Netra runtime controls (observability toggle and config sync).
- :
- Realtime service map UX upgrades: zoom/pan/fit/hover + topology density filters + demo mode.
- Realtime service map page with dependency edges and environment filtering.
- APM error tracking with grouped exceptions, workflow states, and replay context.
- Continuous profiler view (method hotspots), version comparison, trace-to-profile drill-down.
- Dynamic instrumentation controls (dashboard-managed runtime rules, no restart flow).
- Instrumentation execution feedback timeline (agent-applied/failed/skipped ack).
- Database monitoring (DBM): slow query hotspots, query fingerprints, wait/event insights, explain-plan payload support.
- RUM (Real User Monitoring): frontend/web activity, LCP/FID/CLS, JS error visibility, RUM monitors + alerts.
- Cross-telemetry correlation page: joins trace with RUM, infra snapshots, observability snapshots, DB samples, APM errors, and security signals.
- APM monitors page: metric/trace/anomaly watchdog monitors with auto-generated defaults and alert lifecycle.
Kakveda uses one native host agent (kakveda-netra) to capture infra + observability + container + cluster signals and push them directly into kakveda-v1.0.
This keeps integration simple:
- install Netra on host,
- provide dashboard API key,
- start agent (foreground, background, or systemd),
- data appears in
/infraand/observability.
Note: Highlights only. For the full matrix, see docs/COMPARISON.md.
| Capability / Feature | Kakveda | LangSmith | MLflow | Arize AI | Weights & Biases | APM (Datadog/AppD) |
|---|---|---|---|---|---|---|
| Open Source | ✅ Yes (Apache 2.0) | ❌ No | ✅ Yes | ❌ No | ❌ No | ❌ No |
| Self-hosted | ✅ Yes | ❌ No | ✅ Yes | ❌ No | ❌ No | |
| Playground | ✅ Yes | ✅ Yes | ❌ No | ❌ No | ❌ No | ❌ No |
| LLM Tracing | ✅ Yes | ✅ Yes | ✅ Yes | ✅ Yes | ||
| Failure Knowledge Base (Memory) | ✅ Yes | ❌ No | ❌ No | ❌ No | ❌ No | ❌ No |
| Pre-flight Warnings | ✅ Yes | ❌ No | ❌ No | ❌ No | ❌ No | ❌ No |
| Health Score Over Time | ✅ Yes | ❌ No | ❌ No | ✅ Yes | ❌ No | ✅ Infra only |
| Warnings Dashboard + Filters | ✅ Yes | ❌ No | ❌ No | ❌ No |
Pricing snapshot date: March 1, 2026. Values are public starting prices (or published billing models) and can change by region, volume, and contract.
| Platform | Public starting monthly price | Billing basis |
|---|---|---|
| Kakveda + Netra (OSS) | $0 license/mo | Self-hosted |
| Datadog Infrastructure Pro | $15 (annual) / $18 (month-to-month) | per host |
| Splunk AppDynamics Infrastructure Edition | $6 | per vCPU |
| Dynatrace Infrastructure Monitoring | $29 | per host |
| Platform | Public starting monthly price | Billing basis |
|---|---|---|
| Kakveda + Netra (OSS) | $0 license/mo | Self-hosted |
| Datadog APM (Standard) | $31 | per host |
| Datadog Log Management | $0.10 | per GB ingested |
| Logz.io Infrastructure Monitoring | ~$12.00 | per 1000 time-series/mo (from $0.40/day) |
| Logz.io Log Management | ~$27.60 | per GB/mo (from $0.92/day) |
| Platform | Public starting monthly price | Billing basis |
|---|---|---|
| Kakveda (OSS) | $0 license/mo | Self-hosted |
| LangSmith Plus | $39 | per seat (+ usage) |
| Weights & Biases Team | $50 | per user |
| Arize AI | Contact sales | custom |
| MLflow OSS | $0 license/mo | Self-hosted |
| Tool | Infra | Observability/APM | AI/LLM/Agent Monitoring | Public pricing visibility |
|---|---|---|---|---|
| Kakveda v1.0 + Netra | $0 license/mo | $0 license/mo | $0 license/mo | OSS self-host |
| Datadog | $15-$18 per host/mo | APM $31/host/mo; Logs $0.10/GB | LLM observability add-on model | Public list pricing |
| Splunk AppDynamics | $6 per vCPU/mo | APM bundles from $33/vCPU/mo | Enterprise packaging path | Public starting tiers |
| Logz.io | ~$12/mo equivalent entry | Logs/traces usage-based (published daily rates) | Agentic observability usage pricing | Public usage pricing |
| LangSmith | N/A | N/A | $39/seat/mo (+ usage) | Public |
| Arize | N/A | Product observability by plan | AX Pro $50/mo; OSS option exists | Public tiers/plan pages |
| Azure Monitor | Usage-based | Usage-based per GB/retention | Integration-led via Azure stack | No single global flat monthly number |
| MLflow OSS | N/A | N/A | $0 license/mo | OSS |
Detailed matrix and notes: docs/COMPARISON.md
Pricing sources:
- Datadog: https://www.datadoghq.com/pricing/list/
- Splunk AppDynamics / Observability: https://www.splunk.com/en_us/products/pricing/it-operations.html
- Dynatrace: https://www.dynatrace.com/pricing/
- Logz.io: https://logz.io/pricing/
- LangSmith: https://www.langchain.com/pricing
- Weights & Biases: https://wandb.ai/site/pricing
- Arize AI: https://arize.com/pricing/
- Azure Monitor: https://azure.microsoft.com/pricing/details/monitor/
- MLflow: https://mlflow.org/
What many teams still do not operationalize well:
- failure recurrence memory as a first-class data model,
- pre-flight prevention signals before repeat incidents,
- single-stack visibility across infra + observability + AI/LLM/agent runtime behavior.
Kakveda’s unique combination:
- durable failure knowledge base + warning-policy feedback loop,
- one native host-side agent (
kakveda-netra) for infra + container + k8s + observability push, - self-hosted governance-safe deployment with low setup friction.
- Start Kakveda:
docker compose up -d --build - Install Netra on host and provide dashboard API key.
- Run Netra (foreground/background/systemd).
- Verify signals in
/infra,/observability, and/observability/service-map.
- Stores failures in a Global Failure Knowledge Base (GFKB)
- Detects repeated and recurring failure patterns across runs
- Provides pre‑flight warnings when an execution matches a past failure
- Computes a system health score over time
- Offers a full dashboard with scenarios, traces, datasets, evaluations, prompts, and experiments
- Runs locally with Docker Compose in one command
- Failure as data: Failures are stored, versioned, and matched — not just logged.
- Event‑driven flow: Each service reacts to events (trace ingested → failure detected → pattern updated).
- Deterministic demo: Ollama is optional; a deterministic stub keeps the system runnable everywhere.
- Separation of concerns: Each capability runs as its own microservice.
Note: the diagram below is pipeline-centric. The dashboard is both (a) the UI entrypoint that triggers scenario runs and (b) the consumer/visualizer for warnings, runs, and health.
Scenario Runner
│
▼
Warning Policy ◀───────────┐
│ │
▼ │
Model (Ollama / Stub) │
│ │
▼ │
Trace Ingestion ──▶ Event Bus ──▶ Failure Classifier
│
▼
Global Failure KB
│
▼
Pattern Detector
│
▼
Health Scoring
│
▼
Dashboard
| Service | Purpose |
|---|---|
| event-bus | Demo HTTP pub/sub for events |
| ingestion | Receives traces and publishes events |
| gfkb | Global Failure Knowledge Base (failures + patterns) |
| failure-classifier | Detects failures from traces |
| pattern-detector | Maintains recurring failure patterns |
| warning-policy | Pre‑flight "this failed before" warnings |
| health-scoring | Computes health timeline |
| dashboard | UI, auth, RBAC, analytics, scenario runner |
| ollama (optional) | Local LLM runtime |
- Home overview with recent warnings
- Scenario runner with warning integration
- Warning history and analytics
- Runs & traces with nested spans and timelines
- Feedback on runs
- Datasets and examples
- Evaluations with aggregate metrics
- Prompt library with versioning
- Experiments (grouping runs)
- Playground UI
- Login / register / forgot / reset password flows
- Cookie‑based JWT sessions
- Role‑based access control: admin / operator / viewer
- Admin UI for user management and role assignment
- CSRF protection for browser forms
- Security headers (CSP, X‑Frame‑Options, etc.)
- JWT revocation (Redis‑backed when configured)
- Rate limiting (in‑memory demo, Redis optional)
⚠️ This is a production‑adjacent demo.
- Docker + Docker Compose (V2 recommended)
Option 1: Using CLI (Recommended)
git clone https://github.com/prateekdevisingh/kakveda.git
cd kakveda/kakveda-v1.0
pip install -e .
kakveda upOption 2: Using Docker Compose directly
git clone https://github.com/prateekdevisingh/kakveda.git
cd kakveda/kakveda-v1.0
docker-compose up -dOptional: start the companion kakveda-kids-agent demo service (only if you have ../kakveda-kids-agent present):
docker-compose --profile kids up -d --buildOpen the dashboard:
http://localhost:8110
kakveda init # Interactive .env setup
kakveda up # Start all services
kakveda down # Stop all services
kakveda status # Show running services and URLs
kakveda logs # Show logs (all services)
kakveda logs dashboard --tail 50 # Show specific service logs
kakveda reset # Full reset (stops + clears data)
kakveda doctor # Diagnose system issues
kakveda version # Show version info💡 Having issues? See TROUBLESHOOTING.md for common problems and solutions.
- admin@local / admin123 (admin)
- operator@kakveda.local / Operator@123 (operator)
- viewer@kakveda.local / Viewer@123 (viewer)
⚠️ Security warning:
- The default admin is for first-time setup only. If your browser blocks
admin@localas an invalid email, useadmin@kakveda.local(same password:admin123).- You must change the admin password immediately after setup!
- For production, create a new admin and disable or delete the default.
Kakveda supports connecting external AI agents for centralized observability, tracing, and failure intelligence. Follow this step-by-step guide to integrate your custom agent.
You can integrate any agent framework (LangChain, LangGraph, custom Python, etc.) with minimal code using kakveda_sdk.
from kakveda_sdk import KakvedaAgent
agent = KakvedaAgent(capabilities=["my_tool"])
result = agent.execute(
prompt=user_input,
tool_name="my_tool",
execute_fn=my_tool_fn,
metadata={"user_id": "123"},
)Minimum env vars:
KAKVEDA_WARN_URL=http://warning-policy:8105/warn
KAKVEDA_EVENT_BUS_URL=http://event-bus:8100/publish
DASHBOARD_URL=http://dashboard:8110
DASHBOARD_API_KEY=<your-api-key>
AGENT_NAME=my-agent
AGENT_APP_ID=my-agent
AGENT_VERSION=1.0.0If you want the agent visible in the dashboard with heartbeats, ensure /health is exposed (see examples/langchain-agent-demo/agent_app.py).
git clone https://github.com/prateekdevisingh/kakveda.git
cd kakveda/kakveda-v1.0
docker-compose up -dThis starts the following services:
| Service | Port | URL |
|---|---|---|
| Event Bus | 8100 | http://localhost:8100 |
| Dashboard | 8110 | http://localhost:8110 |
| Ollama LLM | 11434 | http://localhost:11434 |
# Check all services
docker ps
# Check Dashboard
curl http://localhost:8110docker network ls | grep kakvedaOutput example:
abc123 kakveda-v10_default bridge local
Example using our Kids Education Agent:
cd ..
git clone https://github.com/prateekdevisingh/kakveda-kids-agent.git
cd kakveda-kids-agentdocker build -t kakveda-kids .Generic Format:
docker run -d \
--name <your-agent-name> \
--network <kakveda-docker-network> \
-p <host-port>:<container-port> \
-e OLLAMA_URL=http://ollama:11434 \
-e EVENT_BUS_URL=http://event-bus:8100 \
-e DASHBOARD_URL=http://dashboard:8110 \
-e DASHBOARD_API_KEY=<your-api-key> \
<your-docker-image>Example (Kids Education Agent):
docker run -d \
--name kakveda-kids-agent \
--network kakveda-v10_default \
-p 8122:8120 \
-e OLLAMA_URL=http://ollama:11434 \
-e EVENT_BUS_URL=http://event-bus:8100 \
-e DASHBOARD_URL=http://dashboard:8110 \
-e DASHBOARD_API_KEY=your-api-key \
kakveda-kidsParameter Reference:
| Parameter | Placeholder | Description |
|---|---|---|
--name |
<your-agent-name> |
Unique name for your container |
--network |
<kakveda-docker-network> |
Kakveda's network (find via docker network ls | grep kakveda) |
-p |
<host-port>:<container-port> |
Port mapping (e.g., 8122:8120) |
-e OLLAMA_URL |
http://ollama:11434 |
LLM service (use service name, not localhost) |
-e EVENT_BUS_URL |
http://event-bus:8100 |
Traces go here for failure intelligence |
-e DASHBOARD_URL |
http://dashboard:8110 |
For agent auto-registration |
-e DASHBOARD_API_KEY |
<your-api-key> |
Get from Dashboard → Admin → API Keys |
| Image | <your-docker-image> |
Your built Docker image name |
# Health check
curl http://localhost:8122/health
# Ask a question
curl -X POST http://localhost:8122/api/ask \
-H "Content-Type: application/json" \
-d '{"question": "tell me about birds", "child_name": "Arya"}'- Open http://localhost:8110
- Go to Runs → See your agent's traces
- Go to Agents → See registered agents
- Go to Playground → Select your agent from dropdown and test
Kakveda does not require any external agent to run. The core stack (dashboard, event-bus, ingestion, etc.) works standalone.
If you add a new agent:
- Prefer running it as a separate container (or as an optional Compose profile).
- Avoid breaking fresh installs by not making the agent build mandatory when its source folder isn’t present.
- Use a unique host port to avoid conflicts (e.g., don’t reuse
8120/8122if something is already bound). - When running inside the Docker network, use service DNS names like
http://event-bus:8100andhttp://dashboard:8110(notlocalhost).
If you do add an agent into docker-compose.yml, wrap it behind a profile:
my-agent:
profiles: ["agents"]
build: ../my-agent
environment:
- EVENT_BUS_URL=http://event-bus:8100
ports:
- "8125:8120"Then start it only when needed:
docker-compose --profile agents up -d --buildFor your agent to fully integrate with Kakveda, implement these endpoints:
| Endpoint | Method | Purpose |
|---|---|---|
/health |
GET | Health check (return {"status": "healthy"}) |
/api/ask |
POST | Main query endpoint |
Send traces to Event Bus:
import httpx
async def send_trace(question: str, answer: str, latency: float):
await httpx.AsyncClient().post(
f"{EVENT_BUS_URL}/publish",
json={
"event": {
"event_type": "trace.ingested",
"run_id": str(uuid.uuid4()),
"scenario_name": "your-agent-name",
"input": question,
"output": answer,
"latency_ms": latency,
"is_failure": False
}
}
)Auto-register with Dashboard (optional):
@app.on_event("startup")
async def register_with_kakveda():
async with httpx.AsyncClient() as client:
await client.post(
f"{DASHBOARD_URL}/api/agents/register",
json={
"name": "your-agent-name",
"base_url": "http://your-agent:port",
"description": "Your agent description",
"capabilities": ["capability1", "capability2"]
}
)To enable password reset emails, set these environment variables (in .env):
SMTP_HOST=smtp.yourorg.com
SMTP_PORT=587
SMTP_USER=youruser
SMTP_PASS=yourpassword
SMTP_FROM=noreply@yourorg.com
SMTP_TLS=true
If SMTP is not set, password reset links will be shown in the UI (for dev/testing only).
- If Ollama is running, the dashboard will call it for generation.
- If not available, Kakveda automatically falls back to a deterministic stub response.
This keeps demos reproducible and dependency‑free.
Key environment variables:
KAKVEDA_ENV– dev / productionDASHBOARD_DB_URL– SQLite (default) or PostgresKAKVEDA_REDIS_URL– optional Redis for revocation & rate limitsKAKVEDA_OTEL_ENABLED– enable OpenTelemetry export
Configuration is explicit and environment‑driven.
- Use Docker Compose (same as Quick Start) for a clean, reproducible stack.
- Default mode uses SQLite and a deterministic model stub (works everywhere).
If you prefer a guided setup, use the built-in CLI to generate a .env file and start the stack.
python -m kakveda_cli.cli init
python -m kakveda_cli.cli upUseful CLI commands:
python -m kakveda_cli.cli status
python -m kakveda_cli.cli down
python -m kakveda_cli.cli resetBefore running tests, stop the Docker stack to avoid port/resource conflicts (and to make test runs deterministic):
python -m kakveda_cli.cli downRun unit tests:
pytest -qOptional: bring the stack back up after tests:
python -m kakveda_cli.cli up- Keep the default stub model for deterministic demos.
- Use the built-in demo accounts.
- Use the dashboard scenario runner to generate runs/warnings quickly.
This repo is built for single-node demos, but supports production-adjacent toggles:
- Use Postgres by setting
DASHBOARD_DB_URL - Use Redis by setting
KAKVEDA_REDIS_URL(revocation + rate limiting) - Enable OpenTelemetry export with
KAKVEDA_OTEL_ENABLED
An example compose file is provided in docker-compose.prod.yml.
This repo IS:
- A complete, runnable system
- Suitable for learning, experimentation, and local use
- A reference architecture for failure‑intelligent LLM systems
This repo is NOT:
- A fully hardened enterprise deployment
- A multi‑cluster or HA setup
- A compliance‑certified system
| Login | Register | Forgot Password |
|---|---|---|
![]() |
![]() |
![]() |
| Dashboard Overview | Dashboard Footer |
|---|---|
![]() |
![]() |
| Scenarios | Run View | Warnings |
|---|---|---|
![]() |
![]() |
![]() |
| Playground | Experiments | Datasets |
|---|---|---|
![]() |
![]() |
![]() |
| Prompts | Admin RBAC |
|---|---|
![]() |
![]() |
This repo includes clean, spec-friendly drawings under docs/figures/:
Fig. 1 — Pipeline-centric architecture for failure-intelligence
Fig. 2 — Example data model for failure entities and pattern entities
Fig. 3 — Pre-flight matching and policy decision flow
- Pluggable event bus implementations
- Pluggable storage backends
- Advanced evaluation plugins
- Improved pattern detection strategies
- Enterprise extensions (separate distribution)
Contributions are welcome!
Please read CONTRIBUTING.md.
Please see SECURITY.md for vulnerability reporting and security notes.
This project is licensed under the Apache License 2.0 (see LICENSE).
Kakveda aims to become a failure‑intelligence layer that complements existing LLM runtimes and observability stacks by adding what they lack most: memory and prevention of past failures.
The open-source core is designed to remain transparent, usable, and self-hostable. Future commercial offerings, if any, may focus on scale, operational hardening, and compliance-oriented features, while keeping the core concepts openly accessible.
Intellectual Property Note: The project is released as open source. Certain aspects of the underlying concepts may be the subject of patent filings.
Copyright 2026 Prateek Chaudhary, Built in India 🇮🇳












