Kakveda – LLM Failure Intelligence Platform

Author: Prateek Chaudhary

Open‑source, event‑driven platform that gives LLM systems a memory of failures, runtime "this failed before" warnings, and a system‑level health view.

Kakveda sits around LLM runtimes and observability tools and adds something most systems lack: failure memory. Instead of treating failures as logs, it treats them as first‑class entities that can be remembered, matched, warned against, and analyzed over time.

This repository provides a complete, production‑adjacent, single‑node implementation designed for local use, demos, and learning — with a clear path to future enterprise extensions.

📚 Documentation

Document	Description
docs/architecture.md	Architecture and event flow
docs/concepts.md	Core concepts (failures, patterns, fingerprints)
docs/failure-intelligence.md	What "failure intelligence" means
docs/COMPARISON.md	Kakveda vs Datadog, LangSmith, MLflow, etc.
docs/netra-host-install.md	Install and run `kakveda-netra` on any host
TROUBLESHOOTING.md	Common issues and solutions

🌍 Current Problem and Kakveda’s Resolution

What the world is currently facing

AI/LLM systems fail in recurring ways, but failures are mostly stored as logs, not reusable knowledge.
Teams get post-incident visibility, but weak pre-incident prevention.
Multi-agent and host-level observability is fragmented across tools and environments.
Root-cause and remediation context gets lost across runs, teams, and projects.

What Kakveda resolves

Converts failures into a persistent Failure Knowledge Base (GFKB).
Performs pattern detection and emits pre-flight warnings to prevent repeat failures.
Adds unified host + observability telemetry via kakveda-netra.
Gives one dashboard for warning history, traces, infra signals, and reliability indicators.
Keeps deployment self-hostable and governance-safe for teams needing data control.
Positions Kakveda as a single unified platform for infra monitoring, observability, and LLM/AI/agent monitoring in one place, where many market offerings still require multiple separate tools.

✅ Latest Feature Rollup

(platform baseline delivered):

Failure Knowledge Base (GFKB), recurring pattern detection, pre-flight warning flow.
Event-driven architecture with trace ingestion, classifier, pattern detector, and health scoring.
Dashboard for runs, warnings, evaluations, prompts, experiments, and feedback.

(observability + host coverage strengthened):

kakveda-netra host agent with full infra payload groups:
- CPU, memory, disk, network, process, file descriptors, system, load, temperature.
- Docker container metrics with diagnostics (docker_error, socket diagnostics).
- Kubernetes inventory/metadata collection (nodes, pods, deployments, services, configmaps, secrets).
Observability views:
- golden signals, SLO/error budget, inferred service map, synthetic checks,
- incident timeline, forecast summary, correlation summary.
Detail pages + chart fallback rendering for environments where CDN chart scripts are blocked.
Dashboard-driven Netra runtime controls (observability toggle and config sync).
:
Realtime service map UX upgrades: zoom/pan/fit/hover + topology density filters + demo mode.
Realtime service map page with dependency edges and environment filtering.
APM error tracking with grouped exceptions, workflow states, and replay context.
Continuous profiler view (method hotspots), version comparison, trace-to-profile drill-down.
Dynamic instrumentation controls (dashboard-managed runtime rules, no restart flow).
Instrumentation execution feedback timeline (agent-applied/failed/skipped ack).
Database monitoring (DBM): slow query hotspots, query fingerprints, wait/event insights, explain-plan payload support.
RUM (Real User Monitoring): frontend/web activity, LCP/FID/CLS, JS error visibility, RUM monitors + alerts.
Cross-telemetry correlation page: joins trace with RUM, infra snapshots, observability snapshots, DB samples, APM errors, and security signals.
APM monitors page: metric/trace/anomaly watchdog monitors with auto-generated defaults and alert lifecycle.

Native Single Tool Positioning

Kakveda uses one native host agent (kakveda-netra) to capture infra + observability + container + cluster signals and push them directly into kakveda-v1.0. This keeps integration simple:

install Netra on host,
provide dashboard API key,
start agent (foreground, background, or systemd),
data appears in /infra and /observability.

🔎 Basic comparison (highlights)

Note: Highlights only. For the full matrix, see docs/COMPARISON.md.

Capability / Feature	Kakveda	LangSmith	MLflow	Arize AI	Weights & Biases	APM (Datadog/AppD)
Open Source	✅ Yes (Apache 2.0)	❌ No	✅ Yes	❌ No	❌ No	❌ No
Self-hosted	✅ Yes	❌ No	✅ Yes	❌ No	⚠️ Limited	❌ No
Playground	✅ Yes	✅ Yes	❌ No	❌ No	❌ No	❌ No
LLM Tracing	✅ Yes	✅ Yes	⚠️ Limited	✅ Yes	✅ Yes	⚠️ Infra only
Failure Knowledge Base (Memory)	✅ Yes	❌ No	❌ No	❌ No	❌ No	❌ No
Pre-flight Warnings	✅ Yes	❌ No	❌ No	❌ No	❌ No	❌ No
Health Score Over Time	✅ Yes	❌ No	❌ No	✅ Yes	❌ No	✅ Infra only
Warnings Dashboard + Filters	✅ Yes	❌ No	❌ No	⚠️ Alerts	❌ No	⚠️ Alerts

💵 Direct Monthly Price Comparison (USD)

Pricing snapshot date: March 1, 2026. Values are public starting prices (or published billing models) and can change by region, volume, and contract.

Infra Monitoring

Platform	Public starting monthly price	Billing basis
Kakveda + Netra (OSS)	$0 license/mo	Self-hosted
Datadog Infrastructure Pro	$15 (annual) / $18 (month-to-month)	per host
Splunk AppDynamics Infrastructure Edition	$6	per vCPU
Dynatrace Infrastructure Monitoring	$29	per host

Observability (APM / Logs / Metrics)

Platform	Public starting monthly price	Billing basis
Kakveda + Netra (OSS)	$0 license/mo	Self-hosted
Datadog APM (Standard)	$31	per host
Datadog Log Management	$0.10	per GB ingested
Logz.io Infrastructure Monitoring	~$12.00	per 1000 time-series/mo (from $0.40/day)
Logz.io Log Management	~$27.60	per GB/mo (from $0.92/day)

AI / Agent / LLM Monitoring

Platform	Public starting monthly price	Billing basis
Kakveda (OSS)	$0 license/mo	Self-hosted
LangSmith Plus	$39	per seat (+ usage)
Weights & Biases Team	$50	per user
Arize AI	Contact sales	custom
MLflow OSS	$0 license/mo	Self-hosted

Consolidated Requested Tool Matrix

Tool	Infra	Observability/APM	AI/LLM/Agent Monitoring	Public pricing visibility
Kakveda v1.0 + Netra	$0 license/mo	$0 license/mo	$0 license/mo	OSS self-host
Datadog	$15-$18 per host/mo	APM $31/host/mo; Logs $0.10/GB	LLM observability add-on model	Public list pricing
Splunk AppDynamics	$6 per vCPU/mo	APM bundles from $33/vCPU/mo	Enterprise packaging path	Public starting tiers
Logz.io	~$12/mo equivalent entry	Logs/traces usage-based (published daily rates)	Agentic observability usage pricing	Public usage pricing
LangSmith	N/A	N/A	$39/seat/mo (+ usage)	Public
Arize	N/A	Product observability by plan	AX Pro $50/mo; OSS option exists	Public tiers/plan pages
Azure Monitor	Usage-based	Usage-based per GB/retention	Integration-led via Azure stack	No single global flat monthly number
MLflow OSS	N/A	N/A	$0 license/mo	OSS

Detailed matrix and notes: docs/COMPARISON.md

Pricing sources:

Datadog: https://www.datadoghq.com/pricing/list/
Splunk AppDynamics / Observability: https://www.splunk.com/en_us/products/pricing/it-operations.html
Dynatrace: https://www.dynatrace.com/pricing/
Logz.io: https://logz.io/pricing/
LangSmith: https://www.langchain.com/pricing
Weights & Biases: https://wandb.ai/site/pricing
Arize AI: https://arize.com/pricing/
Azure Monitor: https://azure.microsoft.com/pricing/details/monitor/
MLflow: https://mlflow.org/

🌐 What Makes Kakveda Unique

What many teams still do not operationalize well:

failure recurrence memory as a first-class data model,
pre-flight prevention signals before repeat incidents,
single-stack visibility across infra + observability + AI/LLM/agent runtime behavior.

Kakveda’s unique combination:

durable failure knowledge base + warning-policy feedback loop,
one native host-side agent (kakveda-netra) for infra + container + k8s + observability push,
self-hosted governance-safe deployment with low setup friction.

Easy Setup (Practical)

Start Kakveda: docker compose up -d --build
Install Netra on host and provide dashboard API key.
Run Netra (foreground/background/systemd).
Verify signals in /infra, /observability, and /observability/service-map.

✨ What this project does

Stores failures in a Global Failure Knowledge Base (GFKB)
Detects repeated and recurring failure patterns across runs
Provides pre‑flight warnings when an execution matches a past failure
Computes a system health score over time
Offers a full dashboard with scenarios, traces, datasets, evaluations, prompts, and experiments
Runs locally with Docker Compose in one command

🧠 Core Concepts

Failure as data: Failures are stored, versioned, and matched — not just logged.
Event‑driven flow: Each service reacts to events (trace ingested → failure detected → pattern updated).
Deterministic demo: Ollama is optional; a deterministic stub keeps the system runnable everywhere.
Separation of concerns: Each capability runs as its own microservice.

🏗️ Architecture Overview

Note: the diagram below is pipeline-centric. The dashboard is both (a) the UI entrypoint that triggers scenario runs and (b) the consumer/visualizer for warnings, runs, and health.

Scenario Runner
      │
      ▼
Warning Policy  ◀───────────┐
      │                     │
      ▼                     │
Model (Ollama / Stub)       │
      │                     │
      ▼                     │
Trace Ingestion ──▶ Event Bus ──▶ Failure Classifier
                                      │
                                      ▼
                           Global Failure KB
                                      │
                                      ▼
                             Pattern Detector
                                      │
                                      ▼
                               Health Scoring
                                      │
                                      ▼
                                  Dashboard

🧩 Included Microservices

Service	Purpose
event-bus	Demo HTTP pub/sub for events
ingestion	Receives traces and publishes events
gfkb	Global Failure Knowledge Base (failures + patterns)
failure-classifier	Detects failures from traces
pattern-detector	Maintains recurring failure patterns
warning-policy	Pre‑flight "this failed before" warnings
health-scoring	Computes health timeline
dashboard	UI, auth, RBAC, analytics, scenario runner
ollama (optional)	Local LLM runtime

🖥️ Dashboard Features

Home overview with recent warnings
Scenario runner with warning integration
Warning history and analytics
Runs & traces with nested spans and timelines
Feedback on runs
Datasets and examples
Evaluations with aggregate metrics
Prompt library with versioning
Experiments (grouping runs)
Playground UI

🔐 Security & Access Control

Login / register / forgot / reset password flows
Cookie‑based JWT sessions
Role‑based access control: admin / operator / viewer
Admin UI for user management and role assignment
CSRF protection for browser forms
Security headers (CSP, X‑Frame‑Options, etc.)
JWT revocation (Redis‑backed when configured)
Rate limiting (in‑memory demo, Redis optional)

⚠️ This is a production‑adjacent demo.

🚀 Quick Start

Prerequisites

Docker + Docker Compose (V2 recommended)

Run the stack

Option 1: Using CLI (Recommended)

git clone https://github.com/prateekdevisingh/kakveda.git
cd kakveda/kakveda-v1.0
pip install -e .
kakveda up

Option 2: Using Docker Compose directly

git clone https://github.com/prateekdevisingh/kakveda.git
cd kakveda/kakveda-v1.0
docker-compose up -d

Optional: start the companion kakveda-kids-agent demo service (only if you have ../kakveda-kids-agent present):

docker-compose --profile kids up -d --build

Open the dashboard:

http://localhost:8110

CLI Commands

kakveda init        # Interactive .env setup
kakveda up          # Start all services
kakveda down        # Stop all services
kakveda status      # Show running services and URLs
kakveda logs        # Show logs (all services)
kakveda logs dashboard --tail 50   # Show specific service logs
kakveda reset       # Full reset (stops + clears data)
kakveda doctor      # Diagnose system issues
kakveda version     # Show version info

💡 Having issues? See TROUBLESHOOTING.md for common problems and solutions.

Demo Accounts (auto‑created)

admin@local / admin123 (admin)
operator@kakveda.local / Operator@123 (operator)
viewer@kakveda.local / Viewer@123 (viewer)

⚠️ Security warning:

The default admin is for first-time setup only. If your browser blocks admin@local as an invalid email, use admin@kakveda.local (same password: admin123).

You must change the admin password immediately after setup!

For production, create a new admin and disable or delete the default.

🔌 Connect Your Own AI Agent to Kakveda

Kakveda supports connecting external AI agents for centralized observability, tracing, and failure intelligence. Follow this step-by-step guide to integrate your custom agent.

SDK Quick Integration (Recommended)

You can integrate any agent framework (LangChain, LangGraph, custom Python, etc.) with minimal code using kakveda_sdk.

from kakveda_sdk import KakvedaAgent

agent = KakvedaAgent(capabilities=["my_tool"])

result = agent.execute(
    prompt=user_input,
    tool_name="my_tool",
    execute_fn=my_tool_fn,
    metadata={"user_id": "123"},
)

Minimum env vars:

KAKVEDA_WARN_URL=http://warning-policy:8105/warn
KAKVEDA_EVENT_BUS_URL=http://event-bus:8100/publish
DASHBOARD_URL=http://dashboard:8110
DASHBOARD_API_KEY=<your-api-key>
AGENT_NAME=my-agent
AGENT_APP_ID=my-agent
AGENT_VERSION=1.0.0

If you want the agent visible in the dashboard with heartbeats, ensure /health is exposed (see examples/langchain-agent-demo/agent_app.py).

Step 1: Start Kakveda Platform

git clone https://github.com/prateekdevisingh/kakveda.git
cd kakveda/kakveda-v1.0
docker-compose up -d

This starts the following services:

Service	Port	URL
Event Bus	8100	http://localhost:8100
Dashboard	8110	http://localhost:8110
Ollama LLM	11434	http://localhost:11434

Step 2: Verify Kakveda is Running

# Check all services
docker ps

# Check Dashboard
curl http://localhost:8110

Step 3: Find Docker Network Name

docker network ls | grep kakveda

Output example:

abc123   kakveda-v10_default   bridge   local

Step 4: Download/Create Your Custom Agent

Example using our Kids Education Agent:

cd ..
git clone https://github.com/prateekdevisingh/kakveda-kids-agent.git
cd kakveda-kids-agent

Step 5: Build Agent Docker Image

docker build -t kakveda-kids .

Step 6: Connect Agent to Kakveda Network

Generic Format:

docker run -d \
  --name <your-agent-name> \
  --network <kakveda-docker-network> \
  -p <host-port>:<container-port> \
  -e OLLAMA_URL=http://ollama:11434 \
  -e EVENT_BUS_URL=http://event-bus:8100 \
  -e DASHBOARD_URL=http://dashboard:8110 \
  -e DASHBOARD_API_KEY=<your-api-key> \
  <your-docker-image>

Example (Kids Education Agent):

docker run -d \
  --name kakveda-kids-agent \
  --network kakveda-v10_default \
  -p 8122:8120 \
  -e OLLAMA_URL=http://ollama:11434 \
  -e EVENT_BUS_URL=http://event-bus:8100 \
  -e DASHBOARD_URL=http://dashboard:8110 \
  -e DASHBOARD_API_KEY=your-api-key \
  kakveda-kids

Parameter Reference:

Parameter	Placeholder	Description
`--name`	`<your-agent-name>`	Unique name for your container
`--network`	`<kakveda-docker-network>`	Kakveda's network (find via `docker network ls \| grep kakveda`)
`-p`	`<host-port>:<container-port>`	Port mapping (e.g., `8122:8120`)
`-e OLLAMA_URL`	`http://ollama:11434`	LLM service (use service name, not localhost)
`-e EVENT_BUS_URL`	`http://event-bus:8100`	Traces go here for failure intelligence
`-e DASHBOARD_URL`	`http://dashboard:8110`	For agent auto-registration
`-e DASHBOARD_API_KEY`	`<your-api-key>`	Get from Dashboard → Admin → API Keys
Image	`<your-docker-image>`	Your built Docker image name

Step 7: Test Your Agent

# Health check
curl http://localhost:8122/health

# Ask a question
curl -X POST http://localhost:8122/api/ask \
  -H "Content-Type: application/json" \
  -d '{"question": "tell me about birds", "child_name": "Arya"}'

Step 8: View Traces in Dashboard

Open http://localhost:8110
Go to Runs → See your agent's traces
Go to Agents → See registered agents
Go to Playground → Select your agent from dropdown and test

Agent Integration Requirements

Kakveda does not require any external agent to run. The core stack (dashboard, event-bus, ingestion, etc.) works standalone.

If you add a new agent:

Prefer running it as a separate container (or as an optional Compose profile).
Avoid breaking fresh installs by not making the agent build mandatory when its source folder isn’t present.
Use a unique host port to avoid conflicts (e.g., don’t reuse 8120/8122 if something is already bound).
When running inside the Docker network, use service DNS names like http://event-bus:8100 and http://dashboard:8110 (not localhost).

If you do add an agent into docker-compose.yml, wrap it behind a profile:

    my-agent:
        profiles: ["agents"]
        build: ../my-agent
        environment:
            - EVENT_BUS_URL=http://event-bus:8100
        ports:
            - "8125:8120"

Then start it only when needed:

docker-compose --profile agents up -d --build

For your agent to fully integrate with Kakveda, implement these endpoints:

Endpoint	Method	Purpose
`/health`	GET	Health check (return `{"status": "healthy"}`)
`/api/ask`	POST	Main query endpoint

Send traces to Event Bus:

import httpx

async def send_trace(question: str, answer: str, latency: float):
    await httpx.AsyncClient().post(
        f"{EVENT_BUS_URL}/publish",
        json={
            "event": {
                "event_type": "trace.ingested",
                "run_id": str(uuid.uuid4()),
                "scenario_name": "your-agent-name",
                "input": question,
                "output": answer,
                "latency_ms": latency,
                "is_failure": False
            }
        }
    )

Auto-register with Dashboard (optional):

@app.on_event("startup")
async def register_with_kakveda():
    async with httpx.AsyncClient() as client:
        await client.post(
            f"{DASHBOARD_URL}/api/agents/register",
            json={
                "name": "your-agent-name",
                "base_url": "http://your-agent:port",
                "description": "Your agent description",
                "capabilities": ["capability1", "capability2"]
            }
        )

📧 SMTP for Password Reset

To enable password reset emails, set these environment variables (in .env):

SMTP_HOST=smtp.yourorg.com
SMTP_PORT=587
SMTP_USER=youruser
SMTP_PASS=yourpassword
SMTP_FROM=noreply@yourorg.com
SMTP_TLS=true

If SMTP is not set, password reset links will be shown in the UI (for dev/testing only).

🤖 Ollama Integration (Optional)

If Ollama is running, the dashboard will call it for generation.
If not available, Kakveda automatically falls back to a deterministic stub response.

This keeps demos reproducible and dependency‑free.

⚙️ Configuration

Key environment variables:

KAKVEDA_ENV – dev / production
DASHBOARD_DB_URL – SQLite (default) or Postgres
KAKVEDA_REDIS_URL – optional Redis for revocation & rate limits
KAKVEDA_OTEL_ENABLED – enable OpenTelemetry export

Configuration is explicit and environment‑driven.

🧰 Install & use (local / demo / other envs)

Local (recommended)

Use Docker Compose (same as Quick Start) for a clean, reproducible stack.
Default mode uses SQLite and a deterministic model stub (works everywhere).

CLI alternative (interactive)

If you prefer a guided setup, use the built-in CLI to generate a .env file and start the stack.

python -m kakveda_cli.cli init
python -m kakveda_cli.cli up

Useful CLI commands:

python -m kakveda_cli.cli status
python -m kakveda_cli.cli down
python -m kakveda_cli.cli reset

✅ Testing (step-by-step)

Before running tests, stop the Docker stack to avoid port/resource conflicts (and to make test runs deterministic):

python -m kakveda_cli.cli down

Run unit tests:

pytest -q

Optional: bring the stack back up after tests:

python -m kakveda_cli.cli up

Demo setup

Keep the default stub model for deterministic demos.
Use the built-in demo accounts.
Use the dashboard scenario runner to generate runs/warnings quickly.

Other environments (staging/production-like)

This repo is built for single-node demos, but supports production-adjacent toggles:

Use Postgres by setting DASHBOARD_DB_URL
Use Redis by setting KAKVEDA_REDIS_URL (revocation + rate limiting)
Enable OpenTelemetry export with KAKVEDA_OTEL_ENABLED

An example compose file is provided in docker-compose.prod.yml.

🧪 What this repo is (and is not)

This repo IS:

A complete, runnable system
Suitable for learning, experimentation, and local use
A reference architecture for failure‑intelligent LLM systems

This repo is NOT:

A fully hardened enterprise deployment
A multi‑cluster or HA setup
A compliance‑certified system

📸 Demo Screenshots

Login & Authentication

Login	Register	Forgot Password

Dashboard

Dashboard Overview	Dashboard Footer

Scenario Runner & Warnings

Scenarios	Run View	Warnings

Advanced Features

Playground	Experiments	Datasets

Admin Panel

Prompts	Admin RBAC

🖼️ Drawings

This repo includes clean, spec-friendly drawings under docs/figures/:

Fig. 1 — Pipeline-centric architecture for failure-intelligence

Fig. 2 — Example data model for failure entities and pattern entities

Fig. 3 — Pre-flight matching and policy decision flow

🛣️ Roadmap (High Level)

Pluggable event bus implementations
Pluggable storage backends
Advanced evaluation plugins
Improved pattern detection strategies
Enterprise extensions (separate distribution)

🤝 Contributing

Contributions are welcome!

Please read CONTRIBUTING.md.

🔐 Security

Please see SECURITY.md for vulnerability reporting and security notes.

📄 License

This project is licensed under the Apache License 2.0 (see LICENSE).

🌱 Long‑term vision

Kakveda aims to become a failure‑intelligence layer that complements existing LLM runtimes and observability stacks by adding what they lack most: memory and prevention of past failures.

The open-source core is designed to remain transparent, usable, and self-hostable. Future commercial offerings, if any, may focus on scale, operational hardening, and compliance-oriented features, while keeping the core concepts openly accessible.

Intellectual Property Note: The project is released as open source. Certain aspects of the underlying concepts may be the subject of patent filings.

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
config		config
data		data
docs		docs
examples		examples
kakveda.egg-info		kakveda.egg-info
kakveda_cli		kakveda_cli
kakveda_sdk		kakveda_sdk
scripts		scripts
services		services
shared		shared
tests		tests
venv		venv
.gitignore		.gitignore
.~lock.FeaturesAndComparision.docx#		.~lock.FeaturesAndComparision.docx#
.~lock.Figures.docx#		.~lock.Figures.docx#
.~lock.ProvisionalSpecification.docx#		.~lock.ProvisionalSpecification.docx#
CLA.md		CLA.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md
RELEASE_NOTES_v0.2.1.md		RELEASE_NOTES_v0.2.1.md
RELEASE_NOTES_v0.2.2.md		RELEASE_NOTES_v0.2.2.md
RELEASE_NOTES_v0.2.3.md		RELEASE_NOTES_v0.2.3.md
RELEASE_NOTES_v1.0.0.md		RELEASE_NOTES_v1.0.0.md
RELEASE_NOTES_v1.0.1.md		RELEASE_NOTES_v1.0.1.md
RELEASE_NOTES_v1.0.2.md		RELEASE_NOTES_v1.0.2.md
RELEASE_NOTES_v1.0.3.md		RELEASE_NOTES_v1.0.3.md
RELEASE_NOTES_v1.0.4.md		RELEASE_NOTES_v1.0.4.md
RELEASE_NOTES_v1.0.5.md		RELEASE_NOTES_v1.0.5.md
SECURITY.md		SECURITY.md
TROUBLESHOOTING.md		TROUBLESHOOTING.md
docker-compose.prod.yml		docker-compose.prod.yml
docker-compose.yml		docker-compose.yml
kakveda.md		kakveda.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
system_probe.py		system_probe.py
test_node_metrics.py		test_node_metrics.py

Folders and files

Latest commit

History

Repository files navigation