Skip to content

prateekdevisingh/kakveda

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

39 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Kakveda – LLM Failure Intelligence Platform

Author: Prateek Chaudhary

Website: https://kakveda.com

Open‑source, event‑driven platform that gives LLM systems a memory of failures, runtime "this failed before" warnings, and a system‑level health view.

Kakveda sits around LLM runtimes and observability tools and adds something most systems lack: failure memory. Instead of treating failures as logs, it treats them as first‑class entities that can be remembered, matched, warned against, and analyzed over time.

This repository provides a complete, production‑adjacent, single‑node implementation designed for local use, demos, and learning — with a clear path to future enterprise extensions.


📚 Documentation

Document Description
docs/architecture.md Architecture and event flow
docs/concepts.md Core concepts (failures, patterns, fingerprints)
docs/failure-intelligence.md What "failure intelligence" means
docs/COMPARISON.md Kakveda vs Datadog, LangSmith, MLflow, etc.
docs/netra-host-install.md Install and run kakveda-netra on any host
TROUBLESHOOTING.md Common issues and solutions

🌍 Current Problem and Kakveda’s Resolution

What the world is currently facing

  • AI/LLM systems fail in recurring ways, but failures are mostly stored as logs, not reusable knowledge.
  • Teams get post-incident visibility, but weak pre-incident prevention.
  • Multi-agent and host-level observability is fragmented across tools and environments.
  • Root-cause and remediation context gets lost across runs, teams, and projects.

What Kakveda resolves

  • Converts failures into a persistent Failure Knowledge Base (GFKB).
  • Performs pattern detection and emits pre-flight warnings to prevent repeat failures.
  • Adds unified host + observability telemetry via kakveda-netra.
  • Gives one dashboard for warning history, traces, infra signals, and reliability indicators.
  • Keeps deployment self-hostable and governance-safe for teams needing data control.
  • Positions Kakveda as a single unified platform for infra monitoring, observability, and LLM/AI/agent monitoring in one place, where many market offerings still require multiple separate tools.

✅ Latest Feature Rollup

(platform baseline delivered):

  • Failure Knowledge Base (GFKB), recurring pattern detection, pre-flight warning flow.
  • Event-driven architecture with trace ingestion, classifier, pattern detector, and health scoring.
  • Dashboard for runs, warnings, evaluations, prompts, experiments, and feedback.

(observability + host coverage strengthened):

  • kakveda-netra host agent with full infra payload groups:
    • CPU, memory, disk, network, process, file descriptors, system, load, temperature.
    • Docker container metrics with diagnostics (docker_error, socket diagnostics).
    • Kubernetes inventory/metadata collection (nodes, pods, deployments, services, configmaps, secrets).
  • Observability views:
    • golden signals, SLO/error budget, inferred service map, synthetic checks,
    • incident timeline, forecast summary, correlation summary.
  • Detail pages + chart fallback rendering for environments where CDN chart scripts are blocked.
  • Dashboard-driven Netra runtime controls (observability toggle and config sync).
  • :
  • Realtime service map UX upgrades: zoom/pan/fit/hover + topology density filters + demo mode.
  • Realtime service map page with dependency edges and environment filtering.
  • APM error tracking with grouped exceptions, workflow states, and replay context.
  • Continuous profiler view (method hotspots), version comparison, trace-to-profile drill-down.
  • Dynamic instrumentation controls (dashboard-managed runtime rules, no restart flow).
  • Instrumentation execution feedback timeline (agent-applied/failed/skipped ack).
  • Database monitoring (DBM): slow query hotspots, query fingerprints, wait/event insights, explain-plan payload support.
  • RUM (Real User Monitoring): frontend/web activity, LCP/FID/CLS, JS error visibility, RUM monitors + alerts.
  • Cross-telemetry correlation page: joins trace with RUM, infra snapshots, observability snapshots, DB samples, APM errors, and security signals.
  • APM monitors page: metric/trace/anomaly watchdog monitors with auto-generated defaults and alert lifecycle.

Native Single Tool Positioning

Kakveda uses one native host agent (kakveda-netra) to capture infra + observability + container + cluster signals and push them directly into kakveda-v1.0. This keeps integration simple:

  • install Netra on host,
  • provide dashboard API key,
  • start agent (foreground, background, or systemd),
  • data appears in /infra and /observability.

🔎 Basic comparison (highlights)

Note: Highlights only. For the full matrix, see docs/COMPARISON.md.

Capability / Feature Kakveda LangSmith MLflow Arize AI Weights & Biases APM (Datadog/AppD)
Open Source ✅ Yes (Apache 2.0) ❌ No ✅ Yes ❌ No ❌ No ❌ No
Self-hosted ✅ Yes ❌ No ✅ Yes ❌ No ⚠️ Limited ❌ No
Playground ✅ Yes ✅ Yes ❌ No ❌ No ❌ No ❌ No
LLM Tracing ✅ Yes ✅ Yes ⚠️ Limited ✅ Yes ✅ Yes ⚠️ Infra only
Failure Knowledge Base (Memory) ✅ Yes ❌ No ❌ No ❌ No ❌ No ❌ No
Pre-flight Warnings ✅ Yes ❌ No ❌ No ❌ No ❌ No ❌ No
Health Score Over Time ✅ Yes ❌ No ❌ No ✅ Yes ❌ No ✅ Infra only
Warnings Dashboard + Filters ✅ Yes ❌ No ❌ No ⚠️ Alerts ❌ No ⚠️ Alerts

💵 Direct Monthly Price Comparison (USD)

Pricing snapshot date: March 1, 2026. Values are public starting prices (or published billing models) and can change by region, volume, and contract.

Infra Monitoring

Platform Public starting monthly price Billing basis
Kakveda + Netra (OSS) $0 license/mo Self-hosted
Datadog Infrastructure Pro $15 (annual) / $18 (month-to-month) per host
Splunk AppDynamics Infrastructure Edition $6 per vCPU
Dynatrace Infrastructure Monitoring $29 per host

Observability (APM / Logs / Metrics)

Platform Public starting monthly price Billing basis
Kakveda + Netra (OSS) $0 license/mo Self-hosted
Datadog APM (Standard) $31 per host
Datadog Log Management $0.10 per GB ingested
Logz.io Infrastructure Monitoring ~$12.00 per 1000 time-series/mo (from $0.40/day)
Logz.io Log Management ~$27.60 per GB/mo (from $0.92/day)

AI / Agent / LLM Monitoring

Platform Public starting monthly price Billing basis
Kakveda (OSS) $0 license/mo Self-hosted
LangSmith Plus $39 per seat (+ usage)
Weights & Biases Team $50 per user
Arize AI Contact sales custom
MLflow OSS $0 license/mo Self-hosted

Consolidated Requested Tool Matrix

Tool Infra Observability/APM AI/LLM/Agent Monitoring Public pricing visibility
Kakveda v1.0 + Netra $0 license/mo $0 license/mo $0 license/mo OSS self-host
Datadog $15-$18 per host/mo APM $31/host/mo; Logs $0.10/GB LLM observability add-on model Public list pricing
Splunk AppDynamics $6 per vCPU/mo APM bundles from $33/vCPU/mo Enterprise packaging path Public starting tiers
Logz.io ~$12/mo equivalent entry Logs/traces usage-based (published daily rates) Agentic observability usage pricing Public usage pricing
LangSmith N/A N/A $39/seat/mo (+ usage) Public
Arize N/A Product observability by plan AX Pro $50/mo; OSS option exists Public tiers/plan pages
Azure Monitor Usage-based Usage-based per GB/retention Integration-led via Azure stack No single global flat monthly number
MLflow OSS N/A N/A $0 license/mo OSS

Detailed matrix and notes: docs/COMPARISON.md

Pricing sources:


🌐 What Makes Kakveda Unique

What many teams still do not operationalize well:

  • failure recurrence memory as a first-class data model,
  • pre-flight prevention signals before repeat incidents,
  • single-stack visibility across infra + observability + AI/LLM/agent runtime behavior.

Kakveda’s unique combination:

  • durable failure knowledge base + warning-policy feedback loop,
  • one native host-side agent (kakveda-netra) for infra + container + k8s + observability push,
  • self-hosted governance-safe deployment with low setup friction.

Easy Setup (Practical)

  1. Start Kakveda: docker compose up -d --build
  2. Install Netra on host and provide dashboard API key.
  3. Run Netra (foreground/background/systemd).
  4. Verify signals in /infra, /observability, and /observability/service-map.

✨ What this project does

  • Stores failures in a Global Failure Knowledge Base (GFKB)
  • Detects repeated and recurring failure patterns across runs
  • Provides pre‑flight warnings when an execution matches a past failure
  • Computes a system health score over time
  • Offers a full dashboard with scenarios, traces, datasets, evaluations, prompts, and experiments
  • Runs locally with Docker Compose in one command

🧠 Core Concepts

  • Failure as data: Failures are stored, versioned, and matched — not just logged.
  • Event‑driven flow: Each service reacts to events (trace ingested → failure detected → pattern updated).
  • Deterministic demo: Ollama is optional; a deterministic stub keeps the system runnable everywhere.
  • Separation of concerns: Each capability runs as its own microservice.

🏗️ Architecture Overview

Note: the diagram below is pipeline-centric. The dashboard is both (a) the UI entrypoint that triggers scenario runs and (b) the consumer/visualizer for warnings, runs, and health.

Scenario Runner
      │
      ▼
Warning Policy  ◀───────────┐
      │                     │
      ▼                     │
Model (Ollama / Stub)       │
      │                     │
      ▼                     │
Trace Ingestion ──▶ Event Bus ──▶ Failure Classifier
                                      │
                                      ▼
                           Global Failure KB
                                      │
                                      ▼
                             Pattern Detector
                                      │
                                      ▼
                               Health Scoring
                                      │
                                      ▼
                                  Dashboard

🧩 Included Microservices

Service Purpose
event-bus Demo HTTP pub/sub for events
ingestion Receives traces and publishes events
gfkb Global Failure Knowledge Base (failures + patterns)
failure-classifier Detects failures from traces
pattern-detector Maintains recurring failure patterns
warning-policy Pre‑flight "this failed before" warnings
health-scoring Computes health timeline
dashboard UI, auth, RBAC, analytics, scenario runner
ollama (optional) Local LLM runtime

🖥️ Dashboard Features

  • Home overview with recent warnings
  • Scenario runner with warning integration
  • Warning history and analytics
  • Runs & traces with nested spans and timelines
  • Feedback on runs
  • Datasets and examples
  • Evaluations with aggregate metrics
  • Prompt library with versioning
  • Experiments (grouping runs)
  • Playground UI

🔐 Security & Access Control

  • Login / register / forgot / reset password flows
  • Cookie‑based JWT sessions
  • Role‑based access control: admin / operator / viewer
  • Admin UI for user management and role assignment
  • CSRF protection for browser forms
  • Security headers (CSP, X‑Frame‑Options, etc.)
  • JWT revocation (Redis‑backed when configured)
  • Rate limiting (in‑memory demo, Redis optional)

⚠️ This is a production‑adjacent demo.


🚀 Quick Start

Prerequisites

  • Docker + Docker Compose (V2 recommended)

Run the stack

Option 1: Using CLI (Recommended)

git clone https://github.com/prateekdevisingh/kakveda.git
cd kakveda/kakveda-v1.0
pip install -e .
kakveda up

Option 2: Using Docker Compose directly

git clone https://github.com/prateekdevisingh/kakveda.git
cd kakveda/kakveda-v1.0
docker-compose up -d

Optional: start the companion kakveda-kids-agent demo service (only if you have ../kakveda-kids-agent present):

docker-compose --profile kids up -d --build

Open the dashboard:

http://localhost:8110

CLI Commands

kakveda init        # Interactive .env setup
kakveda up          # Start all services
kakveda down        # Stop all services
kakveda status      # Show running services and URLs
kakveda logs        # Show logs (all services)
kakveda logs dashboard --tail 50   # Show specific service logs
kakveda reset       # Full reset (stops + clears data)
kakveda doctor      # Diagnose system issues
kakveda version     # Show version info

💡 Having issues? See TROUBLESHOOTING.md for common problems and solutions.

Demo Accounts (auto‑created)

⚠️ Security warning:

  • The default admin is for first-time setup only. If your browser blocks admin@local as an invalid email, use admin@kakveda.local (same password: admin123).
  • You must change the admin password immediately after setup!
  • For production, create a new admin and disable or delete the default.

🔌 Connect Your Own AI Agent to Kakveda

Kakveda supports connecting external AI agents for centralized observability, tracing, and failure intelligence. Follow this step-by-step guide to integrate your custom agent.

SDK Quick Integration (Recommended)

You can integrate any agent framework (LangChain, LangGraph, custom Python, etc.) with minimal code using kakveda_sdk.

from kakveda_sdk import KakvedaAgent

agent = KakvedaAgent(capabilities=["my_tool"])

result = agent.execute(
    prompt=user_input,
    tool_name="my_tool",
    execute_fn=my_tool_fn,
    metadata={"user_id": "123"},
)

Minimum env vars:

KAKVEDA_WARN_URL=http://warning-policy:8105/warn
KAKVEDA_EVENT_BUS_URL=http://event-bus:8100/publish
DASHBOARD_URL=http://dashboard:8110
DASHBOARD_API_KEY=<your-api-key>
AGENT_NAME=my-agent
AGENT_APP_ID=my-agent
AGENT_VERSION=1.0.0

If you want the agent visible in the dashboard with heartbeats, ensure /health is exposed (see examples/langchain-agent-demo/agent_app.py).

Step 1: Start Kakveda Platform

git clone https://github.com/prateekdevisingh/kakveda.git
cd kakveda/kakveda-v1.0
docker-compose up -d

This starts the following services:

Service Port URL
Event Bus 8100 http://localhost:8100
Dashboard 8110 http://localhost:8110
Ollama LLM 11434 http://localhost:11434

Step 2: Verify Kakveda is Running

# Check all services
docker ps

# Check Dashboard
curl http://localhost:8110

Step 3: Find Docker Network Name

docker network ls | grep kakveda

Output example:

abc123   kakveda-v10_default   bridge   local

Step 4: Download/Create Your Custom Agent

Example using our Kids Education Agent:

cd ..
git clone https://github.com/prateekdevisingh/kakveda-kids-agent.git
cd kakveda-kids-agent

Step 5: Build Agent Docker Image

docker build -t kakveda-kids .

Step 6: Connect Agent to Kakveda Network

Generic Format:

docker run -d \
  --name <your-agent-name> \
  --network <kakveda-docker-network> \
  -p <host-port>:<container-port> \
  -e OLLAMA_URL=http://ollama:11434 \
  -e EVENT_BUS_URL=http://event-bus:8100 \
  -e DASHBOARD_URL=http://dashboard:8110 \
  -e DASHBOARD_API_KEY=<your-api-key> \
  <your-docker-image>

Example (Kids Education Agent):

docker run -d \
  --name kakveda-kids-agent \
  --network kakveda-v10_default \
  -p 8122:8120 \
  -e OLLAMA_URL=http://ollama:11434 \
  -e EVENT_BUS_URL=http://event-bus:8100 \
  -e DASHBOARD_URL=http://dashboard:8110 \
  -e DASHBOARD_API_KEY=your-api-key \
  kakveda-kids

Parameter Reference:

Parameter Placeholder Description
--name <your-agent-name> Unique name for your container
--network <kakveda-docker-network> Kakveda's network (find via docker network ls | grep kakveda)
-p <host-port>:<container-port> Port mapping (e.g., 8122:8120)
-e OLLAMA_URL http://ollama:11434 LLM service (use service name, not localhost)
-e EVENT_BUS_URL http://event-bus:8100 Traces go here for failure intelligence
-e DASHBOARD_URL http://dashboard:8110 For agent auto-registration
-e DASHBOARD_API_KEY <your-api-key> Get from Dashboard → Admin → API Keys
Image <your-docker-image> Your built Docker image name

Step 7: Test Your Agent

# Health check
curl http://localhost:8122/health

# Ask a question
curl -X POST http://localhost:8122/api/ask \
  -H "Content-Type: application/json" \
  -d '{"question": "tell me about birds", "child_name": "Arya"}'

Step 8: View Traces in Dashboard

  1. Open http://localhost:8110
  2. Go to Runs → See your agent's traces
  3. Go to Agents → See registered agents
  4. Go to Playground → Select your agent from dropdown and test

Agent Integration Requirements

Kakveda does not require any external agent to run. The core stack (dashboard, event-bus, ingestion, etc.) works standalone.

If you add a new agent:

  • Prefer running it as a separate container (or as an optional Compose profile).
  • Avoid breaking fresh installs by not making the agent build mandatory when its source folder isn’t present.
  • Use a unique host port to avoid conflicts (e.g., don’t reuse 8120/8122 if something is already bound).
  • When running inside the Docker network, use service DNS names like http://event-bus:8100 and http://dashboard:8110 (not localhost).

If you do add an agent into docker-compose.yml, wrap it behind a profile:

    my-agent:
        profiles: ["agents"]
        build: ../my-agent
        environment:
            - EVENT_BUS_URL=http://event-bus:8100
        ports:
            - "8125:8120"

Then start it only when needed:

docker-compose --profile agents up -d --build

For your agent to fully integrate with Kakveda, implement these endpoints:

Endpoint Method Purpose
/health GET Health check (return {"status": "healthy"})
/api/ask POST Main query endpoint

Send traces to Event Bus:

import httpx

async def send_trace(question: str, answer: str, latency: float):
    await httpx.AsyncClient().post(
        f"{EVENT_BUS_URL}/publish",
        json={
            "event": {
                "event_type": "trace.ingested",
                "run_id": str(uuid.uuid4()),
                "scenario_name": "your-agent-name",
                "input": question,
                "output": answer,
                "latency_ms": latency,
                "is_failure": False
            }
        }
    )

Auto-register with Dashboard (optional):

@app.on_event("startup")
async def register_with_kakveda():
    async with httpx.AsyncClient() as client:
        await client.post(
            f"{DASHBOARD_URL}/api/agents/register",
            json={
                "name": "your-agent-name",
                "base_url": "http://your-agent:port",
                "description": "Your agent description",
                "capabilities": ["capability1", "capability2"]
            }
        )

📧 SMTP for Password Reset

To enable password reset emails, set these environment variables (in .env):

SMTP_HOST=smtp.yourorg.com
SMTP_PORT=587
SMTP_USER=youruser
SMTP_PASS=yourpassword
SMTP_FROM=noreply@yourorg.com
SMTP_TLS=true

If SMTP is not set, password reset links will be shown in the UI (for dev/testing only).


🤖 Ollama Integration (Optional)

  • If Ollama is running, the dashboard will call it for generation.
  • If not available, Kakveda automatically falls back to a deterministic stub response.

This keeps demos reproducible and dependency‑free.


⚙️ Configuration

Key environment variables:

  • KAKVEDA_ENV – dev / production
  • DASHBOARD_DB_URL – SQLite (default) or Postgres
  • KAKVEDA_REDIS_URL – optional Redis for revocation & rate limits
  • KAKVEDA_OTEL_ENABLED – enable OpenTelemetry export

Configuration is explicit and environment‑driven.


🧰 Install & use (local / demo / other envs)

Local (recommended)

  • Use Docker Compose (same as Quick Start) for a clean, reproducible stack.
  • Default mode uses SQLite and a deterministic model stub (works everywhere).

CLI alternative (interactive)

If you prefer a guided setup, use the built-in CLI to generate a .env file and start the stack.

python -m kakveda_cli.cli init
python -m kakveda_cli.cli up

Useful CLI commands:

python -m kakveda_cli.cli status
python -m kakveda_cli.cli down
python -m kakveda_cli.cli reset

✅ Testing (step-by-step)

Before running tests, stop the Docker stack to avoid port/resource conflicts (and to make test runs deterministic):

python -m kakveda_cli.cli down

Run unit tests:

pytest -q

Optional: bring the stack back up after tests:

python -m kakveda_cli.cli up

Demo setup

  • Keep the default stub model for deterministic demos.
  • Use the built-in demo accounts.
  • Use the dashboard scenario runner to generate runs/warnings quickly.

Other environments (staging/production-like)

This repo is built for single-node demos, but supports production-adjacent toggles:

  • Use Postgres by setting DASHBOARD_DB_URL
  • Use Redis by setting KAKVEDA_REDIS_URL (revocation + rate limiting)
  • Enable OpenTelemetry export with KAKVEDA_OTEL_ENABLED

An example compose file is provided in docker-compose.prod.yml.


🧪 What this repo is (and is not)

This repo IS:

  • A complete, runnable system
  • Suitable for learning, experimentation, and local use
  • A reference architecture for failure‑intelligent LLM systems

This repo is NOT:

  • A fully hardened enterprise deployment
  • A multi‑cluster or HA setup
  • A compliance‑certified system

📸 Demo Screenshots

Login & Authentication

Login Register Forgot Password
Login Register Forgot Password

Dashboard

Dashboard Overview Dashboard Footer
Dashboard Dashboard Footer

Scenario Runner & Warnings

Scenarios Run View Warnings
Scenarios Run Warnings

Advanced Features

Playground Experiments Datasets
Playground Experiments Datasets

Admin Panel

Prompts Admin RBAC
Prompts Admin RBAC

🖼️ Drawings

This repo includes clean, spec-friendly drawings under docs/figures/:

Fig. 1 — Pipeline-centric architecture for failure-intelligence

Fig. 1 — Pipeline-centric architecture for failure-intelligence

Fig. 2 — Example data model for failure entities and pattern entities

Fig. 2 — Example data model for failure entities and pattern entities

Fig. 3 — Pre-flight matching and policy decision flow

Fig. 3 — Pre-flight matching and policy decision flow


🛣️ Roadmap (High Level)

  • Pluggable event bus implementations
  • Pluggable storage backends
  • Advanced evaluation plugins
  • Improved pattern detection strategies
  • Enterprise extensions (separate distribution)

🤝 Contributing

Contributions are welcome!

Please read CONTRIBUTING.md.


🔐 Security

Please see SECURITY.md for vulnerability reporting and security notes.


📄 License

This project is licensed under the Apache License 2.0 (see LICENSE).


🌱 Long‑term vision

Kakveda aims to become a failure‑intelligence layer that complements existing LLM runtimes and observability stacks by adding what they lack most: memory and prevention of past failures.

The open-source core is designed to remain transparent, usable, and self-hostable. Future commercial offerings, if any, may focus on scale, operational hardening, and compliance-oriented features, while keeping the core concepts openly accessible.

Intellectual Property Note: The project is released as open source. Certain aspects of the underlying concepts may be the subject of patent filings.


Copyright 2026 Prateek Chaudhary, Built in India 🇮🇳

About

Open-source failure intelligence platform for LLM & agent systems. Adds failure memory, pre-flight warnings, pattern detection, and system health scoring.

Topics

Resources

License

Contributing

Security policy

Stars

Watchers

Forks

Contributors

Languages