diff --git a/docs/archive/metrics-for-chatbot-agent-tools.md b/docs/archive/metrics-for-chatbot-agent-tools.md new file mode 100644 index 00000000..f2d66c05 --- /dev/null +++ b/docs/archive/metrics-for-chatbot-agent-tools.md @@ -0,0 +1,197 @@ +# Metrics for Chatbot Agent/Tools Usage + +**Story:** [SC-41316](https://app.shortcut.com/sefaria/story/41316) — MCP: add per-tool time-series metrics +**Epic:** DevOps +**Date:** 2026-02-19 + +## Executive Summary + +SC-41316 asks for Prometheus metrics that support **time-series queries** (e.g. "calls per tool per hour") for the chatbot's MCP tools. This document brainstorms a comprehensive metrics strategy to capture all information needed from agent/tool usage. + +--- + +## 1. Story Context (SC-41316) + +### Current State (from story) + +The story references these Prometheus metrics in `main.py`: + +| Metric | Type | Labels | Purpose | +|--------|------|--------|---------| +| `mcp_tool_calls_total` | Counter | tool_name, status | Total tool invocations | +| `mcp_tool_duration_seconds` | Histogram | tool_name | Latency per tool | +| `mcp_tool_payload_bytes` | Histogram | tool_name | Payload size | +| `mcp_active_connections` | Gauge | — | Active MCP connections | +| `mcp_errors_total` | Counter | tool_name, error_type | Error counts | + +**Note:** The ai-chatbot codebase does not currently expose these Prometheus metrics. They may live in a separate MCP server or Sefaria main repo. The chatbot uses Braintrust for tracing and DB for persistence. + +### Problem (from story) + +> We want to map usage per `tool_name` over time, but we currently only store total calls. We cannot get time-series (e.g. "calls per tool per hour") from the existing metrics alone. + +**Clarification:** Prometheus Counters with labels *are* inherently time-series. `rate(mcp_tool_calls_total[1h])` or `increase(mcp_tool_calls_total[1h])` should yield calls per second or per hour. The issue may be: + +1. Metrics are not exposed at all in the chatbot service +2. Metrics are aggregated without `tool_name` label (losing per-tool breakdown) +3. Scraping interval or retention prevents useful queries + +### Goal (from story) + +> Add to `_run_with_metrics` (or another appropriate place) an option to send/record metrics per individual tool in a way that supports time-series queries (e.g. so we can graph usage per tool_name by time in Prometheus/Grafana). + +--- + +## 2. Current ai-chatbot Observability + +### What exists today + +| Source | Data | Time-series? | +|--------|------|--------------| +| **Braintrust** | Spans, tool_calls in metadata, latency_ms, tool_count, tokens, cost | Yes (via Braintrust UI) | +| **ChatMessage** | tool_calls_data (JSON), tool_calls_count, latency_ms | Yes (via DB queries) | +| **ChatSession** | total_tool_calls, total_input_tokens, total_output_tokens, total_cost_usd | Aggregates only | +| **claude_service** | tool_calls_list per turn (tool_name, tool_input, tool_output, is_error, latency_ms) | Per-turn only | + +### Gap + +- No Prometheus metrics in the chatbot service +- No real-time dashboards for ops (Grafana) +- No alerting on tool error rates or latency +- Per-tool breakdown requires DB queries or Braintrust drill-down + +--- + +## 3. Metrics to Add (Brainstorm) + +### 3.1 Per-tool time-series (SC-41316 core) + +| Metric | Type | Labels | Use case | +|--------|------|--------|----------| +| `chatbot_tool_calls_total` | Counter | tool_name, status (success/error) | `rate(...[1h])` → calls/sec per tool; `increase(...[1h])` → calls per hour | +| `chatbot_tool_duration_seconds` | Histogram | tool_name | p50, p95, p99 latency per tool; SLO monitoring | +| `chatbot_tool_errors_total` | Counter | tool_name, error_type (optional) | Error rate per tool; alert on spike | + +**Implementation point:** In `_build_sdk_tools` handler (claude_service.py ~line 546), after `tool_executor.execute()` and before appending to `tool_calls_list`, increment/observe Prometheus metrics. + +### 3.2 Agent/turn-level metrics + +| Metric | Type | Labels | Use case | +|--------|------|--------|----------| +| `chatbot_turns_total` | Counter | status (success/error/guardrail_blocked) | Turn volume over time | +| `chatbot_turn_latency_seconds` | Histogram | — | End-to-end latency distribution | +| `chatbot_turn_tool_count` | Histogram | — | Tools per turn (distribution) | +| `chatbot_guardrail_blocks_total` | Counter | reason (optional) | Guardrail effectiveness | + +### 3.3 Token and cost metrics + +| Metric | Type | Labels | Use case | +|--------|------|--------|----------| +| `chatbot_tokens_total` | Counter | type (prompt/completion/cache_read/cache_creation) | Token usage trends | +| `chatbot_cost_usd_total` | Counter | — | Cost tracking; budget alerts | +| `chatbot_llm_calls_total` | Counter | — | Multi-step reasoning frequency | + +### 3.4 Session and user metrics (optional) + +| Metric | Type | Labels | Use case | +|--------|------|--------|----------| +| `chatbot_sessions_active` | Gauge | — | Concurrent sessions | +| `chatbot_messages_total` | Counter | role (user/assistant) | Message volume | + +### 3.5 Tool-specific business metrics (optional) + +| Metric | Type | Labels | Use case | +|--------|------|--------|----------| +| `chatbot_tool_get_text_refs_total` | Counter | — | Popular references (if we add ref as label, watch cardinality) | +| `chatbot_tool_search_queries_total` | Counter | tool_name | Search vs semantic vs in-book usage | + +**Cardinality warning:** Avoid high-cardinality labels (e.g. full query text, user_id) in Prometheus. Use exemplars or logs for drill-down. + +--- + +## 4. Implementation Options + +### Option A: Add Prometheus to ai-chatbot (recommended) + +- Add `prometheus_client` to the Django app +- Expose `/metrics` endpoint (or use `django-prometheus`) +- Instrument in `claude_service._build_sdk_tools` handler and `TurnLoggingService` +- **Pros:** Single source of truth, aligns with story, Grafana-ready +- **Cons:** New dependency, need to wire into existing deployment + +### Option B: Rely on Braintrust + DB + +- Use Braintrust for traces and ad-hoc analysis +- Add DB views or scheduled jobs to aggregate tool_calls_data into time-series +- **Pros:** No new infra +- **Cons:** Not real-time, no alerting, heavier DB load for analytics + +### Option C: Hybrid + +- Prometheus for RED metrics (rate, errors, duration) and alerting +- Braintrust for deep traces and token/cost +- DB for audit and long-term analytics + +--- + +## 5. Minimal Set for SC-41316 + +To satisfy the story with minimal scope: + +1. **`chatbot_tool_calls_total`** (Counter, labels: tool_name, status) + - Increment in tool handler on every call (success or error) + - Enables: `rate(chatbot_tool_calls_total{tool_name="get_text"}[1h])` → calls/sec + +2. **`chatbot_tool_duration_seconds`** (Histogram, labels: tool_name) + - Observe latency in tool handler + - Enables: `histogram_quantile(0.95, rate(chatbot_tool_duration_seconds_bucket[5m]))` → p95 + +3. **`chatbot_tool_errors_total`** (Counter, labels: tool_name) + - Increment when `result.is_error` + - Or fold into `chatbot_tool_calls_total` with status=error + +**Instrumentation point:** `server/chat/V2/agent/claude_service.py`, inside the `handler` in `_build_sdk_tools`, after `result = await self.tool_executor.execute(...)`. + +--- + +## 6. Grafana Dashboard Ideas + +| Panel | Query | Purpose | +|-------|-------|---------| +| Tool calls per hour | `sum by (tool_name) (increase(chatbot_tool_calls_total[1h]))` | Usage trends per tool | +| Tool error rate | `sum(rate(chatbot_tool_calls_total{status="error"}[5m])) / sum(rate(chatbot_tool_calls_total[5m]))` | Error ratio | +| Tool p95 latency | `histogram_quantile(0.95, sum by (le, tool_name) (rate(chatbot_tool_duration_seconds_bucket[5m])))` | Latency SLO | +| Top tools by volume | `topk(10, sum by (tool_name) (increase(chatbot_tool_calls_total[24h])))` | Most-used tools | + +--- + +## 7. Trade-offs + +| Approach | Complexity | Real-time | Alerting | Cost tracking | +|----------|------------|-----------|----------|---------------| +| Prometheus only | Medium | Yes | Yes | Via counter | +| Braintrust only | Low | Yes (UI) | Limited | Yes | +| DB aggregation | High | No | No | Yes | +| Hybrid (Prometheus + Braintrust) | Medium | Yes | Yes | Yes | + +--- + +## 8. Recommendations + +1. **Implement Option A** — Add Prometheus metrics to ai-chatbot for per-tool time-series (SC-41316). +2. **Start with the minimal set** — `chatbot_tool_calls_total`, `chatbot_tool_duration_seconds`, and optionally `chatbot_tool_errors_total`. +3. **Instrument in the tool handler** — Single place, low overhead, captures every invocation. +4. **Add agent-level metrics later** — `chatbot_turns_total`, `chatbot_turn_latency_seconds` for full visibility. +5. **Keep Braintrust** — For traces, token/cost detail, and debugging; Prometheus for ops dashboards and alerting. + +--- + +## 9. Next Steps + +- [x] Add `prometheus_client` (or `django-prometheus`) to `requirements.txt` +- [x] Create metrics module (e.g. `server/chat/metrics.py`) with Counter/Histogram definitions +- [x] Instrument tool handler in `claude_service._build_sdk_tools` +- [x] Expose `/metrics` endpoint +- [x] Add ServiceMonitor in `manifests/servicemonitor.yaml` for Prometheus Operator +- [ ] Add Grafana dashboard for tool usage +- [ ] Document Prometheus scrape config for deployment (see `manifests/README.md`) diff --git a/manifests/README.md b/manifests/README.md new file mode 100644 index 00000000..44207347 --- /dev/null +++ b/manifests/README.md @@ -0,0 +1,33 @@ +# Kubernetes Manifests + +## ServiceMonitor + +The `servicemonitor.yaml` configures Prometheus Operator to scrape metrics from the ai-chatbot service. + +### Requirements + +1. **Prometheus Operator** must be installed in the cluster. +2. **Target Service** must have: + - Label `app.kubernetes.io/name: ai-chatbot` (or update the ServiceMonitor selector) + - A port named `http` (or update `spec.endpoints[].port` in the ServiceMonitor) +3. **Prometheus** must be configured to select this ServiceMonitor (e.g. `serviceMonitorSelector` matching the ServiceMonitor labels). + +### Service port example + +If your Service does not have a port named `http`, add it: + +```yaml +spec: + ports: + - name: http + port: 8080 + targetPort: 8080 +``` + +### Metrics exposed + +- `chatbot_tool_calls_total` — Tool invocations (labels: tool_name, status) +- `chatbot_tool_duration_seconds` — Tool latency histogram (labels: tool_name) +- `chatbot_tool_errors_total` — Tool errors (labels: tool_name) + +See `docs/plans/metrics-for-chatbot-agent-tools.md` for Grafana query examples. diff --git a/manifests/servicemonitor.yaml b/manifests/servicemonitor.yaml new file mode 100644 index 00000000..0c35aa69 --- /dev/null +++ b/manifests/servicemonitor.yaml @@ -0,0 +1,27 @@ +# ServiceMonitor for Prometheus Operator +# Tells Prometheus to scrape /api/metrics from the ai-chatbot service. +# +# Prerequisites: +# - Prometheus Operator installed in the cluster +# - Service for ai-chatbot with a port named "http" (or update port below) +# - Prometheus instance configured to select ServiceMonitors (e.g. via +# spec.serviceMonitorSelector or spec.podMonitorSelector) +# +# Apply: kubectl apply -f manifests/servicemonitor.yaml +# Or include in Helm/Kustomize/Flux deployment. +apiVersion: monitoring.coreos.com/v1 +kind: ServiceMonitor +metadata: + name: ai-chatbot + labels: + app.kubernetes.io/name: ai-chatbot + app.kubernetes.io/component: metrics +spec: + selector: + matchLabels: + app.kubernetes.io/name: ai-chatbot + endpoints: + - port: http + path: /api/metrics + interval: 30s + scrapeTimeout: 10s diff --git a/server/chat/V2/agent/claude_service.py b/server/chat/V2/agent/claude_service.py index e19d9661..57665e90 100644 --- a/server/chat/V2/agent/claude_service.py +++ b/server/chat/V2/agent/claude_service.py @@ -35,6 +35,7 @@ from claude_agent_sdk import ClaudeAgentOptions, ClaudeSDKClient, create_sdk_mcp_server, tool from claude_agent_sdk.types import AssistantMessage, ResultMessage +from ...metrics import record_tool_call from ..guardrail import get_guardrail_service from ..prompts import PromptService, get_prompt_service from ..prompts.prompt_fragments import ( @@ -545,6 +546,11 @@ async def handler(args: dict[str, Any]) -> dict[str, Any]: tool_start = time.time() result = await self.tool_executor.execute(tool_name, tool_input) tool_latency = int((time.time() - tool_start) * 1000) + duration_seconds = time.time() - tool_start + + # Record Prometheus metrics for time-series queries (SC-41316) + status_label = "error" if result.is_error else "success" + record_tool_call(tool_name, status_label, duration_seconds) # Flatten content blocks into a single string for the preview output_text = "".join( diff --git a/server/chat/metrics.py b/server/chat/metrics.py new file mode 100644 index 00000000..e87b891e --- /dev/null +++ b/server/chat/metrics.py @@ -0,0 +1,34 @@ +""" +Prometheus metrics for chatbot agent and tool usage. + +Enables time-series queries (e.g. calls per tool per hour) in Grafana. +See docs/plans/metrics-for-chatbot-agent-tools.md. +""" + +from prometheus_client import Counter, Histogram + +# Per-tool metrics (SC-41316) +TOOL_CALLS = Counter( + "chatbot_tool_calls_total", + "Total tool invocations", + ["tool_name", "status"], +) +TOOL_DURATION = Histogram( + "chatbot_tool_duration_seconds", + "Tool execution latency in seconds", + ["tool_name"], + buckets=(0.01, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0), +) +TOOL_ERRORS = Counter( + "chatbot_tool_errors_total", + "Tool execution errors", + ["tool_name"], +) + + +def record_tool_call(tool_name: str, status: str, duration_seconds: float) -> None: + """Record a tool invocation for Prometheus.""" + TOOL_CALLS.labels(tool_name=tool_name, status=status).inc() + TOOL_DURATION.labels(tool_name=tool_name).observe(duration_seconds) + if status == "error": + TOOL_ERRORS.labels(tool_name=tool_name).inc() diff --git a/server/chat/tests/test_metrics.py b/server/chat/tests/test_metrics.py new file mode 100644 index 00000000..79823c7b --- /dev/null +++ b/server/chat/tests/test_metrics.py @@ -0,0 +1,48 @@ +"""Tests for Prometheus metrics endpoint and recording.""" + +from django.test import Client + +from chat.metrics import TOOL_CALLS, TOOL_DURATION, TOOL_ERRORS, record_tool_call + + +class TestMetricsEndpoint: + """Test /api/metrics endpoint.""" + + def test_metrics_returns_200(self) -> None: + client = Client() + response = client.get("/api/metrics") + assert response.status_code == 200 + assert "text/plain" in response.get("Content-Type", "") + + def test_metrics_contains_chatbot_metrics(self) -> None: + record_tool_call("get_text", "success", 0.1) + client = Client() + response = client.get("/api/metrics") + assert response.status_code == 200 + body = response.content.decode() + assert "chatbot_tool_calls_total" in body + assert "chatbot_tool_duration_seconds" in body + assert "chatbot_tool_errors_total" in body + + +class TestRecordToolCall: + """Test record_tool_call helper.""" + + def test_increments_counter_on_success(self) -> None: + before = TOOL_CALLS.labels(tool_name="get_text", status="success")._value.get() + record_tool_call("get_text", "success", 0.1) + after = TOOL_CALLS.labels(tool_name="get_text", status="success")._value.get() + assert after == before + 1 + + def test_increments_error_counter_on_error(self) -> None: + before = TOOL_ERRORS.labels(tool_name="get_text")._value.get() + record_tool_call("get_text", "error", 0.1) + after = TOOL_ERRORS.labels(tool_name="get_text")._value.get() + assert after == before + 1 + + def test_observes_duration(self) -> None: + record_tool_call("text_search", "success", 0.5) + # Histogram stores observations; we can't easily assert without + # scraping the registry. Just verify no exception. + samples = list(TOOL_DURATION.labels(tool_name="text_search").collect()) + assert len(samples) > 0 diff --git a/server/chat/urls.py b/server/chat/urls.py index a85afc96..8f224e8b 100644 --- a/server/chat/urls.py +++ b/server/chat/urls.py @@ -3,6 +3,7 @@ """ from django.urls import path +from django.views.decorators.http import require_GET from . import views from .V2 import views as v2_views @@ -20,4 +21,5 @@ # Admin/management endpoints path("admin/reload-prompts", views.reload_prompts, name="reload_prompts"), path("health", views.health, name="health"), + path("metrics", require_GET(views.metrics), name="metrics"), ] diff --git a/server/chat/views.py b/server/chat/views.py index 5482767c..26d0aba6 100644 --- a/server/chat/views.py +++ b/server/chat/views.py @@ -4,6 +4,8 @@ from datetime import datetime from urllib.parse import urlparse +from django.http import HttpResponse +from prometheus_client import CONTENT_TYPE_LATEST, generate_latest from rest_framework import status from rest_framework.decorators import api_view from rest_framework.response import Response @@ -126,6 +128,16 @@ def reload_prompts(request): return Response({"error": str(e)}, status=status.HTTP_500_INTERNAL_SERVER_ERROR) +def metrics(request): + """ + Prometheus metrics endpoint. + + GET /api/metrics + """ + data = generate_latest() + return HttpResponse(data, content_type=CONTENT_TYPE_LATEST) + + @api_view(["GET"]) def health(request): """ diff --git a/server/requirements.txt b/server/requirements.txt index 811901e2..dd2a14a2 100644 --- a/server/requirements.txt +++ b/server/requirements.txt @@ -10,6 +10,7 @@ claude-agent-sdk>=0.1.0 # Tracing/Observability langsmith>=0.1.0 braintrust>=0.0.150 +prometheus-client>=0.20.0 # HTTP client httpx>=0.27.0