Sefaria · yodem · Feb 19, 2026 · Feb 19, 2026
diff --git a/docs/archive/metrics-for-chatbot-agent-tools.md b/docs/archive/metrics-for-chatbot-agent-tools.md
@@ -0,0 +1,197 @@
+# Metrics for Chatbot Agent/Tools Usage
+
+**Story:** [SC-41316](https://app.shortcut.com/sefaria/story/41316) — MCP: add per-tool time-series metrics  
+**Epic:** DevOps  
+**Date:** 2026-02-19
+
+## Executive Summary
+
+SC-41316 asks for Prometheus metrics that support **time-series queries** (e.g. "calls per tool per hour") for the chatbot's MCP tools. This document brainstorms a comprehensive metrics strategy to capture all information needed from agent/tool usage.
+
+---
+
+## 1. Story Context (SC-41316)
+
+### Current State (from story)
+
+The story references these Prometheus metrics in `main.py`:
+
+| Metric | Type | Labels | Purpose |
+|--------|------|--------|---------|
+| `mcp_tool_calls_total` | Counter | tool_name, status | Total tool invocations |
+| `mcp_tool_duration_seconds` | Histogram | tool_name | Latency per tool |
+| `mcp_tool_payload_bytes` | Histogram | tool_name | Payload size |
+| `mcp_active_connections` | Gauge | — | Active MCP connections |
+| `mcp_errors_total` | Counter | tool_name, error_type | Error counts |
+
+**Note:** The ai-chatbot codebase does not currently expose these Prometheus metrics. They may live in a separate MCP server or Sefaria main repo. The chatbot uses Braintrust for tracing and DB for persistence.
+
+### Problem (from story)
+
+> We want to map usage per `tool_name` over time, but we currently only store total calls. We cannot get time-series (e.g. "calls per tool per hour") from the existing metrics alone.
+
+**Clarification:** Prometheus Counters with labels *are* inherently time-series. `rate(mcp_tool_calls_total[1h])` or `increase(mcp_tool_calls_total[1h])` should yield calls per second or per hour. The issue may be:
+
+1. Metrics are not exposed at all in the chatbot service
+2. Metrics are aggregated without `tool_name` label (losing per-tool breakdown)
+3. Scraping interval or retention prevents useful queries
+
+### Goal (from story)
+
+> Add to `_run_with_metrics` (or another appropriate place) an option to send/record metrics per individual tool in a way that supports time-series queries (e.g. so we can graph usage per tool_name by time in Prometheus/Grafana).
+
+---
+
+## 2. Current ai-chatbot Observability
+
+### What exists today
+
+| Source | Data | Time-series? |
+|--------|------|--------------|
+| **Braintrust** | Spans, tool_calls in metadata, latency_ms, tool_count, tokens, cost | Yes (via Braintrust UI) |
+| **ChatMessage** | tool_calls_data (JSON), tool_calls_count, latency_ms | Yes (via DB queries) |
+| **ChatSession** | total_tool_calls, total_input_tokens, total_output_tokens, total_cost_usd | Aggregates only |
+| **claude_service** | tool_calls_list per turn (tool_name, tool_input, tool_output, is_error, latency_ms) | Per-turn only |
+
+### Gap
+
+- No Prometheus metrics in the chatbot service
+- No real-time dashboards for ops (Grafana)
+- No alerting on tool error rates or latency
+- Per-tool breakdown requires DB queries or Braintrust drill-down
+
+---
+
+## 3. Metrics to Add (Brainstorm)
+
+### 3.1 Per-tool time-series (SC-41316 core)
+
+| Metric | Type | Labels | Use case |
+|--------|------|--------|----------|
+| `chatbot_tool_calls_total` | Counter | tool_name, status (success/error) | `rate(...[1h])` → calls/sec per tool; `increase(...[1h])` → calls per hour |
+| `chatbot_tool_duration_seconds` | Histogram | tool_name | p50, p95, p99 latency per tool; SLO monitoring |
+| `chatbot_tool_errors_total` | Counter | tool_name, error_type (optional) | Error rate per tool; alert on spike |
+
+**Implementation point:** In `_build_sdk_tools` handler (claude_service.py ~line 546), after `tool_executor.execute()` and before appending to `tool_calls_list`, increment/observe Prometheus metrics.
+
+### 3.2 Agent/turn-level metrics
+
+| Metric | Type | Labels | Use case |
+|--------|------|--------|----------|
+| `chatbot_turns_total` | Counter | status (success/error/guardrail_blocked) | Turn volume over time |
+| `chatbot_turn_latency_seconds` | Histogram | — | End-to-end latency distribution |
+| `chatbot_turn_tool_count` | Histogram | — | Tools per turn (distribution) |
+| `chatbot_guardrail_blocks_total` | Counter | reason (optional) | Guardrail effectiveness |
+
+### 3.3 Token and cost metrics
+
+| Metric | Type | Labels | Use case |
+|--------|------|--------|----------|
+| `chatbot_tokens_total` | Counter | type (prompt/completion/cache_read/cache_creation) | Token usage trends |
+| `chatbot_cost_usd_total` | Counter | — | Cost tracking; budget alerts |
+| `chatbot_llm_calls_total` | Counter | — | Multi-step reasoning frequency |
+
+### 3.4 Session and user metrics (optional)
+
+| Metric | Type | Labels | Use case |
+|--------|------|--------|----------|
+| `chatbot_sessions_active` | Gauge | — | Concurrent sessions |
+| `chatbot_messages_total` | Counter | role (user/assistant) | Message volume |
+
+### 3.5 Tool-specific business metrics (optional)
+
+| Metric | Type | Labels | Use case |
+|--------|------|--------|----------|
+| `chatbot_tool_get_text_refs_total` | Counter | — | Popular references (if we add ref as label, watch cardinality) |
+| `chatbot_tool_search_queries_total` | Counter | tool_name | Search vs semantic vs in-book usage |
+
+**Cardinality warning:** Avoid high-cardinality labels (e.g. full query text, user_id) in Prometheus. Use exemplars or logs for drill-down.
+
+---
+
+## 4. Implementation Options
+
+### Option A: Add Prometheus to ai-chatbot (recommended)
+
+- Add `prometheus_client` to the Django app
+- Expose `/metrics` endpoint (or use `django-prometheus`)
+- Instrument in `claude_service._build_sdk_tools` handler and `TurnLoggingService`
+- **Pros:** Single source of truth, aligns with story, Grafana-ready
+- **Cons:** New dependency, need to wire into existing deployment
+
+### Option B: Rely on Braintrust + DB
+
+- Use Braintrust for traces and ad-hoc analysis
+- Add DB views or scheduled jobs to aggregate tool_calls_data into time-series
+- **Pros:** No new infra
+- **Cons:** Not real-time, no alerting, heavier DB load for analytics
+
+### Option C: Hybrid
+
+- Prometheus for RED metrics (rate, errors, duration) and alerting
+- Braintrust for deep traces and token/cost
+- DB for audit and long-term analytics
+
+---
+
+## 5. Minimal Set for SC-41316
+
+To satisfy the story with minimal scope:
+
+1. **`chatbot_tool_calls_total`** (Counter, labels: tool_name, status)  
+   - Increment in tool handler on every call (success or error)
+   - Enables: `rate(chatbot_tool_calls_total{tool_name="get_text"}[1h])` → calls/sec
+
+2. **`chatbot_tool_duration_seconds`** (Histogram, labels: tool_name)  
+   - Observe latency in tool handler
+   - Enables: `histogram_quantile(0.95, rate(chatbot_tool_duration_seconds_bucket[5m]))` → p95
+
+3. **`chatbot_tool_errors_total`** (Counter, labels: tool_name)  
+   - Increment when `result.is_error`  
+   - Or fold into `chatbot_tool_calls_total` with status=error
+
+**Instrumentation point:** `server/chat/V2/agent/claude_service.py`, inside the `handler` in `_build_sdk_tools`, after `result = await self.tool_executor.execute(...)`.
+
+---
+
+## 6. Grafana Dashboard Ideas
+
+| Panel | Query | Purpose |
+|-------|-------|---------|
+| Tool calls per hour | `sum by (tool_name) (increase(chatbot_tool_calls_total[1h]))` | Usage trends per tool |
+| Tool error rate | `sum(rate(chatbot_tool_calls_total{status="error"}[5m])) / sum(rate(chatbot_tool_calls_total[5m]))` | Error ratio |
+| Tool p95 latency | `histogram_quantile(0.95, sum by (le, tool_name) (rate(chatbot_tool_duration_seconds_bucket[5m])))` | Latency SLO |
+| Top tools by volume | `topk(10, sum by (tool_name) (increase(chatbot_tool_calls_total[24h])))` | Most-used tools |
+
+---
+
+## 7. Trade-offs
+
+| Approach | Complexity | Real-time | Alerting | Cost tracking |
+|----------|------------|-----------|----------|---------------|
+| Prometheus only | Medium | Yes | Yes | Via counter |
+| Braintrust only | Low | Yes (UI) | Limited | Yes |
+| DB aggregation | High | No | No | Yes |
+| Hybrid (Prometheus + Braintrust) | Medium | Yes | Yes | Yes |
+
+---
+
+## 8. Recommendations
+
+1. **Implement Option A** — Add Prometheus metrics to ai-chatbot for per-tool time-series (SC-41316).
+2. **Start with the minimal set** — `chatbot_tool_calls_total`, `chatbot_tool_duration_seconds`, and optionally `chatbot_tool_errors_total`.
+3. **Instrument in the tool handler** — Single place, low overhead, captures every invocation.
+4. **Add agent-level metrics later** — `chatbot_turns_total`, `chatbot_turn_latency_seconds` for full visibility.
+5. **Keep Braintrust** — For traces, token/cost detail, and debugging; Prometheus for ops dashboards and alerting.
+
+---
+
+## 9. Next Steps
+
+- [x] Add `prometheus_client` (or `django-prometheus`) to `requirements.txt`
+- [x] Create metrics module (e.g. `server/chat/metrics.py`) with Counter/Histogram definitions
+- [x] Instrument tool handler in `claude_service._build_sdk_tools`
+- [x] Expose `/metrics` endpoint
+- [x] Add ServiceMonitor in `manifests/servicemonitor.yaml` for Prometheus Operator
+- [ ] Add Grafana dashboard for tool usage
+- [ ] Document Prometheus scrape config for deployment (see `manifests/README.md`)
diff --git a/manifests/README.md b/manifests/README.md
@@ -0,0 +1,33 @@
+# Kubernetes Manifests
+
+## ServiceMonitor
+
+The `servicemonitor.yaml` configures Prometheus Operator to scrape metrics from the ai-chatbot service.
+
+### Requirements
+
+1. **Prometheus Operator** must be installed in the cluster.
+2. **Target Service** must have:
+   - Label `app.kubernetes.io/name: ai-chatbot` (or update the ServiceMonitor selector)
+   - A port named `http` (or update `spec.endpoints[].port` in the ServiceMonitor)
+3. **Prometheus** must be configured to select this ServiceMonitor (e.g. `serviceMonitorSelector` matching the ServiceMonitor labels).
+
+### Service port example
+
+If your Service does not have a port named `http`, add it:
+
+```yaml
+spec:
+  ports:
+    - name: http
+      port: 8080
+      targetPort: 8080
+```
+
+### Metrics exposed
+
+- `chatbot_tool_calls_total` — Tool invocations (labels: tool_name, status)
+- `chatbot_tool_duration_seconds` — Tool latency histogram (labels: tool_name)
+- `chatbot_tool_errors_total` — Tool errors (labels: tool_name)
+
+See `docs/plans/metrics-for-chatbot-agent-tools.md` for Grafana query examples.
diff --git a/manifests/servicemonitor.yaml b/manifests/servicemonitor.yaml
@@ -0,0 +1,27 @@
+# ServiceMonitor for Prometheus Operator
+# Tells Prometheus to scrape /api/metrics from the ai-chatbot service.
+#
+# Prerequisites:
+# - Prometheus Operator installed in the cluster
+# - Service for ai-chatbot with a port named "http" (or update port below)
+# - Prometheus instance configured to select ServiceMonitors (e.g. via
+#   spec.serviceMonitorSelector or spec.podMonitorSelector)
+#
+# Apply: kubectl apply -f manifests/servicemonitor.yaml
+# Or include in Helm/Kustomize/Flux deployment.
+apiVersion: monitoring.coreos.com/v1
+kind: ServiceMonitor
+metadata:
+  name: ai-chatbot
+  labels:
+    app.kubernetes.io/name: ai-chatbot
+    app.kubernetes.io/component: metrics
+spec:
+  selector:
+    matchLabels:
+      app.kubernetes.io/name: ai-chatbot
+  endpoints:
+    - port: http
+      path: /api/metrics
+      interval: 30s
+      scrapeTimeout: 10s
diff --git a/server/chat/V2/agent/claude_service.py b/server/chat/V2/agent/claude_service.py
@@ -35,6 +35,7 @@
 from claude_agent_sdk import ClaudeAgentOptions, ClaudeSDKClient, create_sdk_mcp_server, tool
 from claude_agent_sdk.types import AssistantMessage, ResultMessage
 
+from ...metrics import record_tool_call
 from ..guardrail import get_guardrail_service
 from ..prompts import PromptService, get_prompt_service
 from ..prompts.prompt_fragments import (
@@ -545,6 +546,11 @@ async def handler(args: dict[str, Any]) -> dict[str, Any]:
                 tool_start = time.time()
                 result = await self.tool_executor.execute(tool_name, tool_input)
                 tool_latency = int((time.time() - tool_start) * 1000)
+                duration_seconds = time.time() - tool_start
+
+                # Record Prometheus metrics for time-series queries (SC-41316)
+                status_label = "error" if result.is_error else "success"
+                record_tool_call(tool_name, status_label, duration_seconds)
 
                 # Flatten content blocks into a single string for the preview
                 output_text = "".join(

diff --git a/server/chat/metrics.py b/server/chat/metrics.py
@@ -0,0 +1,34 @@
+"""
+Prometheus metrics for chatbot agent and tool usage.
+
+Enables time-series queries (e.g. calls per tool per hour) in Grafana.
+See docs/plans/metrics-for-chatbot-agent-tools.md.
+"""
+
+from prometheus_client import Counter, Histogram
+
+# Per-tool metrics (SC-41316)
+TOOL_CALLS = Counter(
+    "chatbot_tool_calls_total",
+    "Total tool invocations",
+    ["tool_name", "status"],
+)
+TOOL_DURATION = Histogram(
+    "chatbot_tool_duration_seconds",
+    "Tool execution latency in seconds",
+    ["tool_name"],
+    buckets=(0.01, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0),
+)
+TOOL_ERRORS = Counter(
+    "chatbot_tool_errors_total",
+    "Tool execution errors",
+    ["tool_name"],
+)
+
+
+def record_tool_call(tool_name: str, status: str, duration_seconds: float) -> None:
+    """Record a tool invocation for Prometheus."""
+    TOOL_CALLS.labels(tool_name=tool_name, status=status).inc()
+    TOOL_DURATION.labels(tool_name=tool_name).observe(duration_seconds)
+    if status == "error":
+        TOOL_ERRORS.labels(tool_name=tool_name).inc()
diff --git a/server/chat/tests/test_metrics.py b/server/chat/tests/test_metrics.py
@@ -0,0 +1,48 @@
+"""Tests for Prometheus metrics endpoint and recording."""
+
+from django.test import Client
+
+from chat.metrics import TOOL_CALLS, TOOL_DURATION, TOOL_ERRORS, record_tool_call
+
+
+class TestMetricsEndpoint:
+    """Test /api/metrics endpoint."""
+
+    def test_metrics_returns_200(self) -> None:
+        client = Client()
+        response = client.get("/api/metrics")
+        assert response.status_code == 200
+        assert "text/plain" in response.get("Content-Type", "")
+
+    def test_metrics_contains_chatbot_metrics(self) -> None:
+        record_tool_call("get_text", "success", 0.1)
+        client = Client()
+        response = client.get("/api/metrics")
+        assert response.status_code == 200
+        body = response.content.decode()
+        assert "chatbot_tool_calls_total" in body
+        assert "chatbot_tool_duration_seconds" in body
+        assert "chatbot_tool_errors_total" in body
+
+
+class TestRecordToolCall:
+    """Test record_tool_call helper."""
+
+    def test_increments_counter_on_success(self) -> None:
+        before = TOOL_CALLS.labels(tool_name="get_text", status="success")._value.get()
+        record_tool_call("get_text", "success", 0.1)
+        after = TOOL_CALLS.labels(tool_name="get_text", status="success")._value.get()
+        assert after == before + 1
+
+    def test_increments_error_counter_on_error(self) -> None:
+        before = TOOL_ERRORS.labels(tool_name="get_text")._value.get()
+        record_tool_call("get_text", "error", 0.1)
+        after = TOOL_ERRORS.labels(tool_name="get_text")._value.get()
+        assert after == before + 1
+
+    def test_observes_duration(self) -> None:
+        record_tool_call("text_search", "success", 0.5)
+        # Histogram stores observations; we can't easily assert without
+        # scraping the registry. Just verify no exception.
+        samples = list(TOOL_DURATION.labels(tool_name="text_search").collect())
+        assert len(samples) > 0
diff --git a/server/chat/urls.py b/server/chat/urls.py
@@ -3,6 +3,7 @@
 """
 
 from django.urls import path
+from django.views.decorators.http import require_GET
 
 from . import views
 from .V2 import views as v2_views
@@ -20,4 +21,5 @@
     # Admin/management endpoints
     path("admin/reload-prompts", views.reload_prompts, name="reload_prompts"),
     path("health", views.health, name="health"),
+    path("metrics", require_GET(views.metrics), name="metrics"),
 ]