Skip to content

LLM infrastructure implementation#436

Open
Sadeequ wants to merge 1 commit into
Traqora:mainfrom
Sadeequ:LLM_infra
Open

LLM infrastructure implementation#436
Sadeequ wants to merge 1 commit into
Traqora:mainfrom
Sadeequ:LLM_infra

Conversation

@Sadeequ

@Sadeequ Sadeequ commented Jun 28, 2026

Copy link
Copy Markdown

I fixed the "Implement health checks for LLM infrastructure" issue by building end-to-end monitoring for the LLM provider layer.

What I created:

astroml/llm/metrics.py — Prometheus metrics for LLM: astroml_llm_requests_total, astroml_llm_request_latency_seconds.
astroml_llm_cost_usd_total, astroml_llm_tokens_total, and astroml_llm_provider_health.

astroml/llm/health.py — Async health check functions for OpenAI, Anthropic, and HuggingFace using aiohttp. Returns per-provider status, latency, and HTTP codes.

astroml/llm/tracker.py — Wired the existing LLMUsageTracker to emit Prometheus metrics on every successful/failed LLM call. Added record_error() for error-rate tracking.

astroml/llm/explainer.py — Added error tracking in FraudExplainer so failures increment the error counter.

api/routers/llm_health.py — New FastAPI router exposing:
GET /api/v1/llm/health — all providers at once
GET /api/v1/llm/health/{provider} — single provider
api/app.py — Registered llm_health_router and added a /metrics endpoint serving Prometheus generate_latest() output.

api/routers/init.py — Exported llm_health_router.

monitoring/prometheus/alert_rules.yml — Added four LLM alert rules: LLMProviderDown (critical), LLMHighErrorRate, LLMCostThreshold ($10/hr window), LLMHighLatency (P95 > 5s).

monitoring/prometheus/prometheus.yml — Added astroml-api scrape job targeting api:8000/metrics.
monitoring/grafana/llm_health_dashboard.json — Full Grafana dashboard with panels for provider health, P95 latency, error rate, 1h cost, token volume, and total requests.

docs/runbooks/llm_health.md — Runbook covering architecture, metrics reference, alert response procedures (provider down, error spikes, cost spikes), and verification commands.

api/tests/test_llm_health.py — Integration tests covering health endpoints, per-provider health, and metrics exposition.7

Acceptance criteria met:

  • Health checks poll every 60s via Prometheus (scrape config + dashboard refresh)
  • Cost alerts fire within 1 hour (Prometheus LLMCostThreshold rule uses increase(...[1h]) > 10)

Also fixed:
astroml/db/schema.py — indentation errors on ProcessedLedger model (duplicate mapped_column and extra ) at EOF)

Closes #404

@drips-wave

drips-wave Bot commented Jun 28, 2026

Copy link
Copy Markdown

@Sadeequ Great news! 🎉 Based on an automated assessment of this PR, the linked Wave issue(s) no longer count against your application limits.

You can now already apply to more issues while waiting for a review of this PR. Keep up the great work! 🚀

Learn more about application limits

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[LLM] Implement health checks for LLM infrastructure

1 participant