LLM infrastructure implementation by Sadeequ · Pull Request #436 · Traqora/astroml

Sadeequ · 2026-06-28T23:19:30Z

I fixed the "Implement health checks for LLM infrastructure" issue by building end-to-end monitoring for the LLM provider layer.

What I created:

astroml/llm/metrics.py — Prometheus metrics for LLM: astroml_llm_requests_total, astroml_llm_request_latency_seconds.
astroml_llm_cost_usd_total, astroml_llm_tokens_total, and astroml_llm_provider_health.

astroml/llm/health.py — Async health check functions for OpenAI, Anthropic, and HuggingFace using aiohttp. Returns per-provider status, latency, and HTTP codes.

astroml/llm/tracker.py — Wired the existing LLMUsageTracker to emit Prometheus metrics on every successful/failed LLM call. Added record_error() for error-rate tracking.

astroml/llm/explainer.py — Added error tracking in FraudExplainer so failures increment the error counter.

api/routers/llm_health.py — New FastAPI router exposing:
GET /api/v1/llm/health — all providers at once
GET /api/v1/llm/health/{provider} — single provider
api/app.py — Registered llm_health_router and added a /metrics endpoint serving Prometheus generate_latest() output.

api/routers/init.py — Exported llm_health_router.

monitoring/prometheus/alert_rules.yml — Added four LLM alert rules: LLMProviderDown (critical), LLMHighErrorRate, LLMCostThreshold ($10/hr window), LLMHighLatency (P95 > 5s).

monitoring/prometheus/prometheus.yml — Added astroml-api scrape job targeting api:8000/metrics.
monitoring/grafana/llm_health_dashboard.json — Full Grafana dashboard with panels for provider health, P95 latency, error rate, 1h cost, token volume, and total requests.

docs/runbooks/llm_health.md — Runbook covering architecture, metrics reference, alert response procedures (provider down, error spikes, cost spikes), and verification commands.

api/tests/test_llm_health.py — Integration tests covering health endpoints, per-provider health, and metrics exposition.7

Acceptance criteria met:

Health checks poll every 60s via Prometheus (scrape config + dashboard refresh)
Cost alerts fire within 1 hour (Prometheus LLMCostThreshold rule uses increase(...[1h]) > 10)

Also fixed:
astroml/db/schema.py — indentation errors on ProcessedLedger model (duplicate mapped_column and extra ) at EOF)

Closes #404

drips-wave · 2026-06-28T23:19:38Z

@Sadeequ Great news! 🎉 Based on an automated assessment of this PR, the linked Wave issue(s) no longer count against your application limits.

You can now already apply to more issues while waiting for a review of this PR. Keep up the great work! 🚀

Learn more about application limits

LLM infrastructure implementation

5757ef4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

LLM infrastructure implementation#436

LLM infrastructure implementation#436
Sadeequ wants to merge 1 commit into
Traqora:mainfrom
Sadeequ:LLM_infra

Sadeequ commented Jun 28, 2026

Uh oh!

drips-wave Bot commented Jun 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

Sadeequ commented Jun 28, 2026

Uh oh!

drips-wave Bot commented Jun 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant