Add comprehensive observability with Prometheus metrics #3

luabagg · 2025-11-22T04:02:02Z

…anced health checks

Add production-ready monitoring and health check infrastructure:

Prometheus Metrics:

HTTP metrics: request count, latency, request/response sizes
Upload metrics: total uploads, file sizes, processing times, error tracking
File type metrics: counter by content type
Queue metrics: messages published/consumed, processing time, poison queue
Cache metrics: hits, misses, errors by cache type
Dependency health gauges: S3, DynamoDB, Redis status (1=up, 0=down)
AWS metrics: S3 and DynamoDB operation counts and latencies

Enhanced Health Checks:

GET /v1/health - Basic health check
GET /v1/health/live - Kubernetes liveness probe (checks app is running)
GET /v1/health/ready - Kubernetes readiness probe (checks all dependencies)
- Tests Redis connectivity with ping
- Tests S3 access with ListBuckets
- Tests DynamoDB access with ListTables
- Returns 503 if any dependency is down
- Includes latency measurements for each dependency

Metrics Middleware:

Automatically tracks all HTTP requests
Records latency histograms with configurable buckets
Captures request/response sizes
Labels by method, endpoint, and status code

Documentation:

Comprehensive observability guide (docs/OBSERVABILITY.md)
Prometheus configuration examples
Grafana dashboard queries
Kubernetes integration (ServiceMonitor, deployment configs)
Alert rule examples for common issues
Troubleshooting guide

Endpoints:

/v1/metrics - Prometheus-compatible metrics endpoint
/v1/health - Basic health status
/v1/health/live - Liveness probe
/v1/health/ready - Readiness probe with dependency checks

Production Benefits:

Real-time visibility into service health
Automatic dependency monitoring
Kubernetes-ready probes for auto-scaling and recovery
SLA monitoring capabilities
Capacity planning data
Early warning for degraded dependencies

This closes the observability gap and provides production-grade monitoring.

…anced health checks Add production-ready monitoring and health check infrastructure: **Prometheus Metrics:** - HTTP metrics: request count, latency, request/response sizes - Upload metrics: total uploads, file sizes, processing times, error tracking - File type metrics: counter by content type - Queue metrics: messages published/consumed, processing time, poison queue - Cache metrics: hits, misses, errors by cache type - Dependency health gauges: S3, DynamoDB, Redis status (1=up, 0=down) - AWS metrics: S3 and DynamoDB operation counts and latencies **Enhanced Health Checks:** - GET /v1/health - Basic health check - GET /v1/health/live - Kubernetes liveness probe (checks app is running) - GET /v1/health/ready - Kubernetes readiness probe (checks all dependencies) - Tests Redis connectivity with ping - Tests S3 access with ListBuckets - Tests DynamoDB access with ListTables - Returns 503 if any dependency is down - Includes latency measurements for each dependency **Metrics Middleware:** - Automatically tracks all HTTP requests - Records latency histograms with configurable buckets - Captures request/response sizes - Labels by method, endpoint, and status code **Documentation:** - Comprehensive observability guide (docs/OBSERVABILITY.md) - Prometheus configuration examples - Grafana dashboard queries - Kubernetes integration (ServiceMonitor, deployment configs) - Alert rule examples for common issues - Troubleshooting guide **Endpoints:** - /v1/metrics - Prometheus-compatible metrics endpoint - /v1/health - Basic health status - /v1/health/live - Liveness probe - /v1/health/ready - Readiness probe with dependency checks **Production Benefits:** - Real-time visibility into service health - Automatic dependency monitoring - Kubernetes-ready probes for auto-scaling and recovery - SLA monitoring capabilities - Capacity planning data - Early warning for degraded dependencies This closes the observability gap and provides production-grade monitoring.

Add Docker Compose integration for testing metrics and health checks locally: **Prometheus Setup:** - Pre-configured to scrape Filepoint API and Webhook Sender - Scrapes /v1/metrics every 15 seconds - Persistent storage with Docker volumes - Accessible at http://localhost:9090 **Grafana Setup:** - Auto-configured Prometheus datasource - Accessible at http://localhost:3000 (admin/admin) - Ready for dashboard creation - Persistent storage for dashboards **Testing Guide (docs/TESTING_OBSERVABILITY.md):** - Step-by-step local testing instructions - Example Prometheus queries for all metrics - Grafana dashboard panel configurations - Testing scenarios (normal operation, dependency down, high load) - Troubleshooting guide - Performance testing with vegeta - What metrics to monitor and thresholds **No External Accounts Needed:** - Everything runs locally via Docker - No cloud services required - Full observability stack in one command: docker compose up **Configuration Files:** - config/prometheus.yml - Prometheus scrape configs - config/grafana-datasource.yml - Auto-provisions Prometheus datasource This enables developers to test metrics, health checks, and dashboards locally before deploying to production.

claude added 2 commits November 22, 2025 04:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add comprehensive observability with Prometheus metrics #3

Add comprehensive observability with Prometheus metrics #3

Uh oh!

luabagg commented Nov 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add comprehensive observability with Prometheus metrics #3

Are you sure you want to change the base?

Add comprehensive observability with Prometheus metrics #3

Uh oh!

Conversation

luabagg commented Nov 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants