Skip to content

Conversation

@luabagg
Copy link
Member

@luabagg luabagg commented Nov 22, 2025

…anced health checks

Add production-ready monitoring and health check infrastructure:

Prometheus Metrics:

  • HTTP metrics: request count, latency, request/response sizes
  • Upload metrics: total uploads, file sizes, processing times, error tracking
  • File type metrics: counter by content type
  • Queue metrics: messages published/consumed, processing time, poison queue
  • Cache metrics: hits, misses, errors by cache type
  • Dependency health gauges: S3, DynamoDB, Redis status (1=up, 0=down)
  • AWS metrics: S3 and DynamoDB operation counts and latencies

Enhanced Health Checks:

  • GET /v1/health - Basic health check
  • GET /v1/health/live - Kubernetes liveness probe (checks app is running)
  • GET /v1/health/ready - Kubernetes readiness probe (checks all dependencies)
    • Tests Redis connectivity with ping
    • Tests S3 access with ListBuckets
    • Tests DynamoDB access with ListTables
    • Returns 503 if any dependency is down
    • Includes latency measurements for each dependency

Metrics Middleware:

  • Automatically tracks all HTTP requests
  • Records latency histograms with configurable buckets
  • Captures request/response sizes
  • Labels by method, endpoint, and status code

Documentation:

  • Comprehensive observability guide (docs/OBSERVABILITY.md)
  • Prometheus configuration examples
  • Grafana dashboard queries
  • Kubernetes integration (ServiceMonitor, deployment configs)
  • Alert rule examples for common issues
  • Troubleshooting guide

Endpoints:

  • /v1/metrics - Prometheus-compatible metrics endpoint
  • /v1/health - Basic health status
  • /v1/health/live - Liveness probe
  • /v1/health/ready - Readiness probe with dependency checks

Production Benefits:

  • Real-time visibility into service health
  • Automatic dependency monitoring
  • Kubernetes-ready probes for auto-scaling and recovery
  • SLA monitoring capabilities
  • Capacity planning data
  • Early warning for degraded dependencies

This closes the observability gap and provides production-grade monitoring.

…anced health checks

Add production-ready monitoring and health check infrastructure:

**Prometheus Metrics:**
- HTTP metrics: request count, latency, request/response sizes
- Upload metrics: total uploads, file sizes, processing times, error tracking
- File type metrics: counter by content type
- Queue metrics: messages published/consumed, processing time, poison queue
- Cache metrics: hits, misses, errors by cache type
- Dependency health gauges: S3, DynamoDB, Redis status (1=up, 0=down)
- AWS metrics: S3 and DynamoDB operation counts and latencies

**Enhanced Health Checks:**
- GET /v1/health - Basic health check
- GET /v1/health/live - Kubernetes liveness probe (checks app is running)
- GET /v1/health/ready - Kubernetes readiness probe (checks all dependencies)
  - Tests Redis connectivity with ping
  - Tests S3 access with ListBuckets
  - Tests DynamoDB access with ListTables
  - Returns 503 if any dependency is down
  - Includes latency measurements for each dependency

**Metrics Middleware:**
- Automatically tracks all HTTP requests
- Records latency histograms with configurable buckets
- Captures request/response sizes
- Labels by method, endpoint, and status code

**Documentation:**
- Comprehensive observability guide (docs/OBSERVABILITY.md)
- Prometheus configuration examples
- Grafana dashboard queries
- Kubernetes integration (ServiceMonitor, deployment configs)
- Alert rule examples for common issues
- Troubleshooting guide

**Endpoints:**
- /v1/metrics - Prometheus-compatible metrics endpoint
- /v1/health - Basic health status
- /v1/health/live - Liveness probe
- /v1/health/ready - Readiness probe with dependency checks

**Production Benefits:**
- Real-time visibility into service health
- Automatic dependency monitoring
- Kubernetes-ready probes for auto-scaling and recovery
- SLA monitoring capabilities
- Capacity planning data
- Early warning for degraded dependencies

This closes the observability gap and provides production-grade monitoring.
Add Docker Compose integration for testing metrics and health checks locally:

**Prometheus Setup:**
- Pre-configured to scrape Filepoint API and Webhook Sender
- Scrapes /v1/metrics every 15 seconds
- Persistent storage with Docker volumes
- Accessible at http://localhost:9090

**Grafana Setup:**
- Auto-configured Prometheus datasource
- Accessible at http://localhost:3000 (admin/admin)
- Ready for dashboard creation
- Persistent storage for dashboards

**Testing Guide (docs/TESTING_OBSERVABILITY.md):**
- Step-by-step local testing instructions
- Example Prometheus queries for all metrics
- Grafana dashboard panel configurations
- Testing scenarios (normal operation, dependency down, high load)
- Troubleshooting guide
- Performance testing with vegeta
- What metrics to monitor and thresholds

**No External Accounts Needed:**
- Everything runs locally via Docker
- No cloud services required
- Full observability stack in one command: docker compose up

**Configuration Files:**
- config/prometheus.yml - Prometheus scrape configs
- config/grafana-datasource.yml - Auto-provisions Prometheus datasource

This enables developers to test metrics, health checks, and dashboards
locally before deploying to production.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants