Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 16 additions & 0 deletions devenv.nix
Original file line number Diff line number Diff line change
Expand Up @@ -24,5 +24,21 @@
web-marketing.exec = "cd marketing && pnpm run dev";
go-server.exec = "cd server && air";
ai-service.exec = "cd services/ai && uv run uvicorn app.main:app --port 8000";
otel.exec = "docker run -d --name jaeger -e COLLECTOR_OTLP_ENABLED=true -p 16686:16686 -p 4317:4317 -p 4318:4318 jaegertracing/all-in-one:latest";
};

tasks = {
"otel:cleanup" = {
exec = ''
echo "Cleaning up Jaeger container..."
# Stop and remove the Jaeger container
docker stop jaeger 2>/dev/null || true
docker rm jaeger 2>/dev/null || true
# Clean up any dangling volumes (optional)
docker volume prune -f 2>/dev/null || true
echo "Jaeger cleanup completed"
'';
after = [ "devenv:processes:otel" ];
};
};
}
213 changes: 213 additions & 0 deletions docs/OBSERVABILITY.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,213 @@
# Observability Setup

This document explains the OpenTelemetry and logging setup for the nuts backend.

## Overview

The nuts backend now includes comprehensive observability through:
- **OpenTelemetry** for distributed tracing and metrics collection
- **Enhanced logging** with trace context integration
- **Database query tracing** for PostgreSQL operations
- **HTTP request instrumentation** for all API endpoints
- **Business metrics** for tracking key application events
- **Error tracking** with detailed error classification

## Configuration

### Environment Variables

| Variable | Default | Description |
|----------|---------|-------------|
| `OTEL_ENABLED` | `true` | Enable/disable OpenTelemetry tracing |
| `OTEL_SERVICE_NAME` | `nuts-backend` | Service name for traces |
| `OTEL_SERVICE_VERSION` | `unknown` | Service version for traces |
| `OTEL_EXPORTER_OTLP_ENDPOINT` | (empty) | OTLP HTTP endpoint for trace export |
| `OTEL_EXPORTER_OTLP_HEADERS` | (empty) | HTTP headers for OTLP requests (format: key1=value1,key2=value2) |
| `OTEL_EXPORTER_OTLP_COMPRESSION` | `gzip` | Compression for OTLP data (gzip or none) |
| `OTEL_EXPORTER_OTLP_PROTOCOL` | `http/protobuf` | Protocol for OTLP exports |
| `OTEL_EXPORTER_OTLP_METRICS_TEMPORALITY_PREFERENCE` | `delta` | Metrics temporality (delta or cumulative) |
| `OTEL_RESOURCE_ATTRIBUTES` | (empty) | Additional resource attributes (format: key1=value1,key2=value2) |
| `OTEL_ATTRIBUTE_VALUE_LENGTH_LIMIT` | `4095` | Maximum length for attribute values |
| `LOG_LEVEL` | `info` | Log level (trace, debug, info, warn, error, fatal, panic) |
| `ENVIRONMENT` | `development` | Environment (affects log format and defaults) |

### Local Development Setup

For local development with Jaeger:

```bash
# Start Jaeger with Docker
docker run -d --name jaeger \
-p 16686:16686 \
-p 14268:14268 \
-p 4318:4318 \
-e COLLECTOR_OTLP_ENABLED=true \
jaegertracing/all-in-one:latest

# Configure environment
export OTEL_ENABLED=true
export OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4318
export LOG_LEVEL=debug

# Start the backend
go run ./server/cmd/api
```

Then visit http://localhost:16686 to view traces in Jaeger UI.

## Features

### HTTP Request Tracing
- Every HTTP request creates a trace span when `OTEL_ENABLED=true`
- Includes request method, URL, response status, and duration
- Automatic trace context propagation
- Can be disabled by setting `OTEL_ENABLED=false`

### Database Query Tracing
- All PostgreSQL queries are traced using [otelpgx](https://github.com/exaring/otelpgx)
- Includes SQL statement, execution time, and affected rows
- Error tracking for failed queries
- Configurable via `OTEL_ENABLED` setting
- Zero overhead when telemetry is disabled

### Metrics Collection
- **HTTP Request Metrics**: Request count, duration histograms by handler and status code
- **Error Metrics**: Error counts categorized by type and handler
- **Business Metrics**: Key application events (logins, signups, transactions, etc.)
- **Authentication Events**: Login success/failure, token refresh, MFA operations
- **Transaction Events**: Creation, deletion, updates with success/failure tracking
- **User Events**: Profile updates, account operations
- All metrics respect the `OTEL_ENABLED` configuration for complete control

#### Available Metrics
- `http_requests_total`: Counter of HTTP requests by method, handler, and status
- `http_request_duration_seconds`: Histogram of HTTP request durations
- `errors_total`: Counter of errors by error type and handler
- `business_events_total`: Counter of business events by event type and outcome

#### Business Event Examples
- `auth_login` (success/failure)
- `auth_signup` (success/failure)
- `auth_token_refresh` (success/failure)
- `auth_logout` (success/failure)
- `auth_mfa_setup_initiate` (success/failure)
- `transaction_create` (success/failure)
- `transaction_delete` (success/failure)
- `account_create` (success/failure)
- `user_update` (success/failure)

### Enhanced Logging
- Structured JSON logging in production
- Console-friendly logging in development
- Automatic trace_id and span_id injection when telemetry is enabled
- Contextual information (request_id, user_id, etc.)
- Works with or without OpenTelemetry enabled

### Example Log Output

```json
{
"level": "info",
"time": "2024-01-15T10:30:00Z",
"trace_id": "4128d29e878cbbc9e873c4625d9e5cd9",
"span_id": "4b25642f3e1c5176",
"request_id": "pkrvmpptgkbjq6m/Cij20K5UKX-000001",
"message": "Login attempt started",
"handler": "auth.Login",
"remote_addr": "192.168.1.100",
"email": "user@example.com"
}
```

### Disabling Telemetry

To completely disable OpenTelemetry tracing:

```bash
export OTEL_ENABLED=false
```

When disabled:
- No HTTP tracing middleware is added
- No database query tracing is configured
- Logging continues to work without trace context
- Zero performance impact from telemetry
- All existing functionality remains intact

## Production Considerations

### OTLP Exporter Endpoints

#### Basic Configuration
- **Jaeger**: `http://jaeger:4318`
- **Local OTEL Collector**: `http://localhost:4318`

#### Cloud Providers with Authentication
- **Honeycomb**:
```bash
OTEL_EXPORTER_OTLP_ENDPOINT=https://api.honeycomb.io
OTEL_EXPORTER_OTLP_HEADERS=x-honeycomb-team=your_api_key
```
- **Grafana Cloud**:
```bash
OTEL_EXPORTER_OTLP_ENDPOINT=https://otlp-gateway-prod-us-central-0.grafana.net/otlp
OTEL_EXPORTER_OTLP_HEADERS=authorization=Basic base64(instanceId:token)
```
- **New Relic**:
```bash
OTEL_EXPORTER_OTLP_ENDPOINT=https://otlp.nr-data.net
OTEL_EXPORTER_OTLP_HEADERS=api-key=your_license_key
```
- **Datadog** (requires Datadog Agent with OTLP enabled):
```bash
OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4318
```

#### Advanced Configuration
```bash
# Optimize for bandwidth with compression
OTEL_EXPORTER_OTLP_COMPRESSION=gzip

# Add custom resource attributes for better filtering
OTEL_RESOURCE_ATTRIBUTES=environment=production,datacenter=us-east-1,version=1.2.3

# Use cumulative metrics for some monitoring systems
OTEL_EXPORTER_OTLP_METRICS_TEMPORALITY_PREFERENCE=cumulative
```

### Security
- Never log sensitive information (passwords, tokens, etc.)
- Use appropriate log levels in production
- Ensure OTLP endpoints are secured with authentication

### Performance
- Sampling can be configured for high-traffic applications
- Database query logging should be set to appropriate levels
- Consider using asynchronous exporters for production

## Monitoring Alerts

Consider setting up alerts for:
- High error rates in traces
- Slow database queries
- Authentication failures
- Service availability

## Troubleshooting

### No traces appearing
1. Check `OTEL_ENABLED` is set to `true`
2. Verify `OTEL_EXPORTER_OTLP_ENDPOINT` is correct
3. Ensure the OTLP endpoint is reachable
4. Check application logs for telemetry errors

### Missing trace context in logs
1. Ensure requests go through the OTEL HTTP middleware
2. Check that the logger is created with `logging.LoggerWithTraceCtx()`
3. Verify spans are being created properly

### High overhead
1. Adjust log levels to reduce verbosity
2. Configure sampling rates for traces
3. Use asynchronous exporters
4. Monitor resource usage
27 changes: 18 additions & 9 deletions go.mod
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@ require (
github.com/aws/aws-sdk-go-v2/feature/s3/manager v1.17.72
github.com/aws/aws-sdk-go-v2/service/s3 v1.79.2
github.com/docker/go-connections v0.5.0
github.com/exaring/otelpgx v0.9.3
github.com/go-chi/chi/v5 v5.2.2
github.com/go-playground/validator/v10 v10.26.0
github.com/golang-jwt/jwt/v5 v5.2.2
Expand All @@ -31,12 +32,18 @@ require (
github.com/stretchr/testify v1.10.0
github.com/testcontainers/testcontainers-go v0.36.0
github.com/testcontainers/testcontainers-go/modules/localstack v0.36.0
go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp v0.49.0
go.opentelemetry.io/otel v1.35.0
go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracehttp v1.24.0
go.opentelemetry.io/otel/sdk v1.34.0
go.opentelemetry.io/otel/trace v1.35.0
golang.org/x/crypto v0.37.0
golang.org/x/text v0.24.0
gopkg.in/gomail.v2 v2.0.0-20160411212932-81ebce5c23df
)

require (
cloud.google.com/go/compute/metadata v0.3.0 // indirect
cloud.google.com/go/compute/metadata v0.5.2 // indirect
dario.cat/mergo v1.0.1 // indirect
github.com/Azure/go-ansiterm v0.0.0-20230124172434-306776ec8161 // indirect
github.com/Microsoft/go-winio v0.6.2 // indirect
Expand All @@ -56,7 +63,7 @@ require (
github.com/aws/smithy-go v1.22.3 // indirect
github.com/boombuler/barcode v1.0.2 // indirect
github.com/boyter/go-string v1.0.5 // indirect
github.com/cenkalti/backoff/v4 v4.2.1 // indirect
github.com/cenkalti/backoff/v4 v4.3.0 // indirect
github.com/containerd/log v0.1.0 // indirect
github.com/containerd/platforms v0.2.1 // indirect
github.com/cpuguy83/dockercfg v0.3.2 // indirect
Expand All @@ -77,6 +84,7 @@ require (
github.com/goccy/go-json v0.10.5 // indirect
github.com/gogo/protobuf v1.3.2 // indirect
github.com/golang/protobuf v1.5.4 // indirect
github.com/grpc-ecosystem/grpc-gateway/v2 v2.25.1 // indirect
github.com/jackc/pgpassfile v1.0.0 // indirect
github.com/jackc/pgservicefile v0.0.0-20240606120523-5a60cdf6a761 // indirect
github.com/jackc/puddle/v2 v2.2.2 // indirect
Expand Down Expand Up @@ -115,21 +123,22 @@ require (
github.com/tklauser/numcpus v0.7.0 // indirect
github.com/yusufpapurcu/wmi v1.2.4 // indirect
go.opentelemetry.io/auto/sdk v1.1.0 // indirect
go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp v0.49.0 // indirect
go.opentelemetry.io/otel v1.35.0 // indirect
go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracehttp v1.24.0 // indirect
go.opentelemetry.io/otel/exporters/otlp/otlpmetric/otlpmetrichttp v1.34.0 // indirect
go.opentelemetry.io/otel/exporters/otlp/otlptrace v1.24.0 // indirect
go.opentelemetry.io/otel/metric v1.35.0 // indirect
go.opentelemetry.io/otel/sdk v1.34.0 // indirect
go.opentelemetry.io/otel/trace v1.35.0 // indirect
go.opentelemetry.io/otel/sdk/metric v1.34.0 // indirect
go.opentelemetry.io/proto/otlp v1.5.0 // indirect
go.uber.org/goleak v1.3.0 // indirect
golang.org/x/mod v0.24.0 // indirect
golang.org/x/net v0.38.0 // indirect
golang.org/x/oauth2 v0.17.0 // indirect
golang.org/x/oauth2 v0.24.0 // indirect
golang.org/x/sync v0.13.0 // indirect
golang.org/x/sys v0.32.0 // indirect
google.golang.org/appengine v1.6.8 // indirect
google.golang.org/genproto/googleapis/api v0.0.0-20250115164207-1a7da9e5054f // indirect
google.golang.org/genproto/googleapis/rpc v0.0.0-20250115164207-1a7da9e5054f // indirect
google.golang.org/grpc v1.69.4 // indirect
google.golang.org/protobuf v1.36.6 // indirect
gopkg.in/alexcesaro/quotedprintable.v3 v3.0.0-20150716171945-2caba252f4dc // indirect
gopkg.in/gomail.v2 v2.0.0-20160411212932-81ebce5c23df // indirect
gopkg.in/yaml.v3 v3.0.1 // indirect
)
Loading