Skip to content

Otel sources#47

Merged
dantheuber merged 48 commits into
mainfrom
otel-sources
Apr 7, 2026
Merged

Otel sources#47
dantheuber merged 48 commits into
mainfrom
otel-sources

Conversation

@dantheuber
Copy link
Copy Markdown
Member

@dantheuber dantheuber commented Apr 7, 2026

Summary

Add full OpenTelemetry (OTLP) push ingestion pipeline ΓÇö metrics and traces ΓÇö alongside Prometheus scraping support. This transforms depsera from a pull-only health poller into a hybrid pull/push observability platform with automatic dependency discovery from distributed traces.

Linear tickets: DPS-77 through DPS-116

Changes

OTLP Push Ingestion (DPS-77 -- DPS-82)

  • Add health_endpoint_format column and OTel foundation types
  • Add TeamApiKeyStore with full API key CRUD routes and audit logging
  • Add requireApiKeyAuth middleware for team-scoped API key authentication
  • Add OTLP JSON parser and Prometheus text exposition parser (text/plain; version=0.0.4)
  • Add OTLP receiver endpoint (POST /v1/metrics) with auto-registration of unknown services and per-service custom metric/attribute name mappings
  • Integrate format-aware parser dispatch into health poller
  • Add format configuration UI and API key management pages

API Key Rate Limiting & Usage Tracking (DPS-84 -- DPS-102)

  • Add migrations for rate limit columns and usage bucket tables
  • Add ApiKeyUsageStore for time-bucketed usage persistence
  • Add perKeyRateLimit and trackApiKeyUsage middleware with retention pruning
  • Add team and admin API routes for rate limit configuration and usage data
  • Extend OTLP stats endpoints with rate limit config and usage summaries
  • Add frontend pages: ApiKeyUsageChart, OtlpStats, ApiKeys, OtlpAdmin
  • Add comprehensive test coverage (unit, integration, and frontend component tests)

Trace-Based Dependency Discovery (DPS-110 -- DPS-116)

  • Add 5 new migrations (037--41): trace discovery schema, external node enrichment, percentile latency columns, span storage, and span retention settings
  • Add new DB types: DiscoverySource, Span, ExternalNodeEnrichment
  • Add new stores: SpanStore, ExternalNodeEnrichmentStore, AppSettingsStore, plus extensions to DependencyStore, AssociationStore, and LatencyHistoryStore
  • Add OTLP trace, histogram, and sum type definitions
  • Add TraceParser service -- extracts dependencies from CLIENT and PRODUCER spans in OTLP trace payloads
  • Add TraceDependencyBridge -- converts trace-discovered dependencies into dependency records
  • Add otlpServiceResolver -- shared module for service lookup/auto-creation
  • Add POST /v1/traces endpoint with full integration tests
  • Add histogram/sum metric processing with percentile latency extraction (p50/p95/p99) via linear interpolation
  • Add AutoAssociator -- auto-associates trace-discovered dependencies to registered services
  • Add management API endpoints for auto-discovered dependencies and external node enrichment
  • Integrate auto-discovered dependencies into the dependency graph builder
  • Add span retention cleanup and admin settings endpoint

Documentation & Misc

  • Update specs: data model, auth, API reference, health polling, dependency graph, security, store layer, and configuration
  • Update README with OTLP and Prometheus capabilities
  • Fix login redirect and auth timeout issues

Testing

  • New/updated tests included
  • All tests pass (npm test)
  • Linting passes (npm run lint)

Test coverage added:

  • Unit tests for TraceParser, TraceDependencyBridge, OtlpParser, PrometheusParser, AutoAssociator, perKeyRateLimit, trackApiKeyUsage, ApiKeyUsageStore, TeamApiKeyStore, SpanStore, ExternalNodeEnrichmentStore, AppSettingsStore, histogramPercentiles, and validation utilities
  • Integration tests for POST /v1/traces, POST /v1/metrics, rate limit/usage routes, admin OTLP stats, span retention, discovered dependency management, external node enrichment, and association confirm/dismiss flows
  • Frontend component tests for ApiKeyUsageChart, OtlpStats, ApiKeys, OtlpAdmin
  • Migration tests for 034 and 037ΓÇô041

Checklist

Add migration 034 with health_endpoint_format discriminator on services
table (default/schema/prometheus/otlp) and team_api_keys table for OTLP
push authentication. Backfills existing services with schema_config to
schema format. Adds HealthEndpointFormat, TeamApiKey, and
CreateTeamApiKeyInput types, extends audit action/resource types for API
key operations, and updates store input types.
Implement team API key store infrastructure for OTLP push authentication:
- ITeamApiKeyStore interface with findByTeamId, findByKeyHash, create, delete, updateLastUsed
- TeamApiKeyStore implementation with dps_ prefixed key generation and SHA-256 hashing
- Register store in StoreRegistry
- Unit tests covering all CRUD operations
API key authentication middleware for OTLP push endpoints:
- Validates Authorization: Bearer dps_... header format
- Hashes key and looks up by key_hash in team_api_keys table
- Sets req.apiKeyTeamId on success, updates last_used_at
- Returns 401 for missing, malformed, or invalid keys
Team API key management endpoints mounted on team router:
- GET /api/teams/:id/api-keys (list, team lead/admin, strips key_hash)
- POST /api/teams/:id/api-keys (create, returns raw key once)
- DELETE /api/teams/:id/api-keys/:keyId (revoke with ownership check)
- Audit logging on create and revoke
- Route-level tests covering RBAC, validation, and audit events
Add health_endpoint_format selector to ServiceForm with conditional
field visibility (OTLP hides endpoint URL, schema shows editor).
Add format badge to ServiceDetail, API key CRUD client functions,
ApiKeys component with create/revoke/copy workflow and collector
config snippet, and API Keys tab on TeamDetail for team leads/admins.
Document health_endpoint_format column, team_api_keys table, API key
authentication, OTLP receiver endpoint, format-aware parser dispatch,
Prometheus parsing, OTLP rate limiting, and ITeamApiKeyStore across
all relevant spec files.
Add OTLP push ingestion, Prometheus scraping, and API key auth to
README features and API table. Add OTLP rate limit env vars to
.env.example.
Migration 035 adds rate_limit_rpm (nullable INTEGER) and
rate_limit_admin_locked (INTEGER DEFAULT 0) to team_api_keys.

Migration 036 creates api_key_usage_buckets table with composite PK
(api_key_id, bucket_start, granularity) and supporting indexes for
per-key time-series usage tracking.
- Extend TeamApiKey with rate_limit_rpm and rate_limit_admin_locked columns
- Add ApiKeyUsageBucket interface for usage bucket rows
- Add findById, updateRateLimit, setAdminLock to TeamApiKeyStore
- Extend requireApiKeyAuth to set req.apiKeyId for downstream middleware
- Add IApiKeyUsageStore interface with bulkUpsert, getBuckets, getBucketsByTeam,
  getAllBuckets, getSummaryForKeys, and prune methods
- Implement all methods using better-sqlite3 prepared statements and transactions
- Register ApiKeyUsageStore in StoreRegistry
- Implement token bucket rate limiter with burst capacity, soft-limit warnings,
  OTLP-format 429 responses, and injectable getNow for testing
- Implement usage accumulator with 5-second bulk flush to SQLite,
  minute+hour granularity, and rejected count tracking
- Add usage bucket pruning to DataRetentionService (minute=24h, hour=30d, orphaned=7d)
- Rename createOtlpRateLimit to createOtlpGlobalRateLimit for clarity
- Insert perKeyRateLimit and trackApiKeyUsage into /v1/metrics middleware chain
- PATCH /api/teams/:id/api-keys/:keyId/rate-limit (team lead, with lock check)
- GET /api/teams/:id/api-keys/:keyId/usage (team lead, time-series buckets)
- PATCH /api/admin/api-keys/:keyId/rate-limit (admin, with lock toggle)
- GET /api/admin/api-keys/:keyId/usage (admin, per-key time-series)
- GET /api/admin/otlp-usage (admin, cross-team hourly overview)
…ummaries

- Add rate_limit_rpm, rate_limit_is_custom, rate_limit_admin_locked,
  usage_1h/24h/7d, and rejected_24h/7d to team and admin otlpStats responses
- Uses batched getSummaryForKeys queries (3 per endpoint call)
- Update test schemas with rate limit columns and usage buckets table
- Extend OtlpApiKeyStats with rate limit and usage summary fields
- Add ApiKeyUsageBucket and ApiKeyUsageResponse types
- Add API client functions for team/admin rate limit PATCH and usage GET endpoints
Add rate limit display (with custom/default label and lock indicator)
to the API keys management table, and a modal edit dialog for team
leads to update rate limits on unlocked keys.
…s, and usage charts

Add per-key usage summary row (1h/24h/7d push counts, rejected warnings),
rate limit display with lock indicator and edit dialog for team leads,
expandable ApiKeyUsageChart per key (lazy-mounted), and warning badges
for keys approaching or exceeding rate limits.
… admin rate limit controls

Add cross-team usage overview section with 24h/7d push totals, rejection
counts, and top-5 keys table. Extend per-team key cards with usage summary
rows, rate limit display with lock indicators, expandable usage charts, and
admin rate limit edit dialog with lock checkbox. Add amber/red card
highlighting for keys with active rejections. Add AdminOtlpUsageResponse
type and update API client return type.
…db/types.ts

Add foundational TypeScript types for trace-based dependency discovery:
- DiscoverySource type ('manual' | 'otlp_metric' | 'otlp_trace')
- Span and CreateSpanInput interfaces for full span storage
- ExternalNodeEnrichment and UpsertExternalNodeEnrichmentInput interfaces
- Extend Dependency with discovery_source, user_display_name/description/impact
- Extend DependencyAssociation with is_auto_suggested, is_dismissed
- Extend ProactiveDepsStatus.health with optional percentiles
- Update all test mocks and runtime code for new required fields
- 037: discovery_source, user enrichment columns on dependencies;
  re-add is_auto_suggested/is_dismissed on dependency_associations
- 038: external_node_enrichment table for org-wide external node metadata
- 039: percentile latency columns (p50/p95/p99/min/max/request_count/source)
  on dependency_latency_history
- 040: spans table with indexes for trace correlation and timeline views
- 041: app_settings table seeded with span_retention_days = 7
- Register all five migrations in migrate.ts
- Trace types: OtlpSpan, OtlpSpanStatus, OtlpScopeSpans,
  OtlpResourceSpans, OtlpExportTraceServiceRequest
- Histogram types: OtlpHistogramDataPoint, OtlpHistogram
- Sum types: OtlpSum (monotonic + non-monotonic)
- Extend OtlpMetric with histogram? and sum? fields
New stores:
- ISpanStore / SpanStore: bulkInsert, findByTraceId, findByServiceName,
  deleteOlderThan for full span storage
- IAppSettingsStore / AppSettingsStore: get/set for admin-configurable
  app settings (e.g., span_retention_days)
- IExternalNodeEnrichmentStore / ExternalNodeEnrichmentStore: CRUD for
  org-wide external node enrichment metadata

Store extensions:
- DependencyStore.upsert(): passes discovery_source through INSERT;
  preserves 'manual' on conflict (never downgrade to otlp_trace)
- LatencyHistoryStore.recordWithPercentiles(): stores histogram-derived
  p50/p95/p99/min/max/requestCount with source tag
- AssociationStore: create() supports is_auto_suggested flag;
  new findAutoSuggested(), confirm(), dismiss() methods

All stores registered in StoreRegistry. Test inline schemas updated
for new columns.
… extensions

Migration tests (037-041):
- Verify columns, defaults, indexes, FK constraints, cascade delete
- Verify discovery_source backfill for OTLP services
- Verify span_retention_days seeded default

New store tests:
- SpanStore: bulkInsert, findByTraceId ordering, findByServiceName
  with since/limit, deleteOlderThan
- AppSettingsStore: get seeded value, get missing key, set create/update
- ExternalNodeEnrichmentStore: upsert create/update, findByCanonicalName,
  findAll ordered, delete

Extended store tests:
- DependencyStore: discovery_source passthrough, manual default,
  manual preserved on conflict, otlp_metric upgradable to otlp_trace
- LatencyHistoryStore: recordWithPercentiles stores all fields,
  partial percentile data, backward compat with record()
- AssociationStore: create with is_auto_suggested, findAutoSuggested
  filters correctly, confirm/dismiss update flags
…cy discovery

Parses OTLP trace payloads, extracting dependency information from CLIENT
and PRODUCER spans. Implements target name resolution priority chain
(peer.service → db.system → messaging.system → rpc.system → server.address
→ url.full hostname), dependency type inference, auto-description generation,
and deduplication by target name with aggregated latency/error state.
…odule

Move the inline findOrCreateService helper from the metrics OTLP route
into a reusable module so the upcoming trace receiver route can share it.
POST /v1/traces endpoint receives OTLP trace payloads, stores ALL spans
for future timeline views, and feeds CLIENT/PRODUCER spans through the
TraceDependencyBridge into the existing dependency upsert pipeline.
Mounted with identical middleware stack as /v1/metrics (2mb limit).
Tests cover: valid payload acceptance, auto-registration, CLIENT/PRODUCER
dependency creation, ALL span storage, DB target resolution, server.address
fallback, invalid payload 400, unauthorized 401, status change events,
idempotent upsert, duration_ms calculation, attribute serialization,
multi-service payloads, and span kind filtering.
- histogramPercentiles utility: linear interpolation from OTLP histogram
  buckets with auto seconds-to-ms conversion
- OtlpParser: process histogram dataPoints for percentile extraction,
  sum dataPoints for gauge-like/counter values, buildDependency populates
  health.percentiles
- DependencyUpsertService: route histogram percentiles to
  recordWithPercentiles() with otlp_histogram source
- LatencyHistoryStore: bucket queries include avg_p50/p95/p99
- Tests: percentile utility, OtlpParser histogram/sum paths,
  DependencyUpsertService percentile recording; fix latency test schemas
…ciation

Automatically link trace-discovered dependencies to registered services
when an exact name match (case-insensitive) or canonical name match via
alias resolution is found. Creates associations with is_auto_suggested=1,
skips self-links and already-associated pairs (including dismissed), and
catches UNIQUE constraint violations as no-ops for race-condition safety.
…s and external nodes

Add confirm/dismiss endpoints for auto-suggested associations, PATCH
enrichment for trace-discovered dependencies, GET discovered dependencies
list, external node enrichment CRUD, and frontend API client functions.
Extend the dependency graph to visually distinguish auto-discovered
(trace-based) dependencies from manually configured ones, and overlay
enrichment metadata on external nodes.

Backend:
- Add discoverySource, isAutoSuggested, associationId to GraphEdgeData
- Add discoveredDependencyCount to ServiceNodeData
- Include discovery_source, is_auto_suggested, association_id in
  dependency queries (DependencyStore)
- Populate new fields in DependencyGraphBuilder.createEdgeData()
- Add ExternalNodeBuilder.applyEnrichment() for enrichment overlay
- Wire ExternalNodeEnrichmentStore through GraphService

Frontend:
- Extend client GraphEdgeData with discoverySource, isAutoSuggested,
  associationId
- CustomEdge: dashed style + "suggested" badge for auto-suggested edges
- EdgeDetailsPanel: discovery source badge, confirm/dismiss buttons
- NodeDetailsPanel: enriched description/impact/contact for external nodes

Tests: 15 new test cases across backend and frontend
Add configurable span retention (default 7 days) via app_settings table,
dismissed auto-suggestion cleanup using the same retention window, and
admin GET/PUT endpoints at /api/admin/settings/span-retention.
…sing, and span storage

Document new tables (spans, app_settings, external_node_enrichment), extended
columns on dependencies/associations/latency_history, trace ingestion flow
(TraceParser, TraceDependencyBridge, AutoAssociator), histogram/sum metric
processing, discovery source graph styling, and external node enrichment.
@dantheuber dantheuber merged commit 8a25159 into main Apr 7, 2026
3 checks passed
@dantheuber dantheuber deleted the otel-sources branch April 7, 2026 02:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant