Add Prometheus metrics export endpoint for monitoring integration#80
Conversation
Reviewer's GuideAdds a Prometheus-based metrics export pipeline to the MCP server by wiring a new PrometheusExporter into the existing health check HTTP server, sourcing data from the AuthManager metrics collector, tool registry, service pool, and storage stats, and documenting the new /metrics endpoint and available metrics in the README. Sequence diagram for Prometheus /metrics export flowsequenceDiagram
actor Prometheus
participant HealthCheckServer
participant PrometheusExporter
participant MetricsCollector
participant ToolRegistry
participant LighthouseServiceFactory
participant ILighthouseService
Prometheus->>HealthCheckServer: HTTP GET /metrics
HealthCheckServer->>HealthCheckServer: route /metrics
alt metrics enabled
HealthCheckServer->>PrometheusExporter: getMetrics()
PrometheusExporter->>PrometheusExporter: updateMetrics()
PrometheusExporter->>MetricsCollector: getMetrics()
MetricsCollector-->>PrometheusExporter: auth metrics
PrometheusExporter->>MetricsCollector: getCacheCounters()
MetricsCollector-->>PrometheusExporter: cache counters
PrometheusExporter->>MetricsCollector: getSecurityEvents()
MetricsCollector-->>PrometheusExporter: security events
PrometheusExporter->>ToolRegistry: getMetrics()
ToolRegistry-->>PrometheusExporter: registry metrics
loop per tool
PrometheusExporter->>ToolRegistry: getToolStats(toolName)
ToolRegistry-->>PrometheusExporter: tool stats
end
PrometheusExporter->>LighthouseServiceFactory: getStats()
LighthouseServiceFactory-->>PrometheusExporter: service pool stats
PrometheusExporter->>ILighthouseService: getStorageStats()
ILighthouseService-->>PrometheusExporter: storage stats
PrometheusExporter-->>HealthCheckServer: prometheusText = registry.metrics()
HealthCheckServer->>PrometheusExporter: getContentType()
PrometheusExporter-->>HealthCheckServer: contentType
HealthCheckServer-->>Prometheus: 200 OK, text/plain, version=0.0.4
else metrics disabled
HealthCheckServer-->>Prometheus: 404 Metrics endpoint not enabled
end
Updated class diagram for Prometheus metrics export pipelineclassDiagram
class HealthCheckServer {
- HealthCheckDependencies deps
- HealthCheckConfig healthConfig
- Logger logger
- PrometheusExporter prometheusExporter
- lastConnectivityCheck : up boolean, lastChecked number
+ constructor(healthConfig: HealthCheckConfig, deps: HealthCheckDependencies)
+ start() Promise~void~
- handleRequest(req: http.IncomingMessage, res: http.ServerResponse) void
- handleHealth(res: http.ServerResponse) Promise~void~
- handleReady(res: http.ServerResponse) Promise~void~
- handleMetrics(res: http.ServerResponse) Promise~void~
- sendJSON(res: http.ServerResponse, statusCode: number, body: unknown) void
- checkSDK() ReadinessCheck
}
class PrometheusExporter {
- PrometheusExporterDependencies deps
- client.Registry registry
- client.Counter authTotal
- client.Counter cacheHitsTotal
- client.Counter cacheMissesTotal
- client.Counter securityEventsTotal
- client.Counter toolCallsTotal
- client.Gauge cacheSize
- client.Gauge cacheMaxSize
- client.Gauge servicePoolSize
- client.Gauge servicePoolMaxSize
- client.Gauge storageFiles
- client.Gauge storageBytes
- client.Gauge storageMaxBytes
- client.Gauge storageUtilization
- client.Gauge uniqueApiKeys
- client.Gauge toolsRegistered
- client.Histogram requestDuration
- client.Histogram authDuration
- lastCacheCounters : hits number, misses number
- lastAuthMetrics : authenticatedRequests number, failedAuthentications number, fallbackRequests number
- lastSecurityEventCounts : Map~string, number~
- lastToolCallCounts : Map~string, number~
+ constructor(deps: PrometheusExporterDependencies)
- initializeMetrics() void
- updateMetrics() void
- updateAuthMetrics() void
- updateCacheMetrics() void
- updateSecurityMetrics() void
- updateToolMetrics() void
- updateServicePoolMetrics() void
- updateStorageMetrics() void
+ getMetrics() Promise~string~
+ getContentType() string
+ reset() void
}
class PrometheusExporterDependencies {
+ MetricsCollector metricsCollector
+ ToolRegistry registry
+ LighthouseServiceFactory serviceFactory
+ ILighthouseService lighthouseService
}
class AuthManager {
- AuthConfig config
- KeyValidationCache cache
- RateLimiter rateLimiter
- MetricsCollector metricsCollector
+ constructor(config: AuthConfig)
+ authenticateRequest(req: IncomingMessage) Promise~AuthenticationResult~
+ getMetricsCollector() MetricsCollector
+ getCacheStats() CacheStats
+ getRateLimiterStatus(keyHash: string) RateLimiterStatus
+ destroy() void
}
class MetricsCollector {
- cacheHits : number
- cacheMisses : number
+ recordCacheAccess(hit: boolean) void
+ recordAuthentication(result: AuthenticationResult) void
+ getMetrics() AuthMetrics
+ getCacheCounters() hits number, misses number
+ getSecurityEvents() SecurityEvent[]
+ destroy() void
}
class ToolRegistry {
+ getMetrics() ToolRegistryMetrics
+ getToolStats(toolName: string) ToolStats
}
class LighthouseServiceFactory {
+ getStats() ServicePoolStats
}
class ILighthouseService {
<<interface>>
+ getStorageStats() StorageStats
}
class HealthCheckConfig {
+ port : number
+ enabled : boolean
+ lighthouseApiUrl : string
+ connectivityCheckInterval : number
+ connectivityTimeout : number
+ metricsEnabled : boolean
}
class HealthCheckDependencies {
+ AuthManager authManager
+ ToolRegistry registry
+ LighthouseServiceFactory serviceFactory
+ ILighthouseService lighthouseService
+ Logger logger
}
HealthCheckServer --> HealthCheckConfig
HealthCheckServer --> HealthCheckDependencies
HealthCheckServer --> PrometheusExporter
HealthCheckDependencies --> AuthManager
HealthCheckDependencies --> ToolRegistry
HealthCheckDependencies --> LighthouseServiceFactory
HealthCheckDependencies --> ILighthouseService
PrometheusExporter --> PrometheusExporterDependencies
PrometheusExporterDependencies --> MetricsCollector
PrometheusExporterDependencies --> ToolRegistry
PrometheusExporterDependencies --> LighthouseServiceFactory
PrometheusExporterDependencies --> ILighthouseService
AuthManager --> MetricsCollector
AuthManager --> KeyValidationCache
AuthManager --> RateLimiter
MetricsCollector --> AuthenticationResult
MetricsCollector --> SecurityEvent
ToolRegistry --> ToolRegistryMetrics
ToolRegistry --> ToolStats
LighthouseServiceFactory --> ServicePoolStats
ILighthouseService --> StorageStats
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Hey - I've found 2 issues, and left some high level feedback:
- In
PrometheusExporter.updateCacheMetrics, you're only exporting hit/miss counters and leaving a TODO-style comment about cache size/max size — consider wiring this up toAuthManager.getCacheStats()(or similar) solighthouse_cache_sizeandlighthouse_cache_max_sizereflect real values instead of being omitted. - For
authDurationandrequestDurationhistograms you're currently observing averages (per scrape / per tool) rather than individual request latencies, which can produce misleading distributions; if possible, move the histogram instrumentation closer to the actual auth and tool execution paths to record per-request observations. - The
metricsEnabledflag is parsed asprocess.env.PROMETHEUS_METRICS_ENABLED !== "false", which means any non-"false"value (including typos) enables metrics; consider using a stricter boolean parser (e.g. only"true"enables) to avoid surprising configuration behavior.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- In `PrometheusExporter.updateCacheMetrics`, you're only exporting hit/miss counters and leaving a TODO-style comment about cache size/max size — consider wiring this up to `AuthManager.getCacheStats()` (or similar) so `lighthouse_cache_size` and `lighthouse_cache_max_size` reflect real values instead of being omitted.
- For `authDuration` and `requestDuration` histograms you're currently observing averages (per scrape / per tool) rather than individual request latencies, which can produce misleading distributions; if possible, move the histogram instrumentation closer to the actual auth and tool execution paths to record per-request observations.
- The `metricsEnabled` flag is parsed as `process.env.PROMETHEUS_METRICS_ENABLED !== "false"`, which means any non-`"false"` value (including typos) enables metrics; consider using a stricter boolean parser (e.g. only `"true"` enables) to avoid surprising configuration behavior.
## Individual Comments
### Comment 1
<location> `apps/mcp-server/src/health/PrometheusExporter.ts:267-276` </location>
<code_context>
+ // For now, we derive from the metrics collector's data
+ }
+
+ private updateSecurityMetrics(): void {
+ const events = this.deps.metricsCollector.getSecurityEvents();
+
+ // Count events by type
+ const eventCounts: Map<string, number> = new Map();
+ for (const eventType of Object.values(SecurityEventType)) {
+ eventCounts.set(eventType, 0);
+ }
+
+ for (const event of events) {
+ const current = eventCounts.get(event.type) || 0;
+ eventCounts.set(event.type, current + 1);
+ }
+
+ // Calculate deltas and increment counters
+ for (const [type, count] of eventCounts.entries()) {
+ const lastCount = this.lastSecurityEventCounts.get(type) || 0;
+ const delta = count - lastCount;
+ if (delta > 0) {
+ this.securityEventsTotal.labels(type).inc(delta);
+ }
+ this.lastSecurityEventCounts.set(type, count);
+ }
+ }
</code_context>
<issue_to_address>
**issue (bug_risk):** Security event counters may stall or undercount if `getSecurityEvents()` uses a sliding time window.
This logic assumes `getSecurityEvents()` is cumulative. If it’s backed by a sliding window or otherwise bounded, counts can decrease as older events expire, making `delta` negative and ignored so `lighthouse_security_events_total` stops reflecting the true total. To keep this counter monotonic, either expose cumulative counts from `MetricsCollector` or increment the counter at event time instead of deriving it from a windowed collection on scrape.
</issue_to_address>
### Comment 2
<location> `apps/mcp-server/src/health/PrometheusExporter.ts:246-255` </location>
<code_context>
+ }
+ }
+
+ private updateCacheMetrics(): void {
+ const cacheCounters = this.deps.metricsCollector.getCacheCounters();
+
+ // Calculate deltas
+ const hitsDelta = cacheCounters.hits - this.lastCacheCounters.hits;
+ const missesDelta = cacheCounters.misses - this.lastCacheCounters.misses;
+
+ if (hitsDelta > 0) {
+ this.cacheHitsTotal.inc(hitsDelta);
+ }
+ if (missesDelta > 0) {
+ this.cacheMissesTotal.inc(missesDelta);
+ }
+
+ // Update last known values
+ this.lastCacheCounters = { ...cacheCounters };
+
+ // Note: Cache size/maxSize would need to come from AuthManager.getCacheStats()
+ // For now, we derive from the metrics collector's data
+ }
</code_context>
<issue_to_address>
**issue (bug_risk):** Cache size and max-size gauges are defined but never updated.
`cacheSize` and `cacheMaxSize` are registered in `initializeMetrics` but never set here, so Prometheus will always see zero/stale values. If cache size/max size are available (e.g. via `AuthManager.getCacheStats()`), update the gauges in this method; otherwise consider removing these metrics to avoid exporting misleading data.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| private updateSecurityMetrics(): void { | ||
| const events = this.deps.metricsCollector.getSecurityEvents(); | ||
|
|
||
| // Count events by type | ||
| const eventCounts: Map<string, number> = new Map(); | ||
| for (const eventType of Object.values(SecurityEventType)) { | ||
| eventCounts.set(eventType, 0); | ||
| } | ||
|
|
||
| for (const event of events) { |
There was a problem hiding this comment.
issue (bug_risk): Security event counters may stall or undercount if getSecurityEvents() uses a sliding time window.
This logic assumes getSecurityEvents() is cumulative. If it’s backed by a sliding window or otherwise bounded, counts can decrease as older events expire, making delta negative and ignored so lighthouse_security_events_total stops reflecting the true total. To keep this counter monotonic, either expose cumulative counts from MetricsCollector or increment the counter at event time instead of deriving it from a windowed collection on scrape.
| private updateCacheMetrics(): void { | ||
| const cacheCounters = this.deps.metricsCollector.getCacheCounters(); | ||
|
|
||
| // Calculate deltas | ||
| const hitsDelta = cacheCounters.hits - this.lastCacheCounters.hits; | ||
| const missesDelta = cacheCounters.misses - this.lastCacheCounters.misses; | ||
|
|
||
| if (hitsDelta > 0) { | ||
| this.cacheHitsTotal.inc(hitsDelta); | ||
| } |
There was a problem hiding this comment.
issue (bug_risk): Cache size and max-size gauges are defined but never updated.
cacheSize and cacheMaxSize are registered in initializeMetrics but never set here, so Prometheus will always see zero/stale values. If cache size/max size are available (e.g. via AuthManager.getCacheStats()), update the gauges in this method; otherwise consider removing these metrics to avoid exporting misleading data.
Pull Request
Description
#57. Expose internal metrics in Prometheus format via /metrics endpoint for integration with standard monitoring stacks. Includes authentication, cache, tool usage, storage, and process metrics with histogram support for request latencies.
Type of change
Checklist
Related Issues
Screenshots (if applicable)
Summary by Sourcery
Expose a Prometheus-compatible /metrics endpoint on the MCP server’s health check HTTP server to export internal metrics for external monitoring systems.
New Features:
Enhancements:
Build:
Documentation: