feat: add slow request tracking, request size histogram, and upstream health gauge#7
Conversation
… health gauge - Add slow request counter (requests exceeding 1s threshold) - Add request size histogram with buckets from 100B to 100MB - Add upstream health gauge (1=healthy, 0=unhealthy based on failure rate) - Refactor log.lua to extract helper functions for SSL and upstream metrics - Add unit tests for new constants and functions - Update README with new metrics documentation - Update CHANGELOG with new features
There was a problem hiding this comment.
Pull request overview
This PR adds three new monitoring capabilities to the ngxstats library: slow request tracking, request size histograms, and upstream health status calculation.
Changes:
- Introduces slow request counter to track requests exceeding 1 second threshold
- Adds request size histogram with buckets ranging from 100 bytes to 100MB
- Implements upstream health gauge based on failure rate (10% threshold)
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| lib/resty/ngxstats/common.lua | Added constants for new metrics and health calculation function |
| lib/resty/ngxstats/log.lua | Refactored metric recording into helper functions and integrated new metrics |
| lib/resty/ngxstats/prometheus.lua | Added Prometheus format definitions for new metrics |
| spec/common_spec.lua | Added test coverage for new constants and health calculation logic |
| spec/prometheus_spec.lua | Added tests verifying Prometheus output format for new metrics |
| README.md | Updated documentation with new metrics and feature comparison |
| CHANGELOG.md | Documented new features in unreleased section |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if proxy_host ~= "" then | ||
| local health = common.calculate_upstream_health(stats, upstream_name) | ||
| common.update_num(stats, common.key({'upstreams', upstream_name, 'health'}), health) | ||
| end |
There was a problem hiding this comment.
The health calculation should be scoped within the upstream_response_time conditional block (lines 152-161) to ensure it only executes when upstream metrics are being recorded. Currently, it relies on upstream_name which is only set when an upstream exists, but the logic would be clearer if this block was inside the upstream metrics section.
| it("should return 1 (healthy) when exactly at threshold", function() | ||
| -- 90 requests, 10 failures = 10% failure rate (at threshold, not above) | ||
| mock_stats:set("upstreams:edge:requests", 90) | ||
| mock_stats:set("upstreams:edge:failures", 10) | ||
|
|
||
| local health = common.calculate_upstream_health(mock_stats, "edge") | ||
| assert.equals(1, health) | ||
| end) |
There was a problem hiding this comment.
The test comment indicates the failure rate is 'at threshold, not above', but with 90 requests and 10 failures, the total is 100 and the failure rate is 10%, which equals the threshold (0.1). The implementation uses > on line 267 of common.lua, so this test correctly expects healthy (1). However, consider adding a test case with 91 requests and 10 failures (9.9% failure rate) to verify the boundary condition more explicitly.
| local requests = stats:get(requests_key) or 0 | ||
| local failures = stats:get(failures_key) or 0 | ||
|
|
||
| -- Need at least some requests to determine health |
There was a problem hiding this comment.
The comment says 'Need at least some requests to determine health' but the code also returns healthy (1) when there is no data. Consider clarifying the comment to explicitly state the assumption, such as: 'Assume healthy if no data is available yet'.
| -- Need at least some requests to determine health | |
| -- Need at least some requests to determine health; assume healthy if no data is available yet |
Summary
nginx_server_zone_slow_requests_total) - counts requests exceeding configurable threshold (default 1 second)nginx_server_zone_request_length_bytes_bucket) - tracks request body size distribution with buckets from 100B to 100MBnginx_upstream_health) - provides calculated health status (1=healthy, 0=unhealthy) based on failure rate threshold (default 10%)Changes
lib/resty/ngxstats/common.lua: Added constants (REQUEST_SIZE_BUCKETS,SLOW_REQUEST_THRESHOLD,UPSTREAM_HEALTH_FAILURE_THRESHOLD) andcalculate_upstream_health()functionlib/resty/ngxstats/log.lua: Refactored to extractrecord_ssl_metrics()andrecord_upstream_metrics()helper functions, added new metric collectionlib/resty/ngxstats/prometheus.lua: Added metric definitions and handlers for new metricsspec/common_spec.lua: Added 11 new unit tests for constants and health calculationspec/prometheus_spec.lua: Added 3 new unit tests for Prometheus format outputREADME.md: Updated metrics table and feature listCHANGELOG.md: Documented new featuresTest plan