[Bugfix][Router] Fix stale Prometheus endpoint health metrics (#888)#936
[Bugfix][Router] Fix stale Prometheus endpoint health metrics (#888)#936prashansapkota wants to merge 2 commits intovllm-project:mainfrom
Conversation
…lm-project#888) Endpoint health gauges retained stale server= labels indefinitely when a backend was removed from service discovery, causing Prometheus to keep exporting the last known value for dead endpoints. Changes: - Add EndpointInfo.healthy field (default True) to carry health state from service discovery into the metrics handler - Introduce _LABEL_GAUGES list and _clear_label_gauges() helper in metrics_router.py; call it at the start of every /metrics scrape so only currently-active endpoints appear in the output - Update healthy_pods_total to use ep.healthy instead of a hardcoded 1 - Add test_stale_metrics.py covering: EndpointInfo.healthy default/ override, stale label removal, full endpoint removal, unlabeled gauge preservation, _LABEL_GAUGES completeness, and post-clear repopulation - Remove redundant inline comments and section headers throughout Signed-off-by: prashansapkota <prashan.sapkota3456@gmail.com>
There was a problem hiding this comment.
Code Review
This pull request introduces a mechanism to clear stale Prometheus metrics by tracking gauges with server labels and clearing them within the metrics endpoint. It also adds a healthy field to the EndpointInfo dataclass and includes new tests for metric clearing. The review feedback points out a missing metric in the clearing list, suggests restoring a removed docstring for maintainability, and recommends expanding the test coverage to include all labeled gauges.
| _LABEL_GAUGES = [ | ||
| current_qps, | ||
| avg_decoding_length, | ||
| num_prefill_requests, | ||
| num_decoding_requests, | ||
| num_requests_running, | ||
| avg_latency, | ||
| avg_itl, | ||
| num_requests_swapped, | ||
| gpu_prefix_cache_hit_rate, | ||
| gpu_prefix_cache_hits_total, | ||
| gpu_prefix_cache_queries_total, | ||
| healthy_pods_total, | ||
| ] |
There was a problem hiding this comment.
The _LABEL_GAUGES list is missing num_requests_waiting, which is defined with a server label in metrics_service. To ensure all endpoint-specific gauges are cleared of stale labels, this metric should be included in the list. Note that it will also need to be imported from vllm_router.services.metrics_service.
| """ | ||
|
|
||
| # Collect CPU utilization (short interval) | ||
| _clear_label_gauges() |
There was a problem hiding this comment.
The docstring for the metrics endpoint was removed. This documentation provides a clear overview of the metrics being gathered and exported, which is valuable for maintainability. It should be restored.
"""
Endpoint to expose Prometheus metrics for the vLLM router.
This function gathers request statistics, engine metrics, and health status
of the service endpoints to update Prometheus gauges. It exports metrics
such as queries per second (QPS), average decoding length, number of prefill
and decoding requests, average latency, average inter-token latency, number
of swapped requests, and the number of healthy pods for each server. The
metrics are used to monitor the performance and health of the vLLM router
services.
Returns:
Response: A HTTP response containing the latest Prometheus metrics in
the appropriate content type.
"""
_clear_label_gauges()| from vllm_router.services.metrics_service import ( | ||
| current_qps, | ||
| gpu_prefix_cache_hit_rate, | ||
| gpu_prefix_cache_hits_total, | ||
| gpu_prefix_cache_queries_total, | ||
| healthy_pods_total, | ||
| num_requests_running, | ||
| ) |
There was a problem hiding this comment.
Several metrics used in the router are not imported here, which limits the completeness of the tests. Please import all metrics that have a server label to allow for a comprehensive check in test_label_gauges_list_contains_all_expected_gauges.
from vllm_router.services.metrics_service import (
avg_decoding_length,
avg_itl,
avg_latency,
current_qps,
gpu_prefix_cache_hit_rate,
gpu_prefix_cache_hits_total,
gpu_prefix_cache_queries_total,
healthy_pods_total,
num_decoding_requests,
num_prefill_requests,
num_requests_running,
num_requests_swapped,
)| expected = { | ||
| current_qps, | ||
| gpu_prefix_cache_hit_rate, | ||
| gpu_prefix_cache_hits_total, | ||
| gpu_prefix_cache_queries_total, | ||
| healthy_pods_total, | ||
| num_requests_running, | ||
| } |
There was a problem hiding this comment.
The expected set in test_label_gauges_list_contains_all_expected_gauges is incomplete. It should include all metrics that are present in _LABEL_GAUGES to ensure the clearing logic is correctly verified for all relevant gauges.
| expected = { | |
| current_qps, | |
| gpu_prefix_cache_hit_rate, | |
| gpu_prefix_cache_hits_total, | |
| gpu_prefix_cache_queries_total, | |
| healthy_pods_total, | |
| num_requests_running, | |
| } | |
| expected = { | |
| current_qps, | |
| avg_decoding_length, | |
| num_prefill_requests, | |
| num_decoding_requests, | |
| num_requests_running, | |
| avg_latency, | |
| avg_itl, | |
| num_requests_swapped, | |
| gpu_prefix_cache_hit_rate, | |
| gpu_prefix_cache_hits_total, | |
| gpu_prefix_cache_queries_total, | |
| healthy_pods_total, | |
| } |
- Add num_requests_waiting to _LABEL_GAUGES and import it so stale waiting-request series are cleared along with all other server labels - Restore the metrics() endpoint docstring - Expand test imports and expected set in test_label_gauges_list_contains_all_expected_gauges to cover all server-labeled gauges including num_requests_waiting Signed-off-by: prashansapkota <prashan.sapkota3456@gmail.com>
|
Thanks for the review! Addressed all feedback in the latest commit (36748d0):
|
Summary
Fixes #888
Endpoint health gauges retained stale
server=labels indefinitely when a backend was removed from service discovery, causing Prometheus to keep exporting the last known value for dead endpoints.EndpointInfo.healthyfield (defaultTrue) to carry health state from service discovery into the metrics handler_LABEL_GAUGESlist and_clear_label_gauges()helper inmetrics_router.py; call it at the start of every/metricsscrape so only currently-active endpoints appear in the outputhealthy_pods_totalto useep.healthyinstead of a hardcoded1test_stale_metrics.pywith 7 tests covering:EndpointInfo.healthydefault/override, stale label removal, full endpoint removal, unlabeled gauge preservation,_LABEL_GAUGEScompleteness, and post-clear repopulationTest plan
pytest src/tests/test_stale_metrics.py. All 7 tests passpytest src/tests/test_static_service_discovery.py src/tests/test_utils.py. All existing tests pass/metricsno longer exports labels for removed endpoints after a scrape cycle