[Bugfix][Router] Fix stale Prometheus endpoint health metrics (#888) by prashansapkota · Pull Request #936 · vllm-project/production-stack

prashansapkota · 2026-04-29T04:08:24Z

Summary

Fixes #888

Endpoint health gauges retained stale server= labels indefinitely when a backend was removed from service discovery, causing Prometheus to keep exporting the last known value for dead endpoints.

Add EndpointInfo.healthy field (default True) to carry health state from service discovery into the metrics handler
Introduce _LABEL_GAUGES list and _clear_label_gauges() helper in metrics_router.py; call it at the start of every /metrics scrape so only currently-active endpoints appear in the output
Update healthy_pods_total to use ep.healthy instead of a hardcoded 1
Add test_stale_metrics.py with 7 tests covering: EndpointInfo.healthy default/override, stale label removal, full endpoint removal, unlabeled gauge preservation, _LABEL_GAUGES completeness, and post-clear repopulation

Test plan

pytest src/tests/test_stale_metrics.py. All 7 tests pass
pytest src/tests/test_static_service_discovery.py src/tests/test_utils.py. All existing tests pass
Manually verify /metrics no longer exports labels for removed endpoints after a scrape cycle

…lm-project#888) Endpoint health gauges retained stale server= labels indefinitely when a backend was removed from service discovery, causing Prometheus to keep exporting the last known value for dead endpoints. Changes: - Add EndpointInfo.healthy field (default True) to carry health state from service discovery into the metrics handler - Introduce _LABEL_GAUGES list and _clear_label_gauges() helper in metrics_router.py; call it at the start of every /metrics scrape so only currently-active endpoints appear in the output - Update healthy_pods_total to use ep.healthy instead of a hardcoded 1 - Add test_stale_metrics.py covering: EndpointInfo.healthy default/ override, stale label removal, full endpoint removal, unlabeled gauge preservation, _LABEL_GAUGES completeness, and post-clear repopulation - Remove redundant inline comments and section headers throughout Signed-off-by: prashansapkota <prashan.sapkota3456@gmail.com>

gemini-code-assist

Code Review

This pull request introduces a mechanism to clear stale Prometheus metrics by tracking gauges with server labels and clearing them within the metrics endpoint. It also adds a healthy field to the EndpointInfo dataclass and includes new tests for metric clearing. The review feedback points out a missing metric in the clearing list, suggests restoring a removed docstring for maintainability, and recommends expanding the test coverage to include all labeled gauges.

gemini-code-assist · 2026-04-29T04:10:59Z

+_LABEL_GAUGES = [
+    current_qps,
+    avg_decoding_length,
+    num_prefill_requests,
+    num_decoding_requests,
+    num_requests_running,
+    avg_latency,
+    avg_itl,
+    num_requests_swapped,
+    gpu_prefix_cache_hit_rate,
+    gpu_prefix_cache_hits_total,
+    gpu_prefix_cache_queries_total,
+    healthy_pods_total,
+]


The _LABEL_GAUGES list is missing num_requests_waiting, which is defined with a server label in metrics_service. To ensure all endpoint-specific gauges are cleared of stale labels, this metric should be included in the list. Note that it will also need to be imported from vllm_router.services.metrics_service.

gemini-code-assist · 2026-04-29T04:10:59Z

-    """
-
-    # Collect CPU utilization (short interval)
+    _clear_label_gauges()


The docstring for the metrics endpoint was removed. This documentation provides a clear overview of the metrics being gathered and exported, which is valuable for maintainability. It should be restored.

""" Endpoint to expose Prometheus metrics for the vLLM router. This function gathers request statistics, engine metrics, and health status of the service endpoints to update Prometheus gauges. It exports metrics such as queries per second (QPS), average decoding length, number of prefill and decoding requests, average latency, average inter-token latency, number of swapped requests, and the number of healthy pods for each server. The metrics are used to monitor the performance and health of the vLLM router services. Returns: Response: A HTTP response containing the latest Prometheus metrics in the appropriate content type. """ _clear_label_gauges()

gemini-code-assist · 2026-04-29T04:11:00Z

+from vllm_router.services.metrics_service import (
+    current_qps,
+    gpu_prefix_cache_hit_rate,
+    gpu_prefix_cache_hits_total,
+    gpu_prefix_cache_queries_total,
+    healthy_pods_total,
+    num_requests_running,
+)


Several metrics used in the router are not imported here, which limits the completeness of the tests. Please import all metrics that have a server label to allow for a comprehensive check in test_label_gauges_list_contains_all_expected_gauges.

from vllm_router.services.metrics_service import ( avg_decoding_length, avg_itl, avg_latency, current_qps, gpu_prefix_cache_hit_rate, gpu_prefix_cache_hits_total, gpu_prefix_cache_queries_total, healthy_pods_total, num_decoding_requests, num_prefill_requests, num_requests_running, num_requests_swapped, )

gemini-code-assist · 2026-04-29T04:11:00Z

+    expected = {
+        current_qps,
+        gpu_prefix_cache_hit_rate,
+        gpu_prefix_cache_hits_total,
+        gpu_prefix_cache_queries_total,
+        healthy_pods_total,
+        num_requests_running,
+    }


The expected set in test_label_gauges_list_contains_all_expected_gauges is incomplete. It should include all metrics that are present in _LABEL_GAUGES to ensure the clearing logic is correctly verified for all relevant gauges.

Suggested change

expected = {

current_qps,

gpu_prefix_cache_hit_rate,

gpu_prefix_cache_hits_total,

gpu_prefix_cache_queries_total,

healthy_pods_total,

num_requests_running,

}

expected = {

current_qps,

avg_decoding_length,

num_prefill_requests,

num_decoding_requests,

num_requests_running,

avg_latency,

avg_itl,

num_requests_swapped,

gpu_prefix_cache_hit_rate,

gpu_prefix_cache_hits_total,

gpu_prefix_cache_queries_total,

healthy_pods_total,

}

- Add num_requests_waiting to _LABEL_GAUGES and import it so stale waiting-request series are cleared along with all other server labels - Restore the metrics() endpoint docstring - Expand test imports and expected set in test_label_gauges_list_contains_all_expected_gauges to cover all server-labeled gauges including num_requests_waiting Signed-off-by: prashansapkota <prashan.sapkota3456@gmail.com>

prashansapkota · 2026-04-29T04:18:36Z

Thanks for the review! Addressed all feedback in the latest commit (36748d0):

Added num_requests_waiting to _LABEL_GAUGES: this gauge has a server label and was missing from the clearing list, meaning stale waiting-request series would have persisted for removed endpoints. Also added it to the import in metrics_router.py.
Restored the metrics() endpoint docstring: the docstring was removed as part of a comment cleanup pass but reviewers flagged it as valuable for maintainability. It's been restored in full.
Expanded imports in test_stale_metrics.py: the test file was only importing a subset of server-labeled gauges. All gauges (avg_decoding_length, avg_itl, avg_latency, num_decoding_requests, num_prefill_requests, num_requests_swapped, num_requests_waiting) are now imported to allow comprehensive coverage.
Expanded expected set in test_label_gauges_list_contains_all_expected_gauges: the set now mirrors the full _LABEL_GAUGES list exactly, including num_requests_waiting, ensuring the test catches any future omissions from the clearing list.

gemini-code-assist Bot reviewed Apr 29, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bugfix][Router] Fix stale Prometheus endpoint health metrics (#888)#936

[Bugfix][Router] Fix stale Prometheus endpoint health metrics (#888)#936
prashansapkota wants to merge 2 commits intovllm-project:mainfrom
prashansapkota:fix-stale-metrics-888-v2

prashansapkota commented Apr 29, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Apr 29, 2026

Uh oh!

gemini-code-assist Bot Apr 29, 2026

Uh oh!

gemini-code-assist Bot Apr 29, 2026

Uh oh!

gemini-code-assist Bot Apr 29, 2026

Uh oh!

prashansapkota commented Apr 29, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

prashansapkota commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

prashansapkota commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

prashansapkota commented Apr 29, 2026 •

edited

Loading

prashansapkota commented Apr 29, 2026 •

edited

Loading