Skip to content

[Bugfix][Router] Fix stale Prometheus endpoint health metrics (#888)#936

Open
prashansapkota wants to merge 2 commits intovllm-project:mainfrom
prashansapkota:fix-stale-metrics-888-v2
Open

[Bugfix][Router] Fix stale Prometheus endpoint health metrics (#888)#936
prashansapkota wants to merge 2 commits intovllm-project:mainfrom
prashansapkota:fix-stale-metrics-888-v2

Conversation

@prashansapkota
Copy link
Copy Markdown

@prashansapkota prashansapkota commented Apr 29, 2026

Summary

Fixes #888

Endpoint health gauges retained stale server= labels indefinitely when a backend was removed from service discovery, causing Prometheus to keep exporting the last known value for dead endpoints.

  • Add EndpointInfo.healthy field (default True) to carry health state from service discovery into the metrics handler
  • Introduce _LABEL_GAUGES list and _clear_label_gauges() helper in metrics_router.py; call it at the start of every /metrics scrape so only currently-active endpoints appear in the output
  • Update healthy_pods_total to use ep.healthy instead of a hardcoded 1
  • Add test_stale_metrics.py with 7 tests covering: EndpointInfo.healthy default/override, stale label removal, full endpoint removal, unlabeled gauge preservation, _LABEL_GAUGES completeness, and post-clear repopulation

Test plan

  • pytest src/tests/test_stale_metrics.py. All 7 tests pass
  • pytest src/tests/test_static_service_discovery.py src/tests/test_utils.py. All existing tests pass
  • Manually verify /metrics no longer exports labels for removed endpoints after a scrape cycle

…lm-project#888)

Endpoint health gauges retained stale server= labels indefinitely when
a backend was removed from service discovery, causing Prometheus to
keep exporting the last known value for dead endpoints.

Changes:
- Add EndpointInfo.healthy field (default True) to carry health state
  from service discovery into the metrics handler
- Introduce _LABEL_GAUGES list and _clear_label_gauges() helper in
  metrics_router.py; call it at the start of every /metrics scrape so
  only currently-active endpoints appear in the output
- Update healthy_pods_total to use ep.healthy instead of a hardcoded 1
- Add test_stale_metrics.py covering: EndpointInfo.healthy default/
  override, stale label removal, full endpoint removal, unlabeled gauge
  preservation, _LABEL_GAUGES completeness, and post-clear repopulation
- Remove redundant inline comments and section headers throughout

Signed-off-by: prashansapkota <prashan.sapkota3456@gmail.com>
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a mechanism to clear stale Prometheus metrics by tracking gauges with server labels and clearing them within the metrics endpoint. It also adds a healthy field to the EndpointInfo dataclass and includes new tests for metric clearing. The review feedback points out a missing metric in the clearing list, suggests restoring a removed docstring for maintainability, and recommends expanding the test coverage to include all labeled gauges.

Comment on lines +58 to +71
_LABEL_GAUGES = [
current_qps,
avg_decoding_length,
num_prefill_requests,
num_decoding_requests,
num_requests_running,
avg_latency,
avg_itl,
num_requests_swapped,
gpu_prefix_cache_hit_rate,
gpu_prefix_cache_hits_total,
gpu_prefix_cache_queries_total,
healthy_pods_total,
]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The _LABEL_GAUGES list is missing num_requests_waiting, which is defined with a server label in metrics_service. To ensure all endpoint-specific gauges are cleared of stale labels, this metric should be included in the list. Note that it will also need to be imported from vllm_router.services.metrics_service.

"""

# Collect CPU utilization (short interval)
_clear_label_gauges()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The docstring for the metrics endpoint was removed. This documentation provides a clear overview of the metrics being gathered and exported, which is valuable for maintainability. It should be restored.

    """
    Endpoint to expose Prometheus metrics for the vLLM router.

    This function gathers request statistics, engine metrics, and health status
    of the service endpoints to update Prometheus gauges. It exports metrics
    such as queries per second (QPS), average decoding length, number of prefill
    and decoding requests, average latency, average inter-token latency, number
    of swapped requests, and the number of healthy pods for each server. The
    metrics are used to monitor the performance and health of the vLLM router
    services.

    Returns:
        Response: A HTTP response containing the latest Prometheus metrics in
        the appropriate content type.
    """
    _clear_label_gauges()

Comment on lines +6 to +13
from vllm_router.services.metrics_service import (
current_qps,
gpu_prefix_cache_hit_rate,
gpu_prefix_cache_hits_total,
gpu_prefix_cache_queries_total,
healthy_pods_total,
num_requests_running,
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Several metrics used in the router are not imported here, which limits the completeness of the tests. Please import all metrics that have a server label to allow for a comprehensive check in test_label_gauges_list_contains_all_expected_gauges.

from vllm_router.services.metrics_service import (
    avg_decoding_length,
    avg_itl,
    avg_latency,
    current_qps,
    gpu_prefix_cache_hit_rate,
    gpu_prefix_cache_hits_total,
    gpu_prefix_cache_queries_total,
    healthy_pods_total,
    num_decoding_requests,
    num_prefill_requests,
    num_requests_running,
    num_requests_swapped,
)

Comment on lines +97 to +104
expected = {
current_qps,
gpu_prefix_cache_hit_rate,
gpu_prefix_cache_hits_total,
gpu_prefix_cache_queries_total,
healthy_pods_total,
num_requests_running,
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The expected set in test_label_gauges_list_contains_all_expected_gauges is incomplete. It should include all metrics that are present in _LABEL_GAUGES to ensure the clearing logic is correctly verified for all relevant gauges.

Suggested change
expected = {
current_qps,
gpu_prefix_cache_hit_rate,
gpu_prefix_cache_hits_total,
gpu_prefix_cache_queries_total,
healthy_pods_total,
num_requests_running,
}
expected = {
current_qps,
avg_decoding_length,
num_prefill_requests,
num_decoding_requests,
num_requests_running,
avg_latency,
avg_itl,
num_requests_swapped,
gpu_prefix_cache_hit_rate,
gpu_prefix_cache_hits_total,
gpu_prefix_cache_queries_total,
healthy_pods_total,
}

- Add num_requests_waiting to _LABEL_GAUGES and import it so stale
  waiting-request series are cleared along with all other server labels
- Restore the metrics() endpoint docstring
- Expand test imports and expected set in
  test_label_gauges_list_contains_all_expected_gauges to cover all
  server-labeled gauges including num_requests_waiting

Signed-off-by: prashansapkota <prashan.sapkota3456@gmail.com>
@prashansapkota
Copy link
Copy Markdown
Author

prashansapkota commented Apr 29, 2026

Thanks for the review! Addressed all feedback in the latest commit (36748d0):

  1. Added num_requests_waiting to _LABEL_GAUGES: this gauge has a server label and was missing from the clearing list, meaning stale waiting-request series would have persisted for removed endpoints. Also added it to the import in metrics_router.py.

  2. Restored the metrics() endpoint docstring: the docstring was removed as part of a comment cleanup pass but reviewers flagged it as valuable for maintainability. It's been restored in full.

  3. Expanded imports in test_stale_metrics.py: the test file was only importing a subset of server-labeled gauges. All gauges (avg_decoding_length, avg_itl, avg_latency, num_decoding_requests, num_prefill_requests, num_requests_swapped, num_requests_waiting) are now imported to allow comprehensive coverage.

  4. Expanded expected set in test_label_gauges_list_contains_all_expected_gauges: the set now mirrors the full _LABEL_GAUGES list exactly, including num_requests_waiting, ensuring the test catches any future omissions from the clearing list.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug: Router - Stale Prometheus endpoint health metrics when engines become unavailable in service discovery

1 participant