[Bugfix][Dashboard] Dedup healthy_pods_total across router replicas by banlor · Pull Request #943 · vllm-project/production-stack

banlor · 2026-05-07T10:13:57Z

The "Available vLLM instances" stat panel reads:

count by(endpoint) (vllm:healthy_pods_total)

The metric is emitted by each router replica from its own service-discovery view, so Prometheus stores N time-series per backend (one per scrape target). count by(endpoint) then multiplies the result by router replica count.

Fix:

sum(max by (server) (vllm:healthy_pods_total))

max by (server) dedups across router replicas before summing. The metric is 0/1, so max is effectively OR across replicas. If any router sees a backend healthy, it counts.

Went with max (optimistic) over min (pessimistic) on purpose. During scale-up or SD propagation lag, routers can briefly disagree about a backend's health. For an availability stat, the useful answer is "is there a router willing to route there", and min would under-report during transient lag.

Verification

promtool tests with synthetic series:

$ promtool test rules healthy_pods_test.yml
  SUCCESS

3 routers × 2 healthy backends → old query: 6, new query: 2
1 router × 2 healthy backends → new query: 2
3 routers × 1 healthy + 1 unhealthy → new query: 1
Router disagreement (1 healthy view, 2 unhealthy) → new query: 1
All unhealthy → new query: 0

Fixes #644

Make sure the code changes pass the pre-commit checks.
Sign-off your commit
PR title classified [Bugfix]

count by(endpoint) (vllm:healthy_pods_total) returns N samples per backend with N router replicas — each router emits its own time-series with distinct scrape-target labels. The "Available vLLM instances" stat ends up multiplied. sum(max by (server) (vllm:healthy_pods_total)) collapses replicas first, so the count matches actual healthy backends regardless of router count. Fixes vllm-project#644 Signed-off-by: Mikhail Basov <Michael.S.Sinclair@protonmail.com>

gemini-code-assist

Code Review

This pull request updates the Prometheus expressions in several vLLM dashboard configurations to use a sum(max by (server) ...) aggregation for counting healthy pods. The reviewer recommends further refining this logic by including namespace and model labels in the max by clause to ensure accurate deduplication and prevent collisions in multi-tenant or multi-model environments.

gemini-code-assist Bot reviewed May 7, 2026

View reviewed changes

Comment thread helm/dashboards/vllm-dashboard.json

Comment thread tutorials/terraform/coreweave/config/vllm-dashboard.json

Comment thread tutorials/terraform/eks/config/vllm-dashboard.json

Comment thread tutorials/terraform/nebius/config/vllm-dashboard.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bugfix][Dashboard] Dedup healthy_pods_total across router replicas#943

[Bugfix][Dashboard] Dedup healthy_pods_total across router replicas#943
banlor wants to merge 1 commit intovllm-project:mainfrom
banlor:fix/healthy-pods-count-router-replicas

banlor commented May 7, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

banlor commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Verification

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

banlor commented May 7, 2026 •

edited

Loading