Skip to content

[Bugfix][Dashboard] Dedup healthy_pods_total across router replicas#943

Open
banlor wants to merge 1 commit intovllm-project:mainfrom
banlor:fix/healthy-pods-count-router-replicas
Open

[Bugfix][Dashboard] Dedup healthy_pods_total across router replicas#943
banlor wants to merge 1 commit intovllm-project:mainfrom
banlor:fix/healthy-pods-count-router-replicas

Conversation

@banlor
Copy link
Copy Markdown
Contributor

@banlor banlor commented May 7, 2026

The "Available vLLM instances" stat panel reads:

count by(endpoint) (vllm:healthy_pods_total)

The metric is emitted by each router replica from its own service-discovery view, so Prometheus stores N time-series per backend (one per scrape target). count by(endpoint) then multiplies the result by router replica count.

Fix:

sum(max by (server) (vllm:healthy_pods_total))

max by (server) dedups across router replicas before summing. The metric is 0/1, so max is effectively OR across replicas. If any router sees a backend healthy, it counts.

Went with max (optimistic) over min (pessimistic) on purpose. During scale-up or SD propagation lag, routers can briefly disagree about a backend's health. For an availability stat, the useful answer is "is there a router willing to route there", and min would under-report during transient lag.

Verification

promtool tests with synthetic series:

$ promtool test rules healthy_pods_test.yml
  SUCCESS
  1. 3 routers × 2 healthy backends → old query: 6, new query: 2
  2. 1 router × 2 healthy backends → new query: 2
  3. 3 routers × 1 healthy + 1 unhealthy → new query: 1
  4. Router disagreement (1 healthy view, 2 unhealthy) → new query: 1
  5. All unhealthy → new query: 0

Fixes #644


  • Make sure the code changes pass the pre-commit checks.
  • Sign-off your commit
  • PR title classified [Bugfix]

count by(endpoint) (vllm:healthy_pods_total) returns N samples per
backend with N router replicas — each router emits its own time-series
with distinct scrape-target labels. The "Available vLLM instances"
stat ends up multiplied.

sum(max by (server) (vllm:healthy_pods_total)) collapses replicas
first, so the count matches actual healthy backends regardless of
router count.

Fixes vllm-project#644

Signed-off-by: Mikhail Basov <Michael.S.Sinclair@protonmail.com>
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the Prometheus expressions in several vLLM dashboard configurations to use a sum(max by (server) ...) aggregation for counting healthy pods. The reviewer recommends further refining this logic by including namespace and model labels in the max by clause to ensure accurate deduplication and prevent collisions in multi-tenant or multi-model environments.

Comment thread helm/dashboards/vllm-dashboard.json
Comment thread tutorials/terraform/coreweave/config/vllm-dashboard.json
Comment thread tutorials/terraform/eks/config/vllm-dashboard.json
Comment thread tutorials/terraform/nebius/config/vllm-dashboard.json
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

No of vLLM instances in observability is wrong if router replicas are ore than 1

1 participant