Skip to content

Connection-refused errors don't trigger replica banning #785

@kevinelliott

Description

@kevinelliott

Summary

When a PostgreSQL replica backend is unreachable (TCP connection refused), pgdog's health check mechanism does not detect the failure or ban the replica from the load balancer. Clients continue to be routed to the dead backend, experiencing timeouts before eventually falling back to the primary.

Environment

  • pgdog: latest (pulled ~2026-02-20)
  • Deployment: Docker Swarm via inline configs
  • Backend: 2 PostgreSQL nodes (1 primary, 1 replica), replica is down (port 5432 not listening, connection refused)

Configuration

[general]
healthcheck_interval = 15_000
idle_healthcheck_interval = 5_000
idle_healthcheck_delay = 5_000
healthcheck_timeout = 2_000
ban_timeout = 300_000
connect_timeout = 1_000
connect_attempts = 1
checkout_timeout = 5_000
read_write_split = "include_primary_if_replica_banned"

Expected Behavior

When a replica is unreachable (connection refused on port 5432):

  1. Health checks should detect the failure (either via idle health checks creating ephemeral connections, or via failed client connection attempts)
  2. The error counter for that pool should increment
  3. The replica should be banned from the load balancer
  4. With read_write_split = "include_primary_if_replica_banned", reads should route to the primary

Actual Behavior

  1. SHOW POOLS shows errors = 0 and banned = f for the dead replica — indefinitely
  2. SHOW SERVERS shows zero server connections to the dead replica (expected)
  3. SHOW REPLICATION shows the replica with empty LSN values
  4. Read queries still get routed to the dead replica's pool
  5. Clients experience ~6-10s delays (connect_timeout + checkout_timeout) before pgdog falls back to the primary
  6. The error counter never increments, so the replica is never banned

Reproduction Steps

  1. Configure pgdog with 2 backends (1 primary, 1 replica) using role = "auto"
  2. Stop PostgreSQL on the replica (so port 5432 returns connection refused)
  3. Wait for health check intervals to pass
  4. Run SHOW POOLS — observe errors = 0, banned = f for the replica
  5. Run a read query — observe it takes ~6-10s instead of <1s
  6. Run SHOW POOLS again — errors still 0

Impact

This defeats the purpose of health checks and banning for the most common failure mode (backend process stopped/crashed). The include_primary_if_replica_banned setting works correctly when banning is triggered, but the ban never fires for connection-refused failures.

Workaround

Currently the only workaround is to manually remove the dead replica from the pgdog config and restart, which eliminates the automatic failover benefit.

Notes

  • The documentation states: "If there are no idle connections available, PgDog will create an ephemeral connection to perform the healthcheck." This ephemeral connection should fail with connection-refused, but it doesn't appear to trigger a ban.
  • This may be related to how pgdog handles ECONNREFUSED vs query errors on established connections. It seems only the latter triggers banning.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions