Connection-refused errors don't trigger replica banning

## Summary

When a PostgreSQL replica backend is unreachable (TCP connection refused), pgdog's health check mechanism does not detect the failure or ban the replica from the load balancer. Clients continue to be routed to the dead backend, experiencing timeouts before eventually falling back to the primary.

## Environment

- pgdog: `latest` (pulled ~2026-02-20)
- Deployment: Docker Swarm via inline configs
- Backend: 2 PostgreSQL nodes (1 primary, 1 replica), replica is down (port 5432 not listening, connection refused)

## Configuration

```toml
[general]
healthcheck_interval = 15_000
idle_healthcheck_interval = 5_000
idle_healthcheck_delay = 5_000
healthcheck_timeout = 2_000
ban_timeout = 300_000
connect_timeout = 1_000
connect_attempts = 1
checkout_timeout = 5_000
read_write_split = "include_primary_if_replica_banned"
```

## Expected Behavior

When a replica is unreachable (connection refused on port 5432):
1. Health checks should detect the failure (either via idle health checks creating ephemeral connections, or via failed client connection attempts)
2. The error counter for that pool should increment
3. The replica should be banned from the load balancer
4. With `read_write_split = "include_primary_if_replica_banned"`, reads should route to the primary

## Actual Behavior

1. `SHOW POOLS` shows `errors = 0` and `banned = f` for the dead replica — indefinitely
2. `SHOW SERVERS` shows zero server connections to the dead replica (expected)
3. `SHOW REPLICATION` shows the replica with empty LSN values
4. Read queries still get routed to the dead replica's pool
5. Clients experience ~6-10s delays (connect_timeout + checkout_timeout) before pgdog falls back to the primary
6. The error counter never increments, so the replica is never banned

## Reproduction Steps

1. Configure pgdog with 2 backends (1 primary, 1 replica) using `role = "auto"`
2. Stop PostgreSQL on the replica (so port 5432 returns connection refused)
3. Wait for health check intervals to pass
4. Run `SHOW POOLS` — observe `errors = 0, banned = f` for the replica
5. Run a read query — observe it takes ~6-10s instead of <1s
6. Run `SHOW POOLS` again — errors still 0

## Impact

This defeats the purpose of health checks and banning for the most common failure mode (backend process stopped/crashed). The `include_primary_if_replica_banned` setting works correctly when banning is triggered, but the ban never fires for connection-refused failures.

## Workaround

Currently the only workaround is to manually remove the dead replica from the pgdog config and restart, which eliminates the automatic failover benefit.

## Notes

- The documentation states: "If there are no idle connections available, PgDog will create an ephemeral connection to perform the healthcheck." This ephemeral connection should fail with connection-refused, but it doesn't appear to trigger a ban.
- This may be related to how pgdog handles `ECONNREFUSED` vs query errors on established connections. It seems only the latter triggers banning.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Connection-refused errors don't trigger replica banning #785

Summary

Environment

Configuration

Expected Behavior

Actual Behavior

Reproduction Steps

Impact

Workaround

Notes

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Connection-refused errors don't trigger replica banning #785

Description

Summary

Environment

Configuration

Expected Behavior

Actual Behavior

Reproduction Steps

Impact

Workaround

Notes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions