fix(db): auto-reap idle sessions to prevent connection-pool exhaustion (V0062)#793
fix(db): auto-reap idle sessions to prevent connection-pool exhaustion (V0062)#793Evrard-Nil wants to merge 1 commit into
Conversation
Add migration V0062 setting idle_session_timeout=300s and idle_in_transaction_session_timeout=60s on the database so Postgres closes sessions orphaned by crashed/recycled cloud-api instances, instead of letting them accumulate to max_connections. Incident 2026-06-15: after a prod deploy, crash-looping/recycled instances left dozens of idle client backends (no server idle timeout; dead-peer TCP backends linger). The leader filled to max_connections and even a single healthy instance got 'FATAL: sorry, too many clients already'. Safe with deadpool: Fast recycling re-validates on checkout, so a warm pooled connection reaped after going idle is discarded and re-created transparently. Version-guarded (idle_session_timeout is PG14+, cluster is PG16) so it can never wedge startup on an older node. NOTE: this is the durable/recurrence fix. It does NOT clear an active pileup (it runs after the initial pool connect) and existing sessions are unaffected — pair with a manual pg_terminate_backend of idle backends and a DATABASE_MAX_CONNECTIONS reduction.
There was a problem hiding this comment.
Code Review
This pull request introduces a database migration to automatically reap idle client sessions by setting idle_session_timeout and idle_in_transaction_session_timeout via ALTER DATABASE commands. However, the reviewer identified a critical issue: ALTER DATABASE cannot be executed inside a transaction block, which will cause the migration to fail and crash the application during startup since the migration runner executes migrations within a transaction. The reviewer recommends setting these parameters at the connection pool level or executing them manually instead.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| DO $$ | ||
| BEGIN | ||
| IF current_setting('server_version_num')::int >= 140000 THEN | ||
| EXECUTE format( | ||
| 'ALTER DATABASE %I SET idle_session_timeout = %L', | ||
| current_database(), '300s' | ||
| ); | ||
| END IF; | ||
|
|
||
| -- Available since PG 9.6; reaps sessions stuck idle-in-transaction | ||
| -- (the classic connection leak) much sooner. | ||
| EXECUTE format( | ||
| 'ALTER DATABASE %I SET idle_in_transaction_session_timeout = %L', | ||
| current_database(), '60s' | ||
| ); | ||
| END | ||
| $$; |
There was a problem hiding this comment.
In PostgreSQL, ALTER DATABASE commands (including ALTER DATABASE ... SET ...) cannot be executed inside a transaction block. Attempting to do so results in the following error:
ERROR: ALTER DATABASE cannot run inside a transaction block
Since refinery executes migrations within a transaction block by default, this migration will fail to apply and will cause the application to crash during startup.
Suggested Solutions:
- Connection Pool Configuration (Recommended): Set these parameters at the connection pool level instead of the database level. This can be done by adding them to the connection options (e.g.,
-c idle_session_timeout=300s -c idle_in_transaction_session_timeout=60s) in the database configuration or connection string. This avoids global database state mutation and ensures compatibility with transactional migrations. - Manual Execution: Execute the
ALTER DATABASEstatements manually on the target databases as a one-time administrative task outside of the transactional migration runner.
Review — V0062 idle session timeoutSolid, well-documented fix. The version guard, transaction-safety of
|
Incident context (2026-06-15)
After a prod deploy, crash-looping / recycled cloud-api instances left dozens of
idleclient backends behind — there's no server-side idle timeout, and dead-peer TCP backends linger for minutes. The Patroni leader filled tomax_connections, so even a single healthy instance gotFATAL: sorry, too many clients alreadyand panicked atinit_database(lib.rs:108), crash-looping.What this does
Adds migration V0062 setting, on the application database:
idle_session_timeout = 300s— Postgres closes abandoned idle client sessions on its own (the orphaned-connection reaper this incident needed).idle_in_transaction_session_timeout = 60s— reaps the classic idle-in-transaction leak sooner.Why it's safe
is_closed()check on checkout), so a warm pooled connection the server reaps after it goes idle is simply discarded and re-created on next use — no error reaches the app.300sis well above normal inter-query gaps under load, so only genuinely-abandoned sessions are reaped.idle_session_timeoutis PG14+; the cluster is PG16/Spilo-16, but the migration is version-guarded (server_version_num >= 140000) so it's a no-op rather than an error on any older node.ALTER DATABASE … SETis transaction-safe (runs fine under refinery).run_migrations()(lib.rs:114), after the initial pool connect that fails during the outage — so it cannot clear an active pileup.pg_terminate_backendof idle backends to restore service now, and (2) aDATABASE_MAX_CONNECTIONSreduction (32→16) soinstances × (write+read) poolsfit under the servermax_connections(currently the default ~100).Follow-ups worth filing
lib.rs:108.expect) — should back off and retry.max_write/read_connections) isn't bounded against the servermax_connectionsfor the instance count.