Skip to content

fix(db): auto-reap idle sessions to prevent connection-pool exhaustion (V0062)#793

Closed
Evrard-Nil wants to merge 1 commit into
mainfrom
fix/db-idle-session-timeout
Closed

fix(db): auto-reap idle sessions to prevent connection-pool exhaustion (V0062)#793
Evrard-Nil wants to merge 1 commit into
mainfrom
fix/db-idle-session-timeout

Conversation

@Evrard-Nil

Copy link
Copy Markdown
Collaborator

Incident context (2026-06-15)

After a prod deploy, crash-looping / recycled cloud-api instances left dozens of idle client backends behind — there's no server-side idle timeout, and dead-peer TCP backends linger for minutes. The Patroni leader filled to max_connections, so even a single healthy instance got FATAL: sorry, too many clients already and panicked at init_database (lib.rs:108), crash-looping.

What this does

Adds migration V0062 setting, on the application database:

  • idle_session_timeout = 300s — Postgres closes abandoned idle client sessions on its own (the orphaned-connection reaper this incident needed).
  • idle_in_transaction_session_timeout = 60s — reaps the classic idle-in-transaction leak sooner.

Why it's safe

  • With our pooling: deadpool uses Fast recycling (is_closed() check on checkout), so a warm pooled connection the server reaps after it goes idle is simply discarded and re-created on next use — no error reaches the app. 300s is well above normal inter-query gaps under load, so only genuinely-abandoned sessions are reaped.
  • Won't wedge startup: idle_session_timeout is PG14+; the cluster is PG16/Spilo-16, but the migration is version-guarded (server_version_num >= 140000) so it's a no-op rather than an error on any older node. ALTER DATABASE … SET is transaction-safe (runs fine under refinery).

⚠️ Scope — this is the recurrence fix, not the live-outage fix

  • It runs at run_migrations() (lib.rs:114), after the initial pool connect that fails during the outage — so it cannot clear an active pileup.
  • It only affects new sessions; existing zombie backends are unaffected.
  • Pair it with: (1) a manual pg_terminate_backend of idle backends to restore service now, and (2) a DATABASE_MAX_CONNECTIONS reduction (32→16) so instances × (write+read) pools fit under the server max_connections (currently the default ~100).

Follow-ups worth filing

  • cloud-api panics at boot instead of retrying when the DB is briefly full (lib.rs:108 .expect) — should back off and retry.
  • Pool sizing (max_write/read_connections) isn't bounded against the server max_connections for the instance count.

Add migration V0062 setting idle_session_timeout=300s and
idle_in_transaction_session_timeout=60s on the database so Postgres closes
sessions orphaned by crashed/recycled cloud-api instances, instead of letting
them accumulate to max_connections.

Incident 2026-06-15: after a prod deploy, crash-looping/recycled instances left
dozens of idle client backends (no server idle timeout; dead-peer TCP backends
linger). The leader filled to max_connections and even a single healthy instance
got 'FATAL: sorry, too many clients already'.

Safe with deadpool: Fast recycling re-validates on checkout, so a warm pooled
connection reaped after going idle is discarded and re-created transparently.
Version-guarded (idle_session_timeout is PG14+, cluster is PG16) so it can never
wedge startup on an older node.

NOTE: this is the durable/recurrence fix. It does NOT clear an active pileup (it
runs after the initial pool connect) and existing sessions are unaffected — pair
with a manual pg_terminate_backend of idle backends and a DATABASE_MAX_CONNECTIONS
reduction.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a database migration to automatically reap idle client sessions by setting idle_session_timeout and idle_in_transaction_session_timeout via ALTER DATABASE commands. However, the reviewer identified a critical issue: ALTER DATABASE cannot be executed inside a transaction block, which will cause the migration to fail and crash the application during startup since the migration runner executes migrations within a transaction. The reviewer recommends setting these parameters at the connection pool level or executing them manually instead.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +21 to +37
DO $$
BEGIN
IF current_setting('server_version_num')::int >= 140000 THEN
EXECUTE format(
'ALTER DATABASE %I SET idle_session_timeout = %L',
current_database(), '300s'
);
END IF;

-- Available since PG 9.6; reaps sessions stuck idle-in-transaction
-- (the classic connection leak) much sooner.
EXECUTE format(
'ALTER DATABASE %I SET idle_in_transaction_session_timeout = %L',
current_database(), '60s'
);
END
$$;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

In PostgreSQL, ALTER DATABASE commands (including ALTER DATABASE ... SET ...) cannot be executed inside a transaction block. Attempting to do so results in the following error:
ERROR: ALTER DATABASE cannot run inside a transaction block

Since refinery executes migrations within a transaction block by default, this migration will fail to apply and will cause the application to crash during startup.

Suggested Solutions:

  1. Connection Pool Configuration (Recommended): Set these parameters at the connection pool level instead of the database level. This can be done by adding them to the connection options (e.g., -c idle_session_timeout=300s -c idle_in_transaction_session_timeout=60s) in the database configuration or connection string. This avoids global database state mutation and ensures compatibility with transactional migrations.
  2. Manual Execution: Execute the ALTER DATABASE statements manually on the target databases as a one-time administrative task outside of the transactional migration runner.

@Evrard-Nil Evrard-Nil closed this Jun 15, 2026
@claude

claude Bot commented Jun 15, 2026

Copy link
Copy Markdown

Review — V0062 idle session timeout

Solid, well-documented fix. The version guard, transaction-safety of ALTER DATABASE … SET, and the deadpool Fast-recycling reasoning all check out, and the PR is appropriately scoped (recurrence fix, not the live-outage fix). A few things worth confirming before merge.

⚠️ Critical: this migration can itself crash-loop the service

ALTER DATABASE … SET requires the connecting role to own the database (or be superuser) — otherwise ERROR: must be owner of database. This is a stronger privilege than prior migrations needed: creating/altering tables only requires schema/table privileges, so a role that has run every migration up to V0061 successfully can still fail on V0062.

If it fails, run_migrations() is .expect'd at lib.rs:116, so every instance panics at boot and crash-loops — the exact failure mode this PR is trying to prevent, now triggered unconditionally on the next deploy.

Two asks:

  • Confirm the migration/app role owns the application DB in the Patroni/Spilo cluster (not just a privileged non-owner). If it isn't, this won't deploy.
  • Consider making it non-fatal so a privilege/edge issue degrades gracefully instead of bricking startup, e.g. wrap the ALTERs in EXCEPTION WHEN insufficient_privilege THEN RAISE WARNING … inside the DO block. A best-effort hardening setting shouldn't be able to take down the service it protects.

Minor / non-blocking

  • idle_in_transaction_session_timeout = 60s is database-wide, not pool-scoped — it applies to every connection to the app DB (manual psql, analytics jobs, any out-of-band tooling), not just cloud-api. The repo's own transactions are all tight repository ops (none held across streaming awaits), so app traffic is safe; just flagging that anything legitimately sitting idle-in-transaction >60s elsewhere will be terminated. Almost certainly desirable here, but worth being deliberate about.
  • Takes effect on new sessions only (already noted in the PR). Combined with deadpool reusing warm connections, existing pooled connections won't pick up the timeouts until recycled — the protective effect ramps in rather than applying immediately.
  • Numbering is fine (the V0054 gap is pre-existing).

The two follow-ups you filed (retry-with-backoff instead of .expect at boot, and bounding pool size against server max_connections) are the real durable fixes — strongly second filing those.

⚠️ (one item to confirm: DB ownership / non-fatal handling before merge)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant