fix(db): auto-reap idle sessions to prevent connection-pool exhaustion (V0062) by Evrard-Nil · Pull Request #793 · nearai/cloud-api

Evrard-Nil · 2026-06-15T10:15:02Z

Incident context (2026-06-15)

After a prod deploy, crash-looping / recycled cloud-api instances left dozens of idle client backends behind — there's no server-side idle timeout, and dead-peer TCP backends linger for minutes. The Patroni leader filled to max_connections, so even a single healthy instance got FATAL: sorry, too many clients already and panicked at init_database (lib.rs:108), crash-looping.

What this does

Adds migration V0062 setting, on the application database:

idle_session_timeout = 300s — Postgres closes abandoned idle client sessions on its own (the orphaned-connection reaper this incident needed).
idle_in_transaction_session_timeout = 60s — reaps the classic idle-in-transaction leak sooner.

Why it's safe

With our pooling: deadpool uses Fast recycling (is_closed() check on checkout), so a warm pooled connection the server reaps after it goes idle is simply discarded and re-created on next use — no error reaches the app. 300s is well above normal inter-query gaps under load, so only genuinely-abandoned sessions are reaped.
Won't wedge startup: idle_session_timeout is PG14+; the cluster is PG16/Spilo-16, but the migration is version-guarded (server_version_num >= 140000) so it's a no-op rather than an error on any older node. ALTER DATABASE … SET is transaction-safe (runs fine under refinery).

⚠️ Scope — this is the recurrence fix, not the live-outage fix

It runs at run_migrations() (lib.rs:114), after the initial pool connect that fails during the outage — so it cannot clear an active pileup.
It only affects new sessions; existing zombie backends are unaffected.
Pair it with: (1) a manual pg_terminate_backend of idle backends to restore service now, and (2) a DATABASE_MAX_CONNECTIONS reduction (32→16) so instances × (write+read) pools fit under the server max_connections (currently the default ~100).

Follow-ups worth filing

cloud-api panics at boot instead of retrying when the DB is briefly full (lib.rs:108 .expect) — should back off and retry.
Pool sizing (max_write/read_connections) isn't bounded against the server max_connections for the instance count.

Add migration V0062 setting idle_session_timeout=300s and idle_in_transaction_session_timeout=60s on the database so Postgres closes sessions orphaned by crashed/recycled cloud-api instances, instead of letting them accumulate to max_connections. Incident 2026-06-15: after a prod deploy, crash-looping/recycled instances left dozens of idle client backends (no server idle timeout; dead-peer TCP backends linger). The leader filled to max_connections and even a single healthy instance got 'FATAL: sorry, too many clients already'. Safe with deadpool: Fast recycling re-validates on checkout, so a warm pooled connection reaped after going idle is discarded and re-created transparently. Version-guarded (idle_session_timeout is PG14+, cluster is PG16) so it can never wedge startup on an older node. NOTE: this is the durable/recurrence fix. It does NOT clear an active pileup (it runs after the initial pool connect) and existing sessions are unaffected — pair with a manual pg_terminate_backend of idle backends and a DATABASE_MAX_CONNECTIONS reduction.

gemini-code-assist

Code Review

This pull request introduces a database migration to automatically reap idle client sessions by setting idle_session_timeout and idle_in_transaction_session_timeout via ALTER DATABASE commands. However, the reviewer identified a critical issue: ALTER DATABASE cannot be executed inside a transaction block, which will cause the migration to fail and crash the application during startup since the migration runner executes migrations within a transaction. The reviewer recommends setting these parameters at the connection pool level or executing them manually instead.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-15T10:16:03Z

+DO $$
+BEGIN
+    IF current_setting('server_version_num')::int >= 140000 THEN
+        EXECUTE format(
+            'ALTER DATABASE %I SET idle_session_timeout = %L',
+            current_database(), '300s'
+        );
+    END IF;
+
+    -- Available since PG 9.6; reaps sessions stuck idle-in-transaction
+    -- (the classic connection leak) much sooner.
+    EXECUTE format(
+        'ALTER DATABASE %I SET idle_in_transaction_session_timeout = %L',
+        current_database(), '60s'
+    );
+END
+$$;


In PostgreSQL, ALTER DATABASE commands (including ALTER DATABASE ... SET ...) cannot be executed inside a transaction block. Attempting to do so results in the following error:
ERROR: ALTER DATABASE cannot run inside a transaction block

Since refinery executes migrations within a transaction block by default, this migration will fail to apply and will cause the application to crash during startup.

Suggested Solutions:

Connection Pool Configuration (Recommended): Set these parameters at the connection pool level instead of the database level. This can be done by adding them to the connection options (e.g., -c idle_session_timeout=300s -c idle_in_transaction_session_timeout=60s) in the database configuration or connection string. This avoids global database state mutation and ensures compatibility with transactional migrations.

Manual Execution: Execute the ALTER DATABASE statements manually on the target databases as a one-time administrative task outside of the transactional migration runner.

claude · 2026-06-15T10:17:33Z

Review — V0062 idle session timeout

Solid, well-documented fix. The version guard, transaction-safety of ALTER DATABASE … SET, and the deadpool Fast-recycling reasoning all check out, and the PR is appropriately scoped (recurrence fix, not the live-outage fix). A few things worth confirming before merge.

⚠️ Critical: this migration can itself crash-loop the service

ALTER DATABASE … SET requires the connecting role to own the database (or be superuser) — otherwise ERROR: must be owner of database. This is a stronger privilege than prior migrations needed: creating/altering tables only requires schema/table privileges, so a role that has run every migration up to V0061 successfully can still fail on V0062.

If it fails, run_migrations() is .expect'd at lib.rs:116, so every instance panics at boot and crash-loops — the exact failure mode this PR is trying to prevent, now triggered unconditionally on the next deploy.

Two asks:

Confirm the migration/app role owns the application DB in the Patroni/Spilo cluster (not just a privileged non-owner). If it isn't, this won't deploy.
Consider making it non-fatal so a privilege/edge issue degrades gracefully instead of bricking startup, e.g. wrap the ALTERs in EXCEPTION WHEN insufficient_privilege THEN RAISE WARNING … inside the DO block. A best-effort hardening setting shouldn't be able to take down the service it protects.

Minor / non-blocking

idle_in_transaction_session_timeout = 60s is database-wide, not pool-scoped — it applies to every connection to the app DB (manual psql, analytics jobs, any out-of-band tooling), not just cloud-api. The repo's own transactions are all tight repository ops (none held across streaming awaits), so app traffic is safe; just flagging that anything legitimately sitting idle-in-transaction >60s elsewhere will be terminated. Almost certainly desirable here, but worth being deliberate about.
Takes effect on new sessions only (already noted in the PR). Combined with deadpool reusing warm connections, existing pooled connections won't pick up the timeouts until recycled — the protective effect ramps in rather than applying immediately.
Numbering is fine (the V0054 gap is pre-existing).

The two follow-ups you filed (retry-with-backoff instead of .expect at boot, and bounding pool size against server max_connections) are the real durable fixes — strongly second filing those.

⚠️ (one item to confirm: DB ownership / non-fatal handling before merge)

Evrard-Nil requested review from PierreLeGuen and lloydmak99 June 15, 2026 10:15

Evrard-Nil temporarily deployed to Cloud API test env June 15, 2026 10:15 — with GitHub Actions Inactive

gemini-code-assist Bot reviewed Jun 15, 2026

View reviewed changes

Evrard-Nil closed this Jun 15, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(db): auto-reap idle sessions to prevent connection-pool exhaustion (V0062)#793

fix(db): auto-reap idle sessions to prevent connection-pool exhaustion (V0062)#793
Evrard-Nil wants to merge 1 commit into
mainfrom
fix/db-idle-session-timeout

Evrard-Nil commented Jun 15, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 15, 2026

Uh oh!

claude Bot commented Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Evrard-Nil commented Jun 15, 2026

Incident context (2026-06-15)

What this does

Why it's safe

⚠️ Scope — this is the recurrence fix, not the live-outage fix

Follow-ups worth filing

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 15, 2026

Choose a reason for hiding this comment

Suggested Solutions:

Uh oh!

claude Bot commented Jun 15, 2026

Review — V0062 idle session timeout

⚠️ Critical: this migration can itself crash-loop the service

Minor / non-blocking

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant