database: Postgres/AlloyDB network reliability — failover-safe pool + transactional retry with commit verification by elovelan · Pull Request #2 · elovelan/moovio-base

elovelan · 2026-06-30T12:21:45Z

Problem

Postgres/AlloyDB connections fail during brief network outages — most notably AlloyDB's monthly maintenance switchovers (< 1s downtime), but also any transient network blip. Today:

Dead connections linger in the pool. database/sql with the pgx driver has no background connection health checks, so after a switchover, dead connections sit in the pool and get handed to the application.
No pool limits by default. ConnectionsConfig defaults to all zero values, so database/sql holds connections forever.
In-flight queries have no retry path. A query that fails mid-flight (connection severed) just returns an error to the caller — no retry, no indication whether a write landed.

The result: services return 500s during maintenance windows, and callers have to guess whether a write committed after a network error.

What this does

Three layers, each independently useful:

1. Failover-safe connection pool (`postgres.go`, `model_config.go`, `database.go`)

Switch from sql.Open("pgx", …) to pgxpool.NewWithConfig() + stdlib.OpenDBFromPool(), keeping the *sql.DB return type so no downstream changes are needed.
HealthCheckPeriod = 1s — a background goroutine evicts connections that exceeded MaxConnLifetime or MaxConnIdleTime every second. (It does not ping — liveness pinging happens at acquire time via stdlib's ResetSession, default "ping if idle > 1s"; database/sql then retries on a fresh conn. HealthCheckPeriod just keeps expired/idle connections from accumulating.)
DefaultPostgresConnectionsConfig() + ApplyPostgresConnectionsConfig() — failover-safe defaults (MaxLifetime=5m, MaxIdleTime=30s, MaxOpen=25, MaxIdle=5) applied automatically when services don't explicitly configure them.

2. Retry helpers (`retry.go`, `postgres.go`)

Two functions so the common case (selects, idempotent DML) doesn't carry "unsafe" wording or transaction overhead:

RetryIdempotent(ctx, opts, fn func() error) — retries fn (no transaction wrapper) on any error by default except context.Canceled/context.DeadlineExceeded. The caller vouches that fn is idempotent. Lightweight; for selects and idempotent DML.
RetryPostgresNonIdempotent(ctx, db, opts, fn func(*sql.Tx) error) — runs fn in a Postgres transaction and retries the whole transaction on transient errors. fn receives a fresh *sql.Tx each attempt. For non-idempotent operations (the transaction makes retry safe via rollback, plus commit verification below). Shape matches pgx.BeginTxFunc / pgxtransactions.Transaction.

Both use a non-configurable maxRetryAttempts = 3 (matching database/sql's maxBadConnRetries+1), full-jitter backoff (up to 100ms), and respect the caller's context deadline.

RetryMySQLNonIdempotent / RetrySpannerNonIdempotent are placeholders (return "not yet implemented") so the cross-backend API surface is visible.

3. Commit verification (`postgres.go`, `retry.go`)

The hard part of retrying transactions: a commit error is ambiguous — the commit may have succeeded on the server before the error was returned to the client (e.g., the connection was severed after COMMIT was sent but before the response). Blindly retrying could duplicate committed work.

RetryPostgresNonIdempotent resolves this by asking the server, per the PostgreSQL docs (which explicitly endorse pg_xact_status for "determine whether their transaction committed or aborted after the application and database server become disconnected while a COMMIT is in progress"):

Before each Commit, capture the transaction id: SELECT pg_current_xact_id_if_assigned()::text (NULL → read-only — no writes to duplicate, safe to retry without verifying).
On a commit error, query SELECT pg_xact_status($1::xid8) on a fresh connection (the original may be dead; xid8 is globally unique so this works across connections/reconnects, modulo a brief failover replay-lag window):
- 'aborted' → retry (the commit didn't land).
- 'committed' → ErrCommitted (don't retry — would duplicate; the caller reconciles/alerts).
- NULL / 'in progress' / query error → ErrCommitPhase (inconclusive; caller decides).

Commit errors are always verified (not classified by a client-side guess) — the server is the authority on whether a commit landed. Class-40 commit errors (40001 serialization, 40P01 deadlock — server rolled back) → pg_xact_status says 'aborted' → retry. A retryable commit error (verified-aborted or read-only) on exhaustion returns the original error (not ErrCommitPhase — it's known non-commit, just out of retries). ErrCommitPhase is only for inconclusive commits.

Retry design summary

Pre-commit errors (BeginTx/fn): opt-OUT classifier — pre-commit retry is always safe (the transaction rolls back), so it retries everything except context.Canceled/context.DeadlineExceeded and the permanent SQLSTATE classes (22xxx data exceptions, 23xxx constraint violations, 42xxx syntax/schema/permission — via pgerrcode). Unknown codes and non-PgError errors are retried (a false positive wastes ~200ms; a false negative causes a spurious user-facing failure — the former is safer).
Commit errors: always verify via pg_xact_status (above).
opts.IsRetryable is additive (OR) on the pre-commit classifier (commit errors are verified, not classified — safer; a caller shouldn't override commit verification).

Sentinels

if errors.Is(err, database.ErrCommitted) {
    // Verified committed: do NOT retry (would duplicate). Reconcile/alert.
}
if errors.Is(err, database.ErrCommitPhase) {
    // Inconclusive: the commit may or may not have landed. Decide / re-check.
}

Both preserve the underlying commit error for errors.Is/errors.As.

Other

pgerrcode: adopted github.com/jackc/pgerrcode (canonical Postgres error-code constants, by the pgx author) to eliminate string literals and inline human-friendly-name comments.
Naming: RetryIdempotent (no tx) / RetryPostgresNonIdempotent (tx + verification) — the Tx suffix was dropped (the fn func(*sql.Tx) error parameter makes the transaction obvious). The existing RetryPostgres is removed (this is unmerged work).
database/sql interaction: *sql.DB methods already retry driver.ErrBadConn internally (immediate, up to 3 attempts); this outer retry layers on top with jitter for longer outages. *sql.Tx methods have no internal retry, so this is the only retry for tx.Exec/tx.Commit.

Test plan

TestRetryIdempotent — success, retries on any error by default (incl. 23505), custom IsRetryable short-circuits, no retry on context.Canceled, ctx cancellation, attempt exhaustion.
TestRetryPostgresNonIdempotent — success; fn-error-then-succeed (rollback between attempts); non-retryable fn error short-circuits (23505); driver.ErrBadConn from fn retried; SafeToRetry Begin failure retried; SafeToRetry Commit failure → verified-aborted → retry; verify-aborted → retry; verify-committed → ErrCommitted; verify-unknown (NULL) → ErrCommitPhase; verify-error → ErrCommitPhase; read-only (no xid) → retry without verifying; class-40 exhaustion → verify-aborted each attempt → original error (not ErrCommitPhase); 57P01+inconclusive → ErrCommitPhase; additive opts.IsRetryable (pre-commit); ctx cancellation; attempt exhaustion. Uses a database/sql/driver mock (with QueryerContext to script the verification queries).
TestIsRetryablePostgresPreCommitError — pre-commit classifier leg coverage.
TestErrCommitPhase / TestErrCommitted — sentinel detection; underlying error preserved; distinct from each other.
TestRetryMySQLNonIdempotentNotImplemented / TestRetrySpannerNonIdempotentNotImplemented — placeholders return "not yet implemented".
Existing IsRetryablePostgresError / PostgresUniqueViolation / PostgresDeadlockFound / pool-defaults tests still pass.
go build ./..., go vet ./..., go test ./database/... -short all pass.

Origin

This combines the failover-safe pool work (upstream moov-io/base PR #490) with the transaction-retry pattern suggested in its review (r3470551946, r3487236671, r3482973525, r3483137216).

…oyDB During AlloyDB maintenance switchovers (< 1s downtime), services using database/sql with pgx fail to recover because: 1. No connection pool limits are set by default, so dead connections persist indefinitely 2. No retry logic exists for transient connection errors This adds: - DefaultPostgresConnectionsConfig() with MaxLifetime=5m, MaxIdleTime=30s to ensure dead connections are evicted quickly after failover - ApplyPostgresConnectionsConfig() that fills in safe defaults when services don't explicitly configure pool settings - IsRetryablePostgresError() to classify transient PG/network errors - RetryPostgres() for services to wrap critical DB operations Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Switch from sql.Open("pgx") to pgxpool.NewWithConfig() wrapped with stdlib.OpenDBFromPool(). This gives us pgxpool's HealthCheckPeriod (set to 5s) which proactively pings idle connections and evicts dead ones — the most important fix for surviving AlloyDB maintenance switchovers. The return type remains *sql.DB so no downstream changes are needed. Also cleans up the dialer TODO (dialer lifecycle is now tied to the pool) and removes the unused pgx.ParseConfig import path. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

For sub-second AlloyDB switchovers, 1s health checks detect and evict dead connections faster with negligible overhead. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Collapse switch cases in IsRetryablePostgresError for readability - Make context cancellation test more specific (assert context.Canceled and exact call count) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

… codes Remove the 08xxx connection_exception PgError cases — pgx never surfaces these as *pgconn.PgError, so they were dead code. Add the codes that pgx does return and where the server guarantees a rollback before returning the error: 40001 (serialization_failure), 40P01 (deadlock_detected), 53300 (too_many_connections), and 57014 (query_canceled). Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

Replace the fixed 200ms×attempt linear backoff with a random delay in [0, 100ms) per retry gap. Full jitter spreads concurrent retries across the fleet instead of having all callers wait the same fixed interval and slam the database simultaneously. Worst-case retry window across 3 attempts is ~200ms, sized for AlloyDB planned maintenance switchovers which typically cause less than one second of downtime. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

Drop "connection refused" (not present in pgx's codebase) and "unexpected EOF" (already caught by errors.Is(err, io.ErrUnexpectedEOF)). Add comments explaining why the remaining string matches are safe to retry. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

…etryUnsafe Implements the transaction-retry pattern suggested in PR moov-io#490 review comments (discussion_r3470551946 and discussion_r3487236671): - RetryPostgresTx wraps fn in an explicit transaction (fn receives a *sql.Tx) and retries the whole transaction on transient errors. This is the "safe" variant: the explicit transaction guarantees fn's work is atomic, so a retried attempt starts from a clean slate. - RetryUnsafe retries fn (no transaction wrapper) on any error by default; the caller vouches that fn is idempotent. This is the "unsafe" variant for cases where a transaction isn't feasible. - RetryMySQLTx and RetrySpannerTx are added as not-implemented placeholders so the cross-backend API surface is visible and the gap is enforced by tests. The safe variant's default classifier (isRetryablePostgresTxError) is the OR of three signals: 1. driver.ErrBadConn -- pgx's stdlib adapter converts pgconn.SafeToRetry-flagged Exec/Query errors to driver.ErrBadConn before fn sees them; this is the primary retry trigger inside a transaction. 2. pgconn.SafeToRetry -- covers Begin/Commit failures where the original pgx error flows through unchanged. 3. IsRetryablePostgresError -- server-returned SQLSTATE codes that guarantee rollback (40001, 40P01, 57P01, ...). RetryTxOptions.IsRetryable is additive on top of the default classifier so consumers can extend it with implementation-specific cases without narrowing the safety floor. RetryUnsafeOptions.IsRetryable replaces the default (caller takes full control). The existing RetryPostgres is removed (the PR is unmerged). A stacked follow-up will add pg_current_xact_id()/pg_xact_status() commit verification; a TODO marks the insertion point in retryTx. Co-authored-by: E-Love (Eric Loveland) <elovelan@users.noreply.github.com>

Addresses review comments r3482973525 and r3483137216 on PR moov-io#490: without an explicit transaction there is no way to know whether a network error occurred before or after the server accepted the query, so retrying could duplicate committed work. IsRetryablePostgresError now classifies only PostgreSQL SQLSTATE codes that the server guarantees were rolled back before the error was returned (40001, 40P01, 57P01, 57P02, 57P03, 53300, 57014). The net.OpError, io.EOF, io.ErrUnexpectedEOF, and string-matching ("connection reset by peer"/"broken pipe"/"conn closed") checks are removed. The context.DeadlineExceeded check is removed as redundant (non-PgError errors already return false). The doc comment explains why network errors are excluded and points callers to RetryPostgresTx for transactional retry. The network-error checks move into a new internal isPostgresNetworkError helper, which is wired into isRetryablePostgresTxError as a 4th leg. Network errors ARE safe to retry within an explicit transaction (the server rolls back the uncommitted transaction), so RetryPostgresTx keeps them. String matching is dropped entirely -- driver.ErrBadConn and pgconn.SafeToRetry cover pgx's pre-send cases, and the typed net.OpError / io checks cover post-send cases, so substring matching is redundant and fragile. Tests updated: - TestIsRetryablePostgresError: io.EOF, io.ErrUnexpectedEOF, net.OpError, and the string-matched errors now assert false; SQLSTATE assertions unchanged. - TestIsRetryablePostgresTxError: added a Leg 4 section asserting io.EOF, io.ErrUnexpectedEOF, and net.OpError are retryable via the transactional classifier. Co-authored-by: E-Love (Eric Loveland) <elovelan@users.noreply.github.com>

…narrow) phases Addresses review feedback questioning the claim that commit-phase errors always roll back: a commit-phase error is ambiguous because the commit may have succeeded on the server before the error was returned to the client (e.g., the TCP connection was severed after the COMMIT message was sent but before the response was received). Retrying such an error could duplicate the committed work. Changes: - IsRetryablePostgresError -> isSafeRetryablePostgresError (unexported). It classified only SQLSTATE codes that the server guarantees were rolled back, and the recommended path for callers is now RetryPostgresTx or RetryUnsafe, not building a custom retry loop on top of this classifier. - isRetryablePostgresTxError -> isRetryablePostgresPreCommitError (broad, for errors from BeginTx or fn): driver.ErrBadConn + pgconn.SafeToRetry + isSafeRetryablePostgresError + isPostgresNetworkError. Any error from Begin or fn is safe to retry because the transaction rolls back, so network errors are included. Added a comment on the pgconn.SafeToRetry leg explaining pgx guarantees these always occur before any data is sent to the server. - New isRetryablePostgresCommitError (narrow, for errors from (*sql.Tx).Commit): driver.ErrBadConn + pgconn.SafeToRetry + isSafeRetryablePostgresError. Network errors are intentionally EXCLUDED because they could mean the commit succeeded but the response was lost. - New ErrCommitPhase sentinel: commit-phase errors that are not retried (e.g., a network error during commit) are wrapped with ErrCommitPhase so callers can detect them via errors.Is and decide whether to alert, reconcile, or check whether the commit landed (future pg_xact_status() work). The underlying error is preserved for inspection. - retryTx now takes a retryClassifier (preCommit + commit funcs) and runOneTxAttempt returns (err, commitPhase bool); the retry loop applies the commit classifier when commitPhase is true and wraps the final commit-phase error with ErrCommitPhase. opts.IsRetryable remains additive on top of BOTH phases. - TODO(stacked-pr) -> TODO(future) for the pg_current_xact_id()/ pg_xact_status() commit-verification hook. Tests: - Removed TestIsRetryablePostgresError from postgres_test.go (function is now private); SQLSTATE coverage moved to internal TestIsSafeRetryablePostgresError. - Renamed TestIsRetryablePostgresTxError -> TestIsRetryablePostgresPreCommitError. - Added TestIsRetryablePostgresCommitError (network NOT retryable at commit). - Added TestRetryPostgresTx subtests: commit-phase network error is not retried and is wrapped with ErrCommitPhase; commit-phase exhaustion wraps the final error with ErrCommitPhase. - Added TestErrCommitPhase (sentinel detection + underlying error preserved). Co-authored-by: E-Love (Eric Loveland) <elovelan@users.noreply.github.com>

isRetryablePostgresCommitError was a strict subset of isRetryablePostgresPreCommitError (pre-send + server-rolled-back, without the pre-commit-only network leg), duplicating the driver.ErrBadConn and pgconn.SafeToRetry checks that also belonged to the 'safe at any phase' bucket. Folded those two legs into isSafeRetryablePostgresError so it becomes the single 'safe to retry regardless of transaction phase' classifier (pre-send + server-rolled-back SQLSTATE codes), and assigned it directly to postgresTxClassifier.commit. isRetryablePostgresPreCommitError now delegates: isSafeRetryablePostgresError || isPostgresNetworkError. Behavior is unchanged; only the code structure is simpler. Also documented on isRetryablePostgresPreCommitError the permanent-error categories that are intentionally NOT retried pre-commit even though retrying them would be safe (the transaction rolls back): constraint violations (23xxx), syntax errors (42601), schema errors (42xxx), permission errors (42501), and data exceptions (22xxx). These surface at query time (not connection time) and would fail again on retry, so retrying only adds ~200ms of latency before the caller receives the error. Callers who want to retry a broader set can pass opts.IsRetryable. Tests: folded TestIsRetryablePostgresCommitError assertions into TestIsSafeRetryablePostgresError (added driver.ErrBadConn + SafeToRetry -> true); removed TestIsRetryablePostgresCommitError; TestIsRetryablePostgresPreCommitError unchanged. Co-authored-by: E-Love (Eric Loveland) <elovelan@users.noreply.github.com>

…config Two related changes from review discussion: 1. Pre-commit classifier is now OPT-OUT. Pre-commit, retrying is always safe (the transaction rolls back), so the cost of a false positive (retrying a permanent error we didn't blocklist) is ~200ms of wasted latency, while the cost of a false negative (NOT retrying a transient error we didn't allowlist) is a spurious user-facing failure. The former is strictly safer, so isRetryablePostgresPreCommitError now retries everything EXCEPT context.Canceled/DeadlineExceeded and the permanent SQLSTATE classes (22xxx data exceptions, 23xxx constraint violations, 42xxx syntax/schema/permission), via a new isPermanentPostgresError helper. Unknown *pgconn.PgError codes and non-PgError errors are retried. The commit classifier (isSafeRetryablePostgresError) stays OPT-IN (narrow) because at commit retrying is not always safe. isPostgresNetworkError is removed (the pre-commit catch-all now covers network errors; the commit classifier never used it). 2. MaxAttempts is no longer configurable. Added a maxRetryAttempts=3 constant matching database/sql's maxBadConnRetries+1 convention, and removed the MaxAttempts field from RetryTxOptions and RetryUnsafeOptions (plus the '<= 0 -> default to 3' logic). Services that need a different retry budget should bound the total time via the context deadline or wrap the call in their own retry loop. Added a comment on retryTx explaining the interaction with database/sql's internal driver.ErrBadConn retry (immediate, up to 3 attempts for *sql.DB methods) — our outer retry layers on top with full-jitter backoff to survive longer outages; for *sql.Tx methods there is no internal retry, so this is the only one. Tests: TestIsRetryablePostgresPreCommitError rewritten for opt-out (application error and unknown PgError code now retryable; added 22xxx/ 42xxx permanent cases); removed isPostgresNetworkError comment refs; replaced all MaxAttempts: 3 with {} in RetryTxOptions/RetryUnsafeOptions. Co-authored-by: E-Love (Eric Loveland) <elovelan@users.noreply.github.com>

The driver.ErrBadConn check in isSafeRetryablePostgresError (the commit-phase classifier) was unreachable dead code: - *sql.Tx.Commit does not retry driver.ErrBadConn internally (only *sql.DB methods have the maxBadConnRetries loop — see golang/go#11264), so there is no database/sql retry to double up with at commit time. - pgx's wrapTx.Commit returns the native pgx error directly (return wtx.tx.Commit(wtx.ctx)); the SafeToRetry -> driver.ErrBadConn conversion only happens in Conn.ExecContext/QueryContext (i.e., for fn's tx.Exec/tx.Query), NOT for Commit. So the commit classifier never sees driver.ErrBadConn with pgx. - The pre-send commit case (a SafeToRetry-flagged pgx error from commit) is already covered by the pgconn.SafeToRetry leg. Removed the leg and the now-unused database/sql/driver import from postgres.go, and documented on isSafeRetryablePostgresError why driver.ErrBadConn is not checked there. driver.ErrBadConn from fn's tx.Exec is still retryable — that's a pre-commit error handled by isRetryablePostgresPreCommitError's opt-out catch-all (TestIsRetryablePostgresPreCommitError covers it). TestIsSafeRetryablePostgresError: removed the driver.ErrBadConn -> true assertions (now false) and added a comment explaining why driver.ErrBadConn isn't in the commit classifier. Co-authored-by: E-Love (Eric Loveland) <elovelan@users.noreply.github.com>

Trim verbose comments, drop general-knowledge explanations (e.g. Rollback is a no-op after Commit), remove rationale duplicated across function docs, and fix stale references to the renamed isRetryablePostgresTxError in the mysql.go/spanner.go placeholder TODOs. Behavior unchanged. Co-authored-by: E-Love (Eric Loveland) <elovelan@users.noreply.github.com>

HealthCheckPeriod drives the background goroutine that evicts connections exceeded MaxConnLifetime/MaxConnIdleTime; it does NOT ping for liveness. Liveness pinging is at acquire time (stdlib ResetSession, default ping if idle > 1s), with database/sql retrying on a fresh conn. The old comment claimed HealthCheckPeriod pings idle connections in the background and evicts dead connections before the app sees them — both wrong. Co-authored-by: E-Love (Eric Loveland) <elovelan@users.noreply.github.com>

quickdie() (the SIGQUIT handler that raises ERRCODE_CRASH_SHUTDOWN) does _exit(2) WITHOUT rolling back — the source comment says 'we don't want to try to clean up our transaction'. The rollback is via crash recovery on restart, not before the client sees 57P02. Contrast with die() (57P01 admin_shutdown), which processes the interrupt through proc_exit/ AbortTransaction and does roll back before sending. So 57P02 at commit is ambiguous: the commit may have been recorded before SIGQUIT, and quickdie skips rollback. Removed 57P02 from isSafeRetryable- PostgresError (the commit classifier) so it's wrapped with ErrCommitPhase at commit (caller decides). 57P02 is still retried pre-commit via isRetryablePostgresPreCommitError's opt-out — the tx didn't commit (process exited before committing), so retry is safe there. Tests: TestIsSafeRetryablePostgresError asserts 57P02 -> false at commit; TestIsRetryablePostgresPreCommitError asserts 57P02 -> true pre-commit (with a comment noting it's safe despite no pre-return rollback). Co-authored-by: E-Love (Eric Loveland) <elovelan@users.noreply.github.com>

Commit classifier (isSafeRetryablePostgresError) now retries only errors that can NEVER mean 'possibly committed': - pgconn.SafeToRetry (pgx guarantees pre-send: COMMIT message never sent) - pgerrcode.IsTransactionRollback (class 40: server rolled back the tx before returning the ErrorResponse — 40001 serialization_failure, 40P01 deadlock_detected, 40002, 40003, 40000) Dropped 57P01 (admin_shutdown), 57P02 (crash_shutdown), 57014 (query_canceled), 57P03, 53300 from the commit classifier — each could mean 'possibly committed' (signal handlers can interrupt after the commit is recorded; quickdie skips rollback; cancel timing has an edge case; connect-time codes are irrelevant). They're still retried pre-commit via isRetryablePostgresPreCommitError's opt-out (the tx didn't commit pre-commit). At commit they're wrapped with ErrCommitPhase so the caller decides. Adopted github.com/jackc/pgerrcode (canonical Postgres error-code constants, by the pgx author) to eliminate string literals and the inline human-friendly- name comments: removed the local postgresErrUniqueViolation/postgresErrDeadlock- Found consts; PostgresUniqueViolation/PostgresDeadlockFound use pgerrcode; isPermanentPostgresError uses pgerrcode.IsDataException/IsIntegrityConstraint- Violation/IsSyntaxErrororAccessRuleViolation (class 22/23/42); the commit classifier uses pgerrcode.IsTransactionRollback (class 40). Tests: TestIsSafeRetryablePostgresError now asserts only SafeToRetry + class 40 retryable at commit (57P01/57P02/57P03/53300/57014 false); added a TestRetryPostgresTx subtest for 57P01-at-commit not retried + wrapped with ErrCommitPhase; TestIsRetryablePostgresPreCommitError covers 57P01/57P02/57P03/ 53300/57014 still retryable pre-commit; all string literals swapped for pgerrcode constants (postgres_test.go, database_test.go too). Co-authored-by: E-Love (Eric Loveland) <elovelan@users.noreply.github.com>

…PostgresNonIdempotent Renames so the common case (selects, idempotent DML) doesn't put 'unsafe' all over callers' codebases, and the transactional variant's name signals it's for non-idempotent ops: RetryUnsafe -> RetryIdempotent RetryUnsafeOptions -> RetryIdempotentOptions RetryPostgresTx -> RetryPostgresNonIdempotent RetryMySQLTx -> RetryMySQLNonIdempotent (placeholder) RetrySpannerTx -> RetrySpannerNonIdempotent (placeholder) RetryTxOptions -> RetryNonIdempotentOptions Dropped the redundant 'Tx' suffix from the public names: the fn func(*sql.Tx) parameter makes the transaction obvious at every call site, so 'Tx' in the name added little. The internal retryTx helper keeps its name (unexported, descriptive). No behavior change. Co-authored-by: E-Love (Eric Loveland) <elovelan@users.noreply.github.com>

… + pg_xact_status RetryPostgresNonIdempotent now verifies whether an ambiguous commit landed, per r3470551946 and the PostgreSQL docs (pg_xact_status is documented for 'determine whether their transaction committed or aborted after the application and database server become disconnected while a COMMIT is in progress'). Always-on for the non-idempotent variant (no opt-in). Before each Commit, capture the transaction id: SELECT pg_current_xact_id_if_assigned()::text (NULL → read-only, no writes to duplicate). On a commit-phase error the commit classifier rejects (network, 57P01/57P02, etc.), query pg_xact_status($1::xid8) on a FRESH connection (the original may be dead; xid8 is globally unique so this works across connections/reconnects, modulo a failover replay-lag window): - 'aborted' → retry (commit didn't land). - 'committed' → ErrCommitted (don't retry — would duplicate). - NULL/'in progress'/error → ErrCommitPhase (ambiguous, caller decides). - read-only (xid nil) → retry without verifying (no writes to duplicate). Class-40 commit errors (40001/40P01, server rolled back) are retryable per the commit classifier, so they're retried without verification; on exhaustion they return the original error (not ErrCommitPhase — they're known non-commit). ErrCommitPhase is now only for INCONCLUSIVE commits. Implementation: - ErrCommitted sentinel (distinct from ErrCommitPhase). - commitStatus enum (aborted/committed/unknown). - retryDB interface (BeginTx + QueryRowContext) so retryTx can run the fresh-connection pg_xact_status query. - retryClassifier gains captureXid/verifyCommit hooks (nil for future MySQL/Spanner → ErrCommitPhase fallback, current behavior). - runOneTxAttempt captures the xid before Commit and returns it. - retryTx loop: on a classifier-rejected commit error, verify and decide retry / ErrCommitted / ErrCommitPhase; read-only → retry. - postgres.go: capturePostgresXid (::text cast sidesteps pgx's xid8 Go type-mapping) and verifyPostgresCommit (::xid8 cast), wired into postgresTxClassifier. Removed the TODO(future). Tests: mock driver implements QueryerContext, scripting the two verification queries (captureXidResult/verifyStatusResult/verifyErr). New TestRetryPostgres- NonIdempotent subtests cover verify-aborted→retry, verify-committed→ErrCommitted, verify-unknown→ErrCommitPhase, read-only→retry-without-verifying, verify-error →ErrCommitPhase, and class-40-exhaustion→original-error. TestErrCommitted added. ErrCommitPhase-exhaustion test repurposed (class-40 now returns the original, not ErrCommitPhase). Co-authored-by: E-Love (Eric Loveland) <elovelan@users.noreply.github.com>

Now that commit verification (pg_xact_status) is in place, the separate 'commit classifier' (isSafeRetryablePostgresError) is unnecessary. Commit errors are rare, and we already pay a round-trip to capture the xid before every commit — so always checking pg_xact_status on a commit error (one more round-trip, only on the rare commit error) is negligible, and it lets the SERVER be the authority on whether the commit landed instead of a client-side guess. retryTx now: pre-commit errors → preCommit classifier (opt-out, unchanged); commit errors → always verify (read-only/xid-nil → retry without verifying; aborted → retry; committed → ErrCommitted; inconclusive → ErrCommitPhase). No commit classifier, no classify method. opts.IsRetryable is additive on preCommit only (commit errors are verified, not classified — safer, since a caller shouldn't override commit verification). Removed isSafeRetryablePostgresError (and its complex exclusion doc about 57P01/57P02/57014/57P03/53300/network/driver.ErrBadConn) and TestIsSafeRetryablePostgresError. postgresTxClassifier drops the commit field. Trade-off: class-40 / SafeToRetry commit errors now go through verification (+1 pg_xact_status round-trip; previously fast-pathed). No regression for common cases — class-40 → 'aborted' → retry (same outcome); the rare SafeToRetry+failover+replay-lag case → ErrCommitPhase (recoverable). Class-40 doesn't hit the replay-lag case (the server responded, no failover). Tests updated: SafeToRetry-commit subtest now verifies (verifyStatusResult= aborted, verifyCalls==1); class-40-exhaustion subtest verifies each attempt (verifyCalls==3) and still returns the original error (not ErrCommitPhase). Co-authored-by: E-Love (Eric Loveland) <elovelan@users.noreply.github.com>

stevemsmith and others added 20 commits June 25, 2026 12:41

database: lower HealthCheckPeriod from 5s to 1s

c3e3eda

For sub-second AlloyDB switchovers, 1s health checks detect and evict dead connections faster with negligible overhead. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

database: address review feedback from Gemini

b56ff9e

- Collapse switch cases in IsRetryablePostgresError for readability - Make context cancellation test more specific (assert context.Canceled and exact call count) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

cursor Bot changed the title ~~database: RetryIdempotent, RetryPostgresNonIdempotent~~ database: Postgres/AlloyDB network reliability — failover-safe pool + transactional retry with commit verification Jun 30, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

database: Postgres/AlloyDB network reliability — failover-safe pool + transactional retry with commit verification#2

database: Postgres/AlloyDB network reliability — failover-safe pool + transactional retry with commit verification#2
elovelan wants to merge 20 commits into
masterfrom
cursor/postgres-retry-tx-0f34

elovelan commented Jun 30, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

elovelan commented Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

What this does

1. Failover-safe connection pool (postgres.go, model_config.go, database.go)

2. Retry helpers (retry.go, postgres.go)

3. Commit verification (postgres.go, retry.go)

Retry design summary

Sentinels

Other

Test plan

Origin

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

elovelan commented Jun 30, 2026 •

edited

Loading

1. Failover-safe connection pool (`postgres.go`, `model_config.go`, `database.go`)

2. Retry helpers (`retry.go`, `postgres.go`)

3. Commit verification (`postgres.go`, `retry.go`)