Skip to content

database: Postgres/AlloyDB network reliability — failover-safe pool + transactional retry with commit verification#2

Open
elovelan wants to merge 20 commits into
masterfrom
cursor/postgres-retry-tx-0f34
Open

database: Postgres/AlloyDB network reliability — failover-safe pool + transactional retry with commit verification#2
elovelan wants to merge 20 commits into
masterfrom
cursor/postgres-retry-tx-0f34

Conversation

@elovelan

@elovelan elovelan commented Jun 30, 2026

Copy link
Copy Markdown
Owner

Problem

Postgres/AlloyDB connections fail during brief network outages — most notably AlloyDB's monthly maintenance switchovers (< 1s downtime), but also any transient network blip. Today:

  1. Dead connections linger in the pool. database/sql with the pgx driver has no background connection health checks, so after a switchover, dead connections sit in the pool and get handed to the application.
  2. No pool limits by default. ConnectionsConfig defaults to all zero values, so database/sql holds connections forever.
  3. In-flight queries have no retry path. A query that fails mid-flight (connection severed) just returns an error to the caller — no retry, no indication whether a write landed.

The result: services return 500s during maintenance windows, and callers have to guess whether a write committed after a network error.

What this does

Three layers, each independently useful:

1. Failover-safe connection pool (postgres.go, model_config.go, database.go)

  • Switch from sql.Open("pgx", …) to pgxpool.NewWithConfig() + stdlib.OpenDBFromPool(), keeping the *sql.DB return type so no downstream changes are needed.
  • HealthCheckPeriod = 1s — a background goroutine evicts connections that exceeded MaxConnLifetime or MaxConnIdleTime every second. (It does not ping — liveness pinging happens at acquire time via stdlib's ResetSession, default "ping if idle > 1s"; database/sql then retries on a fresh conn. HealthCheckPeriod just keeps expired/idle connections from accumulating.)
  • DefaultPostgresConnectionsConfig() + ApplyPostgresConnectionsConfig() — failover-safe defaults (MaxLifetime=5m, MaxIdleTime=30s, MaxOpen=25, MaxIdle=5) applied automatically when services don't explicitly configure them.

2. Retry helpers (retry.go, postgres.go)

Two functions so the common case (selects, idempotent DML) doesn't carry "unsafe" wording or transaction overhead:

  • RetryIdempotent(ctx, opts, fn func() error) — retries fn (no transaction wrapper) on any error by default except context.Canceled/context.DeadlineExceeded. The caller vouches that fn is idempotent. Lightweight; for selects and idempotent DML.
  • RetryPostgresNonIdempotent(ctx, db, opts, fn func(*sql.Tx) error) — runs fn in a Postgres transaction and retries the whole transaction on transient errors. fn receives a fresh *sql.Tx each attempt. For non-idempotent operations (the transaction makes retry safe via rollback, plus commit verification below). Shape matches pgx.BeginTxFunc / pgxtransactions.Transaction.

Both use a non-configurable maxRetryAttempts = 3 (matching database/sql's maxBadConnRetries+1), full-jitter backoff (up to 100ms), and respect the caller's context deadline.

RetryMySQLNonIdempotent / RetrySpannerNonIdempotent are placeholders (return "not yet implemented") so the cross-backend API surface is visible.

3. Commit verification (postgres.go, retry.go)

The hard part of retrying transactions: a commit error is ambiguous — the commit may have succeeded on the server before the error was returned to the client (e.g., the connection was severed after COMMIT was sent but before the response). Blindly retrying could duplicate committed work.

RetryPostgresNonIdempotent resolves this by asking the server, per the PostgreSQL docs (which explicitly endorse pg_xact_status for "determine whether their transaction committed or aborted after the application and database server become disconnected while a COMMIT is in progress"):

  1. Before each Commit, capture the transaction id: SELECT pg_current_xact_id_if_assigned()::text (NULL → read-only — no writes to duplicate, safe to retry without verifying).
  2. On a commit error, query SELECT pg_xact_status($1::xid8) on a fresh connection (the original may be dead; xid8 is globally unique so this works across connections/reconnects, modulo a brief failover replay-lag window):
    • 'aborted' → retry (the commit didn't land).
    • 'committed'ErrCommitted (don't retry — would duplicate; the caller reconciles/alerts).
    • NULL / 'in progress' / query error → ErrCommitPhase (inconclusive; caller decides).

Commit errors are always verified (not classified by a client-side guess) — the server is the authority on whether a commit landed. Class-40 commit errors (40001 serialization, 40P01 deadlock — server rolled back) → pg_xact_status says 'aborted' → retry. A retryable commit error (verified-aborted or read-only) on exhaustion returns the original error (not ErrCommitPhase — it's known non-commit, just out of retries). ErrCommitPhase is only for inconclusive commits.

Retry design summary

  • Pre-commit errors (BeginTx/fn): opt-OUT classifier — pre-commit retry is always safe (the transaction rolls back), so it retries everything except context.Canceled/context.DeadlineExceeded and the permanent SQLSTATE classes (22xxx data exceptions, 23xxx constraint violations, 42xxx syntax/schema/permission — via pgerrcode). Unknown codes and non-PgError errors are retried (a false positive wastes ~200ms; a false negative causes a spurious user-facing failure — the former is safer).
  • Commit errors: always verify via pg_xact_status (above).
  • opts.IsRetryable is additive (OR) on the pre-commit classifier (commit errors are verified, not classified — safer; a caller shouldn't override commit verification).

Sentinels

if errors.Is(err, database.ErrCommitted) {
    // Verified committed: do NOT retry (would duplicate). Reconcile/alert.
}
if errors.Is(err, database.ErrCommitPhase) {
    // Inconclusive: the commit may or may not have landed. Decide / re-check.
}

Both preserve the underlying commit error for errors.Is/errors.As.

Other

  • pgerrcode: adopted github.com/jackc/pgerrcode (canonical Postgres error-code constants, by the pgx author) to eliminate string literals and inline human-friendly-name comments.
  • Naming: RetryIdempotent (no tx) / RetryPostgresNonIdempotent (tx + verification) — the Tx suffix was dropped (the fn func(*sql.Tx) error parameter makes the transaction obvious). The existing RetryPostgres is removed (this is unmerged work).
  • database/sql interaction: *sql.DB methods already retry driver.ErrBadConn internally (immediate, up to 3 attempts); this outer retry layers on top with jitter for longer outages. *sql.Tx methods have no internal retry, so this is the only retry for tx.Exec/tx.Commit.

Test plan

  • TestRetryIdempotent — success, retries on any error by default (incl. 23505), custom IsRetryable short-circuits, no retry on context.Canceled, ctx cancellation, attempt exhaustion.
  • TestRetryPostgresNonIdempotent — success; fn-error-then-succeed (rollback between attempts); non-retryable fn error short-circuits (23505); driver.ErrBadConn from fn retried; SafeToRetry Begin failure retried; SafeToRetry Commit failure → verified-aborted → retry; verify-aborted → retry; verify-committed → ErrCommitted; verify-unknown (NULL) → ErrCommitPhase; verify-error → ErrCommitPhase; read-only (no xid) → retry without verifying; class-40 exhaustion → verify-aborted each attempt → original error (not ErrCommitPhase); 57P01+inconclusive → ErrCommitPhase; additive opts.IsRetryable (pre-commit); ctx cancellation; attempt exhaustion. Uses a database/sql/driver mock (with QueryerContext to script the verification queries).
  • TestIsRetryablePostgresPreCommitError — pre-commit classifier leg coverage.
  • TestErrCommitPhase / TestErrCommitted — sentinel detection; underlying error preserved; distinct from each other.
  • TestRetryMySQLNonIdempotentNotImplemented / TestRetrySpannerNonIdempotentNotImplemented — placeholders return "not yet implemented".
  • Existing IsRetryablePostgresError / PostgresUniqueViolation / PostgresDeadlockFound / pool-defaults tests still pass.
  • go build ./..., go vet ./..., go test ./database/... -short all pass.

Origin

This combines the failover-safe pool work (upstream moov-io/base PR #490) with the transaction-retry pattern suggested in its review (r3470551946, r3487236671, r3482973525, r3483137216).

stevemsmith and others added 20 commits June 25, 2026 12:41
…oyDB

During AlloyDB maintenance switchovers (< 1s downtime), services using
database/sql with pgx fail to recover because:
1. No connection pool limits are set by default, so dead connections
   persist indefinitely
2. No retry logic exists for transient connection errors

This adds:
- DefaultPostgresConnectionsConfig() with MaxLifetime=5m, MaxIdleTime=30s
  to ensure dead connections are evicted quickly after failover
- ApplyPostgresConnectionsConfig() that fills in safe defaults when
  services don't explicitly configure pool settings
- IsRetryablePostgresError() to classify transient PG/network errors
- RetryPostgres() for services to wrap critical DB operations

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Switch from sql.Open("pgx") to pgxpool.NewWithConfig() wrapped with
stdlib.OpenDBFromPool(). This gives us pgxpool's HealthCheckPeriod
(set to 5s) which proactively pings idle connections and evicts dead
ones — the most important fix for surviving AlloyDB maintenance
switchovers. The return type remains *sql.DB so no downstream changes
are needed.

Also cleans up the dialer TODO (dialer lifecycle is now tied to the
pool) and removes the unused pgx.ParseConfig import path.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
For sub-second AlloyDB switchovers, 1s health checks detect and evict
dead connections faster with negligible overhead.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Collapse switch cases in IsRetryablePostgresError for readability
- Make context cancellation test more specific (assert context.Canceled
  and exact call count)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… codes

Remove the 08xxx connection_exception PgError cases — pgx never surfaces
these as *pgconn.PgError, so they were dead code. Add the codes that pgx
does return and where the server guarantees a rollback before returning
the error: 40001 (serialization_failure), 40P01 (deadlock_detected),
53300 (too_many_connections), and 57014 (query_canceled).

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Replace the fixed 200ms×attempt linear backoff with a random delay in
[0, 100ms) per retry gap. Full jitter spreads concurrent retries across
the fleet instead of having all callers wait the same fixed interval and
slam the database simultaneously. Worst-case retry window across 3
attempts is ~200ms, sized for AlloyDB planned maintenance switchovers
which typically cause less than one second of downtime.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Drop "connection refused" (not present in pgx's codebase) and
"unexpected EOF" (already caught by errors.Is(err, io.ErrUnexpectedEOF)).
Add comments explaining why the remaining string matches are safe to retry.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…etryUnsafe

Implements the transaction-retry pattern suggested in PR moov-io#490 review
comments (discussion_r3470551946 and discussion_r3487236671):

- RetryPostgresTx wraps fn in an explicit transaction (fn receives a
  *sql.Tx) and retries the whole transaction on transient errors. This
  is the "safe" variant: the explicit transaction guarantees fn's work
  is atomic, so a retried attempt starts from a clean slate.
- RetryUnsafe retries fn (no transaction wrapper) on any error by
  default; the caller vouches that fn is idempotent. This is the
  "unsafe" variant for cases where a transaction isn't feasible.
- RetryMySQLTx and RetrySpannerTx are added as not-implemented
  placeholders so the cross-backend API surface is visible and the gap
  is enforced by tests.

The safe variant's default classifier (isRetryablePostgresTxError) is
the OR of three signals:
  1. driver.ErrBadConn -- pgx's stdlib adapter converts
     pgconn.SafeToRetry-flagged Exec/Query errors to driver.ErrBadConn
     before fn sees them; this is the primary retry trigger inside a
     transaction.
  2. pgconn.SafeToRetry -- covers Begin/Commit failures where the
     original pgx error flows through unchanged.
  3. IsRetryablePostgresError -- server-returned SQLSTATE codes that
     guarantee rollback (40001, 40P01, 57P01, ...).

RetryTxOptions.IsRetryable is additive on top of the default classifier
so consumers can extend it with implementation-specific cases without
narrowing the safety floor. RetryUnsafeOptions.IsRetryable replaces the
default (caller takes full control).

The existing RetryPostgres is removed (the PR is unmerged). A stacked
follow-up will add pg_current_xact_id()/pg_xact_status() commit
verification; a TODO marks the insertion point in retryTx.

Co-authored-by: E-Love (Eric Loveland) <elovelan@users.noreply.github.com>
Addresses review comments r3482973525 and r3483137216 on PR moov-io#490: without
an explicit transaction there is no way to know whether a network error
occurred before or after the server accepted the query, so retrying could
duplicate committed work.

IsRetryablePostgresError now classifies only PostgreSQL SQLSTATE codes that
the server guarantees were rolled back before the error was returned
(40001, 40P01, 57P01, 57P02, 57P03, 53300, 57014). The net.OpError,
io.EOF, io.ErrUnexpectedEOF, and string-matching ("connection reset by
peer"/"broken pipe"/"conn closed") checks are removed. The
context.DeadlineExceeded check is removed as redundant (non-PgError errors
already return false). The doc comment explains why network errors are
excluded and points callers to RetryPostgresTx for transactional retry.

The network-error checks move into a new internal isPostgresNetworkError
helper, which is wired into isRetryablePostgresTxError as a 4th leg.
Network errors ARE safe to retry within an explicit transaction (the
server rolls back the uncommitted transaction), so RetryPostgresTx keeps
them. String matching is dropped entirely -- driver.ErrBadConn and
pgconn.SafeToRetry cover pgx's pre-send cases, and the typed net.OpError
/ io checks cover post-send cases, so substring matching is redundant and
fragile.

Tests updated:
- TestIsRetryablePostgresError: io.EOF, io.ErrUnexpectedEOF, net.OpError,
  and the string-matched errors now assert false; SQLSTATE assertions
  unchanged.
- TestIsRetryablePostgresTxError: added a Leg 4 section asserting
  io.EOF, io.ErrUnexpectedEOF, and net.OpError are retryable via the
  transactional classifier.

Co-authored-by: E-Love (Eric Loveland) <elovelan@users.noreply.github.com>
…narrow) phases

Addresses review feedback questioning the claim that commit-phase errors
always roll back: a commit-phase error is ambiguous because the commit may
have succeeded on the server before the error was returned to the client
(e.g., the TCP connection was severed after the COMMIT message was sent but
before the response was received). Retrying such an error could duplicate
the committed work.

Changes:

- IsRetryablePostgresError -> isSafeRetryablePostgresError (unexported). It
  classified only SQLSTATE codes that the server guarantees were rolled
  back, and the recommended path for callers is now RetryPostgresTx or
  RetryUnsafe, not building a custom retry loop on top of this classifier.

- isRetryablePostgresTxError -> isRetryablePostgresPreCommitError (broad,
  for errors from BeginTx or fn): driver.ErrBadConn + pgconn.SafeToRetry +
  isSafeRetryablePostgresError + isPostgresNetworkError. Any error from
  Begin or fn is safe to retry because the transaction rolls back, so
  network errors are included. Added a comment on the pgconn.SafeToRetry
  leg explaining pgx guarantees these always occur before any data is sent
  to the server.

- New isRetryablePostgresCommitError (narrow, for errors from
  (*sql.Tx).Commit): driver.ErrBadConn + pgconn.SafeToRetry +
  isSafeRetryablePostgresError. Network errors are intentionally EXCLUDED
  because they could mean the commit succeeded but the response was lost.

- New ErrCommitPhase sentinel: commit-phase errors that are not retried
  (e.g., a network error during commit) are wrapped with ErrCommitPhase so
  callers can detect them via errors.Is and decide whether to alert,
  reconcile, or check whether the commit landed (future pg_xact_status()
  work). The underlying error is preserved for inspection.

- retryTx now takes a retryClassifier (preCommit + commit funcs) and
  runOneTxAttempt returns (err, commitPhase bool); the retry loop applies
  the commit classifier when commitPhase is true and wraps the final
  commit-phase error with ErrCommitPhase. opts.IsRetryable remains
  additive on top of BOTH phases.

- TODO(stacked-pr) -> TODO(future) for the pg_current_xact_id()/
  pg_xact_status() commit-verification hook.

Tests:
- Removed TestIsRetryablePostgresError from postgres_test.go (function is
  now private); SQLSTATE coverage moved to internal TestIsSafeRetryablePostgresError.
- Renamed TestIsRetryablePostgresTxError -> TestIsRetryablePostgresPreCommitError.
- Added TestIsRetryablePostgresCommitError (network NOT retryable at commit).
- Added TestRetryPostgresTx subtests: commit-phase network error is not
  retried and is wrapped with ErrCommitPhase; commit-phase exhaustion wraps
  the final error with ErrCommitPhase.
- Added TestErrCommitPhase (sentinel detection + underlying error preserved).

Co-authored-by: E-Love (Eric Loveland) <elovelan@users.noreply.github.com>
isRetryablePostgresCommitError was a strict subset of
isRetryablePostgresPreCommitError (pre-send + server-rolled-back, without
the pre-commit-only network leg), duplicating the driver.ErrBadConn and
pgconn.SafeToRetry checks that also belonged to the 'safe at any phase'
bucket. Folded those two legs into isSafeRetryablePostgresError so it
becomes the single 'safe to retry regardless of transaction phase'
classifier (pre-send + server-rolled-back SQLSTATE codes), and assigned
it directly to postgresTxClassifier.commit. isRetryablePostgresPreCommitError
now delegates: isSafeRetryablePostgresError || isPostgresNetworkError.
Behavior is unchanged; only the code structure is simpler.

Also documented on isRetryablePostgresPreCommitError the permanent-error
categories that are intentionally NOT retried pre-commit even though
retrying them would be safe (the transaction rolls back): constraint
violations (23xxx), syntax errors (42601), schema errors (42xxx),
permission errors (42501), and data exceptions (22xxx). These surface at
query time (not connection time) and would fail again on retry, so
retrying only adds ~200ms of latency before the caller receives the
error. Callers who want to retry a broader set can pass opts.IsRetryable.

Tests: folded TestIsRetryablePostgresCommitError assertions into
TestIsSafeRetryablePostgresError (added driver.ErrBadConn + SafeToRetry
-> true); removed TestIsRetryablePostgresCommitError;
TestIsRetryablePostgresPreCommitError unchanged.

Co-authored-by: E-Love (Eric Loveland) <elovelan@users.noreply.github.com>
…config

Two related changes from review discussion:

1. Pre-commit classifier is now OPT-OUT. Pre-commit, retrying is always
   safe (the transaction rolls back), so the cost of a false positive
   (retrying a permanent error we didn't blocklist) is ~200ms of wasted
   latency, while the cost of a false negative (NOT retrying a transient
   error we didn't allowlist) is a spurious user-facing failure. The
   former is strictly safer, so isRetryablePostgresPreCommitError now
   retries everything EXCEPT context.Canceled/DeadlineExceeded and the
   permanent SQLSTATE classes (22xxx data exceptions, 23xxx constraint
   violations, 42xxx syntax/schema/permission), via a new
   isPermanentPostgresError helper. Unknown *pgconn.PgError codes and
   non-PgError errors are retried. The commit classifier
   (isSafeRetryablePostgresError) stays OPT-IN (narrow) because at commit
   retrying is not always safe. isPostgresNetworkError is removed (the
   pre-commit catch-all now covers network errors; the commit classifier
   never used it).

2. MaxAttempts is no longer configurable. Added a maxRetryAttempts=3
   constant matching database/sql's maxBadConnRetries+1 convention, and
   removed the MaxAttempts field from RetryTxOptions and RetryUnsafeOptions
   (plus the '<= 0 -> default to 3' logic). Services that need a different
   retry budget should bound the total time via the context deadline or
   wrap the call in their own retry loop. Added a comment on retryTx
   explaining the interaction with database/sql's internal driver.ErrBadConn
   retry (immediate, up to 3 attempts for *sql.DB methods) — our outer
   retry layers on top with full-jitter backoff to survive longer outages;
   for *sql.Tx methods there is no internal retry, so this is the only one.

Tests: TestIsRetryablePostgresPreCommitError rewritten for opt-out
(application error and unknown PgError code now retryable; added 22xxx/
42xxx permanent cases); removed isPostgresNetworkError comment refs;
replaced all MaxAttempts: 3 with {} in RetryTxOptions/RetryUnsafeOptions.

Co-authored-by: E-Love (Eric Loveland) <elovelan@users.noreply.github.com>
The driver.ErrBadConn check in isSafeRetryablePostgresError (the commit-phase
classifier) was unreachable dead code:

  - *sql.Tx.Commit does not retry driver.ErrBadConn internally (only *sql.DB
    methods have the maxBadConnRetries loop — see golang/go#11264), so there
    is no database/sql retry to double up with at commit time.
  - pgx's wrapTx.Commit returns the native pgx error directly
    (return wtx.tx.Commit(wtx.ctx)); the SafeToRetry -> driver.ErrBadConn
    conversion only happens in Conn.ExecContext/QueryContext (i.e., for fn's
    tx.Exec/tx.Query), NOT for Commit. So the commit classifier never sees
    driver.ErrBadConn with pgx.
  - The pre-send commit case (a SafeToRetry-flagged pgx error from commit) is
    already covered by the pgconn.SafeToRetry leg.

Removed the leg and the now-unused database/sql/driver import from postgres.go,
and documented on isSafeRetryablePostgresError why driver.ErrBadConn is not
checked there. driver.ErrBadConn from fn's tx.Exec is still retryable — that's
a pre-commit error handled by isRetryablePostgresPreCommitError's opt-out
catch-all (TestIsRetryablePostgresPreCommitError covers it).

TestIsSafeRetryablePostgresError: removed the driver.ErrBadConn -> true
assertions (now false) and added a comment explaining why driver.ErrBadConn
isn't in the commit classifier.

Co-authored-by: E-Love (Eric Loveland) <elovelan@users.noreply.github.com>
Trim verbose comments, drop general-knowledge explanations (e.g. Rollback is
a no-op after Commit), remove rationale duplicated across function docs, and
fix stale references to the renamed isRetryablePostgresTxError in the
mysql.go/spanner.go placeholder TODOs. Behavior unchanged.

Co-authored-by: E-Love (Eric Loveland) <elovelan@users.noreply.github.com>
HealthCheckPeriod drives the background goroutine that evicts connections
exceeded MaxConnLifetime/MaxConnIdleTime; it does NOT ping for liveness.
Liveness pinging is at acquire time (stdlib ResetSession, default ping if
idle > 1s), with database/sql retrying on a fresh conn. The old comment
claimed HealthCheckPeriod pings idle connections in the background and
evicts dead connections before the app sees them — both wrong.

Co-authored-by: E-Love (Eric Loveland) <elovelan@users.noreply.github.com>
quickdie() (the SIGQUIT handler that raises ERRCODE_CRASH_SHUTDOWN) does
_exit(2) WITHOUT rolling back — the source comment says 'we don't want to
try to clean up our transaction'. The rollback is via crash recovery on
restart, not before the client sees 57P02. Contrast with die() (57P01
admin_shutdown), which processes the interrupt through proc_exit/
AbortTransaction and does roll back before sending.

So 57P02 at commit is ambiguous: the commit may have been recorded before
SIGQUIT, and quickdie skips rollback. Removed 57P02 from isSafeRetryable-
PostgresError (the commit classifier) so it's wrapped with ErrCommitPhase
at commit (caller decides). 57P02 is still retried pre-commit via
isRetryablePostgresPreCommitError's opt-out — the tx didn't commit (process
exited before committing), so retry is safe there.

Tests: TestIsSafeRetryablePostgresError asserts 57P02 -> false at commit;
TestIsRetryablePostgresPreCommitError asserts 57P02 -> true pre-commit (with
a comment noting it's safe despite no pre-return rollback).

Co-authored-by: E-Love (Eric Loveland) <elovelan@users.noreply.github.com>
Commit classifier (isSafeRetryablePostgresError) now retries only errors that
can NEVER mean 'possibly committed':

  - pgconn.SafeToRetry (pgx guarantees pre-send: COMMIT message never sent)
  - pgerrcode.IsTransactionRollback (class 40: server rolled back the tx
    before returning the ErrorResponse — 40001 serialization_failure, 40P01
    deadlock_detected, 40002, 40003, 40000)

Dropped 57P01 (admin_shutdown), 57P02 (crash_shutdown), 57014 (query_canceled),
57P03, 53300 from the commit classifier — each could mean 'possibly committed'
(signal handlers can interrupt after the commit is recorded; quickdie skips
rollback; cancel timing has an edge case; connect-time codes are irrelevant).
They're still retried pre-commit via isRetryablePostgresPreCommitError's
opt-out (the tx didn't commit pre-commit). At commit they're wrapped with
ErrCommitPhase so the caller decides.

Adopted github.com/jackc/pgerrcode (canonical Postgres error-code constants,
by the pgx author) to eliminate string literals and the inline human-friendly-
name comments: removed the local postgresErrUniqueViolation/postgresErrDeadlock-
Found consts; PostgresUniqueViolation/PostgresDeadlockFound use pgerrcode;
isPermanentPostgresError uses pgerrcode.IsDataException/IsIntegrityConstraint-
Violation/IsSyntaxErrororAccessRuleViolation (class 22/23/42); the commit
classifier uses pgerrcode.IsTransactionRollback (class 40).

Tests: TestIsSafeRetryablePostgresError now asserts only SafeToRetry + class 40
retryable at commit (57P01/57P02/57P03/53300/57014 false); added a
TestRetryPostgresTx subtest for 57P01-at-commit not retried + wrapped with
ErrCommitPhase; TestIsRetryablePostgresPreCommitError covers 57P01/57P02/57P03/
53300/57014 still retryable pre-commit; all string literals swapped for
pgerrcode constants (postgres_test.go, database_test.go too).

Co-authored-by: E-Love (Eric Loveland) <elovelan@users.noreply.github.com>
…PostgresNonIdempotent

Renames so the common case (selects, idempotent DML) doesn't put 'unsafe' all
over callers' codebases, and the transactional variant's name signals it's for
non-idempotent ops:

  RetryUnsafe             -> RetryIdempotent
  RetryUnsafeOptions      -> RetryIdempotentOptions
  RetryPostgresTx         -> RetryPostgresNonIdempotent
  RetryMySQLTx            -> RetryMySQLNonIdempotent  (placeholder)
  RetrySpannerTx          -> RetrySpannerNonIdempotent (placeholder)
  RetryTxOptions          -> RetryNonIdempotentOptions

Dropped the redundant 'Tx' suffix from the public names: the fn func(*sql.Tx)
parameter makes the transaction obvious at every call site, so 'Tx' in the
name added little. The internal retryTx helper keeps its name (unexported,
descriptive). No behavior change.

Co-authored-by: E-Love (Eric Loveland) <elovelan@users.noreply.github.com>
… + pg_xact_status

RetryPostgresNonIdempotent now verifies whether an ambiguous commit landed,
per r3470551946 and the PostgreSQL docs (pg_xact_status is documented for
'determine whether their transaction committed or aborted after the
application and database server become disconnected while a COMMIT is in
progress'). Always-on for the non-idempotent variant (no opt-in).

Before each Commit, capture the transaction id:
  SELECT pg_current_xact_id_if_assigned()::text
(NULL → read-only, no writes to duplicate). On a commit-phase error the
commit classifier rejects (network, 57P01/57P02, etc.), query
pg_xact_status($1::xid8) on a FRESH connection (the original may be dead;
xid8 is globally unique so this works across connections/reconnects, modulo
a failover replay-lag window):

  - 'aborted'     → retry (commit didn't land).
  - 'committed'   → ErrCommitted (don't retry — would duplicate).
  - NULL/'in progress'/error → ErrCommitPhase (ambiguous, caller decides).
  - read-only (xid nil) → retry without verifying (no writes to duplicate).

Class-40 commit errors (40001/40P01, server rolled back) are retryable per
the commit classifier, so they're retried without verification; on exhaustion
they return the original error (not ErrCommitPhase — they're known
non-commit). ErrCommitPhase is now only for INCONCLUSIVE commits.

Implementation:
  - ErrCommitted sentinel (distinct from ErrCommitPhase).
  - commitStatus enum (aborted/committed/unknown).
  - retryDB interface (BeginTx + QueryRowContext) so retryTx can run the
    fresh-connection pg_xact_status query.
  - retryClassifier gains captureXid/verifyCommit hooks (nil for future
    MySQL/Spanner → ErrCommitPhase fallback, current behavior).
  - runOneTxAttempt captures the xid before Commit and returns it.
  - retryTx loop: on a classifier-rejected commit error, verify and decide
    retry / ErrCommitted / ErrCommitPhase; read-only → retry.
  - postgres.go: capturePostgresXid (::text cast sidesteps pgx's xid8 Go
    type-mapping) and verifyPostgresCommit (::xid8 cast), wired into
    postgresTxClassifier. Removed the TODO(future).

Tests: mock driver implements QueryerContext, scripting the two verification
queries (captureXidResult/verifyStatusResult/verifyErr). New TestRetryPostgres-
NonIdempotent subtests cover verify-aborted→retry, verify-committed→ErrCommitted,
verify-unknown→ErrCommitPhase, read-only→retry-without-verifying, verify-error
→ErrCommitPhase, and class-40-exhaustion→original-error. TestErrCommitted
added. ErrCommitPhase-exhaustion test repurposed (class-40 now returns the
original, not ErrCommitPhase).

Co-authored-by: E-Love (Eric Loveland) <elovelan@users.noreply.github.com>
Now that commit verification (pg_xact_status) is in place, the separate
'commit classifier' (isSafeRetryablePostgresError) is unnecessary. Commit
errors are rare, and we already pay a round-trip to capture the xid before
every commit — so always checking pg_xact_status on a commit error (one more
round-trip, only on the rare commit error) is negligible, and it lets the
SERVER be the authority on whether the commit landed instead of a client-side
guess.

retryTx now: pre-commit errors → preCommit classifier (opt-out, unchanged);
commit errors → always verify (read-only/xid-nil → retry without verifying;
aborted → retry; committed → ErrCommitted; inconclusive → ErrCommitPhase).
No commit classifier, no classify method. opts.IsRetryable is additive on
preCommit only (commit errors are verified, not classified — safer, since a
caller shouldn't override commit verification).

Removed isSafeRetryablePostgresError (and its complex exclusion doc about
57P01/57P02/57014/57P03/53300/network/driver.ErrBadConn) and
TestIsSafeRetryablePostgresError. postgresTxClassifier drops the commit field.

Trade-off: class-40 / SafeToRetry commit errors now go through verification
(+1 pg_xact_status round-trip; previously fast-pathed). No regression for
common cases — class-40 → 'aborted' → retry (same outcome); the rare
SafeToRetry+failover+replay-lag case → ErrCommitPhase (recoverable). Class-40
doesn't hit the replay-lag case (the server responded, no failover).

Tests updated: SafeToRetry-commit subtest now verifies (verifyStatusResult=
aborted, verifyCalls==1); class-40-exhaustion subtest verifies each attempt
(verifyCalls==3) and still returns the original error (not ErrCommitPhase).

Co-authored-by: E-Love (Eric Loveland) <elovelan@users.noreply.github.com>
@cursor cursor Bot changed the title database: RetryIdempotent, RetryPostgresNonIdempotent database: Postgres/AlloyDB network reliability — failover-safe pool + transactional retry with commit verification Jun 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants