perf(db): Phase 1 - indexes, batched cleanup, and distributed cleanup lock by appleboy · Pull Request #164 · go-authgate/authgate

appleboy · 2026-04-12T13:44:35Z

Summary

Phase 1 of a multi-phase DB + Redis optimization plan targeting 20k+ users on 5+ pod deployments. This PR focuses on low-risk, high-impact changes: indexes, cleanup batching, distributed cleanup lock, and production-recommended cache defaults. No service/handler logic changed; all OAuth flows behave identically.

Indexes (11 new): covers owner lookups, per-app consent grants, OAuth connections, audit-log filters, pending device codes, and authorization-code cleanup.
Batched cleanup: DeleteExpiredTokens / DeleteExpiredDeviceCodes / DeleteOldAuditLogs now delete 10k rows per iteration with a 200ms pause instead of a single unbounded DELETE. Backed by a shared deleteInBatches helper.
Distributed cleanup lock (rueidislock): with 5+ pods enabled, all replicas may run cleanup — a Redis lock ensures only one actually executes each tick. Idempotent DELETEs mean split-brain during Redis failover is safe (at worst: transient double work).
Cache defaults for production: new .env.production.example ships recommended redis-aside settings for token / client / user / metrics caches + connection-pool caps.

Implementation flow (what changed and why)

1. Index additions (11 indexes)

Modified GORM tags so AutoMigrate creates the indexes on fresh databases.

Model	Index	Query it fixes
`OAuthApplication`	`user_id`, `status`	`ListClientsByUserID`, `CountClientsByStatus`, dashboard
`UserAuthorization`	`(user_id, is_active)`, `(client_id, is_active)`	`ListUserAuthorizations`, `RevokeAllUserAuthorizationsByClientID`
`OAuthConnection`	`user_id` (single)	`GetOAuthConnectionsByUserID`, bulk delete per user
`AuditLog`	`severity`, `(event_type, event_time DESC)`, `(actor_user_id, event_time DESC)`, `(resource_type, resource_id, event_time DESC)`	Admin filter combinations on large tables
`DeviceCode`	`(authorized, expires_at)`	`CountPendingDeviceCodes`, metrics gauge
`AuthorizationCode`	`(used_at, expires_at)`	Replay prevention, cleanup

For production PostgreSQL, internal/store/migrations/001_add_phase1_indexes.sql uses CREATE INDEX CONCURRENTLY IF NOT EXISTS so adding these to a live DB does not block writes. AutoMigrate on startup becomes a no-op since the indexes already exist.

2. Batched cleanup (avoid long-lock DELETE)

internal/store/cleanup.go gains a shared helper:

func (s *Store) deleteInBatches(model any, whereClause string, args ...any) (int64, error)

It loops DELETE ... WHERE id IN (SELECT id ... LIMIT N) until a batch returns fewer than cleanupBatchSize (default 10000) rows, sleeping cleanupBatchPause (default 200ms) between iterations. The subquery form is used because PostgreSQL does not support DELETE ... LIMIT directly, and it lets the inner SELECT use the expires_at / created_at index.

All three cleanup functions (DeleteExpiredTokens, DeleteExpiredDeviceCodes, DeleteOldAuditLogs) now delegate to this helper.

3. Distributed cleanup lock

New file internal/bootstrap/cleanup_lock.go:

initializeCleanupLocker creates a rueidislock.Locker (single-node mode, KeyMajority=1) when ENABLE_CLEANUP_LOCK=true.
runWithCleanupLock wraps each cleanup job: calls TryWithContext — if another pod holds the lock this tick, the job is skipped silently and the next tick tries again.
Locker is closed on graceful shutdown.

Existing addExpiredTokenCleanupJob and addAuditLogCleanupJob now receive the locker and delegate through runWithCleanupLock. Lock key names live in constants (cleanupLockAuditLogs, cleanupLockExpiredTokens) to keep them greppable for runbooks.

New config (in internal/config/config.go):

Env var	Default	Purpose
`ENABLE_CLEANUP_LOCK`	`false`	Opt-in: gate the lock behind an explicit flag
`CLEANUP_LOCK_KEY_VALIDITY`	`5m`	Lock TTL; rueidislock auto-extends while held

4. Production `.env` template

New .env.production.example documents the 5+ pod starting point:

TOKEN_CACHE_ENABLED=true + all four caches on redis-aside
Connection pool: DB_MAX_OPEN_CONNS=25 × 5 pods = 125 (fits a standard max_connections=200 Postgres)
ENABLE_EXPIRED_TOKEN_CLEANUP=true + ENABLE_CLEANUP_LOCK=true
AUDIT_LOG_RETENTION=2160h (90 days)
METRICS_GAUGE_UPDATE_ENABLED=true — enable on one replica only (or use PromQL aggregation)

Dev .env.example is intentionally not touched so local sqlite + no Redis still works out of the box.

How to verify this PR

Unit tests (new)

go test ./internal/store/ -run 'TestDeleteExpired|TestDeleteOld' -v

Covers:

250 expired tokens with cleanupBatchSize=100 → multi-iteration loop correctness, 30 active tokens preserved
Zero expired rows → loop exits immediately
120 expired device codes + 15 active → same for device codes
80 old audit logs + 20 recent → returns int64(80) from DeleteOldAuditLogs

Full regression

make generate   # templ + swagger
make test       # full test suite
make lint       # golangci-lint (no issues)
make build      # produces bin/authgate

Manual / staging verification

Index confirmation (PostgreSQL):

\d user_authorizations
\d audit_logs
EXPLAIN (ANALYZE, BUFFERS)
  SELECT * FROM user_authorizations WHERE user_id = '<uuid>' AND is_active = true;
-- expect: Index Scan using idx_user_active

Distributed lock: bring up 2+ pods with ENABLE_CLEANUP_LOCK=true, force a cleanup tick. Exactly one pod should log Cleaned up N ..., the others silently skip.
Cache hit rate: hit /metrics after warm-up, confirm cache_hits_total{cache_name="token"} / (hits + misses) > 90%.
Graceful shutdown: send SIGTERM, verify Closing cleanup locker... / closed appears in logs with no leaked Redis connections.
Load test (optional, requires k6/vegeta):
- 500 qps /oauth/tokeninfo — P99 should drop vs baseline, DB qps near flat (cache absorbs).
- Admin audit filter event_type=DEVICE_CODE_AUTHORIZED&start_time=... — EXPLAIN ANALYZE should use idx_audit_type_time.

How to use (operator guide)

Fresh deployment

Copy .env.production.example → .env (or inject via Helm).
Fill in secrets (JWT_SECRET, SESSION_SECRET, DATABASE_DSN, REDIS_ADDR).
Start pods. AutoMigrate creates all indexes on first boot.

Existing deployment migrating live

Run internal/store/migrations/001_add_phase1_indexes.sql against PostgreSQL one statement at a time (CREATE INDEX CONCURRENTLY cannot run in a transaction block). Each statement takes seconds to minutes depending on table size; writes stay unblocked.
Roll out this version of AuthGate. AutoMigrate detects existing indexes and is a no-op.
Flip env vars per the production template — especially TOKEN_CACHE_ENABLED=true, ENABLE_EXPIRED_TOKEN_CLEANUP=true, ENABLE_CLEANUP_LOCK=true.

Redis requirements

Redis 6.0+ (rueidis-aside requires RESP3 CLIENT TRACKING).
For rueidislock: any Redis reachable by all pods. HA (Sentinel / managed) recommended; split-brain during failover is safe due to idempotent DELETEs.

Scope / non-goals

Phase 1 only. The following are intentionally deferred:

Phase 2: device code polling cache, dashboard cache, username cache.
Phase 3: OAuth connection provider-token encryption, read replica (gorm.io/plugin/dbresolver), list endpoint pagination.
Phase 4: pg_trgm GIN for admin search, external audit sink.

Test plan

go build ./...
go test ./... -short — all packages pass
go test ./internal/store/ -run 'TestDeleteExpired|TestDeleteOld' -v — 4 new tests pass
make lint — 0 issues
make build — produces bin/authgate
Staging: verify AutoMigrate creates indexes on fresh SQLite DB
Staging: run migration SQL against PostgreSQL, confirm \d shows new indexes
Staging: multi-pod cleanup lock behavior (only one pod logs cleanup per tick)
Staging: cache_hits_total growing, cache_misses_total plateaus after warm-up

🤖 Generated with Claude Code

…up lock - Add composite and single-column indexes on oauth_applications, user_authorizations, oauth_connections, audit_logs, device_codes, and authorization_codes for multi-pod hot paths - Ship PostgreSQL migration using CREATE INDEX CONCURRENTLY to avoid write locks on large tables - Batch DELETE of expired tokens, device codes, and old audit logs via shared deleteInBatches helper (10k rows per iteration) - Add Redis-backed distributed cleanup lock via rueidislock so only one pod runs each cleanup per tick - Introduce .env.production.example with recommended redis-aside cache defaults for 5+ pod deployments Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

codecov · 2026-04-12T13:45:43Z

Codecov Report

❌ Patch coverage is 30.35714% with 78 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
internal/bootstrap/server.go	0.00%	41 Missing ⚠️
internal/bootstrap/cleanup_lock.go	0.00%	29 Missing ⚠️
internal/bootstrap/bootstrap.go	0.00%	6 Missing ⚠️
internal/store/cleanup.go	90.00%	1 Missing and 1 partial ⚠️

📢 Thoughts on this report? Let us know!

Copilot

Pull request overview

Phase 1 of DB/Redis performance optimizations aimed at large, multi-pod deployments by adding DB indexes, batching cleanup deletes, and serializing cleanup work across pods via a Redis-backed lock.

Changes:

Added a PostgreSQL migration with concurrent index creation plus corresponding GORM index tags.
Implemented batched cleanup deletes for expired tokens/device codes and old audit logs, with unit tests.
Introduced an optional distributed cleanup lock (rueidislock) and a production-focused .env template.

Reviewed changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
internal/store/migrations/001_add_phase1_indexes.sql	Adds Phase 1 Postgres indexes using `CREATE INDEX CONCURRENTLY IF NOT EXISTS`.
internal/store/cleanup.go	Adds shared `deleteInBatches` helper and uses it for token/device-code cleanup.
internal/store/cleanup_test.go	Adds unit tests covering batched cleanup behavior.
internal/store/audit_log.go	Switches audit-log cleanup to use batched deletion helper.
internal/models/user_authorization.go	Adds composite indexes for active authorization lookups.
internal/models/oauth_connection.go	Adds an index tag for user lookups.
internal/models/oauth_application.go	Adds index tags for owner/status lookups.
internal/models/device_code.go	Adds composite index for pending/expiry queries.
internal/models/authorization_code.go	Adds composite index for used/expiry queries.
internal/models/audit_log.go	Adds composite indexes to support admin filtering queries.
internal/config/config.go	Adds config/env for enabling cleanup lock + lock validity.
internal/bootstrap/cleanup_lock.go	Implements optional Redis-backed locker and a wrapper for guarded runs.
internal/bootstrap/server.go	Wraps cleanup jobs with the distributed lock + adds locker shutdown handling.
internal/bootstrap/bootstrap.go	Wires locker initialization and passes it to cleanup jobs.
.env.production.example	Adds a production-oriented env template for caches/pools/cleanup lock defaults.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…eanup interval - Rename GORM index tags on oauth_applications.user_id / status, oauth_connections.user_id, and audit_logs.severity to match the names used in 001_add_phase1_indexes.sql so AutoMigrate does not create duplicate default-named indexes on existing PostgreSQL deployments - Use cfg.AuditLogCleanupInterval instead of a hard-coded 24h ticker in addAuditLogCleanupJob and treat <= 0 as disabling cleanup Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 15 out of 15 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…ment id assumption - Rename deleteInBatches to deleteByIDInBatches and document that it requires an id primary key column - Return an error when cleanupBatchSize <= 0 so a misconfigured var cannot silently loop forever via GORM Limit(0) disabling the clause - Also terminate on RowsAffected == 0 as a belt-and-braces exit condition - Add a unit test covering the invalid batch size guard Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 15 out of 15 changed files in this pull request and generated 4 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

- Validate EnableCleanupLock requires REDIS_ADDR and positive CleanupLockKeyValidity, and require positive ExpiredTokenCleanupInterval when expired-token cleanup is enabled, so misconfiguration fails fast instead of panicking time.NewTicker or throwing runtime lock errors - Guard addExpiredTokenCleanupJob against non-positive interval as defense in depth - Drop standalone idx_oauth_conn_user_id in favor of the existing composite unique index idx_oauth_user_provider whose leading column already serves user_id-only predicates - Flip METRICS_GAUGE_UPDATE_ENABLED default to false in the production template so copy-pasting into every pod does not double-count gauge metrics Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 15 out of 15 changed files in this pull request and generated no new comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot AI review requested due to automatic review settings April 12, 2026 13:44

Copilot started reviewing on behalf of appleboy April 12, 2026 13:45 View session

Copilot AI reviewed Apr 12, 2026

View reviewed changes

Comment thread internal/models/oauth_application.go Outdated

Comment thread internal/models/oauth_connection.go Outdated

Comment thread internal/models/audit_log.go Outdated

Comment thread internal/bootstrap/server.go

appleboy requested a review from Copilot April 12, 2026 14:12

Copilot started reviewing on behalf of appleboy April 12, 2026 14:13 View session

Copilot AI reviewed Apr 12, 2026

View reviewed changes

Comment thread internal/store/cleanup.go

Comment thread internal/store/cleanup.go Outdated

appleboy requested a review from Copilot April 12, 2026 14:21

Copilot started reviewing on behalf of appleboy April 12, 2026 14:21 View session

Copilot AI reviewed Apr 12, 2026

View reviewed changes

Comment thread internal/bootstrap/server.go

Comment thread internal/store/migrations/001_add_phase1_indexes.sql Outdated

Comment thread .env.production.example Outdated

Comment thread internal/config/config.go

appleboy requested a review from Copilot April 12, 2026 14:32

Copilot started reviewing on behalf of appleboy April 12, 2026 14:32 View session

Copilot AI reviewed Apr 12, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf(db): Phase 1 - indexes, batched cleanup, and distributed cleanup lock#164

perf(db): Phase 1 - indexes, batched cleanup, and distributed cleanup lock#164
appleboy wants to merge 4 commits intomainfrom
worktree-DB

appleboy commented Apr 12, 2026

Uh oh!

codecov bot commented Apr 12, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

appleboy commented Apr 12, 2026

Summary

Implementation flow (what changed and why)

1. Index additions (11 indexes)

2. Batched cleanup (avoid long-lock DELETE)

3. Distributed cleanup lock

4. Production .env template

How to verify this PR

Unit tests (new)

Full regression

Manual / staging verification

How to use (operator guide)

Fresh deployment

Existing deployment migrating live

Redis requirements

Scope / non-goals

Test plan

Uh oh!

codecov bot commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

4. Production `.env` template

codecov bot commented Apr 12, 2026 •

edited

Loading