perf(db): Phase 1 - indexes, batched cleanup, and distributed cleanup lock#164
perf(db): Phase 1 - indexes, batched cleanup, and distributed cleanup lock#164
Conversation
…up lock - Add composite and single-column indexes on oauth_applications, user_authorizations, oauth_connections, audit_logs, device_codes, and authorization_codes for multi-pod hot paths - Ship PostgreSQL migration using CREATE INDEX CONCURRENTLY to avoid write locks on large tables - Batch DELETE of expired tokens, device codes, and old audit logs via shared deleteInBatches helper (10k rows per iteration) - Add Redis-backed distributed cleanup lock via rueidislock so only one pod runs each cleanup per tick - Introduce .env.production.example with recommended redis-aside cache defaults for 5+ pod deployments Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
There was a problem hiding this comment.
Pull request overview
Phase 1 of DB/Redis performance optimizations aimed at large, multi-pod deployments by adding DB indexes, batching cleanup deletes, and serializing cleanup work across pods via a Redis-backed lock.
Changes:
- Added a PostgreSQL migration with concurrent index creation plus corresponding GORM index tags.
- Implemented batched cleanup deletes for expired tokens/device codes and old audit logs, with unit tests.
- Introduced an optional distributed cleanup lock (rueidislock) and a production-focused
.envtemplate.
Reviewed changes
Copilot reviewed 15 out of 15 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| internal/store/migrations/001_add_phase1_indexes.sql | Adds Phase 1 Postgres indexes using CREATE INDEX CONCURRENTLY IF NOT EXISTS. |
| internal/store/cleanup.go | Adds shared deleteInBatches helper and uses it for token/device-code cleanup. |
| internal/store/cleanup_test.go | Adds unit tests covering batched cleanup behavior. |
| internal/store/audit_log.go | Switches audit-log cleanup to use batched deletion helper. |
| internal/models/user_authorization.go | Adds composite indexes for active authorization lookups. |
| internal/models/oauth_connection.go | Adds an index tag for user lookups. |
| internal/models/oauth_application.go | Adds index tags for owner/status lookups. |
| internal/models/device_code.go | Adds composite index for pending/expiry queries. |
| internal/models/authorization_code.go | Adds composite index for used/expiry queries. |
| internal/models/audit_log.go | Adds composite indexes to support admin filtering queries. |
| internal/config/config.go | Adds config/env for enabling cleanup lock + lock validity. |
| internal/bootstrap/cleanup_lock.go | Implements optional Redis-backed locker and a wrapper for guarded runs. |
| internal/bootstrap/server.go | Wraps cleanup jobs with the distributed lock + adds locker shutdown handling. |
| internal/bootstrap/bootstrap.go | Wires locker initialization and passes it to cleanup jobs. |
| .env.production.example | Adds a production-oriented env template for caches/pools/cleanup lock defaults. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…eanup interval - Rename GORM index tags on oauth_applications.user_id / status, oauth_connections.user_id, and audit_logs.severity to match the names used in 001_add_phase1_indexes.sql so AutoMigrate does not create duplicate default-named indexes on existing PostgreSQL deployments - Use cfg.AuditLogCleanupInterval instead of a hard-coded 24h ticker in addAuditLogCleanupJob and treat <= 0 as disabling cleanup Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 15 out of 15 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…ment id assumption - Rename deleteInBatches to deleteByIDInBatches and document that it requires an id primary key column - Return an error when cleanupBatchSize <= 0 so a misconfigured var cannot silently loop forever via GORM Limit(0) disabling the clause - Also terminate on RowsAffected == 0 as a belt-and-braces exit condition - Add a unit test covering the invalid batch size guard Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 15 out of 15 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Validate EnableCleanupLock requires REDIS_ADDR and positive CleanupLockKeyValidity, and require positive ExpiredTokenCleanupInterval when expired-token cleanup is enabled, so misconfiguration fails fast instead of panicking time.NewTicker or throwing runtime lock errors - Guard addExpiredTokenCleanupJob against non-positive interval as defense in depth - Drop standalone idx_oauth_conn_user_id in favor of the existing composite unique index idx_oauth_user_provider whose leading column already serves user_id-only predicates - Flip METRICS_GAUGE_UPDATE_ENABLED default to false in the production template so copy-pasting into every pod does not double-count gauge metrics Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 15 out of 15 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Summary
Phase 1 of a multi-phase DB + Redis optimization plan targeting 20k+ users on 5+ pod deployments. This PR focuses on low-risk, high-impact changes: indexes, cleanup batching, distributed cleanup lock, and production-recommended cache defaults. No service/handler logic changed; all OAuth flows behave identically.
DeleteExpiredTokens/DeleteExpiredDeviceCodes/DeleteOldAuditLogsnow delete 10k rows per iteration with a 200ms pause instead of a single unbounded DELETE. Backed by a shareddeleteInBatcheshelper.rueidislock): with 5+ pods enabled, all replicas may run cleanup — a Redis lock ensures only one actually executes each tick. Idempotent DELETEs mean split-brain during Redis failover is safe (at worst: transient double work)..env.production.exampleships recommendedredis-asidesettings for token / client / user / metrics caches + connection-pool caps.Implementation flow (what changed and why)
1. Index additions (11 indexes)
Modified GORM tags so
AutoMigratecreates the indexes on fresh databases.OAuthApplicationuser_id,statusListClientsByUserID,CountClientsByStatus, dashboardUserAuthorization(user_id, is_active),(client_id, is_active)ListUserAuthorizations,RevokeAllUserAuthorizationsByClientIDOAuthConnectionuser_id(single)GetOAuthConnectionsByUserID, bulk delete per userAuditLogseverity,(event_type, event_time DESC),(actor_user_id, event_time DESC),(resource_type, resource_id, event_time DESC)DeviceCode(authorized, expires_at)CountPendingDeviceCodes, metrics gaugeAuthorizationCode(used_at, expires_at)For production PostgreSQL,
internal/store/migrations/001_add_phase1_indexes.sqlusesCREATE INDEX CONCURRENTLY IF NOT EXISTSso adding these to a live DB does not block writes.AutoMigrateon startup becomes a no-op since the indexes already exist.2. Batched cleanup (avoid long-lock DELETE)
internal/store/cleanup.gogains a shared helper:It loops
DELETE ... WHERE id IN (SELECT id ... LIMIT N)until a batch returns fewer thancleanupBatchSize(default 10000) rows, sleepingcleanupBatchPause(default 200ms) between iterations. The subquery form is used because PostgreSQL does not supportDELETE ... LIMITdirectly, and it lets the innerSELECTuse theexpires_at/created_atindex.All three cleanup functions (
DeleteExpiredTokens,DeleteExpiredDeviceCodes,DeleteOldAuditLogs) now delegate to this helper.3. Distributed cleanup lock
New file
internal/bootstrap/cleanup_lock.go:initializeCleanupLockercreates arueidislock.Locker(single-node mode,KeyMajority=1) whenENABLE_CLEANUP_LOCK=true.runWithCleanupLockwraps each cleanup job: callsTryWithContext— if another pod holds the lock this tick, the job is skipped silently and the next tick tries again.Existing
addExpiredTokenCleanupJobandaddAuditLogCleanupJobnow receive the locker and delegate throughrunWithCleanupLock. Lock key names live in constants (cleanupLockAuditLogs,cleanupLockExpiredTokens) to keep them greppable for runbooks.New config (in
internal/config/config.go):ENABLE_CLEANUP_LOCKfalseCLEANUP_LOCK_KEY_VALIDITY5m4. Production
.envtemplateNew
.env.production.exampledocuments the 5+ pod starting point:TOKEN_CACHE_ENABLED=true+ all four caches onredis-asideDB_MAX_OPEN_CONNS=25× 5 pods = 125 (fits a standardmax_connections=200Postgres)ENABLE_EXPIRED_TOKEN_CLEANUP=true+ENABLE_CLEANUP_LOCK=trueAUDIT_LOG_RETENTION=2160h(90 days)METRICS_GAUGE_UPDATE_ENABLED=true— enable on one replica only (or use PromQL aggregation)Dev
.env.exampleis intentionally not touched so localsqlite + no Redisstill works out of the box.How to verify this PR
Unit tests (new)
Covers:
cleanupBatchSize=100→ multi-iteration loop correctness, 30 active tokens preservedint64(80)fromDeleteOldAuditLogsFull regression
Manual / staging verification
Index confirmation (PostgreSQL):
Distributed lock: bring up 2+ pods with
ENABLE_CLEANUP_LOCK=true, force a cleanup tick. Exactly one pod should logCleaned up N ..., the others silently skip.Cache hit rate: hit
/metricsafter warm-up, confirmcache_hits_total{cache_name="token"}/ (hits + misses) > 90%.Graceful shutdown: send
SIGTERM, verifyClosing cleanup locker... / closedappears in logs with no leaked Redis connections.Load test (optional, requires k6/vegeta):
/oauth/tokeninfo— P99 should drop vs baseline, DB qps near flat (cache absorbs).event_type=DEVICE_CODE_AUTHORIZED&start_time=...—EXPLAIN ANALYZEshould useidx_audit_type_time.How to use (operator guide)
Fresh deployment
.env.production.example→.env(or inject via Helm).JWT_SECRET,SESSION_SECRET,DATABASE_DSN,REDIS_ADDR).AutoMigratecreates all indexes on first boot.Existing deployment migrating live
internal/store/migrations/001_add_phase1_indexes.sqlagainst PostgreSQL one statement at a time (CREATE INDEX CONCURRENTLYcannot run in a transaction block). Each statement takes seconds to minutes depending on table size; writes stay unblocked.AutoMigratedetects existing indexes and is a no-op.TOKEN_CACHE_ENABLED=true,ENABLE_EXPIRED_TOKEN_CLEANUP=true,ENABLE_CLEANUP_LOCK=true.Redis requirements
CLIENT TRACKING).rueidislock: any Redis reachable by all pods. HA (Sentinel / managed) recommended; split-brain during failover is safe due to idempotent DELETEs.Scope / non-goals
Phase 1 only. The following are intentionally deferred:
gorm.io/plugin/dbresolver), list endpoint pagination.pg_trgmGIN for admin search, external audit sink.Test plan
go build ./...go test ./... -short— all packages passgo test ./internal/store/ -run 'TestDeleteExpired|TestDeleteOld' -v— 4 new tests passmake lint— 0 issuesmake build— producesbin/authgateAutoMigratecreates indexes on fresh SQLite DB\dshows new indexescache_hits_totalgrowing,cache_misses_totalplateaus after warm-up🤖 Generated with Claude Code