Skip to content

perf(db): Phase 1 - indexes, batched cleanup, and distributed cleanup lock#164

Open
appleboy wants to merge 4 commits intomainfrom
worktree-DB
Open

perf(db): Phase 1 - indexes, batched cleanup, and distributed cleanup lock#164
appleboy wants to merge 4 commits intomainfrom
worktree-DB

Conversation

@appleboy
Copy link
Copy Markdown
Member

Summary

Phase 1 of a multi-phase DB + Redis optimization plan targeting 20k+ users on 5+ pod deployments. This PR focuses on low-risk, high-impact changes: indexes, cleanup batching, distributed cleanup lock, and production-recommended cache defaults. No service/handler logic changed; all OAuth flows behave identically.

  • Indexes (11 new): covers owner lookups, per-app consent grants, OAuth connections, audit-log filters, pending device codes, and authorization-code cleanup.
  • Batched cleanup: DeleteExpiredTokens / DeleteExpiredDeviceCodes / DeleteOldAuditLogs now delete 10k rows per iteration with a 200ms pause instead of a single unbounded DELETE. Backed by a shared deleteInBatches helper.
  • Distributed cleanup lock (rueidislock): with 5+ pods enabled, all replicas may run cleanup — a Redis lock ensures only one actually executes each tick. Idempotent DELETEs mean split-brain during Redis failover is safe (at worst: transient double work).
  • Cache defaults for production: new .env.production.example ships recommended redis-aside settings for token / client / user / metrics caches + connection-pool caps.

Implementation flow (what changed and why)

1. Index additions (11 indexes)

Modified GORM tags so AutoMigrate creates the indexes on fresh databases.

Model Index Query it fixes
OAuthApplication user_id, status ListClientsByUserID, CountClientsByStatus, dashboard
UserAuthorization (user_id, is_active), (client_id, is_active) ListUserAuthorizations, RevokeAllUserAuthorizationsByClientID
OAuthConnection user_id (single) GetOAuthConnectionsByUserID, bulk delete per user
AuditLog severity, (event_type, event_time DESC), (actor_user_id, event_time DESC), (resource_type, resource_id, event_time DESC) Admin filter combinations on large tables
DeviceCode (authorized, expires_at) CountPendingDeviceCodes, metrics gauge
AuthorizationCode (used_at, expires_at) Replay prevention, cleanup

For production PostgreSQL, internal/store/migrations/001_add_phase1_indexes.sql uses CREATE INDEX CONCURRENTLY IF NOT EXISTS so adding these to a live DB does not block writes. AutoMigrate on startup becomes a no-op since the indexes already exist.

2. Batched cleanup (avoid long-lock DELETE)

internal/store/cleanup.go gains a shared helper:

func (s *Store) deleteInBatches(model any, whereClause string, args ...any) (int64, error)

It loops DELETE ... WHERE id IN (SELECT id ... LIMIT N) until a batch returns fewer than cleanupBatchSize (default 10000) rows, sleeping cleanupBatchPause (default 200ms) between iterations. The subquery form is used because PostgreSQL does not support DELETE ... LIMIT directly, and it lets the inner SELECT use the expires_at / created_at index.

All three cleanup functions (DeleteExpiredTokens, DeleteExpiredDeviceCodes, DeleteOldAuditLogs) now delegate to this helper.

3. Distributed cleanup lock

New file internal/bootstrap/cleanup_lock.go:

  • initializeCleanupLocker creates a rueidislock.Locker (single-node mode, KeyMajority=1) when ENABLE_CLEANUP_LOCK=true.
  • runWithCleanupLock wraps each cleanup job: calls TryWithContext — if another pod holds the lock this tick, the job is skipped silently and the next tick tries again.
  • Locker is closed on graceful shutdown.

Existing addExpiredTokenCleanupJob and addAuditLogCleanupJob now receive the locker and delegate through runWithCleanupLock. Lock key names live in constants (cleanupLockAuditLogs, cleanupLockExpiredTokens) to keep them greppable for runbooks.

New config (in internal/config/config.go):

Env var Default Purpose
ENABLE_CLEANUP_LOCK false Opt-in: gate the lock behind an explicit flag
CLEANUP_LOCK_KEY_VALIDITY 5m Lock TTL; rueidislock auto-extends while held

4. Production .env template

New .env.production.example documents the 5+ pod starting point:

  • TOKEN_CACHE_ENABLED=true + all four caches on redis-aside
  • Connection pool: DB_MAX_OPEN_CONNS=25 × 5 pods = 125 (fits a standard max_connections=200 Postgres)
  • ENABLE_EXPIRED_TOKEN_CLEANUP=true + ENABLE_CLEANUP_LOCK=true
  • AUDIT_LOG_RETENTION=2160h (90 days)
  • METRICS_GAUGE_UPDATE_ENABLED=true — enable on one replica only (or use PromQL aggregation)

Dev .env.example is intentionally not touched so local sqlite + no Redis still works out of the box.

How to verify this PR

Unit tests (new)

go test ./internal/store/ -run 'TestDeleteExpired|TestDeleteOld' -v

Covers:

  • 250 expired tokens with cleanupBatchSize=100 → multi-iteration loop correctness, 30 active tokens preserved
  • Zero expired rows → loop exits immediately
  • 120 expired device codes + 15 active → same for device codes
  • 80 old audit logs + 20 recent → returns int64(80) from DeleteOldAuditLogs

Full regression

make generate   # templ + swagger
make test       # full test suite
make lint       # golangci-lint (no issues)
make build      # produces bin/authgate

Manual / staging verification

  1. Index confirmation (PostgreSQL):

    \d user_authorizations
    \d audit_logs
    EXPLAIN (ANALYZE, BUFFERS)
      SELECT * FROM user_authorizations WHERE user_id = '<uuid>' AND is_active = true;
    -- expect: Index Scan using idx_user_active
  2. Distributed lock: bring up 2+ pods with ENABLE_CLEANUP_LOCK=true, force a cleanup tick. Exactly one pod should log Cleaned up N ..., the others silently skip.

  3. Cache hit rate: hit /metrics after warm-up, confirm cache_hits_total{cache_name="token"} / (hits + misses) > 90%.

  4. Graceful shutdown: send SIGTERM, verify Closing cleanup locker... / closed appears in logs with no leaked Redis connections.

  5. Load test (optional, requires k6/vegeta):

    • 500 qps /oauth/tokeninfo — P99 should drop vs baseline, DB qps near flat (cache absorbs).
    • Admin audit filter event_type=DEVICE_CODE_AUTHORIZED&start_time=...EXPLAIN ANALYZE should use idx_audit_type_time.

How to use (operator guide)

Fresh deployment

  1. Copy .env.production.example.env (or inject via Helm).
  2. Fill in secrets (JWT_SECRET, SESSION_SECRET, DATABASE_DSN, REDIS_ADDR).
  3. Start pods. AutoMigrate creates all indexes on first boot.

Existing deployment migrating live

  1. Run internal/store/migrations/001_add_phase1_indexes.sql against PostgreSQL one statement at a time (CREATE INDEX CONCURRENTLY cannot run in a transaction block). Each statement takes seconds to minutes depending on table size; writes stay unblocked.
  2. Roll out this version of AuthGate. AutoMigrate detects existing indexes and is a no-op.
  3. Flip env vars per the production template — especially TOKEN_CACHE_ENABLED=true, ENABLE_EXPIRED_TOKEN_CLEANUP=true, ENABLE_CLEANUP_LOCK=true.

Redis requirements

  • Redis 6.0+ (rueidis-aside requires RESP3 CLIENT TRACKING).
  • For rueidislock: any Redis reachable by all pods. HA (Sentinel / managed) recommended; split-brain during failover is safe due to idempotent DELETEs.

Scope / non-goals

Phase 1 only. The following are intentionally deferred:

  • Phase 2: device code polling cache, dashboard cache, username cache.
  • Phase 3: OAuth connection provider-token encryption, read replica (gorm.io/plugin/dbresolver), list endpoint pagination.
  • Phase 4: pg_trgm GIN for admin search, external audit sink.

Test plan

  • go build ./...
  • go test ./... -short — all packages pass
  • go test ./internal/store/ -run 'TestDeleteExpired|TestDeleteOld' -v — 4 new tests pass
  • make lint — 0 issues
  • make build — produces bin/authgate
  • Staging: verify AutoMigrate creates indexes on fresh SQLite DB
  • Staging: run migration SQL against PostgreSQL, confirm \d shows new indexes
  • Staging: multi-pod cleanup lock behavior (only one pod logs cleanup per tick)
  • Staging: cache_hits_total growing, cache_misses_total plateaus after warm-up

🤖 Generated with Claude Code

…up lock

- Add composite and single-column indexes on oauth_applications, user_authorizations, oauth_connections, audit_logs, device_codes, and authorization_codes for multi-pod hot paths
- Ship PostgreSQL migration using CREATE INDEX CONCURRENTLY to avoid write locks on large tables
- Batch DELETE of expired tokens, device codes, and old audit logs via shared deleteInBatches helper (10k rows per iteration)
- Add Redis-backed distributed cleanup lock via rueidislock so only one pod runs each cleanup per tick
- Introduce .env.production.example with recommended redis-aside cache defaults for 5+ pod deployments

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 12, 2026 13:44
@codecov
Copy link
Copy Markdown

codecov bot commented Apr 12, 2026

Codecov Report

❌ Patch coverage is 30.35714% with 78 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
internal/bootstrap/server.go 0.00% 41 Missing ⚠️
internal/bootstrap/cleanup_lock.go 0.00% 29 Missing ⚠️
internal/bootstrap/bootstrap.go 0.00% 6 Missing ⚠️
internal/store/cleanup.go 90.00% 1 Missing and 1 partial ⚠️

📢 Thoughts on this report? Let us know!

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Phase 1 of DB/Redis performance optimizations aimed at large, multi-pod deployments by adding DB indexes, batching cleanup deletes, and serializing cleanup work across pods via a Redis-backed lock.

Changes:

  • Added a PostgreSQL migration with concurrent index creation plus corresponding GORM index tags.
  • Implemented batched cleanup deletes for expired tokens/device codes and old audit logs, with unit tests.
  • Introduced an optional distributed cleanup lock (rueidislock) and a production-focused .env template.

Reviewed changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
internal/store/migrations/001_add_phase1_indexes.sql Adds Phase 1 Postgres indexes using CREATE INDEX CONCURRENTLY IF NOT EXISTS.
internal/store/cleanup.go Adds shared deleteInBatches helper and uses it for token/device-code cleanup.
internal/store/cleanup_test.go Adds unit tests covering batched cleanup behavior.
internal/store/audit_log.go Switches audit-log cleanup to use batched deletion helper.
internal/models/user_authorization.go Adds composite indexes for active authorization lookups.
internal/models/oauth_connection.go Adds an index tag for user lookups.
internal/models/oauth_application.go Adds index tags for owner/status lookups.
internal/models/device_code.go Adds composite index for pending/expiry queries.
internal/models/authorization_code.go Adds composite index for used/expiry queries.
internal/models/audit_log.go Adds composite indexes to support admin filtering queries.
internal/config/config.go Adds config/env for enabling cleanup lock + lock validity.
internal/bootstrap/cleanup_lock.go Implements optional Redis-backed locker and a wrapper for guarded runs.
internal/bootstrap/server.go Wraps cleanup jobs with the distributed lock + adds locker shutdown handling.
internal/bootstrap/bootstrap.go Wires locker initialization and passes it to cleanup jobs.
.env.production.example Adds a production-oriented env template for caches/pools/cleanup lock defaults.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread internal/models/oauth_application.go Outdated
Comment thread internal/models/oauth_connection.go Outdated
Comment thread internal/models/audit_log.go Outdated
Comment thread internal/bootstrap/server.go
…eanup interval

- Rename GORM index tags on oauth_applications.user_id / status, oauth_connections.user_id, and audit_logs.severity to match the names used in 001_add_phase1_indexes.sql so AutoMigrate does not create duplicate default-named indexes on existing PostgreSQL deployments
- Use cfg.AuditLogCleanupInterval instead of a hard-coded 24h ticker in addAuditLogCleanupJob and treat <= 0 as disabling cleanup

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 15 out of 15 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread internal/store/cleanup.go
Comment thread internal/store/cleanup.go Outdated
…ment id assumption

- Rename deleteInBatches to deleteByIDInBatches and document that it requires an id primary key column
- Return an error when cleanupBatchSize <= 0 so a misconfigured var cannot silently loop forever via GORM Limit(0) disabling the clause
- Also terminate on RowsAffected == 0 as a belt-and-braces exit condition
- Add a unit test covering the invalid batch size guard

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 15 out of 15 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread internal/bootstrap/server.go
Comment thread internal/store/migrations/001_add_phase1_indexes.sql Outdated
Comment thread .env.production.example Outdated
Comment thread internal/config/config.go
- Validate EnableCleanupLock requires REDIS_ADDR and positive CleanupLockKeyValidity, and require positive ExpiredTokenCleanupInterval when expired-token cleanup is enabled, so misconfiguration fails fast instead of panicking time.NewTicker or throwing runtime lock errors
- Guard addExpiredTokenCleanupJob against non-positive interval as defense in depth
- Drop standalone idx_oauth_conn_user_id in favor of the existing composite unique index idx_oauth_user_provider whose leading column already serves user_id-only predicates
- Flip METRICS_GAUGE_UPDATE_ENABLED default to false in the production template so copy-pasting into every pod does not double-count gauge metrics

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 15 out of 15 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants