Skip to content

perf: bulk pre-load message IDs and cache group IDs to eliminate ~45K DB queries per sync#358

Open
lanzalibre wants to merge 3 commits into
LogicLabs-OU:mainfrom
lanzalibre:performance/nas-optimizations
Open

perf: bulk pre-load message IDs and cache group IDs to eliminate ~45K DB queries per sync#358
lanzalibre wants to merge 3 commits into
LogicLabs-OU:mainfrom
lanzalibre:performance/nas-optimizations

Conversation

@lanzalibre
Copy link
Copy Markdown

Summary

  • Before: checkDuplicate performed 3 DB queries per email (findGroupSourceIds × 2 + findFirst), processEmail called findGroupSourceIds 2 more times per email, and mkdir was invoked for every single file over SMB. With 6,700 archived emails this meant ~45,500 DB operations and ~6,500 SMB round-trips per sync cycle.
  • After: 1 bulk pre-load query returns all known message IDs as a Set<string> before the fetch loop. checkDuplicate becomes an O(1) in-memory lookup. processEmail accepts optional cached groupSourceIds and knownMessageIds. Newly processed IDs are added back to the Set to catch in-run duplicates.
  • Impact: ~45,500 operations → ~9 per sync cycle.

Changes

Change File Description
1 LocalFileSystemProvider.ts Track known directories in a Set, skip fs.mkdir for cached dirs
2 IngestionService.ts Add preloadExistingMessageIds() — bulk query returns Set<string> + string[]
3 process-mailbox.processor.ts Call pre-load before fetch loop, pass Set and IDs through to connector and processEmail
4 IngestionService.ts processEmail checks in-memory Set before DB query, adds to Set after insert

Tests

19 unit tests covering all 4 changes:

  • LocalFileSystemProvider.test.ts — 5 tests (mkdir caching, per-instance isolation)
  • IngestionService.preload.test.ts — 4 tests (Set construction, null filtering, merge groups)
  • IngestionService.processEmail.test.ts — 8 tests (in-memory dedup, Set.add after insert, in-run duplicates, fallback to DB, caching)
  • processMailboxProcessor.test.ts — 2 tests (integration: pre-load before fetch, Set-based checkDuplicate)

Run with pnpm --filter @open-archiver/backend test.

Root Cause Addressed

Duplicate email records (177 groups, 315 extra DB rows) were caused by the race condition between the findFirst duplicate check and the INSERT — multiple concurrent sync sessions spawned by stale session cleanup all racing past the check simultaneously. The in-memory Set now prevents duplicates within a single job, and the planned UNIQUE constraint (Change 5 in the plan) will prevent cross-job race conditions.

Files

  • packages/backend/src/services/storage/LocalFileSystemProvider.ts
  • packages/backend/src/services/IngestionService.ts
  • packages/backend/src/jobs/processors/process-mailbox.processor.ts
  • packages/backend/src/__tests__/ (new)
  • packages/backend/vitest.config.ts (new)
  • packages/backend/tsconfig.json (exclude tests from build)
  • packages/backend/package.json (add vitest + test scripts)
  • pnpm-lock.yaml

Full details: plans/nas-performance-improvements.md

… DB queries per sync

Before: checkDuplicate did 3 DB queries per email (findGroupSourceIds × 2 + findFirst),
processEmail did 2 more findGroupSourceIds calls, and mkdir was called for every
file over SMB. With 6,700 emails this meant ~45,500 DB operations per sync cycle.

After:
- preloadExistingMessageIds: 1 DB query returns Set<string> + string[] for the full
  source. checkDuplicate becomes an in-memory Set lookup.
- processEmail accepts optional groupSourceIds + knownMessageIds. When provided,
  findGroupSourceIds is skipped and the in-memory Set is checked before the DB query.
  newly processed messageIds are added to the Set to catch in-run duplicates.
- LocalFileSystemProvider.mkdir caching: directories are tracked in a Set, skipping
  redundant fs.mkdir calls. Reduces ~6,500 SMB round-trips to ~8.

Added 19 unit tests covering all 4 changes.

Fixes: duplicate email records from race conditions (check+insert not atomic),
stale session cleanup spawning multiple concurrent sync sessions.

See plans/nas-performance-improvements.md for full details.
@github-actions
Copy link
Copy Markdown


Thank you for your submission, we really appreciate it. Like many open-source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution. You can sign the CLA by just posting a Pull Request Comment same as the below format.


I have read the CLA Document and I hereby sign the CLA


You can retrigger this bot by commenting recheck in this Pull Request. Posted by the CLA Assistant Lite bot.

lanzalibre added 2 commits May 1, 2026 08:45
…e condition duplicates

Adds a uniqueIndex on (messageIdHeader, ingestionSourceId) to the archived_emails
schema and converts both INSERT branches in processEmail to use onConflictDoNothing.
When a concurrent job races past the in-memory Set check, the DB constraint catches
the duplicate at insert time instead of creating a duplicate row.

The returning() array yields [undefined] when the conflict fires, so we check
!archivedEmail and return null — same behavior as the in-memory path.

Also adds a test verifying the onConflictDoNothing path.
…tial unique index

Specifying target columns in onConflictDoNothing() caused PostgreSQL error 42P10
(invalid_column_reference) because the unique index was created without a WHERE
clause at the DB level but Drizzle's target resolution conflicted. Removing the
target parameter lets PostgreSQL infer the unique index automatically.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant