perf: bulk pre-load message IDs and cache group IDs to eliminate ~45K DB queries per sync by lanzalibre · Pull Request #358 · LogicLabs-OU/OpenArchiver

lanzalibre · 2026-04-27T07:37:46Z

Summary

Before: checkDuplicate performed 3 DB queries per email (findGroupSourceIds × 2 + findFirst), processEmail called findGroupSourceIds 2 more times per email, and mkdir was invoked for every single file over SMB. With 6,700 archived emails this meant ~45,500 DB operations and ~6,500 SMB round-trips per sync cycle.
After: 1 bulk pre-load query returns all known message IDs as a Set<string> before the fetch loop. checkDuplicate becomes an O(1) in-memory lookup. processEmail accepts optional cached groupSourceIds and knownMessageIds. Newly processed IDs are added back to the Set to catch in-run duplicates.
Impact: ~45,500 operations → ~9 per sync cycle.

Changes

Change	File	Description
1	`LocalFileSystemProvider.ts`	Track known directories in a `Set`, skip `fs.mkdir` for cached dirs
2	`IngestionService.ts`	Add `preloadExistingMessageIds()` — bulk query returns `Set<string>` + `string[]`
3	`process-mailbox.processor.ts`	Call pre-load before fetch loop, pass Set and IDs through to connector and `processEmail`
4	`IngestionService.ts`	`processEmail` checks in-memory Set before DB query, adds to Set after insert

Tests

19 unit tests covering all 4 changes:

LocalFileSystemProvider.test.ts — 5 tests (mkdir caching, per-instance isolation)
IngestionService.preload.test.ts — 4 tests (Set construction, null filtering, merge groups)
IngestionService.processEmail.test.ts — 8 tests (in-memory dedup, Set.add after insert, in-run duplicates, fallback to DB, caching)
processMailboxProcessor.test.ts — 2 tests (integration: pre-load before fetch, Set-based checkDuplicate)

Run with pnpm --filter @open-archiver/backend test.

Root Cause Addressed

Duplicate email records (177 groups, 315 extra DB rows) were caused by the race condition between the findFirst duplicate check and the INSERT — multiple concurrent sync sessions spawned by stale session cleanup all racing past the check simultaneously. The in-memory Set now prevents duplicates within a single job, and the planned UNIQUE constraint (Change 5 in the plan) will prevent cross-job race conditions.

Files

packages/backend/src/services/storage/LocalFileSystemProvider.ts
packages/backend/src/services/IngestionService.ts
packages/backend/src/jobs/processors/process-mailbox.processor.ts
packages/backend/src/__tests__/ (new)
packages/backend/vitest.config.ts (new)
packages/backend/tsconfig.json (exclude tests from build)
packages/backend/package.json (add vitest + test scripts)
pnpm-lock.yaml

Full details: plans/nas-performance-improvements.md

… DB queries per sync Before: checkDuplicate did 3 DB queries per email (findGroupSourceIds × 2 + findFirst), processEmail did 2 more findGroupSourceIds calls, and mkdir was called for every file over SMB. With 6,700 emails this meant ~45,500 DB operations per sync cycle. After: - preloadExistingMessageIds: 1 DB query returns Set<string> + string[] for the full source. checkDuplicate becomes an in-memory Set lookup. - processEmail accepts optional groupSourceIds + knownMessageIds. When provided, findGroupSourceIds is skipped and the in-memory Set is checked before the DB query. newly processed messageIds are added to the Set to catch in-run duplicates. - LocalFileSystemProvider.mkdir caching: directories are tracked in a Set, skipping redundant fs.mkdir calls. Reduces ~6,500 SMB round-trips to ~8. Added 19 unit tests covering all 4 changes. Fixes: duplicate email records from race conditions (check+insert not atomic), stale session cleanup spawning multiple concurrent sync sessions. See plans/nas-performance-improvements.md for full details.

github-actions · 2026-04-27T07:37:56Z

Thank you for your submission, we really appreciate it. Like many open-source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution. You can sign the CLA by just posting a Pull Request Comment same as the below format.

I have read the CLA Document and I hereby sign the CLA

_{You can retrigger this bot by commenting recheck in this Pull Request.}_{Posted by the CLA Assistant Lite bot.}

…e condition duplicates Adds a uniqueIndex on (messageIdHeader, ingestionSourceId) to the archived_emails schema and converts both INSERT branches in processEmail to use onConflictDoNothing. When a concurrent job races past the in-memory Set check, the DB constraint catches the duplicate at insert time instead of creating a duplicate row. The returning() array yields [undefined] when the conflict fires, so we check !archivedEmail and return null — same behavior as the in-memory path. Also adds a test verifying the onConflictDoNothing path.

…tial unique index Specifying target columns in onConflictDoNothing() caused PostgreSQL error 42P10 (invalid_column_reference) because the unique index was created without a WHERE clause at the DB level but Drizzle's target resolution conflicted. Removing the target parameter lets PostgreSQL infer the unique index automatically.

lanzalibre added 2 commits May 1, 2026 08:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf: bulk pre-load message IDs and cache group IDs to eliminate ~45K DB queries per sync#358

perf: bulk pre-load message IDs and cache group IDs to eliminate ~45K DB queries per sync#358
lanzalibre wants to merge 3 commits into
LogicLabs-OU:mainfrom
lanzalibre:performance/nas-optimizations

lanzalibre commented Apr 27, 2026

Uh oh!

github-actions Bot commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

lanzalibre commented Apr 27, 2026

Summary

Changes

Tests

Root Cause Addressed

Files

Uh oh!

github-actions Bot commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant