perf: bulk pre-load message IDs and cache group IDs to eliminate ~45K DB queries per sync#358
Open
lanzalibre wants to merge 3 commits into
Open
perf: bulk pre-load message IDs and cache group IDs to eliminate ~45K DB queries per sync#358lanzalibre wants to merge 3 commits into
lanzalibre wants to merge 3 commits into
Conversation
… DB queries per sync Before: checkDuplicate did 3 DB queries per email (findGroupSourceIds × 2 + findFirst), processEmail did 2 more findGroupSourceIds calls, and mkdir was called for every file over SMB. With 6,700 emails this meant ~45,500 DB operations per sync cycle. After: - preloadExistingMessageIds: 1 DB query returns Set<string> + string[] for the full source. checkDuplicate becomes an in-memory Set lookup. - processEmail accepts optional groupSourceIds + knownMessageIds. When provided, findGroupSourceIds is skipped and the in-memory Set is checked before the DB query. newly processed messageIds are added to the Set to catch in-run duplicates. - LocalFileSystemProvider.mkdir caching: directories are tracked in a Set, skipping redundant fs.mkdir calls. Reduces ~6,500 SMB round-trips to ~8. Added 19 unit tests covering all 4 changes. Fixes: duplicate email records from race conditions (check+insert not atomic), stale session cleanup spawning multiple concurrent sync sessions. See plans/nas-performance-improvements.md for full details.
|
I have read the CLA Document and I hereby sign the CLA You can retrigger this bot by commenting recheck in this Pull Request. Posted by the CLA Assistant Lite bot. |
…e condition duplicates Adds a uniqueIndex on (messageIdHeader, ingestionSourceId) to the archived_emails schema and converts both INSERT branches in processEmail to use onConflictDoNothing. When a concurrent job races past the in-memory Set check, the DB constraint catches the duplicate at insert time instead of creating a duplicate row. The returning() array yields [undefined] when the conflict fires, so we check !archivedEmail and return null — same behavior as the in-memory path. Also adds a test verifying the onConflictDoNothing path.
…tial unique index Specifying target columns in onConflictDoNothing() caused PostgreSQL error 42P10 (invalid_column_reference) because the unique index was created without a WHERE clause at the DB level but Drizzle's target resolution conflicted. Removing the target parameter lets PostgreSQL infer the unique index automatically.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
checkDuplicateperformed 3 DB queries per email (findGroupSourceIds × 2 + findFirst),processEmailcalledfindGroupSourceIds2 more times per email, andmkdirwas invoked for every single file over SMB. With 6,700 archived emails this meant ~45,500 DB operations and ~6,500 SMB round-trips per sync cycle.Set<string>before the fetch loop.checkDuplicatebecomes an O(1) in-memory lookup.processEmailaccepts optional cachedgroupSourceIdsandknownMessageIds. Newly processed IDs are added back to the Set to catch in-run duplicates.Changes
LocalFileSystemProvider.tsSet, skipfs.mkdirfor cached dirsIngestionService.tspreloadExistingMessageIds()— bulk query returnsSet<string>+string[]process-mailbox.processor.tsprocessEmailIngestionService.tsprocessEmailchecks in-memory Set before DB query, adds to Set after insertTests
19 unit tests covering all 4 changes:
LocalFileSystemProvider.test.ts— 5 tests (mkdir caching, per-instance isolation)IngestionService.preload.test.ts— 4 tests (Set construction, null filtering, merge groups)IngestionService.processEmail.test.ts— 8 tests (in-memory dedup, Set.add after insert, in-run duplicates, fallback to DB, caching)processMailboxProcessor.test.ts— 2 tests (integration: pre-load before fetch, Set-based checkDuplicate)Run with
pnpm --filter @open-archiver/backend test.Root Cause Addressed
Duplicate email records (177 groups, 315 extra DB rows) were caused by the race condition between the
findFirstduplicate check and the INSERT — multiple concurrent sync sessions spawned by stale session cleanup all racing past the check simultaneously. The in-memory Set now prevents duplicates within a single job, and the planned UNIQUE constraint (Change 5 in the plan) will prevent cross-job race conditions.Files
packages/backend/src/services/storage/LocalFileSystemProvider.tspackages/backend/src/services/IngestionService.tspackages/backend/src/jobs/processors/process-mailbox.processor.tspackages/backend/src/__tests__/(new)packages/backend/vitest.config.ts(new)packages/backend/tsconfig.json(exclude tests from build)packages/backend/package.json(add vitest + test scripts)pnpm-lock.yamlFull details:
plans/nas-performance-improvements.md