fix(backend): batch processed file ownership validation by clonder · Pull Request #6725 · loculus-project/loculus

clonder · 2026-06-22T13:25:06Z

Resolves #3951

Batches file ownership validation in /submit-processed-data so the backend checks file groups once per stream chunk instead of once per sequence entry.

This keeps the existing safety check that processed data cannot reference files from another group, while reducing many small DB lookups during large preprocessing batches.

🚀 Preview: Add preview label to enable

anna-parker · 2026-06-23T10:09:32Z

+                    .validateFilenameCharacters(fileMapping)
+                    .validateFilenamesAreUnique(fileMapping)
+                    .validateCategoriesMatchOutputSchema(fileMapping, organism)
+                    .validateMultipartUploads(fileMapping.fileIds)


validateMultipartUploads also calls the db (FilesTable), so this will still be done once per entry

anna-parker · 2026-06-23T10:09:35Z

+                    .validateFilenamesAreUnique(fileMapping)
+                    .validateCategoriesMatchOutputSchema(fileMapping, organism)
+                    .validateMultipartUploads(fileMapping.fileIds)
+                    .validateFilesExist(fileMapping.fileIds)


this also calls the FilesTable twice actually for all fileIds

anna-parker · 2026-06-23T10:25:28Z

+                organism,
+                processedFiles,
+            )
+            validateFilesBelongToSubmittingGroups(filesToValidate)


I think the other file validations that call the db: validateMultipartUploads and validateFilesExist can be added here, so that they only call the db once per batch (they take a set of fileIds as input so I dont see an issue moving them unless I overlooked sth)

Moved validateMultipartUploads and validateFilesExist out of the per-entry loop.

anna-parker · 2026-06-23T10:25:54Z

@claude review

claude · 2026-06-23T10:26:08Z

Claude finished @anna-parker's task in 2m 43s —— View job

Code Review

Read changed files and understand the batching logic
Review correctness and performance
Check test coverage
Post review findings

Summary

The PR correctly batches validateFilesBelongToSubmittingGroups — reducing from 2 DB calls per entry to 2 per chunk. The refactoring is clean and the new test covers the cross-group scenario in a batch. That said, there are a few issues worth addressing.

Issues

1. `validateMultipartUploads` and `validateFilesExist` still issue per-entry DB calls (anna-parker's point)

In validateFileMappingsAndCollectFileIds the chain at lines 346–347:

fileMappingPreconditionValidator
    .validateMultipartUploads(fileMapping.fileIds)   // 1 DB call per entry
    .validateFilesExist(fileMapping.fileIds)          // 2 DB calls per entry (getNonExistentFileIds + getUncheckedFileIds)

...is called once per entry in the per-batch forEach, so with a default streamBatchSize of 1000 this still produces up to 3000 DB calls per chunk for file-having entries. Both methods accept a Set<FileId>, so the fix (as anna-parker suggests) is to collect all file IDs across the batch first and pass the full set in a single call each. The error messages are already file-ID-centric, so no diagnostic quality is lost.

Fix this →

2. Behavioral change for errored entries — no test coverage

Lines 353–357 introduce a new behaviour: entries with errors skip file ownership validation entirely.

// Only releasable processed entries need file ownership checks.
if (submittedProcessedData.errors.orEmpty().isEmpty()) {
    filesToValidate[accessionVersion] = fileMapping.fileIds
}

The old implementation validated ownership for all entries. While the rationale (errored entries can't be released) is reasonable, this is an undocumented semantic change with no test. A pipeline that submits an errored entry referencing a cross-group file would now silently succeed where it previously errored. It would be worth adding a test (and confirming the behaviour is intentional) for the case where an errored entry references a file from another group.

3. Silent skip on unknown accession version

In validateFilesBelongToSubmittingGroups:

val sequenceEntryGroup = sequenceEntryGroups[accessionVersion] ?: return@forEach

If an accession is not found in SequenceEntriesTable the ownership check is silently skipped for that entry, relying on insertProcessedData to catch it later. This is correct and the comment explains it, but it means a pipeline error (wrong accession) won't produce an early ownership-related error — just a slightly different error downstream. This seems acceptable.

Positives

The ownership batching itself is implemented correctly: 1 query for accession→group and 1 query for file→group per chunk instead of 2 per entry. Good.
accessionVersion.displayAccessionVersion() in the error message is correct; the old code used submittedProcessedData.displayAccessionVersion() which was equivalent but slightly more indirect.
The new integration test for cross-group files within a batch is exactly the right test to add for this PR.
parseSubmittedProcessedDataLine extraction is a clean readability improvement.

anna-parker · 2026-06-23T10:48:42Z

this is a preexisting issue but Im now concerned the stream is not being closed correctly, I believe ServletInputStream (the inputStream) handles closing itsself but Im not sure this BufferedReader class does... probably this isnt an issue as weve been doing it but might be good to check

https://stackoverflow.com/questions/30862577/does-closing-a-stream-close-the-bufferedreader-source

doesnt seem to close the stream on its own? https://docs.oracle.com/javase/8/docs/api/java/io/BufferedReader.html

@claude what are your thoughts? do we need to close this stream explicitly?

Claude finished @anna-parker's task in 1m 7s —— View job

Read the relevant code around line 274

Analyze BufferedReader / ServletInputStream lifecycle

Post answer

Stream Closing Analysis

Short answer: no resource leak in practice, but the use {} idiom is still the right fix.

Here's the chain:

request.inputStream ← ServletInputStream (managed by servlet container) └─ InputStreamReader(...) └─ BufferedReader(...) ← never closed

Why it isn't technically a leak:
request.inputStream is a ServletInputStream whose lifecycle is owned by the servlet container (Tomcat/Jetty etc.). Per the Servlet spec (§3.1.1), the container closes the stream after the response is committed — regardless of whether application code closed it first. So even if BufferedReader is never closed, no file descriptor or socket leaks.

Why it's still worth fixing:

BufferedReader implements AutoCloseable, so leaving it unclosed will trip static analysis (e.g. detekt) and is a bad pattern to copy.

If an exception is thrown mid-stream (which does happen here — validation errors), the BufferedReader is abandoned without flushing any internal buffer. That's harmless for a read-only reader, but structurally sloppy.

The lineSequence() on an unclosed reader is fine as long as the sequence is fully consumed (it is, via .forEach), but use {} makes the ownership explicit.

The fix is a one-liner — wrap with Kotlin's use {}:

BufferedReader(InputStreamReader(inputStream)).use { reader -> reader.lineSequence().chunked(streamBatchSize).forEach { lines -> ... } }

Calling close() on the BufferedReader will propagate to InputStreamReader → ServletInputStream. The servlet container handles a double-close gracefully (it's a no-op after the first), so this is safe.

Verdict: Preexisting issue, low severity, but worth a one-line fix to keep the code idiomatic and static-analysis clean. Since this is marked as a preexisting issue, it could go in this PR or a follow-up chore.

added use {} here so the lifecycle is explicit

clonder requested a review from anna-parker June 22, 2026 13:25

clonder self-assigned this Jun 22, 2026

clonder added bug Something isn't working backend related to the loculus backend component performance labels Jun 22, 2026

clonder force-pushed the fix/backend-batch-processed-file-ownership branch 2 times, most recently from 3d25b1b to 8f9ed32 Compare June 22, 2026 13:55

fix(backend): batch processed file ownership validation

93f70d4

clonder force-pushed the fix/backend-batch-processed-file-ownership branch from 8f9ed32 to 93f70d4 Compare June 22, 2026 14:02

anna-parker requested a review from tombch June 23, 2026 08:42

anna-parker reviewed Jun 23, 2026

View reviewed changes

Comment thread backend/src/main/kotlin/org/loculus/backend/service/submission/SubmissionDatabaseService.kt

anna-parker reviewed Jun 23, 2026

View reviewed changes

fix(backend): batch processed file validation

18cce92

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(backend): batch processed file ownership validation#6725

fix(backend): batch processed file ownership validation#6725
clonder wants to merge 2 commits into
mainfrom
fix/backend-batch-processed-file-ownership

clonder commented Jun 22, 2026 •

edited by loculus-bot

Loading

Uh oh!

anna-parker Jun 23, 2026

Uh oh!

anna-parker Jun 23, 2026

Uh oh!

Uh oh!

anna-parker Jun 23, 2026

Uh oh!

clonder Jun 23, 2026

Uh oh!

anna-parker commented Jun 23, 2026

Uh oh!

claude Bot commented Jun 23, 2026 •

edited

Loading

Uh oh!

anna-parker Jun 23, 2026

Uh oh!

anna-parker Jun 23, 2026

Uh oh!

anna-parker Jun 23, 2026

Uh oh!

anna-parker Jun 23, 2026

Uh oh!

claude Bot Jun 23, 2026 •

edited

Loading

Uh oh!

clonder Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

clonder commented Jun 22, 2026 • edited by loculus-bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

anna-parker commented Jun 23, 2026

Uh oh!

claude Bot commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review

Summary

Issues

1. validateMultipartUploads and validateFilesExist still issue per-entry DB calls (anna-parker's point)

2. Behavioral change for errored entries — no test coverage

3. Silent skip on unknown accession version

Positives

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

claude Bot Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Stream Closing Analysis

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

clonder commented Jun 22, 2026 •

edited by loculus-bot

Loading

claude Bot commented Jun 23, 2026 •

edited

Loading

1. `validateMultipartUploads` and `validateFilesExist` still issue per-entry DB calls (anna-parker's point)

claude Bot Jun 23, 2026 •

edited

Loading