Skip to content

fix(backend): batch processed file ownership validation#6725

Open
clonder wants to merge 2 commits into
mainfrom
fix/backend-batch-processed-file-ownership
Open

fix(backend): batch processed file ownership validation#6725
clonder wants to merge 2 commits into
mainfrom
fix/backend-batch-processed-file-ownership

Conversation

@clonder

@clonder clonder commented Jun 22, 2026

Copy link
Copy Markdown

Resolves #3951

Batches file ownership validation in /submit-processed-data so the backend checks file groups once per stream chunk instead of once per sequence entry.

This keeps the existing safety check that processed data cannot reference files from another group, while reducing many small DB lookups during large preprocessing batches.

🚀 Preview: Add preview label to enable

@clonder clonder requested a review from anna-parker June 22, 2026 13:25
@clonder clonder self-assigned this Jun 22, 2026
@clonder clonder added bug Something isn't working backend related to the loculus backend component performance labels Jun 22, 2026
@clonder clonder force-pushed the fix/backend-batch-processed-file-ownership branch 2 times, most recently from 3d25b1b to 8f9ed32 Compare June 22, 2026 13:55
@clonder clonder force-pushed the fix/backend-batch-processed-file-ownership branch from 8f9ed32 to 93f70d4 Compare June 22, 2026 14:02
@anna-parker anna-parker requested a review from tombch June 23, 2026 08:42
.validateFilenameCharacters(fileMapping)
.validateFilenamesAreUnique(fileMapping)
.validateCategoriesMatchOutputSchema(fileMapping, organism)
.validateMultipartUploads(fileMapping.fileIds)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

validateMultipartUploads also calls the db (FilesTable), so this will still be done once per entry

.validateFilenamesAreUnique(fileMapping)
.validateCategoriesMatchOutputSchema(fileMapping, organism)
.validateMultipartUploads(fileMapping.fileIds)
.validateFilesExist(fileMapping.fileIds)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this also calls the FilesTable twice actually for all fileIds

organism,
processedFiles,
)
validateFilesBelongToSubmittingGroups(filesToValidate)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the other file validations that call the db: validateMultipartUploads and validateFilesExist can be added here, so that they only call the db once per batch (they take a set of fileIds as input so I dont see an issue moving them unless I overlooked sth)

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved validateMultipartUploads and validateFilesExist out of the per-entry loop.

@anna-parker

Copy link
Copy Markdown
Contributor

@claude review

@claude

claude Bot commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

Claude finished @anna-parker's task in 2m 43s —— View job


Code Review

  • Read changed files and understand the batching logic
  • Review correctness and performance
  • Check test coverage
  • Post review findings

Summary

The PR correctly batches validateFilesBelongToSubmittingGroups — reducing from 2 DB calls per entry to 2 per chunk. The refactoring is clean and the new test covers the cross-group scenario in a batch. That said, there are a few issues worth addressing.


Issues

1. validateMultipartUploads and validateFilesExist still issue per-entry DB calls (anna-parker's point)

In validateFileMappingsAndCollectFileIds the chain at lines 346–347:

fileMappingPreconditionValidator
    .validateMultipartUploads(fileMapping.fileIds)   // 1 DB call per entry
    .validateFilesExist(fileMapping.fileIds)          // 2 DB calls per entry (getNonExistentFileIds + getUncheckedFileIds)

...is called once per entry in the per-batch forEach, so with a default streamBatchSize of 1000 this still produces up to 3000 DB calls per chunk for file-having entries. Both methods accept a Set<FileId>, so the fix (as anna-parker suggests) is to collect all file IDs across the batch first and pass the full set in a single call each. The error messages are already file-ID-centric, so no diagnostic quality is lost.

Fix this →

2. Behavioral change for errored entries — no test coverage

Lines 353–357 introduce a new behaviour: entries with errors skip file ownership validation entirely.

// Only releasable processed entries need file ownership checks.
if (submittedProcessedData.errors.orEmpty().isEmpty()) {
    filesToValidate[accessionVersion] = fileMapping.fileIds
}

The old implementation validated ownership for all entries. While the rationale (errored entries can't be released) is reasonable, this is an undocumented semantic change with no test. A pipeline that submits an errored entry referencing a cross-group file would now silently succeed where it previously errored. It would be worth adding a test (and confirming the behaviour is intentional) for the case where an errored entry references a file from another group.

3. Silent skip on unknown accession version

In validateFilesBelongToSubmittingGroups:

val sequenceEntryGroup = sequenceEntryGroups[accessionVersion] ?: return@forEach

If an accession is not found in SequenceEntriesTable the ownership check is silently skipped for that entry, relying on insertProcessedData to catch it later. This is correct and the comment explains it, but it means a pipeline error (wrong accession) won't produce an early ownership-related error — just a slightly different error downstream. This seems acceptable.


Positives

  • The ownership batching itself is implemented correctly: 1 query for accession→group and 1 query for file→group per chunk instead of 2 per entry. Good.
  • accessionVersion.displayAccessionVersion() in the error message is correct; the old code used submittedProcessedData.displayAccessionVersion() which was equivalent but slightly more indirect.
  • The new integration test for cross-group files within a batch is exactly the right test to add for this PR.
  • parseSubmittedProcessedDataLine extraction is a clean readability improvement.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a preexisting issue but Im now concerned the stream is not being closed correctly, I believe ServletInputStream (the inputStream) handles closing itsself but Im not sure this BufferedReader class does... probably this isnt an issue as weve been doing it but might be good to check

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesnt seem to close the stream on its own? https://docs.oracle.com/javase/8/docs/api/java/io/BufferedReader.html

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@claude what are your thoughts? do we need to close this stream explicitly?

@claude claude Bot Jun 23, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude finished @anna-parker's task in 1m 7s —— View job


  • Read the relevant code around line 274
  • Analyze BufferedReader / ServletInputStream lifecycle
  • Post answer

Stream Closing Analysis

Short answer: no resource leak in practice, but the use {} idiom is still the right fix.

Here's the chain:

request.inputStream          ← ServletInputStream (managed by servlet container)
  └─ InputStreamReader(...)
       └─ BufferedReader(...)   ← never closed

Why it isn't technically a leak:
request.inputStream is a ServletInputStream whose lifecycle is owned by the servlet container (Tomcat/Jetty etc.). Per the Servlet spec (§3.1.1), the container closes the stream after the response is committed — regardless of whether application code closed it first. So even if BufferedReader is never closed, no file descriptor or socket leaks.

Why it's still worth fixing:

  1. BufferedReader implements AutoCloseable, so leaving it unclosed will trip static analysis (e.g. detekt) and is a bad pattern to copy.
  2. If an exception is thrown mid-stream (which does happen here — validation errors), the BufferedReader is abandoned without flushing any internal buffer. That's harmless for a read-only reader, but structurally sloppy.
  3. The lineSequence() on an unclosed reader is fine as long as the sequence is fully consumed (it is, via .forEach), but use {} makes the ownership explicit.

The fix is a one-liner — wrap with Kotlin's use {}:

BufferedReader(InputStreamReader(inputStream)).use { reader ->
    reader.lineSequence().chunked(streamBatchSize).forEach { lines ->
        ...
    }
}

Calling close() on the BufferedReader will propagate to InputStreamReaderServletInputStream. The servlet container handles a double-close gracefully (it's a no-op after the first), so this is safe.

Verdict: Preexisting issue, low severity, but worth a one-line fix to keep the code idiomatic and static-analysis clean. Since this is marked as a preexisting issue, it could go in this PR or a follow-up chore.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added use {} here so the lifecycle is explicit

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend related to the loculus backend component bug Something isn't working performance

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Improve performance of file validation for /submit-processed-data

2 participants