feat(backend, config): S3 garbage collection without race condition#6678
feat(backend, config): S3 garbage collection without race condition#6678anna-parker wants to merge 62 commits into
Conversation
|
Tanks a lot for this! Definitely better to guard against this :) Do I understand correctly that in the current setup, the scan of all se and sepd is done twice per GC task? I'm wondering if it's not possible to modify the query to return file_id, upload_requested_at, and marked_for_deletion_at at the same time, and then figure out in kotlin what files need to be deleted and which to be marked for deletion. That way you'd only have to do the heavy SQL work once. Or is there something I'm missing that would make that impractical? |
|
@maverbiest I think I covered this option in |
|
@anna-parker hmm if I understand the description of the alternative correctly that is about having one query that both finds and deletes. That's not what I was thinking but I probably didn't explain very well; what I meant is to have one query to find unreferenced files, but instead of it just returning file_id, also have it return upload_requested_at, and marked_for_deletion_at. Something like: val unreferenced = filesDatabaseService.getUnreferencedFiles()
val toDelete = unreferenced.filter { it.markedForDeletionAt != null }.map { it.id }
val toMark = unreferenced.filter { it.markedForDeletionAt == null && it.uploadRequestedAt < threshold }.map { it.id }Then run another query to delete what files in |
|
ah yes that would work! its a bit more custom code I guess but doable! |
|
I was wondering if the plan moving forward is to get this PR ready to merge, or to incorporate these improvements into #6543? I'm good either way, but if we're focusing on this PR I'll probably close my original one to avoid confusion & double work :) |
There was a problem hiding this comment.
Maybe we could rename this file since we now also add a column in addition to an index
| } | ||
| } | ||
|
|
||
| fun getMarkedForDeletionFileIds(fileIds: Set<FileId>): Set<FileId> = fileIds.chunkedForDatabase({ chunk -> |
There was a problem hiding this comment.
Not blocking but could we add a short docstring here explaining that the function filters a set of input FileIds (similar to getUncheckedFileIds below)? Upon first reading the function name I expected it to get the FileIds of all files marked for deletion in the DB
| val markedOrphans = orphans.filterValues { it != null }.keys | ||
| val newOrphans = orphans.filterValues { it == null }.keys | ||
|
|
||
| // Phase 2: delete files that were marked in a previous run and are still unreferenced |
There was a problem hiding this comment.
| // Phase 2: delete files that were marked in a previous run and are still unreferenced | |
| // Phase 2: delete files that were marked in a previous run |
Maybe remove this part? There is no way for a file to be 'rescued' once it's been marked for deletion, right? Or is there?
| if (newOrphans.isNotEmpty()) { | ||
| filesDatabaseService.markFilesForDeletion(newOrphans) | ||
| log.info { "S3 garbage collection task marked ${newOrphans.size} orphan(s) for deletion" } | ||
| } else { | ||
| log.info { "S3 garbage collection task identified no new orphan files" } | ||
| } |
There was a problem hiding this comment.
One thing I'm wondering here is whether marking files for deletion should be gated by enabled as well?
If I understand the current approach correctly, even when we disable the GC files will get marked for deletion which means they will cause submissions to be rejected. Not blocking in this one way or the other but might be worth flagging in a comment at least
| ) | ||
| if (deleteFailures > 0) { | ||
| log.warn { | ||
| "S3 garbage collection task unsuccessfully attempted to delete $deleteFailures orphan file(s)" |
There was a problem hiding this comment.
Maybe we could keep track of s3DeleteFailures and dbDeleteFailures separately?
| gcInitialDelayMinutes: 15 # initial delay before first garbage collection task runs | ||
| gcEnabled: false # have garbage collection log instead of actually delete files | ||
| gcFrequencyMinutes: 1440 # frequency of garbage collection polling in minutes | ||
| orphanRetentionPeriodMinutes: 7 |
There was a problem hiding this comment.
| orphanRetentionPeriodMinutes: 7 | |
| orphanRetentionPeriodMinutes: 10080 |
There was a problem hiding this comment.
I think I'd prefer this to also carry the gc prefix in line with the other gc-related config . We could do gcOrphanRetentionPeriodMinutes (or maybe gcGracePeriodMinutes)?
| private val dateProvider: DateProvider, | ||
| private val auditLogger: AuditLogger, | ||
| @Value("\${${BackendSpringProperty.S3_ORPHAN_RETENTION_PERIOD_MINUTES}}") private val orphanRetentionPeriod: Int, | ||
| @Value("\${${BackendSpringProperty.S3_GC_ENABLED}:false}") private val enabled: Boolean = false, |
There was a problem hiding this comment.
I find it confusing that we set the default value twice in one line. It's not obvious to me what would happen if they ever get out of sync. Then there's also a third default value in application.json (which actually defaults to true for enabled instead of false).
Could we keep all defaults in application.json so they all live in one place? Also for the initial delay and the frequency?
|
|
||
| private var groupId = 0 | ||
|
|
||
| private fun createProcessedSubmissionReferencingFile(file: UUID) { |
There was a problem hiding this comment.
This has a lot of overlap with the second test in GetOrphanedFileIdsTest.kt, could this be extracted and reused? Or would that be too much of a hassle since they're slightly different?
builds on #6543 with modifications to prevent a race condition and additional code improvements, tests were modified to use existing helper functions used elsewhere in the codebase
resolves #3935
resolves #6543 (comment)
Config
Requires #6705 to be merged before GC can be enabled on instances with multiple backend replicas, the default config parameters will be applied without user changes:
The Exact Race Window
The current code has two separate steps with no lock between them:
validateFileIdsExist checks the DB, then the submission writes to sequence_entries later. Between those two events, GC
can delete the file. The window is roughly: time to delete all the S3 objects + DB rows in the batch.
Additional improvements
Alternatives considered
Unresolved Issues
As discussed with @corneliusroemer async there is still a very small window between when the files are validated and are uploaded to the db during a submission. When only one backend is running this can be ignored, however on PPX we have two backends if both start this job simultaneously we can have a submission occur during deletion and have one backend attempt to delete a file within this interval. Leading to the issue we are attempting to resolve still occuring.
We should probably prevent multiple garbage collection tasks from running simultaneously to prevent this from occuring. The alternative is to only delete files a certain time period after they have been marked as ready for deletion - this would prevent this race condition.
Screenshot
PR Checklist
🚀 Preview: Add
previewlabel to enable