feat(backend): run scheduled tasks with ShedLock by theosanderson-agent · Pull Request #6705 · loculus-project/loculus

theosanderson-agent · 2026-06-17T13:43:52Z

Summary

Implementation of ShedLock to close #6704.

Today every @Scheduled task runs independently on each backend replica, so the effective execution frequency scales with the replica count (e.g. with 3 replicas the cleanup task runs ~3× as often, and the CrossRef citation fetch hits the external service 3×). ShedLock fixes this by having replicas compete for a shared, named lock before running a task — only the winner executes that tick; the rest skip it. The result is "run once per interval, regardless of replica count".

How it works

ShedLock stores locks in a small shedlock table in the existing Postgres database (no new infrastructure). When a scheduled method fires:

The replica attempts to INSERT/UPDATE a row keyed by the task name, setting lock_until.
If it wins the row lock, it runs the task and releases the lock when done.
Other replicas, seeing an unexpired lock, skip that execution.

lockAtMostFor is a safety net: if the holding replica crashes mid-task, the lock auto-expires after that duration so another replica can take over. We use the database clock (usingDbTime()) for all lock timing, so clock drift between replicas can't cause double-runs.

Changes

build.gradle — add net.javacrumbs.shedlock:shedlock-spring and shedlock-provider-jdbc-template (7.7.0); gradle.lockfile regenerated (STRICT lock mode).
BackendSpringConfig.kt — @EnableSchedulerLock(defaultLockAtMostFor = "PT30M") and a JdbcTemplateLockProvider bean backed by the existing DataSource with usingDbTime().
Migration V1.31__add_shedlock_table.sql — creates the shedlock table.

All four scheduled tasks annotated with @SchedulerLock, each lockAtMostFor sized above its expected runtime:

Task	Interval	`lockAtMostFor`
`CleanUpStaleSequencesInProcessingTask`	~60s	`PT5M`
`UseNewerProcessingPipelineVersionTask`	~10s	`PT10M`
`CleanUpAuxTableTask`	1h	`PT15M`
`SeqSetCrossRefCitationsTask`	6h	`PT1H`

Note the org.jetbrains.kotlin.plugin.spring plugin auto-opens Spring components, so the @SchedulerLock methods are proxyable without extra open modifiers.

Testing

./gradlew compileKotlin ktlintFormat — pass.
./gradlew dependencies --write-locks — lockfile updated.
BackendApplicationTest (context load with mocked DataSource) — pass; the lockProvider bean wires up without touching the DB at construction.
A DB-backed EndpointTest against the Postgres test container — pass; Flyway log confirms Migrating schema "public" to version "1.31 - add shedlock table" and Successfully applied 32 migrations.

🤖 Generated with Claude Code

🚀 Preview: Add preview label to enable

@scheduled

Introduce ShedLock so that the backend's scheduled tasks execute on only one replica at a time, instead of once per replica. This addresses #6704: previously the effective frequency of every @scheduled task scaled with the number of backend replicas. ShedLock coordinates via a shared `shedlock` table in the existing Postgres database (added in migration V1.31). Before a task runs, the replica tries to acquire a row-level lock keyed by the task name; only the replica that wins runs the task, the others skip that tick. Changes: - Add shedlock-spring and shedlock-provider-jdbc-template (7.7.0) deps. - Register a JdbcTemplateLockProvider backed by the existing DataSource, using `usingDbTime()` so lock timing relies on the database clock and is unaffected by clock drift between replicas. - Enable locking with @EnableSchedulerLock (defaultLockAtMostFor=PT30M as a safety net if a replica dies mid-task). - Annotate all four scheduled tasks with @SchedulerLock, each with a lockAtMostFor sized above its expected runtime. - Add Flyway migration V1.31 creating the `shedlock` table. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

claude · 2026-06-17T14:07:41Z

Claude finished @anna-parker's task in 2m 6s —— View job

Code Review

Overall this is a clean, well-scoped implementation. The approach (ShedLock backed by the existing Postgres DB, usingDbTime() for clock-safe locking) is the right one for this problem. Two things worth addressing before merging, and one smaller note:

TIMESTAMP vs TIMESTAMP WITH TIME ZONE in migration — The ShedLock PostgreSQL documentation recommends TIMESTAMP WITH TIME ZONE for the lock_until and locked_at columns so that timezone-aware comparisons work correctly when usingDbTime() calls NOW() (which returns timestamptz). Plain TIMESTAMP will work because PostgreSQL silently strips the timezone on assignment, but aligning with the documented schema is safer. Inline suggestion posted.

lockAtMostFor for UseNewerProcessingPipelineVersionTask — This task fires every 10 seconds; a PT10M safety timeout means a crashed replica locks out all others for up to 10 minutes. The actual work is a lightweight DB read, so PT1M is still generous and keeps the failure window much smaller. Inline suggestion posted.

Test coverage for the lock mechanism — The existing task tests call task() directly (bypassing the ShedLock AOP proxy), so they never exercise the lock itself. The business logic tests are fine as-is, but there's no test confirming the shedlock table gets populated or that the lock correctly prevents double-execution. This may be acceptable for a demo PR, but worth tracking before the replica count is increased. Inline comment posted.

…ble.sql Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com>

Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com>

anna-parker

lgtm!

I confirmed tasks are running on the preview!

theosanderson · 2026-06-17T18:10:58Z

OK, I will let you/someone else merge as I haven't looked into this enough to take responsibility in the way that merging would imply (thanks a lot for checking!)

anna-parker · 2026-06-17T20:19:38Z

@corneliusroemer and I took a closer look and discovered that the behavior is slightly different than I expected from quickly reading the docs.

Shedlock prevents instances running a task in parallel. It does not check that a task is executed only as many times as defined by the task schedule frequency.

If a second instance tries to start a task while the first instance is running the same task (i.e. that task is locked in the table) the second task run will be skipped. However, if the second instance tried to run the task AFTER the initial instance has finished running (and removed the lock) the second task would still run. As backends will not all start in unison in theory we could get to a state where a task is still run multiple times during its defined frequency period. This will definitely prevent peaks of load when multiple backends start a task in unison but not fully prevent multiple executions of a task.

I believe that a way to fix this would be to set the lockAtMostFor and lockAtLeastFor parameters to the task frequency. However, I am wondering if it makes sense to just add code for this instead of using a library.

lockAtMostFor should be greater than the time the task is expected to run for so that another task doesnt try to start while the first task is still running.

theosanderson · 2026-06-17T20:58:36Z

Yes my vague expectation was what you say. Avoiding parallel runs is anyway a good start. I was thinking we could maybe run on a timetable based on times and then discard the runs that are stopped by locks (possibly)

anna-parker · 2026-06-17T21:50:09Z

alternative suggestion that should handle these edge cases with self rolled code (quite simple as only covers what we need and leads to no external dependencies): #6711

anna-parker · 2026-06-18T14:36:15Z

From discussing with @theosanderson we should increase the lockAtMostFor to at least the task run frequency for all tasks (for the GC task this can be higher), and probably set lockAtLeastFor to the task frequency (slightly less is ok if we are ok with potentially higher task run frequency).

This will require some test changes

theosanderson · 2026-06-18T14:42:20Z

yeah I guess I would phrase it differently, I would say we set lockAtLeastFor to the desired frequency. We set lockAtMostFor to the highest interval we would be prepared to tolerate in the rare event that a job died.

…astFor The previous ShedLock setup only set lockAtMostFor, which prevents *simultaneous* runs but not frequency-scaling with replica count: because the lock is released as soon as the task finishes and replicas fire out of sync, the task could still run up to N times per interval with N replicas. Add a configurable lockAtLeastFor to every scheduled task so the lock is held for at least the desired interval, which is what actually guarantees "run once per interval regardless of replica count". The schedulers now poll frequently (ShedLock skips rather than queues when a lock is held), and lockAtLeastFor governs the real cadence. For the GC-style aux cleanup, lockAtMostFor is set above lockAtLeastFor so a long run keeps the lock (avoiding parallelism) while still releasing if a replica dies mid-task. Lock minimums are overridable via loculus.locks.<task>.atLeast and are set to PT0S in tests (ShedLock's in-memory LockRecordRegistry is incompatible with truncating the shedlock table between tests). ShedLockIntegrationTest now verifies lockAtLeastFor directly through the LockProvider. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

outdated

theosanderson · 2026-06-19T09:02:40Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ca6520502e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Addresses Codex review: useNewerProcessingPipelineVersion hardcoded a lockAtLeastFor of PT10S, which silently overrode operator-configured pipelineVersionUpgradeCheckIntervalSeconds values below 10s (schema allows down to 1s). lockAtLeastFor now defaults to the configured interval via a nested placeholder, so the interval is honored rather than capped. The Helm chart passes both bounds derived from the same value (lockAtLeastFor = interval, lockAtMostFor = 5x interval) so the at-most ceiling always stays above the floor for any configured interval. cleanUpStaleSequencesInProcessing likewise defaults its floor to run-every-seconds. ShedLock does not evaluate SpEL in the duration attributes (only in name), so the 5x is computed in the chart rather than the annotation. Added ShedLockPropertyResolutionTest to cover the production default path, which the DB-backed test can't (it always overrides atLeast to PT0S). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…piry boundary cleanUpStaleSequencesInProcessing used fixedRate with lockAtLeastFor equal to the run-every interval. fixedRate fires on a wall-clock grid every R while the lock expires at acquire+R, so the next tick lands exactly on the expiry boundary and is skipped or not depending on sub-millisecond clock jitter, making the effective cadence drift toward ~2x. Switching to fixedDelay schedules from completion, so each poll fires after lockAtLeastFor has already elapsed since acquisition and reliably re-acquires, while still holding the lock for the full interval (at-most-once per interval across replicas preserved). The other tasks are unaffected: aux/crossref poll far more often than their lock, and pipeline is already fixedDelay. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Remove restatements of the lock mechanism that the annotation values already convey, keeping only the non-obvious rationale (fixedDelay choice, the unique test lock name, and why the property-resolution test exists). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

corneliusroemer · 2026-06-22T08:33:53Z

+        // non-Helm fallback. Both overridable via `loculus.locks.*` (tests use PT0S).
+        lockAtLeastFor = "\${loculus.locks.useNewerProcessingPipelineVersion.atLeast:" +
+            "PT\${${BackendSpringProperty.PIPELINE_VERSION_UPGRADE_CHECK_INTERVAL_SECONDS}}S}",
+        lockAtMostFor = "\${loculus.locks.useNewerProcessingPipelineVersion.atMost:PT1M}",


This hard coding is problematic when the task starts taking longer on big instances

Ugh this is actually configurable but bypassing our explicit config variables

yeah agree it could be confusing to have this set in two ways. Would be happy to discard one

theosanderson · 2026-06-23T10:11:45Z

Closing as Anya and Cornelius prefer a hand-rolled version: #6711

theosanderson-agent mentioned this pull request Jun 17, 2026

Make scheduled tasks not scale in frequency with number of backend replicas #6704

Open

claude Bot added the backend related to the loculus backend component label Jun 17, 2026

anna-parker added the update_db_schema label Jun 17, 2026

actions-user and others added 2 commits June 17, 2026 13:55

Update schema documentation based on migration changes

9a20161

Your commit message

a52f401

anna-parker marked this pull request as ready for review June 17, 2026 14:07

claude Bot reviewed Jun 17, 2026

View reviewed changes

Comment thread backend/src/main/resources/db/migration/V1.31__add_shedlock_table.sql

claude Bot reviewed Jun 17, 2026

View reviewed changes

Comment thread .../main/kotlin/org/loculus/backend/service/submission/UseNewerProcessingPipelineVersionTask.kt Outdated

claude Bot reviewed Jun 17, 2026

View reviewed changes

Comment thread .../main/kotlin/org/loculus/backend/service/submission/CleanUpStaleSequencesInProcessingTask.kt Outdated

anna-parker and others added 3 commits June 17, 2026 16:14

Update backend/src/main/resources/db/migration/V1.31__add_shedlock_ta…

0169f4e

…ble.sql Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com>

Update schema documentation based on migration changes

0bc9cb8

Apply suggestion from @claude[bot]

16cf025

Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com>

anna-parker self-requested a review June 17, 2026 14:23

feat: add a little test

3898c3c

anna-parker previously approved these changes Jun 17, 2026

View reviewed changes

anna-parker added the preview Triggers a deployment to argocd label Jun 17, 2026

more tests

c8e1f7b

anna-parker mentioned this pull request Jun 17, 2026

feat(backend): ensure a task is only executed once regardless of how many backend instances are run #6711

Open

3 tasks

anna-parker removed the preview Triggers a deployment to argocd label Jun 18, 2026

anna-parker mentioned this pull request Jun 18, 2026

feat(backend, config): S3 garbage collection without race condition #6678

Open

3 tasks

chatgpt-codex-connector Bot reviewed Jun 19, 2026

View reviewed changes

Comment thread .../main/kotlin/org/loculus/backend/service/submission/UseNewerProcessingPipelineVersionTask.kt Outdated

Theo Sanderson and others added 3 commits June 19, 2026 10:27

theosanderson changed the title ~~feat(backend): run scheduled tasks on a single replica with ShedLock~~ feat(backend): run scheduled tasks with ShedLock Jun 19, 2026

theosanderson requested a review from anna-parker June 19, 2026 10:40

corneliusroemer reviewed Jun 22, 2026

View reviewed changes

theosanderson closed this Jun 23, 2026

Conversation

theosanderson-agent commented Jun 17, 2026 • edited by theosanderson Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

How it works

Changes

Testing

Uh oh!

claude Bot commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

anna-parker left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

theosanderson commented Jun 17, 2026

Uh oh!

anna-parker commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

theosanderson commented Jun 17, 2026

Uh oh!

anna-parker commented Jun 17, 2026

Uh oh!

anna-parker commented Jun 18, 2026

Uh oh!

theosanderson commented Jun 18, 2026

Uh oh!

theosanderson commented Jun 19, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

corneliusroemer Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

corneliusroemer Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

theosanderson Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

theosanderson commented Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

theosanderson-agent commented Jun 17, 2026 •

edited by theosanderson

Loading

claude Bot commented Jun 17, 2026 •

edited

Loading

anna-parker left a comment •

edited

Loading

anna-parker commented Jun 17, 2026 •

edited

Loading