GH-5308: Stop a job execution by signalling and awaiting its running step(s) by kyungrae · Pull Request #5442 · spring-projects/spring-batch

kyungrae · 2026-06-25T06:41:03Z

Problem

JobOperator.stop() runs on the caller thread and itself calls jobRepository.update(stepExecution) on the running step. The thread executing that step also writes the same BATCH_STEP_EXECUTION row (per chunk and on completion). Both use optimistic locking, so they race:

caller loses → OptimisticLockingFailureException propagates out of stop();
worker loses → AbstractStep marks the step UNKNOWN ("should not be restarted") → the job can no longer be restarted.

The re-sync added in #5217 only narrows the window (get-then-update isn't atomic, and under REPEATABLE READ the re-read returns a stale snapshot). Full analysis + per-vendor reproduction: #5308.

Approach — single writer + signal/await

Make the thread executing the step the sole writer of its StepExecution:

stop() no longer persists the step execution (removes the second writer → the race is gone).
It marks the job STOPPING in a short transaction, signals each running step, then waits — outside the transaction — for the step to confirm it has stopped (a future completed by AbstractStep once the execution terminates and its final metadata is saved), bounded by a configurable stopTimeout (default 30s; JobExecutionStopException on timeout).

Why wait?

The wait is the point of the change, for two reasons:

Durability during shutdown. The main motivation behind persisting the stopped state from stop() (Spring batch terminate in started status after sigterm #4023) was that on SIGTERM the job would otherwise be left STARTED in the database. With this change the worker is the one that
records the terminal state, but stop() blocks until that record is persisted — so when stop() is called from a shutdown hook, it keeps the JVM alive long enough for the worker to durably write STOPPED before resources are torn down. We get the same durability guarantee without the operator racing the worker.
stop() becomes a reliable, synchronous operation. Previously stop() was fire-and-forget: it requested a stop and returned, with no guarantee the job had actually stopped. Now it returns only after the running step(s) have genuinely terminated, so the caller can trust that on return the job is stopped (or get a JobExecutionStopException if it didn't stop within the timeout).

Notes (behavior change)

stop() now blocks until the running step(s) confirm they have stopped, where it previously returned immediately (fire-and-forget). This is the intended improvement — it makes stop reliable and durable. The wait is bounded by a configurable stopTimeout (default 30s); only a step that genuinely cannot be interrupted in time (e.g. a long single tasklet that never reaches a chunk boundary) hits the timeout and surfaces a JobExecutionStopException. If a step is no longer actually running in this JVM, there is nothing to wait for and stop() returns immediately.
stop() no longer runs inside one operator-wide transaction; it persists STOPPING in its own short, atomic transaction and waits outside it. Since that STOPPING update is now the only write stop() makes, there is no multi-write atomicity to preserve.

Testing

Existing unit + functional tests pass.

Per-vendor reproduction across both tests — JobOperatorFunctionalTests (restart scenario) and GracefulShutdownFunctionalTests (the test this issue references, previously @Disabled because of exactly this race). Looped 100× per cell against HSQLDB / MySQL 8 / PostgreSQL 16:

Test	Vendor	Baseline (no fix)	After this PR
`GracefulShutdownFunctionalTests`	HSQLDB	3/100	0/100
`GracefulShutdownFunctionalTests`	MySQL	27/100	0/100
`GracefulShutdownFunctionalTests`	PostgreSQL	3/100	0/100
`JobOperatorFunctionalTests`	HSQLDB	1/100	0/100
`JobOperatorFunctionalTests`	MySQL	22/100	0/100
`JobOperatorFunctionalTests`	PostgreSQL	8/100	0/100

Both tests show the same vendor spread under baseline (MySQL ≫ PostgreSQL ≈ HSQLDB), confirming they exercise the same StepExecution write race. After this PR, no optimistic-lock conflicts, no UNKNOWN step states, and no flaky failures in any of the 600 baseline-reproducing cells.

Harness + script: experiment/olf-investigation. Run with:

RUNS=100 ./reproduce-gh-5308.sh   # before the fix (baseline), then again after applying this PR

Resolves #5308

JobOperator.stop() ran on the caller thread and itself persisted the running StepExecution (jobRepository.update). The thread executing the step persists the same BATCH_STEP_EXECUTION row, so the two raced on the optimistic lock: whichever lost either propagated an OptimisticLockingFailureException out of stop() or was driven into an UNKNOWN state (and could no longer be restarted). Make the thread executing the step the sole writer of its StepExecution: - stop() no longer persists step executions. It marks the job STOPPING in a short transaction, signals each running step, then waits - outside any transaction - for the step to terminate and persist its own stopped state, with a configurable timeout (JobExecutionStopException on expiry). This also gives graceful shutdown the durability it needs: the caller blocks until the stopped state is persisted. - StoppableStep gains subscribeToTermination(StepExecution); AbstractStep completes the returned future once the execution terminates. - StoppableStep.stop() default now only sets terminateOnly; the worker owns the STOPPED / exit status / end time transition. - Revert the stop-time StepExecution version re-sync in SimpleJobRepository.update, which only narrowed the race window. - stop() is no longer wrapped in the operator's transaction. Resolves spring-projects#5308 Signed-off-by: Kyungrae Kim <rlarudfo93@gmail.com>

kyungrae force-pushed the fix/gh-5308-stop-awaits-step-termination branch 2 times, most recently from 1e82526 to 7176811 Compare June 25, 2026 08:51

kyungrae force-pushed the fix/gh-5308-stop-awaits-step-termination branch from 7176811 to 3d7086f Compare June 26, 2026 00:41

This was referenced Jun 28, 2026

GracefulShutdownFunctionalTests.testStopJob fails intermittently due to a race condition #5308

Open

GH-5308: Serialize step execution updates with concurrent stop requests #5448

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

GH-5308: Stop a job execution by signalling and awaiting its running step(s)#5442

GH-5308: Stop a job execution by signalling and awaiting its running step(s)#5442
kyungrae wants to merge 1 commit into
spring-projects:mainfrom
kyungrae:fix/gh-5308-stop-awaits-step-termination

kyungrae commented Jun 25, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

kyungrae commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Approach — single writer + signal/await

Why wait?

Notes (behavior change)

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

kyungrae commented Jun 25, 2026 •

edited

Loading