GH-5308: Stop a job execution by signalling and awaiting its running step(s)#5442
Open
kyungrae wants to merge 1 commit into
Open
GH-5308: Stop a job execution by signalling and awaiting its running step(s)#5442kyungrae wants to merge 1 commit into
kyungrae wants to merge 1 commit into
Conversation
1e82526 to
7176811
Compare
JobOperator.stop() ran on the caller thread and itself persisted the running StepExecution (jobRepository.update). The thread executing the step persists the same BATCH_STEP_EXECUTION row, so the two raced on the optimistic lock: whichever lost either propagated an OptimisticLockingFailureException out of stop() or was driven into an UNKNOWN state (and could no longer be restarted). Make the thread executing the step the sole writer of its StepExecution: - stop() no longer persists step executions. It marks the job STOPPING in a short transaction, signals each running step, then waits - outside any transaction - for the step to terminate and persist its own stopped state, with a configurable timeout (JobExecutionStopException on expiry). This also gives graceful shutdown the durability it needs: the caller blocks until the stopped state is persisted. - StoppableStep gains subscribeToTermination(StepExecution); AbstractStep completes the returned future once the execution terminates. - StoppableStep.stop() default now only sets terminateOnly; the worker owns the STOPPED / exit status / end time transition. - Revert the stop-time StepExecution version re-sync in SimpleJobRepository.update, which only narrowed the race window. - stop() is no longer wrapped in the operator's transaction. Resolves spring-projects#5308 Signed-off-by: Kyungrae Kim <rlarudfo93@gmail.com>
7176811 to
3d7086f
Compare
This was referenced Jun 28, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
JobOperator.stop()runs on the caller thread and itself callsjobRepository.update(stepExecution)on the running step. The thread executing that step also writes the sameBATCH_STEP_EXECUTIONrow (per chunk and on completion). Both use optimistic locking, so they race:OptimisticLockingFailureExceptionpropagates out ofstop();AbstractStepmarks the stepUNKNOWN("should not be restarted") → the job can no longer be restarted.The re-sync added in #5217 only narrows the window (get-then-update isn't atomic, and under
REPEATABLE READthe re-read returns a stale snapshot). Full analysis + per-vendor reproduction: #5308.Approach — single writer + signal/await
Make the thread executing the step the sole writer of its
StepExecution:stop()no longer persists the step execution (removes the second writer → the race is gone).STOPPINGin a short transaction, signals each running step, then waits — outside the transaction — for the step to confirm it has stopped (a future completed byAbstractSteponce the execution terminates and its final metadata is saved), bounded by a configurablestopTimeout(default 30s;JobExecutionStopExceptionon timeout).Why wait?
The wait is the point of the change, for two reasons:
stop()(Spring batch terminate in started status after sigterm #4023) was that onSIGTERMthe job would otherwise be leftSTARTEDin the database. With this change the worker is the one thatrecords the terminal state, but
stop()blocks until that record is persisted — so whenstop()is called from a shutdown hook, it keeps the JVM alive long enough for the worker to durably writeSTOPPEDbefore resources are torn down. We get the same durability guarantee without the operator racing the worker.stop()becomes a reliable, synchronous operation. Previouslystop()was fire-and-forget: it requested a stop and returned, with no guarantee the job had actually stopped. Now it returns only after the running step(s) have genuinely terminated, so the caller can trust that on return the job is stopped (or get aJobExecutionStopExceptionif it didn't stop within the timeout).Notes (behavior change)
stop()now blocks until the running step(s) confirm they have stopped, where it previously returned immediately (fire-and-forget). This is the intended improvement — it makes stop reliable and durable. The wait is bounded by a configurablestopTimeout(default 30s); only a step that genuinely cannot be interrupted in time (e.g. a long single tasklet that never reaches a chunk boundary) hits the timeout and surfaces aJobExecutionStopException. If a step is no longer actually running in this JVM, there is nothing to wait for andstop()returns immediately.stop()no longer runs inside one operator-wide transaction; it persistsSTOPPINGin its own short, atomic transaction and waits outside it. Since thatSTOPPINGupdate is now the only writestop()makes, there is no multi-write atomicity to preserve.Testing
Existing unit + functional tests pass.
Per-vendor reproduction across both tests —
JobOperatorFunctionalTests(restart scenario) andGracefulShutdownFunctionalTests(the test this issue references, previously@Disabledbecause of exactly this race). Looped 100× per cell against HSQLDB / MySQL 8 / PostgreSQL 16:GracefulShutdownFunctionalTestsGracefulShutdownFunctionalTestsGracefulShutdownFunctionalTestsJobOperatorFunctionalTestsJobOperatorFunctionalTestsJobOperatorFunctionalTestsBoth tests show the same vendor spread under baseline (MySQL ≫ PostgreSQL ≈ HSQLDB), confirming they exercise the same
StepExecutionwrite race. After this PR, no optimistic-lock conflicts, noUNKNOWNstep states, and no flaky failures in any of the 600 baseline-reproducing cells.Harness + script:
experiment/olf-investigation. Run with:RUNS=100 ./reproduce-gh-5308.sh # before the fix (baseline), then again after applying this PRResolves #5308