Skip to content

GH-5308: Stop a job execution by signalling and awaiting its running step(s)#5442

Open
kyungrae wants to merge 1 commit into
spring-projects:mainfrom
kyungrae:fix/gh-5308-stop-awaits-step-termination
Open

GH-5308: Stop a job execution by signalling and awaiting its running step(s)#5442
kyungrae wants to merge 1 commit into
spring-projects:mainfrom
kyungrae:fix/gh-5308-stop-awaits-step-termination

Conversation

@kyungrae

@kyungrae kyungrae commented Jun 25, 2026

Copy link
Copy Markdown

Problem

JobOperator.stop() runs on the caller thread and itself calls jobRepository.update(stepExecution) on the running step. The thread executing that step also writes the same BATCH_STEP_EXECUTION row (per chunk and on completion). Both use optimistic locking, so they race:

  • caller losesOptimisticLockingFailureException propagates out of stop();
  • worker losesAbstractStep marks the step UNKNOWN ("should not be restarted") → the job can no longer be restarted.

The re-sync added in #5217 only narrows the window (get-then-update isn't atomic, and under REPEATABLE READ the re-read returns a stale snapshot). Full analysis + per-vendor reproduction: #5308.

Approach — single writer + signal/await

Make the thread executing the step the sole writer of its StepExecution:

  • stop() no longer persists the step execution (removes the second writer → the race is gone).
  • It marks the job STOPPING in a short transaction, signals each running step, then waits — outside the transaction — for the step to confirm it has stopped (a future completed by AbstractStep once the execution terminates and its final metadata is saved), bounded by a configurable stopTimeout (default 30s; JobExecutionStopException on timeout).

Why wait?

The wait is the point of the change, for two reasons:

  1. Durability during shutdown. The main motivation behind persisting the stopped state from stop() (Spring batch terminate in started status after sigterm #4023) was that on SIGTERM the job would otherwise be left STARTED in the database. With this change the worker is the one that
    records the terminal state, but stop() blocks until that record is persisted — so when stop() is called from a shutdown hook, it keeps the JVM alive long enough for the worker to durably write STOPPED before resources are torn down. We get the same durability guarantee without the operator racing the worker.
  2. stop() becomes a reliable, synchronous operation. Previously stop() was fire-and-forget: it requested a stop and returned, with no guarantee the job had actually stopped. Now it returns only after the running step(s) have genuinely terminated, so the caller can trust that on return the job is stopped (or get a JobExecutionStopException if it didn't stop within the timeout).

Notes (behavior change)

  • stop() now blocks until the running step(s) confirm they have stopped, where it previously returned immediately (fire-and-forget). This is the intended improvement — it makes stop reliable and durable. The wait is bounded by a configurable stopTimeout (default 30s); only a step that genuinely cannot be interrupted in time (e.g. a long single tasklet that never reaches a chunk boundary) hits the timeout and surfaces a JobExecutionStopException. If a step is no longer actually running in this JVM, there is nothing to wait for and stop() returns immediately.
  • stop() no longer runs inside one operator-wide transaction; it persists STOPPING in its own short, atomic transaction and waits outside it. Since that STOPPING update is now the only write stop() makes, there is no multi-write atomicity to preserve.

Testing

Existing unit + functional tests pass.

Per-vendor reproduction across both testsJobOperatorFunctionalTests (restart scenario) and GracefulShutdownFunctionalTests (the test this issue references, previously @Disabled because of exactly this race). Looped 100× per cell against HSQLDB / MySQL 8 / PostgreSQL 16:

Test Vendor Baseline (no fix) After this PR
GracefulShutdownFunctionalTests HSQLDB 3/100 0/100
GracefulShutdownFunctionalTests MySQL 27/100 0/100
GracefulShutdownFunctionalTests PostgreSQL 3/100 0/100
JobOperatorFunctionalTests HSQLDB 1/100 0/100
JobOperatorFunctionalTests MySQL 22/100 0/100
JobOperatorFunctionalTests PostgreSQL 8/100 0/100

Both tests show the same vendor spread under baseline (MySQL ≫ PostgreSQL ≈ HSQLDB), confirming they exercise the same StepExecution write race. After this PR, no optimistic-lock conflicts, no UNKNOWN step states, and no flaky failures in any of the 600 baseline-reproducing cells.

Harness + script: experiment/olf-investigation. Run with:

RUNS=100 ./reproduce-gh-5308.sh   # before the fix (baseline), then again after applying this PR

Resolves #5308

@kyungrae kyungrae force-pushed the fix/gh-5308-stop-awaits-step-termination branch 2 times, most recently from 1e82526 to 7176811 Compare June 25, 2026 08:51
JobOperator.stop() ran on the caller thread and itself persisted the
running StepExecution (jobRepository.update). The thread executing the
step persists the same BATCH_STEP_EXECUTION row, so the two raced on the
optimistic lock: whichever lost either propagated an
OptimisticLockingFailureException out of stop() or was driven into an
UNKNOWN state (and could no longer be restarted).

Make the thread executing the step the sole writer of its StepExecution:

- stop() no longer persists step executions. It marks the job STOPPING in
  a short transaction, signals each running step, then waits - outside any
  transaction - for the step to terminate and persist its own stopped
  state, with a configurable timeout (JobExecutionStopException on expiry).
  This also gives graceful shutdown the durability it needs: the caller
  blocks until the stopped state is persisted.
- StoppableStep gains subscribeToTermination(StepExecution); AbstractStep
  completes the returned future once the execution terminates.
- StoppableStep.stop() default now only sets terminateOnly; the worker
  owns the STOPPED / exit status / end time transition.
- Revert the stop-time StepExecution version re-sync in
  SimpleJobRepository.update, which only narrowed the race window.
- stop() is no longer wrapped in the operator's transaction.

Resolves spring-projects#5308

Signed-off-by: Kyungrae Kim <rlarudfo93@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

GracefulShutdownFunctionalTests.testStopJob fails intermittently due to a race condition

1 participant