Skip to content

GH-5308: Serialize step execution updates with concurrent stop requests#5448

Open
kyungrae wants to merge 1 commit into
spring-projects:mainfrom
kyungrae:fix/gh-5308-serialize-step-execution-callback
Open

GH-5308: Serialize step execution updates with concurrent stop requests#5448
kyungrae wants to merge 1 commit into
spring-projects:mainfrom
kyungrae:fix/gh-5308-serialize-step-execution-callback

Conversation

@kyungrae

@kyungrae kyungrae commented Jun 29, 2026

Copy link
Copy Markdown

Problem

When JobOperator.stop(jobExecution) is called while a step is running, the stopping thread persists the step execution's stopped state (jobRepository.update(stepExecution)) on its own thread — concurrently with the worker thread still committing chunks for the same BATCH_STEP_EXECUTION row. Both issue optimistic-locking updates:

UPDATE BATCH_STEP_EXECUTION SET ..., VERSION = ? WHERE STEP_EXECUTION_ID = ? AND VERSION = ?

so one matches 0 rows and fails with OptimisticLockingFailureException. It is timing- and vendor-sensitive: frequent on MySQL (REPEATABLE READ), occasional on PostgreSQL (READ COMMITTED), almost never on in-memory HSQLDB — which is why CI rarely catches it and GracefulShutdownFunctionalTests was @Disabled.

A contributing factor: SimpleJobRepository.update(StepExecution) re-read the row version via stepExecutionDao.synchronizeStatus(stepExecution) while the job was stopping. Under MySQL REPEATABLE READ that read returns the stale snapshot version, overwriting the correct in-memory version and guaranteeing the stopping thread's update loses.

Solution

Guard every update to a step execution's metadata with a per-execution lock, held across the surrounding transaction's commit, shared between the worker and the stopping thread:

  • AbstractStep keeps one Semaphore per running step execution and exposes StoppableStep.callUnderLock(StepExecution, Runnable). The worker's start/chunk/final updates and the operator's stop update all run under it, so they serialize and neither observes a stale version.
  • TaskletStep and ChunkOrientedStep take this shared lock around their chunk transactions.
  • SimpleJobRepository.update(StepExecution) no longer re-reads the version while stopping — under the lock the shared in-memory execution already holds the current version (the stale re-read was the root failure on MySQL).
  • The operator sets the job execution status to STOPPED directly instead of STOPPING (which update(JobExecution) upgraded to STOPPED anyway).
  • Re-enables GracefulShutdownFunctionalTests (disabled under GracefulShutdownFunctionalTests.testStopJob fails intermittently due to a race condition #5308).

Validation

JobOperatorFunctionalTests and GracefulShutdownFunctionalTests, 100 runs per vendor (MySQL 8 / PostgreSQL 16 in Docker, HSQLDB in-memory), after the fix:

Test Vendor Baseline (no fix) After this PR
GracefulShutdownFunctionalTests HSQLDB 3/100 0/100
GracefulShutdownFunctionalTests MySQL 27/100 0/100
GracefulShutdownFunctionalTests PostgreSQL 3/100 0/100
JobOperatorFunctionalTests HSQLDB 1/100 0/100
JobOperatorFunctionalTests MySQL 22/100 0/100
JobOperatorFunctionalTests PostgreSQL 8/100 0/100

(0 failures / 600 runs. Before-fix numbers: see #5442.) Plus a new AbstractStepTests unit test asserting t updates to one step execution; full spring-batch-core suite passes.

Resolves #5308

When a JobOperator stops a running job, the stopping thread persisted the
step execution's stopped state on its own thread, concurrently with the
worker thread still committing chunks for the same step execution. Both
issued optimistic-locking UPDATEs against the same BATCH_STEP_EXECUTION row,
so the stopping thread could fail with OptimisticLockingFailureException.

Guard every update to a step execution's metadata with a per-execution lock,
held across the surrounding transaction's commit, and share it between the
worker and the stopping thread.

Re-enables GracefulShutdownFunctionalTests.

Issue spring-projects#5308

Signed-off-by: Kyungrae Kim <rlarudfo93@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

GracefulShutdownFunctionalTests.testStopJob fails intermittently due to a race condition

1 participant