GH-5308: Serialize step execution updates with concurrent stop requests#5448
Open
kyungrae wants to merge 1 commit into
Open
GH-5308: Serialize step execution updates with concurrent stop requests#5448kyungrae wants to merge 1 commit into
kyungrae wants to merge 1 commit into
Conversation
When a JobOperator stops a running job, the stopping thread persisted the step execution's stopped state on its own thread, concurrently with the worker thread still committing chunks for the same step execution. Both issued optimistic-locking UPDATEs against the same BATCH_STEP_EXECUTION row, so the stopping thread could fail with OptimisticLockingFailureException. Guard every update to a step execution's metadata with a per-execution lock, held across the surrounding transaction's commit, and share it between the worker and the stopping thread. Re-enables GracefulShutdownFunctionalTests. Issue spring-projects#5308 Signed-off-by: Kyungrae Kim <rlarudfo93@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
When
JobOperator.stop(jobExecution)is called while a step is running, the stopping thread persists the step execution's stopped state (jobRepository.update(stepExecution)) on its own thread — concurrently with the worker thread still committing chunks for the same BATCH_STEP_EXECUTION row. Both issue optimistic-locking updates:so one matches 0 rows and fails with
OptimisticLockingFailureException. It is timing- and vendor-sensitive: frequent on MySQL (REPEATABLE READ), occasional on PostgreSQL (READ COMMITTED), almost never on in-memory HSQLDB — which is why CI rarely catches it andGracefulShutdownFunctionalTestswas@Disabled.A contributing factor: SimpleJobRepository.update(StepExecution) re-read the row version via stepExecutionDao.synchronizeStatus(stepExecution) while the job was stopping. Under MySQL REPEATABLE READ that read returns the stale snapshot version, overwriting the correct in-memory version and guaranteeing the stopping thread's update loses.
Solution
Guard every update to a step execution's metadata with a per-execution lock, held across the surrounding transaction's commit, shared between the worker and the stopping thread:
AbstractStepkeeps oneSemaphoreper running step execution and exposesStoppableStep.callUnderLock(StepExecution, Runnable). The worker's start/chunk/final updates and the operator's stop update all run under it, so they serialize and neither observes a stale version.TaskletStepandChunkOrientedSteptake this shared lock around their chunk transactions.SimpleJobRepository.update(StepExecution)no longer re-reads the version while stopping — under the lock the shared in-memory execution already holds the current version (the stale re-read was the root failure on MySQL).STOPPEDdirectly instead ofSTOPPING(which update(JobExecution) upgraded to STOPPED anyway).GracefulShutdownFunctionalTests(disabled under GracefulShutdownFunctionalTests.testStopJob fails intermittently due to a race condition #5308).Validation
JobOperatorFunctionalTests and GracefulShutdownFunctionalTests, 100 runs per vendor (MySQL 8 / PostgreSQL 16 in Docker, HSQLDB in-memory), after the fix:
GracefulShutdownFunctionalTestsGracefulShutdownFunctionalTestsGracefulShutdownFunctionalTestsJobOperatorFunctionalTestsJobOperatorFunctionalTestsJobOperatorFunctionalTests(0 failures / 600 runs. Before-fix numbers: see #5442.) Plus a new AbstractStepTests unit test asserting t updates to one step execution; full spring-batch-core suite passes.
Resolves #5308