[AMORO-4089] Fix optimizing process stuck permanently after AMS restart#4090
Open
j1wonpark wants to merge 1 commit intoapache:masterfrom
Open
Conversation
When AMS restarts, SCHEDULED/ACKED tasks were loaded into taskMap but never re-queued, causing the optimizing process to hang forever. This fix addresses two issues: 1. Reset SCHEDULED/ACKED tasks to PLANNED during recovery in loadTaskRuntimes(), so they are re-queued for execution. 2. Fix race condition in OptimizerKeeper where expired optimizer tokens were still in authOptimizers during task scanning. Move unregisterOptimizer() before collectTasks() so orphaned tasks are correctly detected. Signed-off-by: Jiwon Park <jpark92@outlook.kr>
Contributor
Author
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why are the changes needed?
When AMS restarts (e.g., during a Kubernetes rolling update), in-flight optimizing tasks in SCHEDULED or ACKED state are loaded into
taskMapbut never placed intotaskQueueinloadTaskRuntimes(). Since the connection to the original Optimizer is lost after restart, these tasks can never be completed, causing the table to remain stuck inMAJOR_OPTIMIZING/MINOR_OPTIMIZINGstatus permanently.Additionally,
OptimizerKeeperhas a race condition where it scans for orphaned tasks before removing the expired optimizer's token fromauthOptimizers. This causes the predicate!activeTokens.contains(task.getToken())to evaluate tofalse, so orphaned tasks are never detected. After the token is removed, no further keeper events are generated, leaving the tasks permanently stuck.resolve #4089
Brief change log
OptimizingQueue.loadTaskRuntimes()DefaultOptimizingService.OptimizerKeeper.run()by movingunregisterOptimizer()beforecollectTasks(), so expired tokens are removed fromauthOptimizersbefore orphaned task detectiontestTouchTimeout,testReloadScheduledTask, andtestReloadAckTaskto reflect the new recovery behaviorHow was this patch tested?
Documentation
no
not applicable